Hubbleflow · Course

Reinforcement Learning.

From Bellman equations to GRPO.

Every agentic capability in today's frontier models comes from reinforcement learning. This cohort teaches you RL from the ground up.

The long-horizon thinking, the step-by-step decomposition of a hard problem into actions, the ability to try, fail, and self-correct over many steps: all of it is reinforcement learning. RL is also the engine behind modern robotics, where a policy learns to act in the physical world. This live weekend cohort builds every algorithm from scratch, from MDPs and Q-learning to PPO, and finishes on the RLHF, DPO, and GRPO stack behind today's reasoning models.

Join the WhatsApp Community Explore the syllabus

Live weekend cohort18 modules · 5 levels5 capstone builds

RL · Floor Navigator

● exploring

start

Episode 06reward+0.34

Exploring the terrain, trial and error.

Engineers from these companies are already in the cohort

Sankalp Sharma

Software Development Engineer 4 at Adobe

Adobe

Glad to be part of this cohort. What stands out is the practical depth, not just tools, but how AI, system design, and agentic patterns come together for real-world engineering. Looking forward to learning more.

Mohit Kumar Sahu

Senior Software Development Engineer at Expedia Group

Expedia

What I appreciate most is the depth of learning. Instead of just covering the “what,” the cohort dives into the “how” and “why” behind AI concepts. Great experience so far!

Prashant Bucha

SE-2 at Microsoft · 9+ YOE · Ex-Hike, ixigo, Housing

Microsoft

Aseem's masterclass finally made AI click for me beyond just writing prompts. He goes deep into how the models and agents actually work under the hood, the attention math, the agent loops, the evals, exactly the kind of depth you need as an engineer who wants to build with AI, not just use it. Genuinely one of the most useful technical programs I've done in years.

Divya Kriplani

SDE-2 at Blinkit · Ex-Probo, ixigo

Blinkit

Joining this cohort was one of the best decisions I made for my AI learning journey. Before this, I was unsure where to start and overwhelmed by the noise around AI. Aseem's sessions gave me clarity, strong fundamentals, and the confidence to build my own agents. The focus on basic principles and real-world systems makes all the difference.

Satyam Singh

Principal Engineer at ixigo · Ex-InfoEdge · NIT Allahabad

ixigo

I learned a lot about the internal workings of AI, which is helping me use AI far more effectively for technical and complex problem-solving tasks.

Ankit

SDE-2 at Amazon

Amazon

The pace is intense and the depth is real. We built things from scratch instead of gluing libraries together, and that completely changes how you think about the stack. Easily the best technical cohort I've taken.

Aman Sapra

Assistant Manager at KPMG Delivery Network India

KPMG

I would highly recommend this cohort to anyone who wants to understand Agentic AI beyond the hype and surface-level tutorials. What makes this program stand out is the way it combines fundamentals, system design, and real-world implementation thinking. The cohort does not just focus on tools or quick demos. It helps you understand how AI systems are actually designed, how LLMs and agents fit into modern product architectures, and how to reason about them as an engineer. If you want depth instead of buzzwords, this is the cohort to join.

Vipul Sharma

Lead Member of Technical Staff at Stage

Stage

Genuinely one of the best learning experiences I've had as an engineer. Aseem takes dense AI and systems topics and turns them into something you can actually build with, every session moves you from theory to working code. It's the rare cohort that respects your time and assumes you want real depth. Highly recommend it to anyone serious about going beyond the surface.

Jasleen Kaur

Staff Engineer at MediaTek

MediaTek

As a staff engineer, what I value most is depth and first-principles thinking, and this cohort delivers both. Aseem connects the math, the systems, and the production reality in a way I haven't seen in any other program. It's rare to find teaching that is this rigorous and this practical at the same time. I came in to fill gaps and left with a genuinely stronger mental model of the whole stack.

Harsh

SDE-3 at Flipkart

Flipkart

I've done plenty of online courses that stay at the surface. This one goes all the way down, tokenization, attention, agent loops, evals, and then back up to production. I finally feel like I understand AI instead of just using it.

Alankit Gupta

Senior Software Engineer at SplashLearn

SplashLearn

As someone coming from a backend and system-design background, this cohort has helped me connect traditional engineering principles with modern AI systems. Every session leaves me with a long list of things to explore and apply. Great learning experience so far.

Shyam Singh

Frontend Architect · React, Angular

I'm attending this weekend cohort on AI agents, and now I finally understand how AI and agents actually work. Earlier, AI was just magic to me, now I understand the machinery behind it. Thanks Aseem for these sessions.

Rohan

Senior Software Engineer at Uber

Uber

Coming from a distributed-systems background, I expected the AI parts to feel hand-wavy. They didn't. Every concept is grounded in how you'd actually design, ship, and operate it, latency, failure modes, evals, the works. This is the most engineering-honest AI course I've come across.

Sankalp Sharma

Software Development Engineer 4 at Adobe

Adobe

Mohit Kumar Sahu

Senior Software Development Engineer at Expedia Group

Expedia

What I appreciate most is the depth of learning. Instead of just covering the “what,” the cohort dives into the “how” and “why” behind AI concepts. Great experience so far!

Prashant Bucha

SE-2 at Microsoft · 9+ YOE · Ex-Hike, ixigo, Housing

Microsoft

Divya Kriplani

SDE-2 at Blinkit · Ex-Probo, ixigo

Blinkit

Satyam Singh

Principal Engineer at ixigo · Ex-InfoEdge · NIT Allahabad

ixigo

I learned a lot about the internal workings of AI, which is helping me use AI far more effectively for technical and complex problem-solving tasks.

Ankit

SDE-2 at Amazon

Amazon

Aman Sapra

Assistant Manager at KPMG Delivery Network India

KPMG

Vipul Sharma

Lead Member of Technical Staff at Stage

Stage

Jasleen Kaur

Staff Engineer at MediaTek

MediaTek

Harsh

SDE-3 at Flipkart

Flipkart

Alankit Gupta

Senior Software Engineer at SplashLearn

SplashLearn

Shyam Singh

Frontend Architect · React, Angular

Rohan

Senior Software Engineer at Uber

Uber

The Syllabus

Eighteen modules. Every one builds an algorithm.

Grouped into five levels of increasing depth, from bandits to reasoning LLMs. Each module ends in code you write yourself.

Level 01

Foundations & Bandits

Module

The RL Problem & Multi-Armed Bandits

The trial-and-error paradigm with roots in operations research, psychology, and AI. We start where Sutton & Barto and Ravindran's NPTEL course both start, “immediate RL”, to build intuition for exploration vs. exploitation before time and state enter the picture.

Key concepts

Agent–environment interfaceThe reward hypothesisExploration vs. exploitationε-greedy & softmax action selectionAction-value estimationRegretUCB1 & Hoeffding boundsThompson sampling (Bayesian)PAC bounds & median elimination

Build

The Sutton & Barto 10-armed testbed, comparing ε-greedy, UCB1, and Thompson sampling with regret curves.

Module

Contextual Bandits & Policy Search

Bridge from stateless bandits to full RL by adding context, and introduce the policy-gradient idea early. This frames the whole course: value-based vs. policy-based learning.

Key concepts

Contextual banditsLinUCBPolicy parameterisationThe REINFORCE estimator (bandit setting)Baselines & variance reductionGradient bandit algorithmA/B testing as a bandit

Build

A contextual-bandit news / ad recommender using LinUCB and a policy-gradient bandit, evaluated by cumulative reward.

Module

Level 1 Capstone

Markov Decision Processes & Bellman Equations

Introduce time. Formalise returns, discounting, and value functions, then prove why the machinery works, the part most courses skip and most bugs come from.

Key concepts

MDP tuple (S, A, P, R, γ)Return & discountingState / action value functionsBellman expectation equationBellman optimality equationContraction mapping & Banach fixed-pointExistence / uniqueness of V*POMDP preview

Capstone Build

A GridWorld MDP library with a Bellman-backup solver and visualised value / policy heatmaps.

Level 02

Tabular Methods

Module

Dynamic Programming

With a known model, solve MDPs exactly. This is the conceptual backbone every later algorithm approximates.

Key concepts

Policy evaluationPolicy improvement theoremPolicy iterationValue iterationGeneralised policy iteration (GPI)Asynchronous DPConvergence guaranteesThe curse of dimensionality

Build

Policy iteration vs. value iteration on FrozenLake and Jack's Car Rental, comparing iteration counts and runtime.

Module

Monte Carlo Methods

Learn from experience without a model. Introduce the on-policy / off-policy distinction and importance sampling, concepts that resurface in PPO and RLHF.

Key concepts

First-visit vs. every-visit MCMC prediction & controlExploring startsOn-policy vs. off-policyImportance sampling (ordinary & weighted)MC Tree Search / UCTAlphaGo preview

Build

A Blackjack MC-control agent (off-policy via importance sampling), plus a minimal UCT player for Tic-Tac-Toe.

Module

Level 2 Capstone

Temporal-Difference Learning & Eligibility Traces

The central idea of RL: bootstrapping. Build SARSA and Q-learning, then unify Monte Carlo and TD with eligibility traces.

Key concepts

TD(0) prediction & controlSARSA (on-policy)Q-learning (off-policy, Watkins)Expected SARSAMaximisation bias & Double Q-learningn-step TDTD(λ) & forward / backward viewsEligibility traces

Capstone Build

SARSA vs. Q-learning vs. Expected SARSA on Cliff Walking and Windy GridWorld, with a TD(λ) extension.

Level 03

Deep Value-Based RL

Module

Function Approximation & the Deadly Triad

Tabular methods don't scale. Move to parameterised value functions and confront the instability that defines deep RL.

Key concepts

Linear function approximationTile coding & state aggregationSemi-gradient TDThe deadly triadLSTD / LSTDQLeast-squares policy iteration (LSPI)Fitted Q-IterationBaird's counterexample

Build

Linear semi-gradient SARSA with tile coding on Mountain Car, then Fitted Q-Iteration, contrasting stability.

Module

Level 3 Capstone

Deep Q-Networks (DQN) & Variants

The 2013/2015 breakthrough that launched deep RL. Build DQN end-to-end, then layer on the “Rainbow” improvements.

Key concepts

Why naive policy gradients are unstable, and the trust-region fix that became the industry default, and the literal algorithm behind RLHF.

Key concepts

TRPO & monotonic improvementKL-divergence constraintNatural policy gradientConjugate gradientPPO clipped surrogate objectivePPO-penalty vs. PPO-clipKL early stoppingThe importance-sampling ratio

Build

A clean, CleanRL-style PPO solving continuous control (Pendulum / Hopper), validated against Stable-Baselines3.

Module

Level 4 Capstone

Continuous Control: DDPG, TD3 & SAC

Off-policy actor-critics for robotics and control, where overestimation and exploration must be handled explicitly.

Key concepts

The 2023 insight that you can skip the reward model and the RL loop entirely, now a production default.

Key concepts

DPO (Rafailov et al., 2023)Policy ↔ reward equivalenceThe implicit rewardThe DPO classification loss & βReference policyIPO, KTOPreference-data curationPPO-vs-DPO trade-offs

Build

DPO-fine-tune a model with TRL on the same preference data from Module 16; compare alignment, cost, and stability head-to-head with the PPO result.

Module

Level 5 Capstone

GRPO & RLVR: Training Reasoning Models

The 2024–25 frontier: drop the critic, drop the reward model, reward only verifiable correctness, and watch chain-of-thought reasoning emerge. The course's culmination, mirroring the DeepSeek-R1 recipe.

Key concepts

GRPO (Shao et al., DeepSeekMath)Critic-free group-relative advantageRLVR (verifiable / rule-based rewards)DeepSeek-R1 & R1-ZeroOutcome vs. process rewardsLow-variance KL estimatorDAPO refinementsDr. GRPO (length / variance-bias fix)

Capstone Build

Use GRPO (TRL or veRL) with a verifiable math / code reward to fine-tune a small base model on GSM8K-style problems; track accuracy and emergent chain-of-thought length over training.

Taught by

Aseem Rastogi.

Software & AI Architect · Co-Founder & CTO, Agentcord.ai

Ex-Architect, iXiGo · Ex-Staff Engineer, Synaptic · Ex-Senior Computer Scientist, Belzabar · B.Tech CSE, NIT Hamirpur (Gold Medalist)

Read Aseem’s story

Not a course taught from tutorials. Taught by the architect who built the systems, at iXiGo, Synaptic, and now Agentcord.ai.

Format & Cadence

Live on weekends. Supported every day.

Live weekend cohort

Saturday and Sunday sessions, with recordings of every class.

Hands-on

Weekly labs and a portfolio-grade capstone at the end of every level.

Support

Weekly office hours and capstone reviews per level.

Private Discord community

Dedicated channels per module and topic. Ask anything, any time. Every question and answer lives permanently, becoming a growing knowledge base for the cohort.

Tools you’ll master

The full modern reinforcement learning toolkit.

From Gymnasium to veRL. Every one of these appears in at least one module or lab.

Core & Deep Learning

PythonNumPyPyTorch

Environments & Benchmarks

GymnasiumAtari / ALEMuJoCoDeepMind ControlMiniGridPettingZooD4RL

RL Libraries

Stable-Baselines3CleanRLRay RLlibTianshouTorchRL

LLM Post-Training

Hugging Face TRLveRLOpenRLHFUnslothvLLMPEFT / LoRA

Experiment Tracking & Infra

Weights & BiasesTensorBoardHydraDocker

What it costs

Pay in full

₹40,00015% off

₹33,999

one-time

Full 4-month cohort

Or pay monthly

₹9,999

per month

Billed across 4 months

Reinforcement learning is how machines learn to act, and now how they learn to reason. The engineers who understand it from the math up will build the systems that decide.

Join the WhatsApp Community

Doors open for the next cohort soon.

Frequently asked

Questions worth asking.

A little helps, but the course is self-contained. You need comfort with Python, NumPy, and basic calculus and probability (gradients, expectations). We build every algorithm from the MDP up and don't assume prior RL or deep-learning experience.

Join the community

Join our rapidly growing WhatsApp community.

Tap in to a fast-growing community of engineers going deep on AI: cohort updates, resources, and a place to ask anything, alongside people building the same things you are.

Join the WhatsApp Community

Free to join. Open to anyone serious about going deep on AI.