Hubbleflow · Course

Reinforcement Learning.

From Bellman equations to GRPO.

Every agentic capability in today's frontier models comes from reinforcement learning. This cohort teaches you RL from the ground up.

The long-horizon thinking, the step-by-step decomposition of a hard problem into actions, the ability to try, fail, and self-correct over many steps: all of it is reinforcement learning. RL is also the engine behind modern robotics, where a policy learns to act in the physical world. This live weekend cohort builds every algorithm from scratch, from MDPs and Q-learning to PPO, and finishes on the RLHF, DPO, and GRPO stack behind today's reasoning models.

Live weekend cohort18 modules · 5 levels5 capstone builds
RL · Floor Navigator
exploring
start
Episode 06reward+0.34

Exploring the terrain, trial and error.

Engineers from these companies are already in the cohort
Microsoft
Adobe
Uber
Expedia
ixigo
Blinkit
Nielsen
Dream11
InMobi
Microsoft
Adobe
Uber
Expedia
ixigo
Blinkit
Nielsen
Dream11
InMobi
Sankalp Sharma
Software Development Engineer 4 at Adobe
Adobe

Glad to be part of this cohort. What stands out is the practical depth, not just tools, but how AI, system design, and agentic patterns come together for real-world engineering. Looking forward to learning more.

Mohit Kumar Sahu
Senior Software Development Engineer at Expedia Group
Expedia

What I appreciate most is the depth of learning. Instead of just covering the “what,” the cohort dives into the “how” and “why” behind AI concepts. Great experience so far!

Prashant Bucha
SE-2 at Microsoft · 9+ YOE · Ex-Hike, ixigo, Housing
Microsoft

Aseem's masterclass finally made AI click for me beyond just writing prompts. He goes deep into how the models and agents actually work under the hood, the attention math, the agent loops, the evals, exactly the kind of depth you need as an engineer who wants to build with AI, not just use it. Genuinely one of the most useful technical programs I've done in years.

Divya Kriplani
SDE-2 at Blinkit · Ex-Probo, ixigo
Blinkit

Joining this cohort was one of the best decisions I made for my AI learning journey. Before this, I was unsure where to start and overwhelmed by the noise around AI. Aseem's sessions gave me clarity, strong fundamentals, and the confidence to build my own agents. The focus on basic principles and real-world systems makes all the difference.

Satyam Singh
Principal Engineer at ixigo · Ex-InfoEdge · NIT Allahabad
ixigo

I learned a lot about the internal workings of AI, which is helping me use AI far more effectively for technical and complex problem-solving tasks.

Ankit
SDE-2 at Amazon
Amazon

The pace is intense and the depth is real. We built things from scratch instead of gluing libraries together, and that completely changes how you think about the stack. Easily the best technical cohort I've taken.

Aman Sapra
Assistant Manager at KPMG Delivery Network India
KPMG

I would highly recommend this cohort to anyone who wants to understand Agentic AI beyond the hype and surface-level tutorials. What makes this program stand out is the way it combines fundamentals, system design, and real-world implementation thinking. The cohort does not just focus on tools or quick demos. It helps you understand how AI systems are actually designed, how LLMs and agents fit into modern product architectures, and how to reason about them as an engineer. If you want depth instead of buzzwords, this is the cohort to join.

Vipul Sharma
Lead Member of Technical Staff at Stage
Stage

Genuinely one of the best learning experiences I've had as an engineer. Aseem takes dense AI and systems topics and turns them into something you can actually build with, every session moves you from theory to working code. It's the rare cohort that respects your time and assumes you want real depth. Highly recommend it to anyone serious about going beyond the surface.

Jasleen Kaur
Staff Engineer at MediaTek
MediaTek

As a staff engineer, what I value most is depth and first-principles thinking, and this cohort delivers both. Aseem connects the math, the systems, and the production reality in a way I haven't seen in any other program. It's rare to find teaching that is this rigorous and this practical at the same time. I came in to fill gaps and left with a genuinely stronger mental model of the whole stack.

Harsh
SDE-3 at Flipkart
Flipkart

I've done plenty of online courses that stay at the surface. This one goes all the way down, tokenization, attention, agent loops, evals, and then back up to production. I finally feel like I understand AI instead of just using it.

Alankit Gupta
Senior Software Engineer at SplashLearn
SplashLearn

As someone coming from a backend and system-design background, this cohort has helped me connect traditional engineering principles with modern AI systems. Every session leaves me with a long list of things to explore and apply. Great learning experience so far.

Shyam Singh
Frontend Architect · React, Angular

I'm attending this weekend cohort on AI agents, and now I finally understand how AI and agents actually work. Earlier, AI was just magic to me, now I understand the machinery behind it. Thanks Aseem for these sessions.

Rohan
Senior Software Engineer at Uber
Uber

Coming from a distributed-systems background, I expected the AI parts to feel hand-wavy. They didn't. Every concept is grounded in how you'd actually design, ship, and operate it, latency, failure modes, evals, the works. This is the most engineering-honest AI course I've come across.

Sankalp Sharma
Software Development Engineer 4 at Adobe
Adobe

Glad to be part of this cohort. What stands out is the practical depth, not just tools, but how AI, system design, and agentic patterns come together for real-world engineering. Looking forward to learning more.

Mohit Kumar Sahu
Senior Software Development Engineer at Expedia Group
Expedia

What I appreciate most is the depth of learning. Instead of just covering the “what,” the cohort dives into the “how” and “why” behind AI concepts. Great experience so far!

Prashant Bucha
SE-2 at Microsoft · 9+ YOE · Ex-Hike, ixigo, Housing
Microsoft

Aseem's masterclass finally made AI click for me beyond just writing prompts. He goes deep into how the models and agents actually work under the hood, the attention math, the agent loops, the evals, exactly the kind of depth you need as an engineer who wants to build with AI, not just use it. Genuinely one of the most useful technical programs I've done in years.

Divya Kriplani
SDE-2 at Blinkit · Ex-Probo, ixigo
Blinkit

Joining this cohort was one of the best decisions I made for my AI learning journey. Before this, I was unsure where to start and overwhelmed by the noise around AI. Aseem's sessions gave me clarity, strong fundamentals, and the confidence to build my own agents. The focus on basic principles and real-world systems makes all the difference.

Satyam Singh
Principal Engineer at ixigo · Ex-InfoEdge · NIT Allahabad
ixigo

I learned a lot about the internal workings of AI, which is helping me use AI far more effectively for technical and complex problem-solving tasks.

Ankit
SDE-2 at Amazon
Amazon

The pace is intense and the depth is real. We built things from scratch instead of gluing libraries together, and that completely changes how you think about the stack. Easily the best technical cohort I've taken.

Aman Sapra
Assistant Manager at KPMG Delivery Network India
KPMG

I would highly recommend this cohort to anyone who wants to understand Agentic AI beyond the hype and surface-level tutorials. What makes this program stand out is the way it combines fundamentals, system design, and real-world implementation thinking. The cohort does not just focus on tools or quick demos. It helps you understand how AI systems are actually designed, how LLMs and agents fit into modern product architectures, and how to reason about them as an engineer. If you want depth instead of buzzwords, this is the cohort to join.

Vipul Sharma
Lead Member of Technical Staff at Stage
Stage

Genuinely one of the best learning experiences I've had as an engineer. Aseem takes dense AI and systems topics and turns them into something you can actually build with, every session moves you from theory to working code. It's the rare cohort that respects your time and assumes you want real depth. Highly recommend it to anyone serious about going beyond the surface.

Jasleen Kaur
Staff Engineer at MediaTek
MediaTek

As a staff engineer, what I value most is depth and first-principles thinking, and this cohort delivers both. Aseem connects the math, the systems, and the production reality in a way I haven't seen in any other program. It's rare to find teaching that is this rigorous and this practical at the same time. I came in to fill gaps and left with a genuinely stronger mental model of the whole stack.

Harsh
SDE-3 at Flipkart
Flipkart

I've done plenty of online courses that stay at the surface. This one goes all the way down, tokenization, attention, agent loops, evals, and then back up to production. I finally feel like I understand AI instead of just using it.

Alankit Gupta
Senior Software Engineer at SplashLearn
SplashLearn

As someone coming from a backend and system-design background, this cohort has helped me connect traditional engineering principles with modern AI systems. Every session leaves me with a long list of things to explore and apply. Great learning experience so far.

Shyam Singh
Frontend Architect · React, Angular

I'm attending this weekend cohort on AI agents, and now I finally understand how AI and agents actually work. Earlier, AI was just magic to me, now I understand the machinery behind it. Thanks Aseem for these sessions.

Rohan
Senior Software Engineer at Uber
Uber

Coming from a distributed-systems background, I expected the AI parts to feel hand-wavy. They didn't. Every concept is grounded in how you'd actually design, ship, and operate it, latency, failure modes, evals, the works. This is the most engineering-honest AI course I've come across.

The Syllabus

Eighteen modules. Every one builds an algorithm.

Grouped into five levels of increasing depth, from bandits to reasoning LLMs. Each module ends in code you write yourself.

Level 01

Foundations & Bandits

Module
01

The RL Problem & Multi-Armed Bandits

The trial-and-error paradigm with roots in operations research, psychology, and AI. We start where Sutton & Barto and Ravindran's NPTEL course both start, “immediate RL”, to build intuition for exploration vs. exploitation before time and state enter the picture.

Key concepts
Agent–environment interfaceThe reward hypothesisExploration vs. exploitationε-greedy & softmax action selectionAction-value estimationRegretUCB1 & Hoeffding boundsThompson sampling (Bayesian)PAC bounds & median elimination
Build

The Sutton & Barto 10-armed testbed, comparing ε-greedy, UCB1, and Thompson sampling with regret curves.

Module
02

Contextual Bandits & Policy Search

Bridge from stateless bandits to full RL by adding context, and introduce the policy-gradient idea early. This frames the whole course: value-based vs. policy-based learning.

Key concepts
Contextual banditsLinUCBPolicy parameterisationThe REINFORCE estimator (bandit setting)Baselines & variance reductionGradient bandit algorithmA/B testing as a bandit
Build

A contextual-bandit news / ad recommender using LinUCB and a policy-gradient bandit, evaluated by cumulative reward.

Module
03
Level 1 Capstone

Markov Decision Processes & Bellman Equations

Introduce time. Formalise returns, discounting, and value functions, then prove why the machinery works, the part most courses skip and most bugs come from.

Key concepts
MDP tuple (S, A, P, R, γ)Return & discountingState / action value functionsBellman expectation equationBellman optimality equationContraction mapping & Banach fixed-pointExistence / uniqueness of V*POMDP preview
Capstone Build

A GridWorld MDP library with a Bellman-backup solver and visualised value / policy heatmaps.

Level 02

Tabular Methods

Module
04

Dynamic Programming

With a known model, solve MDPs exactly. This is the conceptual backbone every later algorithm approximates.

Key concepts
Policy evaluationPolicy improvement theoremPolicy iterationValue iterationGeneralised policy iteration (GPI)Asynchronous DPConvergence guaranteesThe curse of dimensionality
Build

Policy iteration vs. value iteration on FrozenLake and Jack's Car Rental, comparing iteration counts and runtime.

Module
05

Monte Carlo Methods

Learn from experience without a model. Introduce the on-policy / off-policy distinction and importance sampling, concepts that resurface in PPO and RLHF.

Key concepts
First-visit vs. every-visit MCMC prediction & controlExploring startsOn-policy vs. off-policyImportance sampling (ordinary & weighted)MC Tree Search / UCTAlphaGo preview
Build

A Blackjack MC-control agent (off-policy via importance sampling), plus a minimal UCT player for Tic-Tac-Toe.

Module
06
Level 2 Capstone

Temporal-Difference Learning & Eligibility Traces

The central idea of RL: bootstrapping. Build SARSA and Q-learning, then unify Monte Carlo and TD with eligibility traces.

Key concepts
TD(0) prediction & controlSARSA (on-policy)Q-learning (off-policy, Watkins)Expected SARSAMaximisation bias & Double Q-learningn-step TDTD(λ) & forward / backward viewsEligibility traces
Capstone Build

SARSA vs. Q-learning vs. Expected SARSA on Cliff Walking and Windy GridWorld, with a TD(λ) extension.

Level 03

Deep Value-Based RL

Module
07

Function Approximation & the Deadly Triad

Tabular methods don't scale. Move to parameterised value functions and confront the instability that defines deep RL.

Key concepts
Linear function approximationTile coding & state aggregationSemi-gradient TDThe deadly triadLSTD / LSTDQLeast-squares policy iteration (LSPI)Fitted Q-IterationBaird's counterexample
Build

Linear semi-gradient SARSA with tile coding on Mountain Car, then Fitted Q-Iteration, contrasting stability.

Module
08
Level 3 Capstone

Deep Q-Networks (DQN) & Variants

The 2013/2015 breakthrough that launched deep RL. Build DQN end-to-end, then layer on the “Rainbow” improvements.

Key concepts
DQN (Mnih et al., human-level control)Experience replayTarget networksDouble DQN (van Hasselt)Dueling DQNPrioritised Experience ReplayNoisy NetsDistributional RL (C51)Rainbow (Hessel et al.)
Capstone Build

A from-scratch PyTorch DQN beating CartPole, then a CNN DQN on Atari (Breakout / Pong) with Double + Dueling + PER.

Level 04

Policy Gradients & Control

Module
09

Policy Gradient Methods & REINFORCE

Optimise the policy directly, essential for continuous actions and the foundation of all LLM RL.

Key concepts
The policy gradient theoremREINFORCE (Williams, 1992)Score-function estimatorBaselinesReward-to-goVariance reductionThe log-derivative trickEntropy regularisation
Build

REINFORCE with a learned baseline on CartPole and LunarLander, plotting variance with and without the baseline.

Module
10

Actor-Critic Methods (A2C / A3C / GAE)

Combine value learning and policy gradients to cut variance. The architecture that PPO and GRPO inherit.

Key concepts
Actor-critic architectureThe advantage functionA2C & A3C (Mnih et al.)Generalised Advantage Estimation (GAE)n-step returnsSynchronous vs. asynchronous rolloutsBias–variance trade-off in advantage
Build

A synchronous A2C with GAE and vectorised environments on LunarLander, exposing λ and n-step as tunable knobs.

Module
11

Trust Regions & PPO

Why naive policy gradients are unstable, and the trust-region fix that became the industry default, and the literal algorithm behind RLHF.

Key concepts
TRPO & monotonic improvementKL-divergence constraintNatural policy gradientConjugate gradientPPO clipped surrogate objectivePPO-penalty vs. PPO-clipKL early stoppingThe importance-sampling ratio
Build

A clean, CleanRL-style PPO solving continuous control (Pendulum / Hopper), validated against Stable-Baselines3.

Module
12
Level 4 Capstone

Continuous Control: DDPG, TD3 & SAC

Off-policy actor-critics for robotics and control, where overestimation and exploration must be handled explicitly.

Key concepts
Deterministic policy gradientDDPG (Lillicrap et al.)Ornstein–Uhlenbeck / Gaussian explorationTD3 (clipped double-Q, target smoothing, delayed updates)Soft Actor-Critic (maximum-entropy RL)Automatic temperature tuningReplay buffers for continuous control
Capstone Build

TD3 and SAC on MuJoCo tasks (HalfCheetah, Ant), benchmarking sample efficiency and stability against each other.

Level 05

Frontiers & LLM Alignment

Module
13

Exploration & Intrinsic Motivation

Sparse-reward problems where ε-greedy fails, the bridge to reasoning tasks where reward is rare and binary.

Key concepts
Count-based explorationPseudo-countsIntrinsic Curiosity Module (ICM)Random Network Distillation (RND)Go-ExploreEntropy-based explorationHard-exploration benchmarks (Montezuma's Revenge)
Build

Add RND intrinsic rewards to PPO and solve a sparse-reward MiniGrid task that vanilla PPO cannot.

Module
14

Model-Based RL & Planning

Learn a world model to plan and slash sample cost, the lineage of AlphaGo through Dreamer.

Key concepts
Dyna-Q (integrated planning / learning)MCTSAlphaGo / AlphaZero / MuZeroWorld Models (Ha & Schmidhuber)Dreamer (latent imagination)PETS / probabilistic ensemblesModel-based vs. model-free trade-offs
Build

Dyna-Q on a maze (planning-steps ablation), plus a small learned-dynamics model for short-horizon planning.

Module
15

Offline RL & Hierarchical RL

Learning from fixed datasets with no environment access, and temporal abstraction, directly relevant to training on logged and preference data.

Key concepts
Distributional shift & extrapolation errorBehaviour cloningConservative Q-Learning (CQL)Implicit Q-Learning (IQL)Decision TransformerThe options framework & SMDPsMAXQ value decompositionThe D4RL benchmark
Build

Train CQL or IQL on a D4RL dataset and compare against behaviour cloning, quantifying the offline distributional-shift gap.

Module
16

RLHF: Reward Modeling + PPO

The classic three-stage alignment pipeline that produced InstructGPT and ChatGPT, now framed as “RL where the environment is a language model.”

Key concepts
SFT → reward model → PPO pipelineBradley-Terry preference modelPairwise preference dataReward-model trainingKL-to-reference penaltyPer-token reward + value headReward hackingInstructGPT (Ouyang et al., 2022)
Build

A full RLHF loop with TRL on a small model: train a reward model, then PPO-fine-tune with a KL penalty; measure win-rate vs. the SFT baseline.

Module
17

DPO & Preference Optimization Without RL

The 2023 insight that you can skip the reward model and the RL loop entirely, now a production default.

Key concepts
DPO (Rafailov et al., 2023)Policy ↔ reward equivalenceThe implicit rewardThe DPO classification loss & βReference policyIPO, KTOPreference-data curationPPO-vs-DPO trade-offs
Build

DPO-fine-tune a model with TRL on the same preference data from Module 16; compare alignment, cost, and stability head-to-head with the PPO result.

Module
18
Level 5 Capstone

GRPO & RLVR: Training Reasoning Models

The 2024–25 frontier: drop the critic, drop the reward model, reward only verifiable correctness, and watch chain-of-thought reasoning emerge. The course's culmination, mirroring the DeepSeek-R1 recipe.

Key concepts
GRPO (Shao et al., DeepSeekMath)Critic-free group-relative advantageRLVR (verifiable / rule-based rewards)DeepSeek-R1 & R1-ZeroOutcome vs. process rewardsLow-variance KL estimatorDAPO refinementsDr. GRPO (length / variance-bias fix)
Capstone Build

Use GRPO (TRL or veRL) with a verifiable math / code reward to fine-tune a small base model on GSM8K-style problems; track accuracy and emergent chain-of-thought length over training.

Taught by
AR

Aseem Rastogi.

Software & AI Architect · Co-Founder & CTO, Agentcord.ai

Ex-Architect, iXiGo · Ex-Staff Engineer, Synaptic · Ex-Senior Computer Scientist, Belzabar · B.Tech CSE, NIT Hamirpur (Gold Medalist)

Not a course taught from tutorials. Taught by the architect who built the systems, at iXiGo, Synaptic, and now Agentcord.ai.

Format & Cadence

Live on weekends. Supported every day.

01

Live weekend cohort

Saturday and Sunday sessions, with recordings of every class.

02

Hands-on

Weekly labs and a portfolio-grade capstone at the end of every level.

03

Support

Weekly office hours and capstone reviews per level.

04

Private Discord community

Dedicated channels per module and topic. Ask anything, any time. Every question and answer lives permanently, becoming a growing knowledge base for the cohort.

Tools you’ll master

The full modern reinforcement learning toolkit.

From Gymnasium to veRL. Every one of these appears in at least one module or lab.

Core & Deep Learning
PythonNumPyPyTorch
Environments & Benchmarks
GymnasiumAtari / ALEMuJoCoDeepMind ControlMiniGridPettingZooD4RL
RL Libraries
Stable-Baselines3CleanRLRay RLlibTianshouTorchRL
LLM Post-Training
Hugging Face TRLveRLOpenRLHFUnslothvLLMPEFT / LoRA
Experiment Tracking & Infra
Weights & BiasesTensorBoardHydraDocker
What it costs
Pay in full
40,00015% off
33,999
one-time
Full 4-month cohort
Or pay monthly
9,999
per month
Billed across 4 months

Reinforcement learning is how machines learn to act, and now how they learn to reason. The engineers who understand it from the math up will build the systems that decide.

Doors open for the next cohort soon.

Frequently asked

Questions worth asking.

A little helps, but the course is self-contained. You need comfort with Python, NumPy, and basic calculus and probability (gradients, expectations). We build every algorithm from the MDP up and don't assume prior RL or deep-learning experience.

Join the community

Join our rapidly growing WhatsApp community.

Tap in to a fast-growing community of engineers going deep on AI: cohort updates, resources, and a place to ask anything, alongside people building the same things you are.

Free to join. Open to anyone serious about going deep on AI.