Bellman Grid

Q-Learning Visualizer — Interactive RL Dashboard

Environment Map

Idle

Dim X:

Dim Y:

👉 Click blank cells to build or remove walls before training

Bot (Agent)GoalWall (Rack)Q-Value HeatOptimal Path

Live Telemetry

Success0

Wall Hits0

Episode0

Shortest—steps

Convergence Proof0/3 Validated

Bot must walk the exact minimum path 3 consecutive times to prove convergence.

Hyperparameters

Exploration Rate (ε)0.50

Controls curiosity. High ε = explore randomly. Low ε = follow the learned map. Start at 0.5, reduce after ~50 episodes.

Learning Rate (α)0.10

How aggressively new experience overwrites old knowledge. α = 0.1 is stable; α → 1.0 is fast but noisy.

Deep Learning Masterclass

Understanding Q-Learning

You just watched a model-free Reinforcement Learning algorithm calculate the optimal policy of a Markov Decision Process. Here is how it works under the hood.

The Markov Property & Loops

Why does the bot loop when trapped? Q-Learning is Markovian — the agent has zero memory of where it has been within the same episode. Its decision is based solely on its current grid square. If trapped in a U-shaped wall, it bounces randomly until the dynamic Fatigue Timeout resets the episode.

Exploration vs. Exploitation

Epsilon (ε)controls the agent's curiosity:

High ε: Bot ignores its map, walks randomly, discovers new shortcuts.
Low ε:Bot strictly follows the highest Q-values. “Deploy Best Route” forces ε = 0.

The Bellman Equation

Every step updates the Q-Table via the Temporal Difference (TD) rule:

New Q(s,a) = Current Q(s,a) +
α (Learning Rate) × [Reward + γ (Discount) × Max(Q(s′,a′)) − Current Q]

The bracketed term is the TD Error — the gap between what the agent expected and what actually happened. α scales how much that gap updates the Q-Table each step.

Real-World Application & Reality Check

Gamechanger: Amazon Logistics

Q-Learning excels in Grid Worlds. Routing 10,000 robots across a warehouse floor where racks constantly move is impossible to hardcode — Q-Learning adapts dynamically to any layout change.

Gamechanger: Traffic Lights

Instead of fixed timers, smart grids use RL. The State is queued cars; the Reward is throughput. The system learns the optimal switching policy for any time of day.

Dealbreaker: Curse of Dimensionality

Standard Q-Learning stores a table of States × Actions. In a video game, states are millions of pixels — the Q-Table becomes too large for any RAM. That is why Deep Q-Networks (DQN) were invented: neural networks that approximate the Q-Table instead of storing it.

Dealbreaker: Real-World Training Cost

Q-Learning must fail thousands of times to converge. Training a $100M physical drone in the real world means crashing it 5,000 times during exploration. It must be trained in simulation first.