Q-Learning Visualizer — Interactive RL Dashboard
👉 Click blank cells to build or remove walls before training
Bot must walk the exact minimum path 3 consecutive times to prove convergence.
Controls curiosity. High ε = explore randomly. Low ε = follow the learned map. Start at 0.5, reduce after ~50 episodes.
How aggressively new experience overwrites old knowledge. α = 0.1 is stable; α → 1.0 is fast but noisy.
You just watched a model-free Reinforcement Learning algorithm calculate the optimal policy of a Markov Decision Process. Here is how it works under the hood.
Why does the bot loop when trapped? Q-Learning is Markovian — the agent has zero memory of where it has been within the same episode. Its decision is based solely on its current grid square. If trapped in a U-shaped wall, it bounces randomly until the dynamic Fatigue Timeout resets the episode.
Epsilon (ε)controls the agent's curiosity:
Every step updates the Q-Table via the Temporal Difference (TD) rule:
The bracketed term is the TD Error — the gap between what the agent expected and what actually happened. α scales how much that gap updates the Q-Table each step.
Q-Learning excels in Grid Worlds. Routing 10,000 robots across a warehouse floor where racks constantly move is impossible to hardcode — Q-Learning adapts dynamically to any layout change.
Instead of fixed timers, smart grids use RL. The State is queued cars; the Reward is throughput. The system learns the optimal switching policy for any time of day.
Standard Q-Learning stores a table of States × Actions. In a video game, states are millions of pixels — the Q-Table becomes too large for any RAM. That is why Deep Q-Networks (DQN) were invented: neural networks that approximate the Q-Table instead of storing it.
Q-Learning must fail thousands of times to converge. Training a $100M physical drone in the real world means crashing it 5,000 times during exploration. It must be trained in simulation first.