Imagine teaching a puppy to sit. You do not hand it a textbook - you let it try things, and when it sits, you give it a treat. Over time, it learns which actions earn rewards. Reinforcement Learning (RL) works the same way: an AI agent learns by interacting with an environment, receiving feedback, and adjusting its behaviour to maximise long-term reward.
Every RL system has the same core structure:
This loop repeats - potentially millions of times - until the agent discovers a policy that consistently earns high rewards. Unlike supervised learning, there is no labelled dataset telling the agent what to do. It must discover effective behaviour purely through interaction.
One of RL's central dilemmas: should the agent exploit what it already knows works, or explore unknown actions that might yield even better rewards?
Most algorithms use an epsilon-greedy strategy: with probability ε the agent explores randomly, and the rest of the time it exploits its best-known action. Over training, ε gradually decreases as the agent becomes more confident. More sophisticated methods - like Upper Confidence Bound (UCB) and Thompson Sampling - explore more intelligently by targeting uncertain actions.
Q-learning assigns a value (called a Q-value) to every state-action pair. The Q-value represents the expected total future reward if the agent takes that action in that state and then follows the best policy afterwards.
The update rule is surprisingly elegant:
Q(state, action) ← Q(state, action) + α × [reward + γ × max Q(next_state, all_actions) − Q(state, action)]
Over many episodes, Q-values converge, and the agent can simply pick the action with the highest Q-value in any state. For simple environments this table-based approach works well. When the state space is too large for a table (say, raw pixel inputs from a video game), a neural network approximates the Q-values - this is Deep Q-Learning (DQN), the approach DeepMind used to master Atari games in 2015.
What does the Q-value in Q-learning represent?
Q-learning works by learning the value of actions. Policy gradient methods take a different path: they directly optimise the policy itself - the mapping from states to actions.
Instead of asking "how good is this action?", they ask "how should I adjust my action probabilities to get more reward?" The agent nudges its policy in the direction that increases expected reward, using gradient ascent.
Proximal Policy Optimisation (PPO), developed by OpenAI, is one of the most popular policy gradient algorithms. It is stable, efficient, and was used to train the RL component of ChatGPT (via RLHF - Reinforcement Learning from Human Feedback).
In March 2016, DeepMind's AlphaGo defeated Lee Sedol, one of the greatest Go players in history, 4 games to 1. Go has more possible board positions than atoms in the universe - brute-force search was impossible.
AlphaGo combined:
Its successor, AlphaGo Zero, skipped human data entirely - learning solely from self-play - and surpassed the original AlphaGo within 40 days.
In 2019, OpenAI Five defeated the world champion team in Dota 2 - a game with incomplete information, long time horizons, and complex team coordination. Each of the five agents used PPO trained across thousands of GPUs over months, accumulating the equivalent of 45,000 years of gameplay experience.
Game AI has always been a proving ground for RL, but the techniques developed here transfer to far more consequential domains. The scale of training - distributing RL across thousands of GPUs - pushed the boundaries of what was computationally feasible and inspired new infrastructure for large-scale AI training.
RL is increasingly applied beyond games:
What is the exploration vs exploitation trade-off?
A critical challenge in RL is reward hacking: the agent finds unexpected shortcuts to maximise its reward signal without actually solving the intended task.
Famous examples include a boat-racing agent that discovered it could score more points by spinning in circles collecting bonus items than by finishing the race, and simulated robots that learned to exploit physics engine bugs to "walk" by vibrating.
This highlights a fundamental truth: the agent optimises exactly what you measure, which may not be what you actually want. Designing reward functions that truly capture intended behaviour is one of the most important - and most difficult - aspects of applied RL. Researchers are exploring techniques like reward modelling and inverse reinforcement learning to address this challenge.
Training RL agents in the real world is expensive, slow, and potentially dangerous. Sim-to-real transfer trains agents in simulated environments and then deploys them in reality.
The challenge: simulations are never perfectly accurate. Techniques like domain randomisation - randomly varying simulation parameters (friction, lighting, object shapes) - force agents to learn robust policies that generalise across conditions.
What is sim-to-real transfer?