How does a child learn to walk? Nobody hands them a labelled dataset of "good step / bad step" examples. Nobody writes out the physics equations for balance. They try, they fall, they try differently, they gradually improve — guided by feedback from the world around them.
This intuition — learning through trial and error, guided by rewards and penalties — is the foundation of Reinforcement Learning (RL), the approach behind some of AI's most spectacular achievements: beating world champions at Go and chess, training ChatGPT to be helpful, and teaching robots to walk.
Every RL problem has the same fundamental structure:
┌─────────────────────────────────────────┐
│ ENVIRONMENT │
│ (the world the agent operates in) │
└─────────────────┬───────────────────────┘
│
State (s_t) │ Reward (r_t)
What the │ How good was
agent sees ◄─────┤ the last action?
│
┌─────────────────▼───────────────────────┐
│ AGENT │
│ (the AI that takes actions) │
└─────────────────┬───────────────────────┘
│
│ Action (a_t)
▼
Agent acts on the environment,
environment transitions to state s_{t+1}
The agent's goal is to learn a policy that maximises the total cumulative reward over time, not just immediate reward.
RL agents face a perpetual tension:
Exploitation: do what you already know works well — take the action that currently seems best.
Exploration: try something new — maybe there's a better strategy you haven't discovered yet.
A robot that only exploits will never discover a better route. A robot that only explores will never reliably get anywhere.
A simple but effective strategy is ε-greedy:
import random
epsilon = 0.1 # 10% exploration rate
def choose_action(state, Q_table, epsilon):
if random.random() < epsilon:
# Explore: choose a random action
return random.choice(available_actions)
else:
# Exploit: choose the action with highest expected reward
return argmax(Q_table[state])
With ε = 0.1, the agent explores randomly 10% of the time and exploits its learned knowledge 90% of the time. Typically, ε starts high (lots of exploration early on) and decays over training as the agent becomes more confident.
Most RL problems are formalised as Markov Decision Processes (MDPs). The key assumption is the Markov property: the current state contains all information needed to choose the best action — the past history doesn't matter, only where you are now.
An MDP is defined by:
The discount factor is important: rewards in the near future are worth more than rewards far away:
# Cumulative discounted reward (Return)
# R = r_0 + γ·r_1 + γ²·r_2 + γ³·r_3 + ...
# With γ = 0.99: future rewards are nearly as valuable as immediate ones
# With γ = 0.5: rewards 10 steps away are worth only 0.1% of immediate reward
gamma = 0.99
total_return = sum(gamma**t * reward for t, reward in enumerate(rewards))
Q-learning is one of the foundational RL algorithms. It learns a Q-function: Q(s, a) = the expected total future reward if you take action a in state s and then act optimally forever after.
import numpy as np
# Initialise Q-table to zeros
n_states = 16 # e.g., a 4×4 grid world
n_actions = 4 # up, down, left, right
Q = np.zeros((n_states, n_actions))
# Q-learning update rule (the Bellman equation)
def update_Q(Q, state, action, reward, next_state, alpha=0.1, gamma=0.99):
"""
alpha: learning rate — how fast to update Q values
gamma: discount factor — how much to value future rewards
"""
# Current estimate
current_Q = Q[state, action]
# Target: immediate reward + discounted best future value
target = reward + gamma * np.max(Q[next_state])
# Move current estimate towards target
Q[state, action] = current_Q + alpha * (target - current_Q)
return Q
Over thousands of episodes, Q-learning converges on the optimal Q-function. Once learned, the policy is simply: in each state, take the action with the highest Q-value.
Deep Q-Networks (DQN), developed by DeepMind in 2013, replaced the Q-table with a deep neural network, allowing Q-learning to scale to complex visual inputs. DQN famously learned to play 49 Atari games directly from pixels, reaching human-level performance on many.
Q-learning learns the value of actions indirectly. Policy gradient methods directly optimise the policy — the mapping from states to actions.
# The core idea: if an action led to high reward, make it more likely
# If an action led to low reward, make it less likely
def policy_gradient_update(policy_network, states, actions, rewards):
# Compute discounted returns for each timestep
returns = compute_returns(rewards, gamma=0.99)
# Normalise returns (reduces variance)
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
# Policy loss: -log(π(a|s)) × G_t
# Maximise log-probability of actions weighted by how good they were
log_probs = policy_network.log_prob(states, actions)
loss = -(log_probs * returns).mean()
# Backpropagate and update
optimizer.zero_grad()
loss.backward()
optimizer.step()
Modern policy gradient variants include:
One of the most impactful recent applications of RL is Reinforcement Learning from Human Feedback (RLHF) — the technique used to make ChatGPT and other LLMs helpful, harmless, and honest.
The process has three stages:
Stage 1: Pre-training Train a large language model on internet text to predict the next token. This gives the model broad knowledge and language ability, but no understanding of what humans find helpful.
Stage 2: Train a Reward Model Ask human annotators to rate pairs of model responses: "Which response is better?" Use these ratings to train a reward model — a neural network that scores how good a response is.
Stage 3: RL Fine-tuning with PPO Use the reward model as the environment. The LLM is the agent; its actions are generating tokens; its reward comes from the reward model's score of the completed response.
Human prompt → LLM generates response → Reward model scores it
→ PPO updates LLM to generate
higher-scoring responses
This loop runs for thousands of iterations until the LLM consistently generates responses that humans rate highly. The result: a model that is not just knowledgeable, but genuinely tries to be helpful.
Go is a board game so complex — more possible board positions than atoms in the observable universe — that it was considered one of the hardest challenges in AI. Most experts expected human champions to remain unbeaten for decades.
In March 2016, DeepMind's AlphaGo defeated world champion Lee Sedol 4-1.
The system used a combination of:
AlphaZero (2017) went further: it learned Go, chess, and shogi from scratch, with no human game data — only the rules and self-play via RL. It surpassed AlphaGo in three days and the best chess engine (Stockfish) in four hours.
RL is central to modern robotics research:
A key challenge in robotics RL is the sim-to-real gap: policies trained in simulation often fail in the real world because the simulation doesn't perfectly model physics, friction, sensor noise, and unexpected objects.
RL is powerful but has real constraints:
What is RLHF and why is it used to train models like ChatGPT?