Backpropagation gives us gradients - but gradients of what? Before backprop can run, we need a single number that captures how wrong the model is. That number comes from a loss function. Once we have gradients, an optimiser decides how to update the weights. Together, they form the learning loop.
A loss function (also called a cost function) takes the model's prediction and the true answer, and returns a number measuring "wrongness." The goal of training is to minimise this number.
Think of it like a score in golf - lower is better. A loss of 0 means a perfect prediction.
When predicting continuous values (house prices, temperature), we use Mean Squared Error (MSE):
MSE = (1/n) × Σ(predicted - actual)²
Squaring does two things: it makes all errors positive, and it punishes large errors disproportionately. Predict a house price off by £100k and the squared error is 100× worse than being off by £10k.
MSE dates back to Carl Friedrich Gauss in 1795 - over two centuries before neural networks. He used it to track the orbit of the asteroid Ceres.
When predicting categories (spam or not, cat vs dog), we use cross-entropy loss. It measures how far the model's predicted probabilities are from the true labels.
If the correct answer is "cat" and the model says 99% cat, the loss is tiny. If it says 10% cat, the loss is enormous. Cross-entropy has a useful property: it becomes infinitely unhappy when the model is confidently wrong, creating a strong gradient to correct the mistake.
Binary cross-entropy is for two-class problems. Categorical cross-entropy handles multiple classes by comparing probability distributions.
Why is MSE a poor choice for classification tasks?
With a loss function defined, we can visualise the loss landscape - a surface where each point represents a set of weights and the height is the loss. Training means finding the lowest valley.
Gradient descent is the algorithm that gets us there:
The size of each step is controlled by the learning rate - arguably the most important hyperparameter in deep learning.
Imagine hiking down a foggy mountain where you can only feel the slope directly under your feet. You step downhill, but you cannot see the whole landscape. How might you end up in a small dip that is not the deepest valley? This is the local minimum problem.
Computes the gradient using the entire dataset before each update. Accurate but painfully slow for large datasets - imagine re-reading every book in a library before correcting a single spelling mistake.
Updates weights after each single example. Fast but noisy - the path zigzags wildly. The noise can actually help escape local minima, which is a surprising benefit.
The practical sweet spot. Computes gradients on a small batch (typically 32–512 examples). Balances speed and stability, and is what virtually all modern training uses.
Plain SGD has limitations. Researchers have developed smarter optimisers that adapt as they go.
Like a heavy ball rolling downhill, momentum accumulates velocity in consistent directions and dampens oscillations. If the gradient keeps pointing the same way, momentum accelerates. If it keeps changing direction, momentum smooths it out.
Adapts the learning rate per parameter. Frequently updated weights get smaller steps; rarely updated weights get larger steps. Great for sparse data (like text), but the learning rate can shrink to zero over time.
Combines momentum and per-parameter adaptive rates. It maintains running averages of both the gradient (first moment) and the squared gradient (second moment). Adam is the default choice for most practitioners today.
What advantage does Adam have over basic SGD?
Rather than fixing the learning rate, modern training often schedules it:
The intuition: take big steps early to explore broadly, then small steps later to fine-tune.
Sometimes gradients explode (as we saw in the backpropagation lesson). Gradient clipping caps the gradient magnitude before the update step. If the gradient exceeds a threshold, it is scaled down proportionally. This is standard practice when training RNNs and Transformers.
What does gradient clipping prevent?
The Adam optimiser paper (Kingma & Ba, 2014) has over 150,000 citations, making it one of the most cited papers in all of computer science.
If you were training a model and the loss stopped decreasing after a few epochs, what would you investigate first - the learning rate, the loss function, or the data? Why?