AI EducademyAIEducademy
🌳

AI 学习路径

🌱
AI 种子

从零开始

🌿
AI 萌芽

打好基础

🌳
AI 枝干

付诸实践

🏕️
AI 树冠

深入探索

🌲
AI 森林

精通AI

🔨

工程技能路径

✏️
AI 草图

从零开始

🪨
AI 雕刻

打好基础

⚒️
AI 匠心

付诸实践

💎
AI 打磨

深入探索

🏆
AI 杰作

精通AI

查看所有学习计划→

实验室

已加载 7 个实验
🧠神经网络游乐场🤖AI 还是人类?💬提示实验室🎨图像生成器😊情感分析器💡聊天机器人构建器⚖️伦理模拟器
进入实验室→
📝

博客

关于AI、教育和技术的最新文章

阅读博客→
nav.faq
🎯
使命

让AI教育触达每一个人、每一个角落

💜
价值观

开源、多语言、社区驱动

⭐
Open Source

在 GitHub 上公开构建

认识创始人→在 GitHub 上查看
立即开始
AI EducademyAIEducademy

MIT 许可证。开源项目

学习

  • 学习计划
  • 课程
  • 实验室

社区

  • GitHub
  • 参与贡献
  • 行为准则
  • 关于
  • 常见问题

支持

  • 请我喝杯咖啡 ☕
AI & 工程学习计划›🌿 AI 萌芽›课程›损失函数与优化器
📉
AI 萌芽 • 中级⏱️ 15 分钟阅读

损失函数与优化器

Loss Functions and Optimisers

Backpropagation gives us gradients - but gradients of what? Before backprop can run, we need a single number that captures how wrong the model is. That number comes from a loss function. Once we have gradients, an optimiser decides how to update the weights. Together, they form the learning loop.

What Is a Loss Function?

A loss function (also called a cost function) takes the model's prediction and the true answer, and returns a number measuring "wrongness." The goal of training is to minimise this number.

Think of it like a score in golf - lower is better. A loss of 0 means a perfect prediction.

A U-shaped curve showing loss on the y-axis and weight value on the x-axis, with a ball rolling towards the minimum
Training is like rolling a ball downhill on the loss landscape, searching for the lowest point.

Loss Functions for Regression - MSE

When predicting continuous values (house prices, temperature), we use Mean Squared Error (MSE):

MSE = (1/n) × Σ(predicted - actual)²

Squaring does two things: it makes all errors positive, and it punishes large errors disproportionately. Predict a house price off by £100k and the squared error is 100× worse than being off by £10k.

\ud83e\udd2f

MSE dates back to Carl Friedrich Gauss in 1795 - over two centuries before neural networks. He used it to track the orbit of the asteroid Ceres.

Loss Functions for Classification - Cross-Entropy

When predicting categories (spam or not, cat vs dog), we use cross-entropy loss. It measures how far the model's predicted probabilities are from the true labels.

If the correct answer is "cat" and the model says 99% cat, the loss is tiny. If it says 10% cat, the loss is enormous. Cross-entropy has a useful property: it becomes infinitely unhappy when the model is confidently wrong, creating a strong gradient to correct the mistake.

Binary cross-entropy is for two-class problems. Categorical cross-entropy handles multiple classes by comparing probability distributions.

\ud83e\udde0小测验

Why is MSE a poor choice for classification tasks?

Gradient Descent - Rolling Downhill

With a loss function defined, we can visualise the loss landscape - a surface where each point represents a set of weights and the height is the loss. Training means finding the lowest valley.

Gradient descent is the algorithm that gets us there:

  1. Compute the gradient (slope) at the current position.
  2. Take a step in the opposite direction (downhill).
  3. Repeat.

The size of each step is controlled by the learning rate - arguably the most important hyperparameter in deep learning.

The Learning Rate Dilemma

  • Too high: You overshoot the valley, bouncing back and forth or diverging entirely.
  • Too low: You creep along painfully slowly and may get stuck in a shallow local minimum.
  • Just right: You converge steadily to a good solution.
\ud83e\udd14
Think about it:

Imagine hiking down a foggy mountain where you can only feel the slope directly under your feet. You step downhill, but you cannot see the whole landscape. How might you end up in a small dip that is not the deepest valley? This is the local minimum problem.

Flavours of Gradient Descent

Batch Gradient Descent

Computes the gradient using the entire dataset before each update. Accurate but painfully slow for large datasets - imagine re-reading every book in a library before correcting a single spelling mistake.

Stochastic Gradient Descent (SGD)

Updates weights after each single example. Fast but noisy - the path zigzags wildly. The noise can actually help escape local minima, which is a surprising benefit.

Mini-Batch Gradient Descent

The practical sweet spot. Computes gradients on a small batch (typically 32–512 examples). Balances speed and stability, and is what virtually all modern training uses.

Modern Optimisers

Plain SGD has limitations. Researchers have developed smarter optimisers that adapt as they go.

SGD with Momentum

Like a heavy ball rolling downhill, momentum accumulates velocity in consistent directions and dampens oscillations. If the gradient keeps pointing the same way, momentum accelerates. If it keeps changing direction, momentum smooths it out.

AdaGrad

Adapts the learning rate per parameter. Frequently updated weights get smaller steps; rarely updated weights get larger steps. Great for sparse data (like text), but the learning rate can shrink to zero over time.

Adam (Adaptive Moment Estimation)

Combines momentum and per-parameter adaptive rates. It maintains running averages of both the gradient (first moment) and the squared gradient (second moment). Adam is the default choice for most practitioners today.

\ud83e\udde0小测验

What advantage does Adam have over basic SGD?

Learning Rate Schedules

Rather than fixing the learning rate, modern training often schedules it:

  • Step decay: Halve the rate every N epochs.
  • Cosine annealing: Smoothly decrease following a cosine curve, sometimes with warm restarts.
  • Warmup: Start with a tiny rate, gradually increase, then decay. Used in Transformer training.

The intuition: take big steps early to explore broadly, then small steps later to fine-tune.

Gradient Clipping - Safety Rails

Sometimes gradients explode (as we saw in the backpropagation lesson). Gradient clipping caps the gradient magnitude before the update step. If the gradient exceeds a threshold, it is scaled down proportionally. This is standard practice when training RNNs and Transformers.

\ud83e\udde0小测验

What does gradient clipping prevent?

\ud83e\udd2f

The Adam optimiser paper (Kingma & Ba, 2014) has over 150,000 citations, making it one of the most cited papers in all of computer science.

Key Takeaways

  • Loss functions quantify how wrong a model is - MSE for regression, cross-entropy for classification.
  • Gradient descent minimises the loss by repeatedly stepping opposite to the gradient.
  • The learning rate controls step size and is critical to get right.
  • Adam is the go-to optimiser, combining momentum and adaptive rates.
  • Learning rate schedules and gradient clipping are essential training stabilisers.
\ud83e\udd14
Think about it:

If you were training a model and the loss stopped decreasing after a few epochs, what would you investigate first - the learning rate, the loss function, or the data? Why?


📚 Further Reading

  • Andrej Karpathy - A Recipe for Training Neural Networks - Practical wisdom on loss debugging and optimiser selection
  • 3Blue1Brown - Gradient Descent - Stunning visual intuition for how gradient descent navigates loss landscapes
  • An Overview of Gradient Descent Optimisation Algorithms (Ruder, 2016) - Comprehensive comparison of SGD, Adam, and friends
第 7 课,共 16 课已完成 0%
←反向传播
分词→