AI EducademyAIEducademy
🌳

AI基础

🌱
AI 种子

从零开始

🌿
AI 萌芽

打好基础

🌳
AI 枝干

付诸实践

🏕️
AI 树冠

深入探索

🌲
AI 森林

精通AI

🔨

AI精通

✏️
AI 草图

从零开始

🪨
AI 雕刻

打好基础

⚒️
AI 匠心

付诸实践

💎
AI 打磨

深入探索

🏆
AI 杰作

精通AI

🚀

职业准备

🚀
面试发射台

开启你的旅程

🌟
行为面试精通

掌握软技能

💻
技术面试

通过编程轮次

🤖
AI与ML面试

ML面试精通

🏆
Offer与未来

拿下最好的Offer

查看所有学习计划→

实验室

已加载 7 个实验
🧠神经网络游乐场🤖AI 还是人类?💬提示实验室🎨图像生成器😊情感分析器💡聊天机器人构建器⚖️伦理模拟器
🎯模拟面试进入实验室→
学习旅程博客
🎯
关于

让AI教育触达每一个人、每一个角落

❓
常见问题

Common questions answered

✉️
Contact

Get in touch with us

⭐
Open Source

在 GitHub 上公开构建

立即开始
AI EducademyAIEducademy

MIT 许可证。开源项目

学习

  • 学习计划
  • 课程
  • 实验室

社区

  • GitHub
  • 参与贡献
  • 行为准则
  • 关于
  • 常见问题

支持

  • 请我喝杯咖啡 ☕
  • 服务条款
  • 隐私政策
  • 联系我们
AI & 工程学习计划›🌿 AI 萌芽›课程›反向传播
⛓️
AI 萌芽 • 中级⏱️ 16 分钟阅读

反向传播

Backpropagation - The Engine of Learning

In the previous lessons you saw that neural networks have weights, and training adjusts those weights. But how does the network know which weights to change, and by how much? The answer is backpropagation - the single most important algorithm in modern deep learning.

Andrej Karpathy calls it "the most important thing to understand about neural networks." Let us see why.

A Quick Forward Pass Recap

During a forward pass, data flows left to right through the network:

  1. Inputs are multiplied by weights and summed.
  2. A bias is added.
  3. An activation function (like ReLU) is applied.
  4. The output feeds into the next layer, repeating until a final prediction emerges.

The prediction is then compared to the true answer using a loss function (covered in the next lesson). The loss is a single number that says: "Here is how wrong you are."

A computation graph showing a forward pass through three nodes, with arrows indicating data flow from input to loss
The forward pass builds a computation graph. Backpropagation then walks it in reverse.

The Key Insight - Blame Assignment

Imagine you bake a cake and it tastes awful. You used five ingredients. The question is: which ingredient contributed most to the bad taste, and by how much?

Backpropagation answers exactly this question for neural networks. It assigns blame to every single weight by asking: "If I nudge this weight slightly, how much does the loss change?"

That rate of change is called a gradient, and it comes from calculus - specifically, the derivative.

🤯

Geoffrey Hinton, one of the "godfathers of AI," has said that backpropagation is the key idea that made deep learning practical. Without it, training networks with millions of parameters would be computationally impossible.

The Chain Rule - One Idea to Rule Them All

Neural networks are chains of simple operations composed together. The chain rule from calculus tells us how to differentiate composed functions:

If y = f(g(x)), then dy/dx = f'(g(x)) × g'(x).

第 6 课,共 16 课已完成 0%
←AI 伦理与偏见

Discussion

Sign in to join the discussion

建议修改本课内容

Everyday analogy: You drive to a shop. Your speed depends on how hard you press the accelerator. The accelerator position depends on traffic. To know how traffic affects your speed, you multiply: (speed per accelerator press) × (accelerator press per traffic condition). That is the chain rule - multiplying local rates of change along a chain.

🤯

Backpropagation was popularised in a landmark 1986 paper by Rumelhart, Hinton, and Williams, but the core idea of reverse-mode automatic differentiation dates back to the 1960s.

Computation Graphs - Visualising the Maths

Modern frameworks like PyTorch build a computation graph during the forward pass. Every operation - add, multiply, ReLU - becomes a node. Backpropagation then walks this graph in reverse, applying the chain rule at each node to compute gradients.

Think of it like a river system. The loss is the ocean at the end. Backprop traces every tributary upstream to find how much each source (weight) contributed to the final flow.

A Tiny Worked Example

Suppose L = (w × x - y)² with w = 2, x = 3, y = 10.

  1. Forward: w × x = 6, then 6 - 10 = -4, then (-4)² = 16. Loss = 16.
  2. Backward: dL/d(diff) = 2 × (-4) = -8, then d(diff)/d(wx) = 1, so dL/d(wx) = -8.
  3. Finally, d(wx)/dw = x = 3, so dL/dw = -8 × 3 = -24.

The gradient of −24 tells us: increasing w will decrease the loss rapidly. That is exactly the signal we need to improve.

🧠小测验

In the chain rule, what do we do with the local derivatives at each node?

Gradient Flow Through Layers

In a deep network, gradients must travel through many layers. Each layer multiplies the gradient by its local derivative. This creates two dangerous failure modes:

Vanishing Gradients

If local derivatives are small (e.g., the sigmoid function saturates near 0 or 1), repeated multiplication makes gradients shrink towards zero. Early layers barely learn - they receive almost no signal. This plagued early deep networks.

Exploding Gradients

If local derivatives are large, gradients grow exponentially. Weights receive enormous updates and the network becomes unstable, producing NaN values.

🤔
Think about it:

ReLU's derivative is either 0 or 1 - it never shrinks the gradient when active. Why might this simple property have been revolutionary for training deep networks?

Modern solutions include:

  • ReLU activation - derivative is 1 for positive inputs, avoiding shrinkage.
  • Residual connections (skip connections) - give gradients a highway to bypass layers.
  • Batch normalisation - keeps values in a healthy range.
  • Gradient clipping - caps gradients to prevent explosions.

How Weights Actually Update

Once backprop computes every gradient, the optimiser (next lesson) updates each weight:

w_new = w_old - learning_rate × gradient

The learning rate controls the step size. Too large and you overshoot; too small and training takes forever. The gradient tells you the direction; the learning rate tells you how far to step.

🧠小测验

What causes vanishing gradients in deep networks?

Why Backprop Matters

Every time ChatGPT improves its next-word prediction, every time a self-driving car refines its steering, backpropagation is running underneath. It is the algorithm that makes learning from mistakes mathematically precise.

Without backprop, we would have no efficient way to train networks with millions - or billions - of parameters.

🧠小测验

What does a gradient tell us about a weight?

🤔
Think about it:

Karpathy emphasises that backprop is "just recursive application of the chain rule." If you understand the chain rule and computation graphs, you understand backprop. What other complex systems could be understood by breaking them into simple, composable pieces?

Key Takeaways

  • Backpropagation computes gradients by walking the computation graph in reverse.
  • The chain rule multiplies local derivatives along each path.
  • Vanishing gradients slow learning; exploding gradients destabilise it.
  • Modern tricks (ReLU, skip connections, gradient clipping) keep gradient flow healthy.
  • Backprop + an optimiser = the learning engine of all modern deep learning.

📚 Further Reading

  • Andrej Karpathy - nn-zero-to-hero (micrograd) - Build backprop from scratch in Python
  • 3Blue1Brown - Backpropagation Calculus - Beautiful visual explanation of the chain rule in neural networks
  • CS231n Backprop Notes - Stanford's concise reference on computation graphs and gradient flow