AI EducademyAIEducademy
🌳

AI 学习路径

🌱
AI 种子

从零开始

🌿
AI 萌芽

打好基础

🌳
AI 枝干

付诸实践

🏕️
AI 树冠

深入探索

🌲
AI 森林

精通AI

🔨

工程技能路径

✏️
AI 草图

从零开始

🪨
AI 雕刻

打好基础

⚒️
AI 匠心

付诸实践

💎
AI 打磨

深入探索

🏆
AI 杰作

精通AI

查看所有学习计划→

实验室

已加载 7 个实验
🧠神经网络游乐场🤖AI 还是人类?💬提示实验室🎨图像生成器😊情感分析器💡聊天机器人构建器⚖️伦理模拟器
进入实验室→
📝

博客

关于AI、教育和技术的最新文章

阅读博客→
nav.faq
🎯
使命

让AI教育触达每一个人、每一个角落

💜
价值观

开源、多语言、社区驱动

⭐
Open Source

在 GitHub 上公开构建

认识创始人→在 GitHub 上查看
立即开始
AI EducademyAIEducademy

MIT 许可证。开源项目

学习

  • 学习计划
  • 课程
  • 实验室

社区

  • GitHub
  • 参与贡献
  • 行为准则
  • 关于
  • 常见问题

支持

  • 请我喝杯咖啡 ☕
AI & 工程学习计划›🌳 AI 枝干›课程›强化学习
🎮
AI 枝干 • 中级⏱️ 15 分钟阅读

强化学习

Reinforcement Learning - How AI Learns by Doing

Imagine teaching a puppy to sit. You do not hand it a textbook - you let it try things, and when it sits, you give it a treat. Over time, it learns which actions earn rewards. Reinforcement Learning (RL) works the same way: an AI agent learns by interacting with an environment, receiving feedback, and adjusting its behaviour to maximise long-term reward.

The Agent-Environment Loop

Every RL system has the same core structure:

  1. The agent observes the current state of the environment.
  2. It chooses an action from its available options.
  3. The environment transitions to a new state and returns a reward signal.
  4. The agent uses that reward to update its strategy (called a policy).

This loop repeats - potentially millions of times - until the agent discovers a policy that consistently earns high rewards. Unlike supervised learning, there is no labelled dataset telling the agent what to do. It must discover effective behaviour purely through interaction.

Diagram showing the agent-environment loop: state, action, reward, next state
The agent-environment loop is the foundation of all reinforcement learning.

Exploration vs Exploitation

One of RL's central dilemmas: should the agent exploit what it already knows works, or explore unknown actions that might yield even better rewards?

  • A restaurant analogy: do you order your favourite dish (exploit) or try something new that might be even better (explore)?
  • Too much exploitation and the agent gets stuck in mediocre strategies. Too much exploration and it wastes time on poor actions.

Most algorithms use an epsilon-greedy strategy: with probability ε the agent explores randomly, and the rest of the time it exploits its best-known action. Over training, ε gradually decreases as the agent becomes more confident. More sophisticated methods - like Upper Confidence Bound (UCB) and Thompson Sampling - explore more intelligently by targeting uncertain actions.

\ud83e\udd2f
DeepMind's AlphaGo explored approximately 30 million self-play games during training - far more games than any human could play in a lifetime.

Q-Learning - A Simple but Powerful Idea

Q-learning assigns a value (called a Q-value) to every state-action pair. The Q-value represents the expected total future reward if the agent takes that action in that state and then follows the best policy afterwards.

The update rule is surprisingly elegant:

Q(state, action) ← Q(state, action) + α × [reward + γ × max Q(next_state, all_actions) − Q(state, action)]
  • α (learning rate) controls how quickly the agent updates its beliefs.
  • γ (discount factor) controls how much the agent values future rewards versus immediate ones.

Over many episodes, Q-values converge, and the agent can simply pick the action with the highest Q-value in any state. For simple environments this table-based approach works well. When the state space is too large for a table (say, raw pixel inputs from a video game), a neural network approximates the Q-values - this is Deep Q-Learning (DQN), the approach DeepMind used to master Atari games in 2015.

\ud83e\udde0小测验

What does the Q-value in Q-learning represent?

Policy Gradient Methods

Q-learning works by learning the value of actions. Policy gradient methods take a different path: they directly optimise the policy itself - the mapping from states to actions.

Instead of asking "how good is this action?", they ask "how should I adjust my action probabilities to get more reward?" The agent nudges its policy in the direction that increases expected reward, using gradient ascent.

Proximal Policy Optimisation (PPO), developed by OpenAI, is one of the most popular policy gradient algorithms. It is stable, efficient, and was used to train the RL component of ChatGPT (via RLHF - Reinforcement Learning from Human Feedback).

AlphaGo - The Game That Changed Everything

In March 2016, DeepMind's AlphaGo defeated Lee Sedol, one of the greatest Go players in history, 4 games to 1. Go has more possible board positions than atoms in the universe - brute-force search was impossible.

AlphaGo combined:

  • Supervised learning from human expert games to bootstrap its policy.
  • Self-play RL to improve far beyond human-level play.
  • Monte Carlo Tree Search to plan moves efficiently.

Its successor, AlphaGo Zero, skipped human data entirely - learning solely from self-play - and surpassed the original AlphaGo within 40 days.

\ud83e\udd14
Think about it:AlphaGo Zero learned without any human knowledge. Does this suggest that human expertise can sometimes be a limitation rather than an advantage for AI?

OpenAI Five and Game AI

In 2019, OpenAI Five defeated the world champion team in Dota 2 - a game with incomplete information, long time horizons, and complex team coordination. Each of the five agents used PPO trained across thousands of GPUs over months, accumulating the equivalent of 45,000 years of gameplay experience.

Game AI has always been a proving ground for RL, but the techniques developed here transfer to far more consequential domains. The scale of training - distributing RL across thousands of GPUs - pushed the boundaries of what was computationally feasible and inspired new infrastructure for large-scale AI training.

Real-World Reinforcement Learning

RL is increasingly applied beyond games:

  • Robotics - Teaching robot arms to grasp objects, navigate terrain, and perform assembly tasks.
  • Self-driving vehicles - Making lane-change and intersection decisions in complex traffic.
  • Data centre cooling - DeepMind used RL to reduce Google's data centre energy consumption by 40%.
  • Drug discovery - Optimising molecular structures for desired properties.
\ud83e\udde0小测验

What is the exploration vs exploitation trade-off?

Reward Hacking - When AI Games the System

A critical challenge in RL is reward hacking: the agent finds unexpected shortcuts to maximise its reward signal without actually solving the intended task.

Famous examples include a boat-racing agent that discovered it could score more points by spinning in circles collecting bonus items than by finishing the race, and simulated robots that learned to exploit physics engine bugs to "walk" by vibrating.

This highlights a fundamental truth: the agent optimises exactly what you measure, which may not be what you actually want. Designing reward functions that truly capture intended behaviour is one of the most important - and most difficult - aspects of applied RL. Researchers are exploring techniques like reward modelling and inverse reinforcement learning to address this challenge.

Sim-to-Real Transfer

Training RL agents in the real world is expensive, slow, and potentially dangerous. Sim-to-real transfer trains agents in simulated environments and then deploys them in reality.

The challenge: simulations are never perfectly accurate. Techniques like domain randomisation - randomly varying simulation parameters (friction, lighting, object shapes) - force agents to learn robust policies that generalise across conditions.

\ud83e\udde0小测验

What is sim-to-real transfer?

\ud83e\udd14
Think about it:If an RL agent discovers a reward-hacking shortcut, is that a failure of the agent or a failure of the humans who designed the reward function?

📚 Further Reading

  • Spinning Up in Deep RL - OpenAI - Practical introduction to deep reinforcement learning with code examples
  • Mastering the Game of Go without Human Knowledge - Silver et al. - The AlphaGo Zero paper demonstrating superhuman play from pure self-play
第 7 课,共 14 课已完成 0%
←生成式AI
多模态AI→