AI EducademyAIEducademy
🌳

AI 学习路径

🌱
AI 种子

从零开始

🌿
AI 萌芽

打好基础

🌳
AI 枝干

付诸实践

🏕️
AI 树冠

深入探索

🌲
AI 森林

精通AI

🔨

工程技能路径

✏️
AI 草图

从零开始

🪨
AI 雕刻

打好基础

⚒️
AI 匠心

付诸实践

💎
AI 打磨

深入探索

🏆
AI 杰作

精通AI

查看所有学习计划→

实验室

已加载 7 个实验
🧠神经网络游乐场🤖AI 还是人类?💬提示实验室🎨图像生成器😊情感分析器💡聊天机器人构建器⚖️伦理模拟器
进入实验室→
📝

博客

关于AI、教育和技术的最新文章

阅读博客→
nav.faq
🎯
使命

让AI教育触达每一个人、每一个角落

💜
价值观

开源、多语言、社区驱动

⭐
Open Source

在 GitHub 上公开构建

认识创始人→在 GitHub 上查看
立即开始
AI EducademyAIEducademy

MIT 许可证。开源项目

学习

  • 学习计划
  • 课程
  • 实验室

社区

  • GitHub
  • 参与贡献
  • 行为准则
  • 关于
  • 常见问题

支持

  • 请我喝杯咖啡 ☕
AI & 工程学习计划›🌿 AI 萌芽›课程›过拟合与欠拟合:机器学习模型为何失效
📉
AI 萌芽 • 中级⏱️ 25 分钟阅读

过拟合与欠拟合:机器学习模型为何失效

Overfitting and Underfitting: Why ML Models Fail

You've trained your first machine learning model. It performs brilliantly on your training data — 98% accuracy! You test it on new data and it falls apart: 61% accuracy. What went wrong?

Almost certainly, your model has overfit. This is one of the two most common failure modes in machine learning, and understanding it — alongside its opposite, underfitting — is essential to building models that actually work in the real world.

📐 The Bias-Variance Tradeoff

Before diving into examples, it helps to understand the theoretical framework behind these concepts: the bias-variance tradeoff.

Every model makes prediction errors. Those errors can be decomposed into three parts:

Total Error = Bias² + Variance + Irreducible Noise

Bias is the error from wrong assumptions in the model. A high-bias model is too simple — it systematically misses the true pattern in the data.

Variance is the error from sensitivity to small fluctuations in the training data. A high-variance model is too complex — it memorises the training data, including its noise, rather than learning the underlying pattern.

Irreducible noise is the natural randomness in the data that no model can eliminate.

The tradeoff: reducing bias tends to increase variance, and vice versa. Your job as a machine learning practitioner is to find the sweet spot.

📈 Underfitting: Too Simple to Learn

Underfitting occurs when your model is too simple to capture the true pattern in the data. It performs poorly on both training data and new data.

A Visual Example

Imagine you have data showing house prices based on size. The true relationship is roughly a gentle curve — prices rise with size, but with some plateauing at the top end.

If you fit a straight horizontal line to this data:

# Underfitting: overly simple model
from sklearn.linear_model import LinearRegression
import numpy as np

# True relationship is quadratic, but we're fitting a simple mean
model = DummyRegressor(strategy='mean')
model.fit(X_train, y_train)

# Training accuracy:  55%
# Test accuracy:      54%
# Both are bad — classic underfitting

The model ignores the actual relationship between house size and price. It doesn't matter whether you show it training data or new data — it's wrong either way.

Signs of Underfitting

  • High training error AND high test error
  • Model predictions cluster around a mean regardless of input
  • Learning curves show both training and validation error are high and close together

Causes of Underfitting

  • Model is too simple for the complexity of the data (e.g., linear model for non-linear data)
  • Too few training epochs (model hasn't had time to learn)
  • Too aggressive regularisation (see below)
  • Important features are missing from the input

📉 Overfitting: Too Complex, Memorising Noise

Overfitting occurs when your model learns the training data too well — including its noise and random variation — and fails to generalise to new examples.

A Visual Example

Using a 15-degree polynomial to fit the same house price data:

# Overfitting: overly complex model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Degree-15 polynomial — wildly complex for this problem
model = make_pipeline(PolynomialFeatures(15), LinearRegression())
model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)   # 0.99 — looks amazing!
test_score  = model.score(X_test, y_test)     # 0.43 — terrible on new data

The polynomial has twisted itself into knots to pass through every training point — including the noisy outliers. It has memorised the training set rather than learning the underlying pattern. On unseen data, it's useless.

\ud83e\udd2f
A model that perfectly memorises all its training data is sometimes called "the world's worst model" — it has 100% training accuracy but cannot generalise at all, making it completely useless for its actual purpose.

Signs of Overfitting

  • Very low training error, but much higher test error (a large generalisation gap)
  • Model performance degrades significantly on any new data
  • The model is surprisingly sensitive to small changes in input

🔀 Train / Validation / Test Split

A fundamental tool for detecting overfitting is splitting your data into three sets:

from sklearn.model_selection import train_test_split

# First split: hold out 20% as the final test set
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Second split: hold out 20% of remaining as validation set
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.25, random_state=42
)

# Result: 60% train / 20% validation / 20% test
  • Training set: what the model learns from
  • Validation set: what you use to tune hyperparameters and detect overfitting during development
  • Test set: the final, held-out evaluation — touch it only once, at the very end

The validation set is your early warning system. If training accuracy keeps improving but validation accuracy plateaus or drops, you're overfitting.

\ud83e\udd14
Think about it:Why is it important to keep your test set completely separate and only use it once? What could go wrong if you repeatedly evaluated on the test set and made adjustments based on it?

🛡️ Fixes for Overfitting

1. Regularisation

Regularisation adds a penalty to the loss function that discourages the model from learning overly complex patterns.

L2 Regularisation (Ridge) penalises large weights:

from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)  # alpha controls regularisation strength
model.fit(X_train, y_train)

L1 Regularisation (Lasso) can drive some weights all the way to zero, performing feature selection:

from sklearn.linear_model import Lasso

model = Lasso(alpha=0.1)
model.fit(X_train, y_train)

2. Dropout (Neural Networks)

During training, randomly "drop out" (set to zero) a proportion of neurons:

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Dropout(p=0.5),   # 50% of neurons randomly deactivated during training
    nn.Linear(64, 1)
)

This prevents neurons from co-adapting and forces the network to learn more robust, distributed representations.

3. Early Stopping

Monitor validation loss during training and stop when it starts to increase:

from tensorflow.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(
    monitor='val_loss',
    patience=5,          # stop after 5 epochs without improvement
    restore_best_weights=True
)

model.fit(X_train, y_train,
          validation_data=(X_val, y_val),
          callbacks=[early_stop],
          epochs=1000)

4. More Training Data

More data makes it harder for the model to memorise noise — there's simply too much to fit exactly. When data collection is expensive, data augmentation (creating modified copies of existing examples) can help.

5. Simpler Architecture

Sometimes the right fix is simply choosing a less complex model for the problem.

🔄 Cross-Validation

With small datasets, a single train/val split might be misleading due to randomness. K-fold cross-validation gives a more reliable estimate:

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100)

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print(f"Mean accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
# Mean accuracy: 0.847 ± 0.023

The data is split into 5 folds; the model trains on 4 and validates on 1, rotating each time. The final score is the average across all 5 — much more reliable than a single split.

\ud83e\udde0小测验

A model achieves 99% accuracy on training data but only 62% on test data. What does this indicate?

Key Takeaways

  • Underfitting (high bias): the model is too simple; it performs poorly on both training and test data — fix by using a more complex model or adding features
  • Overfitting (high variance): the model is too complex; it performs great on training data but poorly on test data — fix by regularising, adding data, or simplifying the model
  • The bias-variance tradeoff is the fundamental tension: reducing one tends to increase the other; your goal is to find the sweet spot
  • Always split data into train / validation / test sets — the test set should only be touched once, at the end
  • Key fixes for overfitting: L1/L2 regularisation, dropout, early stopping, more data, simpler architecture
  • Cross-validation gives a more reliable performance estimate when data is limited, by averaging results across multiple train/validation splits
第 12 课,共 16 课已完成 0%
←理解大型语言模型
特征工程:教会机器什么最重要→