You've trained your first machine learning model. It performs brilliantly on your training data — 98% accuracy! You test it on new data and it falls apart: 61% accuracy. What went wrong?
Almost certainly, your model has overfit. This is one of the two most common failure modes in machine learning, and understanding it — alongside its opposite, underfitting — is essential to building models that actually work in the real world.
Before diving into examples, it helps to understand the theoretical framework behind these concepts: the bias-variance tradeoff.
Every model makes prediction errors. Those errors can be decomposed into three parts:
Total Error = Bias² + Variance + Irreducible Noise
Bias is the error from wrong assumptions in the model. A high-bias model is too simple — it systematically misses the true pattern in the data.
Variance is the error from sensitivity to small fluctuations in the training data. A high-variance model is too complex — it memorises the training data, including its noise, rather than learning the underlying pattern.
Irreducible noise is the natural randomness in the data that no model can eliminate.
The tradeoff: reducing bias tends to increase variance, and vice versa. Your job as a machine learning practitioner is to find the sweet spot.
Underfitting occurs when your model is too simple to capture the true pattern in the data. It performs poorly on both training data and new data.
Imagine you have data showing house prices based on size. The true relationship is roughly a gentle curve — prices rise with size, but with some plateauing at the top end.
If you fit a straight horizontal line to this data:
# Underfitting: overly simple model
from sklearn.linear_model import LinearRegression
import numpy as np
# True relationship is quadratic, but we're fitting a simple mean
model = DummyRegressor(strategy='mean')
model.fit(X_train, y_train)
# Training accuracy: 55%
# Test accuracy: 54%
# Both are bad — classic underfitting
The model ignores the actual relationship between house size and price. It doesn't matter whether you show it training data or new data — it's wrong either way.
Overfitting occurs when your model learns the training data too well — including its noise and random variation — and fails to generalise to new examples.
Using a 15-degree polynomial to fit the same house price data:
# Overfitting: overly complex model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
# Degree-15 polynomial — wildly complex for this problem
model = make_pipeline(PolynomialFeatures(15), LinearRegression())
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train) # 0.99 — looks amazing!
test_score = model.score(X_test, y_test) # 0.43 — terrible on new data
The polynomial has twisted itself into knots to pass through every training point — including the noisy outliers. It has memorised the training set rather than learning the underlying pattern. On unseen data, it's useless.
A fundamental tool for detecting overfitting is splitting your data into three sets:
from sklearn.model_selection import train_test_split
# First split: hold out 20% as the final test set
X_train_val, X_test, y_train_val, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Second split: hold out 20% of remaining as validation set
X_train, X_val, y_train, y_val = train_test_split(
X_train_val, y_train_val, test_size=0.25, random_state=42
)
# Result: 60% train / 20% validation / 20% test
The validation set is your early warning system. If training accuracy keeps improving but validation accuracy plateaus or drops, you're overfitting.
Regularisation adds a penalty to the loss function that discourages the model from learning overly complex patterns.
L2 Regularisation (Ridge) penalises large weights:
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0) # alpha controls regularisation strength
model.fit(X_train, y_train)
L1 Regularisation (Lasso) can drive some weights all the way to zero, performing feature selection:
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1)
model.fit(X_train, y_train)
During training, randomly "drop out" (set to zero) a proportion of neurons:
import torch.nn as nn
model = nn.Sequential(
nn.Linear(128, 64),
nn.ReLU(),
nn.Dropout(p=0.5), # 50% of neurons randomly deactivated during training
nn.Linear(64, 1)
)
This prevents neurons from co-adapting and forces the network to learn more robust, distributed representations.
Monitor validation loss during training and stop when it starts to increase:
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(
monitor='val_loss',
patience=5, # stop after 5 epochs without improvement
restore_best_weights=True
)
model.fit(X_train, y_train,
validation_data=(X_val, y_val),
callbacks=[early_stop],
epochs=1000)
More data makes it harder for the model to memorise noise — there's simply too much to fit exactly. When data collection is expensive, data augmentation (creating modified copies of existing examples) can help.
Sometimes the right fix is simply choosing a less complex model for the problem.
With small datasets, a single train/val split might be misleading due to randomness. K-fold cross-validation gives a more reliable estimate:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Mean accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
# Mean accuracy: 0.847 ± 0.023
The data is split into 5 folds; the model trains on 4 and validates on 1, rotating each time. The final score is the average across all 5 — much more reliable than a single split.
A model achieves 99% accuracy on training data but only 62% on test data. What does this indicate?