Most machine learning algorithms are black boxes — you feed in data, something mathematical happens inside, and a prediction comes out. Decision trees are different. They are one of the few algorithms you can fully explain to a non-technical colleague, draw on a whiteboard, and still trust to make accurate predictions.
You've probably played 20 Questions: one person thinks of something, and others ask yes/no questions to narrow it down. "Is it alive? Is it bigger than a car? Does it live in water?" Each answer eliminates a huge swath of possibilities until the answer becomes obvious.
A decision tree works exactly like this. Given a new data point to classify, the tree asks a series of questions about its features, following the branches that match each answer, until it reaches a leaf — a final prediction.
Before we get into how trees learn, let's name the parts:
A single data point travels from root to leaf, answering one question at each node, until it reaches a prediction.
The clever part: how does the algorithm decide which question to ask at each node? It tries every possible split on every feature and picks the one that best separates the data.
Two common measures of "best separation":
Gini impurity measures how mixed a group is. A perfectly pure node — all examples belong to one class — has a Gini impurity of 0. A completely mixed node has the maximum impurity. The algorithm prefers splits that produce the purest child nodes.
Information gain is similar: it measures how much a split reduces uncertainty (entropy) about the class label. Higher information gain = better split.
Both measures ask the same underlying question: after splitting on this feature, how much more certain am I about the class?
The CART algorithm (Classification and Regression Trees), introduced in 1984 by Breiman, Friedman, Olshen, and Stone, is the foundation of most modern decision tree implementations. Despite being 40 years old, it remains one of the most widely used ML algorithms.
Left unconstrained, a decision tree will grow until every training example has its own leaf — achieving 100% accuracy on training data but failing completely on new data. This is overfitting.
Imagine memorising every past exam question word-for-word instead of understanding the subject. You'd ace the past papers but fail the real exam.
Two main remedies:
Pre-pruning (early stopping) — set limits during training: maximum depth, minimum samples per leaf, minimum information gain threshold. The tree stops growing when it hits these limits.
Post-pruning — grow the full tree, then trim back branches that don't improve performance on a validation set.
A decision tree with depth 1 (a single question) is called a "decision stump". It's extremely simple — almost certainly underfitting. A tree of depth 100 with one sample per leaf is overfitting. How would you decide where to stop?
A single decision tree is powerful but brittle — small changes in training data can produce very different trees. The solution: grow hundreds of trees, each trained on a random subset of the data and features, then average their predictions.
This is a Random Forest — one of the most reliable and widely-used algorithms in all of machine learning. You'll cover it in depth in a later lesson. For now, remember: individual trees are interpretable, forests are robust.
| Strengths | Weaknesses | |---|---| | Fully interpretable — can be visualised | Prone to overfitting without pruning | | No need to normalise or scale features | Small data changes = very different trees | | Handles both numerical and categorical features | Biased towards features with more values | | Works without feature engineering | Not great at capturing linear relationships | | Fast to train and predict | Single trees often underperform ensembles |