You have trained a model. The loss went down. But is it actually good? The answer depends entirely on how you measure it - and choosing the wrong metric can give you dangerously misleading confidence. This lesson covers the metrics every AI practitioner must understand.
Accuracy = correct predictions ÷ total predictions. Sounds reasonable - until you meet class imbalance.
Imagine a fraud detection model. Out of 10,000 transactions, only 50 are fraudulent. A model that simply predicts "not fraud" for every single transaction achieves 99.5% accuracy - while catching zero fraud. Utterly useless, yet the accuracy looks brilliant.
This is why accuracy alone is never enough for real-world AI.
In medical screening for rare diseases, a model that always predicts "healthy" can exceed 99.9% accuracy. This is why doctors and data scientists rely on sensitivity (recall) as the primary metric for screening tests.
Before diving into better metrics, we need the confusion matrix - a 2×2 table that breaks down every prediction:
| | Predicted Positive | Predicted Negative | |---|---|---| | Actually Positive | True Positive (TP) | False Negative (FN) | | Actually Negative | False Positive (FP) | True Negative (TN) |
Concrete example - email spam filter on 1,000 emails (100 spam, 900 legitimate):
| | Predicted Spam | Predicted Legitimate | |---|---|---| | Actually Spam | 80 (TP) | 20 (FN) | | Actually Legitimate | 30 (FP) | 870 (TN) |
From this single table, we can derive every classification metric.
Precision = TP ÷ (TP + FP) = 80 ÷ (80 + 30) = 72.7%
In our spam filter: of all emails marked as spam, 72.7% actually were spam. The other 27.3% were legitimate emails incorrectly caught - false positives.
When precision matters most: When false positives are expensive. A spam filter that sends important client emails to junk is a serious problem.
Recall = TP ÷ (TP + FN) = 80 ÷ (80 + 20) = 80%
The model caught 80 of the 100 actual spam emails - it recalled 80% of them. The other 20 slipped through as false negatives.
When recall matters most: When missing a positive case is dangerous. In cancer screening, failing to detect a tumour (false negative) could cost a life.
A hospital wants a model to screen for a dangerous disease. Which metric should they prioritise?
Precision and recall pull in opposite directions. Tightening the spam threshold catches fewer legitimate emails (precision rises) but lets more spam through (recall drops). Loosening it catches more spam (recall rises) but snags more good emails (precision drops).
There is no free lunch - you must decide which errors are more costly for your specific use case.
When you need a single number balancing precision and recall, use the F1 score:
F1 = 2 × (Precision × Recall) ÷ (Precision + Recall)
For our spam filter: F1 = 2 × (0.727 × 0.80) ÷ (0.727 + 0.80) = 0.762 - about 76.2%.
The harmonic mean punishes extreme imbalances. If either precision or recall is very low, F1 drops sharply.
A content moderation system has 95% precision but only 30% recall. The F1 score is just 46%. What does this tell you about the system's real-world behaviour, and would you deploy it?
The ROC curve (Receiver Operating Characteristic) plots the True Positive Rate against the False Positive Rate at every possible classification threshold. It shows how well a model separates classes across all thresholds, not just one.
AUC (Area Under the Curve) summarises this into a single number:
AUC is threshold-independent, making it excellent for comparing models before you have decided on a specific operating point.
What does an AUC of 0.5 mean?
Classification metrics do not apply to language models that generate text. Different tasks need different measures.
BLEU (Bilingual Evaluation Understudy) measures how much a generated translation overlaps with reference translations, counting matching n-grams (word sequences). Scores range from 0 to 1.
BLEU is widely used in machine translation but has significant limitations: it rewards word overlap, not meaning. "The cat sat on the mat" and "A feline rested upon the rug" score poorly against each other despite similar meaning.
Perplexity measures how surprised a language model is by new text. Lower is better - a perplexity of 20 means the model is, on average, choosing among 20 equally likely next words. A good model has low perplexity because it predicts text well.
GPT-4 achieves remarkably low perplexity on English text, reflecting its strong language understanding.
The BLEU metric was introduced in 2002 and quickly became the standard for machine translation evaluation. Despite known flaws, it remained dominant for nearly two decades because no simple alternative consistently correlated better with human judgement.
Offline metrics are necessary but not sufficient. The ultimate test is A/B testing: deploy two model versions to different user groups and measure real-world outcomes.
Production metrics often diverge from offline metrics because users behave unpredictably.
Why might a model with excellent offline metrics perform poorly in A/B testing?
| Scenario | Primary metric | |----------|---------------| | Balanced classification | Accuracy, F1 | | Imbalanced classes | Precision, Recall, AUC | | Medical screening | Recall (sensitivity) | | Spam filtering | Precision + Recall balance | | Machine translation | BLEU, METEOR | | Language model quality | Perplexity | | Production impact | A/B test outcomes |
You are building a self-driving car's pedestrian detection system. A false negative means failing to see a pedestrian; a false positive means braking for a shadow. Which metric do you optimise, and what trade-off are you willing to accept?