📊

AI 萌芽 • 中级⏱️ 15 分钟阅读

评估指标

Evaluation Metrics - Is Your AI Actually Good?

You have trained a model. The loss went down. But is it actually good? The answer depends entirely on how you measure it - and choosing the wrong metric can give you dangerously misleading confidence. This lesson covers the metrics every AI practitioner must understand.

The Accuracy Trap

Accuracy = correct predictions ÷ total predictions. Sounds reasonable - until you meet class imbalance.

Imagine a fraud detection model. Out of 10,000 transactions, only 50 are fraudulent. A model that simply predicts "not fraud" for every single transaction achieves 99.5% accuracy - while catching zero fraud. Utterly useless, yet the accuracy looks brilliant.

This is why accuracy alone is never enough for real-world AI.

🤯

In medical screening for rare diseases, a model that always predicts "healthy" can exceed 99.9% accuracy. This is why doctors and data scientists rely on sensitivity (recall) as the primary metric for screening tests.

The Confusion Matrix

Before diving into better metrics, we need the confusion matrix - a 2×2 table that breaks down every prediction:

| | Predicted Positive | Predicted Negative | |---|---|---| | Actually Positive | True Positive (TP) | False Negative (FN) | | Actually Negative | False Positive (FP) | True Negative (TN) |

Concrete example - email spam filter on 1,000 emails (100 spam, 900 legitimate):

| | Predicted Spam | Predicted Legitimate | |---|---|---| | Actually Spam | 80 (TP) | 20 (FN) | | Actually Legitimate | 30 (FP) | 870 (TN) |

From this single table, we can derive every classification metric.

A confusion matrix for a spam filter with the four quadrants colour-coded: green for TP and TN, red for FP and FN, with precision and recall formulas alongside — The confusion matrix is the foundation of all classification metrics.

Precision - "Of Everything I Flagged, How Much Was Correct?"

Precision = TP ÷ (TP + FP) = 80 ÷ (80 + 30) = 72.7%

In our spam filter: of all emails marked as spam, 72.7% actually were spam. The other 27.3% were legitimate emails incorrectly caught - false positives.

When precision matters most: When false positives are expensive. A spam filter that sends important client emails to junk is a serious problem.

第 10 课，共 16 课已完成 0%

←嵌入与向量数据库

Discussion

建议修改本课内容

Recall - "Of Everything That Was Positive, How Much Did I Find?"

Recall = TP ÷ (TP + FN) = 80 ÷ (80 + 20) = 80%

The model caught 80 of the 100 actual spam emails - it recalled 80% of them. The other 20 slipped through as false negatives.

When recall matters most: When missing a positive case is dangerous. In cancer screening, failing to detect a tumour (false negative) could cost a life.

🧠小测验

A hospital wants a model to screen for a dangerous disease. Which metric should they prioritise?

The Precision–Recall Trade-Off

Precision and recall pull in opposite directions. Tightening the spam threshold catches fewer legitimate emails (precision rises) but lets more spam through (recall drops). Loosening it catches more spam (recall rises) but snags more good emails (precision drops).

There is no free lunch - you must decide which errors are more costly for your specific use case.

F1 Score - The Harmonic Mean

When you need a single number balancing precision and recall, use the F1 score:

F1 = 2 × (Precision × Recall) ÷ (Precision + Recall)

For our spam filter: F1 = 2 × (0.727 × 0.80) ÷ (0.727 + 0.80) = 0.762 - about 76.2%.

The harmonic mean punishes extreme imbalances. If either precision or recall is very low, F1 drops sharply.

🤔

Think about it:

A content moderation system has 95% precision but only 30% recall. The F1 score is just 46%. What does this tell you about the system's real-world behaviour, and would you deploy it?

The ROC curve (Receiver Operating Characteristic) plots the True Positive Rate against the False Positive Rate at every possible classification threshold. It shows how well a model separates classes across all thresholds, not just one.

AUC (Area Under the Curve) summarises this into a single number:

AUC = 1.0 - perfect separation.
AUC = 0.5 - no better than random guessing (the diagonal line).
AUC < 0.5 - worse than random (your labels might be flipped!).

AUC is threshold-independent, making it excellent for comparing models before you have decided on a specific operating point.

🧠小测验

What does an AUC of 0.5 mean?

Metrics for Text Generation

Classification metrics do not apply to language models that generate text. Different tasks need different measures.

BLEU (Bilingual Evaluation Understudy) measures how much a generated translation overlaps with reference translations, counting matching n-grams (word sequences). Scores range from 0 to 1.

BLEU is widely used in machine translation but has significant limitations: it rewards word overlap, not meaning. "The cat sat on the mat" and "A feline rested upon the rug" score poorly against each other despite similar meaning.

Perplexity measures how surprised a language model is by new text. Lower is better - a perplexity of 20 means the model is, on average, choosing among 20 equally likely next words. A good model has low perplexity because it predicts text well.

GPT-4 achieves remarkably low perplexity on English text, reflecting its strong language understanding.

🤯

The BLEU metric was introduced in 2002 and quickly became the standard for machine translation evaluation. Despite known flaws, it remained dominant for nearly two decades because no simple alternative consistently correlated better with human judgement.

A/B Testing in Production

Offline metrics are necessary but not sufficient. The ultimate test is A/B testing: deploy two model versions to different user groups and measure real-world outcomes.

Does the new recommendation model increase click-through rates?
Does the improved chatbot reduce support ticket escalations?
Does the updated spam filter receive fewer "not spam" corrections?

Production metrics often diverge from offline metrics because users behave unpredictably.

🧠小测验

Why might a model with excellent offline metrics perform poorly in A/B testing?

When to Use Which Metric

| Scenario | Primary metric | |----------|---------------| | Balanced classification | Accuracy, F1 | | Imbalanced classes | Precision, Recall, AUC | | Medical screening | Recall (sensitivity) | | Spam filtering | Precision + Recall balance | | Machine translation | BLEU, METEOR | | Language model quality | Perplexity | | Production impact | A/B test outcomes |

🤔

Think about it:

You are building a self-driving car's pedestrian detection system. A false negative means failing to see a pedestrian; a false positive means braking for a shadow. Which metric do you optimise, and what trade-off are you willing to accept?

Accuracy is misleading with imbalanced data - always check the confusion matrix.
Precision measures correctness of positive predictions; recall measures completeness.
F1 balances both; AUC evaluates across all thresholds.
Text generation uses BLEU and perplexity instead of classification metrics.
A/B testing is the gold standard for measuring real-world model impact.

📚 Further Reading

Google ML Crash Course - Classification Metrics - Interactive walkthrough of precision, recall, and ROC curves
Towards Data Science - Beyond Accuracy - Practical guide with real-world examples of metric selection

AI基础

AI精通

职业准备

实验室

评估指标

Evaluation Metrics - Is Your AI Actually Good?

The Accuracy Trap

The Confusion Matrix

Precision - "Of Everything I Flagged, How Much Was Correct?"

Discussion

Recall - "Of Everything That Was Positive, How Much Did I Find?"

The Precision–Recall Trade-Off

F1 Score - The Harmonic Mean

ROC Curves and AUC

Metrics for Text Generation

BLEU Score

Perplexity

A/B Testing in Production

When to Use Which Metric

Key Takeaways

📚 Further Reading