AI EducademyAIEducademy
🌳

AI 学习路径

🌱
AI 种子

从零开始

🌿
AI 萌芽

打好基础

🌳
AI 枝干

付诸实践

🏕️
AI 树冠

深入探索

🌲
AI 森林

精通AI

🔨

工程技能路径

✏️
AI 草图

从零开始

🪨
AI 雕刻

打好基础

⚒️
AI 匠心

付诸实践

💎
AI 打磨

深入探索

🏆
AI 杰作

精通AI

查看所有学习计划→

实验室

已加载 7 个实验
🧠神经网络游乐场🤖AI 还是人类?💬提示实验室🎨图像生成器😊情感分析器💡聊天机器人构建器⚖️伦理模拟器
进入实验室→
📝

博客

关于AI、教育和技术的最新文章

阅读博客→
nav.faq
🎯
使命

让AI教育触达每一个人、每一个角落

💜
价值观

开源、多语言、社区驱动

⭐
Open Source

在 GitHub 上公开构建

认识创始人→在 GitHub 上查看
立即开始
AI EducademyAIEducademy

MIT 许可证。开源项目

学习

  • 学习计划
  • 课程
  • 实验室

社区

  • GitHub
  • 参与贡献
  • 行为准则
  • 关于
  • 常见问题

支持

  • 请我喝杯咖啡 ☕
AI & 工程学习计划›🌿 AI 萌芽›课程›评估指标
📊
AI 萌芽 • 中级⏱️ 15 分钟阅读

评估指标

Evaluation Metrics - Is Your AI Actually Good?

You have trained a model. The loss went down. But is it actually good? The answer depends entirely on how you measure it - and choosing the wrong metric can give you dangerously misleading confidence. This lesson covers the metrics every AI practitioner must understand.

The Accuracy Trap

Accuracy = correct predictions ÷ total predictions. Sounds reasonable - until you meet class imbalance.

Imagine a fraud detection model. Out of 10,000 transactions, only 50 are fraudulent. A model that simply predicts "not fraud" for every single transaction achieves 99.5% accuracy - while catching zero fraud. Utterly useless, yet the accuracy looks brilliant.

This is why accuracy alone is never enough for real-world AI.

\ud83e\udd2f

In medical screening for rare diseases, a model that always predicts "healthy" can exceed 99.9% accuracy. This is why doctors and data scientists rely on sensitivity (recall) as the primary metric for screening tests.

The Confusion Matrix

Before diving into better metrics, we need the confusion matrix - a 2×2 table that breaks down every prediction:

| | Predicted Positive | Predicted Negative | |---|---|---| | Actually Positive | True Positive (TP) | False Negative (FN) | | Actually Negative | False Positive (FP) | True Negative (TN) |

Concrete example - email spam filter on 1,000 emails (100 spam, 900 legitimate):

| | Predicted Spam | Predicted Legitimate | |---|---|---| | Actually Spam | 80 (TP) | 20 (FN) | | Actually Legitimate | 30 (FP) | 870 (TN) |

From this single table, we can derive every classification metric.

A confusion matrix for a spam filter with the four quadrants colour-coded: green for TP and TN, red for FP and FN, with precision and recall formulas alongside
The confusion matrix is the foundation of all classification metrics.

Precision - "Of Everything I Flagged, How Much Was Correct?"

Precision = TP ÷ (TP + FP) = 80 ÷ (80 + 30) = 72.7%

In our spam filter: of all emails marked as spam, 72.7% actually were spam. The other 27.3% were legitimate emails incorrectly caught - false positives.

When precision matters most: When false positives are expensive. A spam filter that sends important client emails to junk is a serious problem.

Recall - "Of Everything That Was Positive, How Much Did I Find?"

Recall = TP ÷ (TP + FN) = 80 ÷ (80 + 20) = 80%

The model caught 80 of the 100 actual spam emails - it recalled 80% of them. The other 20 slipped through as false negatives.

When recall matters most: When missing a positive case is dangerous. In cancer screening, failing to detect a tumour (false negative) could cost a life.

\ud83e\udde0小测验

A hospital wants a model to screen for a dangerous disease. Which metric should they prioritise?

The Precision–Recall Trade-Off

Precision and recall pull in opposite directions. Tightening the spam threshold catches fewer legitimate emails (precision rises) but lets more spam through (recall drops). Loosening it catches more spam (recall rises) but snags more good emails (precision drops).

There is no free lunch - you must decide which errors are more costly for your specific use case.

F1 Score - The Harmonic Mean

When you need a single number balancing precision and recall, use the F1 score:

F1 = 2 × (Precision × Recall) ÷ (Precision + Recall)

For our spam filter: F1 = 2 × (0.727 × 0.80) ÷ (0.727 + 0.80) = 0.762 - about 76.2%.

The harmonic mean punishes extreme imbalances. If either precision or recall is very low, F1 drops sharply.

\ud83e\udd14
Think about it:

A content moderation system has 95% precision but only 30% recall. The F1 score is just 46%. What does this tell you about the system's real-world behaviour, and would you deploy it?

ROC Curves and AUC

The ROC curve (Receiver Operating Characteristic) plots the True Positive Rate against the False Positive Rate at every possible classification threshold. It shows how well a model separates classes across all thresholds, not just one.

AUC (Area Under the Curve) summarises this into a single number:

  • AUC = 1.0 - perfect separation.
  • AUC = 0.5 - no better than random guessing (the diagonal line).
  • AUC < 0.5 - worse than random (your labels might be flipped!).

AUC is threshold-independent, making it excellent for comparing models before you have decided on a specific operating point.

\ud83e\udde0小测验

What does an AUC of 0.5 mean?

Metrics for Text Generation

Classification metrics do not apply to language models that generate text. Different tasks need different measures.

BLEU Score

BLEU (Bilingual Evaluation Understudy) measures how much a generated translation overlaps with reference translations, counting matching n-grams (word sequences). Scores range from 0 to 1.

BLEU is widely used in machine translation but has significant limitations: it rewards word overlap, not meaning. "The cat sat on the mat" and "A feline rested upon the rug" score poorly against each other despite similar meaning.

Perplexity

Perplexity measures how surprised a language model is by new text. Lower is better - a perplexity of 20 means the model is, on average, choosing among 20 equally likely next words. A good model has low perplexity because it predicts text well.

GPT-4 achieves remarkably low perplexity on English text, reflecting its strong language understanding.

\ud83e\udd2f

The BLEU metric was introduced in 2002 and quickly became the standard for machine translation evaluation. Despite known flaws, it remained dominant for nearly two decades because no simple alternative consistently correlated better with human judgement.

A/B Testing in Production

Offline metrics are necessary but not sufficient. The ultimate test is A/B testing: deploy two model versions to different user groups and measure real-world outcomes.

  • Does the new recommendation model increase click-through rates?
  • Does the improved chatbot reduce support ticket escalations?
  • Does the updated spam filter receive fewer "not spam" corrections?

Production metrics often diverge from offline metrics because users behave unpredictably.

\ud83e\udde0小测验

Why might a model with excellent offline metrics perform poorly in A/B testing?

When to Use Which Metric

| Scenario | Primary metric | |----------|---------------| | Balanced classification | Accuracy, F1 | | Imbalanced classes | Precision, Recall, AUC | | Medical screening | Recall (sensitivity) | | Spam filtering | Precision + Recall balance | | Machine translation | BLEU, METEOR | | Language model quality | Perplexity | | Production impact | A/B test outcomes |

\ud83e\udd14
Think about it:

You are building a self-driving car's pedestrian detection system. A false negative means failing to see a pedestrian; a false positive means braking for a shadow. Which metric do you optimise, and what trade-off are you willing to accept?

Key Takeaways

  • Accuracy is misleading with imbalanced data - always check the confusion matrix.
  • Precision measures correctness of positive predictions; recall measures completeness.
  • F1 balances both; AUC evaluates across all thresholds.
  • Text generation uses BLEU and perplexity instead of classification metrics.
  • A/B testing is the gold standard for measuring real-world model impact.

📚 Further Reading

  • Google ML Crash Course - Classification Metrics - Interactive walkthrough of precision, recall, and ROC curves
  • Towards Data Science - Beyond Accuracy - Practical guide with real-world examples of metric selection
第 10 课,共 16 课已完成 0%
←嵌入与向量数据库
理解大型语言模型→