AI EducademyAIEducademy
🌳

Fundamentos de IA

🌱
AI Seeds

Empieza desde cero

🌿
AI Sprouts

Construye bases

🌳
AI Branches

Aplica en la práctica

🏕️
AI Canopy

Profundiza

🌲
AI Forest

Domina la IA

🔨

Maestría en IA

✏️
AI Sketch

Empieza desde cero

🪨
AI Chisel

Construye bases

⚒️
AI Craft

Aplica en la práctica

💎
AI Polish

Profundiza

🏆
AI Masterpiece

Domina la IA

🚀

Preparación Profesional

🚀
Plataforma de Entrevistas

Comienza tu camino

🌟
Dominio Conductual

Domina las habilidades blandas

💻
Entrevistas Técnicas

Supera la ronda de código

🤖
Entrevistas de IA y ML

Dominio en entrevistas de ML

🏆
Oferta y Más Allá

Consigue la mejor oferta

Ver Todos los Programas→

Laboratorio

7 experimentos cargados
🧠Playground de Red Neuronal🤖¿IA o Humano?💬Laboratorio de Prompts🎨Generador de Imágenes😊Analizador de Sentimiento💡Constructor de Chatbots⚖️Simulador de Ética
🎯Entrevista simuladaEntrar al Laboratorio→
ViajeBlog
🎯
Acerca de

Hacer la educación en IA accesible para todos, en todas partes

❓
Preguntas Frecuentes

Common questions answered

✉️
Contact

Get in touch with us

⭐
Open Source

Construido de forma abierta en GitHub

Empezar
AI EducademyAIEducademy

Licencia MIT. Open Source

Aprender

  • Académicos
  • Lecciones
  • Laboratorio

Comunidad

  • GitHub
  • Contribuir
  • Código de Conducta
  • Acerca de
  • Preguntas Frecuentes

Soporte

  • Invítame a un Café ☕
  • Términos de Servicio
  • Política de Privacidad
  • Contacto
Académicos de IA e Ingeniería›🌿 AI Sprouts›Lecciones›Métricas de Evaluación
📊
AI Sprouts • Intermedio⏱️ 15 min de lectura

Métricas de Evaluación

Evaluation Metrics - Is Your AI Actually Good?

You have trained a model. The loss went down. But is it actually good? The answer depends entirely on how you measure it - and choosing the wrong metric can give you dangerously misleading confidence. This lesson covers the metrics every AI practitioner must understand.

The Accuracy Trap

Accuracy = correct predictions ÷ total predictions. Sounds reasonable - until you meet class imbalance.

Imagine a fraud detection model. Out of 10,000 transactions, only 50 are fraudulent. A model that simply predicts "not fraud" for every single transaction achieves 99.5% accuracy - while catching zero fraud. Utterly useless, yet the accuracy looks brilliant.

This is why accuracy alone is never enough for real-world AI.

🤯

In medical screening for rare diseases, a model that always predicts "healthy" can exceed 99.9% accuracy. This is why doctors and data scientists rely on sensitivity (recall) as the primary metric for screening tests.

The Confusion Matrix

Before diving into better metrics, we need the confusion matrix - a 2×2 table that breaks down every prediction:

| | Predicted Positive | Predicted Negative | |---|---|---| | Actually Positive | True Positive (TP) | False Negative (FN) | | Actually Negative | False Positive (FP) | True Negative (TN) |

Concrete example - email spam filter on 1,000 emails (100 spam, 900 legitimate):

| | Predicted Spam | Predicted Legitimate | |---|---|---| | Actually Spam | 80 (TP) | 20 (FN) | | Actually Legitimate | 30 (FP) | 870 (TN) |

From this single table, we can derive every classification metric.

A confusion matrix for a spam filter with the four quadrants colour-coded: green for TP and TN, red for FP and FN, with precision and recall formulas alongside
The confusion matrix is the foundation of all classification metrics.

Precision - "Of Everything I Flagged, How Much Was Correct?"

Precision = TP ÷ (TP + FP) = 80 ÷ (80 + 30) = 72.7%

In our spam filter: of all emails marked as spam, 72.7% actually were spam. The other 27.3% were legitimate emails incorrectly caught - false positives.

When precision matters most: When false positives are expensive. A spam filter that sends important client emails to junk is a serious problem.

Lección 10 de 160% completado
←Embeddings y Bases de Datos Vectoriales

Discussion

Sign in to join the discussion

Sugerir una edición a esta lección

Recall - "Of Everything That Was Positive, How Much Did I Find?"

Recall = TP ÷ (TP + FN) = 80 ÷ (80 + 20) = 80%

The model caught 80 of the 100 actual spam emails - it recalled 80% of them. The other 20 slipped through as false negatives.

When recall matters most: When missing a positive case is dangerous. In cancer screening, failing to detect a tumour (false negative) could cost a life.

🧠Verificación Rápida

A hospital wants a model to screen for a dangerous disease. Which metric should they prioritise?

The Precision–Recall Trade-Off

Precision and recall pull in opposite directions. Tightening the spam threshold catches fewer legitimate emails (precision rises) but lets more spam through (recall drops). Loosening it catches more spam (recall rises) but snags more good emails (precision drops).

There is no free lunch - you must decide which errors are more costly for your specific use case.

F1 Score - The Harmonic Mean

When you need a single number balancing precision and recall, use the F1 score:

F1 = 2 × (Precision × Recall) ÷ (Precision + Recall)

For our spam filter: F1 = 2 × (0.727 × 0.80) ÷ (0.727 + 0.80) = 0.762 - about 76.2%.

The harmonic mean punishes extreme imbalances. If either precision or recall is very low, F1 drops sharply.

🤔
Think about it:

A content moderation system has 95% precision but only 30% recall. The F1 score is just 46%. What does this tell you about the system's real-world behaviour, and would you deploy it?

ROC Curves and AUC

The ROC curve (Receiver Operating Characteristic) plots the True Positive Rate against the False Positive Rate at every possible classification threshold. It shows how well a model separates classes across all thresholds, not just one.

AUC (Area Under the Curve) summarises this into a single number:

  • AUC = 1.0 - perfect separation.
  • AUC = 0.5 - no better than random guessing (the diagonal line).
  • AUC < 0.5 - worse than random (your labels might be flipped!).

AUC is threshold-independent, making it excellent for comparing models before you have decided on a specific operating point.

🧠Verificación Rápida

What does an AUC of 0.5 mean?

Metrics for Text Generation

Classification metrics do not apply to language models that generate text. Different tasks need different measures.

BLEU Score

BLEU (Bilingual Evaluation Understudy) measures how much a generated translation overlaps with reference translations, counting matching n-grams (word sequences). Scores range from 0 to 1.

BLEU is widely used in machine translation but has significant limitations: it rewards word overlap, not meaning. "The cat sat on the mat" and "A feline rested upon the rug" score poorly against each other despite similar meaning.

Perplexity

Perplexity measures how surprised a language model is by new text. Lower is better - a perplexity of 20 means the model is, on average, choosing among 20 equally likely next words. A good model has low perplexity because it predicts text well.

GPT-4 achieves remarkably low perplexity on English text, reflecting its strong language understanding.

🤯

The BLEU metric was introduced in 2002 and quickly became the standard for machine translation evaluation. Despite known flaws, it remained dominant for nearly two decades because no simple alternative consistently correlated better with human judgement.

A/B Testing in Production

Offline metrics are necessary but not sufficient. The ultimate test is A/B testing: deploy two model versions to different user groups and measure real-world outcomes.

  • Does the new recommendation model increase click-through rates?
  • Does the improved chatbot reduce support ticket escalations?
  • Does the updated spam filter receive fewer "not spam" corrections?

Production metrics often diverge from offline metrics because users behave unpredictably.

🧠Verificación Rápida

Why might a model with excellent offline metrics perform poorly in A/B testing?

When to Use Which Metric

| Scenario | Primary metric | |----------|---------------| | Balanced classification | Accuracy, F1 | | Imbalanced classes | Precision, Recall, AUC | | Medical screening | Recall (sensitivity) | | Spam filtering | Precision + Recall balance | | Machine translation | BLEU, METEOR | | Language model quality | Perplexity | | Production impact | A/B test outcomes |

🤔
Think about it:

You are building a self-driving car's pedestrian detection system. A false negative means failing to see a pedestrian; a false positive means braking for a shadow. Which metric do you optimise, and what trade-off are you willing to accept?

Key Takeaways

  • Accuracy is misleading with imbalanced data - always check the confusion matrix.
  • Precision measures correctness of positive predictions; recall measures completeness.
  • F1 balances both; AUC evaluates across all thresholds.
  • Text generation uses BLEU and perplexity instead of classification metrics.
  • A/B testing is the gold standard for measuring real-world model impact.

📚 Further Reading

  • Google ML Crash Course - Classification Metrics - Interactive walkthrough of precision, recall, and ROC curves
  • Towards Data Science - Beyond Accuracy - Practical guide with real-world examples of metric selection