AI EducademyAIEducademy
🌳

Ruta de Aprendizaje de IA

🌱
AI Seeds

Empieza desde cero

🌿
AI Sprouts

Construye bases

🌳
AI Branches

Aplica en la práctica

🏕️
AI Canopy

Profundiza

🌲
AI Forest

Domina la IA

🔨

Ruta de Ingeniería y Código

✏️
AI Sketch

Empieza desde cero

🪨
AI Chisel

Construye bases

⚒️
AI Craft

Aplica en la práctica

💎
AI Polish

Profundiza

🏆
AI Masterpiece

Domina la IA

Ver Todos los Programas→

Laboratorio

7 experimentos cargados
🧠Playground de Red Neuronal🤖¿IA o Humano?💬Laboratorio de Prompts🎨Generador de Imágenes😊Analizador de Sentimiento💡Constructor de Chatbots⚖️Simulador de Ética
Entrar al Laboratorio→
📝

Blog

Últimos artículos sobre IA, educación y tecnología

Leer el Blog→
nav.faq
🎯
Misión

Hacer la educación en IA accesible para todos, en todas partes

💜
Valores

Open Source, multilingüe e impulsado por la comunidad

⭐
Open Source

Construido de forma abierta en GitHub

Conoce al Creador→Ver en GitHub
Empezar
AI EducademyAIEducademy

Licencia MIT. Open Source

Aprender

  • Académicos
  • Lecciones
  • Laboratorio

Comunidad

  • GitHub
  • Contribuir
  • Código de Conducta
  • Acerca de
  • Preguntas Frecuentes

Soporte

  • Invítame a un Café ☕
Académicos de IA e Ingeniería›🏆 AI Masterpiece›Lecciones›Guía de Competiciones Kaggle
🏆
AI Masterpiece • Avanzado⏱️ 20 min de lectura

Guía de Competiciones Kaggle

Kaggle Competition Guide - How to Compete (and Learn) on Kaggle

Kaggle isn't just a competition platform - it's the world's largest applied ML classroom. Over 15 million data scientists use it to sharpen skills, build portfolios, and land jobs. This lesson teaches you how to compete effectively and extract maximum learning from every competition.

🎯 Why Kaggle Matters for Your Career

Kaggle experience signals something CVs cannot: you can ship models that work on messy, real-world data. Hiring managers at Google, Meta, and top startups actively recruit from Kaggle leaderboards. Even without winning, a strong profile with well-documented notebooks demonstrates practical competence.

Kaggle competition workflow from data exploration to final submission
The typical Kaggle workflow - iterate between EDA, feature engineering, and validation until convergence.

📊 Competition Types and Winning Strategies

| Type | Example | Typical Winning Approach | |------|---------|------------------------| | Tabular | House prices, fraud detection | Gradient boosting (XGBoost/LightGBM) + heavy feature engineering | | Computer Vision | Image classification, segmentation | Pre-trained CNNs (EfficientNet, ConvNeXt) + augmentation | | NLP | Sentiment, question answering | Fine-tuned transformers (DeBERTa, RoBERTa) | | Simulation | Game AI, optimisation | Reinforcement learning + domain heuristics |

\ud83e\udd2f
The most popular single algorithm in Kaggle competition winners is LightGBM - it appears in over 60% of top-placing tabular solutions, often combined with CatBoost or XGBoost in ensembles.

🔍 EDA - Your First 48 Hours

Never start modelling before understanding your data. A disciplined EDA workflow:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("train.csv")

# Shape and types - know what you're working with
print(f"Shape: {df.shape}")
print(f"Missing values:\n{df.isnull().sum().sort_values(ascending=False).head(10)}")

# Target distribution - is it balanced?
df["target"].value_counts(normalize=True).plot(kind="bar")
plt.title("Target Distribution")
plt.show()

# Correlations - find quick signal
correlations = df.select_dtypes(include="number").corr()["target"].sort_values()
print(correlations.head(10))  # Strongest negative correlations
print(correlations.tail(10))  # Strongest positive correlations

Check for leakage - features that wouldn't exist at prediction time. This is the number-one mistake beginners make. If a feature correlates suspiciously well with the target, investigate before celebrating.

\ud83e\udde0Verificación Rápida

During EDA, you discover a feature with 0.98 correlation to the target. What should your FIRST reaction be?

🛠️ Feature Engineering That Wins

Top competitors spend 70% of their time on features, not model tuning. Battle-tested techniques:

Aggregation features - group statistics at different granularities:

for col in ["category", "store_id", "day_of_week"]:
    stats = df.groupby(col)["sales"].agg(["mean", "std", "median"])
    stats.columns = [f"{col}_sales_{s}" for s in ["mean", "std", "median"]]
    df = df.merge(stats, on=col, how="left")

Target encoding - replace categories with smoothed target means (use fold-based encoding to prevent leakage).

Lag features - for time series, previous values are gold:

for lag in [1, 7, 14, 28]:
    df[f"sales_lag_{lag}"] = df.groupby("store_id")["sales"].shift(lag)

✅ Cross-Validation - Trust Your Local Score

Your local CV score matters more than the public leaderboard. A robust validation strategy:

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_predictions = np.zeros(len(X))

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    model = lgb.LGBMClassifier(n_estimators=1000, learning_rate=0.05)
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)],
              callbacks=[lgb.early_stopping(50)])
    oof_predictions[val_idx] = model.predict_proba(X_val)[:, 1]

print(f"OOF AUC: {roc_auc_score(y, oof_predictions):.5f}")

Golden rule: if your local CV and public LB disagree, trust your CV. The public leaderboard uses only a fraction of test data - overfitting to it is a trap.

\ud83e\udd14
Think about it:You're 15th on the public leaderboard but your local CV suggests your model is overfit. Do you submit your best LB score or your best CV score for the final submission? Why?

🎭 Ensemble Methods - The Final Push

Almost every winning solution uses ensembles. Three key techniques:

Bagging - train the same model on different data subsets, average predictions. Reduces variance.

Stacking - train a meta-model on out-of-fold predictions from diverse base models:

# Level 1: diverse base models produce OOF predictions
base_preds = np.column_stack([lgbm_oof, xgb_oof, catboost_oof, nn_oof])

# Level 2: logistic regression learns optimal combination
from sklearn.linear_model import LogisticRegression
meta = LogisticRegression()
meta.fit(base_preds, y_train)

Blending - weighted average of model predictions. Simpler than stacking but effective:

final = 0.4 * lgbm_pred + 0.35 * xgb_pred + 0.25 * catboost_pred
\ud83e\udde0Verificación Rápida

Why do ensemble methods almost always outperform single models in Kaggle competitions?

📓 Writing Great Kaggle Notebooks

Public notebooks build your reputation. A medal-worthy notebook includes:

  1. Clear narrative - explain your reasoning, not just your code
  2. Reproducibility - set random seeds, document library versions
  3. Visualisations - plots that reveal insights, not just decorate
  4. Honest results - share what didn't work alongside what did
\ud83e\udde0Verificación Rápida

What is the MOST effective way to progress from Kaggle Contributor to Expert rank?

📈 Kaggle Ranks and Progression

| Rank | Requirement | Typical Timeline | |------|------------|-----------------| | Novice | Create an account | Day 1 | | Contributor | Complete profile, run a notebook | Week 1 | | Expert | 2 bronze medals (competitions) | 3–6 months | | Master | 1 gold + 2 silver medals | 1–2 years | | Grandmaster | 5 gold medals (1 solo) | 3–5+ years |

\ud83e\udd14
Think about it:Beyond rankings, how would you use your Kaggle profile as a portfolio piece when applying for ML roles? What would a hiring manager look for in your competition history?

🎯 Key Takeaways

  • EDA first, modelling second - always check for data leakage
  • Feature engineering delivers more lift than hyperparameter tuning
  • Trust your local CV over the public leaderboard
  • Ensemble diverse models for the final push
  • Document your work in public notebooks - it's your portfolio

📚 Further Reading

  • Kaggle Book by Konrad Banachewicz & Luca Massaron - Strategies from grandmasters distilled into actionable advice
  • How to Win a Data Science Competition (Coursera) - Top-down Kaggle strategy course from NRU HSE
  • Papers With Code - Benchmarks - State-of-the-art results across all ML tasks
Lección 7 de 100% completado
←Pipeline de ML de Principio a Fin
Lectura de Artículos de Investigación→