Most tutorials stop at model.fit(). In production, that's barely 20% of the work. This lesson walks through the entire ML lifecycle - the same pipeline powering recommendation engines at Spotify, fraud detection at Stripe, and search ranking at Google.
Every production ML system flows through these stages:
Raw data is your foundation. Treat it like source code - version everything.
# Initialise DVC in your repo
dvc init
dvc remote add -d storage s3://my-ml-data-bucket
# Track a dataset
dvc add data/transactions.csv
git add data/transactions.csv.dvc .gitignore
git commit -m "Track raw transaction dataset v1"
dvc push
DVC (Data Version Control) lets you git checkout any historical dataset. Combined with Git tags, you can reproduce any experiment from any point in time.
Features matter more than algorithms. A gradient-boosted tree with brilliant features beats a deep neural network with poor ones.
import pandas as pd
df = pd.read_csv("data/transactions.csv", parse_dates=["timestamp"])
# Temporal features
df["hour"] = df["timestamp"].dt.hour
df["is_weekend"] = df["timestamp"].dt.dayofweek >= 5
# Aggregation features
user_stats = df.groupby("user_id")["amount"].agg(["mean", "std", "count"])
user_stats.columns = ["user_avg_amount", "user_std_amount", "user_txn_count"]
df = df.merge(user_stats, on="user_id")
# Ratio features
df["amount_vs_avg"] = df["amount"] / df["user_avg_amount"]
Which feature engineering technique typically provides the MOST predictive power in tabular ML problems?
Never lose track of what you tried. MLflow logs parameters, metrics, and artefacts automatically.
import mlflow
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import f1_score
mlflow.set_experiment("fraud-detection-v2")
with mlflow.start_run(run_name="gbm-baseline"):
params = {"n_estimators": 500, "max_depth": 6, "learning_rate": 0.05}
mlflow.log_params(params)
model = GradientBoostingClassifier(**params)
model.fit(X_train, y_train)
score = f1_score(y_test, model.predict(X_test))
mlflow.log_metric("f1_score", score)
mlflow.sklearn.log_model(model, "model")
Compare runs in the MLflow UI, then promote the best model to the Model Registry with staging → production lifecycle tags.
Manual steps break. Encode your pipeline in code using orchestration frameworks:
| Tool | Best For | Key Strength | |------|----------|-------------| | Kubeflow Pipelines | Kubernetes-native ML | Scales on existing K8s clusters | | Apache Airflow | Complex DAG scheduling | Mature ecosystem, wide adoption | | Prefect | Modern Python workflows | Pythonic API, excellent error handling | | ZenML | MLOps abstraction | Stack-agnostic, plug any tool |
Your model needs an API. Two dominant patterns:
REST API - simple, HTTP-based, great for moderate throughput:
# FastAPI model server
from fastapi import FastAPI
import joblib
app = FastAPI()
model = joblib.load("model.pkl")
@app.post("/predict")
async def predict(features: dict):
prediction = model.predict([list(features.values())])
return {"prediction": int(prediction[0])}
gRPC - binary protocol, 2–10× faster than REST, ideal for internal microservices and high-throughput scenarios.
Never send 100% of traffic to a new model. Use canary releases: route 5% → 25% → 50% → 100%, monitoring error rates at each step. Roll back instantly if metrics degrade.
What is the PRIMARY advantage of canary deployments for ML models?
Deployment is not the finish line - it's the starting gun. Monitor for:
Tools like Evidently AI, WhyLabs, and Arize provide drift dashboards. When drift exceeds thresholds, trigger automatic retraining with fresh production data - closing the feedback loop.
A fraud detection model's precision drops from 95% to 78% over three months, but the input feature distributions haven't changed. What type of drift is this?