You've made it through the coding rounds. Now the interviewer slides a whiteboard marker across the table and says, "Design a recommendation system for 50 million users." Your palms go damp — not because you don't know ML, but because you've never practised thinking out loud about entire systems. This lesson gives you a repeatable framework that turns that open-ended question into a structured conversation.
Traditional software system design focuses on data flow, storage, and scalability. ML system design adds an extra dimension: the model is a living component that degrades over time, depends on data quality, and requires continuous evaluation.
Interviewers aren't looking for a perfect architecture diagram. They want to see that you can:
The biggest mistake candidates make is jumping straight to model selection. Interviewers consistently report that problem framing and data discussion are where candidates differentiate themselves.
Use this framework as your skeleton for every design question. You don't need to spend equal time on every step — adapt based on the question — but touching each one signals maturity.
Start by clarifying the goal. Ask: What does success look like for the business?
Map the business metric to an ML-friendly objective early. For example, "increase user engagement" might translate to "predict probability of click on each item and rank by score."
Discuss where data comes from, how it's collected, and what problems you anticipate:
Sign in to join the discussion
Describe the features you'd extract. Group them logically:
| Feature Group | Examples | |---|---| | User features | age bucket, tenure, historical click rate | | Item features | category, price range, popularity score | | Context features | time of day, device type, location | | Interaction features | user×item co-occurrence, session depth |
Now — and only now — discuss models. Justify your choice:
Cover how you'd train and validate:
Pick metrics that align with the business goal. Precision@K for ranking, AUC-ROC for binary classification, NDCG for ordered lists.
Discuss how the model reaches users:
Explain what you'd watch after launch: data drift, prediction distribution shifts, and online metrics via A/B tests.
Imagine you're designing a fraud detection system. The business says "catch all fraud." Why is 100% recall a dangerous target, and how would you frame the conversation around acceptable trade-offs?
In an ML system design interview, what should you do FIRST when given a design prompt?
Interviewers love trade-off discussions because they reveal depth of experience. Here are the trade-offs that come up most:
| Trade-Off | When to Favour Left | When to Favour Right | |---|---|---| | Latency vs Accuracy | Real-time user-facing (search) | Batch offline (email recs) | | Simple vs Complex model | Small data, need interpretability | Large data, accuracy is critical | | Batch vs Real-time serving | Predictions don't change quickly | Predictions must reflect latest context | | Build vs Buy | Core differentiator for the business | Commodity capability (e.g., OCR) |
When you discuss a trade-off, use this pattern:
"We could go with option A which gives us [benefit], but the downside is [cost]. Alternatively, option B [benefit], though it introduces [cost]. Given [specific constraint from the problem], I'd lean towards option A because..."
Netflix estimates that its recommendation system saves the company over $1 billion per year in reduced churn. That single ML system's value exceeds the GDP of some small countries.
Prompt: "Design a content moderation system for a social media platform."
Why is a hybrid serving approach (batch candidate generation + real-time ranking) common in recommendation systems?
The best ML system design answers feel like a conversation, not a lecture. Pause to check in with the interviewer, ask clarifying questions, and be willing to pivot when they nudge you in a different direction.