Every algorithm you've studied so far has been supervised: you feed it labelled examples and it learns to predict labels for new ones. But what happens when you have a mountain of data and no labels at all?
This is where unsupervised learning comes in — and clustering is one of its most powerful tools.
Clustering is the task of grouping data points together so that points in the same group (cluster) are more similar to each other than to points in other groups.
Crucially, nobody tells the algorithm how many groups there are or what they represent. It finds the structure on its own.
Think of it like sorting a pile of mixed sweets with your eyes closed, just using touch. You'd group them by shape — round ones together, long ones together, chewy ones apart from hard ones — without anyone defining the categories in advance.
Clustering is used by astronomers to group galaxies by shape and composition, by geneticists to identify disease subtypes, and by Spotify to generate personalised playlists — all without anyone labelling the data by hand.
K-Means is simple, fast, and surprisingly powerful. Here's the full algorithm in plain English:
The analogy: imagine placing K magnets on a map of customer addresses. Each customer is attracted to the nearest magnet. Then you move each magnet to the centre of its customers. Repeat until the magnets stop moving.
K-Means depends on the initial random placement of centroids. Two runs on the same data can produce different clusters. How would you decide which result is better? What would you even measure?
A practical problem: how do you choose K? If you set K equal to the number of data points, every point is its own cluster — perfect but useless. If K = 1, everything is one big blob — also useless.
The elbow method helps: run K-Means for K = 1, 2, 3, … N and plot how much "error" (within-cluster variance) decreases as K increases. More clusters always reduce error, but there's usually a K where the improvement starts to slow dramatically — the "elbow" in the curve. That's a good candidate for the right K.
It's not a hard rule, but it gives you a principled starting point.
K-Means requires you to fix K upfront. Hierarchical clustering does not.
Instead, it builds a tree (called a dendrogram) of clusters by either:
The resulting dendrogram lets you "cut" at any level to get any number of clusters, which is extremely flexible. The downside: hierarchical clustering is computationally expensive on large datasets.
Clustering turns up everywhere:
| Application | What Gets Clustered | |---|---| | Customer segmentation | Customers grouped by purchase behaviour | | Image compression | Pixel colours reduced to K representative colours | | Document organisation | Articles grouped by topic without manual labels | | Anomaly detection | Outliers that don't fit any cluster = suspicious | | Genetics | Gene expression patterns grouped by disease subtype | | Social networks | Communities identified by connection patterns |
Clustering is powerful but has real limitations:
K-Means was first proposed by Stuart Lloyd in 1957 (as an internal Bell Labs technical note) and independently re-described by Forgy in 1965. Despite its age, it remains one of the most widely-used algorithms in data science.