AI EducademyAIEducademy
🌳

Fondations IA

🌱
AI Seeds

Partez de zéro

🌿
AI Sprouts

Construisez les fondations

🌳
AI Branches

Mettez en pratique

🏕️
AI Canopy

Approfondissez

🌲
AI Forest

Maîtrisez l'IA

🔨

Maîtrise IA

✏️
AI Sketch

Partez de zéro

🪨
AI Chisel

Construisez les fondations

⚒️
AI Craft

Mettez en pratique

💎
AI Polish

Approfondissez

🏆
AI Masterpiece

Maîtrisez l'IA

🚀

Prêt pour la Carrière

🚀
Rampe de lancement entretien

Commencez votre parcours

🌟
Maîtrise comportementale

Maîtrisez les compétences relationnelles

💻
Entretiens techniques

Réussissez l'épreuve de code

🤖
Entretiens IA et ML

Maîtrisez l'entretien ML

🏆
Offre et au-delà

Décrochez la meilleure offre

Voir tous les programmes→

Labo

7 expériences chargées
🧠Terrain de jeu neuronal🤖IA ou humain ?💬Labo de prompts🎨Generateur d'images😊Analyseur de sentiment💡Constructeur de chatbot⚖️Simulateur d'ethique
🎯Entretien simuléEntrer dans le labo→
ParcoursBlog
🎯
À propos

Rendre l'éducation en IA accessible à tous, partout

❓
FAQ

Common questions answered

✉️
Contact

Get in touch with us

⭐
Open Source

Construit publiquement sur GitHub

Commencer gratuitement
AI EducademyAIEducademy

Licence MIT. Open Source

Apprendre

  • Programmes
  • Leçons
  • Labo

Communauté

  • GitHub
  • Contribuer
  • Code de conduite
  • À propos
  • FAQ

Soutien

  • Offrez-moi un café ☕
  • Conditions d'utilisation
  • Politique de confidentialité
  • Contact
Programmes d'IA et d'ingénierie›🌿 AI Sprouts›Leçons›Clustering : comment l'IA trouve des modèles sans étiquettes
🔵
AI Sprouts • Intermédiaire⏱️ 25 min de lecture

Clustering : comment l'IA trouve des modèles sans étiquettes

Clustering: How AI Finds Patterns Without Labels 🔵

Every algorithm you've studied so far has been supervised: you feed it labelled examples and it learns to predict labels for new ones. But what happens when you have a mountain of data and no labels at all?

This is where unsupervised learning comes in — and clustering is one of its most powerful tools.


🔍 What Is Clustering?

Clustering is the task of grouping data points together so that points in the same group (cluster) are more similar to each other than to points in other groups.

Crucially, nobody tells the algorithm how many groups there are or what they represent. It finds the structure on its own.

Think of it like sorting a pile of mixed sweets with your eyes closed, just using touch. You'd group them by shape — round ones together, long ones together, chewy ones apart from hard ones — without anyone defining the categories in advance.

🤯

Clustering is used by astronomers to group galaxies by shape and composition, by geneticists to identify disease subtypes, and by Spotify to generate personalised playlists — all without anyone labelling the data by hand.


📍 K-Means: The Most Famous Clustering Algorithm

K-Means is simple, fast, and surprisingly powerful. Here's the full algorithm in plain English:

  1. Choose K — decide in advance how many clusters you want.
  2. Place K centroids randomly — scatter K points randomly across the data space. A centroid is just the "centre" of a cluster.
  3. Assign each data point to its nearest centroid — every point belongs to whichever centroid it's closest to.
  4. Recalculate each centroid — move each centroid to the average position of all points assigned to it.
  5. Repeat steps 3 and 4 — keep reassigning and recalculating until the assignments stop changing.
Three iterations of K-Means: random centroids, first assignment, final converged clusters shown in different colours
K-Means iterates between assigning points to centroids and recalculating centroid positions until clusters stabilise.

The analogy: imagine placing K magnets on a map of customer addresses. Each customer is attracted to the nearest magnet. Then you move each magnet to the centre of its customers. Repeat until the magnets stop moving.

Leçon 16 sur 160% terminé
←Arbres de décision : l'algorithme que vous pouvez dessiner sur papier

Discussion

Sign in to join the discussion

Suggérer une modification de cette leçon
🤔
Think about it:

K-Means depends on the initial random placement of centroids. Two runs on the same data can produce different clusters. How would you decide which result is better? What would you even measure?


🔢 Choosing K: The Elbow Method

A practical problem: how do you choose K? If you set K equal to the number of data points, every point is its own cluster — perfect but useless. If K = 1, everything is one big blob — also useless.

The elbow method helps: run K-Means for K = 1, 2, 3, … N and plot how much "error" (within-cluster variance) decreases as K increases. More clusters always reduce error, but there's usually a K where the improvement starts to slow dramatically — the "elbow" in the curve. That's a good candidate for the right K.

It's not a hard rule, but it gives you a principled starting point.


🌿 Hierarchical Clustering

K-Means requires you to fix K upfront. Hierarchical clustering does not.

Instead, it builds a tree (called a dendrogram) of clusters by either:

  • Agglomerative (bottom-up): start with every point as its own cluster. Repeatedly merge the two most similar clusters until everything is one big cluster.
  • Divisive (top-down): start with everything in one cluster. Repeatedly split the most dissimilar cluster until every point is its own cluster.

The resulting dendrogram lets you "cut" at any level to get any number of clusters, which is extremely flexible. The downside: hierarchical clustering is computationally expensive on large datasets.


🌍 Real-World Applications

Clustering turns up everywhere:

| Application | What Gets Clustered | |---|---| | Customer segmentation | Customers grouped by purchase behaviour | | Image compression | Pixel colours reduced to K representative colours | | Document organisation | Articles grouped by topic without manual labels | | Anomaly detection | Outliers that don't fit any cluster = suspicious | | Genetics | Gene expression patterns grouped by disease subtype | | Social networks | Communities identified by connection patterns |


⚠️ Limitations of Clustering

Clustering is powerful but has real limitations:

  • Interpreting clusters is hard — the algorithm groups the data, but you still have to work out what each group means.
  • K-Means assumes roughly spherical clusters — it struggles with elongated, irregular, or nested cluster shapes.
  • Sensitive to scale — features with large numerical ranges dominate distance calculations; always normalise features before clustering.
  • No ground truth — without labels, evaluating quality is subjective. Two analysts might draw different conclusions from the same clustering.
🤯

K-Means was first proposed by Stuart Lloyd in 1957 (as an internal Bell Labs technical note) and independently re-described by Forgy in 1965. Despite its age, it remains one of the most widely-used algorithms in data science.