AI EducademyAIEducademy
🌳

Fundamentos de IA

🌱
AI Seeds

Empieza desde cero

🌿
AI Sprouts

Construye bases

🌳
AI Branches

Aplica en la práctica

🏕️
AI Canopy

Profundiza

🌲
AI Forest

Domina la IA

🔨

Maestría en IA

✏️
AI Sketch

Empieza desde cero

🪨
AI Chisel

Construye bases

⚒️
AI Craft

Aplica en la práctica

💎
AI Polish

Profundiza

🏆
AI Masterpiece

Domina la IA

🚀

Preparación Profesional

🚀
Plataforma de Entrevistas

Comienza tu camino

🌟
Dominio Conductual

Domina las habilidades blandas

💻
Entrevistas Técnicas

Supera la ronda de código

🤖
Entrevistas de IA y ML

Dominio en entrevistas de ML

🏆
Oferta y Más Allá

Consigue la mejor oferta

Ver Todos los Programas→

Laboratorio

7 experimentos cargados
🧠Playground de Red Neuronal🤖¿IA o Humano?💬Laboratorio de Prompts🎨Generador de Imágenes😊Analizador de Sentimiento💡Constructor de Chatbots⚖️Simulador de Ética
🎯Entrevista simuladaEntrar al Laboratorio→
ViajeBlog
🎯
Acerca de

Hacer la educación en IA accesible para todos, en todas partes

❓
Preguntas Frecuentes

Common questions answered

✉️
Contact

Get in touch with us

⭐
Open Source

Construido de forma abierta en GitHub

Empezar
AI EducademyAIEducademy

Licencia MIT. Open Source

Aprender

  • Académicos
  • Lecciones
  • Laboratorio

Comunidad

  • GitHub
  • Contribuir
  • Código de Conducta
  • Acerca de
  • Preguntas Frecuentes

Soporte

  • Invítame a un Café ☕
  • Términos de Servicio
  • Política de Privacidad
  • Contacto
Académicos de IA e Ingeniería›🌿 AI Sprouts›Lecciones›Clustering: cómo la IA encuentra patrones sin etiquetas
🔵
AI Sprouts • Intermedio⏱️ 25 min de lectura

Clustering: cómo la IA encuentra patrones sin etiquetas

Clustering: How AI Finds Patterns Without Labels 🔵

Every algorithm you've studied so far has been supervised: you feed it labelled examples and it learns to predict labels for new ones. But what happens when you have a mountain of data and no labels at all?

This is where unsupervised learning comes in — and clustering is one of its most powerful tools.


🔍 What Is Clustering?

Clustering is the task of grouping data points together so that points in the same group (cluster) are more similar to each other than to points in other groups.

Crucially, nobody tells the algorithm how many groups there are or what they represent. It finds the structure on its own.

Think of it like sorting a pile of mixed sweets with your eyes closed, just using touch. You'd group them by shape — round ones together, long ones together, chewy ones apart from hard ones — without anyone defining the categories in advance.

🤯

Clustering is used by astronomers to group galaxies by shape and composition, by geneticists to identify disease subtypes, and by Spotify to generate personalised playlists — all without anyone labelling the data by hand.


📍 K-Means: The Most Famous Clustering Algorithm

K-Means is simple, fast, and surprisingly powerful. Here's the full algorithm in plain English:

  1. Choose K — decide in advance how many clusters you want.
  2. Place K centroids randomly — scatter K points randomly across the data space. A centroid is just the "centre" of a cluster.
  3. Assign each data point to its nearest centroid — every point belongs to whichever centroid it's closest to.
  4. Recalculate each centroid — move each centroid to the average position of all points assigned to it.
  5. Repeat steps 3 and 4 — keep reassigning and recalculating until the assignments stop changing.
Three iterations of K-Means: random centroids, first assignment, final converged clusters shown in different colours
K-Means iterates between assigning points to centroids and recalculating centroid positions until clusters stabilise.

The analogy: imagine placing K magnets on a map of customer addresses. Each customer is attracted to the nearest magnet. Then you move each magnet to the centre of its customers. Repeat until the magnets stop moving.

Lección 16 de 160% completado
←Árboles de decisión: el algoritmo que puedes dibujar en papel

Discussion

Sign in to join the discussion

Sugerir una edición a esta lección
🤔
Think about it:

K-Means depends on the initial random placement of centroids. Two runs on the same data can produce different clusters. How would you decide which result is better? What would you even measure?


🔢 Choosing K: The Elbow Method

A practical problem: how do you choose K? If you set K equal to the number of data points, every point is its own cluster — perfect but useless. If K = 1, everything is one big blob — also useless.

The elbow method helps: run K-Means for K = 1, 2, 3, … N and plot how much "error" (within-cluster variance) decreases as K increases. More clusters always reduce error, but there's usually a K where the improvement starts to slow dramatically — the "elbow" in the curve. That's a good candidate for the right K.

It's not a hard rule, but it gives you a principled starting point.


🌿 Hierarchical Clustering

K-Means requires you to fix K upfront. Hierarchical clustering does not.

Instead, it builds a tree (called a dendrogram) of clusters by either:

  • Agglomerative (bottom-up): start with every point as its own cluster. Repeatedly merge the two most similar clusters until everything is one big cluster.
  • Divisive (top-down): start with everything in one cluster. Repeatedly split the most dissimilar cluster until every point is its own cluster.

The resulting dendrogram lets you "cut" at any level to get any number of clusters, which is extremely flexible. The downside: hierarchical clustering is computationally expensive on large datasets.


🌍 Real-World Applications

Clustering turns up everywhere:

| Application | What Gets Clustered | |---|---| | Customer segmentation | Customers grouped by purchase behaviour | | Image compression | Pixel colours reduced to K representative colours | | Document organisation | Articles grouped by topic without manual labels | | Anomaly detection | Outliers that don't fit any cluster = suspicious | | Genetics | Gene expression patterns grouped by disease subtype | | Social networks | Communities identified by connection patterns |


⚠️ Limitations of Clustering

Clustering is powerful but has real limitations:

  • Interpreting clusters is hard — the algorithm groups the data, but you still have to work out what each group means.
  • K-Means assumes roughly spherical clusters — it struggles with elongated, irregular, or nested cluster shapes.
  • Sensitive to scale — features with large numerical ranges dominate distance calculations; always normalise features before clustering.
  • No ground truth — without labels, evaluating quality is subjective. Two analysts might draw different conclusions from the same clustering.
🤯

K-Means was first proposed by Stuart Lloyd in 1957 (as an internal Bell Labs technical note) and independently re-described by Forgy in 1965. Despite its age, it remains one of the most widely-used algorithms in data science.