AI EducademyAIEducademy
🌳

AI 学习路径

🌱
AI 种子

从零开始

🌿
AI 萌芽

打好基础

🌳
AI 枝干

付诸实践

🏕️
AI 树冠

深入探索

🌲
AI 森林

精通AI

🔨

工程技能路径

✏️
AI 草图

从零开始

🪨
AI 雕刻

打好基础

⚒️
AI 匠心

付诸实践

💎
AI 打磨

深入探索

🏆
AI 杰作

精通AI

查看所有学习计划→

实验室

已加载 7 个实验
🧠神经网络游乐场🤖AI 还是人类?💬提示实验室🎨图像生成器😊情感分析器💡聊天机器人构建器⚖️伦理模拟器
进入实验室→
📝

博客

关于AI、教育和技术的最新文章

阅读博客→
nav.faq
🎯
使命

让AI教育触达每一个人、每一个角落

💜
价值观

开源、多语言、社区驱动

⭐
Open Source

在 GitHub 上公开构建

认识创始人→在 GitHub 上查看
立即开始
AI EducademyAIEducademy

MIT 许可证。开源项目

学习

  • 学习计划
  • 课程
  • 实验室

社区

  • GitHub
  • 参与贡献
  • 行为准则
  • 关于
  • 常见问题

支持

  • 请我喝杯咖啡 ☕
AI & 工程学习计划›🌿 AI 萌芽›课程›聚类:AI如何在没有标签的情况下发现规律
🔵
AI 萌芽 • 中级⏱️ 25 分钟阅读

聚类:AI如何在没有标签的情况下发现规律

Clustering: How AI Finds Patterns Without Labels 🔵

Every algorithm you've studied so far has been supervised: you feed it labelled examples and it learns to predict labels for new ones. But what happens when you have a mountain of data and no labels at all?

This is where unsupervised learning comes in — and clustering is one of its most powerful tools.


🔍 What Is Clustering?

Clustering is the task of grouping data points together so that points in the same group (cluster) are more similar to each other than to points in other groups.

Crucially, nobody tells the algorithm how many groups there are or what they represent. It finds the structure on its own.

Think of it like sorting a pile of mixed sweets with your eyes closed, just using touch. You'd group them by shape — round ones together, long ones together, chewy ones apart from hard ones — without anyone defining the categories in advance.

\ud83e\udd2f

Clustering is used by astronomers to group galaxies by shape and composition, by geneticists to identify disease subtypes, and by Spotify to generate personalised playlists — all without anyone labelling the data by hand.


📍 K-Means: The Most Famous Clustering Algorithm

K-Means is simple, fast, and surprisingly powerful. Here's the full algorithm in plain English:

  1. Choose K — decide in advance how many clusters you want.
  2. Place K centroids randomly — scatter K points randomly across the data space. A centroid is just the "centre" of a cluster.
  3. Assign each data point to its nearest centroid — every point belongs to whichever centroid it's closest to.
  4. Recalculate each centroid — move each centroid to the average position of all points assigned to it.
  5. Repeat steps 3 and 4 — keep reassigning and recalculating until the assignments stop changing.
Three iterations of K-Means: random centroids, first assignment, final converged clusters shown in different colours
K-Means iterates between assigning points to centroids and recalculating centroid positions until clusters stabilise.

The analogy: imagine placing K magnets on a map of customer addresses. Each customer is attracted to the nearest magnet. Then you move each magnet to the centre of its customers. Repeat until the magnets stop moving.

\ud83e\udd14
Think about it:

K-Means depends on the initial random placement of centroids. Two runs on the same data can produce different clusters. How would you decide which result is better? What would you even measure?


🔢 Choosing K: The Elbow Method

A practical problem: how do you choose K? If you set K equal to the number of data points, every point is its own cluster — perfect but useless. If K = 1, everything is one big blob — also useless.

The elbow method helps: run K-Means for K = 1, 2, 3, … N and plot how much "error" (within-cluster variance) decreases as K increases. More clusters always reduce error, but there's usually a K where the improvement starts to slow dramatically — the "elbow" in the curve. That's a good candidate for the right K.

It's not a hard rule, but it gives you a principled starting point.


🌿 Hierarchical Clustering

K-Means requires you to fix K upfront. Hierarchical clustering does not.

Instead, it builds a tree (called a dendrogram) of clusters by either:

  • Agglomerative (bottom-up): start with every point as its own cluster. Repeatedly merge the two most similar clusters until everything is one big cluster.
  • Divisive (top-down): start with everything in one cluster. Repeatedly split the most dissimilar cluster until every point is its own cluster.

The resulting dendrogram lets you "cut" at any level to get any number of clusters, which is extremely flexible. The downside: hierarchical clustering is computationally expensive on large datasets.


🌍 Real-World Applications

Clustering turns up everywhere:

| Application | What Gets Clustered | |---|---| | Customer segmentation | Customers grouped by purchase behaviour | | Image compression | Pixel colours reduced to K representative colours | | Document organisation | Articles grouped by topic without manual labels | | Anomaly detection | Outliers that don't fit any cluster = suspicious | | Genetics | Gene expression patterns grouped by disease subtype | | Social networks | Communities identified by connection patterns |


⚠️ Limitations of Clustering

Clustering is powerful but has real limitations:

  • Interpreting clusters is hard — the algorithm groups the data, but you still have to work out what each group means.
  • K-Means assumes roughly spherical clusters — it struggles with elongated, irregular, or nested cluster shapes.
  • Sensitive to scale — features with large numerical ranges dominate distance calculations; always normalise features before clustering.
  • No ground truth — without labels, evaluating quality is subjective. Two analysts might draw different conclusions from the same clustering.
\ud83e\udd2f

K-Means was first proposed by Stuart Lloyd in 1957 (as an internal Bell Labs technical note) and independently re-described by Forgy in 1965. Despite its age, it remains one of the most widely-used algorithms in data science.


第 16 课,共 16 课已完成 0%
←决策树:可以在纸上画出的算法
🌳 AI 枝干→