AI EducademyAIEducademy
🌳

AI基础

🌱
AI 种子

从零开始

🌿
AI 萌芽

打好基础

🌳
AI 枝干

付诸实践

🏕️
AI 树冠

深入探索

🌲
AI 森林

精通AI

🔨

AI精通

✏️
AI 草图

从零开始

🪨
AI 雕刻

打好基础

⚒️
AI 匠心

付诸实践

💎
AI 打磨

深入探索

🏆
AI 杰作

精通AI

🚀

职业准备

🚀
面试发射台

开启你的旅程

🌟
行为面试精通

掌握软技能

💻
技术面试

通过编程轮次

🤖
AI与ML面试

ML面试精通

🏆
Offer与未来

拿下最好的Offer

查看所有学习计划→

实验室

已加载 7 个实验
🧠神经网络游乐场🤖AI 还是人类?💬提示实验室🎨图像生成器😊情感分析器💡聊天机器人构建器⚖️伦理模拟器
🎯模拟面试进入实验室→
学习旅程博客
🎯
关于

让AI教育触达每一个人、每一个角落

❓
常见问题

Common questions answered

✉️
Contact

Get in touch with us

⭐
Open Source

在 GitHub 上公开构建

立即开始
AI EducademyAIEducademy

MIT 许可证。开源项目

学习

  • 学习计划
  • 课程
  • 实验室

社区

  • GitHub
  • 参与贡献
  • 行为准则
  • 关于
  • 常见问题

支持

  • 请我喝杯咖啡 ☕
  • 服务条款
  • 隐私政策
  • 联系我们
AI & 工程学习计划›🌿 AI 萌芽›课程›数据如何驱动AI
📊
AI 萌芽 • 入门⏱️ 12 分钟阅读

数据如何驱动AI

How Data Powers AI

You already know that AI can recognise faces, translate languages, and recommend songs. But what actually powers all of that? Data. Without data, AI is like a car without fuel - it simply cannot go anywhere.

In this lesson, we will explore what datasets look like, the different types of data AI uses, and why the quality of that data matters enormously.

What Is a Dataset?

A dataset is an organised collection of information that an AI system learns from. Think of it as a giant spreadsheet.

  • Rows represent individual examples (e.g., one photo of a cat, one patient record).
  • Columns represent features - the characteristics being measured (e.g., age, colour, size).
  • Labels are the answers we want the AI to predict (e.g., "cat" or "dog").
A simple dataset table showing rows of animal images with columns for features like colour and size, and a label column showing cat or dog
A dataset is like a well-organised spreadsheet where each row is an example and each column is a feature.
🤯

The ImageNet dataset contains over 14 million hand-labelled images across more than 20,000 categories. It took researchers years and thousands of human annotators to build it.

Types of Data

AI works with two broad categories of data:

Structured Data

This is data that fits neatly into tables with rows and columns - like a bank transaction log or a hospital patient record. Each field has a clear type (number, date, category).

Examples: sales figures, sensor readings, survey responses.

Unstructured Data

This is data that does not follow a fixed format. It includes images, videos, audio recordings, emails, and social media posts. Over 80% of the world's data is unstructured.

Examples: photos on your phone, voice messages, news articles.

🧠小测验

Which of the following is an example of unstructured data?

Why Does AI Need So Much Data?

Humans can learn to recognise a dog after seeing just a few pictures. AI typically needs thousands - sometimes millions - of examples before it can do the same. That is because AI has no built-in understanding of the world. Every piece of knowledge must come from the data.

第 1 课,共 16 课已完成 0%
←返回学习计划

Discussion

Sign in to join the discussion

建议修改本课内容

The more varied and representative the data, the better the AI generalises to new situations. A model trained only on photos of golden retrievers might fail to recognise a poodle. Diversity in data leads to robustness in AI.

This is why companies invest heavily in collecting, cleaning, and curating large datasets - it is often the most time-consuming and expensive part of building an AI system.

The good news is that once a high-quality dataset exists, it can be reused and shared, accelerating research worldwide.

Data Quality: Garbage In, Garbage Out

AI is only as good as the data it learns from. If the data is messy, incomplete, or incorrect, the AI will produce unreliable results. This principle is known as garbage in, garbage out (GIGO).

Common data quality problems include:

  • Missing values - blank fields that leave gaps in what the AI can learn.
  • Duplicates - the same example appearing multiple times, skewing the model.
  • Incorrect labels - a photo of a dog labelled as a cat confuses the learning process.
  • Outdated information - training on data from 2005 will not reflect 2025 trends.
🤔
Think about it:

Imagine you are studying for an exam, but half your textbook pages are missing and some answers in the back are wrong. How well would you perform? That is exactly what happens when AI trains on poor-quality data.

Real-World Datasets

Some datasets have become famous in the AI community:

| Dataset | What It Contains | Size | |---------|-----------------|------| | ImageNet | Labelled photographs | 14 million+ images | | Common Crawl | Web pages from across the internet | Petabytes of text | | Wikipedia | Encyclopaedia articles | 60 million+ articles | | MNIST | Handwritten digits (0–9) | 70,000 images |

These datasets are freely available and have been used to train some of the most influential AI models in history.

🧠小测验

What does the MNIST dataset contain?

Data Labelling and Annotation

Before AI can learn from data, someone usually needs to label it. This means telling the system what each example represents.

  • A photo gets tagged as "cat" or "dog."
  • A sentence gets marked as "positive sentiment" or "negative sentiment."
  • A medical scan gets annotated by a doctor as "healthy" or "abnormal."

This process is called annotation, and it is often done by humans - sometimes thousands of them working together.

🤯

Many AI companies use crowd-sourcing platforms where workers around the world label data for just a few pence per item. It is a massive global effort that most people never see.

Data Bias: When Data Gets It Wrong

Data reflects the world it comes from - including the world's prejudices. When a dataset over-represents one group or under-represents another, the AI trained on it will inherit those imbalances.

Real examples of data bias:

  • A hiring algorithm trained mostly on male CVs learned to penalise female applicants.
  • Facial recognition systems trained mostly on lighter skin tones performed poorly on darker skin tones.
  • A medical AI trained on data from one country missed symptoms common in other populations.
💡

Bias is not just a technical problem - it is a social one. Every dataset carries the assumptions and blind spots of the people who created it.

🤔
Think about it:

If you trained an AI to recommend restaurants but only gave it data from London, would it give good recommendations for someone in Manchester? What might it get wrong?

🧠小测验

What is the most likely consequence of training an AI on biased data?

Key Takeaways

  • A dataset is an organised collection of examples that AI learns from.
  • Data can be structured (tables) or unstructured (images, text, audio).
  • Data quality directly determines AI quality - garbage in, garbage out.
  • Labelling is the human effort that teaches AI what each example means.
  • Bias in data leads to bias in AI - and it has real-world consequences.

Next up, we will look at the algorithms that actually process all this data and turn it into intelligence.