📊

AI 萌芽 • 入门⏱️ 12 分钟阅读

数据如何驱动AI

How Data Powers AI

You already know that AI can recognise faces, translate languages, and recommend songs. But what actually powers all of that? Data. Without data, AI is like a car without fuel - it simply cannot go anywhere.

In this lesson, we will explore what datasets look like, the different types of data AI uses, and why the quality of that data matters enormously.

What Is a Dataset?

A dataset is an organised collection of information that an AI system learns from. Think of it as a giant spreadsheet.

Rows represent individual examples (e.g., one photo of a cat, one patient record).
Columns represent features - the characteristics being measured (e.g., age, colour, size).
Labels are the answers we want the AI to predict (e.g., "cat" or "dog").

A simple dataset table showing rows of animal images with columns for features like colour and size, and a label column showing cat or dog — A dataset is like a well-organised spreadsheet where each row is an example and each column is a feature.

🤯

The ImageNet dataset contains over 14 million hand-labelled images across more than 20,000 categories. It took researchers years and thousands of human annotators to build it.

Types of Data

AI works with two broad categories of data:

Structured Data

This is data that fits neatly into tables with rows and columns - like a bank transaction log or a hospital patient record. Each field has a clear type (number, date, category).

Examples: sales figures, sensor readings, survey responses.

Unstructured Data

This is data that does not follow a fixed format. It includes images, videos, audio recordings, emails, and social media posts. Over 80% of the world's data is unstructured.

Examples: photos on your phone, voice messages, news articles.

🧠小测验

Which of the following is an example of unstructured data?

Why Does AI Need So Much Data?

Humans can learn to recognise a dog after seeing just a few pictures. AI typically needs thousands - sometimes millions - of examples before it can do the same. That is because AI has no built-in understanding of the world. Every piece of knowledge must come from the data.

第 1 课，共 16 课已完成 0%

←返回学习计划

Discussion

建议修改本课内容

The more varied and representative the data, the better the AI generalises to new situations. A model trained only on photos of golden retrievers might fail to recognise a poodle. Diversity in data leads to robustness in AI.

This is why companies invest heavily in collecting, cleaning, and curating large datasets - it is often the most time-consuming and expensive part of building an AI system.

The good news is that once a high-quality dataset exists, it can be reused and shared, accelerating research worldwide.

Data Quality: Garbage In, Garbage Out

AI is only as good as the data it learns from. If the data is messy, incomplete, or incorrect, the AI will produce unreliable results. This principle is known as garbage in, garbage out (GIGO).

Common data quality problems include:

Missing values - blank fields that leave gaps in what the AI can learn.
Duplicates - the same example appearing multiple times, skewing the model.
Incorrect labels - a photo of a dog labelled as a cat confuses the learning process.
Outdated information - training on data from 2005 will not reflect 2025 trends.

🤔

Think about it:

Imagine you are studying for an exam, but half your textbook pages are missing and some answers in the back are wrong. How well would you perform? That is exactly what happens when AI trains on poor-quality data.

Some datasets have become famous in the AI community:

| Dataset | What It Contains | Size | |---------|-----------------|------| | ImageNet | Labelled photographs | 14 million+ images | | Common Crawl | Web pages from across the internet | Petabytes of text | | Wikipedia | Encyclopaedia articles | 60 million+ articles | | MNIST | Handwritten digits (0–9) | 70,000 images |

These datasets are freely available and have been used to train some of the most influential AI models in history.

🧠小测验

What does the MNIST dataset contain?

Data Labelling and Annotation

Before AI can learn from data, someone usually needs to label it. This means telling the system what each example represents.

A photo gets tagged as "cat" or "dog."
A sentence gets marked as "positive sentiment" or "negative sentiment."
A medical scan gets annotated by a doctor as "healthy" or "abnormal."

This process is called annotation, and it is often done by humans - sometimes thousands of them working together.

🤯

Many AI companies use crowd-sourcing platforms where workers around the world label data for just a few pence per item. It is a massive global effort that most people never see.

Data Bias: When Data Gets It Wrong

Data reflects the world it comes from - including the world's prejudices. When a dataset over-represents one group or under-represents another, the AI trained on it will inherit those imbalances.

Real examples of data bias:

A hiring algorithm trained mostly on male CVs learned to penalise female applicants.
Facial recognition systems trained mostly on lighter skin tones performed poorly on darker skin tones.
A medical AI trained on data from one country missed symptoms common in other populations.

💡

Bias is not just a technical problem - it is a social one. Every dataset carries the assumptions and blind spots of the people who created it.

🤔

Think about it:

If you trained an AI to recommend restaurants but only gave it data from London, would it give good recommendations for someone in Manchester? What might it get wrong?

🧠小测验

What is the most likely consequence of training an AI on biased data?

A dataset is an organised collection of examples that AI learns from.
Data can be structured (tables) or unstructured (images, text, audio).
Data quality directly determines AI quality - garbage in, garbage out.
Labelling is the human effort that teaches AI what each example means.
Bias in data leads to bias in AI - and it has real-world consequences.

Next up, we will look at the algorithms that actually process all this data and turn it into intelligence.

AI基础

AI精通

职业准备

实验室

数据如何驱动AI

How Data Powers AI

What Is a Dataset?

Types of Data

Structured Data

Unstructured Data

Why Does AI Need So Much Data?

Discussion

Data Quality: Garbage In, Garbage Out

Real-World Datasets

Data Labelling and Annotation

Data Bias: When Data Gets It Wrong

Key Takeaways