In AI Seeds, you learned that AI learns from examples โ just like a child learns to recognise animals from picture books. But where do those examples come from?
The answer is data โ and it's the single most important ingredient in AI. Bad data leads to bad AI. Great data leads to great AI. Let's dig in.
Data is simply recorded information. Every time you do something digital, you create data:
In AI, we collect these pieces of information, organise them, and use them to teach machines.
Not all data looks the same. There are two main types:
Data that fits neatly into rows and columns โ like a spreadsheet.
| Name | Age | City | Favourite Colour | |------|-----|------|-----------------| | Aisha | 14 | London | Blue | | Ravi | 16 | Hyderabad | Green | | Emma | 15 | Amsterdam | Red |
Databases, CSV files, and Excel sheets contain structured data. It's easy for machines to read and process.
Data that doesn't fit into a table โ images, videos, audio, emails, social media posts.
Over 80% of the world's data is unstructured! Photos, videos, and text messages massively outnumber spreadsheets. Modern AI โ especially deep learning โ was designed specifically to handle this messy, unstructured data.
When you study for an exam, you don't just read the textbook โ you also practise with sample questions and then take the real exam. AI does the same thing with three data splits:
The largest portion โ typically 70โ80% of all data. The AI model studies this to learn patterns.
About 10โ15% of the data. Used during training to check progress. Think of it as a practice test โ "Am I learning the right things?"
The remaining 10โ15%. Used after training is complete. The model has never seen this data before. It's the true measure of how well the model performs.
# A common way to split data in Python
from sklearn.model_selection import train_test_split
# Split: 80% training, 20% temporary
train_data, temp_data = train_test_split(all_data, test_size=0.2)
# Split the temporary set: half validation, half test
val_data, test_data = train_test_split(temp_data, test_size=0.5)
print(f"Training: {len(train_data)}")
print(f"Validation: {len(val_data)}")
print(f"Test: {len(test_data)}")
Why can't we just test the model on the same data it trained on? Because it would be like giving a student the exact exam questions in advance โ they'd score perfectly but might not actually understand the material. Test data must be unseen to give an honest evaluation.
Here's a critical concept: AI is only as fair as the data it learns from.
If you train a facial recognition system mostly on photos of light-skinned people, it will perform poorly on darker-skinned faces. This isn't the algorithm being "racist" โ it simply never had enough examples to learn properly.
Real-world examples of data bias:
Bias isn't always obvious. If your dataset contains 90% English text, your AI will be excellent at English but poor at Hindi, Telugu, or Dutch. That's a bias โ even though nobody intended it. Always ask: "Who is missing from this data?"
Many AI breakthroughs started with a great dataset. Here are some you should know:
The entire MNIST dataset is under 15 MB โ smaller than a single smartphone photo! Yet it launched thousands of AI careers. You don't always need "big data" to learn big concepts.
Let's explore a real dataset using Python. We'll use the famous Iris dataset โ 150 measurements of flowers with 4 features each.
import pandas as pd
from sklearn.datasets import load_iris
# Load the dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = [iris.target_names[t] for t in iris.target]
# Basic exploration
print("Shape:", df.shape) # (150, 5)
print("\nFirst 5 rows:")
print(df.head())
print("\nSpecies counts:")
print(df['species'].value_counts()) # 50 of each species
print("\nBasic statistics:")
print(df.describe())
When you explore a dataset, always ask:
You now know what fuels AI. In the next lesson, we'll explore algorithms โ the step-by-step recipes that turn data into intelligent decisions. Think of it this way: data is the ingredients, and algorithms are the cooking instructions!