AI EducademyAIEducademy
🌳

AI 学习路径

🌱
AI 种子

从零开始

🌿
AI 萌芽

打好基础

🌳
AI 枝干

付诸实践

🏕️
AI 树冠

深入探索

🌲
AI 森林

精通AI

🔨

工程技能路径

✏️
AI 草图

从零开始

🪨
AI 雕刻

打好基础

⚒️
AI 匠心

付诸实践

💎
AI 打磨

深入探索

🏆
AI 杰作

精通AI

查看所有学习计划→

实验室

已加载 7 个实验
🧠神经网络游乐场🤖AI 还是人类?💬提示实验室🎨图像生成器😊情感分析器💡聊天机器人构建器⚖️伦理模拟器
进入实验室→
📝

博客

关于AI、教育和技术的最新文章

阅读博客→
nav.faq
🎯
使命

让AI教育触达每一个人、每一个角落

💜
价值观

开源、多语言、社区驱动

⭐
Open Source

在 GitHub 上公开构建

认识创始人→在 GitHub 上查看
立即开始
AI EducademyAIEducademy

MIT 许可证。开源项目

学习

  • 学习计划
  • 课程
  • 实验室

社区

  • GitHub
  • 参与贡献
  • 行为准则
  • 关于
  • 常见问题

支持

  • 请我喝杯咖啡 ☕
AI & 工程学习计划›🌿 AI 萌芽›课程›特征工程:教会机器什么最重要
⚙️
AI 萌芽 • 中级⏱️ 30 分钟阅读

特征工程:教会机器什么最重要

Feature Engineering: Teaching Machines What Matters

There's a saying in machine learning: garbage in, garbage out. You can have the most sophisticated algorithm in the world, but if you feed it poorly prepared data, the results will be poor. Conversely, a simple algorithm given excellent, thoughtfully prepared data can outperform a complex algorithm given raw, messy data.

Feature engineering is the craft of transforming raw data into representations that machine learning algorithms can learn from effectively. It's arguably the single most impactful skill in applied machine learning — and it's where deep domain knowledge meets data science.

🧩 What Are Features?

In machine learning, a feature is any measurable property or attribute of the thing you're trying to predict. Features are the inputs to your model; the label (or target) is the output.

For a house price prediction model:

  • Features: square footage, number of bedrooms, postcode, year built, distance to nearest school
  • Label: sale price

For an email spam classifier:

  • Features: word frequencies, sender domain, presence of certain phrases, email length
  • Label: spam (1) or not spam (0)

The quality, quantity, and relevance of your features are often more important than which algorithm you choose.

🗃️ Why Raw Data Is Rarely Model-Ready

Real-world data is messy. Machine learning algorithms have specific requirements that raw data almost never satisfies:

  • Numerical only: most algorithms can't directly process text, categories, or dates
  • No missing values: most algorithms can't handle NaN or null values
  • Similar scales: features with very different value ranges can distort gradient-based learning
  • Meaningful representation: raw timestamps or postcodes don't encode the patterns that matter (is it a weekend? is it a wealthy area?)

Feature engineering is the process of bridging the gap between raw data and model-ready inputs.

📏 Normalisation and Standardisation

When features have very different scales, models can be misled. A feature with values in the thousands can dominate a feature with values between 0 and 1.

Min-Max Normalisation (Scaling to [0, 1])

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)

# Before: age=[22, 65, 34], income=[18000, 95000, 42000]
# After:  age=[0.0, 1.0, 0.27], income=[0.0, 1.0, 0.31]

Standardisation (Z-score, mean=0, std=1)

More robust to outliers than min-max scaling:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Transforms each feature to have mean=0, standard deviation=1
# Feature value → (value - mean) / std_dev

Rule of thumb: use standardisation by default. Use min-max when you need values in a specific bounded range (e.g., neural network inputs).

🏷️ Encoding Categorical Variables

Categories like "red/green/blue" or "London/Manchester/Birmingham" have no natural numerical ordering. You need to encode them before feeding them to most algorithms.

One-Hot Encoding

Creates a binary column for each category. Best when categories have no inherent order:

import pandas as pd

# Original: colour = ['red', 'green', 'blue', 'red']
df = pd.DataFrame({'colour': ['red', 'green', 'blue', 'red']})

encoded = pd.get_dummies(df['colour'], prefix='colour')
print(encoded)

#    colour_blue  colour_green  colour_red
# 0            0             0           1
# 1            0             1           0
# 2            1             0           0
# 3            0             0           1

Label Encoding

Maps each category to an integer. Only use when the categories have a natural order:

from sklearn.preprocessing import LabelEncoder

# Works for: ['low', 'medium', 'high'] → [0, 1, 2]
# DANGER: Don't use for unordered categories like cities —
# the model will incorrectly assume London(0) < Paris(1) < Tokyo(2)

le = LabelEncoder()
df['education_level'] = le.fit_transform(df['education_level'])
# 'school' → 0, 'undergraduate' → 1, 'postgraduate' → 2
\ud83e\udd2f
A common beginner mistake is to label-encode city names, accidentally telling the model that Tokyo is "more than" London, which leads to bizarre predictions. Always use one-hot encoding for unordered categories.

🕳️ Handling Missing Values

Missing data is almost universal in real-world datasets. You have several options:

Imputation

Fill missing values with a calculated substitute:

from sklearn.impute import SimpleImputer
import numpy as np

# Strategy options: 'mean', 'median', 'most_frequent', 'constant'
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)

# For numerical data: median is robust to outliers
# For categorical data: use 'most_frequent'

Adding a Missingness Indicator

Sometimes the fact that data is missing is itself informative:

# Create a binary flag: 1 if income was missing, 0 if present
df['income_missing'] = df['income'].isna().astype(int)

# Then impute the original column
df['income'].fillna(df['income'].median(), inplace=True)

When to Drop

Drop a column if it has more than ~70-80% missing values (rarely informative). Drop a row only if you have plenty of data and the row is randomly missing (not systematically).

🛠️ Creating New Features

This is where feature engineering becomes an art. You use domain knowledge to create features that capture patterns not visible in the raw data.

Date/Time Decomposition

df['purchase_datetime'] = pd.to_datetime(df['purchase_datetime'])

# Extract meaningful components
df['hour_of_day']   = df['purchase_datetime'].dt.hour
df['day_of_week']   = df['purchase_datetime'].dt.dayofweek
df['is_weekend']    = (df['day_of_week'] >= 5).astype(int)
df['month']         = df['purchase_datetime'].dt.month
df['is_holiday']    = df['purchase_datetime'].isin(uk_holidays).astype(int)

A raw timestamp tells the model nothing directly; these derived features expose patterns like "fraud peaks at 3am" or "sales spike at weekends".

Interaction Features

# Square footage × number of bathrooms might be more predictive
# than either feature alone for house prices
df['size_per_bathroom'] = df['sqft'] / df['bathrooms']

# Ratio of income to loan amount — a classic credit risk feature
df['debt_to_income'] = df['loan_amount'] / df['annual_income']

Text Features

from sklearn.feature_extraction.text import TfidfVectorizer

# Convert raw text to numerical feature matrix
vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
text_features = vectorizer.fit_transform(df['review_text'])

# Each word becomes a feature; its value reflects how important
# the word is in that document relative to the whole corpus
\ud83e\udd14
Think about it:If you were building a model to predict restaurant health inspection failures, what raw data might you have access to, and what new features could you engineer from it that would be more useful than the raw data alone?

📊 Feature Importance

Once you've built a model, you can measure which features it found most useful. This is valuable both for understanding your model and for deciding which features to keep or drop:

from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Get feature importances
importances = pd.Series(
    model.feature_importances_,
    index=feature_names
).sort_values(ascending=False)

print(importances.head(10))

# income_to_debt_ratio     0.187
# days_since_last_payment  0.143
# credit_utilisation       0.121
# ...

Features with near-zero importance can usually be dropped without affecting performance — and dropping them simplifies the model, reduces training time, and can improve generalisation.

🗜️ Dimensionality Reduction: A Brief Introduction

When you have hundreds or thousands of features, models can struggle — this is called the curse of dimensionality. Dimensionality reduction techniques compress features while preserving the most important patterns:

from sklearn.decomposition import PCA

# Reduce 100 features to 10 components that capture 95% of variance
pca = PCA(n_components=10)
X_reduced = pca.fit_transform(X_scaled)

print(f"Variance explained: {pca.explained_variance_ratio_.sum():.2%}")
# Variance explained: 94.7%

PCA (Principal Component Analysis) is the most common approach, but others include t-SNE (for visualisation) and UMAP (for preserving local structure).

\ud83e\udde0小测验

You have a dataset with a 'city' column containing 50 different city names. What encoding approach should you use?

Key Takeaways

  • Features are the inputs to your model; feature quality often matters more than algorithm choice
  • Raw data is almost never model-ready — it needs cleaning, transformation, and enrichment
  • Normalisation/standardisation puts features on comparable scales; use standardisation by default
  • One-hot encoding is correct for unordered categorical variables; label encoding only for ordered ones
  • Missing values can be handled by imputation (median/mean/mode) and by adding a missingness indicator flag
  • Creating new features from domain knowledge — time decomposition, ratios, interactions — often produces the biggest performance gains
  • Feature importance scores help you understand which inputs your model relies on and which can be dropped
  • Dimensionality reduction (e.g., PCA) compresses high-dimensional data while preserving most of the useful variance
第 13 课,共 16 课已完成 0%
←过拟合与欠拟合:机器学习模型为何失效
监督学习与无监督学习:关键区别详解→