There's a saying in machine learning: garbage in, garbage out. You can have the most sophisticated algorithm in the world, but if you feed it poorly prepared data, the results will be poor. Conversely, a simple algorithm given excellent, thoughtfully prepared data can outperform a complex algorithm given raw, messy data.
Feature engineering is the craft of transforming raw data into representations that machine learning algorithms can learn from effectively. It's arguably the single most impactful skill in applied machine learning — and it's where deep domain knowledge meets data science.
In machine learning, a feature is any measurable property or attribute of the thing you're trying to predict. Features are the inputs to your model; the label (or target) is the output.
For a house price prediction model:
For an email spam classifier:
The quality, quantity, and relevance of your features are often more important than which algorithm you choose.
Real-world data is messy. Machine learning algorithms have specific requirements that raw data almost never satisfies:
Feature engineering is the process of bridging the gap between raw data and model-ready inputs.
When features have very different scales, models can be misled. A feature with values in the thousands can dominate a feature with values between 0 and 1.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)
# Before: age=[22, 65, 34], income=[18000, 95000, 42000]
# After: age=[0.0, 1.0, 0.27], income=[0.0, 1.0, 0.31]
More robust to outliers than min-max scaling:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
# Transforms each feature to have mean=0, standard deviation=1
# Feature value → (value - mean) / std_dev
Rule of thumb: use standardisation by default. Use min-max when you need values in a specific bounded range (e.g., neural network inputs).
Categories like "red/green/blue" or "London/Manchester/Birmingham" have no natural numerical ordering. You need to encode them before feeding them to most algorithms.
Creates a binary column for each category. Best when categories have no inherent order:
import pandas as pd
# Original: colour = ['red', 'green', 'blue', 'red']
df = pd.DataFrame({'colour': ['red', 'green', 'blue', 'red']})
encoded = pd.get_dummies(df['colour'], prefix='colour')
print(encoded)
# colour_blue colour_green colour_red
# 0 0 0 1
# 1 0 1 0
# 2 1 0 0
# 3 0 0 1
Maps each category to an integer. Only use when the categories have a natural order:
from sklearn.preprocessing import LabelEncoder
# Works for: ['low', 'medium', 'high'] → [0, 1, 2]
# DANGER: Don't use for unordered categories like cities —
# the model will incorrectly assume London(0) < Paris(1) < Tokyo(2)
le = LabelEncoder()
df['education_level'] = le.fit_transform(df['education_level'])
# 'school' → 0, 'undergraduate' → 1, 'postgraduate' → 2
Missing data is almost universal in real-world datasets. You have several options:
Fill missing values with a calculated substitute:
from sklearn.impute import SimpleImputer
import numpy as np
# Strategy options: 'mean', 'median', 'most_frequent', 'constant'
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)
# For numerical data: median is robust to outliers
# For categorical data: use 'most_frequent'
Sometimes the fact that data is missing is itself informative:
# Create a binary flag: 1 if income was missing, 0 if present
df['income_missing'] = df['income'].isna().astype(int)
# Then impute the original column
df['income'].fillna(df['income'].median(), inplace=True)
Drop a column if it has more than ~70-80% missing values (rarely informative). Drop a row only if you have plenty of data and the row is randomly missing (not systematically).
This is where feature engineering becomes an art. You use domain knowledge to create features that capture patterns not visible in the raw data.
df['purchase_datetime'] = pd.to_datetime(df['purchase_datetime'])
# Extract meaningful components
df['hour_of_day'] = df['purchase_datetime'].dt.hour
df['day_of_week'] = df['purchase_datetime'].dt.dayofweek
df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)
df['month'] = df['purchase_datetime'].dt.month
df['is_holiday'] = df['purchase_datetime'].isin(uk_holidays).astype(int)
A raw timestamp tells the model nothing directly; these derived features expose patterns like "fraud peaks at 3am" or "sales spike at weekends".
# Square footage × number of bathrooms might be more predictive
# than either feature alone for house prices
df['size_per_bathroom'] = df['sqft'] / df['bathrooms']
# Ratio of income to loan amount — a classic credit risk feature
df['debt_to_income'] = df['loan_amount'] / df['annual_income']
from sklearn.feature_extraction.text import TfidfVectorizer
# Convert raw text to numerical feature matrix
vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
text_features = vectorizer.fit_transform(df['review_text'])
# Each word becomes a feature; its value reflects how important
# the word is in that document relative to the whole corpus
Once you've built a model, you can measure which features it found most useful. This is valuable both for understanding your model and for deciding which features to keep or drop:
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Get feature importances
importances = pd.Series(
model.feature_importances_,
index=feature_names
).sort_values(ascending=False)
print(importances.head(10))
# income_to_debt_ratio 0.187
# days_since_last_payment 0.143
# credit_utilisation 0.121
# ...
Features with near-zero importance can usually be dropped without affecting performance — and dropping them simplifies the model, reduces training time, and can improve generalisation.
When you have hundreds or thousands of features, models can struggle — this is called the curse of dimensionality. Dimensionality reduction techniques compress features while preserving the most important patterns:
from sklearn.decomposition import PCA
# Reduce 100 features to 10 components that capture 95% of variance
pca = PCA(n_components=10)
X_reduced = pca.fit_transform(X_scaled)
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.2%}")
# Variance explained: 94.7%
PCA (Principal Component Analysis) is the most common approach, but others include t-SNE (for visualisation) and UMAP (for preserving local structure).
You have a dataset with a 'city' column containing 50 different city names. What encoding approach should you use?