Feature Engineering for Machine Learning

What Is Feature Engineering?

Feature engineering is the process of using domain knowledge to create, transform, or select input variables (features) that make machine learning models more accurate. Raw data rarely arrives in the form that best exposes the underlying patterns — a transaction timestamp is more useful as day-of-week and hour-of-day; a free-text address becomes more useful as a country code and population density. The quality of features almost always matters more than the choice of algorithm.

This article covers the most important feature engineering techniques with Python code examples using pandas and scikit-learn.

Types of Feature Engineering

Category	What It Does	Example
Feature creation	Derive new columns from existing ones using domain knowledge	order_value / customer_age = spend_per_year_of_life
Feature transformation	Change the scale or distribution of an existing feature	Log-transform revenue to reduce right skew
Feature encoding	Convert non-numeric features into numeric form	One-hot encode country; ordinal-encode size (S/M/L/XL)
Feature extraction	Derive structured information from unstructured data	Extract hour, day-of-week from timestamp; TF-IDF from text
Feature selection	Remove irrelevant or redundant features	Drop columns with near-zero variance or high multicollinearity
Interaction features	Capture relationships between two or more features	price × discount_rate; age × income_bracket

Encoding Categorical Variables

Most ML algorithms require numeric inputs, so categorical columns must be encoded. The encoding strategy depends on cardinality (number of unique values) and whether the categories have a natural order.

import pandas as pd
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

df = pd.read_csv('orders.csv')

# One-hot encoding — best for low-cardinality nominal categories
df = pd.get_dummies(df, columns=['country', 'product_category'], drop_first=True)

# Ordinal encoding — use when categories have a meaningful order
size_order = [['XS', 'S', 'M', 'L', 'XL', 'XXL']]
enc = OrdinalEncoder(categories=size_order)
df['size_encoded'] = enc.fit_transform(df[['size']])

# Target encoding — replace category with mean of target variable
# Useful for high-cardinality columns (e.g., zip code, product_id)
target_means = df.groupby('city')['revenue'].mean()
df['city_encoded'] = df['city'].map(target_means)

# Frequency encoding — replace with how often the value appears
freq = df['city'].value_counts() / len(df)
df['city_freq'] = df['city'].map(freq)

Encoding Strategies by Cardinality

Cardinality	Recommended Encoding	Why	Watch Out For
Binary (2 values)	Label encode (0/1)	Simple; no dimensionality increase	Ensure consistent 0/1 assignment across train/test
Low (<15 values), nominal	One-hot encoding	No false ordinal relationship implied	Dimensionality explosion; drop_first=True to avoid multicollinearity
Low, ordinal	Ordinal encoding	Preserves natural ordering	Must define explicit order; wrong order adds noise
High (>15 values)	Target or frequency encoding	Avoids wide sparse matrices	Target encoding leaks label info — use cross-val folds
Very high (IDs, zip codes)	Embedding (neural nets) or hashing	Handles unseen values	Hashing causes collisions; embeddings need training data

Numeric Transformations and Scaling

Many algorithms (linear regression, SVM, KNN, neural networks) are sensitive to feature scale. Tree-based models (random forests, XGBoost) are not. Always scale after splitting into train and test sets — fit the scaler on the training set only to avoid data leakage.

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import numpy as np

# StandardScaler: mean=0, std=1 — works well for normally distributed features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit + transform on train
X_test_scaled = scaler.transform(X_test)          # transform only on test

# MinMaxScaler: scales to [0,1] — good for neural networks and image data
mm_scaler = MinMaxScaler()
X_train_mm = mm_scaler.fit_transform(X_train)

# RobustScaler: uses median and IQR — best when data has outliers
rob_scaler = RobustScaler()
X_train_rob = rob_scaler.fit_transform(X_train)

# Log transformation for right-skewed numeric features
df['revenue_log'] = np.log1p(df['revenue'])  # log1p = log(1+x), handles 0

# Box-Cox transformation (requires positive values)
from scipy.stats import boxcox
df['revenue_bc'], lambda_val = boxcox(df['revenue'] + 1)

Datetime Feature Extraction

Timestamps contain rich information that models cannot use directly. Extracting components like hour, day-of-week, and month unlocks cyclical patterns that drive many business metrics.

import pandas as pd

df['order_date'] = pd.to_datetime(df['order_date'])

# Extract calendar components
df['hour'] = df['order_date'].dt.hour
df['day_of_week'] = df['order_date'].dt.dayofweek   # 0=Monday, 6=Sunday
df['month'] = df['order_date'].dt.month
df['quarter'] = df['order_date'].dt.quarter
df['year'] = df['order_date'].dt.year
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['is_month_end'] = df['order_date'].dt.is_month_end.astype(int)

# Cyclical encoding: convert periodic features to sin/cos pairs
# so the model knows that hour 23 is close to hour 0
import numpy as np
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

# Days since a reference event (e.g., customer signup)
df['days_since_signup'] = (df['order_date'] - df['signup_date']).dt.days

Creating Interaction and Aggregation Features

# Ratio and interaction features
df['revenue_per_session'] = df['revenue'] / (df['sessions'] + 1)   # +1 avoids div/0
df['discount_impact'] = df['original_price'] * df['discount_rate']
df['cart_value_x_sessions'] = df['cart_value'] * df['sessions']

# Aggregation features: group-level statistics joined back
customer_stats = df.groupby('customer_id').agg(
    total_orders=('order_id', 'count'),
    avg_order_value=('revenue', 'mean'),
    max_order_value=('revenue', 'max'),
    days_since_last_order=('order_date', lambda x: (pd.Timestamp.today() - x.max()).days)
).reset_index()

df = df.merge(customer_stats, on='customer_id', how='left')

# Lag features for time series / sequential data
df = df.sort_values(['customer_id', 'order_date'])
df['prev_order_value'] = df.groupby('customer_id')['revenue'].shift(1)
df['revenue_change'] = df['revenue'] - df['prev_order_value']

Feature Selection

More features are not always better — irrelevant or redundant features add noise, slow training, and hurt interpretability. Feature selection finds the subset that gives the best model performance.

Method	How It Works	Pros	Cons
Variance threshold	Remove features with near-zero variance	Fast; removes useless constant columns	Misses redundant features with non-zero variance
Correlation filter	Drop one of any pair with \|correlation\| > threshold	Removes multicollinearity	Does not consider relationship with target
Univariate tests (SelectKBest)	Score each feature vs the target independently	Fast; interpretable	Ignores interactions between features
Recursive Feature Elimination	Iteratively remove least important features per model	Considers feature interactions	Slow for high-dimensional data
Feature importance (tree models)	Use impurity-based or permutation importance from fitted model	Captures non-linear importance	Impurity importance biased toward high-cardinality features

from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif, RFE
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# 1. Remove near-zero variance features
selector = VarianceThreshold(threshold=0.01)
X_filtered = selector.fit_transform(X)

# 2. Remove highly correlated features
corr_matrix = pd.DataFrame(X).corr().abs()
upper = corr_matrix.where(pd.np.triu(pd.np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [c for c in upper.columns if any(upper[c] > 0.95)]

# 3. Univariate selection: keep top 20 features by F-score
kbest = SelectKBest(f_classif, k=20)
X_kbest = kbest.fit_transform(X, y)

# 4. Feature importance from Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=feature_names)
print(importances.sort_values(ascending=False).head(15))

Avoiding Data Leakage

Data leakage occurs when information from outside the training window flows into your features, making validation metrics look better than real-world performance. It is one of the most common and costly mistakes in ML projects.

Temporal leakage: using future data to predict the past — always split on time, never randomly, for time-ordered data
Target leakage: including features that are consequences of the target (e.g., churn_reason when predicting churn)
Preprocessing leakage: fitting scalers, encoders, or imputers on the full dataset before splitting — always fit on the training set only
Group leakage: train and test share customers, sessions, or other correlated groups — use group-aware splits

Summary

Feature engineering is where domain expertise and data intuition translate into model performance. The best features are usually derived from asking business questions: what would a human analyst look at to predict this outcome? Encoding, scaling, and datetime extraction are mechanical steps — the creative value comes from constructing interaction features and aggregations that capture behaviour invisible in the raw columns. Always use a pipeline to prevent leakage and ensure reproducibility across train and test sets.