What Is Feature Engineering?
Feature engineering is the process of using domain knowledge to create, transform, or select input variables (features) that make machine learning models more accurate. Raw data rarely arrives in the form that best exposes the underlying patterns — a transaction timestamp is more useful as day-of-week and hour-of-day; a free-text address becomes more useful as a country code and population density. The quality of features almost always matters more than the choice of algorithm.
This article covers the most important feature engineering techniques with Python code examples using pandas and scikit-learn.
Types of Feature Engineering
| Category | What It Does | Example |
|---|---|---|
| Feature creation | Derive new columns from existing ones using domain knowledge | order_value / customer_age = spend_per_year_of_life |
| Feature transformation | Change the scale or distribution of an existing feature | Log-transform revenue to reduce right skew |
| Feature encoding | Convert non-numeric features into numeric form | One-hot encode country; ordinal-encode size (S/M/L/XL) |
| Feature extraction | Derive structured information from unstructured data | Extract hour, day-of-week from timestamp; TF-IDF from text |
| Feature selection | Remove irrelevant or redundant features | Drop columns with near-zero variance or high multicollinearity |
| Interaction features | Capture relationships between two or more features | price × discount_rate; age × income_bracket |
Encoding Categorical Variables
Most ML algorithms require numeric inputs, so categorical columns must be encoded. The encoding strategy depends on cardinality (number of unique values) and whether the categories have a natural order.
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
df = pd.read_csv('orders.csv')
# One-hot encoding — best for low-cardinality nominal categories
df = pd.get_dummies(df, columns=['country', 'product_category'], drop_first=True)
# Ordinal encoding — use when categories have a meaningful order
size_order = [['XS', 'S', 'M', 'L', 'XL', 'XXL']]
enc = OrdinalEncoder(categories=size_order)
df['size_encoded'] = enc.fit_transform(df[['size']])
# Target encoding — replace category with mean of target variable
# Useful for high-cardinality columns (e.g., zip code, product_id)
target_means = df.groupby('city')['revenue'].mean()
df['city_encoded'] = df['city'].map(target_means)
# Frequency encoding — replace with how often the value appears
freq = df['city'].value_counts() / len(df)
df['city_freq'] = df['city'].map(freq)
Encoding Strategies by Cardinality
| Cardinality | Recommended Encoding | Why | Watch Out For |
|---|---|---|---|
| Binary (2 values) | Label encode (0/1) | Simple; no dimensionality increase | Ensure consistent 0/1 assignment across train/test |
| Low (<15 values), nominal | One-hot encoding | No false ordinal relationship implied | Dimensionality explosion; drop_first=True to avoid multicollinearity |
| Low, ordinal | Ordinal encoding | Preserves natural ordering | Must define explicit order; wrong order adds noise |
| High (>15 values) | Target or frequency encoding | Avoids wide sparse matrices | Target encoding leaks label info — use cross-val folds |
| Very high (IDs, zip codes) | Embedding (neural nets) or hashing | Handles unseen values | Hashing causes collisions; embeddings need training data |
Numeric Transformations and Scaling
Many algorithms (linear regression, SVM, KNN, neural networks) are sensitive to feature scale. Tree-based models (random forests, XGBoost) are not. Always scale after splitting into train and test sets — fit the scaler on the training set only to avoid data leakage.
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import numpy as np
# StandardScaler: mean=0, std=1 — works well for normally distributed features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit + transform on train
X_test_scaled = scaler.transform(X_test) # transform only on test
# MinMaxScaler: scales to [0,1] — good for neural networks and image data
mm_scaler = MinMaxScaler()
X_train_mm = mm_scaler.fit_transform(X_train)
# RobustScaler: uses median and IQR — best when data has outliers
rob_scaler = RobustScaler()
X_train_rob = rob_scaler.fit_transform(X_train)
# Log transformation for right-skewed numeric features
df['revenue_log'] = np.log1p(df['revenue']) # log1p = log(1+x), handles 0
# Box-Cox transformation (requires positive values)
from scipy.stats import boxcox
df['revenue_bc'], lambda_val = boxcox(df['revenue'] + 1)
Datetime Feature Extraction
Timestamps contain rich information that models cannot use directly. Extracting components like hour, day-of-week, and month unlocks cyclical patterns that drive many business metrics.
import pandas as pd
df['order_date'] = pd.to_datetime(df['order_date'])
# Extract calendar components
df['hour'] = df['order_date'].dt.hour
df['day_of_week'] = df['order_date'].dt.dayofweek # 0=Monday, 6=Sunday
df['month'] = df['order_date'].dt.month
df['quarter'] = df['order_date'].dt.quarter
df['year'] = df['order_date'].dt.year
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['is_month_end'] = df['order_date'].dt.is_month_end.astype(int)
# Cyclical encoding: convert periodic features to sin/cos pairs
# so the model knows that hour 23 is close to hour 0
import numpy as np
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
# Days since a reference event (e.g., customer signup)
df['days_since_signup'] = (df['order_date'] - df['signup_date']).dt.days
Creating Interaction and Aggregation Features
# Ratio and interaction features
df['revenue_per_session'] = df['revenue'] / (df['sessions'] + 1) # +1 avoids div/0
df['discount_impact'] = df['original_price'] * df['discount_rate']
df['cart_value_x_sessions'] = df['cart_value'] * df['sessions']
# Aggregation features: group-level statistics joined back
customer_stats = df.groupby('customer_id').agg(
total_orders=('order_id', 'count'),
avg_order_value=('revenue', 'mean'),
max_order_value=('revenue', 'max'),
days_since_last_order=('order_date', lambda x: (pd.Timestamp.today() - x.max()).days)
).reset_index()
df = df.merge(customer_stats, on='customer_id', how='left')
# Lag features for time series / sequential data
df = df.sort_values(['customer_id', 'order_date'])
df['prev_order_value'] = df.groupby('customer_id')['revenue'].shift(1)
df['revenue_change'] = df['revenue'] - df['prev_order_value']
Feature Selection
More features are not always better — irrelevant or redundant features add noise, slow training, and hurt interpretability. Feature selection finds the subset that gives the best model performance.
| Method | How It Works | Pros | Cons |
|---|---|---|---|
| Variance threshold | Remove features with near-zero variance | Fast; removes useless constant columns | Misses redundant features with non-zero variance |
| Correlation filter | Drop one of any pair with |correlation| > threshold | Removes multicollinearity | Does not consider relationship with target |
| Univariate tests (SelectKBest) | Score each feature vs the target independently | Fast; interpretable | Ignores interactions between features |
| Recursive Feature Elimination | Iteratively remove least important features per model | Considers feature interactions | Slow for high-dimensional data |
| Feature importance (tree models) | Use impurity-based or permutation importance from fitted model | Captures non-linear importance | Impurity importance biased toward high-cardinality features |
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif, RFE
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# 1. Remove near-zero variance features
selector = VarianceThreshold(threshold=0.01)
X_filtered = selector.fit_transform(X)
# 2. Remove highly correlated features
corr_matrix = pd.DataFrame(X).corr().abs()
upper = corr_matrix.where(pd.np.triu(pd.np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [c for c in upper.columns if any(upper[c] > 0.95)]
# 3. Univariate selection: keep top 20 features by F-score
kbest = SelectKBest(f_classif, k=20)
X_kbest = kbest.fit_transform(X, y)
# 4. Feature importance from Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=feature_names)
print(importances.sort_values(ascending=False).head(15))
Avoiding Data Leakage
Data leakage occurs when information from outside the training window flows into your features, making validation metrics look better than real-world performance. It is one of the most common and costly mistakes in ML projects.
- Temporal leakage: using future data to predict the past — always split on time, never randomly, for time-ordered data
- Target leakage: including features that are consequences of the target (e.g., churn_reason when predicting churn)
- Preprocessing leakage: fitting scalers, encoders, or imputers on the full dataset before splitting — always fit on the training set only
- Group leakage: train and test share customers, sessions, or other correlated groups — use group-aware splits
Summary
Feature engineering is where domain expertise and data intuition translate into model performance. The best features are usually derived from asking business questions: what would a human analyst look at to predict this outcome? Encoding, scaling, and datetime extraction are mechanical steps — the creative value comes from constructing interaction features and aggregations that capture behaviour invisible in the raw columns. Always use a pipeline to prevent leakage and ensure reproducibility across train and test sets.
Create a free reader account to keep reading.