Feature Engineering for Machine Learning and Analytics

What Is Feature Engineering?

Feature engineering is the process of transforming raw data into features — input variables — that better represent the underlying patterns in a dataset, improving the performance of machine learning models and analytical queries. It sits at the intersection of domain knowledge and statistical thinking: good features require an understanding of what the data means in the real world and how predictive algorithms interpret numerical inputs.

The importance of feature engineering is often underestimated. A simple logistic regression model with thoughtfully engineered features will frequently outperform a deep neural network trained on poorly structured raw data. The quality of features is the primary determinant of model quality, and no amount of algorithmic sophistication can compensate for irrelevant or poorly encoded inputs.

Types of Features

Features can be categorized by their data type and how they need to be processed before use in a model.

Feature Type	Description	Example	Common Encoding
Numerical (continuous)	Real-valued numbers	Age, price, temperature	Standardization, normalization
Numerical (discrete)	Integer counts or ordinal values	Number of purchases, rating (1-5)	Direct use or binning
Categorical (nominal)	Unordered categories	Country, product category	One-hot encoding, target encoding
Categorical (ordinal)	Ordered categories	Education level, size (S/M/L)	Label encoding with order preserved
Datetime	Timestamps and dates	Transaction date, signup time	Extract components, cyclical encoding
Text	Free-form strings	Product description, review	TF-IDF, embeddings
Boolean / Binary	True/false flags	Is premium user, has churned	0/1 encoding

Numerical Feature Transformations

Raw numerical features often need transformation before they can be used effectively. The most common issue is skewed distributions — many real-world variables like income, transaction amounts, and website traffic follow power-law or log-normal distributions with long right tails. Linear models and distance-based algorithms (like KNN and clustering) assume roughly Gaussian inputs; skewed features can dominate distance calculations and distort model coefficients.

Log transformation is the most common remedy for right-skewed data:

import numpy as np
import pandas as pd

df['log_price'] = np.log1p(df['price'])       # log1p handles zeros safely
df['log_revenue'] = np.log(df['revenue'] + 1)

Standardization (z-score normalization) centers features at zero with unit variance, which is important for algorithms sensitive to feature scale like SVM, PCA, and regularized regression:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df['price_scaled'] = scaler.fit_transform(df[['price']])

Binning (discretization) converts continuous variables into categorical buckets, which can be useful when the relationship with the target is non-linear or when domain knowledge suggests meaningful thresholds:

df['age_group'] = pd.cut(df['age'],
                          bins=[0, 18, 35, 55, 100],
                          labels=['under_18', '18_35', '35_55', '55_plus'])

Categorical Encoding

Machine learning models require numeric inputs, so categorical variables must be encoded. The choice of encoding method significantly affects model performance.

One-hot encoding creates a binary column for each category level. It is appropriate for nominal variables with low cardinality (fewer than ~20 unique values):

df_encoded = pd.get_dummies(df, columns=['country', 'device_type'], drop_first=True)

Target encoding (also called mean encoding) replaces each category with the mean of the target variable for that category. It handles high-cardinality categoricals efficiently but is prone to overfitting — it should always be applied with cross-validation folds to prevent data leakage:

from category_encoders import TargetEncoder

encoder = TargetEncoder(cols=['product_category'])
df['product_category_encoded'] = encoder.fit_transform(df['product_category'], df['target'])

Frequency encoding replaces categories with their frequency in the training set. It is a simple, leakage-free alternative to target encoding for high-cardinality features:

freq_map = df['city'].value_counts() / len(df)
df['city_freq'] = df['city'].map(freq_map)

Datetime Feature Engineering

Datetime columns contain rich temporal signals that are invisible to models unless explicitly extracted. Standard decomposition extracts components like year, month, day of week, hour, and whether a date falls on a weekend or holiday:

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['month'] = df['timestamp'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['days_since_signup'] = (df['timestamp'] - df['signup_date']).dt.days

Cyclical features like hour of day (0-23) and month (1-12) have a circular structure — hour 23 is closer to hour 0 than hour 12 is. Linear encoding fails to capture this. Sine/cosine transformation preserves the cyclical relationship:

df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

Interaction and Derived Features

Some of the most predictive features are combinations or transformations of existing ones. Interaction features capture joint effects that individual features cannot express:

# Ratio feature
df['revenue_per_session'] = df['revenue'] / (df['sessions'] + 1)

# Product interaction
df['price_x_quantity'] = df['price'] * df['quantity']

# Polynomial features (for linear models)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
df_poly = poly.fit_transform(df[['age', 'tenure']])

Aggregation features compute statistics about a group — for example, the average purchase amount for a customer's region, or the number of times a product has been returned in the last 30 days. These features encode historical context and are particularly valuable in recommendation systems, fraud detection, and churn prediction.

Handling Missing Values

Missing data is a feature engineering problem as much as a data quality problem. The approach to imputation depends on the missing mechanism and the algorithm being used.

Strategy	When to Use	Code
Mean/median imputation	Numerical features, missing completely at random	`df['col'].fillna(df['col'].median())`
Mode imputation	Categorical features	`df['col'].fillna(df['col'].mode()[0])`
Indicator flag	When missingness itself is informative	`df['col_missing'] = df['col'].isna().astype(int)`
Model-based imputation	Complex patterns, missing not at random	KNN imputer, iterative imputer
Forward/backward fill	Time series data	`df['col'].ffill()`

Feature Selection

Not all engineered features improve model performance — some add noise, increase overfitting, and slow training. Feature selection identifies the most predictive subset. Common techniques include correlation filtering (removing features highly correlated with each other), univariate statistical tests (selecting features with the highest F-score or mutual information with the target), and model-based importance (using a tree model's feature importances or L1-regularized model's non-zero coefficients to rank features).

Feature engineering and selection are iterative processes. Creating features, training a baseline model, analyzing which features contribute most, removing or transforming low-quality features, and retraining — this cycle is the core of practical machine learning work. Domain knowledge is what allows you to generate hypotheses about which features might be predictive before running any code, making the process far more efficient than brute-force feature generation.