Feature Engineering for Machine Learning and Analytics

What Is Feature Engineering?

Feature engineering is the process of transforming raw data into features — input variables — that better represent the underlying patterns in a dataset, improving the performance of machine learning models and analytical queries. It sits at the intersection of domain knowledge and statistical thinking: good features require an understanding of what the data means in the real world and how predictive algorithms interpret numerical inputs.

The importance of feature engineering is often underestimated. A simple logistic regression model with thoughtfully engineered features will frequently outperform a deep neural network trained on poorly structured raw data. The quality of features is the primary determinant of model quality, and no amount of algorithmic sophistication can compensate for irrelevant or poorly encoded inputs.

Types of Features

Feature Type	Description	Example	Common Encoding
Numerical (continuous)	Real-valued numbers	Age, price, temperature	Standardization, normalization
Numerical (discrete)	Integer counts or ordinal values	Number of purchases, rating (1-5)	Direct use or binning
Categorical (nominal)	Unordered categories	Country, product category	One-hot encoding, target encoding
Categorical (ordinal)	Ordered categories	Education level, size (S/M/L)	Label encoding with order preserved
Datetime	Timestamps and dates	Transaction date, signup time	Extract components, cyclical encoding
Text	Free-form strings	Product description, review	TF-IDF, embeddings
Boolean / Binary	True/false flags	Is premium user, has churned	0/1 encoding

Numerical Feature Transformations

Raw numerical features often need transformation before they can be used effectively. Many real-world variables like income, transaction amounts, and website traffic follow power-law or log-normal distributions with long right tails. Linear models and distance-based algorithms assume roughly Gaussian inputs; skewed features can dominate distance calculations and distort model coefficients.

Log transformation is the most common remedy for right-skewed data:

import numpy as np
import pandas as pd

df['log_price'] = np.log1p(df['price'])
df['log_revenue'] = np.log(df['revenue'] + 1)

Standardization (z-score normalization) centers features at zero with unit variance, important for algorithms sensitive to feature scale like SVM, PCA, and regularized regression:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df['price_scaled'] = scaler.fit_transform(df[['price']])

Binning converts continuous variables into categorical buckets, useful when the relationship with the target is non-linear or when domain knowledge suggests meaningful thresholds:

df['age_group'] = pd.cut(df['age'],
    bins=[0, 18, 35, 55, 100],
    labels=['under_18', '18_35', '35_55', '55_plus'])

Categorical Encoding

Machine learning models require numeric inputs, so categorical variables must be encoded. The choice of encoding method significantly affects model performance.

One-hot encoding creates a binary column for each category level — appropriate for nominal variables with low cardinality (fewer than about 20 unique values):

df_encoded = pd.get_dummies(df, columns=['country', 'device_type'], drop_first=True)

Target encoding replaces each category with the mean of the target variable for that category, handling high-cardinality categoricals efficiently. It must be applied with cross-validation to prevent data leakage:

from category_encoders import TargetEncoder

encoder = TargetEncoder(cols=['product_category'])
df['cat_encoded'] = encoder.fit_transform(df['product_category'], df['target'])

Frequency encoding replaces categories with their frequency in the training set — a simple, leakage-free alternative for shigh-cardinality features:

freq_map = df['city'].value_counts() / len(df)
df['city_freq'] = df['city'].map(freq_map)

Datetime Feature Engineering

Datetime columns contain rich temporal signals that are invisible to models unless explicitly extracted. Standard decomposition pulls out year, month, day of week, hour, and whether a date falls on a weekend or holiday:

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['month'] = df['timestamp'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['days_since_signup'] = (df['timestamp'] - df['signup_date']).dt.days

Cyclical features like hour of day and month have a circular structure — hour 23 is closer to hour 0 than hour 12 is. Sine and cosine transformation preserves this cyclical relationship:

df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

Interaction and Derived Features

Some of the most predictive features are combinations or transformations of existing ones. Interaction features capture joint effects that individual features cannot express alone:

df['revenue_per_session'] = df['revenue'] / (df['sessions'] + 1)
df['price_x_quantity'] = df['price'] * df['quantity']

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
df_poly = poly.fit_transform(df[['age', 'tenure']])

Aggregation features compute statistics about a group — for example, the average purchase amount for a customer's region, or the number of times a product has been returned in the last 30 days. These features encode historical context and are particularly valuable in recommendation systems, fraud detection, and churn prediction.

Handling Missing Values

Strategy	When to Use	Example
Mean/median imputation	Numerical, missing completely at random	df['col'].fillna(df['col'].median())
Mode imputation	Categorical features	df['col'].fillna(df['col'].mode()[0])
Indicator flag	When missingness itself is informative	df['col_missing'] = df['col'].isna().astype(int)
Model-based imputation	Complex patterns, missing not at random	KNN imputer, iterative imputer
Forward/backward fill	Time series data	df['col'].ffill()

Feature Selection

Not all engineered features improve model performance — some add noise, increase overfitting, and slow training. Feature selection identifies the most predictive subset. Common techniques include correlation filtering (removing features highly correlated with each other), univariate statistical tests (selecting features with the highest F-score or mutual information with the target), and model-based importance using a tree model's feature importances or L1-regularized coefficients to rank features.

Feature engineering and selection are iterative processes. Creating features, training a baseline model, analyzing which features contribute most, removing or transforming low-quality features, and retraining — this cycle is the core of practical machine learning work. Domain knowledge is what allows you to generate hypotheses about which features might be predictive before running any code, making the process far more efficient than brute-force feature generation.