What Is Feature Engineering?
Feature engineering is the process of transforming raw data into features — input variables — that better represent the underlying patterns in a dataset, improving the performance of machine learning models and analytical queries. It sits at the intersection of domain knowledge and statistical thinking: good features require an understanding of what the data means in the real world and how predictive algorithms interpret numerical inputs.
The importance of feature engineering is often underestimated. A simple logistic regression model with thoughtfully engineered features will frequently outperform a deep neural network trained on poorly structured raw data. The quality of features is the primary determinant of model quality, and no amount of algorithmic sophistication can compensate for irrelevant or poorly encoded inputs.
Types of Features
Feature Type | Description | Example | Common Encoding |
|---|---|---|---|
Numerical (continuous) | Real-valued numbers | Age, price, temperature | Standardization, normalization |
Numerical (discrete) | Integer counts or ordinal values | Number of purchases, rating (1-5) | Direct use or binning |
Categorical (nominal) | Unordered categories | Country, product category | One-hot encoding, target encoding |
Categorical (ordinal) | Ordered categories | Education level, size (S/M/L) | Label encoding with order preserved |
Datetime | Timestamps and dates | Transaction date, signup time | Extract components, cyclical encoding |
Text | Free-form strings | Product description, review | TF-IDF, embeddings |
Boolean / Binary | True/false flags | Is premium user, has churned | 0/1 encoding |
Numerical Feature Transformations
Raw numerical features often need transformation before they can be used effectively. Many real-world variables like income, transaction amounts, and website traffic follow power-law or log-normal distributions with long right tails. Linear models and distance-based algorithms assume roughly Gaussian inputs; skewed features can dominate distance calculations and distort model coefficients.
Log transformation is the most common remedy for right-skewed data:
import numpy as np
import pandas as pd
df['log_price'] = np.log1p(df['price'])
df['log_revenue'] = np.log(df['revenue'] + 1)
Standardization (z-score normalization) centers features at zero with unit variance, important for algorithms sensitive to feature scale like SVM, PCA, and regularized regression:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['price_scaled'] = scaler.fit_transform(df[['price']])
Binning converts continuous variables into categorical buckets, useful when the relationship with the target is non-linear or when domain knowledge suggests meaningful thresholds:
df['age_group'] = pd.cut(df['age'],
bins=[0, 18, 35, 55, 100],
labels=['under_18', '18_35', '35_55', '55_plus'])
Categorical Encoding
Machine learning models require numeric inputs, so categorical variables must be encoded. The choice of encoding method significantly affects model performance.
One-hot encoding creates a binary column for each category level — appropriate for nominal variables with low cardinality (fewer than about 20 unique values):
df_encoded = pd.get_dummies(df, columns=['country', 'device_type'], drop_first=True)
Target encoding replaces each category with the mean of the target variable for that category, handling high-cardinality categoricals efficiently. It must be applied with cross-validation to prevent data leakage:
from category_encoders import TargetEncoder
encoder = TargetEncoder(cols=['product_category'])
df['cat_encoded'] = encoder.fit_transform(df['product_category'], df['target'])
Frequency encoding replaces categories with their frequency in the training set — a simple, leakage-free alternative for shigh-cardinality features:
freq_map = df['city'].value_counts() / len(df)
df['city_freq'] = df['city'].map(freq_map)
Datetime Feature Engineering
Datetime columns contain rich temporal signals that are invisible to models unless explicitly extracted. Standard decomposition pulls out year, month, day of week, hour, and whether a date falls on a weekend or holiday:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['month'] = df['timestamp'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['days_since_signup'] = (df['timestamp'] - df['signup_date']).dt.days
Cyclical features like hour of day and month have a circular structure — hour 23 is closer to hour 0 than hour 12 is. Sine and cosine transformation preserves this cyclical relationship:
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
Interaction and Derived Features
Some of the most predictive features are combinations or transformations of existing ones. Interaction features capture joint effects that individual features cannot express alone:
df['revenue_per_session'] = df['revenue'] / (df['sessions'] + 1)
df['price_x_quantity'] = df['price'] * df['quantity']
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
df_poly = poly.fit_transform(df[['age', 'tenure']])
Aggregation features compute statistics about a group — for example, the average purchase amount for a customer's region, or the number of times a product has been returned in the last 30 days. These features encode historical context and are particularly valuable in recommendation systems, fraud detection, and churn prediction.
Handling Missing Values
Strategy | When to Use | Example |
|---|---|---|
Mean/median imputation | Numerical, missing completely at random | df['col'].fillna(df['col'].median()) |
Mode imputation | Categorical features | df['col'].fillna(df['col'].mode()[0]) |
Indicator flag | When missingness itself is informative | df['col_missing'] = df['col'].isna().astype(int) |
Model-based imputation | Complex patterns, missing not at random | KNN imputer, iterative imputer |
Forward/backward fill | Time series data | df['col'].ffill() |
Feature Selection
Not all engineered features improve model performance — some add noise, increase overfitting, and slow training. Feature selection identifies the most predictive subset. Common techniques include correlation filtering (removing features highly correlated with each other), univariate statistical tests (selecting features with the highest F-score or mutual information with the target), and model-based importance using a tree model's feature importances or L1-regularized coefficients to rank features.
Feature engineering and selection are iterative processes. Creating features, training a baseline model, analyzing which features contribute most, removing or transforming low-quality features, and retraining — this cycle is the core of practical machine learning work. Domain knowledge is what allows you to generate hypotheses about which features might be predictive before running any code, making the process far more efficient than brute-force feature generation.
Create a free reader account to keep reading.