What Is Feature Engineering?
Feature engineering is the process of transforming raw data into features — input variables — that better represent the underlying patterns in a dataset, improving the performance of machine learning models and analytical queries. It sits at the intersection of domain knowledge and statistical thinking: good features require an understanding of what the data means in the real world and how predictive algorithms interpret numerical inputs.
The importance of feature engineering is often underestimated. A simple logistic regression model with thoughtfully engineered features will frequently outperform a deep neural network trained on poorly structured raw data. The quality of features is the primary determinant of model quality, and no amount of algorithmic sophistication can compensate for irrelevant or poorly encoded inputs.
Types of Features
Features can be categorized by their data type and how they need to be processed before use in a model.
Feature Type | Description | Example | Common Encoding |
|---|---|---|---|
Numerical (continuous) | Real-valued numbers | Age, price, temperature | Standardization, normalization |
Numerical (discrete) | Integer counts or ordinal values | Number of purchases, rating (1-5) | Direct use or binning |
Categorical (nominal) | Unordered categories | Country, product category | One-hot encoding, target encoding |
Categorical (ordinal) | Ordered categories | Education level, size (S/M/L) | Label encoding with order preserved |
Datetime | Timestamps and dates | Transaction date, signup time | Extract components, cyclical encoding |
Text | Free-form strings | Product description, review | TF-IDF, embeddings |
Boolean / Binary | True/false flags | Is premium user, has churned | 0/1 encoding |
Numerical Feature Transformations
Raw numerical features often need transformation before they can be used effectively. The most common issue is skewed distributions — many real-world variables like income, transaction amounts, and website traffic follow power-law or log-normal distributions with long right tails. Linear models and distance-based algorithms (like KNN and clustering) assume roughly Gaussian inputs; skewed features can dominate distance calculations and distort model coefficients.
Log transformation is the most common remedy for right-skewed data:
import numpy as np
import pandas as pd
df['log_price'] = np.log1p(df['price']) # log1p handles zeros safely
df['log_revenue'] = np.log(df['revenue'] + 1)
Standardization (z-score normalization) centers features at zero with unit variance, which is important for algorithms sensitive to feature scale like SVM, PCA, and regularized regression:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['price_scaled'] = scaler.fit_transform(df[['price']])
Binning (discretization) converts continuous variables into categorical buckets, which can be useful when the relationship with the target is non-linear or when domain knowledge suggests meaningful thresholds:
df['age_group'] = pd.cut(df['age'],
bins=[0, 18, 35, 55, 100],
labels=['under_18', '18_35', '35_55', '55_plus'])
Categorical Encoding
Machine learning models require numeric inputs, so categorical variables must be encoded. The choice of encoding method significantly affects model performance.
One-hot encoding creates a binary column for each category level. It is appropriate for nominal variables with low cardinality (fewer than ~20 unique values):
df_encoded = pd.get_dummies(df, columns=['country', 'device_type'], drop_first=True)
Target encoding (also called mean encoding) replaces each category with the mean of the target variable for that category. It handles high-cardinality categoricals efficiently but is prone to overfitting — it should always be applied with cross-validation folds to prevent data leakage:
from category_encoders import TargetEncoder
encoder = TargetEncoder(cols=['product_category'])
df['product_category_encoded'] = encoder.fit_transform(df['product_category'], df['target'])
Frequency encoding replaces categories with their frequency in the training set. It is a simple, leakage-free alternative to target encoding for high-cardinality features:
freq_map = df['city'].value_counts() / len(df)
df['city_freq'] = df['city'].map(freq_map)
Datetime Feature Engineering
Datetime columns contain rich temporal signals that are invisible to models unless explicitly extracted. Standard decomposition extracts components like year, month, day of week, hour, and whether a date falls on a weekend or holiday:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['month'] = df['timestamp'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['days_since_signup'] = (df['timestamp'] - df['signup_date']).dt.days
Cyclical features like hour of day (0-23) and month (1-12) have a circular structure — hour 23 is closer to hour 0 than hour 12 is. Linear encoding fails to capture this. Sine/cosine transformation preserves the cyclical relationship:
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
Interaction and Derived Features
Some of the most predictive features are combinations or transformations of existing ones. Interaction features capture joint effects that individual features cannot express:
# Ratio feature
df['revenue_per_session'] = df['revenue'] / (df['sessions'] + 1)
# Product interaction
df['price_x_quantity'] = df['price'] * df['quantity']
# Polynomial features (for linear models)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
df_poly = poly.fit_transform(df[['age', 'tenure']])
Aggregation features compute statistics about a group — for example, the average purchase amount for a customer's region, or the number of times a product has been returned in the last 30 days. These features encode historical context and are particularly valuable in recommendation systems, fraud detection, and churn prediction.
Handling Missing Values
Missing data is a feature engineering problem as much as a data quality problem. The approach to imputation depends on the missing mechanism and the algorithm being used.
Strategy | When to Use | Code |
|---|---|---|
Mean/median imputation | Numerical features, missing completely at random |
|
Mode imputation | Categorical features |
|
Indicator flag | When missingness itself is informative |
|
Model-based imputation | Complex patterns, missing not at random | KNN imputer, iterative imputer |
Forward/backward fill | Time series data |
|
Feature Selection
Not all engineered features improve model performance — some add noise, increase overfitting, and slow training. Feature selection identifies the most predictive subset. Common techniques include correlation filtering (removing features highly correlated with each other), univariate statistical tests (selecting features with the highest F-score or mutual information with the target), and model-based importance (using a tree model's feature importances or L1-regularized model's non-zero coefficients to rank features).
Feature engineering and selection are iterative processes. Creating features, training a baseline model, analyzing which features contribute most, removing or transforming low-quality features, and retraining — this cycle is the core of practical machine learning work. Domain knowledge is what allows you to generate hypotheses about which features might be predictive before running any code, making the process far more efficient than brute-force feature generation.
Create a free reader account to keep reading.