What Is Feature Engineering?
Feature engineering is the process of using domain knowledge and analytical creativity to transform raw data into features that better represent the underlying patterns for machine learning models or analytical workflows. It bridges the gap between raw data and useful inputs, and is often the single most impactful lever for improving model performance — more so than algorithm selection or hyperparameter tuning.
A feature is any measurable property used as input to an analysis or model. Raw data rarely comes in a form that's immediately useful. Feature engineering transforms it into representations that capture the patterns you care about. A transaction timestamp, for example, might be transformed into day of week, hour of day, days since last purchase, or whether it occurred during a holiday — all of which may be more predictive than the raw timestamp.
Types of Feature Engineering Techniques
Technique | Description | Example | Best Used For |
|---|---|---|---|
Encoding | Convert categorical to numeric | One-hot encode "country" | ML models requiring numeric input |
Binning | Group continuous values into ranges | Age → 18-25, 26-35, 36-50 | Reducing noise, segmentation |
Log transform | Compress skewed distributions | log(revenue) | Right-skewed numeric features |
Interaction features | Combine two features | price × quantity = order_value | Capturing multiplicative effects |
Date extraction | Pull components from timestamps | day_of_week, is_weekend, quarter | Time-based patterns |
Aggregation | Summarize group-level stats | user's avg order value (last 30d) | Entity-level behavioral features |
Text features | Extract signals from text | word count, sentiment score | NLP preprocessing |
Lag features | Previous period values | sales_7d_ago, sales_28d_ago | Time series forecasting |
Encoding Categorical Variables
Machine learning models require numeric inputs, so categorical variables must be encoded. The most common technique is one-hot encoding, which creates a binary column for each category. A "color" column with values red, green, blue becomes three binary columns. This works well for nominal categories with no inherent order and a manageable number of unique values.
For ordinal categories (small, medium, large), label encoding or ordinal encoding preserves the ordering information that one-hot encoding discards. For high-cardinality categoricals with hundreds of unique values, target encoding (replacing each category with its mean target value) or frequency encoding (replacing with category count) is more practical than one-hot encoding, which would create too many columns.
Numerical Transformations
Many statistical methods and ML algorithms perform better when features have symmetric, approximately normal distributions. Right-skewed numerical features — like revenue, user counts, or file sizes — benefit from log transformation, which compresses large values and spreads small ones. Square root transformation is a milder alternative. Box-Cox transformation generalizes these, finding the optimal power transformation automatically.
Standardization (z-score normalization) scales features to have mean 0 and standard deviation 1. Min-max scaling maps values to a 0–1 range. Distance-based algorithms like KNN and SVM are sensitive to feature scale, making normalization essential. Tree-based models like random forests and gradient boosting are scale-invariant and typically don't require normalization.
Date and Time Features
Timestamps are one of the richest sources of engineered features. From a single timestamp you can extract year, month, day, hour, minute, day of week, week of year, quarter, and whether the date falls on a weekend or public holiday. Each of these captures different temporal patterns — hour of day captures daily cycles, day of week captures weekly patterns, month captures seasonality.
Time-since features are particularly powerful: days since first purchase, days since last login, days since account creation. These recency-based features often outperform raw timestamps because they encode meaningful behavioral context — a customer who last purchased 3 days ago is fundamentally different from one who last purchased 300 days ago, and this difference is captured more naturally as a numeric recency value.
Aggregation and Window Features
For entity-level data (users, customers, products), aggregating historical behavior creates powerful predictive features. For each user, compute the number of purchases in the last 7, 30, and 90 days; average order value; total spend; most frequently purchased category; days since first purchase. These features summarize an entity's history in a form that's directly useful for prediction.
Rolling window aggregations (computing statistics over a sliding time window) capture recent trends more effectively than all-time aggregates. A customer's purchase count over the last 30 days is often more predictive of their next action than their all-time count, since recent behavior reflects current engagement level.
Interaction and Ratio Features
Combining two existing features can create a new feature that captures relationships the model might not discover on its own. Price divided by quality score creates a value-for-money metric. Clicks divided by impressions gives click-through rate. Revenue minus cost gives profit. These ratio and interaction features often encode domain knowledge directly — they represent concepts that analysts understand to be meaningful.
Be judicious with interaction features — combining N features pairwise creates N*(N-1)/2 combinations, which quickly becomes unwieldy and introduces multicollinearity. Use domain knowledge and EDA to identify the most promising combinations rather than generating all possible interactions blindly.
Feature Selection
Not every engineered feature improves model performance. Irrelevant or redundant features add noise, slow down training, and can hurt generalization. Feature selection techniques help identify which features to keep. Correlation analysis removes features that are highly correlated with each other (multicollinearity). Feature importance scores from tree models rank features by their contribution to model performance. Recursive feature elimination iteratively removes the least important features until a target count is reached.
A good rule of thumb: when in doubt, fewer high-quality features beat many low-quality ones. Start with a small set of strongly motivated features, validate their impact, then incrementally add more.
Conclusion
Feature engineering is where analytical insight meets machine learning practice. The best features come from deeply understanding the domain, asking what information would actually help predict the outcome, and creatively transforming raw data to represent that information clearly. Invest in learning the full toolkit of encoding, transformation, aggregation, and interaction techniques — and you'll consistently outperform analysts who rely solely on raw data.
Create a free reader account to keep reading.