What Is Regression Analysis?
Regression analysis is a statistical technique for modelling the relationship between a dependent variable (the outcome you want to predict or explain) and one or more independent variables (the features or inputs). It is one of the most widely used methods in data analysis because it produces interpretable, quantified relationships — not just "X and Y are correlated" but "a one-unit increase in X is associated with a β-unit change in Y, holding all else constant." Analysts use regression to forecast sales, explain churn drivers, price products, evaluate marketing campaigns, and build the foundation for more advanced machine learning models.
Simple vs. Multiple Linear Regression
Concept | Simple Linear Regression | Multiple Linear Regression |
|---|---|---|
Number of predictors | One independent variable | Two or more independent variables |
Equation form | Y = β₀ + β₁X + ε | Y = β₀ + β₁X₁ + β₂X₂ + … + βₙXₙ + ε |
Interpretation of β | Change in Y per unit increase in X | Change in Y per unit increase in Xᵢ, holding all other variables constant |
Use case | Quick univariate exploration; e.g. ad spend vs. revenue | Real-world modelling where multiple factors drive the outcome simultaneously |
Key risk | Omitted variable bias if other drivers are ignored | Multicollinearity when predictors are highly correlated with each other |
Key Regression Metrics and Their Meaning
Metric | Formula / Definition | What It Tells You | Good Value |
|---|---|---|---|
R² (R-squared) | Proportion of variance in Y explained by the model | How much of the outcome variation is captured; 0 = nothing, 1 = perfect fit | Domain-dependent; 0.7+ in business contexts, can be lower in noisy social data |
Adjusted R² | R² penalised for adding predictors that don't help | Prevents artificially inflating R² by throwing in irrelevant variables | Close to R²; a big gap suggests over-fitting or irrelevant predictors |
RMSE (Root Mean Squared Error) | √(mean of squared residuals) | Average prediction error in the same units as Y; sensitive to large errors | As low as possible; compare against baseline (e.g. predict the mean) |
MAE (Mean Absolute Error) | Mean of |actual − predicted| | Average absolute prediction error; less sensitive to outliers than RMSE | As low as possible; preferred when large errors are not disproportionately bad |
p-value (for each coefficient) | Probability of observing the estimated β if the true β were 0 | Statistical significance of each predictor; low p-value = strong evidence of real effect | Below 0.05 (5% threshold) for significance; interpret with caution for large samples |
Confidence interval | Range within which the true β likely falls (e.g. 95% CI) | Uncertainty around the coefficient estimate; wide CI = imprecise estimate | Narrow CI that does not include zero for a significant predictor |
Assumptions of Linear Regression
Assumption | What It Means | How to Check | What Happens When Violated |
|---|---|---|---|
Linearity | The relationship between X and Y is linear | Scatter plot of X vs. Y; residual vs. fitted plot (should show random scatter) | Biased coefficients; model misses the true pattern |
Independence | Observations are independent of each other | Check data collection process; Durbin-Watson test for time-series autocorrelation | Underestimated standard errors; misleading significance tests |
Homoscedasticity | Variance of residuals is constant across all fitted values | Plot residuals vs. fitted values; Breusch-Pagan test | Inefficient estimates; confidence intervals become unreliable |
Normality of residuals | Residuals are approximately normally distributed | Q-Q plot of residuals; Shapiro-Wilk test on small samples | Affects hypothesis tests on small samples; less critical for large n |
No multicollinearity | Predictors are not highly correlated with each other | Variance Inflation Factor (VIF); values above 5–10 indicate a problem | Unstable coefficient estimates; large standard errors; predictors hard to interpret |
Beyond OLS: Other Regression Types Analysts Use
Type | When to Use | Key Difference from OLS | Common Tools |
|---|---|---|---|
Logistic Regression | Binary outcome (churn: yes/no; conversion: yes/no) | Models log-odds of the outcome; outputs a probability between 0 and 1 | sklearn LogisticRegression, statsmodels Logit |
Ridge Regression (L2) | Many correlated predictors; prevents over-fitting | Adds a penalty term λ·Σβ² to shrink coefficients without eliminating them | sklearn Ridge |
Lasso Regression (L1) | Feature selection needed; sparse models preferred | Penalty λ·Σ|β| can shrink some coefficients exactly to zero, selecting features | sklearn Lasso |
Polynomial Regression | Non-linear relationship between X and Y | Adds X², X³ terms; still linear in coefficients, so OLS can fit it | sklearn PolynomialFeatures + LinearRegression |
Poisson Regression | Count outcomes (number of events, page views, tickets) | Models the log of the expected count; assumes variance equals mean | statsmodels GLM with Poisson family |
Regression in Python: A Practical Workflow
Step | Code Pattern | Purpose |
|---|---|---|
1. Explore relationships | df[['X1','X2','Y']].corr() and seaborn pairplot | Identify candidate predictors and check linearity visually |
2. Split data | train_test_split(X, y, test_size=0.2, random_state=42) | Reserve a hold-out set to evaluate model performance on unseen data |
3. Fit model | model = LinearRegression().fit(X_train, y_train) | Estimate β coefficients via OLS (minimises sum of squared residuals) |
4. Evaluate | r2_score(y_test, model.predict(X_test)) and mean_squared_error(..., squared=False) | Quantify fit on hold-out data; compare to baseline |
5. Check residuals | residuals = y_test - predictions; plt.scatter(predictions, residuals) | Validate homoscedasticity and detect non-linearity or outliers |
6. Interpret coefficients | pd.Series(model.coef_, index=X.columns).sort_values() | Identify the magnitude and direction of each predictor's effect |
7. Statistical inference (optional) | import statsmodels.api as sm; sm.OLS(y, sm.add_constant(X)).fit().summary() | Get p-values, confidence intervals, and F-statistic for formal hypothesis testing |
Common Mistakes Analysts Make with Regression
Mistake | Why It's Problematic | How to Avoid It |
|---|---|---|
Confusing correlation with causation | A significant β does not mean X causes Y; confounders may drive both | Use domain knowledge; consider randomised experiments or causal inference methods |
Over-fitting to training data | Model captures noise rather than signal; performs poorly on new data | Always evaluate on a held-out test set; use cross-validation; regularise with Ridge/Lasso |
Ignoring multicollinearity | Coefficients become unstable; signs can flip depending on which variables are included | Compute VIF; remove or combine correlated predictors; use PCA or Ridge regression |
Extrapolating beyond the data range | The linear relationship may not hold outside the observed range of X | Limit predictions to the range of training data; communicate uncertainty at extremes |
Not scaling features before regularisation | Ridge and Lasso penalise coefficients equally, so predictors on large scales are unfairly penalised | Standardise features with StandardScaler before fitting Ridge or Lasso models |
Summary
Regression analysis is the workhorse of quantitative data analysis. It allows analysts to move from observation to explanation and prediction with a mathematically rigorous, interpretable framework. Mastering ordinary least squares is the entry point, but understanding when to apply logistic, ridge, or lasso variants — and how to diagnose model assumptions — separates analysts who produce trustworthy insights from those who produce misleading ones. Always validate on held-out data, check your residuals, and be cautious about inferring causality from observational regression results.
Create a free reader account to keep reading.