Regression Analysis for Data Analysts

What Is Regression Analysis?

Regression analysis is a statistical technique for modelling the relationship between a dependent variable (the outcome you want to predict or explain) and one or more independent variables (the features or inputs). It is one of the most widely used methods in data analysis because it produces interpretable, quantified relationships — not just "X and Y are correlated" but "a one-unit increase in X is associated with a β-unit change in Y, holding all else constant." Analysts use regression to forecast sales, explain churn drivers, price products, evaluate marketing campaigns, and build the foundation for more advanced machine learning models.

Simple vs. Multiple Linear Regression

Concept	Simple Linear Regression	Multiple Linear Regression
Number of predictors	One independent variable	Two or more independent variables
Equation form	Y = β₀ + β₁X + ε	Y = β₀ + β₁X₁ + β₂X₂ + … + βₙXₙ + ε
Interpretation of β	Change in Y per unit increase in X	Change in Y per unit increase in Xᵢ, holding all other variables constant
Use case	Quick univariate exploration; e.g. ad spend vs. revenue	Real-world modelling where multiple factors drive the outcome simultaneously
Key risk	Omitted variable bias if other drivers are ignored	Multicollinearity when predictors are highly correlated with each other

Key Regression Metrics and Their Meaning

Metric	Formula / Definition	What It Tells You	Good Value
R² (R-squared)	Proportion of variance in Y explained by the model	How much of the outcome variation is captured; 0 = nothing, 1 = perfect fit	Domain-dependent; 0.7+ in business contexts, can be lower in noisy social data
Adjusted R²	R² penalised for adding predictors that don't help	Prevents artificially inflating R² by throwing in irrelevant variables	Close to R²; a big gap suggests over-fitting or irrelevant predictors
RMSE (Root Mean Squared Error)	√(mean of squared residuals)	Average prediction error in the same units as Y; sensitive to large errors	As low as possible; compare against baseline (e.g. predict the mean)
MAE (Mean Absolute Error)	Mean of \|actual − predicted\|	Average absolute prediction error; less sensitive to outliers than RMSE	As low as possible; preferred when large errors are not disproportionately bad
p-value (for each coefficient)	Probability of observing the estimated β if the true β were 0	Statistical significance of each predictor; low p-value = strong evidence of real effect	Below 0.05 (5% threshold) for significance; interpret with caution for large samples
Confidence interval	Range within which the true β likely falls (e.g. 95% CI)	Uncertainty around the coefficient estimate; wide CI = imprecise estimate	Narrow CI that does not include zero for a significant predictor

Assumptions of Linear Regression

Assumption	What It Means	How to Check	What Happens When Violated
Linearity	The relationship between X and Y is linear	Scatter plot of X vs. Y; residual vs. fitted plot (should show random scatter)	Biased coefficients; model misses the true pattern
Independence	Observations are independent of each other	Check data collection process; Durbin-Watson test for time-series autocorrelation	Underestimated standard errors; misleading significance tests
Homoscedasticity	Variance of residuals is constant across all fitted values	Plot residuals vs. fitted values; Breusch-Pagan test	Inefficient estimates; confidence intervals become unreliable
Normality of residuals	Residuals are approximately normally distributed	Q-Q plot of residuals; Shapiro-Wilk test on small samples	Affects hypothesis tests on small samples; less critical for large n
No multicollinearity	Predictors are not highly correlated with each other	Variance Inflation Factor (VIF); values above 5–10 indicate a problem	Unstable coefficient estimates; large standard errors; predictors hard to interpret

Beyond OLS: Other Regression Types Analysts Use

Type	When to Use	Key Difference from OLS	Common Tools
Logistic Regression	Binary outcome (churn: yes/no; conversion: yes/no)	Models log-odds of the outcome; outputs a probability between 0 and 1	sklearn LogisticRegression, statsmodels Logit
Ridge Regression (L2)	Many correlated predictors; prevents over-fitting	Adds a penalty term λ·Σβ² to shrink coefficients without eliminating them	sklearn Ridge
Lasso Regression (L1)	Feature selection needed; sparse models preferred	Penalty λ·Σ\|β\| can shrink some coefficients exactly to zero, selecting features	sklearn Lasso
Polynomial Regression	Non-linear relationship between X and Y	Adds X², X³ terms; still linear in coefficients, so OLS can fit it	sklearn PolynomialFeatures + LinearRegression
Poisson Regression	Count outcomes (number of events, page views, tickets)	Models the log of the expected count; assumes variance equals mean	statsmodels GLM with Poisson family

Regression in Python: A Practical Workflow

Step	Code Pattern	Purpose
1. Explore relationships	df[['X1','X2','Y']].corr() and seaborn pairplot	Identify candidate predictors and check linearity visually
2. Split data	train_test_split(X, y, test_size=0.2, random_state=42)	Reserve a hold-out set to evaluate model performance on unseen data
3. Fit model	model = LinearRegression().fit(X_train, y_train)	Estimate β coefficients via OLS (minimises sum of squared residuals)
4. Evaluate	r2_score(y_test, model.predict(X_test)) and mean_squared_error(..., squared=False)	Quantify fit on hold-out data; compare to baseline
5. Check residuals	residuals = y_test - predictions; plt.scatter(predictions, residuals)	Validate homoscedasticity and detect non-linearity or outliers
6. Interpret coefficients	pd.Series(model.coef_, index=X.columns).sort_values()	Identify the magnitude and direction of each predictor's effect
7. Statistical inference (optional)	import statsmodels.api as sm; sm.OLS(y, sm.add_constant(X)).fit().summary()	Get p-values, confidence intervals, and F-statistic for formal hypothesis testing

Common Mistakes Analysts Make with Regression

Mistake	Why It's Problematic	How to Avoid It
Confusing correlation with causation	A significant β does not mean X causes Y; confounders may drive both	Use domain knowledge; consider randomised experiments or causal inference methods
Over-fitting to training data	Model captures noise rather than signal; performs poorly on new data	Always evaluate on a held-out test set; use cross-validation; regularise with Ridge/Lasso
Ignoring multicollinearity	Coefficients become unstable; signs can flip depending on which variables are included	Compute VIF; remove or combine correlated predictors; use PCA or Ridge regression
Extrapolating beyond the data range	The linear relationship may not hold outside the observed range of X	Limit predictions to the range of training data; communicate uncertainty at extremes
Not scaling features before regularisation	Ridge and Lasso penalise coefficients equally, so predictors on large scales are unfairly penalised	Standardise features with StandardScaler before fitting Ridge or Lasso models

Summary

Regression analysis is the workhorse of quantitative data analysis. It allows analysts to move from observation to explanation and prediction with a mathematically rigorous, interpretable framework. Mastering ordinary least squares is the entry point, but understanding when to apply logistic, ridge, or lasso variants — and how to diagnose model assumptions — separates analysts who produce trustworthy insights from those who produce misleading ones. Always validate on held-out data, check your residuals, and be cautious about inferring causality from observational regression results.