Understanding Multicollinearity in Linear Regression

Understanding Multicollinearity in Linear Regression

Multicollinearity is a common issue in linear regression models that can severely distort the interpretation of your results. In this article, I’ll walk you through:

  1. What multicollinearity is
  2. The difference between no multicollinearity, perfect multicollinearity, and imperfect multicollinearity
  3. How to detect it
  4. And how to handle it using Python code


What is Multicollinearity?

Multicollinearity occurs when two or more independent variables in a regression model are highly correlated. In simpler terms: one variable can be linearly predicted from the others with a substantial degree of accuracy.

This becomes problematic because:

  • It inflates the variance of coefficient estimates
  • It makes coefficients unstable and sensitive to small changes in the model
  • It complicates the interpretation of individual predictors


Types of Multicollinearity

1. No Multicollinearity

This is the ideal scenario. All independent variables are linearly independent. The regression coefficients can be estimated precisely and interpreted individually.

2. Perfect Multicollinearity

Occurs when one predictor is an exact linear combination of others. This violates one of the key assumptions of linear regression and makes it impossible to compute the model.

🛑 Example: If X3 = 2*X1 + 3*X2, then perfect multicollinearity exists. The model will throw a singular matrix error.

3. Imperfect Multicollinearity

This is the real-world case we encounter most often. Predictors are highly correlated, but not perfectly. The model will still run, but the estimates may be unreliable.


Example in Python: Detecting Multicollinearity

Let’s go through some code to identify multicollinearity using the Variance Inflation Factor (VIF).

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import seaborn as sns
import matplotlib.pyplot as plt


np.random.seed(42)
X1 = np.random.normal(0, 1, 100) # Generating synthetic data
X2 = X1 * 0.95 + np.random.normal(0, 0.1, 100)  # Highly correlated with X1
X3 = np.random.normal(0, 1, 100)
y = 3*X1 + 2*X2 + 1.5*X3 + np.random.normal(0, 1, 100)


df = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'y': y}) # Creating DataFrame

# Visualizing correlation
sns.heatmap(df[['X1', 'X2', 'X3']].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

# Regression model
X = df[['X1', 'X2', 'X3']]
X = sm.add_constant(X)
model = sm.OLS(df['y'], X).fit()
print(model.summary())

# VIF calculation
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
        
Article content
OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.969
Model:                            OLS   Adj. R-squared:                  0.968
Method:                 Least Squares   F-statistic:                     1001.
Date:                Mon, 24 Mar 2025   Prob (F-statistic):           2.91e-72
Time:                        15:28:46   Log-Likelihood:                -127.46
No. Observations:                 100   AIC:                             262.9
Df Residuals:                      96   BIC:                             273.3
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0875      0.089      0.981      0.329      -0.090       0.265
X1             3.1860      0.885      3.599      0.001       1.429       4.943
X2             1.6176      0.940      1.721      0.088      -0.248       3.483
X3             1.5269      0.083     18.302      0.000       1.361       1.693
==============================================================================
Omnibus:                        1.353   Durbin-Watson:                   1.821
Prob(Omnibus):                  0.508   Jarque-Bera (JB):                1.317
Skew:                           0.169   Prob(JB):                        0.518
Kurtosis:                       2.551   Cond. No.                         19.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
  feature        VIF
0   const   1.020278
1      X1  81.985000
2      X2  81.909203
3      X3   1.037920        

Interpreting VIF

  • VIF ≈ 1: No multicollinearity
  • VIF between 1 and 5: Moderate multicollinearity
  • VIF > 10: High multicollinearity (potential concern)

What does OLS do?

OLS tries to find the "best-fitting line" through the data by minimizing the sum of the squared differences between the actual values (y) and the predicted values (ŷ) from the model.

Formally, OLS minimizes this:


Article content

Why use OLS?

  • It gives unbiased estimates of the coefficients (if assumptions are met)
  • It’s efficient (smallest variance among all linear unbiased estimators)
  • It’s easy to interpret


How to Handle Multicollinearity

Here are some common strategies:

  1. Remove or combine correlated variables: If two variables are telling the same story, you might not need both.
  2. Use dimensionality reduction (PCA):Principal Component Analysis transforms your features into uncorrelated components.
  3. Regularization (Ridge, Lasso Regression):These techniques penalize large coefficients and help stabilize the model.

Example:

from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(df[['X1', 'X2', 'X3']], df['y'])
print("Ridge Coefficients:", ridge_model.coef_)
        

Insights from This Article

  • Multicollinearity doesn’t violate regression assumptions unless it’s perfect, but it can distort your results
  • Always check for it using correlation matrices and VIF
  • Consider removal, transformation, or regularization if it’s a concern


Have you faced multicollinearity in your models? What technique do you use to handle it? Let’s discuss in the comments!

#DataScience #MachineLearning #LinearRegression #Multicollinearity #FeatureEngineering #PythonForDataScience #Statsmodels #DataVisualization #RegressionAnalysis #Analytics #DataAnalysis #CorrelationMatrix #Heatmap #DataCleaning #LearningEveryday #DataCommunity #Pandas #Seaborn #ScikitLearn #PredictiveModeling #StatisticalModeling #MLBasics #LinkedInLearning


To view or add a comment, sign in

More articles by Emmanuel Andrade

Insights from the community

Others also viewed

Explore topics