Understanding Multicollinearity in Linear Regression

Emmanuel Andrade

Senior Data Scientist | Machine Learning Engineer | Artificial Intelligence Engineer| LLM | Python | PyTorch | TensorFlow | PySpark | Pandas | Sciki-learn | Grafana | SQL | MongoDB | Professor | Researcher

Published Mar 24, 2025

Multicollinearity is a common issue in linear regression models that can severely distort the interpretation of your results. In this article, I’ll walk you through:

What multicollinearity is
The difference between no multicollinearity, perfect multicollinearity, and imperfect multicollinearity
How to detect it
And how to handle it using Python code

What is Multicollinearity?

Multicollinearity occurs when two or more independent variables in a regression model are highly correlated. In simpler terms: one variable can be linearly predicted from the others with a substantial degree of accuracy.

This becomes problematic because:

It inflates the variance of coefficient estimates
It makes coefficients unstable and sensitive to small changes in the model
It complicates the interpretation of individual predictors

Types of Multicollinearity

1. No Multicollinearity

This is the ideal scenario. All independent variables are linearly independent. The regression coefficients can be estimated precisely and interpreted individually.

2. Perfect Multicollinearity

Occurs when one predictor is an exact linear combination of others. This violates one of the key assumptions of linear regression and makes it impossible to compute the model.

🛑 Example: If X3 = 2*X1 + 3*X2, then perfect multicollinearity exists. The model will throw a singular matrix error.

3. Imperfect Multicollinearity

This is the real-world case we encounter most often. Predictors are highly correlated, but not perfectly. The model will still run, but the estimates may be unreliable.

Example in Python: Detecting Multicollinearity

Let’s go through some code to identify multicollinearity using the Variance Inflation Factor (VIF).

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import seaborn as sns
import matplotlib.pyplot as plt


np.random.seed(42)
X1 = np.random.normal(0, 1, 100) # Generating synthetic data
X2 = X1 * 0.95 + np.random.normal(0, 0.1, 100)  # Highly correlated with X1
X3 = np.random.normal(0, 1, 100)
y = 3*X1 + 2*X2 + 1.5*X3 + np.random.normal(0, 1, 100)


df = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'y': y}) # Creating DataFrame

# Visualizing correlation
sns.heatmap(df[['X1', 'X2', 'X3']].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

# Regression model
X = df[['X1', 'X2', 'X3']]
X = sm.add_constant(X)
model = sm.OLS(df['y'], X).fit()
print(model.summary())

# VIF calculation
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)

Recommended by LinkedIn

I Created a Machine Learning Model with Auto Data…

Cláudio César da Costa Junior 10 months ago

Logistic Regression implementation in Python

Binayak Bhandari, Ph.D. 3 months ago

🌳 Why Random Forest? Working, Applications, and…

Amit Kharche 4 weeks ago

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.969
Model:                            OLS   Adj. R-squared:                  0.968
Method:                 Least Squares   F-statistic:                     1001.
Date:                Mon, 24 Mar 2025   Prob (F-statistic):           2.91e-72
Time:                        15:28:46   Log-Likelihood:                -127.46
No. Observations:                 100   AIC:                             262.9
Df Residuals:                      96   BIC:                             273.3
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0875      0.089      0.981      0.329      -0.090       0.265
X1             3.1860      0.885      3.599      0.001       1.429       4.943
X2             1.6176      0.940      1.721      0.088      -0.248       3.483
X3             1.5269      0.083     18.302      0.000       1.361       1.693
==============================================================================
Omnibus:                        1.353   Durbin-Watson:                   1.821
Prob(Omnibus):                  0.508   Jarque-Bera (JB):                1.317
Skew:                           0.169   Prob(JB):                        0.518
Kurtosis:                       2.551   Cond. No.                         19.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
  feature        VIF
0   const   1.020278
1      X1  81.985000
2      X2  81.909203
3      X3   1.037920

Interpreting VIF

VIF ≈ 1: No multicollinearity
VIF between 1 and 5: Moderate multicollinearity
VIF > 10: High multicollinearity (potential concern)

What does OLS do?

OLS tries to find the "best-fitting line" through the data by minimizing the sum of the squared differences between the actual values (y) and the predicted values (ŷ) from the model.

Formally, OLS minimizes this:

Why use OLS?

It gives unbiased estimates of the coefficients (if assumptions are met)
It’s efficient (smallest variance among all linear unbiased estimators)
It’s easy to interpret

How to Handle Multicollinearity

Here are some common strategies:

Remove or combine correlated variables: If two variables are telling the same story, you might not need both.
Use dimensionality reduction (PCA):Principal Component Analysis transforms your features into uncorrelated components.
Regularization (Ridge, Lasso Regression):These techniques penalize large coefficients and help stabilize the model.

Example:

from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(df[['X1', 'X2', 'X3']], df['y'])
print("Ridge Coefficients:", ridge_model.coef_)

Insights from This Article

Multicollinearity doesn’t violate regression assumptions unless it’s perfect, but it can distort your results
Always check for it using correlation matrices and VIF
Consider removal, transformation, or regularization if it’s a concern

Have you faced multicollinearity in your models? What technique do you use to handle it? Let’s discuss in the comments!

#DataScience #MachineLearning #LinearRegression #Multicollinearity #FeatureEngineering #PythonForDataScience #Statsmodels #DataVisualization #RegressionAnalysis #Analytics #DataAnalysis #CorrelationMatrix #Heatmap #DataCleaning #LearningEveryday #DataCommunity #Pandas #Seaborn #ScikitLearn #PredictiveModeling #StatisticalModeling #MLBasics #LinkedInLearning

To view or add a comment, sign in

Understanding Multicollinearity in Linear Regression

Emmanuel Andrade

Senior Data Scientist | Machine Learning Engineer | Artificial Intelligence Engineer| LLM | Python | PyTorch | TensorFlow | PySpark | Pandas | Sciki-learn | Grafana | SQL | MongoDB | Professor | Researcher

What is Multicollinearity?

Types of Multicollinearity

1. No Multicollinearity

2. Perfect Multicollinearity

3. Imperfect Multicollinearity

Example in Python: Detecting Multicollinearity

Recommended by LinkedIn

Interpreting VIF

What does OLS do?

Why use OLS?

How to Handle Multicollinearity

Insights from This Article

More articles by Emmanuel Andrade

Insights from the community

Others also viewed

🌳 Why Random Forest? Working, Applications, and Tuning Tips

Predictive Maintenance for Factories

Day 2: Logistic Regression

GPT-Python Pulse: Creating a Family Tree

ML Algorithms equations made simple

Pydantic AI — Agents Made Simpler

Mastering Regression Algorithms: A Hands-On Guide with Salary Prediction Using Polynomial Regression

Hierarchical Density Factorization with Chocloton

Machine Learning: Linear Regression

dowhy library exploration - Just be-cause

Explore topics

What is Multicollinearity?

Types of Multicollinearity

1. No Multicollinearity

2. Perfect Multicollinearity

3. Imperfect Multicollinearity

Example in Python: Detecting Multicollinearity

Recommended by LinkedIn

Interpreting VIF

What does OLS do?

Why use OLS?

How to Handle Multicollinearity

Insights from This Article

More articles by Emmanuel Andrade

Data, Information, Knowledge, and Wisdom: The Foundation of Data Science

What Is Data Science, Really?

Grid Search vs. Randomized Search: A Didactic Exploration with Practical Example

Working with data and not using NumPy yet? You’re doing it wrong.

When Your Test Accuracy Beats Training Accuracy: Something's Not Right

Feature Scaling in Machine Learning: What It Is, Why It Matters, and How to Apply It

Avoiding Lookahead Errors in Time Series: A Guide for Data Scientists

Understanding Apache Parquet: Structure and Advantages

AI Winters and the Lessons from Lighthill: Reflections While Preparing a Lecture on Artificial Intelligence

Explorando conceitos de Aprendizado de Máquinas: Tipos de Aprendizado

Insights from the community

Others also viewed

🌳 Why Random Forest? Working, Applications, and Tuning Tips

Predictive Maintenance for Factories

Day 2: Logistic Regression

GPT-Python Pulse: Creating a Family Tree

ML Algorithms equations made simple

Pydantic AI — Agents Made Simpler

Mastering Regression Algorithms: A Hands-On Guide with Salary Prediction Using Polynomial Regression

Hierarchical Density Factorization with Chocloton

Machine Learning: Linear Regression

dowhy library exploration - Just be-cause

Explore topics