Understanding Multicollinearity in Linear Regression
Multicollinearity is a common issue in linear regression models that can severely distort the interpretation of your results. In this article, I’ll walk you through:
What is Multicollinearity?
Multicollinearity occurs when two or more independent variables in a regression model are highly correlated. In simpler terms: one variable can be linearly predicted from the others with a substantial degree of accuracy.
This becomes problematic because:
Types of Multicollinearity
1. No Multicollinearity
This is the ideal scenario. All independent variables are linearly independent. The regression coefficients can be estimated precisely and interpreted individually.
2. Perfect Multicollinearity
Occurs when one predictor is an exact linear combination of others. This violates one of the key assumptions of linear regression and makes it impossible to compute the model.
🛑 Example: If X3 = 2*X1 + 3*X2, then perfect multicollinearity exists. The model will throw a singular matrix error.
3. Imperfect Multicollinearity
This is the real-world case we encounter most often. Predictors are highly correlated, but not perfectly. The model will still run, but the estimates may be unreliable.
Example in Python: Detecting Multicollinearity
Let’s go through some code to identify multicollinearity using the Variance Inflation Factor (VIF).
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(42)
X1 = np.random.normal(0, 1, 100) # Generating synthetic data
X2 = X1 * 0.95 + np.random.normal(0, 0.1, 100) # Highly correlated with X1
X3 = np.random.normal(0, 1, 100)
y = 3*X1 + 2*X2 + 1.5*X3 + np.random.normal(0, 1, 100)
df = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'y': y}) # Creating DataFrame
# Visualizing correlation
sns.heatmap(df[['X1', 'X2', 'X3']].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()
# Regression model
X = df[['X1', 'X2', 'X3']]
X = sm.add_constant(X)
model = sm.OLS(df['y'], X).fit()
print(model.summary())
# VIF calculation
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
Recommended by LinkedIn
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.969
Model: OLS Adj. R-squared: 0.968
Method: Least Squares F-statistic: 1001.
Date: Mon, 24 Mar 2025 Prob (F-statistic): 2.91e-72
Time: 15:28:46 Log-Likelihood: -127.46
No. Observations: 100 AIC: 262.9
Df Residuals: 96 BIC: 273.3
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0875 0.089 0.981 0.329 -0.090 0.265
X1 3.1860 0.885 3.599 0.001 1.429 4.943
X2 1.6176 0.940 1.721 0.088 -0.248 3.483
X3 1.5269 0.083 18.302 0.000 1.361 1.693
==============================================================================
Omnibus: 1.353 Durbin-Watson: 1.821
Prob(Omnibus): 0.508 Jarque-Bera (JB): 1.317
Skew: 0.169 Prob(JB): 0.518
Kurtosis: 2.551 Cond. No. 19.0
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
feature VIF
0 const 1.020278
1 X1 81.985000
2 X2 81.909203
3 X3 1.037920
Interpreting VIF
What does OLS do?
OLS tries to find the "best-fitting line" through the data by minimizing the sum of the squared differences between the actual values (y) and the predicted values (ŷ) from the model.
Formally, OLS minimizes this:
Why use OLS?
How to Handle Multicollinearity
Here are some common strategies:
Example:
from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(df[['X1', 'X2', 'X3']], df['y'])
print("Ridge Coefficients:", ridge_model.coef_)
Insights from This Article
Have you faced multicollinearity in your models? What technique do you use to handle it? Let’s discuss in the comments!
#DataScience #MachineLearning #LinearRegression #Multicollinearity #FeatureEngineering #PythonForDataScience #Statsmodels #DataVisualization #RegressionAnalysis #Analytics #DataAnalysis #CorrelationMatrix #Heatmap #DataCleaning #LearningEveryday #DataCommunity #Pandas #Seaborn #ScikitLearn #PredictiveModeling #StatisticalModeling #MLBasics #LinkedInLearning