Explainable Regression Model Analysis

Explainable Regression Model Analysis

Regression analysis is well grounded method to determine the impact of the variables on to the model. The process of performing regression allows us to find out the factors which matters the most and their influence on dependent and independent variables.  Also, it helps us to ignore factors which are not important to us.

There is one very important thing to keep in mind that regression establishes existence of an association relationship between two variables and not a causal relationship. For example, sale of ice cream increases during summer as well as robbery incidents to. So, can we build a regression model for this, no we cannot. Dependent variable does not imply that the changes in the values of that variable are dependent on the changes in the values of the independent variable. We can only establish that the change in value of dependent variable is associated with the change in value of independent variables.

Let us see the Simple Linear Regression first.

Functional form of SLR is –  Yi = β0 + β1 Xi + E

Where,

Yi is the value of ith observation of the dependent variable in the sample

Xi is the value of ith observation of the independent variable in the sample

E is the random error (also known as residuals)

Β0 and β1 are the regression parameters (or regression coefficients)

Linear regression means that the relationship between Y and the coefficients (β) is linear.

One of the most important things in a regression model is to carefully examine the Beta coefficients of the equation. With the help of them you can just not explain your model well but also it helps you to build the correct and acceptable model. Let us see an example, here is a dataset which I have used have 3 features- height, size of left foot and size of right foot. Height is our dependent variable which we are trying to predict while others are independent variables.

Below is summary output of the model I built with just one variable – size of right foot. 

No alt text provided for this image






We can clearly see that variable is statistically significant and its coefficient value is positive which means one unit increase in “right” will increase the height by 31.5457 + 3.194942

Also, R square value is 0.81 which is very good. Now, let me introduce the other variable “left” in the model.

No alt text provided for this image







Again, our model is doing good with R square value of 0.818. It has also showed very good accuracy on test data as well.

But is this model acceptable?

This is very important question which we always need to ask once we build our models because by this, we will be able to build our confidence in it and explain it further. Explain-ability has become very crucial these days, it just not help us to get insights from the model, but it also improves your model. It builds your trust in it.

Now, if we carefully look at “left” variable, it is statistically insignificant at 95% confidence interval. Therefore, we should not be using it but in some business context we do use variables which are not significant. But there is even more important thing to look at and it is the coefficient sign of “left” variable. It is negative which means for a unit increase in “left” variable, the height of a person will decrease. Logically it is impossible to say that if your size of left foot increases your height will decrease. Also, if you look at the coefficient of “right” variable it got doubled from the last model. Which is a very significant increase. This shows this model is highly unstable. This is due to Multi-collinearity. Inclusion of variables which are highly correlated can give you unstable coefficients. Also, there is a concept of omitted variable bias which lead to change in coefficients, but we will look into it later.

Now, would you still accept this model?  (It had a very good test accuracy)

Let us investigate another example. Here we are trying to predict salary of a person through his percentage in high school.

Coefficients of the equation were –

const                    31306.421508

Percentage in grade 10    3597.521595

Independent variable came out to be significant.

- The estimated (predicted) model can be written as Salary = 30587.285 + 3560.587 * (Percentage in Grade 10)

- The equation can be interpreted as follows: For every 1% increase in Grade 10, the salary of the MBA students will increase by 3560.587.

Can we say here that if “Percentage in grade 10” is 0 then salary will be 30587.285. No, we cannot say it because we cannot extrapolate beta coefficients outside the range for which model has been built. We must be very careful while reading our regression equation.

Now, this model looks good and statistically it passes all the test. But can we accept it. Again No, we still must perform validation on it so that we can be 100% sure that our model work fine and we are able to derive insights and explain it further. 

The following measures are used to validate the linear regression models:

  • Co-efficient of determination (R-squared).
  • Hypothesis test for the regression coefficient.
  • Analysis of variance for overall model validity.
  • Residual analysis to validate the regression model assumptions.
  • Outlier analysis since, the presence of outliers can significantly impact the regression parameters.

Residual analysis

Analysis of residuals reveals whether the assumption of normally distributed errors hold. Residual plots can also reveal if the actual relationship is non-linear.

The first plot we will look into is P-P plot. Which is test for normal distribution for error.=

No alt text provided for this image




The diagonal line is the cumulative distribution of a normal distribution, whereas the dots represent the cumulative distribution of the residuals. Since the dots are close to the diagonal line, we can conclude that the residuals follow an approximate normal distribution. Thus, the hypothesis tests t and F are valid for the model we build.

Test for Homoscedasticity

An important assumption of regression model is that the residuals have constant variance (homoscedasticity) across different values of the explanatory variable (X). That is, the variance of residuals is assumed to be independent of variable X. Failure to meet this assumption will result in unreliability of the hypothesis tests.

No alt text provided for this image





It can be observed from the below figure that residuals are random and have no Funnel shape which means residuals have constant variance.


OUTLIER ANALYSIS

Suppose you are examining the box office collection of movies. There is this one movie 'AVENGERS' for which collection is way to high and your plots shows this data point outside the normal pattern or range. But is this data point an outlier. In fact, it gives us a lot of information from business perspective. Therefore, we should not treat any data point as an outlier on just basis of our visualization.

 Z-score is the standardized distance of an observation from its mean value. Any observation with a Z-score of more than 3 may be flagged as outlier and influential observations that may change the regression parameter values significantly.  

In our case there was no value which had Z score of more than 3 therefore, there was no output.

Below is the output of another model on different dataset where there were 3 records or datapoints which had Z score of more than 3. Therefore, these 3 data points were very influential. So, we can go back and have a look into them and see if there is anything which they tell us. This is how Z score output looks like.

No alt text provided for this image


Cook’s Distance measures how much the predicted value of the dependent variable changes for all the observations in the sample when a particular observation is excluded from sample for the estimation of regression parameters. A Cook’s distance value of more than 1 indicates highly influential observation.

No alt text provided for this image

We can observe from above that no value has a cook’s distance greater than 1 meaning there is no highly influential observation present though couple of observations distance was slightly greater than the others. Now it is up to the individual whether he want to look into these observations at local level and try to gain insights from them or he is happy to us them in the model as they are.


Leverage value of an observation measures the influence of that observation on the overall fit of the regression function and is related to the Mahalanobis distance. Leverage value of more than 3(k + 1)/n is treated as highly influential observation, where k is the number of features in the model and n is the sample size.

No alt text provided for this image




There are few observations which are influencing our overall fit of the regression function. Therefore, it is worth trying to remove them and re-run the model and see how much is the variation.

This was all about Linear regression and now we will see Logistic Regression.

In many ways, logistic regression is very similar to linear regression. One big difference, though, is the Logit function. Logistic regression seeks to model the relationship between a dependent variable and one or more independent variables. As in the case of linear regression, logistic regression allows us to look at the fit of the model as well as the significance of the relationships between dependent and independent variables. However, while linear regression uses least squares to find a best fitting curve and come up with coefficients that predict the change in the dependent variable given changes in the independent variables, logistic regression estimates the probability of an event occurring (e.g. the probability of a person staying in education post 16 years of age).

The regression coefficients are usually estimated using maximum likelihood estimation. Unlike linear regression with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximize the likelihood function, so that an iterative process must be used instead for example Newton's method. This process begins with a tentative solution, revises it slightly to see if it can be improved, and repeats this revision until no more improvement is made, at which point the process is said to have converged.

In some instances, the model may not reach convergence. Non-convergence of a model indicates that the coefficients are not meaningful because the iterative process was unable to find appropriate solutions. A failure to converge may occur for several reasons: having a large ratio of predictors to cases, multicollinearity, sparseness, or complete separation.

Having a large ratio of variables to cases results in an overly conservative Wald statistic and can lead to non-convergence. Regularized logistic regression is specifically intended to be used in this situation.

Multicollinearity refers to unacceptably high correlations between predictors. As multicollinearity increases, coefficients remain unbiased but standard errors increase and the likelihood of model convergence decreases. To detect multicollinearity amongst the predictors, one can conduct a linear regression analysis with the predictors of interest for the sole purpose of examining the tolerance statistic used to assess whether multicollinearity is unacceptably high.

Sparseness in the data refers to having a large proportion of empty cells (cells with zero counts). Zero cell counts are particularly problematic with categorical predictors. With continuous predictors, the model can infer values for the zero cell counts, but this is not the case with categorical predictors. The model will not converge with zero cell counts for categorical predictors because the natural logarithm of zero is an undefined value so that the final solution to the model cannot be reached. To remedy this problem, researchers may collapse categories in a theoretically meaningful way or add a constant to all cells.

Let us explore a credit class binary problem. This is a very common problem and easy to work upon. I have used credit rating data set from Kaggle. I will focus on Model Diagnostics in this article.

After all the EDA, encoding and hypothesis testing, I ran the first iteration of model with all the variables.

No alt text provided for this image






The next step is to find out significant variables in the model and use them.

Only 7 variables came out to be significant out of 28 variables that was used in first iteration.

No alt text provided for this image




Also, I have checked VIF score for them to see whether there is multicollinearity within these variables or not.

No alt text provided for this image




Ideally, VIF score should be less than 4 but, in some cases, researchers have used VIF score of less than 8-10. VIF score is good for all the variables in our model. For AGE, it is slightly higher than others, but I have decided to keep this variable as from business perspective this variable is very important. But your VIF should not exceed more than 10.

Below is the output of the model with only significant variables.

No alt text provided for this image








Next step is to look into the classification report-

Confusion matrix for this was –

No alt text provided for this image




No alt text provided for this image




I have also used ROC curve to determine how my model has performed. The AUC (area under the ROC curve) is the proportion of the concordant pairs in the data. Model with higher AUC is preferred and AUC is frequently used for model selection.

No alt text provided for this image




We have got 0.75 area under ROC curve which is a very good number.

Overall accuracy of the model is 0.71. This model has performed well on class of Good Credit which is ‘0’. Our recall and f1-score are very good. But on the other hand, this model does not do well on Bad credit classification. Now, from business perspective you have to decide which class you want to focus more on. Are you happy in classifying good credit and work with it, in this case you will be able to cut down on bad credits and save your resources. Or you want to focus more on Bad credit, in this case you will lose out on potential good credits. Or you want to give equal weightage to both. This is completely a business decision and depends on how well your model is performing.

Coming back to our model can we improve our predictions. we have used the cut off probability of 0.50 for classification which is standard and widely used. But why should we stick to 0.50 when some other threshold value can perform better. Well from 0 to 1 there are numerous values that we can try but do we have a time for that. No, we cannot do that. Therefore, there are two techniques which we can use here –

Youden’s Index for Optimal Cut-Off Probability

Sensitivity and specificity change when we change the cut-off probability. We can calculate Youden’s Index by incrementally changing the cut-off probability and calculating the corresponding sensitivity + specificity – 1.

After using Youden’s Index approach we got our cut off probability at 0.22. Below is the classification report after using 0.22 as cut off probability.

No alt text provided for this image

It can be clearly observed that we have now performed better on Bad credit class (1). We have a recall of 0.81 and f1-score has jumped from 0.37 to 0.59 and precision for Good credit has also improved. Though, our overall accuracy has gone down to 0.66. In my opinion this classification report is better than the last one. Well some people may argue on it but again it totally depends on what do you want address more. When you accept a model, which has lesser overall accuracy as compared to other, this is known as accuracy paradox.

In this case change is not as high as expected as this is a random dataset, but from my experience I have seen a good amount of shift in classification report. In few instances even an improvement of 1% is considered as a significant shift. Suppose you are in flight and just before taking off, Pilot makes an announcement that there is only 95% chance that this plane will Land. Will you still take that flight?

Cost-Based Cut-Off Probability

In cost-based approach, we assign penalty cost for miss-classification of positives and negatives. we try to find the cut-off probability p that minimizes the expected total penalty cost. Cut off probability based on penalty cost is the most preferred in cases such as credit rating.

After using Cost Based approach, we got 0.17 as our cut off probability.

No alt text provided for this image

We can clearly see that recall for Bad credit has jumped to 96%. That is a very high number. In our very first model we had 89% recall for Good credit and now we have even better recall on Bad credit. At one place we are identifying our good credits accurately and at other we are identifying bad credits. On the very same trained model with different strategy we are getting two completely contrasting results. Now, which one to choose will depend on your business problem.

We have seen how different techniques can give you completely different outputs on the very same model. Therefore, we should always look into our model carefully and always try to ask different questions to it.

There are lot of other things that goes in model building but here I wanted to highlight the importance of model validation and diagnostics. It is very crucial to explain your model. You need to have full control on your model instead of blindly trusting in it. Here we have seen trusting the accuracy of your model is not enough. You need to know the importance of each variable and its contribution towards your model both locally and globally. If you build your model logically with a correct approach, you will be in a position to explain and derive insights from it without any problem.  


Shweta Puri

Senior Data Scientist at LBG

4y

Thanks for posting Mihir!! Well tailored and crisp article

Like
Reply
Anand Panda

Business Head - Digital Business & Global Payments | BillDesk | Payments Tech

4y

A good primer on regression fundamentals with examples. Association relationship and explainability are vital parts of any regression modeling process and well covered. Thanks, Mihir

To view or add a comment, sign in

More articles by Mihir Jain

Insights from the community

Others also viewed

Explore topics