Linear regression. What is it and how can it be useful?
Linear regression analysis is probably one of the most important methods of multivariate statistical analysis that can be used to study the relationships of variables. Regression analysis is used to examine the relationship of one variable (called dependent, target) to one or more other variables (called independent variables or predictors). The aim of the regression analysis is to find the regression equation that best expresses the relationship between the dependent and independent variables, based on the dataset, to check the suitability of the obtained model and use it to calculate the values of the dependent variable. The general equation for linear regression analysis can be written as:
Y = a + b1 X1 + b2 X2 + ... + bk Xk + e,
where a is a constant, b1, b2, ..., bk are regression coefficients called slope or direction coefficients, e is an error term. The regression coefficient bi shows how Y changes as Xi increases by one.
Regression is used: 1) Predict or calculate the value of a dependent variable from the values of independent variables; 2) Determine what the relationship is between the variables 3) Compare the relationships between independent variables and the dependent variable
The linear regression analysis is based on the following assumptions:
- The relationship between the variables is linear
- Variables are measured at least on an interval scale
- Distributions of variables normal (at least approximate)
- Independent variables are uncorrelated (or weak correlations)
- Errors are normally distributed random variables
- The average of all errors is zero
- The variances of all errors are equal, this assumption is also called the homoskedasticity requirement
- All errors are independent of each other
Although linear regression analysis requires that dependent and independent variables be measured at least on an interval scale, it is often possible to use nominal or rank variables by transcoding them into dummy variables. Each nominal or rank variable with m categories can be described by m-1 dummy variables. Possible values for the dummy variables are 0 and 1. If the number of dummy variables is too large, it is difficult to interpret the results obtained.
Linear regression analysis is sensitive to outliers and requires that the relationships between dependent and independent variables be linear. Therefore, it is recommended to perform the analysis of outliers before regression analysis and to look at the relationships graphically, for example in scatter plots.
The problem of multicollinearity
Multicollinearity is a strong correlation between independent variables. A strong correlation between the variables is shown by the correlation coefficients approaching -+ 1. However, even medium correlations can affect regression results, so it is important to consider them as well. In the case of multicollinearity, it is not possible to distinguish well the influence of independent variables on the dependent variable, instability of regression coefficients occurs - several additional observations can change their meaning or even sign. In most cases, multicollinearity is determined by calculating the Variance Inflation Factor (VIF). A variable is too multicollinear if its VIF > 4. If there is multicollinearity, it must be removed: for example, discard one variable from the equation or other way to do that is to calculate new variable based on other two (average of two strongly related variables)
How to evaluate whether linear regression analysis is reliable?
The coefficient of determination (R Square) shows the part of the variance of the dependent variable explained by the variation of the independent variable or variables. Adjusted R Square is the coefficient of determination adjusted for the sample size and the number of independent variables. This coefficient is an estimate of the coefficient of determination of the population. For large samples, the coefficients of determination and the adjusted coefficients of determination differ slightly. The higher the coefficient of determination, the more accurately the dependent variable can be calculated from the independent ones. In addition to these coefficients, various other regression task metrics such as RMSE, MASE, and others can be used.
Yes, that is theory but how to implement it in practice?
There is not one different way to put this into practice even everyone's favorite Excel has a linear regression analysis feature. Here are some ways to implement linear regression analysis. All of these are just examples, as there are many different libraries/packages to solve this problem.