LINEAR REGRESSION

LINEAR REGRESSION

Linear Regression helps to find the linear relationship between target/dependent variable(Continuous) and independent/predictor variables(can be continuous or categorical or both).Linear Regression is used to predict only continues target variable.

When is linear regression appropriate?

  1. The relationship between the variables is linear.
  2. The data is homoskedastic, meaning the variance in the residuals (the difference in the real and predicted values) is more or less constant.
  3. The residuals are independent, meaning the residuals are distributed randomly and not influenced by the residuals in previous observations. If the residuals are not independent of each other, they’re considered to be autocorrelated.
  4. The residuals are normally distributed. This assumption means the probability density function of the residual values is normally distributed at each x value. I leave this assumption for last because I don’t consider it to be a hard requirement for the use of linear regression, although if this isn’t true, some manipulations must be made to the model.

Simple Linear Regression:

Simple linear regression is useful for finding relationship between two continuous variables. One is predictor or independent variable and other is response or dependent variable

Example : Relationship between ‘number of hours studied’ and ‘marks obtained’. 

Goal is to design a model that can predict marks if given the number of hours studied. That is, if we give number of hours studied by a student as an input, our model should predict their mark with minimum error.

Model building: X = No. of hours studied ; Y(to be predicted) = Mark scored :

Y(pred) = b0 + b1*X ;

 b1 is the slope of the line, and b0* is the intercept (the value of Y when X = 0).

The values b0 and b1 must be chosen so that they minimize the error. If sum of squared error is taken as a metric to evaluate the model, then goal to obtain a line that best reduces the error.

If we don’t square the error, then positive and negative point will cancel out each other.

X^ is mean of X values , Y^ mean of y values

Exploring ‘b1’

  • If b1 > 0, then x(predictor) and y(target) have a positive relationship. That is increase in x will increase y.
  • If b1 < 0, then x(predictor) and y(target) have a negative relationship. That is increase in x will decrease y.
The Math Behind Linear Regression

Multiple Regression :

•Multiple linear regression is the most common form of linear regression analysis.

Multiple linear regression is used to explain the relationship between one continuous depended variable from two or more independent variables.

•The independent variables can be continuous or categorical (dummy coded as appropriate)

•Independent variables should not be multi-collinear

Variance Inflation Factor - VIF

Multi-collinearity is typically checked using VIF

•Varaince inflation factors (VIF) measure how much the variance of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related.

VIF = 1 / (1 – R^2)

•(1 – R^2) for each independent Variable is computed by Regressing that Variable w.r.t all other Independent Variable. For e.g.

•By regressing each variable with other we trying to find how much of variance of a variable can be explained by all other variables taken together

Coefficient of Determination : R2

  • To determine how much variance two variables share, or how much variance is explained, or accounted for, by a set of variables (predictors) in an outcome variable.
  • Values can range from 0.00 to 1.00, or 0 to 100%.
  • In terms of regression analysis, the coefficient of determination is an overall measure of the accuracy of the regression model.
  • In simple linear regression analysis, the calculation of this coefficient is to square the r value between the two values, where r is the correlation coefficient.
  • In a multiple linear regression analysisR2 is known as the multiple correlation coefficient of determination.

It helps to describe how well a regression line fits (a.k.a., goodness of fit). An R2 value of 0 indicates that the regression line does not fit the set of data points and a value of 1 indicates that the regression line perfectly fits the set of data points

  • By definition, R2 is calculated by one minus the Sum of Squares of Residuals (SSerror) divided by the Total Sum of Squares (SStotal):  R2= 1 – (SSerror / SStotal).

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is a number that indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s).


Sum of Squared error (SSE) or Residual Sum Squared (RSS) or Sum of Squared residuals (SSR)

Regression Sum of squares or explained sum of square(ESS) or Sum of Squares of Model (SSM)

Total sum of squares (SSTO) : This tells how much the data point move around the mean.

Null-Hypothesis and P-value

Null hypothesis is the initial claim that researcher specify using previous research or knowledge.

Low P-value: Rejects null hypothesis indicating that the predictor value is related to the response

High P-value: Changes in predictor are not associated with change in target

Regression line

Hypothesis Test for Correlations

Correlations have a hypothesis test. As with any hypothesis test, this test takes sample data and evaluates two mutually exclusive statements about the population from which the sample was drawn. For Pearson correlations, the two hypotheses are the following:

A correlation of zero indicates that no linear relationship exists. If your p-value is less than your significance level, the sample contains sufficient evidence to reject the null hypothesis and conclude that the correlation does not equal zero. In other words, the sample data support the notion that the relationship exists in the population.


Additional Info::


To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics