All about Linear Regression

Linear Regression Analysis is the mathematical measure of the underlying relationship between two or more variables. In this blog, we will try to decode Linear Regression.

Statistical Concepts
Linear Regression, R-squared, and Adjusted R-squared

Linear Regression

When we study a variable (dependent) in terms of another variable (independent) through a linear relationship between them, it is called Bivariate Linear Regression Analysis.

Multiple Linear Regression Analysis is the study of the relation of a dependent variable with two or more independent variables.

In non-deterministic models, regression is always an approximation and hence, the presence of errors is unavoidable. However, by minimizing the sum of squares of errors, we find the best equation that can explain the dependent variable in terms of the independent variables by obtaining the closest estimates to intercept and coefficients of the explanatory variables.

The equation of multiple linear regression:

Y = β0 + β1X1 + β2X2+  …  + βpXp + ε

where Y is the response variable and Xi’s (i = 1, 2, …, p) are explanatory variables and ε is the error term.

We estimate the (p+1) parameters β0, β1, …, βp using the method of least squares, that is, by minimizing the error sum of squares.

Assumptions:

  1. Normality of residuals: ε are independent and identically distributed N(0,σ2) variates
  2. Linearity: The relationship between X and Y is linear
  3. Homoscedasticity: The variance of the error is constant for any value of X
  4. Independence: Observations are independent of each other (no multi-collinearity)

Coefficient of determination (R2)

It measures the proportion of variation in the dependent variable that can be predicted from the set of independent variables in a regression equation.  It varies from 0 to 1. A value closer to 1 indicates that the variability in the dependent variable can be explained well by independent variables whereas a value closer to 0 implies that most of the variance results from chance causes or the absence of some other explanatory variable in the model.

R2 = 1 – SSE/TSS 

where SSE (Sum of Squares due to Error) denotes the variability left unexplained by the model

TSS (Total Sum of Squares) is the total variability in the independent variable.

Adjusted R-squared

However, the addition of an explanatory variable almost always leads to an increase in the value of R2, that is, R2 is a non-decreasing function of a number of regressors. Therefore, we use adjusted R2 which takes into account the number of predictors in the model. We divide both SSE and TSS with their degrees of freedom.

 adjusted R2 = 1- (SSE/(n-(p+1))) / (TSS/(n – 1))

where p is the number of explanatory variables in the model

Note: SSE has n-(p+1) degrees of freedom since we estimate 1 intercept and p slope coefficients in the model.

You can find some of the resources that helped us here.

To know how we approached interview preparation, you can read the related article here.

For any query about the process or suggestion about topics that we can talk about in future, you can reach out to us on Linkedin.

Cheers and Best!

Kanika & Anubhav