Multicollinearity

Multicollinearity is a statistical term for the existence of a high degree of linear correlation amongst two or more explanatory variables in a regression model. This makes it difficult to separate the effects of them on the dependent variable.

Definition
Collinearity refers to a linear relationship between two explanatory variables. Multicollinearity means that three or more variables are correlated, but it is often used to describe correlation amongst any number of variables. The original definition referred to an exact linear relationship, but later it was extended to mean a nearly perfect relationship. The correlation can be negative or positive.

The mathematical definition states that an exact linear relationship is said to exist if the following condition is satisfied:



\lambda_1 X_{1i} + \lambda_2 X_{2i} + \cdots + \lambda_k X_{ki} = 0 $$

where $$ \lambda_i $$ are constants and $$ X_i $$ are explanatory variables.

Imagine that you add a stochastic error term $$ v_i $$ to the equation above such that



\lambda_1 X_{1i} + \lambda_2 X_{2i} + \cdots + \lambda_k X_{ki} + v_i = 0 $$

This equation shows that there is no exact linear relationship and therefore the $$ X_i $$ variables are nearly perfectly correlated. In the perfectly multicollinear case, it is possible to express one of the X variables, eg $$ X_{1i} $$, as a function of the remaining X variables:



X_{1i} = -{\lambda_2 \over \lambda_1} X_{2i} - {\lambda_3 \over \lambda_1} X_{3i} - \cdots - {\lambda_k \over \lambda_1} X_{ki} $$

where $$ \lambda_1 \ne 0 $$. And in the nearly perfectly multicollinear case, it is clear that $$ X_1 $$ is not an exact linear combination of the other X variables but also depends on the stochastic error term $$ v_i $$:



X_{1i} = -{\lambda_2 \over \lambda_1} X_{2i} - {\lambda_3 \over \lambda_1} X_{3i} - \cdots - {\lambda_k \over \lambda_1} X_{ki} + {v_i \over \lambda_1} $$

Consequences of multicollinearity
In the presence of multicollinearity, the estimate of one variable's impact on y while controlling for the others tends to be less precise than if predictors were uncorrelated with one another.

Simply put, if nominally "different" measures actually quantify the same phenomenon then they are redundant. Alternatively, if the variables are accorded different names and perhaps employ different numeric measurement scales but correlate highly with each other, then they suffer from redundancy.

A principal danger of such data redundancy is that of overfitting in regression analysis models. The best regression models are those in which the predictor variables each correlate highly with the dependent (outcome) variable but correlate at most only minimally with each other. Such a model is often called "low noise" and will be statistically robust (that is, it will predict reliably across numerous samples of variable sets drawn from the same statistical population).

See Multi-collinearity Variance Inflation and Orthogonalization in Regression by Dr. Alex Yu.

Detection of multicollinearity
Indicators that multicollinearity may be present in a model:

1) Large changes in the estimated regression coefficients when a predictor variable is added or deleted

2) Non significant results of simple linear regressions

3) Estimated regression coefficients have an opposite sign from predicted

4) Formal detection-tolerance or the variation inflation factor (VIF)


 * $$\mathrm{tolerance} = 1-R^2,\quad \mathrm{VIF} = \frac{1}{\mathrm{tolerance}}.$$

A tolerance of less than 0.1 indicates a multicollinearity problem.

Remedy to multicollinearity
Multicollinearity has also been described as micronumerosity (or "too little data"). Multicollinearity does not actually bias results, it just produces large standard errors in the related independent variables. With enough data, these errors will be reduced.

Also you can...

1) The presence of multicollinearity doesn't affect the fitted model provided that the predictor variables follow the same multicolinearity pattern as the data on which the regression model is based.

2) A predictor variable may be dropped to lessen multicollinearity. (But then you don't get any extra information from the dropped variable)

3) You may be able to add a case to break multicollinearity

4) Estimate the regression coefficients from different data sets

Note: Multicollinearity does not impact the reliability of the forecast, but rather impacts the interpretation of the explanatory variables. As long as the collinear relationships in your independent variables remain stable over time, multicollinearity will not affect your forecast. If there is reason to believe that the collinear relationships do NOT remain stable over time, it is better to consider a technique like Ridge regression.

Multicollinearity in survival analysis
Multicollinearity may also represent a serious issue in survival analysis. The problem is that time-varying covariates may change their value over the time line of the study. A special procedure is recommended to assess the impact of multicollinearity on the results. See Van den Poel & Larivière (2004) for a detailed discussion.