# Multicollinearity

Multicollinearity is a regression analysis problem and occurs when two or more explanatory variables have a very strong correlation with one another. On the one hand, with increasing multicollinearity, the method for estimating the regression coefficients becomes unstable and statements about the estimation of the regression coefficients become increasingly imprecise. On the other hand, the model interpretation is no longer clear. The classic symptom of strong multicollinearity is a high degree of certainty associated with low t-values ​​for the individual regression parameters .

## Problems of multicollinearity

Perfect collinearity makes the computational implementation of the linear regression analysis impossible and usually occurs as a result of the incorrect specification of the underlying model ( true model ).

### Numerical instability

Mathematically , the solution of the multiple linear regression problem for the regression coefficients obtained by means of the least squares method can be represented in vector-matrix notation as ${\ displaystyle y_ {i} = b_ {0} + b_ {1} x_ {i1} + \ ldots + b_ {k} x_ {ik}}$

${\ displaystyle \ mathbf {b} = \ left (\ mathbf {X} ^ {\ top} \ mathbf {X} \ right) ^ {- 1} \ mathbf {X} ^ {\ top} \ mathbf {y} }$.

The vector contains the estimated regression coefficients , the vector, and the data matrix${\ displaystyle \ mathbf {b} = (b_ {0}, \ dots, b_ {p}) ^ {\ top}}$${\ displaystyle \ mathbf {y} = (y_ {1}, \ dots, y_ {n}) ^ {\ top}}$

${\ displaystyle \ mathbf {X} = {\ begin {pmatrix} 1 & x_ {11} & \ cdots & x_ {1k} \\\ vdots & \ vdots && \ vdots \\ 1 & x_ {n1} & \ cdots & x_ {nk} \ end {pmatrix}}}$

the -dimensional observation values. The problem lies in computing the inverse of the sum-of-products matrix ; the stronger the multicollinearity, the more it approximates a singular matrix , i.e. H. there is no inverse . ${\ displaystyle n \ times p}$ ${\ displaystyle \ mathbf {X} ^ {\ top} \ mathbf {X}}$${\ displaystyle \ mathbf {X} ^ {\ top} \ mathbf {X}}$

### Model interpretation

If the regression model is and there is perfect multicollinearity, i. H. ${\ displaystyle y = b_ {0} + b_ {1} x_ {1} + b_ {2} x_ {2}}$

${\ displaystyle x_ {2} = c_ {0} + c_ {1} x_ {1} \,}$ or moved
${\ displaystyle x_ {1} = {\ frac {1} {c_ {1}}} x_ {2} - {\ frac {c_ {0}} {c_ {1}}}}$

and inserts both equations into the regression model, one obtains

(1) ${\ displaystyle y = b_ {0} + b_ {1} x_ {1} + b_ {2} (c_ {0} + c_ {1} x_ {1}) = (b_ {0} + b_ {2} c_ {0}) + (b_ {1} + b_ {2} c_ {1}) x_ {1} \,}$
(2) ${\ displaystyle y = b_ {0} + b_ {1} \ left ({\ frac {1} {c_ {1}}} x_ {2} - {\ frac {c_ {0}} {c_ {1}} } \ right) + b_ {2} x_ {2} = \ left (b_ {0} + {\ frac {b_ {1} c_ {0}} {c_ {1}}} \ right) + \ left ({ \ frac {b_ {1}} {c_ {1}}} + b_ {2} \ right) x_ {2}}$

In model (1) only depends on and in model (2) only depends on . The question now arises, which model is the “right one”? In economics, one speaks of non- identifiable models . ${\ displaystyle y}$${\ displaystyle x_ {1}}$${\ displaystyle y}$${\ displaystyle x_ {2}}$

## Identification of multicollinearity

Because empirical data always show a certain degree of multicollinearity, key figures were developed that provide indications of multicollinearity. However, there is no clear guideline.

### correlation

To reveal multicollinearity z. B. the analysis of the correlation coefficients of the regressors. Very high positive or negative correlation coefficients indicate a strong relationship between the regressors and thus multicollinearity. However, a low correlation between the regressors does not automatically mean the absence of multicollinearity (example); also linear combinations of regressors that have a high positive or negative correlation, e.g. B. between and lead to the problems mentioned above. A high correlation between the regressors can be identified by the correlation matrix . ${\ displaystyle d_ {1} x_ {1} + d_ {2} x_ {2}}$${\ displaystyle d_ {3} x_ {3} + d_ {4} x_ {4}}$

### Coefficient of determination

A high coefficient of determination for linear regressions ${\ displaystyle R_ {i} ^ {2}}$

${\ displaystyle x_ {i} = d_ {i0} + \ sum _ {j = 1 \ atop j \ neq i} ^ {k} d_ {ji} x_ {j}}$,

d. H. the -th regressor is well predicted by all other regressors, indicates multicollinearity. ${\ displaystyle i}$

#### tolerance

The tolerance is used to estimate the multicollinearity. A value of indicates a strong multicollinearity. ${\ displaystyle {\ text {Tol}} _ {j} = 1-R_ {j} ^ {2}}$${\ displaystyle {\ text {Tol}} _ {j} <0 {,} 2}$

#### Variance Inflation Factor (VIF)

The greater the variance inflation factor

${\ displaystyle \ operatorname {VIF} _ {j} = {\ frac {1} {1-R_ {j} ^ {2}}} = {\ frac {1} {{\ text {Tol}} _ {j }}} \ in [1; \ infty)}$, (with as coefficient of determination of the regression of all other influencing variables),${\ displaystyle R_ {j}}$${\ displaystyle x_ {j}}$

the stronger are the indications of multicollinearities. There is no definitive value from when the VIF displays (too) high multicollinearity. As a rule of thumb, VIF values ​​above 10 are often classified as "too high".

### Condition index

The product sum matrix is positive semidefinite, i.e. H. all eigenvalues ​​of the matrix are positive or zero. If the matrix becomes singular, then at least one eigenvalue is zero. Is the condition index ${\ displaystyle \ mathbf {X} ^ {\ top} \ mathbf {X}}$ ${\ displaystyle \ lambda _ {i}}$

${\ displaystyle {\ text {KI}} _ {j} = {\ sqrt {\ frac {\ lambda _ {j}} {\ min _ {i} \ lambda _ {i}}}}}$

a value greater than 30 is also referred to as strong multicollinearity. ${\ displaystyle {\ text {KI}} _ {j}}$