Bias from omitted variables

In the statistics one occurs distortion by exuberant variables , and distortion due to omitted variables ( english Omitted variable bias , short OVB ) when one or more relevant variable (s) or regressor (s) is not considered (to be). Here, a relevant variable is a variable that has a partial (true) effect on the response variable other than zero, i.e. a variable that has an influence on the response variable in the true model . The variables for which one actually wants to control, but which were left out when estimating a regression model , are called left out variables . The possible consequence of omitting one or more relevant variables is a skewed and inconsistent estimate of the effect of interest.

If (with the least squares estimation estimated) regression model misspecified was and a relevant explanatory variable was omitted from the regression equation leads to a distortion of the least-squares estimator. In general, bias occurs if:

the omitted variable is correlated with a variable taken into account in the model and
if the omitted variable the response variable determined

The bias in the least squares estimator arises because the model tries to compensate for the missing relevant variables by overestimating or underestimating the effects of the other explanatory variables. In practice there is usually an exchange relationship between a distortion caused by omitted variables and the problem of the existence of multicollinearity . One possible solution is to use instrument variables .

Starting position

Given a typical multiple linear regression model , with the vector of the unknown regression parameters , the design matrix , the vector of the dependent variable, and the vector of the error terms . Furthermore, it is assumed that the error terms in the Middle zero: . This means that it can be assumed that the model is correct on average. ${\ displaystyle \ mathbf {y} = \ mathbf {X} {\ boldsymbol {\ beta}} + {\ boldsymbol {\ varepsilon}}}$ ${\ displaystyle {\ boldsymbol {\ beta}}}$ ${\ displaystyle p \ times 1}$ ${\ displaystyle n \ times p}$ ${\ displaystyle \ mathbf {X}}$ ${\ displaystyle n \ times 1}$ ${\ displaystyle \ mathbf {y}}$ ${\ displaystyle n \ times 1}$ ${\ displaystyle {\ boldsymbol {\ varepsilon}}}$ ${\ displaystyle \ operatorname {E} [{\ boldsymbol {\ boldsymbol {\ varepsilon}}}] = \ mathbf {0}}$

Consider the following situation:

The real data generating process is:

{\ displaystyle \ mathbf {y} = \ mathbf {X} {\ varvec {\ beta}} + \ mathbf {Z} {\ varvec {\ gamma}} + {\ varvec {\ varepsilon}}, \ quad {\ boldsymbol {\ varepsilon}} \ sim (\ mathbf {0}, \ sigma ^ {2} \ mathbf {I})}

With

{\ displaystyle {\ boldsymbol {\ gamma}} \ neq 0}

The incorrectly specified data generating process is:

{\ displaystyle \ mathbf {y} = \ mathbf {X} {\ boldsymbol {\ beta}} + {\ boldsymbol {\ varepsilon}} ^ {*}, \ quad {\ boldsymbol {\ varepsilon}} ^ {*} = \ mathbf {Z} {\ boldsymbol {\ gamma}} + {\ boldsymbol {\ varepsilon}}}

Although the full model is correct, the reduced model is incorrectly estimated . In this case, the relevant variables (these variables are relevant because applies to the true parameter ) are incorrectly neglected. These omitted variables migrate into a newly defined stochastic disturbance variable because they are relevant, but are not taken into account in the model. In the case of omitted variables, the least squares estimator is generally biased (biased by omitted variables). An exception is when and are orthogonal , i.e. H. every variable in is uncorrelated with every variable in . In addition, the components of the estimator from the reduced model show a smaller variance than the corresponding components of the estimator based on the true model . ${\ displaystyle \ mathbf {Z}}$ ${\ displaystyle {\ boldsymbol {\ gamma}} \ neq 0}$ ${\ displaystyle \ mathbf {b}}$ ${\ displaystyle \ mathbf {X}}$ ${\ displaystyle \ mathbf {Z}}$ ${\ displaystyle \ mathbf {X}}$ ${\ displaystyle \ mathbf {Z}}$

Effects of the model specification

Distortion of the Least Squares Estimator

If one estimates the reduced model, but in reality the true model is the full model, then because of

{\ displaystyle {\ begin {aligned} \ operatorname {E} (\ mathbf {b}) & = \ operatorname {E} ((\ mathbf {X} ^ {\ top} \ mathbf {X}) ^ {- 1 } \ mathbf {X} ^ {\ top} \ mathbf {y}) \\ & = \ operatorname {E} ((\ mathbf {X} ^ {\ top} \ mathbf {X}) ^ {- 1} \ mathbf {X} ^ {\ top} (\ mathbf {X} {\ varvec {\ beta}} + \ mathbf {Z} {\ varvec {\ gamma}} + {\ varvec {\ varepsilon}})) \\ & = \ operatorname {E} ((\ mathbf {X} ^ {\ top} \ mathbf {X}) ^ {- 1} \ mathbf {X} ^ {\ top} \ mathbf {X} {\ varvec {\ beta}} + (\ mathbf {X} ^ {\ top} \ mathbf {X}) ^ {- 1} \ mathbf {X} ^ {\ top} {\ mathbf {Z}}} {\ mathbf {\ gamma}} + (\ mathbf {X} ^ {\ top} \ mathbf {X}) ^ {- 1} \ mathbf {X} ^ {\ top} {\ boldsymbol {\ varepsilon}})) = { \ varvec {\ beta}} + (\ mathbf {X} ^ {\ top} \ mathbf {X}) ^ {- 1} \ mathbf {X} ^ {\ top} {\ varvec {\ mathbf {Z}} } {\ boldsymbol {\ gamma}} + (\ mathbf {X} ^ {\ top} \ mathbf {X}) ^ {- 1} \ underbrace {\ operatorname {E} (\ mathbf {X} ^ {\ top } {\ boldsymbol {\ varepsilon}})} _ {= \ mathbf {0}} \\ & = {\ boldsymbol {\ beta}} + \ underbrace {(\ mathbf {X} ^ {\ top} \ mathbf { X}) ^ {- 1} \ mathbf {X} ^ {\ top} \ mathbf {Z} {\ boldsymbol {\ gamma}}} _ {\ text {Distortion}} \ end {aligned}}}

a systematic error of . ${\ displaystyle (\ mathbf {X} ^ {\ top} \ mathbf {X}) ^ {- 1} \ mathbf {X} ^ {\ top} \ mathbf {Z} {\ varvec {\ gamma}}}$

Bias of the variance estimator

Due to the omitted variables, the variance estimator for the true variance of the confounding variables is biased. The uncertainty with regard to the estimation of the disturbance variables increases and the variance can no longer be estimated true to expectation . The following applies to the bias of the variance

{\ displaystyle \ operatorname {Bias} ({\ hat {\ sigma}} ^ {2}) = \ operatorname {E} ({\ hat {\ sigma}} ^ {2}) - \ sigma ^ {2} = {\ frac {{\ boldsymbol {\ gamma}} ^ {\ top} \ mathbf {Z} ^ {\ top} \ mathbf {M} \ mathbf {Z} {\ boldsymbol {\ gamma}}} {TK}} \ geq 0}

,

d. H. on average, the variance of the disturbance variables is systematically overestimated. Since the systematic error in the numerator has a quadratic form , it is positive.

Distortion of residuals

By omitting relevant variables, the residuals are no longer centered around zero

{\ displaystyle \ operatorname {E} ({\ boldsymbol {\ varepsilon}} ^ {*}) \ neq 0}

,

This can be interpreted in such a way that, on average, the true model is no longer estimated.

Individual evidence

^ ^A ^b Peter Hackl : Introduction to Econometrics. 2nd updated edition, Pearson Deutschland GmbH, 2008., ISBN 978-3-86894-156-2 , pp. 105 ff.

[:0-1] A ^b Peter Hackl : Introduction to Econometrics. 2nd updated edition, Pearson Deutschland GmbH, 2008., ISBN 978-3-86894-156-2 , pp. 105 ff.