# Instrument variable estimation

The instrument variable estimation (short: IV estimation ), also method of the instrument variables , or the instrument variable method is a generic term for certain estimation methods in the inferring statistics .

The aim of the IV method is to exclude a correlation between the explanatory variables and the error term in a regression analysis . This is done by replacing the explanatory variables with other quantities that are closely related to them but do not correlate with the error term or represent a linear combination of other explanatory variables.

## history

While instrument variables today are mostly used in situations with omitted variables, historically they were first used as a solution to problems related to simultaneity. When estimating supply and demand curves, for example, the problem arises that only equilibrium prices and quantities are available as data points, i.e. quantities for which supply and demand are coordinated. The American economist Philip G. Wright published a book in 1928 under the title The Tariff on Animal and Vegetable Oils . In one of the appendices to this book, Wright presented a method by which the demand and supply selectivities of butter and flaxseed oil can be estimated. This is believed to be the first study to use the instrumental variable approach.

It was later found that instrument variables can also correct for biases due to measurement errors. This also applies to distortions due exuberant variables ( english omitted variable bias ).

## idea

In many situations in which causal effects are to be examined and quantified, there is a correlation between the error term and the explanatory variable. For example, if you want to study the effect of education ( ) on a person's income from work ( ), you could estimate a model of the following type ( linear single regression ): ${\ displaystyle x}$${\ displaystyle y}$

${\ displaystyle y_ {i} = \ alpha + \ beta x_ {i} + u_ {i}}$, where represents the error term.${\ displaystyle u}$

One way of estimating would be least squares estimation . However, this is based on several assumptions, including the fact that the error term and the explanatory variable are uncorrelated. ${\ displaystyle \ beta}$

However, this is very unlikely in the example mentioned. It is easy to identify many variables that do not appear in the model, but have an effect on both education and income. In addition, some of these variables are hardly or not at all measurable and therefore cannot be included in the model as control variables. For example, a person's hard work is very likely to be correlated with that person's level of education and income; Since the diligence is also not measurable and therefore remains in the error term, there will be the same correlation between the explanatory variable and the error term that must not exist for the validity of the least squares method. In such a case there is a problem due to exuberant variables (Engl. Omitted variable ), and the least squares estimator is inconsistent. The correlation between the error term and the explanatory variables is called endogeneity . In addition to omitted variables, this problem can also arise if the variables cannot be measured precisely, but only with measurement error, and if there is bilateral, simultaneous causality ( has a causal effect on , has a causal effect on ). ${\ displaystyle x}$${\ displaystyle y}$${\ displaystyle y}$${\ displaystyle x}$

Further approaches to solving endogeneity problems are regression discontinuity analysis , panel data and the estimation methods based on it, as well as the classical experiment .

## Math background

The following applies to the least squares estimator (KQ estimator) (in the simple linear regression model with an explanatory variable):

${\ displaystyle {\ widehat {\ beta}} _ {\ mathrm {KQ}} = {\ frac {\ sum _ {i} x_ {i} y_ {i}} {\ sum _ {i} x_ {i} ^ {2}}} = {\ frac {\ sum _ {i} x_ {i} (x_ {i} \ beta + \ epsilon _ {i})} {\ sum _ {i} x_ {i} ^ { 2}}} = \ beta + {\ frac {\ sum _ {i} x_ {i} \ epsilon _ {i}} {\ sum _ {i} x_ {i} ^ {2}}}.}$

If and are not correlated , the second term approaches zero for an infinite number of observations and the estimator is consistent for . If and are correlated, the estimator is inconsistent. ${\ displaystyle x}$${\ displaystyle \ epsilon}$${\ displaystyle \ beta}$${\ displaystyle x}$${\ displaystyle \ epsilon}$

An instrument variable is correlated to the explanatory variable but not to the error term. The estimator is:

${\ displaystyle {\ widehat {\ beta}} _ {\ mathrm {IV}} = {\ frac {\ sum _ {i} z_ {i} y_ {i}} {\ sum _ {i} z_ {i} x_ {i}}} = {\ frac {\ sum _ {i} z_ {i} (x_ {i} \ beta + \ epsilon _ {i})} {\ sum _ {i} z_ {i} x_ { i}}} = \ beta + {\ frac {\ sum _ {i} z_ {i} \ epsilon _ {i}} {\ sum _ {i} z_ {i} x_ {i}}}.}$

If and are not correlated, the last term vanishes and leads to a consistent estimator. Note: If it is not correlated with the error term, it is itself an instrument variable. In this case the KQ estimator is identical to the IV estimator. ${\ displaystyle z}$${\ displaystyle \ epsilon}$${\ displaystyle x}$${\ displaystyle x}$

The above approach can easily be generalized to a multiple explanatory variable regression. be a matrix of explanatory variables ( data matrix ) resulting from observations of variables. be a matrix of instrument variables. Then follows ${\ displaystyle X}$${\ displaystyle T \ times K}$${\ displaystyle T}$${\ displaystyle K}$${\ displaystyle Z}$${\ displaystyle T \ times K}$

${\ displaystyle {\ hat {\ beta}} _ {\ mathrm {IV}} = (Z'X) ^ {- 1} Z'Y = (Z'X) ^ {- 1} Z '(X \ beta + \ epsilon) = \ beta + (Z'X) ^ {- 1} Z '\ epsilon.}$

## implementation

This technique is often implemented by means of a two- stage least squares estimation ( 2SLS for short). In the first step of the two-stage process, each endogenous explanatory variable is regressed on all valid instruments and all exogenous variables . Since the instruments are exogenous, this approximation of the endogenous variables will not correlate with the error term. Intuitively, this means that the relationship between and the endogenous explanatory variables is examined. In the second step, the regression of interest is estimated as usual, but all endogenous explanatory variables are replaced with the approximate values ​​from step 1. ${\ displaystyle y}$

The estimator obtained in this way is consistent. In order for the standard errors to be calculated correctly, only the sum of the squared error terms needs to be corrected:

Step 1: ${\ displaystyle {\ has {X}} = Z (Z'Z) ^ {- 1} Z'X}$
Step 2: ${\ displaystyle {\ hat {B}} _ {\ mathrm {IV}} = ({\ hat {X}} '{\ hat {X}}) ^ {- 1} {\ hat {X}}' Y }$

## conditions

A valid instrument must meet the following two conditions.

### Relevance condition

A problem arises when the instruments are only weakly correlated with the endogenous variable (s) (“weak” instrument). This assumption is usually checked by an F-test in the first stage of the 2SLS regression. The null hypothesis to be rejected for this test is that the instruments together do not have an influence on the endogenous variable that can be distinguished from zero. As a rule of thumb, the resulting F-statistic should be greater than 10.

### Exogeneity

A valid instrument correlates with the endogenous variable and with the variable to be explained, but not the error term. The difficulty here is that this assumption cannot be statistically tested on the basis of the available data, but has to be justified with arguments. Only if a valid instrument already exists can the exogeneity of another instrument be checked with the help of the Sargan-Hansen test .

In addition, estimates based on valid instrument variables are consistent, but usually not unbiased, so larger samples are required.

## interpretation

An estimate based on instrumental variables as local average treatment effect (Engl. Local average treatment effect , LATE short) interpreted. This means that the instrument variable estimation does not estimate the average treatment effect for the whole population, but only for the subpopulation for which the instrument influences the endogenous variable. The reason for this is that only the part of the variation in the endogenous variable that can be explained by the instrument can be used for the estimation.

## literature

### Textbooks and review articles

• Joshua D. Angrist , Jörn-Steffen Pischke: Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press, 2008.
• Joshua D. Angrist, Alan B. Krueger: Instrumental Variables and the Seach for Identification: From Supply and Demand to Natural Experiments. In: Journal of Economic Perspectives. Volume 15, Number 4, Fall 2001, pp. 69-85.
• Hans-Friedrich Eckey, Reinhold Kosfeld, Christian Dreger: Econometrics. 3rd, revised. and exp. Edition. Gabler, Wiesbaden 2004.
• William H. Greene: Econometric Analysis. 5th edition. Prentice Hall, Upper Saddle River, NJ 2003.
• James H. Stock, Mark W. Watson: Introduction to Econometrics. 2nd Edition. Pearson Education, 2007.
• Marno Verbeek: A Guide to Modern Econometrics. 4th edition. John Wiley & Sons, Chichester 2012.
• Jeffrey M. Wooldridge: Econometric Analysis of Cross Section and Panel Data. MIT Press, Cambridge, Mass. including 2002.

## Remarks

1. ^ JD Angrist, AB Krueger: Instrumental Variables and the Seach for Identification. 2001, p. 69.
2. ^ JH Stock, MW Watson: Introduction to Econometrics. 2007, p. 425.
3. ^ JD Angrist, AB Krueger: Instrumental Variables and the Seach for Identification. 2001, p. 71 f.
4. Douglas Staiger, James H. Stock: Instrumental Variables Regression with Weak Instruments. In: Econometrica, Econometric Society. vol. 65 (3), May 1997, pp. 557-586.
5. ^ JD Angrist, AB Krueger: Instrumental Variables and the Seach for Identification. 2001, p. 71.