Cook distance

In statistics , especially in regression diagnostics , the Cook distance , the Cook measure , or Cook's distance , is the most important measure for determining so-called influential observations when a least squares regression has been carried out. The Cook distance is named after the American statistician R. Dennis Cook , who introduced the concept in 1977.

definition

Data points with large residuals ( outliers ) and / or large "leverage" values could affect the outcome and precision of a regression . Cook's distance measures the effect of omitting a given observation. Data points with a large Cook distance should be considered when analyzing data. Let the multiple linear regression model be in vector-matrix form :

{\ displaystyle {\ underset {n \ times 1} {\ mathbf {y}}} = {\ underset {n \ times p} {\ mathbf {X}}} \ quad {\ underset {p \ times 1} { \ boldsymbol {\ beta}}} \ quad + \ quad {\ underset {n \ times 1} {\ boldsymbol {\ varepsilon}}}}

,

where the disturbance variable vector follows a multidimensional normal distribution and the vector is the regression coefficient (here is the number of unknown parameters to be estimated and the number of explanatory variables), and the data matrix . The least squares estimation vector is then , from which it follows that the estimation vector of the dependent variable results as follows: ${\ displaystyle {\ boldsymbol {\ varepsilon}} \ sim {\ mathcal {N}} \ left (\ mathbf {0}, \ sigma ^ {2} \ mathbf {I} \ right)}$ ${\ displaystyle {\ boldsymbol {\ beta}} = \ left (\ beta _ {0} \, \ beta _ {1}, \ dots, \ beta _ {k} \ right) ^ {\ top}}$ ${\ displaystyle p = k + 1}$ ${\ displaystyle k}$ ${\ displaystyle \ mathbf {X}}$ ${\ displaystyle {\ hat {\ varvec {\ beta}}} = \ left (\ mathbf {X} ^ {\ top} \ mathbf {X} \ right) ^ {- 1} \ mathbf {X} ^ {\ top} \ mathbf {y}}$

{\ displaystyle \ mathbf {\ hat {y}} = \ mathbf {X} {\ hat {\ varvec {\ beta}}} = \ underbrace {\ mathbf {X} \ left (\ mathbf {X} ^ {\ top} \ mathbf {X} \ right) ^ {- 1} \ mathbf {X} ^ {\ top}} _ {= \ mathbf {P}} \ mathbf {y} = \ mathbf {P} \ mathbf {y }}

,

where represents the prediction matrix. The th diagonal element of is given by , where is the -th row of the data matrix . The values are also referred to as the “leverage values” of the tenth observation. To formalize the influence of a point , consider the effect of omitting the point on and . The estimate of obtained by omitting the th observation is given by . One can compare with using the Cook distance , which is defined by: ${\ displaystyle \ mathbf {P} \ equiv \ mathbf {X} \ left (\ mathbf {X} ^ {\ top} \ mathbf {X} \ right) ^ {- 1} \ mathbf {X} ^ {\ top }}$ ${\ displaystyle i}$ ${\ displaystyle \ mathbf {P} \,}$ ${\ displaystyle p_ {ii} \ equiv \ mathbf {x} _ {i} ^ {\ top} \ left (\ mathbf {X} ^ {\ top} \ mathbf {X} \ right) ^ {- 1} \ mathbf {x} _ {i}}$ ${\ displaystyle \ mathbf {x} _ {i} ^ {\ top}}$ ${\ displaystyle i}$ ${\ displaystyle \ mathbf {X}}$ ${\ displaystyle i}$ ${\ displaystyle (y_ {i}, \ mathbf {x} _ {i} ^ {\ top})}$ ${\ displaystyle {\ boldsymbol {\ beta}}}$ ${\ displaystyle \ mathbf {\ hat {y}} = \ mathbf {X} {\ hat {\ boldsymbol {\ beta}}}}$ ${\ displaystyle {\ boldsymbol {\ beta}}}$ ${\ displaystyle i}$ ${\ displaystyle (y_ {i}, \ mathbf {x} _ {i} ^ {\ top})}$ ${\ displaystyle {\ hat {\ boldsymbol {\ beta}}} _ {(i)} = (\ mathbf {X} _ {(i)} ^ {\ top} \ mathbf {X} _ {(i)} ) ^ {- 1} \ mathbf {X} _ {(i)} ^ {\ top} \ mathbf {y} _ {(i)}}$ ${\ displaystyle {\ hat {\ boldsymbol {\ beta}}} _ {(i)}}$ ${\ displaystyle {\ hat {\ varvec {\ beta}}}}$

{\ displaystyle D_ {i} = {\ frac {({\ hat {\ varvec {\ beta}}} _ {(i)} - {\ hat {\ varvec {\ beta}}}) ^ {\ top} (\ mathbf {X} ^ {\ top} \ mathbf {X}) ({\ hat {\ varvec {\ beta}}} _ {(i)} - {\ hat {\ varvec {\ beta}}}) } {(k + 1) s ^ {2}}} = {\ frac {(\ mathbf {X} {\ hat {\ boldsymbol {\ beta}}} _ {(i)} - \ mathbf {X} { \ hat {\ varvec {\ beta}}}) ^ {\ top} (\ mathbf {X} {\ hat {\ varvec {\ beta}}} _ {(i)} - \ mathbf {X} {\ hat {\ boldsymbol {\ beta}}})} {(k + 1) s ^ {2}}} = {\ frac {({\ hat {\ mathbf {y}}} _ {(i)} - {\ hat {\ mathbf {y}}}) ^ {\ top} ({\ hat {\ mathbf {y}}} _ {(i)} - {\ hat {\ mathbf {y}}})} {(k +1) s ^ {2}}}}

,

where represents the unbiased estimate of the variance of the disturbance variables . The measure is proportional to the usual Euclidean distance between and . Hence, it is great if the observation has a substantial impact on both , and . ${\ displaystyle s ^ {2}}$ ${\ displaystyle D_ {i}}$ ${\ displaystyle {\ hat {\ mathbf {y}}} _ {(i)}}$ ${\ displaystyle {\ hat {\ mathbf {y}}}}$ ${\ displaystyle D_ {i}}$ ${\ displaystyle (y_ {i}, \ mathbf {x} _ {i} ^ {\ top})}$ ${\ displaystyle {\ hat {\ varvec {\ beta}}}}$ ${\ displaystyle {\ hat {\ mathbf {y}}}}$

A numerically simpler representation of is given by: ${\ displaystyle D_ {i}}$

{\ displaystyle D_ {i} = {\ frac {t_ {i} ^ {2}} {k + 1}} \ left ({\ frac {p_ {ii}} {(1-p_ {ii}) ^ { 2}}} \ right)}

,

where represent the studentized residuals . ${\ displaystyle t_ {i}}$ ${\ displaystyle t_ {i} = {{\ widehat {\ varepsilon}} _ {i} \ over s _ {(i)} ^ {2} {\ sqrt {1-p_ {ii} \}}}}$

Recognize highly influential observations

There are different approaches to determining the boundaries of what highly influential observations should be. The simple rule of thumb has been suggested. Other authors have suggested where is the number of observations. ${\ displaystyle D_ {i}> 1}$ ${\ displaystyle D_ {i}> 4 / n}$ ${\ displaystyle n}$

literature

Rencher, Alvin C., and G. Bruce Schaalje: Linear models in statistics. , John Wiley & Sons, 2008

Individual evidence

^ Fumio Hayashi : Econometrics ., Princeton University Press., 2000, pp. 21-23
^ Rencher, Alvin C., and G. Bruce Schaalje: Linear models in statistics. , John Wiley & Sons, 2008, p. 236
^ Ludwig Fahrmeir , Thomas Kneib , Stefan Lang, Brian Marx: Regression: models, methods and applications. Springer Science & Business Media, 2013, ISBN 978-3-642-34332-2 , p. 165.
^ Rencher, Alvin C., and G. Bruce Schaalje: Linear models in statistics. , John Wiley & Sons, 2008, p. 237
^ Rencher, Alvin C., and G. Bruce Schaalje: Linear models in statistics. , John Wiley & Sons, 2008, p. 237
^ R. Dennis Cook and Sanford Weisberg: Residuals and Influence in Regression , 1982., New York, Chapman & Hall, ISBN 0-412-24280-X
↑ Kenneth A. Bollen and Robert W. Jackman: Regression Diagnostics: An Expository Treatment of Outliers and Influential Cases in Modern Methods of Data Analysis (1990), Newbury Park, CA, ISBN 0-8039-3366-5 , pp. 257– 9.

[1] Fumio Hayashi : Econometrics ., Princeton University Press., 2000, pp. 21-23

[2] Rencher, Alvin C., and G. Bruce Schaalje: Linear models in statistics. , John Wiley & Sons, 2008, p. 236

[3] Ludwig Fahrmeir , Thomas Kneib , Stefan Lang, Brian Marx: Regression: models, methods and applications. Springer Science & Business Media, 2013, ISBN 978-3-642-34332-2 , p. 165.

[4] Rencher, Alvin C., and G. Bruce Schaalje: Linear models in statistics. , John Wiley & Sons, 2008, p. 237

[5] Rencher, Alvin C., and G. Bruce Schaalje: Linear models in statistics. , John Wiley & Sons, 2008, p. 237

[6] R. Dennis Cook and Sanford Weisberg: Residuals and Influence in Regression , 1982., New York, Chapman & Hall, ISBN 0-412-24280-X

[7] Kenneth A. Bollen and Robert W. Jackman: Regression Diagnostics: An Expository Treatment of Outliers and Influential Cases in Modern Methods of Data Analysis (1990), Newbury Park, CA, ISBN 0-8039-3366-5 , pp. 257– 9.

Cook distance

contents

definition

Recognize highly influential observations

See also

literature

Individual evidence