In statistics , especially in regression diagnostics , the Cook distance , the Cook measure , or Cook's distance , is the most important measure for determining so-called influential observations when a least squares regression has been carried out. The Cook distance is named after the American statistician R. Dennis Cook , who introduced the concept in 1977.
Data points with large residuals ( outliers ) and / or large "leverage" values could affect the outcome and precision of a regression . Cook's distance measures the effect of omitting a given observation. Data points with a large Cook distance should be considered when analyzing data. Let the multiple linear regression model be in vector-matrix form :
,
where the disturbance variable vector follows a multidimensional normal distribution and the vector is the regression coefficient (here is the number of unknown parameters to be estimated and the number of explanatory variables), and the data matrix . The least squares estimation vector is then , from which it follows that the estimation vector of the dependent variable results as follows:
,
where represents the prediction matrix. The th diagonal element of is given by , where is the -th row of the data matrix . The values are also referred to as the “leverage values” of the tenth observation. To formalize the influence of a point , consider the effect of omitting the point on and . The estimate of obtained by omitting the th observation is given by . One can compare with using the Cook distance , which is defined by:
There are different approaches to determining the boundaries of what highly influential observations should be. The simple rule of thumb has been suggested. Other authors have suggested where is the number of observations.
↑ Kenneth A. Bollen and Robert W. Jackman: Regression Diagnostics: An Expository Treatment of Outliers and Influential Cases in Modern Methods of Data Analysis (1990), Newbury Park, CA, ISBN 0-8039-3366-5 , pp. 257– 9.