Imputation (statistics)

from Wikipedia, the free encyclopedia

In mathematical statistics, the term imputation is used to summarize methods with which missing data in statistical surveys - the so-called non - response - are completed in the data matrix. The silence distortion , caused by the non-response is reduced.

General

Imputation is one of the so-called missing data techniques , i.e. processes that are used when evaluating incomplete sample data sets. This problem occurs relatively frequently in surveys and other surveys, for example when some respondents deliberately do not answer certain questions due to lack of knowledge or insufficient motivation to answer certain questions, but incomplete data sets due to technical breakdowns or data loss are also conceivable.

In addition to imputation, the so-called elimination processes (also: complete case analysis ) are among the common missing data techniques. All data records for which one or more survey characteristics have missing values ​​are deleted from the data matrix , so that in the end a complete data matrix remains for evaluation purposes. Although this procedure is very simple, it has considerable disadvantages: In particular with a large number of item non-responses (lack of individual values), it results in a considerable loss of information. Furthermore, this technique can lead to a falsification of the remaining sample if the system of the data failure depends on the characteristics of the incompletely collected feature. A frequent example is surveys on income, in which it can happen that people with a relatively high income are reluctant to report this and therefore there is a tendency for missing data in such cases. In order to get this problem under control, imputation methods were developed in which the attempt is made not to simply ignore missing data, but instead to replace it with plausible values ​​that can be estimated using the observed values ​​of the same data set, among other things.

Selected imputation procedures

There are a number of methods that can be used to complete missing values. A rough distinction is made between singular and multiple imputation. With the former, each missing value is replaced by a specific estimated value, while with multiple imputation, several values ​​are estimated for each non-response item, usually by means of a simulation based on one or more distribution models.

Singular imputation

Substitution by measures of position

One of the simplest imputation methods is to replace all missing values ​​of a survey characteristic with the empirical measure of the position of the observed values ​​- usually the mean value or, in the case of non-quantitative characteristics, the median or mode . However, this method has the disadvantage that - similar to an elimination method - distortions occur if the data failure depends on the characteristics of the relevant feature. Furthermore, the resulting sample shows a systematically underestimated standard deviation , since the imputed values ​​are constant and therefore show no variation among themselves. These problems can be partially alleviated if the method is not applied uniformly for the entire sample, but rather separately according to certain characteristic classes, into which the data records are divided according to the characteristics of a certain, fully surveyed characteristic. Accordingly, a class mean can be calculated separately for each of these classes, which replaces missing values ​​within the class.

Substitution by ratio estimators

The replacement by a ratio estimator is a relatively simple procedure that tries to exploit a possibly existing functional connection between two sample characteristics when estimating the imputation values, one of which could be fully observed. Are X and Y two random variables , in a sample of size n are collected, wherein X could be fully charged and of n objects to be examined and the Y present value. Each of the missing Y expressions can then be estimated by a ratio estimator:

for all

Are there

and

.

It should be noted that this estimator can only be used meaningfully in special cases, usually when a strong correlation between X and Y can be assumed.

Hot deck and cold deck techniques

The procedures that are referred to as hot deck or cold deck all have the peculiarity that missing sample values ​​are replaced by observed values ​​of the same characteristic. They only differ in terms of the method by which the imputation values ​​are determined. While the estimated values ​​from other surveys (for example from historical, "cold" surveys) are used for the cold deck techniques, the much more common hot deck techniques use the current data matrix. Deck techniques are usually used within imputation classes, that is, feature classes into which the data records can be divided according to the characteristics of a fully recorded feature.

A well-known hot deck method is the so-called sequential or traditional hot deck . The procedure here is as follows: In the incomplete data matrix, an imputation value is initially set as a starting value within each imputation class for each incompletely observed variable. The sequential methods differ in how the start values ​​are determined; is conceivable z. B. the mean value of the existing class values, a random value from the respective class, or a cold deck estimate. After the start values ​​have been set, one goes through all elements of the data matrix. If the expression is available for an object, it becomes the new imputation value for the respective characteristic in the same imputation class; otherwise, the current imputation value for this characteristic is substituted for the missing expression. This is the procedure for all elements of the data matrix until it no longer has any gaps.

Regression method

The imputation methods based on regression analysis all have in common that they try to exploit any functional relationships between two or more sample characteristics when estimating the missing values. The above-described imputations using the sample mean or a ratio estimator are also a simplified form of regression imputation. In general, different numbers of features to be included as well as different regression methods come into question. In the case of quantitative features, linear regression using the least squares method is often used . Let X and Y be two random variables that are collected together in a sample of size n , and Y was only collected once. If, as is assumed, there is a correlation between the two variables, a regression equation from Y to X of the following form can be calculated from the observed ( x , y ) value pairs :

for all

Alpha and beta are the regression coefficients that are estimated from the observed ( x , y ) value pairs by means of their least square estimators and :

The regression estimation with more than one regressor characteristic - the so-called multiple linear regression - is carried out analogously, but is more computationally intensive due to the larger amount of data then available. It is implemented as standard in statistical software packages such as SPSS .

If an incompletely observed characteristic is not quantitative, no estimate can be calculated using linear regression. For certain categorical variables, however, there are special regression methods, of which logistic regression is probably the best known.

Multiple imputation

Multiple imputation is a comparatively demanding missing data procedure. In principle, “multiple” means that this procedure delivers several estimated values ​​in several imputation steps for each missing value. These can then be averaged to an estimated value, or a new, completed data matrix can be set up for each imputation step. A common procedure for determining the estimated value is the simulation from a multivariate distribution model that is considered plausible . If, for example, the two random variables X and Y are assumed to have a common normal distribution with fixed parameters, the conditional distribution of Y , given the observed X value, can be derived for value pairs with an observed X value and a missing Y value - in this simple one Case of a univariate normal distribution. Then there is the possibility of generating the possible imputation values for each missing Y value in the course of multiple simulations from the respective distribution.

See also

literature

  • U. Bankhofer: Incomplete data and distance matrices in multivariate data analysis . Dissertation, University of Augsburg, Verlag Josef Eul, Bergisch Gladbach 1995
  • O. Lüdtke, A. Robitzsch, U. Trautwein, O. Köller: Dealing with missing values ​​in psychological research. Problems and solutions. Psycholog. Rundschau 58 (2) 103-117 (2007). Comment and reply:
    • J. Wuttke: Increased need for documentation when imputing missing data. Psycholog. Rundschau 59 (3) 178-179 (2008).
    • O. Lüdtke et al .: Does transparency stand in the way of adequate data evaluation? ibid, 180-181 (2008).
  • J. L. Schafer: Analysis of Incomplete Multivariate Data . Chapman & Hall, London 1997, ISBN 0-412-04061-1
  • D. Schunk: A Markov Chain Monte Carlo Algorithm for Multiple Imputation in Large Surveys. Advances in Statistical Analysis, 92, 101-114 (2008).
  • C. FG Schendera, data quality with SPSS, Oldenbourg Verlag, Munich, 2007, pp. 119–161

Web links