Sample variance (estimator)

from Wikipedia, the free encyclopedia

The sample variance is an estimation function in mathematical statistics . Your central task is to estimate the unknown variance of an underlying probability distribution . Outside of estimation theory , it is also used as an auxiliary function for the construction of confidence ranges and statistical tests . The sample variance is defined in several variants, which differ slightly in terms of their properties and thus also their areas of application. The differentiation between the different variants is not always uniform in the literature. If, therefore, only "the" sample variance is spoken of, it should always be checked which of the definitions applies in the corresponding context.

The empirical variance is also referred to as sample variance , a measure of dispersion of a sample, i.e. of several numbers. This sample variance of a specific sample corresponds to an estimated value and is thus a realization of the sample variance as an estimation function and random variable .

definition

To estimate the expected value and the variance of a population, there are random variables and be . In practice, these are the sample variables . It denotes

the sample mean .

First, the expected value has to be estimated, which is available here in the form of the parameter . Using the least squares criterion

one obtains the estimate of the expected value as the sample mean :

.

Since a degree of freedom is consumed by estimating the sample mean , it is common to "correct" the empirical variance with the factor . There are essentially three different definitions of sample variance in the literature. Many authors mention

the sample variance or, for better delimitation, the corrected sample variance . Alternatively, also

is called sample variance, as is also

for a fixed real number called sample variance.

Neither the notation nor the way of speaking for the various definitions of sample variance are consistent and a regular source of error. The notations for also or . For is also the spelling or , for also .

In this article, the notations listed above and are used for the sake of clarity . It is referred to as corrected sample variance, sample variance and sample variance with a specified expected value. These ways of speaking are not widespread in the literature and are only introduced here for the sake of clarity. When comparing different sources, the definitions, notations and ways of speaking should always be compared with each other in order to avoid errors!

use

An important use of the sample variance is to estimate the variance of an unknown probability distribution. Depending on the framework conditions, the various definitions are used, as they meet different optimality criteria (see below). The rule of thumb can be:

  • If the expected value and the variance of the probability measure are unknown, the estimation function is used.
  • If the variance is unknown and the expected value corresponds to the value , it is used as the estimator.

The estimator is usually not used; it arises, for example, when using the moment method or the maximum likelihood method and does not meet the common quality criteria.

In addition to being used as an estimator, the sample variance is also used as an auxiliary function for the construction of confidence intervals or statistical tests . There they can be found, for example, as pivot statistics for the construction of confidence intervals in the normal distribution model or as test statistics for the chi-square test .

properties

Framework

Mostly, the sample variance is used under the assumption that the evaluations are distributed independently and identically and have either a known or an unknown expected value. These assumptions are described by the following statistical models :

  • If the expected value is unknown, the statistical model is given by the (not necessarily parametric) product model
.
Here, the n-fold product measure denotes and is the family of all probability measures with finite variance, which are indexed with an arbitrary index set . The sample variables are then independently identically distributed according to and therefore have a finite variance.
  • If the expected value is known and is the same , the statistical model is given by the (not necessarily parametric) product model
.
Here refers to the family of all probability measures with finite variance and expectation that with any index set are indexed. The sample variables are then independently and identically distributed according to and thus have a finite variance and the expected value .

Expectancy

Known expected value

In the case of the known expected value, there is an unbiased estimator for the variance. That means it applies

.

Here, or denotes the formation of the expected value or the formation of variance with respect to the probability measure .

Faithfulness to expectations applies there

is. Here, the first step follows from the linearity of the expected value, the second, since according to the prerequisite it is above the known expected value and thus the variance applies according to the definition. The third step is that they are all distributed identically.

Unknown expected value

In the case of the unknown expected value, there is an unbiased estimator for the variance, so it applies

In contrast, is not fair because it applies

.

However, the estimator is still asymptotically true to expectations . This follows directly from the illustration above, for it is

.
Derivation of faithfulness to expectations

First, note that due to independence

applies and due to the identical distributions

for everyone and thus .

It follows directly from this

based on and in the last step and using the linearity of the expected value.

Analogously it follows, because they are also distributed identically (especially for all ),

again using and in the third step.

With the help of and in the second step as well as in the third step is then

The last equality follows here after the shift theorem . It then follows

and analog

.

Bessel correction

The connection follows directly from the definition

The factor is called the Bessel correction (after Friedrich Wilhelm Bessel ). It can be understood as a correction factor insofar as it corrects in such a way that the estimator is accurate to expectations. This follows as shown above

.

and the Bessel correction is exactly the reciprocal of the factor . The estimation function is thus derived from the Bessel correction.

Sample standard deviation

If the random variables are independently and identically distributed , for example a sample , then the standard deviation of the population of the sample results as the root of the sample variance or , therefore

or

With

is called the sample standard deviation or sample variance ; its realizations correspond to the empirical standard deviation . Since fairness to expectation is lost in most cases when using a nonlinear function such as the square root, the sample standard deviation, in contrast to the corrected sample variance, is in neither case an unbiased estimator of the standard deviation.

Estimating population standard deviation from a sample

The adjusted sample variance is an unbiased estimate of the population variance . In contrast, there is no unbiased estimator for the standard deviation. Since the square root is a concave function , it follows from Jensen's inequality along with the fidelity of

.

In most cases, this estimator underestimates the standard deviation of the population.

example

If you choose one of the numbers or by tossing a fair coin, i.e. both with probability , then this is a random variable with an expected value of 0, variance and standard deviation . One calculates from independent throws and the corrected sample variance

in which

denotes the sample mean, there are four possible test outcomes, each of which has a probability :

The expected value of the corrected sample variance is therefore

.

The corrected sample variance is therefore actually true to expectation. The expected value of the corrected sample standard deviation, however, is

.

So the corrected sample standard deviation underestimates the population standard deviation.

Calculation for accumulating measured values

In systems that continuously collect large amounts of measured values, it is often impractical to temporarily store all measured values ​​in order to calculate the standard deviation.

In this context, it is better to use a modified formula that bypasses the critical term . This cannot be calculated immediately for every measured value because the mean value is not constant.

By applying the shift theorem and the definition of the mean one arrives at the representation

or.

which can be updated immediately for each incoming measured value if the sum of the measured values and the sum of their squares are included and continuously updated. However, this representation is numerically less stable; in particular, the term under the square root can be numerically smaller than 0 due to rounding errors.

By cleverly rearranging the equation, a form can be found for the latter equation that is numerically more stable and uses the variance and the mean value of the previous as well as the sample value and the mean value of the current iteration step:

Normally distributed random variables

Calculation bases

For the case of normally distributed random variables, however, an unbiased estimator can be given:

Where is the estimate of the standard deviation and the gamma function . The formula follows by noting that has a chi-square distribution with degrees of freedom.

Correction factors for the unbiased estimate of the standard deviation
Sample size Correction factor
2 1.253314
5 1.063846
10 1.028109
15th 1.018002
25th 1.010468
example

The five values ​​3, 4, 5, 6, 7 were measured in a random sample from a normally distributed random variable. One should now calculate the estimate for the standard deviation.

The sample variance is:

The correction factor in this case is

and the unbiased estimate for the standard deviation is thus approximate

literature

Individual evidence

  1. ^ L. Fahrmeir, R. Künstler, I. Pigeot, G. Tutz: Statistics. The way to data analysis. 8., revised. and additional edition. Springer Spectrum, Berlin / Heidelberg 2016, ISBN 978-3-662-50371-3 , p. 351.
  2. Claudia Czado, Thorsten Schmidt: Mathematical Statistics . Springer-Verlag, Berlin Heidelberg 2011, ISBN 978-3-642-17260-1 , p. 5 , doi : 10.1007 / 978-3-642-17261-8 .
  3. a b c Eric W. Weisstein : Sample Variance . In: MathWorld (English).
  4. ^ Ludger Rüschendorf: Mathematical Statistics . Springer Verlag, Berlin Heidelberg 2014, ISBN 978-3-642-41996-6 , p. 3 , doi : 10.1007 / 978-3-642-41997-3 .
  5. ^ Hans-Otto Georgii: Stochastics . Introduction to probability theory and statistics. 4th edition. Walter de Gruyter, Berlin 2009, ISBN 978-3-11-021526-7 , p. 208 , doi : 10.1515 / 9783110215274 .
  6. ^ Hans-Otto Georgii: Stochastics . Introduction to probability theory and statistics. 4th edition. Walter de Gruyter, Berlin 2009, ISBN 978-3-11-021526-7 , p. 207 , doi : 10.1515 / 9783110215274 .
  7. MS Nikulin: Sample variance . In: Michiel Hazewinkel (Ed.): Encyclopaedia of Mathematics . Springer-Verlag , Berlin 2002, ISBN 978-1-55608-010-4 (English, online ).
  8. Eric W. Weisstein : Bessels Correction . In: MathWorld (English).
  9. ^ Ludger Rüschendorf: Mathematical Statistics . Springer Verlag, Berlin Heidelberg 2014, ISBN 978-3-642-41996-6 , p. 27 , doi : 10.1007 / 978-3-642-41997-3 .
  10. ^ Eric Weisstein: Standard Deviation Distribution . In: MathWorld (English).