Total variance

The total variance is a statistic measure for the spread of a multivariate data set (with variables ): ${\ displaystyle p}$ ${\ displaystyle X_ {j}}$

{\ displaystyle T = \ sum _ {j = 1} ^ {p} {\ frac {1} {n}} \ sum _ {i = 1} ^ {n} (x_ {ij} - {\ bar {x }} _ {j}) ^ {2} = {\ frac {1} {n}} \ sum _ {i = 1} ^ {n} \ underbrace {\ sum _ {j = 1} ^ {p} ( x_ {ij} - {\ bar {x}} _ {j}) ^ {2}} _ {= d ^ {2} (x_ {i}, {\ bar {x}})}}

with the -th observation in the variable , the arithmetic mean of the observations in the variable and the squared Euclidean distance between the multivariate observation and the center of the data . ${\ displaystyle x_ {ij}}$ ${\ displaystyle i}$ ${\ displaystyle X_ {j}}$ ${\ displaystyle {\ bar {x}} _ {j}}$ ${\ displaystyle X_ {j}}$ ${\ displaystyle d ^ {2} (x_ {i}, {\ bar {x}})}$ ${\ displaystyle x_ {i} = (x_ {i1}, \ ldots, x_ {ip})}$ ${\ displaystyle {\ bar {x}} = ({\ bar {x}} _ {1}, \ ldots, {\ bar {x}} _ {p})}$

It is thus an extension of the empirical variance of a variable to the multivariate case:

{\ displaystyle s ^ {2} = {\ frac {1} {n}} \ sum _ {i = 1} ^ {n} \ underbrace {(x_ {i} - {\ bar {x}}) ^ { 2}} _ {= d ^ {2} (x_ {i}, {\ bar {x}})}.}

An important property of total variance is its invariance under rotation of the data set, i.e. H. the total variance of the rotated data is equal to the total variance of the unrotated data. This is true because the total variance is the mean distance between the observation and the data set center.

The total variance is closely related to the covariance matrix of the data, which can also be viewed as a generalization of the univariate variance, but depends on the selected base. The total variance is then just the trace of this matrix, so it is also the sum of the eigenvalues of the covariance matrix.

The proportion of the declared total variance is therefore used in the principal component analysis , the factor analysis and the cluster analysis as a measure of whether the data reduction carried out reflects the multivariate data set well. When using this measure in the cluster analysis, one speaks of an “internal validation ”, since it does not require any additional external information.

literature

Ludwig Fahrmeir, Wolfgang Brachinger, Alfred Hamerle, Gerhard Tutz: Multivariate statistical methods , Gruyter, 2nd edition, 1996