Analysis of variance

from Wikipedia, the free encyclopedia

As analysis of variance , shortly VA ( English variance analysis of short ANOVA ), and analysis of variance or scatter decomposition called, refers to a large group of data-analytical and structure-checking statistical methods that allow for many different applications.

What they have in common is that they calculate variances and test variables in order to obtain information about the regularities behind the data. The variance of one or more target variables is explained by the influence of one or more influencing variables (factors). The simplest form of analysis of variance tests the influence of a single nominally scaled variable on an interval-scaled variable by comparing the means of the dependent variable within the groups defined by the categories of the independent variable. Thus, the analysis of variance in its simplest form is an alternative to the t test that is suitable for comparisons between more than two groups. Variance analysis models are usually special linear regression models . The method of analysis of variance goes back essentially to Ronald Aylmer Fisher .

overview

Basic concepts

The dependent variable is called the target variable :

The independent variable is called the influencing variable or factor :

  • The categorical variable (= factor) that specifies the groups. Its influence is to be checked; it is nominally scaled .
  • The categories of a factor are then called factor levels . This designation is not identical to that used in factor analysis .

Number of target variables

Depending on whether one or more target variables are present, a distinction is made between two types of analysis of variance:

  • the univariate variance analysis by the English name of alysis o f va riance as ANOVA abbreviated
  • the multivariate analysis of variance, also abbreviated as MANOVA after the English term m ultivariate an alysis o f va riance

Depending on whether one or more factors are present, a distinction is made between simple (one-factorial) and multiple or multiple (multi-factorial) analysis of variance.

Number of examination units

In the simplest case, the same number of observations are considered from each factor level. In this case, one also speaks of an orthogonal analysis of variance or a balanced model . Working with and interpreting data whose factor levels contain different numbers of elements (e.g. also missing values) is more difficult (see unbalanced model ).

Fixed and random effects

A common model discriminating analysis of variance is then made whether the factors with fixed effects ( English fixed factors ) or factors with random effects ( English random factors ) are present. One speaks of fixed effects when the influencing factors occur in a finite number of factor levels and all of them have been recorded or the statement of interest in the investigation only relates to these factor levels. One speaks of models with random effects if one can only capture a selection of all possible factor levels (see also linear panel data models ).

Basic idea

The total variance can be easily broken down into groups if the variability between the factor levels is large, but the variability within them is low.

The procedures investigate whether (and if so how) the expected values ​​of the metric random variables differ in different groups (including classes ). The test variables of the method are used to test whether the variance between the groups is greater than the variance within the groups . In this way it can be determined whether the grouping is useful or not, or whether the groups differ significantly or not.

If they differ significantly, it can be assumed that different laws operate in the groups. For example, it can be clarified whether the behavior of a control group is identical to that of an experimental group. If, for example, the variance of one of these two groups has already been traced back to causes (sources of variance), if the variance is equal, it can be concluded that no new cause (e.g. due to the experimental conditions) was added in the other group.

See also: discriminant analysis , coefficient of determination

Requirements and alternatives

The reliability of the significance tests in the context of the analysis of variance depends on the extent to which their requirements are met. These requirements are slightly different depending on the application, the following generally apply:

The check is carried out with other tests outside of the analysis of variance, which are now included as an option in statistical programs as standard. The normal distribution of the residuals can be checked with the Shapiro-Wilk test , and homogeneity of variance with the Levene test .

Against deviations from the assumption of normal distribution, analyzes of variance are considered robust, especially with larger sample sizes (see central limit theorem ). Inhomogeneous variances represent a problem with unequal group sizes. In the case of simple variance analyzes, the Brown-Forsythe test can be used if necessary . A transformation of the dependent variable may also be considered in order to adjust the variances of the groups, for example by taking the logarithm . If the prerequisites are not sufficiently met, distribution-free, nonparametric methods are also available, which are robust but have lower test strength and test different parameters than analysis of variance, since they are based on ranks.

Simple analysis of variance

In a way analysis of variance , and one-way analysis of variance ( English one-way analysis of variance , in short, one-way ANOVA ), or single-factor analysis of variance , called one examines the influence of an independent variable (factor) with different levels (groups) on the characteristics a random variable. For this purpose, the mean values ​​of the values ​​for the groups are compared with one another, namely the variance between the groups is compared with the variance within the groups. Because the total variance is made up of the two components mentioned, it is called an analysis of variance. The simple analysis of variance is the generalization of the t test for more than two groups. For it is equivalent to the t-test.

requirements

  • The error components must be normally distributed. Error components denote the respective variances (total, treatment and error variance). The validity of this prerequisite also requires a normal distribution of the measured values ​​in the respective population.
  • The error variances between the groups (i.e. the k factor levels) must be equal or homogeneous ( homoscedasticity ).
  • The measured values ​​or factor levels must be independent of one another.

example

This form of analysis of variance is indicated if, for example, it is to be examined whether smoking has an influence on aggressiveness. Smoking is an independent variable here, which can be divided into three values ​​( factor levels): non-smokers , light smokers and heavy smokers . The aggressiveness recorded by a questionnaire is the dependent variable. To carry out the investigation, the test subjects are assigned to the three groups. Then the questionnaire is submitted with which the aggressiveness is recorded.

Hypotheses

Let it be the expected value of the dependent variable in the . Group. The null hypothesis of a simple analysis of variance is:

The alternative hypothesis is:

The null hypothesis therefore states that there is no difference between the expected values ​​of the groups (which correspond to the factor values ​​or factor levels). The alternative hypothesis states that there is a difference between at least two expected values. For example, if we have five factor levels, the alternative hypothesis is confirmed if at least two of the group means differ. However, three expected values ​​or four or all five can also differ significantly from one another.

If the null hypothesis is rejected, the analysis of variance does not provide any information about how many or between which factor levels there is a difference. We then only know with a certain probability (see level of significance ) that at least two values ​​show a significant difference.

One can now ask whether it would be permissible to carry out individual comparisons in pairs between the mean values using different t- tests. If you compare only two groups (i.e. two mean values) with the analysis of variance, then the t-test and analysis of variance lead to the same result. However, if there are more than two groups, the verification of the global null hypothesis of the analysis of variance using paired t- tests is not permitted - so-called alpha error accumulation or alpha error inflation occurs. With the help of multiple comparison techniques, after a significant analysis of variance result, it can be checked at which pair of mean values ​​the difference or differences lie. Examples of such comparison techniques are the Bonferroni test for the smallest significant difference and the Scheffé test (see also post-hoc test ). The advantage of these methods is that they take into account the aspect of alpha error inflation.

Basic idea of ​​the bill

  • When calculating the analysis of variance, one first calculates the total variance observed in all groups. To do this, all the measured values ​​from all groups are summarized and the total mean and the total variance are calculated.
  • Then one would like to determine the variance component of the total variance, which is solely due to the factor. If the entire observed variance was due to the factor, then all measured values ​​in a factor level would have to be the same - in this case there should only be differences between the groups. Since all measured values ​​within a group have the same factor value, they would consequently all have to have the same value, since the factor would be the only source of variance generating. In practice, however, measured values ​​will also differ within a factor level. These differences within the groups must therefore come from other influences (either coincidence or so-called interfering variables ).
In order to calculate which variance can be traced back to the characteristics of the factor, you change your data for a moment, so to speak, "ideally": You assign the mean value of the respective factor level to all measured values ​​within a factor level. This means that all values ​​within a factor level are made the same, and the only difference now remains between the factor levels. The variance is now calculated again with this “idealized” data. This characterizes the variance that results from the factor (“variance of treatments”, treatment variance ).
If you divide the variance of the treatments by the total variance, you get the relative proportion of the variance attributable to the factor.
  • There is usually a discrepancy between the total variance and the variance of the treatments - the total variance is greater than the variance of the treatments. The variance that is not due to the factor (the treatment) is called the error variance. This is based either on chance or on other variables that have not been investigated (interfering variables).
The error variance can be calculated by rearranging your data: For each individual measured value, you calculate the deviation from the respective group mean of its factor level. The entire variance is calculated again from this. This then characterizes the error variance.
An important relationship between the components is the additivity of the sums of squares. The sum of squares is the part of the variance formula that is in the numerator. If you omit the denominator (the number of degrees of freedom ) when calculating the variance of the treatment , you get the square sum of the treatment. The total sum of squares (i.e. total variance without denominator) results from the sum of the treatment sum and the sum of the squares of the residuals.
  • The final significance test is carried out using an “ordinary” F- test . One can show mathematically that if the null hypothesis of the analysis of variance is valid, the treatment and error variance must be the same. An F test can be used to test the null hypothesis that two variances are equal by calculating the quotient from them.
In the case of the analysis of variance, the quotient is formed from the variance of the treatments divided by the error variance. This quotient is F -distributed with numerator degrees of freedom and / or denominator degrees of freedom ( is the number of groups, is the total number of all test subjects, is the respective number of test subjects per factor level).
In tables of the F distribution you can then look up the corresponding F value with the corresponding degrees of freedom and read off what percentage of the F distribution density this value “cuts off”. For example, if we agree on a significance level of 5% before performing the analysis of variance, then the F value would have to cut off at least 95% of the F distribution on the left. If this is the case, then we have a significant result and can reject the null hypothesis at the 5% level.

Mathematical model

The simple analysis of variance regards each measured value as the sum of a “component independent of the factor effect” , the “factor effect” and the experimental error . Each measured value can thus be processed through the following data generating process

to be generated. The second equality results from the fact that the fixed mean value dependent on the factor level (mean value of under the test conditions ) can be split into a component that is independent of the factor effect and into the factor effect itself. It therefore applies

.

The experimental error is assumed to be normally distributed at every factor level and for every repetition with an expected value of zero and a homoscedastic unknown error variance (independent of the factor level) . This assumption can be interpreted in such a way that the experimental errors are balanced out on average and that the variability is the same in all groups. It is also assumed that the trial errors are independent of different iterations. To sum up writing for the experimental error: . The goal is the model parameters , and to estimate statistically, that point estimator , and to find. With the help of a so-called table of the analysis of variance or also called table of the analysis of variance , the -th factor level mean can be calculated

and the -th factor level variance

to calculate. The overall mean represents the mean of the factor level means weighted with the number of cases :

,

where represents the total size of the samples at all factor levels. The global expected value or the global mean ( English grand mean ) is set equal to the mean of the level mean values :

.

An additional condition for the model parameters in order to ensure the identifiability of the regression model is the so-called reparameterization condition , in which a new parameterization is carried out. In simple analysis of variance it is

.

I.e. the sum of the factor effect weighted with the number of cases results in zero. In this case one speaks of an effect coding. The effects can be clearly estimated using the reparameterization condition . The global mean is estimated by the overall mean , the parameter is estimated by the factor level mean , and the factor effect is estimated by the deviation . The respective deviation between the measured value and the estimated value ( residual ) is through

given. The residual is given by the deviation of the measured value from the level mean and is the expression of the random variation of the variables at the level of the factor. It can be viewed as a realization of the trial error in the -th iteration at the -th factor level. Each realization of the target value is made up of the total mean, the factor effect and the residual :

.

Sums of squares

The "total sum of squares" or " total sum of squares " just SQT ( S umme the Q uadrate the T otalen deviations ), can be broken down into two parts. One part relates to group membership and the other part, the rest, is attributed to chance. The first part, d. H. The “sum of squares due to factor A”, SQA for short , can be expressed as the sum of the squares of the deviations of the mean values from the total mean of the groups. By the regression "unexplained square sum" and the residual sum of squares , short SQR ( S umme the Q uadrate the R estabweichungen (or "residuals")) , which relates to the differences within the groups is expressed as the total deviation from the Mean values ​​in the groups. The following applies:

.

Where:

,
,

and

.

The two sums of squares and are stochastically independent. In the case of groups of the same size , it can be shown that under the null hypothesis the following applies:

, d. H. the sum of squares follows a chi-square distribution with degrees of freedom,

and

, d. H. the sum of squares follows a chi-square distribution with degrees of freedom.

Test variable

One usually also defines the " mean squares of deviations" (often incorrectly called mean sums of squares ):

,

and

.

The test variable or the F statistic can thus be defined as follows:

.

In the case of groups of the same size, under the null hypothesis, F -distributed with degrees of freedom in the numerator and degrees of freedom in the denominator.

When the test variable becomes significant, at least two groups differ from one another. In post-hoc tests can then be calculated between which each group is the difference.

Sample calculation

The following example is a simple two-group analysis of variance (also known as a two-sample F-test ). In one experiment, two groups ( ) of ( ) animals each receive a different diet. After a certain time, your weight gain is measured with the following values:

Group 1
Group 2

The aim is to investigate whether the different foods have a significant influence on weight. The mean value and the variance (here “estimated value”, empirical variance ) of the two groups are

and
and

Because it can be calculated from this:

and

The underlying probability model assumes that the weights of the animals are normally distributed and show the same variance in each group. The null hypothesis to be tested is

: "The mean values ​​of the two groups are the same"

Obviously, the mean values and differ . However, this deviation could also be in the range of natural fluctuations. To check whether the distinction is significant, the test size is calculated.

According to the underlying model, the size is a random variable with a distribution, where the number of groups (factor levels) and the number of measured values ​​are. The indices are called degrees of freedom . The value of the F distribution for given degrees of freedom ( F - quantile ) can be looked up in a Fisher-panel. A desired level of significance (the probability of error) must also be specified. In the present case, the F quantile for the error type 1 is 5%. This means that the null hypothesis cannot be rejected for all values ​​of the test variable up to 4.41. There , the null hypothesis can be rejected with the present values.

It can therefore be assumed that the animals in the two groups really have different weights on average. The probability of accepting a difference even though it is not there is less than 5%.

Double analysis of variance

The two-time analysis of variance, and two-way analysis of variance ( English two-way analysis of variance : short two-way ANOVA , or) two-factor analysis of variance , called into account to explain the target variable two factors (factor A and factor B).

example

This form of analysis of variance is e.g. B. indicated in studies that want to show the influence of smoking and coffee drinking on nervousness. Smoking is here the factor A, which in z. B. can be divided into three characteristics (factor levels): non-smokers , light smokers and chain smokers . Factor B can be the amount of coffee used daily with the following levels: 0 cups, 1–3 cups, 4–8 cups, more than 8 cups. Nervousness is the dependent variable. To carry out the investigation, test persons are divided into 12 groups according to the combinations of factor levels. The measurement of nervousness is carried out, which provides metric data.

Basic idea of ​​the bill

The model (for the case with fixed effects) in effect representation is:

In it are:

: Target variable; assumed to be normally distributed in the groups
: Number of factor levels of the first factor (A)
: Number of factor levels of the second factor (B)
: Number of observations per factor level (here the same for all combinations of factor levels)
: Effect of the -th factor level of factor A
: Effect of the -th factor level of factor B
: Interaction (interaction) of the factors on the factor level combination .

The interaction describes a special effect that only occurs when the factor level combination is present.

: Confounding variables, independent and normally distributed with expected value and the same variances.


The total square sum is broken down into four independent square sums (square sum decomposition):

In it are:

the total sum of squares,
the residual sum of squares,
the sum of squares due to the interaction of A and B,
the sum of squares due to factor A.
the sum of squares due to factor B.

The expected values ​​of the sums of squares are:

The sums of squares divided by are, assuming appropriate assumptions, chi-squared, namely:

with degrees of freedom
with degrees of freedom if
with degrees of freedom if
with degrees of freedom if

The mean squares of deviation result from dividing the sums of squares by their degrees of freedom:

The applicable test parameters are calculated like the quotients of the mean squares, with the denominator.

One now calculates the variances for the individual factors and the variance for the interaction of and . The hypothesis is: there is no interaction. Again, the hypothesis is calculated using the test statistics . This is now composed of the quotient that resulted from the interaction of and and the error variance. You now compare with the F quantiles after specifying a desired level of significance. If the test variable is greater than the quantile (the latter can be read in the relevant tables), then it is rejected, i.e. there is an interaction between the factors and .

Analysis of Variance Chalkboard

In a practical analysis, the results are summarized in the table of the analysis of variance:

Source of variation Sum of squares of deviation ( SQ ) Number of degrees of freedom ( FG ) Mean square of deviation ( MQ ) F statistics ( F )
Factor A
Factor B
interaction
Residual
Total

Multiple analysis of variance (more than two factors)

Several factors are also possible. This type of analysis of variance is known as multiple analysis of variance or multi-factorial analysis of variance . However, the data requirement for an estimation of the model parameters increases sharply with the number of factors. The representations of the model (e.g. in tables) also become more confusing as the number of factors increases. More than three factors can only be represented with difficulty.

See also

literature

Individual evidence

  1. Hans Friedrich Eckey: Multivariate Statistics: Basics - Methods - Examples . Dr. Th. Gabler Verlag; Edition: 2002 (September 12, 2002). ISBN 978-3409119696 . P. 94.
  2. Werner Timischl: Applied Statistics. An introduction for biologists and medical professionals. 2013, 3rd edition, p. 360.
  3. ^ Ludwig Fahrmeir , Rita artist, Iris Pigeot , Gerhard Tutz : Statistics. The way to data analysis. 8., revised. and additional edition. Springer Spectrum, Berlin / Heidelberg 2016, ISBN 978-3-662-50371-3 , p. 480.
  4. Werner Timischl: Applied Statistics. An introduction for biologists and medical professionals. 2013, 3rd edition, p. 361.
  5. Werner Timischl: Applied Statistics. An introduction for biologists and medical professionals. 2013, 3rd edition, p. 362.
  6. Werner Timischl : Applied Statistics. An introduction for biologists and medical professionals. 2013, 3rd edition, p. 362.