Sample covariance
The sample covariance or empirical covariance (often simply covariance (from the Latin con = "with-" and variance from variare = "(ver) change, be different")) is a non- standardized measure of the (linear) relationship between two in statistics statistical variables . The adjusted sample covariance is an unbiased estimate of the population covariance using a sample.
If the covariance is positive, then small values of one variable are predominantly associated with small values of the other variable and also for large values. For a negative covariance it is exactly the opposite.
definition
If a data series ( sample ) of two statistical variables is and , then the sample covariance is defined as the " average deviation product "
with the arithmetic means and the data series and .
The sample covariance measures the common dispersion ("co-dispersion") of the observation data of a sample. The mean deviation of the observation data from the mean values is calculated.
The corrected sample covariance is also often used:
Construction of the covariance
The blue data point at the top right in the graphic has a positive contribution to the covariance:
- .
This applies to all data points in quadrant I, with and . These considerations can be continued analogously for the data points in the other quadrants:
- Data points in quadrant I: positive contribution to covariance,
- Data points in quadrant II: negative contribution to covariance,
- Data points in quadrant III: positive contribution to covariance and
- Data points in quadrant IV: negative contribution to covariance.
If there is a "positive" relationship between the data points, then most of the data points (as in the example on the right) will lie in quadrants I and III and make many positive contributions to the covariance. Although the few data points in quadrants II and IV make negative contributions, the positive contributions will predominate; H. the covariance is positive. If there is a "negative" relationship, it follows with the same reasoning that the covariance is negative.
Corrected sample covariance
To obtain an estimate of the unknown covariance of the population from a sample , the corrected sample covariance is used:
For a simple random sample , the sample variables and have covariance . Assuming a two-dimensional normal distribution of the sample variables and using the maximum likelihood method , the estimation function results
- .
However, it turns out that the expectation is, i. H. the estimator is not unbiased (i.e. distorted ) for .
However, the adjusted sample covariance is unbiased. In the context of inductive statistics , the corrected sample covariance is therefore always used.
Sample covariance vs. Corrected sample covariance
In the context of descriptive statistics , the question arises whether it is better to use the factor or . In general, it depends on the goal of the analysis (or the properties of the sample).
- If the goal is to estimate the covariance of a population, then the factor must be used because of the unambiguous property . But it should be possible to draw conclusions about the population, e.g. B. the sample could be a simple random sample.
- If the goal is to describe the data only descriptively, then you can use or . This is e.g. This is the case, for example, when inferences about the population are not wanted or possible. Then the user has to decide which property is more important to him: the possible inference about the population (with ) or the interpretation as mean deviation from (with ).
In the case of large sample sizes, the difference between and is small anyway, so that the above consideration only has to be made for small sample sizes.
properties
The following properties apply to both sample covariance and corrected sample covariance.
Interpretation of the covariance
- The covariance is positive if and tend to have a linear relationship in the same direction, i.e. H. high values of are associated with high values of and low values with low values .
- The covariance, on the other hand, is negative if and have an opposing linear relationship, i.e. H. high values of one variable are associated with low values of the other variable.
- If the result is 0, there is no linear relationship between the two variables and (non-linear relationships are possible).
The covariance indicates the direction of a relationship between two variables, but no statement can be made about the strength of the relationship due to the linearity of the covariance. In order to make a relationship comparable , the covariance must be normalized. The most common normalization using the standard deviation leads to the correlation coefficient .
Relationship to variance
The covariance is an extension of the variance because it holds
- or.
- .
 
Where and is the empirical variances with a suitable prefactor. That is, the variance is the covariance of a variable with itself.
Displacement set
The shift theorem provides an alternative representation of the covariance
. In many cases, these formulas make it easier to calculate the covariance. In the case of numerical calculations, however, attention must be paid to undesired digit deletion when subtracting large numbers.
Symmetry and linearity
The covariance is linear and symmetric; H. the following applies:
- symmetry
- Swapping the roles of and results in the same value for the covariance:
- or.
 
- Linearity
- If one of the variables is subjected to a linear transformation, e.g. B. , then applies
- or.
 
- Because of the symmetry, the covariance is also linear in the second argument.
The linearity of the covariance means that the covariance depends on the unit of measurement of the variable. For example, you get ten times the covariance if you look at the variable instead . Since this property makes the absolute values of the covariance difficult to interpret, one often considers the scale-independent correlation coefficient instead .
Examples
1.) The following graphic shows the scatter diagram for 21 different data sets together with the covariance and correlation of the data set. The first row shows seven data sets with varying degrees of linear correlation, with the correlation going from +1 to 0 to −1. Since the covariance is a non-standardized measure, it goes from +2 to zero to −2. That is, if there is no linear relationship, then the covariance is just as zero as the correlation. The sign of the covariance indicates the direction of the relationship; however, it does not show the strength of the context.
It becomes even clearer in the second line, where all seven data sets have a perfect linear relationship. But the covariance decreases to zero and then becomes negative. The correlation for these data records is either +1 or −1 (or undefined). Finally, the third line shows that both the covariance and the correlation are zero, although there is a clear connection between the two variables. This means that the covariance only measures the linear relationship and non-linear relationships are not recognized.
2.) In a school it should be checked whether there is a connection between the number of hours taught by the teachers per day and the number of cups of coffee drunk. Ten data pairs were collected and evaluated (not carried out like this, just for the sake of illustration!):
| number | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 
| Number of hours ( ) | 5 | 6th | 8th | 4th | 6th | 6th | 5 | 7th | 5 | 4th | 
| Number of cups ( ) | 2 | 1 | 4th | 1 | 2 | 0 | 2 | 3 | 3 | 1 | 
The covariance is now calculated as follows: 
 
a.) First, the arithmetic mean of both variables is determined: and
 
b.) The covariance is now calculated using:
Since the covariance is greater than zero, a positive relationship between the number of lessons and the number of cups of coffee can be seen for this sample. Whether this can be generalized to the population, here the teaching staff, depends on the quality of the sample.




