Rank correlation coefficient

from Wikipedia, the free encyclopedia

A rank correlation coefficient is a parameter-free measure of correlations , that is, it measures how well any monotone function can describe the relationship between two variables without making any assumptions about the probability distribution of the variables. The eponymous property of these measures is that they only take into account the rank of the observed values, i.e. only their position in an ordered list.

Unlike Pearson's correlation coefficient , rank correlation coefficients do not require the assumption that the relationship between variables is linear . They are robust against outliers .

There are two known rank correlation coefficients: the Spearman's rank correlation coefficient ( Spearman's rho ) and the Kendall's rank correlation coefficient ( Kendall's tau ). To determine the agreement between several observers ( interrater reliability ) on the ordinal scale level , however, the concordance coefficient W , also known as Kendall's concordance coefficient, which is related to the rank correlation coefficient , according to the statistician Maurice George Kendall (1907-1983), is used.

concept

We start with pairs of measurements . The concept of nonparametric correlation is the value of each measurement by the rank relative to all other replace in the measurement, that is . After this operation, the values ​​come from a well-known distribution, namely a uniform distribution of numbers between 1 and . If they are all different, each number occurs exactly once. If some have identical scores, they are assigned the mean of the ranks they would have received had they been slightly different. In this case of bonds or Ties spoken. This mean rank is sometimes a whole number, sometimes a “half” rank. In all cases, the sum of all assigned ranks is equal to the sum of all numbers from 1 to , viz .

Then exactly the same procedure is carried out with the and each value is replaced by its rank among all .

Information is lost when interval-scaled measured values ​​are replaced by the corresponding ranks. However, it can still be used with interval-scaled data, since a nonparametric correlation is more robust than the linear correlation, more resistant to unplanned errors and outlier values ​​in the data, just as the median is more robust than the mean . If the data are only rankings, i.e. data at the ordinal level, there is also no alternative to rank correlations.

Spearman's rank correlation coefficient

The Spearman's rank correlation coefficient is named after Charles Spearman and is often referred to with the Greek letter ρ (rho) or - in contrast to Pearson's product-moment correlation coefficient - as .

Spearman's rank correlation coefficient for random variables

definition

A random vector with the continuous marginal distribution functions is given . Define the random vector . Then the Spearman rank correlation coefficient for the random vector is given by:

These are the usual Pearson's correlation coefficients.

Note that the value of is independent of the concrete (marginal) distribution functions . In fact, the stochastic rank correlation coefficient only depends on the copula on which the random vector is based. Another advantage compared to Pearson's correlation coefficient is the fact that it always exists because they can be integrally squared.

Independence from the marginal distributions

The fact that the Spearman's rank correlation coefficient is not influenced by the marginal distributions of the random vector can be illustrated as follows: According to Sklar's theorem, there is a unique copula for the random vector with the common distribution function and the continuous univariate marginal distribution functions , so that:

.

Now the random vector is transformed to the random vector . Since copulas are invariant under strictly monotonically increasing transformation, and because of the continuity of has the same copula as . In addition, the marginal distributions of uniform are distributed there

for everyone and .

From these two observations it follows that although it depends on the copula of , it does not depend on its marginal distributions.

Empirical Spearman's rank correlation coefficient

In principle, a special case of Pearson's product-moment correlation coefficient is where the data is converted to ranks before the correlation coefficient is calculated:

It is

the rank of ,
the mean of the ranks of ,
the standard deviation of the ranks of and
the sample covariance of and .

In practice, a simpler formula is usually used to calculate , but it is only correct if all ranks appear exactly once. There are two metric characteristics and the associated samples or respectively . By scaling the ranks or values, the (connected) rank series or . If the - and - series are connected in such a way that the smallest values, the second smallest values, etc. correspond to one another, then the following applies : That is, the two rankings are identical. If you now represent the pairs of ranking numbers in the -plane as points by plotting horizontally and vertically , the points lie on a straight line with the slope . In this case, we speak of a perfect positive rank correlation to which the maximum correlation value is assigned. To capture the deviation from the perfect positive rank correlation, the sum of squares is, according to Spearman :

to form the rank differences . The Spearman's rank correlation coefficient is then given by:

If all ranks are different, this simple formula gives exactly the same result.

With ties

If there are identical values ​​for or (i.e. bonds ), the formula becomes a little more complicated. But as long as not many values ​​are identical, there are only small deviations:

with . Where is the number of observations with the same rank; where either stands for or for .

Examples

example 1

As an example, the height and weight of different people will be examined. The pairs of measured values ​​are 175 cm, 178 cm and 190 cm and 65 kg, 70 kg and 98 kg.

In this example, there is the maximum rank correlation : the height data series is ranked, and the height rankings also correspond to the weight rankings. A low rank correlation exists if, for example, the body height increases over the course of the data series, but the weight decreases. Then one cannot say “The heaviest person is the greatest”. The rank correlation coefficient is the numerical expression of the relationship between two ranks.

Example 2

There are eight observations of two variables a and b:

i 1 2 3 4th 5 6th 7th 8th
2.0 3.0 3.0 5.0 5.5 8.0 10.0 10.0
1.5 1.5 4.0 3.0 1.0 5.0 5.0 9.5

In order to determine the rank for the observations of b, the procedure is as follows: First, it is sorted according to the value, then the rank is assigned (i.e. renumbered) and normalized, i. H. if the values ​​are the same, the mean is calculated. Finally, the input order is restored so that the differences in the ranks can then be formed.

entrance Sort (value) Determine rank Sort (index)

The following interim calculation results from the two data series a and b:

Values ​​of a Values ​​of b Rank of a Rank of b
2.0 1.5 1.0 2.5 −1.5 2.25
3.0 1.5 2.5 2.5 0.0 0.00
3.0 4.0 2.5 5.0 −2.5 6.25
5.0 3.0 4.0 4.0 0.0 0.00
5.5 1.0 5.0 1.0 4.0 16.00
8.0 5.0 6.0 6.5 −0.5 0.25
10.0 5.0 7.5 6.5 1.0 1.00
10.0 9.5 7.5 8.0 −0.5 0.25
       

The table is arranged according to the variable a. It is important that individual values ​​can share a rank. In row a there are two “3”, and they each have the “average” rank (2 + 3) / 2 = 2.5. The same thing happens with row b.

Values ​​of a Values ​​of b
2.0 1.5 1 0 2 6th
3.0 1.5 2 6th - -
3.0 4.0 - - 1 0
5.0 3.0 1 0 1 0
5.5 1.0 1 0 1 0
8.0 5.0 1 0 2 6th
10.0 5.0 2 6th - -
10.0 9.5 - - 1 0
   

With the Horn correction, this finally results

Determination of significance

The modern approach to testing whether the observed value is significantly different from zero leads to a permutation test . The probability is calculated that the null hypothesis is greater than or equal to the observed one .

This approach is superior to traditional methods when the data set is not too large to generate all the necessary permutations, and furthermore when it is clear how to generate meaningful permutations for the null hypothesis for the given application (which is usually quite simple).

Kendall's rope

In contrast to Spearman's , Kendall's only uses the difference in ranks and not the difference in ranks. As a rule, the value of Kendall's is somewhat smaller than the value of Spearman's . is also useful for interval-scaled data if the data is not normally distributed, the scales have unequal divisions, or if the sample sizes are very small.

Kendall's tau for random variables

Let be a bivariate random vector with copula and marginal distribution functions . Thus, according to Sklar's theorem, has the common distribution function . The Kendall's tau for the random vector is then defined as:

Note that is independent of the marginal distributions of the random vector . The value therefore only depends on its copula.

Empirical Kendall's rope

To compute the empirical , we consider pairs of sorted observations and with and . The following applies:

Then pair 1 is compared with all of the following pairs ( ), pair 2 with all of the following pairs ( ) and so on . Applies to a couple:

  • and , it is said, concordant or in agreement,
  • and , it is said, disconcordant or disagreed,
  • and , so there is a bond in ,
  • and , so there is a bond in and
  • and , so there is a bond in and .

The number of pairs that

  • are concordant or coincident with ,
  • are disconcerting or disagreed with ,
  • the ties are in, with ,
  • the ties are in is with and
  • the bonds in and are denoted by.

The Kendall's value now compares the number of concordant and disconcordant pairs:

If Kendall's is positive, there are more concordant pairs than disconcordant ones; H. it is likely that if it is, then it is also true. If Kendall's tau is negative, there are more disconcordant pairs than concordants, i.e. H. it is likely that if it is, then it is also true. The value normalizes Kendall's so that:

Test of Kendall's rope

Looking at the random variable , Kendall found that for the test

vs.

this is normally distributed approximate under null hypothesis . In addition to the approximate test, an exact permutation test can also be carried out.

More τ coefficients

With the above definitions, Kendall had defined three coefficients in total :

(see above)

Kendall's Tau can only be applied to data without ties. Kendall's does not reach the extreme values or on non-quadratic contingency tables and does not take into account any constraints in and because it is not included . In the case of four-field tables, the four-field coefficient (Phi) and, if the values ​​of the two dichotomous variables are coded with 0 and 1, also with the Pearson's correlation coefficient are identical.

Tetra and polychoric correlation

In connection with Likert items , the tetra (with two binary variables) or polychoric correlation is often calculated. It is assumed that z. For example, in the case of a question with the answer form ( Does not apply at all , ..., Applies completely ), the respondents would actually have answered in a metric sense, but had to choose one of the alternatives due to the form of the answer.

This means that behind the observed variables , which are ordinal , there are unobserved, interval-scaled variables . The correlation between the unobserved variables is called tetra- or polychoric correlation.

The use of the tetra- or polychoric correlation for Likert items is recommended if the number of categories for the observed variables is less than seven. In practice, the Bravais-Pearson correlation coefficient is often used instead to calculate the correlation, but it can be shown that this underestimates the true correlation.

Estimation method for the tetra- or polychoric correlation

Assuming that the unobserved variables have a bivariate normal distribution in pairs , one can estimate the correlation between the unobserved variables with the help of the maximum likelihood method . There are two ways to do this:

  1. First, one estimates the interval boundaries for each category for each unobserved variable (assuming the univariate normal distribution for the respective unobserved variable). Then, in a second step, the correlation with the previously estimated interval limits is only estimated using the maximum likelihood method ( twostep method).
  2. Both the unknown interval limits and the unknown correlation are included as parameters in the maximum likelihood function. You will then be estimated in one step.

Approximation formula for the tetrachoric correlation

\ 0 1
0
1

For two binary variables, the crosstab on the right can be used to specify an approximation formula for the tetrachoric correlation:

There is a correlation of if and only if . Accordingly, there is a correlation of if and only if .

Individual evidence

  1. ^ Fahrmeir et al .: Statistics . 2004, p. 142.
  2. Werner Timischl: Applied Statistics. An introduction for biologists and medical professionals . 3. Edition. 2013, p. 303.
  3. D. Horn: A correction for the effect of tied ranks on the value of the rank difference correlation coefficient . In: Educational and Psychological Measurement , 3, 1942, pp. 686-690.
  4. ^ DJ Bartholomew, F. Steele, JI Galbraith, I. Moustaki: The Analysis and Interpretation of Multivariate Data for Social Scientists . Chapman & Hall / CRC, 2002
  5. KG Jöreskog, D. Sorbom: PRELIS, a program for multivariate data screening and data summarization . Scientific Software, Mooresville 1988