Similarity analysis

In statistics , especially multivariate statistics , one is interested in the measurement of the similarity between different objects and defines similarity and distance measures for this purpose . This is not to measure the mathematical sense , the term applies only to the measurement of a certain size.

Typically, similarity measures are used for nominally or ordinally scaled variables and distance measures are used for metrically scaled variables (i.e. for interval and ratio scales ).

Similarity measure

definition

Let be a finite set. A function is called a similarity measure or similarity function if the following applies to all : ${\ displaystyle I = \ left \ {1,2, \ dots, N \ right \}}$ ${\ displaystyle s \ colon I \ times I \ rightarrow \ mathbb {R}}$ ${\ displaystyle i, j \ in I}$

${\ displaystyle s (i, j) = s (j, i)}$ and
${\ displaystyle s (i, i) \ geq s (i, j)}$ .

In addition, it is often required that the following applies to everyone : ${\ displaystyle i, j \ in I}$

${\ displaystyle s (i, j) \ geq 0}$ and . ${\ displaystyle s (i, i) = 1}$

The function values can be a symmetric - Matrix arrange. This matrix is called the similarity matrix . In this context it is also referred to as the similarity coefficient. ${\ displaystyle s (i, j)}$ ${\ displaystyle N \ times N}$ ${\ displaystyle \ left (s (i, j) \ right) _ {i, j}}$ ${\ displaystyle s (i, j)}$

Application in bioinformatics

Similarity matrices such as PAM or BLOSUM play an important role in sequence alignment . Similar proteins , nucleotides or amino acids receive higher scores (i.e. similarity values) than dissimilar ones. The similarity is defined here by the chemical properties of the building blocks and their mutation rates .

Example (AGCT stands for the four nucleobases adenine , guanine , cytosine and thymine ):

	A.	G	C.	T
A.	10	−1	−3	−4
G	−1	7th	−5	−3
C.	−3	−5	9	0
T	−4	−3	0	8th

The molecules whose similarity is to be specified are sorted in the same order by columns and rows. The value at the position thus indicates how similar the molecule at column position i is to that at row position j. ${\ displaystyle a_ {i, j}}$ ${\ displaystyle (i, j)}$

According to the above similarity matrix, cytosine and tymine (similarity score 0) are more similar to one another than guanine to cytosine (similarity score -5).

Similarity measures for binary variables

For binary variables and two observations and be ${\ displaystyle p}$ ${\ displaystyle i}$ ${\ displaystyle j}$

{\ displaystyle n_ {00} = \ sum _ {k = 1} ^ {p} I (x_ {ik} = 0, x_ {jk} = 0)}

, ,

{\ displaystyle n_ {01} = \ sum _ {k = 1} ^ {p} I (x_ {ik} = 0, x_ {jk} = 1)}

{\ displaystyle n_ {10} = \ sum _ {k = 1} ^ {p} I (x_ {ik} = 1, x_ {jk} = 0)}

, and

{\ displaystyle n_ {11} = \ sum _ {k = 1} ^ {p} I (x_ {ik} = 1, x_ {jk} = 1)}

{\ displaystyle p = n_ {00} + n_ {01} + n_ {10} + n_ {11} \,}

.

Then you can define the following dimensions:

Similarity measure	${\ displaystyle s (i, j)}$
brown	${\ displaystyle {\ frac {n_ {11}} {\ max (n_ {11} + n_ {01}, n_ {11} + n_ {10})}}}$
Dice	${\ displaystyle {\ frac {2n_ {11}} {n_ {01} + n_ {10} + 2n_ {11}}}}$
Hamann	${\ displaystyle {\ frac {(n_ {00} + n_ {11}) - (n_ {01} + n_ {10})} {p}}}$
Jaccard ( S coefficient )	${\ displaystyle {\ frac {n_ {11}} {n_ {01} + n_ {10} + n_ {11}}}}$
Kappa	${\ displaystyle {\ frac {1} {1 + {\ tfrac {p (n_ {01} + n_ {10})} {2 (n_ {00} n_ {11} -n_ {01} n_ {10}) }}}}}$
Kulczynski	${\ displaystyle {\ frac {n_ {11}} {n_ {01} + n_ {10}}}}$
Ochiai	${\ displaystyle {\ frac {n_ {11}} {\ sqrt {(n_ {11} + n_ {01}) (n_ {11} + n_ {10})}}}}$
Phi	${\ displaystyle {\ frac {n_ {11} n_ {00} -n_ {10} n_ {01}} {\ sqrt {(n_ {11} + n_ {01}) (n_ {11} + n_ {10} ) (n_ {00} + n_ {01}) (n_ {00} + n_ {10})}}}}$
Russel Rao	${\ displaystyle {\ frac {n_ {11}} {p}}}$
Simple matching ( M-coefficient )	${\ displaystyle {\ frac {n_ {00} + n_ {11}} {p}}}$
Simpson	${\ displaystyle {\ frac {n_ {11}} {\ min (n_ {11} + n_ {01}, n_ {11} + n_ {10})}}}$
Sneath	${\ displaystyle {\ frac {n_ {11}} {n_ {11} + 2n_ {01} + 2n_ {10}}}}$
Tanimoto ( Rogers )	${\ displaystyle {\ frac {n_ {00} + n_ {11}} {n_ {00} +2 (n_ {01} + n_ {10}) + n_ {11}}}}$
Yule	${\ displaystyle {\ frac {n_ {00} n_ {11} -n_ {01} n_ {10}} {n_ {00} n_ {11} + n_ {01} n_ {10}}}}$

For non-binary nominal or ordinal variables, one defines a binary variable for each category of the variable and can then use the similarity measures for binary variables.

Choice of the degree of similarity

Which degree of similarity is chosen for the analysis depends on the problem. However, there are some indications as to when which measure is best depending on the properties of the binary variable:

Is the variable symmetric, i. H. Both categories are equally important (e.g. gender), then the same presence ( ) or the same absence ( ) is often important for a similarity measure. Then Simple Matching, Hamman or Tanimoto can be used. ${\ displaystyle n_ {11}}$ ${\ displaystyle n_ {00}}$
Is the variable asymmetrical, i. H. only one category plays an essential role (e.g. disease occurred), then often only the same occurrence ( ) plays a role. Then Dice, Jaccard, Kulczynskl, Ochiai, Braun, Simpson or Sneath can be used. ${\ displaystyle n_ {11}}$
Kappa, Phi and Yule can be used in both the symmetrical and the asymmetrical case.

When choosing the similarity measure, connections between the measures should also be taken into account:

Dice, Jaccard and Sneath are monotonous functions of each other:

{\ displaystyle {\ text {Sneath}} \ leq {\ text {Jaccard}} \ leq {\ text {Dice}}.}

Looking at Simpson and Braun, the harmonic mean is Dice, the arithmetic mean is Kulczynski, and the geometric mean is Ochiai. From the inequality of the mean values it follows:

{\ displaystyle {\ text {Brown}} \ leq {\ text {Dice}} \ leq {\ text {Ochiai}} \ leq {\ text {Kulczynski}} \ leq {\ text {Simpson}}.}

Hamman, Rogers and Simple matching also show a connection.

Distance measure

definition

Let be a finite set. A function is called a distance measure or distance function if the following applies to all : ${\ displaystyle I = \ left \ {1,2, \ dots, N \ right \}}$ ${\ displaystyle d \ colon I \ times I \ rightarrow \ mathbb {R}}$ ${\ displaystyle i, j \ in I}$

${\ displaystyle d (i, j) = d (j, i)}$ such as
${\ displaystyle d (i, j) \ geq 0}$ and . ${\ displaystyle d (i, j) = 0 \ Leftrightarrow i = j}$

The function values can be a symmetrical - Matrix arrange. This matrix is called the distance matrix . ${\ displaystyle d (i, j)}$ ${\ displaystyle N \ times N}$ ${\ displaystyle \ left (d (i, j) \ right) _ {i, j}}$

If the function also satisfies the triangle inequality , it is a metric . A metric is often referred to as a distance function . ${\ displaystyle d}$

Some distance measurements

For scale variables and two observations and one can define the following measures: ${\ displaystyle p}$ ${\ displaystyle i}$ ${\ displaystyle j}$

Distance measure	${\ displaystyle d (i, j)}$
${\ displaystyle L_ {r}}$	${\ displaystyle \ left (\ sum _ {k = 1} ^ {p} \| x_ {ik} -x_ {jk} \| ^ {r} \ right) ^ {1 / r}}$
Euclidean ${\ displaystyle L_ {2}}$	${\ displaystyle {\ sqrt {\ sum _ {k = 1} ^ {p} (x_ {ik} -x_ {jk}) ^ {2}}}}$
Pearson	${\ displaystyle {\ sqrt {\ sum _ {k = 1} ^ {p} {\ frac {(x_ {ik} -x_ {jk}) ^ {2}} {s_ {k} ^ {2}}} }}}$ with the standard deviation of the variable ${\ displaystyle s_ {k}}$ ${\ displaystyle k}$
City block Manhattan ${\ displaystyle L_ {1}}$	${\ displaystyle \ sum _ {k = 1} ^ {p} \| x_ {ik} -x_ {jk} \|}$
Gower	${\ displaystyle \ sum _ {k = 1} ^ {p} {\ frac {\| x_ {ik} -x_ {jk} \|} {r_ {k}}}}$ with the range of the variable ${\ displaystyle r_ {k}}$ ${\ displaystyle k}$
Mahalanobis	${\ displaystyle {\ sqrt {(x_ {i} -x_ {j}) ^ {T} S ^ {- 1} (x_ {i} -x_ {j})}}}$ with the sample covariance matrix of the variables ${\ displaystyle S}$ ${\ displaystyle x_ {i}}$

Relationship between similarity and distance measures

In general, one can define a distance measure from a similarity measure by

{\ displaystyle d (i, j) = {\ sqrt {s (i, i) + s (j, j) -2s (i, j)}}}

.

However, a distance measure obtained in this way generally does not satisfy the triangle inequality and is therefore not a metric.

literature

Joachim Hartung , Bärbel Elpelt: Multivariate Statistics. Teaching and handbook of applied statistics. Oldenbourg Verlag, Munich 1984, ISBN 3-486-28451-7
Ludwig Fahrmeir, Alfred Hamerle: Multivariate statistical methods. de Gruyter, Berlin 1984, ISBN 3-11-008509-7

Individual evidence

^ PF Russel, TR Rao: On habitat and association of species of Anophe-line larvae . In: South-eastern Madras, Journal of Malaria Institute India . 3, 1940, pp. 153-178.
^ DJ Rogers and TT Tanimoto: A Computer Program for Classifying Plants . In: Science . 132, No. 3434, October 21, 1960, pp. 1115-1118. doi : 10.1126 / science.132.3434.1115 .
↑ ShengLi Tzeng, Han-Ming Wu, Chun-Houh Chen: Selection of Proximity Measures for Matrix Visualization of Binary Data . In: Biomedical Engineering and Informatics, 2009. BMEI '09. 2nd International Conference on . October 30, 2009, p. 1–9 , doi : 10.1109 / BMEI.2009.5305137 .
↑ JC Gower: A General Coefficient of Similarity and Some of Its Properties . In: Biometrics . 27, No. 4, December 1971, pp. 857-871.
^ Wolfgang Härdle , Léopold Simar: Applied Multivariate Statistical Analysis . 1st edition. Springer Verlag, Berlin 2003, ISBN 3-540-03079-4 , pp. 381 .

[1] PF Russel, TR Rao: On habitat and association of species of Anophe-line larvae . In: South-eastern Madras, Journal of Malaria Institute India . 3, 1940, pp. 153-178.

[2] DJ Rogers and TT Tanimoto: A Computer Program for Classifying Plants . In: Science . 132, No. 3434, October 21, 1960, pp. 1115-1118. doi : 10.1126 / science.132.3434.1115 .

[3] ShengLi Tzeng, Han-Ming Wu, Chun-Houh Chen: Selection of Proximity Measures for Matrix Visualization of Binary Data . In: Biomedical Engineering and Informatics, 2009. BMEI '09. 2nd International Conference on . October 30, 2009, p. 1–9 , doi : 10.1109 / BMEI.2009.5305137 .

[4] JC Gower: A General Coefficient of Similarity and Some of Its Properties . In: Biometrics . 27, No. 4, December 1971, pp. 857-871.

[5] Wolfgang Härdle , Léopold Simar: Applied Multivariate Statistical Analysis . 1st edition. Springer Verlag, Berlin 2003, ISBN 3-540-03079-4 , pp. 381 .

Similarity analysis

contents

Similarity measure

definition

Application in bioinformatics

Similarity measures for binary variables

Choice of the degree of similarity

Distance measure

definition

Some distance measurements

Relationship between similarity and distance measures

See also

literature

Individual evidence