Similarity analysis

In statistics , especially multivariate statistics , one is interested in the measurement of the similarity between different objects and defines similarity and distance measures for this purpose . This is not to measure the mathematical sense , the term applies only to the measurement of a certain size.

Typically, similarity measures are used for nominally or ordinally scaled variables and distance measures are used for metrically scaled variables (i.e. for interval and ratio scales ).

Similarity measure

definition

Let be a finite set. A function is called a similarity measure or similarity function if the following applies to all : ${\ displaystyle I = \ left \ {1,2, \ dots, N \ right \}}$${\ displaystyle s \ colon I \ times I \ rightarrow \ mathbb {R}}$${\ displaystyle i, j \ in I}$

• ${\ displaystyle s (i, j) = s (j, i)}$ and
• ${\ displaystyle s (i, i) \ geq s (i, j)}$.

In addition, it is often required that the following applies to everyone : ${\ displaystyle i, j \ in I}$

• ${\ displaystyle s (i, j) \ geq 0}$and .${\ displaystyle s (i, i) = 1}$

The function values can be a symmetric - Matrix arrange. This matrix is ​​called the similarity matrix . In this context it is also referred to as the similarity coefficient. ${\ displaystyle s (i, j)}$ ${\ displaystyle N \ times N}$ ${\ displaystyle \ left (s (i, j) \ right) _ {i, j}}$${\ displaystyle s (i, j)}$

Application in bioinformatics

Similarity matrices such as PAM or BLOSUM play an important role in sequence alignment . Similar proteins , nucleotides or amino acids receive higher scores (i.e. similarity values) than dissimilar ones. The similarity is defined here by the chemical properties of the building blocks and their mutation rates .

Example (AGCT stands for the four nucleobases adenine , guanine , cytosine and thymine ):

A. G C. T
A. 10 −1 −3 −4
G −1 7th −5 −3
C. −3 −5 9 0
T −4 −3 0 8th

The molecules whose similarity is to be specified are sorted in the same order by columns and rows. The value at the position thus indicates how similar the molecule at column position i is to that at row position j. ${\ displaystyle a_ {i, j}}$${\ displaystyle (i, j)}$

According to the above similarity matrix, cytosine and tymine (similarity score 0) are more similar to one another than guanine to cytosine (similarity score -5).

Similarity measures for binary variables

For binary variables and two observations and be ${\ displaystyle p}$${\ displaystyle i}$${\ displaystyle j}$

${\ displaystyle n_ {00} = \ sum _ {k = 1} ^ {p} I (x_ {ik} = 0, x_ {jk} = 0)}$, ,${\ displaystyle n_ {01} = \ sum _ {k = 1} ^ {p} I (x_ {ik} = 0, x_ {jk} = 1)}$
${\ displaystyle n_ {10} = \ sum _ {k = 1} ^ {p} I (x_ {ik} = 1, x_ {jk} = 0)}$, and${\ displaystyle n_ {11} = \ sum _ {k = 1} ^ {p} I (x_ {ik} = 1, x_ {jk} = 1)}$
${\ displaystyle p = n_ {00} + n_ {01} + n_ {10} + n_ {11} \,}$.

Then you can define the following dimensions:

Similarity measure ${\ displaystyle s (i, j)}$
brown ${\ displaystyle {\ frac {n_ {11}} {\ max (n_ {11} + n_ {01}, n_ {11} + n_ {10})}}}$
Dice ${\ displaystyle {\ frac {2n_ {11}} {n_ {01} + n_ {10} + 2n_ {11}}}}$
Hamann ${\ displaystyle {\ frac {(n_ {00} + n_ {11}) - (n_ {01} + n_ {10})} {p}}}$
Jaccard ( S coefficient ) ${\ displaystyle {\ frac {n_ {11}} {n_ {01} + n_ {10} + n_ {11}}}}$
Kappa ${\ displaystyle {\ frac {1} {1 + {\ tfrac {p (n_ {01} + n_ {10})} {2 (n_ {00} n_ {11} -n_ {01} n_ {10}) }}}}}$
Kulczynski ${\ displaystyle {\ frac {n_ {11}} {n_ {01} + n_ {10}}}}$
Ochiai ${\ displaystyle {\ frac {n_ {11}} {\ sqrt {(n_ {11} + n_ {01}) (n_ {11} + n_ {10})}}}}$
Phi ${\ displaystyle {\ frac {n_ {11} n_ {00} -n_ {10} n_ {01}} {\ sqrt {(n_ {11} + n_ {01}) (n_ {11} + n_ {10} ) (n_ {00} + n_ {01}) (n_ {00} + n_ {10})}}}}$
Russel Rao ${\ displaystyle {\ frac {n_ {11}} {p}}}$
Simple matching ( M-coefficient ) ${\ displaystyle {\ frac {n_ {00} + n_ {11}} {p}}}$
Simpson ${\ displaystyle {\ frac {n_ {11}} {\ min (n_ {11} + n_ {01}, n_ {11} + n_ {10})}}}$
Sneath ${\ displaystyle {\ frac {n_ {11}} {n_ {11} + 2n_ {01} + 2n_ {10}}}}$
Tanimoto ( Rogers ) ${\ displaystyle {\ frac {n_ {00} + n_ {11}} {n_ {00} +2 (n_ {01} + n_ {10}) + n_ {11}}}}$
Yule ${\ displaystyle {\ frac {n_ {00} n_ {11} -n_ {01} n_ {10}} {n_ {00} n_ {11} + n_ {01} n_ {10}}}}$

For non-binary nominal or ordinal variables, one defines a binary variable for each category of the variable and can then use the similarity measures for binary variables.

Choice of the degree of similarity

Which degree of similarity is chosen for the analysis depends on the problem. However, there are some indications as to when which measure is best depending on the properties of the binary variable:

• Is the variable symmetric, i. H. Both categories are equally important (e.g. gender), then the same presence ( ) or the same absence ( ) is often important for a similarity measure. Then Simple Matching, Hamman or Tanimoto can be used.${\ displaystyle n_ {11}}$${\ displaystyle n_ {00}}$
• Is the variable asymmetrical, i. H. only one category plays an essential role (e.g. disease occurred), then often only the same occurrence ( ) plays a role. Then Dice, Jaccard, Kulczynskl, Ochiai, Braun, Simpson or Sneath can be used.${\ displaystyle n_ {11}}$
• Kappa, Phi and Yule can be used in both the symmetrical and the asymmetrical case.

When choosing the similarity measure, connections between the measures should also be taken into account:

• Dice, Jaccard and Sneath are monotonous functions of each other:
${\ displaystyle {\ text {Sneath}} \ leq {\ text {Jaccard}} \ leq {\ text {Dice}}.}$
${\ displaystyle {\ text {Brown}} \ leq {\ text {Dice}} \ leq {\ text {Ochiai}} \ leq {\ text {Kulczynski}} \ leq {\ text {Simpson}}.}$
• Hamman, Rogers and Simple matching also show a connection.

Distance measure

definition

Let be a finite set. A function is called a distance measure or distance function if the following applies to all : ${\ displaystyle I = \ left \ {1,2, \ dots, N \ right \}}$${\ displaystyle d \ colon I \ times I \ rightarrow \ mathbb {R}}$${\ displaystyle i, j \ in I}$

• ${\ displaystyle d (i, j) = d (j, i)}$ such as
• ${\ displaystyle d (i, j) \ geq 0}$and .${\ displaystyle d (i, j) = 0 \ Leftrightarrow i = j}$

The function values can be a symmetrical - Matrix arrange. This matrix is ​​called the distance matrix . ${\ displaystyle d (i, j)}$${\ displaystyle N \ times N}$ ${\ displaystyle \ left (d (i, j) \ right) _ {i, j}}$

If the function also satisfies the triangle inequality , it is a metric . A metric is often referred to as a distance function . ${\ displaystyle d}$

Some distance measurements

For scale variables and two observations and one can define the following measures: ${\ displaystyle p}$${\ displaystyle i}$${\ displaystyle j}$

Distance measure ${\ displaystyle d (i, j)}$
${\ displaystyle L_ {r}}$ ${\ displaystyle \ left (\ sum _ {k = 1} ^ {p} | x_ {ik} -x_ {jk} | ^ {r} \ right) ^ {1 / r}}$
Euclidean
${\ displaystyle L_ {2}}$
${\ displaystyle {\ sqrt {\ sum _ {k = 1} ^ {p} (x_ {ik} -x_ {jk}) ^ {2}}}}$
Pearson ${\ displaystyle {\ sqrt {\ sum _ {k = 1} ^ {p} {\ frac {(x_ {ik} -x_ {jk}) ^ {2}} {s_ {k} ^ {2}}} }}}$
with the standard deviation of the variable${\ displaystyle s_ {k}}$${\ displaystyle k}$
City block
Manhattan
${\ displaystyle L_ {1}}$
${\ displaystyle \ sum _ {k = 1} ^ {p} | x_ {ik} -x_ {jk} |}$
Gower ${\ displaystyle \ sum _ {k = 1} ^ {p} {\ frac {| x_ {ik} -x_ {jk} |} {r_ {k}}}}$
with the range of the variable${\ displaystyle r_ {k}}$${\ displaystyle k}$
Mahalanobis ${\ displaystyle {\ sqrt {(x_ {i} -x_ {j}) ^ {T} S ^ {- 1} (x_ {i} -x_ {j})}}}$
with the sample covariance matrix of the variables${\ displaystyle S}$${\ displaystyle x_ {i}}$

Relationship between similarity and distance measures

In general, one can define a distance measure from a similarity measure by

${\ displaystyle d (i, j) = {\ sqrt {s (i, i) + s (j, j) -2s (i, j)}}}$.

However, a distance measure obtained in this way generally does not satisfy the triangle inequality and is therefore not a metric.