Neighbor Joining Algorithm

The Neighbor Joining Algorithm is a mathematical procedure for comparing data sets and arranging them hierarchically in a bifurcal (two- forked ) manner . This procedure was introduced by Saitou and Nei in 1987 and further developed and simplified by Studier and Keppler in 1988.

application

In bioinformatics , the neighbor joining method describes a phenetic bottom-up clustering method that is used to create phylogenetic tree structures . This is intended to calculate the probability of an ancestry or family relationship in a family tree-like representation based on varying features in the data matrix .

Usually trees are created from DNA or protein sequence data or classic morphological data sets. The algorithm needs knowledge of the distance between two pairs of taxa (for example species or sequences) in a tree.

algorithm

Three steps in creating a phylogenetic tree for five taxa using the neighbor joining algorithm

Neighbor joining is mostly based on the “minimum evolution” criterion for phylogenetic trees: starting from an initially star-shaped “tree” in which all taxa are connected to a “center”, the DNA or protein sequences are paired with the smallest genetic distance selected and united to a branch of the tree. The genetic distances of the sequences are recalculated and the closest related ones are put together again to form a branch with two taxa. This continues until all taxa have been inserted in the tree and the star structure of the tree has been completely dissolved. In contrast to UPGMA , Neighbor-Joining takes into account that the rate of evolution is not constant: If a taxon is far away from all other taxa, this is not due to a distant degree of relationship, but to accelerated evolution.

The algorithm is iterative and replaces a pair of Operational Taxonomic Units (OTU) with a new sequence in each step. It iterates on the remaining sequences until there is only one possible topology for the three remaining OTUs . Then the tree structure is created.

example

A typical table of distances between taxa is given below, the values being purely hypothetical but realistic: Studier and Keppler introduced an alternative parameter in the algorithm called Mij. Saitou and Nei originally used the “minimal sum of branches” (Sij), ie the minimum number of branches, to determine the neighbors. The example is based on the algorithm by Studier and Keppler, which offers a complexity class compared to the original parameter . ${\ displaystyle O (n ^ {3})}$

	human	mouse	rose	tulip
human	0	3	14th	12
mouse	3	0	13	11
rose	14th	13	0	4th
tulip	12	11	4th	0

Since the table is triangular, the lower half does not have to be saved. The values in this table are named as. ${\ displaystyle d_ {i, j}}$

Step 1: The average distances from each taxon to each other must be calculated. This is done with the following formula for the net divergence r _i :

${\ displaystyle r_ {i} = {\ frac {1} {N-2}} \ sum _ {k = 1} ^ {N} d_ {i, k}}$

Where N is the number of taxa.

human	${\ displaystyle r_ {1} = {\ frac {0 + 3 + 14 + 12} {4-2}} = 14 {,} 5}$
mouse	${\ displaystyle r_ {2} = {\ frac {3 + 0 + 13 + 11} {4-2}} = 13 {,} 5}$
rose	${\ displaystyle r_ {3} = {\ frac {14 + 13 + 0 + 4} {4-2}} = 15 {,} 5}$
tulip	${\ displaystyle r_ {4} = {\ frac {12 + 11 + 4 + 0} {4-2}} = 13 {,} 5}$

Interpretation: In our example, the rose has the greatest net divergence, so it has experienced a greater evolutionary speed than the other taxa.

Step 2: We calculate an intermediate matrix M.

${\ displaystyle M_ {i, j} = d_ {i, j} - (r_ {i} + r_ {j})}$

Such as B. between man and mouse:

${\ displaystyle M_ {1,2} = d_ {1,2} - (r_ {1} + r_ {2}) = 3- (14 {,} 5 + 13 {,} 5) = - 25}$

${\ displaystyle M}$	human	mouse	rose	tulip
human		−25	−16	−16
mouse	−25		−16	−16
rose	−16	−16		−25
tulip	−16	−16	−25

Step 3: The smallest value, i.e. the smallest distance between two taxa, is now searched for in this newly calculated distance matrix M, and the two taxa found are combined to form a new subtree u = (i, j). In this example, there are two options, human and mouse, or rose and tulip to form a subtree. We will first choose humans and mice.

The edge length of the node to the branch is calculated as follows:

${\ displaystyle v_ {i, u} = {\ frac {d_ {i, j} + r_ {i} -r_ {j}} {2}}}$

${\ displaystyle v_ {j, u} = d_ {i, j} -v_ {i, u}}$

So human to MeMa is the same ${\ displaystyle {\ frac {3 + 14.5-13.5} {2}} = 2}$

Step 4: The new entry u = MeMa is added to the original distance matrix:

	human	mouse	rose	tulip	MeMa
human	0	3	14th	12	?
mouse	3	0	13	11	?
rose	14th	13	0	4th	?
tulip	12	11	4th	0	?
MeMa	?	?	?	?	0

The following formula is used to calculate the distances of the new entry u = (i, j) = (1,2) = (human, mouse) = MeMa to the remaining taxa:

${\ displaystyle d_ {u, k} = {\ frac {d_ {i, k} + d_ {j, k} -d_ {i, j}} {2}}}$

The entries i and j are combined to form a new entry u, and the distance to entry k is calculated. So the distance between rose and the new subtree is:

${\ displaystyle d_ {5,3} = {\ frac {d_ {1,3} + d_ {2,3} -d_ {1,2}} {2}} = {\ frac {14 + 13-3} {2}} = 12}$

The "old", merged entries are deleted from the distance matrices.

	rose	tulip	MeMa
rose	0	4th	12
tulip	4th	0	10
MeMa	12	10	0

Then it is calculated again and again , reassembled and started again from the beginning. This is repeated until only two taxa remain, which are then finally connected. ${\ displaystyle r_ {i}}$ ${\ displaystyle M_ {i, j}}$

The result of our example can be shown as follows:

additive tree	Edition of the Phylip program
Mensch Rose \ / \ 2 3 / \ 9 / ----------------- / \ / 1 1 \ / \ Maus Tulpe	+----Maus ! ! +-------------Rose 1-----------------------------------------2 ! +----Tulpe ! +--------Mensch

classification

Neighbor joining is one of the explicit methods. This means that different evolution models, i.e. different probabilities for point mutations, can be assumed when calculating the genetic distances. The correctness of these family trees is based on the assumption that the change in the characteristics under consideration does not contain any unknown intermediate steps. It is therefore assumed in a simplified way that “evolution does not take detours” (“minimum evolution”).

The neighbor joining algorithm calculates the family tree step by step and therefore does not necessarily find the optimal tree topology with the smallest branch length. This is based on its construction principle, as a greedy algorithm . In contrast to other algorithms, this one does not calculate all possible trees and finally selects the optimal ones, but discards some calculation methods during the process. Although the algorithm is sub-optimal, it has been extensively tested and will usually find a tree that is relatively close to the optimum.

advantages

The main advantage of this method is its speed. It can be applied to huge amounts of data, even where other methods of phylogenetic analysis such as maximum parsimony and maximum likelihood are no longer feasible. In contrast to the UPGMA algorithm (Unweighted Pair Group Method with Arithmetic mean) for phylogenetic tree reconstruction, neighbor joining does not assume that the evolution of the lineages takes place at the same rate (see also Molecular Clock ) and therefore creates an unbalanced tree.

literature

N. Saitou, M. Nei: The neighbor-joining method. A new method for reconstructing phylogenetic trees. In: Molecular Biology and Evolution . tape 4 , no. 4 , July 1, 1987, pp. 406-425 ( oxfordjournals.org ).
JA Studier, KJ Keppler: A note on the neighbor-joining algorithm of Saitou and Nei . In: Molecular Biology and Evolution . tape 5 , no. 6 , November 1, 1988, pp. 729-731 , PMID 3221794 ( mbe.oxfordjournals.org [PDF]).
Volker Knoop, Kai Müller: Genes and Family Trees. A handbook on molecular phylogenetics . 1st edition. Elsevier, Spektrum Akademischer Verlag, Munich / Heidelberg 2006, ISBN 3-8274-1642-6 .
Olivier Gascuel, Mike Steel: Neighbor-Joining Revealed . In: Molecular Biology and Evolution . tape 23 , no. 11 , November 1, 2006, pp. 1997-2000 , doi : 10.1093 / molbev / msl072 , PMID 16877499 .

swell

↑ 2.2.2 Neighbor joining method (PDF, p. 9) on spline.de

↑ ^a ^b 3.1 neighbor joining algorithm (PDF, p. 7).

^ The Neighbor-Joining Method on icp.ucl.ac.be

↑ Neighbor Joining Method (Saitou and Nei, 1987) Summary ( Memento of the original from September 16, 2016 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. on stat.berkeley.edu @1@ 2

↑ Kord Eickmeyer, Peter Huggins, Lior Pachter, Ruriko Yoshida: On the optimality of the neighbor-joining algorithm . In: Algorithms for Molecular Biology (AMB) . tape 3 , April 30, 2008, ISSN 1748-7188 , p. 5 , doi : 10.1186 / 1748-7188-3-5 .

Web links

Quicktree - An implementation of neighbor joining ( Memento of March 11, 2010 in the Internet Archive )
The PHYLIP package

[1] 2.2.2 Neighbor joining method (PDF, p. 9) on spline.de

[Algorithmus-2] 3.1 neighbor joining algorithm (PDF, p. 7).

[3] The Neighbor-Joining Method on icp.ucl.ac.be