Needleman Desire Algorithm

The Needleman Wunsch algorithm is an optimization algorithm from bioinformatics . There it is used to compare two nucleotide or amino acid sequences . It uses a model to calculate the optimal global similarity score or, with the help of backtracking, one or more optimal global alignments between two sequences . The similarity score is a measure of the similarity between two sequences; the higher the score, the more similar the sequences are under the scoring model used. The algorithm optimizes the score of the alignment. An alignment is a sequence of editing steps to convert the first sequence into the second. There are many alignments for two nontrivial sequences - an optimal alignment has a maximum similarity score. The algorithm uses the dynamic programming method and allows arbitrary scoring models (i.e. the use of general gap costs ) for the editing steps.

The procedure

The Needleman Wunsch DP algorithm calculates the optimal global similarity alignment score for all pairs of possible prefixes of the sequences a and b in a matrix. The element of the matrix contains the optimal score for the optimal global alignment of the partial sequence of a and b. The spelling corresponds to the i-th prefix of a. If m denotes the sequence length of a and n denotes the sequence length of b, then the score matrix contains M rows and columns. The alignment score of the complete sequences is contained in after the execution of the algorithm . ${\ displaystyle i, j}$ ${\ displaystyle a [0..i]}$ ${\ displaystyle b [0..j]}$ ${\ displaystyle a [0..i]}$ ${\ displaystyle a_ {1}, a_ {2}, \ dots, a_ {i}}$ ${\ displaystyle m + 1}$ ${\ displaystyle n + 1}$ ${\ displaystyle M (m, n)}$

The score matrix is calculated recursively . The element (i, j) of the matrix M is maximized over three cases. The extension of the previous alignment of the sequences and a match or mis-match corresponds to the addition of the previously calculated score from and the costs of replacing by . The extension of an alignment already calculated by a final deletion corresponds to the addition of the general gap costs of the length of the deletion to the score of the optimal alignment of the sequences and , where the length of the deletion is. Analogous to the deletion, the expansion of an optimal alignment of the sequences and a final insertion corresponds to the addition of the score of this alignment and the gap costs for the length of the insertion. The maximum value of these three alternatives is stored in the element . ${\ displaystyle a [0..i-1]}$ ${\ displaystyle b [0..j-1]}$ ${\ displaystyle M (i-1, j-1)}$ ${\ displaystyle a [i]}$ ${\ displaystyle b [j]}$ ${\ displaystyle a [0..ik]}$ ${\ displaystyle b [0..j]}$ ${\ displaystyle k}$ ${\ displaystyle a [0..i]}$ ${\ displaystyle b [0..jl]}$ ${\ displaystyle l}$ ${\ displaystyle M (i, j)}$

The gap cost function can be general. I.e. it is not assumed that uniform costs or affine gap costs are used.

The matrix recursions described informally so far are formally written down in the next section. In order to take into account the dependencies of the recursion, the score matrix must be calculated in rows or columns.

The alignment can be output through backtracking . As a possible backtracking method, during the calculation of the score matrix, it is possible to store in an auxiliary matrix for each element whether the maximum value was calculated by inserting, deleting or by a match. From the element of the matrix , after the complete calculation of the matrix, the path to the calculation of the maximum can be traced back and the optimal alignment can be constructed. For two sequences there are one or more optimal paths in a matrix in a score matrix that lead to the optimal score, just as there can be several alignments for two sequences that have the optimal score. ${\ displaystyle (i, j)}$ ${\ displaystyle (n, m)}$ ${\ displaystyle M}$

Matrix recursion

Specification of the algorithm by matrix - recursion :

${\ displaystyle M (0,0) = 0}$

${\ displaystyle M (i, 0) = M (i-1,0) + f (i), \; 1 \ leq i \ leq m}$

${\ displaystyle M (0, j) = M (0, j-1) + f (j), \; 1 \ leq j \ leq n}$

${\ displaystyle M (i, j) = \ max {\ begin {Bmatrix} M (i-1, j-1) + \ w (a_ {i}, b_ {j}) & {\ text {Match or Mismatch}} \\ M (i-1, j) + \ f (k) & {\ text {Deletion}} \\ M (i, j-1) + \ f (l) & {\ text {Insertion} } \ end {Bmatrix}}, \; 1 \ leq i \ leq m, 1 \ leq j \ leq n}$

${\ displaystyle a, b}$ : Strings over an alphabet ${\ displaystyle \ Sigma}$
${\ displaystyle m = \ operatorname {length} (a)}$
${\ displaystyle n = \ operatorname {length} (b)}$
${\ displaystyle M (i, j)}$ : maximum similarity score between a prefix of which ends in and a prefix of which ends in ${\ displaystyle a}$ ${\ displaystyle i}$ ${\ displaystyle b}$ ${\ displaystyle j}$
${\ displaystyle w (c, d), \; c, d \ in \ Sigma}$ : Similarity score function
${\ displaystyle f}$ : general gap cost function

example

The steps of the algorithm are presented here using a small example.

The following function is used as the evaluation function:

${\ displaystyle w (x, y) = {\ begin {cases} 1 & {\ text {for}} x = y \\ - 1 & {\ text {otherwise}} \\\ end {cases}}}$

a = ACGTC and b = AGTC

For a better understanding, one can imagine that the rows are labeled with the letters from sequence a and the columns with the letters from sequence b. Mathematically, this makes no sense within the matrix, so this is just an illustration.

${\ displaystyle D = {\ begin {pmatrix} & - & A & G & T & C \\ - & 0 & -1 & -2 & -3 & -4 \\ A & -1 & 0 & 0 & 0 & 0 \\ C & -2 & 0 & 0 & 0 & 0 \\ G & -3 & 0 & 0 & 0 & 0 \\ T & -4 & 0 & 0 & 0 & 0 \\ C & - 5 & 0 & 0 & 0 & 0 \\\ end {pmatrix}}}$

0. step: initialization

The matrix entries for the first row and the first column are filled as described above. The rating for the entry is calculated from the rating above and the score at the point . So the other values are now calculated analogously. ${\ displaystyle D (i, j)}$ ${\ displaystyle D (1,0)}$ ${\ displaystyle D (i-1, j) = D (0,0) = 0}$ ${\ displaystyle w (a_ {i}, b_ {i}) = w (a_ {1}, -) = w (A, -) = - 1}$ ${\ displaystyle D (1,0) = 0 + (- 1) = - 1}$

${\ displaystyle D = {\ begin {pmatrix} 0 & -1 & -2 & -3 & -4 \\ - 1 & 0 & 0 & 0 & 0 \\ - 2 & 0 & 0 & 0 & 0 \\ - 3 & 0 & 0 & 0 & 0 \\ - 4 & 0 & 0 & 0 & 0 \\ - 5 & 0 & 0 & 0 & 0 \\\ end {pmatrix}}}$

1st step: Calculation of : ${\ displaystyle D (1,1)}$

${\ displaystyle {\ begin {cases} D (0,0) + w (A, A) \ Rightarrow 0 + 1 \\ D (0,1) + w (A, -) \ Rightarrow -1 + (- 1 ) \\ D (1,0) + w (-, A) \ Rightarrow -1 + (- 1) \\\ end {cases}}}$

→ The maximum arises from the first case, ie A is aligned with A.

${\ displaystyle D = {\ begin {pmatrix} 0 & -1 & -2 & -3 & -4 \\ - 1 & 1 & 0 & 0 & 0 \\ - 2 & 0 & 0 & 0 & 0 \\ - 3 & 0 & 0 & 0 & 0 \\ - 4 & 0 & 0 & 0 & 0 \\ - 5 & 0 & 0 & 0 & 0 \\\ end {pmatrix}}}$

→ Increase j by 1, i remains the same

2nd step: Calculation of : ${\ displaystyle D (1,2)}$

${\ displaystyle {\ begin {cases} D (0,1) + w (A, C) \ Rightarrow -1 + (- 1) \\ D (0,2) + w (A, -) \ Rightarrow -2 + (- 1) \\ D (1,1) + w (-, C) \ Rightarrow 1 + (- 1) \\\ end {cases}}}$

→ The maximum arises from the third case, since the maximum of the calculation, namely 0, arises here, ie a gap (-) would be aligned with C.

${\ displaystyle D = {\ begin {pmatrix} 0 & -1 & -2 & -3 & -4 \\ - 1 & 1 & 0 & 0 & 0 \\ - 2 & 0 & 0 & 0 & 0 \\ - 3 & 0 & 0 & 0 & 0 \\ - 4 & 0 & 0 & 0 & 0 \\ - 5 & 0 & 0 & 0 & 0 \\\ end {pmatrix}}}$

${\ displaystyle \ vdots}$

After completing the above steps, the filled matrix looks like this:

${\ displaystyle D = {\ begin {pmatrix} 0 & -1 & -2 & -3 & -4 \\ - 1 & 1 & 0 & -1 & -2 \\ - 2 & 0 & 0 & -1 & 0 \\ - 3 & -1 & 1 & 0 & -1 \\ - 4 & -2 & 0 & 2 & 1 \\ -5 & -3 & -1 & 1 & 3 \\\ end {pmatrix}}}$

The rating of this alignment is 3.

The corresponding alignment looks like this:

${\ displaystyle {\ begin {matrix} {\ text {sequence}} \ a: & A & C & G & T & C \\ {\ text {sequence}} \ b: & A & - & G & T & C \ end {matrix}}}$

It is calculated by a traceback.

Choice of evaluation function

The choice of the evaluation function has a great influence on the results obtained using the Needleman Wunsch algorithm. A simple evaluation function as chosen above in no way reflects the biological background of an alignment and is therefore rather unsuitable for practical purposes. The currently most common evaluation functions read the evaluation from a so-called scoring matrix . For proteins one can use the PAM or Blosum matrices. These matrices with the size 20 × 20 (or 24 × 24, if a few special cases are also considered) contain evaluations (so-called log odds) for the fact that one amino acid is substituted by another. The log odds are based on probabilities. These scoring matrices are also calculated from sequence alignments.

The evaluation function used above is used to determine the similarity of two sequences. In order to be able to determine the distance, one can simply change the evaluation function, ie in the case of inequality one can return a positive value, which can be interpreted as a penalty, and in the case of equality 0 or a negative value. It must be noted, however, that in the recursion of a distance-based evaluation function, it is not the maximum but the minimum that has to be determined.

An example of a distance-based evaluation function:

${\ displaystyle w (x, y) = {\ begin {cases} 0 & {\ text {for}} x = y \\ 1 & {\ text {for}} x \ not = y \\\ end {cases}} }$

complexity

The runtime of the Needleman-Wunsch algorithm is in . It must the matrix elements are calculated, and for each item must have be maximized elements. This can be easily deduced from the matrix recursions. The memory requirement is in . ${\ displaystyle O}$ ${\ displaystyle (\ max (n, m) ^ {3})}$ ${\ displaystyle m \ cdot n}$ ${\ displaystyle O (n) + O (m)}$ ${\ displaystyle O}$ ${\ displaystyle (nm)}$

Demarcation

The 1970 Needleman-Wunsch paper does not contain any matrix recursions; the algorithm is described informally there; the matrix recursions shown here result from this description.

In a 1976 paper by Waterman, Smith and Beyer, the Needleman-Wunsch algorithm is specified explicitly in matrix recursions. This is why some authors refer to the Needleman-Wunsch algorithm as the Waterman-Smith-Beyer algorithm.

In some literature, the standard DP algorithm for calculating the global alignment at uniform gap costs is incorrectly referred to as the Needleman Wunsch algorithm. However, this is a specialization of the Needleman-Wunsch algorithm, since only the three neighboring cells have to be taken into account when using uniform gap costs for the edit operations. ${\ displaystyle O (n ^ {2})}$

Because of the cubic runtime, the Needleman Wunsch algorithm is not used in practice. With restriction to uniform costs, the optimal alignment can be calculated using the optimization described above . With the Gotoh algorithm , the optimal alignment can be calculated with affine gap costs in quadratic runtime. The Hirschberg algorithm calculates a global alignment with uniform or affine gap costs on linear storage space in time and the Smith-Waterman algorithm calculates the optimal local alignment of two sequences. ${\ displaystyle O (n ^ {2})}$ ${\ displaystyle O (m + n)}$ ${\ displaystyle \ Theta (mn)}$

Storage estimation

Because of the quadratic memory requirement, the algorithm is rather unsuitable for aligning longer sequences. If z. If, for example, integer values, each 4 bytes in size, are stored in the matrix, then the calculation of the alignment score of a sequence of 10,000 characters requires a matrix with entries. So are occupied by the matrix . The alignment of entire genomes cannot be carried out efficiently in this way. For example, an average bacterial genome has approximately 1-4 million base pairs and the human genome has approximately 3 billion base pairs. ${\ displaystyle 10,000 \ times 10,000}$ ${\ displaystyle 100,000,000 \ times 4 \, {\ text {Byte}} \ approx 381 \, {\ text {Megabyte}}}$

Apart from that, a global alignment of entire genomes does not necessarily result in a high level of biological knowledge gain.

variants

Uniform gap costs

If uniform gap costs are used, it is possible to specialize the matrix recursions of the Needleman Wunsch algorithm, which reduces the runtime from to . Sellers explicitly specified these recursions in a 1974 paper. ${\ displaystyle O (n ^ {3})}$ ${\ displaystyle O (n ^ {2})}$

A uniform gap cost function is defined by , i. H. each place in a gap costs the same. Using uniform gap costs in the consideration of an optimal alignment of prefixes and that in an insertion gap length with ends, viewed directly that the optimal alignment of prefixes and ends in an insertion Gap. Therefore, it is enough of an optimal insertion Gap (any length) in the calculation of the cost , to be expected. The calculation of the deletion gap costs is carried out in the same way. ${\ displaystyle f}$ ${\ displaystyle f (l) = l \ cdot c}$ ${\ displaystyle a [0..i]}$ ${\ displaystyle b [0..j]}$ ${\ displaystyle k_ {1}}$ ${\ displaystyle k_ {1}> 1}$ ${\ displaystyle a [0..i-1]}$ ${\ displaystyle b [0..j]}$ ${\ displaystyle M (i, j)}$ ${\ displaystyle M (i-1, j) + c}$

This results in the following specialized recursions:

${\ displaystyle M (0,0) = 0}$

${\ displaystyle M (i, 0) = M (i-1,0) + c, \; 1 \ leq i \ leq m}$

${\ displaystyle M (0, j) = M (0, j-1) + c, \; 1 \ leq j \ leq n}$

${\ displaystyle M (i, j) = \ max {\ begin {Bmatrix} M (i-1, j-1) + \ w (a_ {i}, b_ {j}) & {\ text {Match or Mismatch}} \\ M (i-1, j) + c & {\ text {Deletion}} \\ M (i, j-1) + c & {\ text {Insertion}} \ end {Bmatrix}}, \; 1 \ leq i \ leq m, 1 \ leq j \ leq n}$

Free-shift alignment

The calculation of the optimal free-shift alignment (or end-gap free alignment) is a variant of the Needleman-Wunsch algorithm in which a sequence of insertions or deletions at the beginning or end of the alignment are not taken into account in the score calculation . To do this, the algorithm for calculating the global alignment is modified so that the first row or first column is also initialized. With backtracking, the maximum value is searched for in the last column or row and used as a starting point. ${\ displaystyle 0}$

Individual evidence

^ Waterman, Smith, Beyer: Some Biological Sequence Metrics . In: Advances in Mathematics . tape 20 , 1976, p. 367-387 , doi : 10.1016 / 0001-8708 (76) 90202-4 (Theorem 3).
↑ P. Clote, R. Ofen: Computational Molecular Biology . Wiley, 2000, ISBN 0-471-87251-2 .
^ D. Gusfield: Algorithms on Strings, Trees and Sequences . 1997, ISBN 0-521-58519-8 , 11.7.3, pp. 234 .
^ Peter H. Sellers: On the Theory and Computation of Evolutionary Distances . In: SIAM Journal on Applied Mathematics . tape 26 , no. 4 June 1974, p. 787-793 , JSTOR : 2099985 .

literature

Saul B. Needleman, Christian D. Wunsch: A general method applicable to the search for similarities in the amino acid sequence of two proteins . In: Journal of Molecular Biology . tape 48 , 1970, pp. 443-453 , doi : 10.1016 / 0022-2836 (70) 90057-4 , PMID 5420325 .

Web links

Implementation of the Needleman Wunsch algorithm

[1] Waterman, Smith, Beyer: Some Biological Sequence Metrics . In: Advances in Mathematics . tape 20 , 1976, p. 367-387 , doi : 10.1016 / 0001-8708 (76) 90202-4 (Theorem 3).

[2] P. Clote, R. Ofen: Computational Molecular Biology . Wiley, 2000, ISBN 0-471-87251-2 .

[3] D. Gusfield: Algorithms on Strings, Trees and Sequences . 1997, ISBN 0-521-58519-8 , 11.7.3, pp. 234 .

[4] Peter H. Sellers: On the Theory and Computation of Evolutionary Distances . In: SIAM Journal on Applied Mathematics . tape 26 , no. 4 June 1974, p. 787-793 , JSTOR : 2099985 .