Gotoh algorithm

The Gotoh algorithm calculates the alignment of two sequences with affine gap costs. He uses the programming paradigm of dynamic programming .

Affine gap costs

Affine gap costs are the costs for gaps in alignments, which can be described by a linear function . Where is the length of the gap. is the cost of starting a new gap, and is the cost of adding one post to an existing gap . ${\ displaystyle g (l) = {\ text {gap}} \ _ {\ text {start}} + (l-1) \ times {\ text {gap}} \ _ {\ text {extend}}}$ ${\ displaystyle l}$ ${\ displaystyle {\ text {gap}} \ _ {\ text {start}}}$ ${\ displaystyle {\ text {gap}} \ _ {\ text {extend}}}$

Biologically affine gap costs can make more sense than purely linear gap costs, since one would like to consider an insertion or deletion of several characters in certain contexts as more likely than an insertion or deletion of a single character, e.g. As the aligning of cDNA from genomic DNA (Gusfield, 1999).

Matrix recurrences

Specification of the algorithm by matrix recurrences:

Initialization:

A [0, 0] = 0

B [0, 0] = 0

C [0, 0] = 0

A [i, 0] = C [i, 0] = -inf, 1 <= i <= m

A [0, j] = B [0, j] = -inf, 1 <= j <= n

B [i, 0] = g (i), 1 <= i <= m

C [0, j] = g (j), 1 <= i <= n

Recursion:

${\ displaystyle A (i, j) = \ max {\ begin {Bmatrix} A (i-1, j-1) + w (u [i-1], v [j-1]) \\ B (i -1, j-1) + w (u [i-1], v [j-1]) \\ C (i-1, j-1) + w (u [i-1], v [j- 1] \\\ end {Bmatrix}}, \; 1 \ leq i \ leq m, \; 1 \ leq j \ leq n}$

${\ displaystyle B (i, j) = \ max {\ begin {Bmatrix} A (i-1, j) + {\ text {gap}} \ _ {\ text {start}} \\ B (i-1 , j) + {\ text {gap}} \ _ {\ text {extend}} \\ C (i-1, j) + {\ text {gap}} \ _ {\ text {start}} \\\ end {Bmatrix}}, \; 1 \ leq i \ leq m, \; 1 \ leq j \ leq n}$

${\ displaystyle C (i, j) = \ max {\ begin {Bmatrix} A (i, j-1) + {\ text {gap}} \ _ {\ text {start}} \\ B (i, j -1) + {\ text {gap}} \ _ {\ text {start}} \\ C (i, j-1) + {\ text {gap}} \ _ {\ text {extend}} \\\ end {Bmatrix}}, \; 1 \ leq i \ leq m, \; 1 \ leq j \ leq n}$

u, v - sequences over an alphabet ${\ displaystyle \ Sigma}$
m = length (u), n = length (v)
w (a, b), - similarity function ${\ displaystyle a, b \ in \ Sigma}$
gap_start - cost to start a gap
gap_extend - the cost of expanding a gap
g (l) - affine gap-cost function ${\ displaystyle g (l) = {\ text {gap}} \ _ {\ text {start}} + (l-1) \ times {\ text {gap}} \ _ {\ text {extend}}}$
-inf = ${\ displaystyle - \ infty}$
A (i, j) - the optimal similarity score of the optimal alignment of the prefix u [1..i] of u and the prefix v [1..j] of v under affine gap costs
B (i, j) - the optimal similarity score of the optimal alignment of the prefix u [1..i] of u and the prefix v [1..j] of v, which ends with a gap in u
C (i, j) - the optimal similarity score of the optimal alignment of the prefix u [1..i] of u and the prefix v [1..j] of v, which ends with a gap in v

The optimal score of the alignment is the maximum of: A [m, n], B [m, n], C [m, n].

Implementation in pseudo-code:

 1 //Initialisierung der Matrizen
 2 a[0, 0] = 0;
 3 b[0, 0] = 0;
 4 c[0, 0] = 0;
 5 
 6 FOR i = 1; TO i <= m; i++;
 7     a[i, 0] = c[i, 0] = -inf;
 8     b[i, 0] = g(i);
 9 
10 FOR j = 1; TO j <= n; j++;
11     a[0, j] = b[0, j] = -inf;
12     c[0, j] = g(j);
13 
14 //Berechnen der restlichen Matrix
15 FOR i = 1; TO i <= m; i++;
16     FOR j = 1; TO j <= n; j++;
17         a[i, j] = get_max(a[i - 1, j - 1],
18                             b[i - 1, j - 1],
19                             c[i - 1, j - 1]) +
20                             match_or_mismatch(u[i - 1], v[i - 1]);
21         b[i, j] = get_max(a[i - 1, j] + gap_start,
22                             b[i - 1, j] + gap_extend,
23                             c[i - 1, j] + gap_start);
24         c[i, j] = get_max(a[i, j - 1] + gap_start,
25                             b[i, j - 1] + gap_start,
26                             c[i, j - 1] + gap_extend);
27 //Speichere Ergebnis
28 result = get_max(a[m, n], b[m, n], c[m, n]);
29 
30 function g(l)
31     return gap_start + (l - 1) * gap_extend;
32 
33 function match_or_mismatch(x, y)
34         if( x == y)
35             return match_score;
36         else
37             return mismatch_score;

Implementation in C # (without error handling):

 1 using System;
 2 using System.IO;
 3 using System.Linq;
 4 
 5 
 6 namespace DnaAlignment
 7 {
 8     class Program
 9     {
10         const int GAP_PENALTY_START = -10;
11         const int GAP_PENALTY_EXTEND = -0.5;
12         const int MIN_INF = int.MinValue + 100; //Pseudo-Unendlich
13         const int MATCH = 3;
14         const int MISMATCH = -3;
15         static void Main(string[] args)
16         {
17             using (StreamReader reader = File.OpenText(args[0]))
18                 while (!reader.EndOfStream)
19                 {
20                     string line = reader.ReadLine();
21                     if (null == line)
22                         continue;
23                     string[] sSplit = line.Split('|');
24                     if (sSplit.Count() != 2)
25                         continue;
26                     string seqA = sSplit[0].Trim(' ');
27                     string seqB = sSplit[1].Trim(' ');
28 
29                     //Initsialisiere Matrizen
30                     int[,] a = new int[seqA.Length + 1, seqB.Length + 1];
31                     int[,] b = new int[seqA.Length + 1, seqB.Length + 1];
32                     int[,] c = new int[seqA.Length + 1, seqB.Length + 1];
33                     a[0, 0] = 0;
34                     b[0, 0] = 0;
35                     c[0, 0] = 0;
36                     for (int i = 1; i <= seqA.Length; i++)
37                     {
38                         a[i, 0] = c[i, 0] = MIN_INF;
39                         b[i, 0] = G(i);
40                     }
41                     for (int j = 1; j <= seqB.Length; j++)
42                     {
43                         a[0, j] = b[0, j] = MIN_INF;
44                         c[0, j] = G(j);
45                     }
46                     //Berechne Matrizen
47                     for (int i = 1; i <= seqA.Length; i++)
48                     {
49                         for (int j = 1; j <= seqB.Length; j++)
50                         {
51                             a[i, j] = Get_Max_Value(
52                                         a[i - 1, j - 1],
53                                         b[i - 1, j - 1],
54                                         c[i - 1, j - 1]) +
55                                         Match_Or_Mismatch(seqA[i - 1], seqB[j - 1]);
56                             b[i, j] = Get_Max_Value(
57                                         a[i - 1, j] + GAP_PENALTY_START,
58                                         b[i - 1, j] + GAP_PENALTY_EXTEND,
59                                         c[i - 1, j] + GAP_PENALTY_START);
60                             c[i, j] = Get_Max_Value(
61                                         a[i, j - 1] + GAP_PENALTY_START,
62                                         b[i, j - 1] + GAP_PENALTY_START,
63                                         c[i, j - 1] + GAP_PENALTY_EXTEND);
64                         }
65                     }
66                     Console.WriteLine(Get_Max_Value(a[seqA.Length, seqB.Length],
67                                         b[seqA.Length, seqB.Length],
68                                         c[seqA.Length, seqB.Length]));
69                 }
70         }
71         private static int G(int index)
72         {
73             return GAP_PENALTY_START + (index - 1) * GAP_PENALTY_EXTEND;
74         }
75         private static int Get_Max_Value(int a, int b, int c)
76         {
77             return Math.Max(Math.Max(a, b), c);
78         }
79         private static int Match_Or_Mismatch(char a, char b)
80         {
81             return a == b ? MATCH : MISMATCH;
82         }
83     }
84 }

Efficiency

The running time and the storage space requirement of the algorithm are in O (nm). This is an improvement on the Needleman Wunsch algorithm , which has a running time of for any gap costs . In the worst case, the matrix is square ( ), which leads to a cubic running time of in the Needleman-Wunsch algorithm . The Gotoh algorithm with its linear gap costs reduces the complexity to . ${\ displaystyle O (nm ^ {2} + n ^ {2} m)}$ ${\ displaystyle n = m}$ ${\ displaystyle O (n ^ {3})}$ ${\ displaystyle O (n ^ {2})}$

If only the score of the optimal alignment is to be calculated, all matrices can also be calculated column-by-row or column-by-row, since each entry only depends on the previous column or row. So the storage space requirement drops . ${\ displaystyle O (m + n)}$

The Gotoh algorithm can also be implemented with the Divide-and-Conquer method, so that the optimal alignment can be calculated with linear space requirements. See Hirschberg's algorithm .

literature

O. Gotoh: An improved algorithm for matching biological sequences . In: Journal of Molecular Biology . tape 162 , 1982, pp. 705-708 ( jaligner.sourceforge.net [PDF; 206 kB ]).
Eugene W. Myers and Webb Miller: Optimal alignments in linear space . In: Bioinformatics . tape 4 , no. 1 , 1988, p. 11–17 , doi : 10.1093 / bioinformatics / 4.1.11 ( citeseerx.ist.psu.edu [PDF] application of the linear storage space technique of the Hirschberg algorithm to the Gotoh algorithm).
J. Stoye: Divide-and-Conquer Multiple Sequence Alignment (PDF; 938 kB). Dissertation thesis. Bielefeld University, Research Report of the Technical Faculty, Information Technology Department, Report 97-02, pp. 26-27, 1997. ISSN 0946-7831
D. Gusfield: Algorithms on Strings, Trees and Sequences. 11.8, 1999, ISBN 0-521-58519-8 , pp. 235-244.
Saad Mneimneh: Computational Biology Lecture. 6: Affine gap penalty function, multiple sequence alignment.

Web links

bibiserv.techfak.uni-bielefeld.de CGI - Script for the calculation of global alignments with affine gap costs at Bielefeld University

Individual evidence

↑ Saad Mneimneh: Affine gap penalty function, multiple sequence alignment. (PDF) Retrieved on August 15, 2015 .

[1] Saad Mneimneh: Affine gap penalty function, multiple sequence alignment. (PDF) Retrieved on August 15, 2015 .