Gotoh algorithm

from Wikipedia, the free encyclopedia

The Gotoh algorithm calculates the alignment of two sequences with affine gap costs. He uses the programming paradigm of dynamic programming .

Affine gap costs

Affine gap costs are the costs for gaps in alignments, which can be described by a linear function . Where is the length of the gap. is the cost of starting a new gap, and is the cost of adding one post to an existing gap .

Biologically affine gap costs can make more sense than purely linear gap costs, since one would like to consider an insertion or deletion of several characters in certain contexts as more likely than an insertion or deletion of a single character, e.g. As the aligning of cDNA from genomic DNA (Gusfield, 1999).

Matrix recurrences

Specification of the algorithm by matrix recurrences:

Initialization:

A [0, 0] = 0

B [0, 0] = 0

C [0, 0] = 0

A [i, 0] = C [i, 0] = -inf, 1 <= i <= m

A [0, j] = B [0, j] = -inf, 1 <= j <= n

B [i, 0] = g (i), 1 <= i <= m

C [0, j] = g (j), 1 <= i <= n

Recursion:

  • u, v - sequences over an alphabet
  • m = length (u), n = length (v)
  • w (a, b), - similarity function
  • gap_start - cost to start a gap
  • gap_extend - the cost of expanding a gap
  • g (l) - affine gap-cost function
  • -inf =
  • A (i, j) - the optimal similarity score of the optimal alignment of the prefix u [1..i] of u and the prefix v [1..j] of v under affine gap costs
  • B (i, j) - the optimal similarity score of the optimal alignment of the prefix u [1..i] of u and the prefix v [1..j] of v, which ends with a gap in u
  • C (i, j) - the optimal similarity score of the optimal alignment of the prefix u [1..i] of u and the prefix v [1..j] of v, which ends with a gap in v

The optimal score of the alignment is the maximum of: A [m, n], B [m, n], C [m, n].

Implementation in pseudo-code:

 1 //Initialisierung der Matrizen
 2 a[0, 0] = 0;
 3 b[0, 0] = 0;
 4 c[0, 0] = 0;
 5 
 6 FOR i = 1; TO i <= m; i++;
 7     a[i, 0] = c[i, 0] = -inf;
 8     b[i, 0] = g(i);
 9 
10 FOR j = 1; TO j <= n; j++;
11     a[0, j] = b[0, j] = -inf;
12     c[0, j] = g(j);
13 
14 //Berechnen der restlichen Matrix
15 FOR i = 1; TO i <= m; i++;
16     FOR j = 1; TO j <= n; j++;
17         a[i, j] = get_max(a[i - 1, j - 1],
18                             b[i - 1, j - 1],
19                             c[i - 1, j - 1]) +
20                             match_or_mismatch(u[i - 1], v[i - 1]);
21         b[i, j] = get_max(a[i - 1, j] + gap_start,
22                             b[i - 1, j] + gap_extend,
23                             c[i - 1, j] + gap_start);
24         c[i, j] = get_max(a[i, j - 1] + gap_start,
25                             b[i, j - 1] + gap_start,
26                             c[i, j - 1] + gap_extend);
27 //Speichere Ergebnis
28 result = get_max(a[m, n], b[m, n], c[m, n]);
29 
30 function g(l)
31     return gap_start + (l - 1) * gap_extend;
32 
33 function match_or_mismatch(x, y)
34         if( x == y)
35             return match_score;
36         else
37             return mismatch_score;

Implementation in C # (without error handling):

 1 using System;
 2 using System.IO;
 3 using System.Linq;
 4 
 5 
 6 namespace DnaAlignment
 7 {
 8     class Program
 9     {
10         const int GAP_PENALTY_START = -10;
11         const int GAP_PENALTY_EXTEND = -0.5;
12         const int MIN_INF = int.MinValue + 100; //Pseudo-Unendlich
13         const int MATCH = 3;
14         const int MISMATCH = -3;
15         static void Main(string[] args)
16         {
17             using (StreamReader reader = File.OpenText(args[0]))
18                 while (!reader.EndOfStream)
19                 {
20                     string line = reader.ReadLine();
21                     if (null == line)
22                         continue;
23                     string[] sSplit = line.Split('|');
24                     if (sSplit.Count() != 2)
25                         continue;
26                     string seqA = sSplit[0].Trim(' ');
27                     string seqB = sSplit[1].Trim(' ');
28 
29                     //Initsialisiere Matrizen
30                     int[,] a = new int[seqA.Length + 1, seqB.Length + 1];
31                     int[,] b = new int[seqA.Length + 1, seqB.Length + 1];
32                     int[,] c = new int[seqA.Length + 1, seqB.Length + 1];
33                     a[0, 0] = 0;
34                     b[0, 0] = 0;
35                     c[0, 0] = 0;
36                     for (int i = 1; i <= seqA.Length; i++)
37                     {
38                         a[i, 0] = c[i, 0] = MIN_INF;
39                         b[i, 0] = G(i);
40                     }
41                     for (int j = 1; j <= seqB.Length; j++)
42                     {
43                         a[0, j] = b[0, j] = MIN_INF;
44                         c[0, j] = G(j);
45                     }
46                     //Berechne Matrizen
47                     for (int i = 1; i <= seqA.Length; i++)
48                     {
49                         for (int j = 1; j <= seqB.Length; j++)
50                         {
51                             a[i, j] = Get_Max_Value(
52                                         a[i - 1, j - 1],
53                                         b[i - 1, j - 1],
54                                         c[i - 1, j - 1]) +
55                                         Match_Or_Mismatch(seqA[i - 1], seqB[j - 1]);
56                             b[i, j] = Get_Max_Value(
57                                         a[i - 1, j] + GAP_PENALTY_START,
58                                         b[i - 1, j] + GAP_PENALTY_EXTEND,
59                                         c[i - 1, j] + GAP_PENALTY_START);
60                             c[i, j] = Get_Max_Value(
61                                         a[i, j - 1] + GAP_PENALTY_START,
62                                         b[i, j - 1] + GAP_PENALTY_START,
63                                         c[i, j - 1] + GAP_PENALTY_EXTEND);
64                         }
65                     }
66                     Console.WriteLine(Get_Max_Value(a[seqA.Length, seqB.Length],
67                                         b[seqA.Length, seqB.Length],
68                                         c[seqA.Length, seqB.Length]));
69                 }
70         }
71         private static int G(int index)
72         {
73             return GAP_PENALTY_START + (index - 1) * GAP_PENALTY_EXTEND;
74         }
75         private static int Get_Max_Value(int a, int b, int c)
76         {
77             return Math.Max(Math.Max(a, b), c);
78         }
79         private static int Match_Or_Mismatch(char a, char b)
80         {
81             return a == b ? MATCH : MISMATCH;
82         }
83     }
84 }

Efficiency

The running time and the storage space requirement of the algorithm are in O (nm). This is an improvement on the Needleman Wunsch algorithm , which has a running time of for any gap costs . In the worst case, the matrix is ​​square ( ), which leads to a cubic running time of in the Needleman-Wunsch algorithm . The Gotoh algorithm with its linear gap costs reduces the complexity to .

If only the score of the optimal alignment is to be calculated, all matrices can also be calculated column-by-row or column-by-row, since each entry only depends on the previous column or row. So the storage space requirement drops .

The Gotoh algorithm can also be implemented with the Divide-and-Conquer method, so that the optimal alignment can be calculated with linear space requirements. See Hirschberg's algorithm .

literature

Web links

Individual evidence

  1. Saad Mneimneh: Affine gap penalty function, multiple sequence alignment. (PDF) Retrieved on August 15, 2015 .