Hirschberg algorithm
The Hirschberg algorithm calculates the pairwise sequence alignment and has a linear memory requirement for input . The algorithm developed by Dan Hirschberg in the 1970s uses the method of dynamic programming and the divide-and-conquer principle.
General
The Hirschberg algorithm is a generally applicable and optimal algorithm for finding a sequence alignment. The well-known BLAST algorithm and the FASTA algorithm are only suboptimal heuristics. If you compare the Hirschberg algorithm with the Needleman Wunsch algorithm , the Hirschberg algorithm is less a completely new algorithm, but rather a clever strategy that cleverly uses the Needleman Wunsch algorithm to reduce memory consumption linearize, which is also the special thing about this algorithm: The calculations for a sequence alignment only require a lot of memory space linearly, so the space complexity of the algorithm is in O (n). To calculate an alignment of two character strings and with and , the algorithm has a running time of and a memory consumption of . OBdA should apply in the following so that the space requirement is in .
The algorithm is used, for example, in bioinformatics to compare different DNA or protein sequences .
In a slightly modified form Hirschbergs algorithm is also used to in a graph parallel connected components with effort to compute processors.
Calculation of the Levenshtein distance on linear storage space
To understand the Hirschberg algorithm, it is first important to understand that the Levenshtein distance can be calculated on a linear storage space:
01 := 0 02 for j in 1..n loop 03 := + 04 end loop 05 for i in 1..m loop 06 s := 07 := + 08 c := 09 for j in 1..n loop 10 c := 11 s := 12 := c 13 end loop 14 end loop
In lines 1–4, the one-dimensional field is initialized with n places memory requirements. In line 6, the initialization of the first element is in saved. Then and are initialized with the start value for the next line. The figure below shows a snapshot of a line run. In the inner loop, it always points to the result previously calculated, while the result that is still required is saved in the last line. After line 14, the Levenshtein distance is the result in .
ε ... ε 0 1 2 3 ... 1 ... s = 0 c = = 1
It should be clear that this calculation can also be performed backwards. The imaginary matrix is not run through from left to right and from top to bottom, but from bottom right to top left:
01 := 0 02 for j in n-1..0 loop 03 := + 04 end loop 05 for i in m-1..0 loop 06 s := 07 := + 08 c := 09 for j in n-1..0 loop 10 c := 11 s := 12 := c 13 end loop 14 end loop
Calculation of the alignment on linear storage space
The Divide & Conquer algorithm from Hirschberg calculates an alignment of the character strings and , by combining forward and backward passages with one another (line specifications refer to the pseudocode specified below):
1. If or there is a trivial alignment problem (lines 14-22). A string consisting of only one character must be aligned with another string and an alignment is returned. Is and you go to step 2.
2. A forward pass computes an alignment of and the first half of (lines 27-40). The result of the forward run is a field whose elements indicate the cost of a run from to (with ).
3. A backward pass computes an alignment of with the second half of (lines 42-55). The result is another field , the elements of which indicate the cost of a run back .
4. In the field elements and the two Levenshtein distances stand for the global alignments of and the respective halves of . The remaining elements of are the distances from the first half to all prefixes of . Correspondingly, the remaining elements of contain the distances from the second half to all suffixes of .
5. The Levenshtein distance from zu can now be calculated by walking along the middle row of the alignment matrix and looking for a smallest sum of corresponding - and - elements. If such a minimum sum has been found, a position has been found in the middle line in which the optimal alignment intersects the middle line or the middle of . At this point it is divided into two parts and thus the alignment problem can also be divided into two smaller alignment problems (lines 59-65).
6. Step 1 is called recursively on both parts of and . The two recursive calls return partial alignments that are combined into a single alignment. The alignment is returned (lines 68 and 69).
01 -- 02 -- Der Divide & Conquer-Algorithmus von Hirschberg zur 03 -- Berechnung des globalen Alignments auf linearem Speicher. 04 -- 05 -- Bei besitzt der Algorithmus eine Laufzeit von 06 -- und einen Speicherverbrauch von . 07 -- 08 function HirschbergAlignment(x,y : string) return A is 09 function SubAlignment(,,, : integer) return A is 10 mitte,cut : integer 11 s,c : real 12 : array(..) of real 13 begin 14 if or then 15 -- Konstruiere Matrix T für die Teil-Strings 16 -- x(..) und y(..) 17 -- Achtung: Nur linearer Speicherplatz erforderlich! 18 T := ... 19 -- Berechne triviales Alignment auf Matrix T 20 -- in linearer Laufzeit 21 return Alignment(T,x(..),y(..)) 22 end if 23 24 mitte := 25 -- finde ausgehend von den minimalen Pfad 26 -- mit dem Vorwärtsalgorithmus: 27 := 0 28 for j in .. loop 29 := 30 end loop 31 for i in ..mitte loop 32 s := 33 c := 34 := c 35 for j in .. loop 36 c := 37 s := 38 := c 39 end loop 40 end loop 41 -- finde minimalen score-pfad nach 42 := 0 43 for j in .. loop 44 := 45 end loop 46 for i in ..mitte loop 47 s := 48 c := 49 := c; 50 for j in .. loop 51 c := 52 s := 53 := c 54 end loop 55 end loop 56 -- finde den Punkt aus .. in dem der Minimale Pfad die 57 -- mittlere Zeile schneidet: 58 -- 59 for j in .. loop 60 if j= then 61 cut := 62 elsif then 63 cut := j 64 end if 65 end loop 66 -- Alignment entsteht durch Konkatenation von linkem und 67 -- rechtem Teil-Alignment: 68 return SubAlignment(,,mitte,cut) 69 SubAlignment(mitte,cut,,) 70 end SubAlignment 71 m,n : integer 72 begin 73 m := ; n := 74 -- Sonderbehandlung: ist der leere String und lässt keine Zerteilung zu: 75 if m=0 then 76 return 77 else 78 return SubAlignment(0,0,m,n) 79 end if 80 end HirschbergAlignment
literature
- DS Hirschberg: A linear space algorithm for computing maximally common subsequences . In: Communications of the ACM . tape 18 , no. 6 , 1975, p. 341-343 ( PDF ).
- Chao, KM, Hardison, RC and Miller, W .: Recent developments in linear-space alignment methods: a survey . In: Journal of Computional Biology . No. 4 , 1994, pp. 271-291 ( PDF ).