Hirschberg algorithm

from Wikipedia, the free encyclopedia

The Hirschberg algorithm calculates the pairwise sequence alignment and has a linear memory requirement for input . The algorithm developed by Dan Hirschberg in the 1970s uses the method of dynamic programming and the divide-and-conquer principle.

General

The Hirschberg algorithm is a generally applicable and optimal algorithm for finding a sequence alignment. The well-known BLAST algorithm and the FASTA algorithm are only suboptimal heuristics. If you compare the Hirschberg algorithm with the Needleman Wunsch algorithm , the Hirschberg algorithm is less a completely new algorithm, but rather a clever strategy that cleverly uses the Needleman Wunsch algorithm to reduce memory consumption linearize, which is also the special thing about this algorithm: The calculations for a sequence alignment only require a lot of memory space linearly, so the space complexity of the algorithm is in O (n). To calculate an alignment of two character strings and with and , the algorithm has a running time of and a memory consumption of . OBdA should apply in the following so that the space requirement is in .

The algorithm is used, for example, in bioinformatics to compare different DNA or protein sequences .

In a slightly modified form Hirschbergs algorithm is also used to in a graph parallel connected components with effort to compute processors.

Calculation of the Levenshtein distance on linear storage space

To understand the Hirschberg algorithm, it is first important to understand that the Levenshtein distance can be calculated on a linear storage space:

01  := 0
02 for j in 1..n loop
03        :=  + 
04 end loop
05 for i in 1..m loop
06       s := 
07        :=  + 
08       c := 
09       for j in 1..n loop
10             c := 
11             s := 
12              := c
13       end loop
14 end loop

In lines 1–4, the one-dimensional field is initialized with n places memory requirements. In line 6, the initialization of the first element is in saved. Then and are initialized with the start value for the next line. The figure below shows a snapshot of a line run. In the inner loop, it always points to the result previously calculated, while the result that is still required is saved in the last line. After line 14, the Levenshtein distance is the result in .

  ε     ...
ε    0  1  2  3  ...
   1

...

s = 0
c =  = 1

It should be clear that this calculation can also be performed backwards. The imaginary matrix is ​​not run through from left to right and from top to bottom, but from bottom right to top left:

01  := 0
02 for j in n-1..0 loop
03        :=  + 
04 end loop
05 for i in m-1..0 loop
06       s := 
07        :=  + 
08       c := 
09       for j in n-1..0 loop
10             c := 
11             s := 
12              := c
13       end loop
14 end loop

Calculation of the alignment on linear storage space

The Divide & Conquer algorithm from Hirschberg calculates an alignment of the character strings and , by combining forward and backward passages with one another (line specifications refer to the pseudocode specified below):

1. If or there is a trivial alignment problem (lines 14-22). A string consisting of only one character must be aligned with another string and an alignment is returned. Is and you go to step 2.

2. A forward pass computes an alignment of and the first half of (lines 27-40). The result of the forward run is a field whose elements indicate the cost of a run from to (with ).

3. A backward pass computes an alignment of with the second half of (lines 42-55). The result is another field , the elements of which indicate the cost of a run back .

4. In the field elements and the two Levenshtein distances stand for the global alignments of and the respective halves of . The remaining elements of are the distances from the first half to all prefixes of . Correspondingly, the remaining elements of contain the distances from the second half to all suffixes of .

5. The Levenshtein distance from zu can now be calculated by walking along the middle row of the alignment matrix and looking for a smallest sum of corresponding - and - elements. If such a minimum sum has been found, a position has been found in the middle line in which the optimal alignment intersects the middle line or the middle of . At this point it is divided into two parts and thus the alignment problem can also be divided into two smaller alignment problems (lines 59-65).

6. Step 1 is called recursively on both parts of and . The two recursive calls return partial alignments that are combined into a single alignment. The alignment is returned (lines 68 and 69).

01 --
02 -- Der Divide & Conquer-Algorithmus von Hirschberg zur
03 -- Berechnung des globalen Alignments auf linearem Speicher.
04 --
05 -- Bei  besitzt der Algorithmus eine Laufzeit von 
06 -- und einen Speicherverbrauch von .
07 --
08 function HirschbergAlignment(x,y : string) return A is
09        function SubAlignment(,,, : integer) return A is
10                mitte,cut : integer
11                s,c : real
12                 : array(..) of real
13        begin
14                if  or  then
15                        -- Konstruiere Matrix T für die Teil-Strings
16                        -- x(..) und y(..)
17                        -- Achtung: Nur linearer Speicherplatz erforderlich!
18                        T := ...
19                        -- Berechne triviales Alignment auf Matrix T
20                        -- in linearer Laufzeit
21                        return Alignment(T,x(..),y(..))
22                end if
23
24                mitte := 
25                -- finde ausgehend von  den minimalen Pfad
26                -- mit dem Vorwärtsalgorithmus:
27                 := 0
28                for j in .. loop
29                         := 
30                end loop
31                for i in ..mitte loop
32                        s := 
33                        c := 
34                         := c
35                        for j in .. loop
36                                c := 
37                                s := 
38                                 := c
39                        end loop
40                end loop
41                -- finde minimalen score-pfad nach 
42                 := 0
43                for j in .. loop
44                         := 
45                end loop
46                for i in ..mitte loop
47                        s := 
48                        c := 
49                         := c;
50                        for j in .. loop
51                                c := 
52                                s := 
53                                 := c
54                        end loop
55                end loop
56                -- finde den Punkt aus .. in dem der Minimale Pfad die
57                -- mittlere Zeile schneidet:
58                -- 
59                for j in .. loop
60                        if j= then
61                                cut := 
62                        elsif  then
63                                cut := j
64                        end if
65                end loop
66                -- Alignment entsteht durch Konkatenation von linkem und
67                -- rechtem Teil-Alignment:
68                return SubAlignment(,,mitte,cut)
69                                 SubAlignment(mitte,cut,,)
70        end SubAlignment
71        m,n : integer
72 begin
73        m := ; n := 
74        -- Sonderbehandlung:  ist der leere String und lässt keine Zerteilung zu:
75        if m=0 then
76                return 
77        else
78                return SubAlignment(0,0,m,n)
79        end if
80 end HirschbergAlignment

literature

  • DS Hirschberg: A linear space algorithm for computing maximally common subsequences . In: Communications of the ACM . tape 18 , no. 6 , 1975, p. 341-343 ( PDF ).
  • Chao, KM, Hardison, RC and Miller, W .: Recent developments in linear-space alignment methods: a survey . In: Journal of Computional Biology . No. 4 , 1994, pp. 271-291 ( PDF ).