Nussinov algorithm

An optimal secondary structure of an RNA sequence from a virus genome calculated by the Nussinov algorithm. It has 18 base pairs and there are 41 further co-optimal secondary structures of this input sequence with 18 base pairs.

The Nussinov algorithm is a simple algorithm for calculating the maximum possible number of base pairs in an RNA sequence and one or more possible secondary structures of this RNA sequence. Because of its simple model assumptions, its biological importance is low, but it is used in bioinformatics didactics as a simple example for dynamic programming and serves as a starting point for more complex models.

algorithm

model

The algorithm models an RNA sequence and the base pairs within this sequence as a planar graph , i.e. without pseudoknodes . Between the bases of a single base pair there is at least one further base, i. that is, the loop of a hairpin structure consists of at least one base. ${\ displaystyle s}$

The sequence of the individual bases is given as a character string with the length . In this case, referred to the character at the position , and the partial sequence of the characters from the point to point . This is synonymous with and is . Furthermore, assume an empty string of length 0. ${\ displaystyle s}$ ${\ displaystyle n}$ ${\ displaystyle s [i]}$ ${\ displaystyle i}$ ${\ displaystyle s [i..j]}$ ${\ displaystyle i}$ ${\ displaystyle j}$ ${\ displaystyle s [i..i]}$ ${\ displaystyle s [i]}$ ${\ displaystyle s [1..n]}$ ${\ displaystyle s}$ ${\ displaystyle s [i..i-1]}$

The matrix of the size contains the number of the maximum possible base pairs of the partial sequences for the sequence . The matrix element accordingly designates the number of maximum possible base pairs for the entire sequence. ${\ displaystyle N [i, j]}$ ${\ displaystyle n \ times n}$ ${\ displaystyle s [i..j]}$ ${\ displaystyle i \ leq j}$ ${\ displaystyle s}$ ${\ displaystyle N [1, n]}$

The function returns 1 if and are complementary bases , otherwise 0. ${\ displaystyle \ sigma (i, j)}$ ${\ displaystyle s [i]}$ ${\ displaystyle s [j]}$

Pseudo-nodes are not allowed in the model, i. i.e., for two base pairs and holds or ${\ displaystyle (s [a]; s [b])}$ ${\ displaystyle (s [c]; s [d])}$ ${\ displaystyle a <b <c <d}$ ${\ displaystyle a> b> c> d}$

Breakdown into smaller sub-problems

The elements of the matrix are calculated by first assuming that all elements except the element describing the sequence are known. The sequence can be formed by adding the base to the sequence . This base can either form a base pair with another element of the sequence or not: ${\ displaystyle N}$ ${\ displaystyle N [1, n]}$ ${\ displaystyle s [1, n]}$ ${\ displaystyle s [1, n]}$ ${\ displaystyle s [1, n-1]}$ ${\ displaystyle s [n]}$

If no base pair is formed, then it must be and the problem is solved. ${\ displaystyle N [1, n] = N [1, n-1]}$
If a base pair is formed, this base pair can be formed between and one of the bases from the partial sequence . Assuming that the base pair is formed between and with the sequence divides into the further parts and . The number of maximum possible base pairs can in turn be calculated for these two parts by starting the algorithm for these parts again. The sum of the two parts plus the base pair formed between and gives a possible candidate value for the maximum sum. The value for should be maximum, so the candidate must be calculated for each allowed . The highest value that can be achieved in this way guarantees that it will also be maximum. So is ${\ displaystyle s [n]}$ ${\ displaystyle s [1..n-2]}$ ${\ displaystyle s [k]}$ ${\ displaystyle s [n]}$ ${\ displaystyle i \ leq k \ leq n-2}$ ${\ displaystyle s [1..k-1]}$ ${\ displaystyle s [k + 1..n-1]}$ ${\ displaystyle s [k]}$ ${\ displaystyle s [n]}$ ${\ displaystyle N [1, n]}$ ${\ displaystyle 1 \ leq k \ leq n-2}$ ${\ displaystyle N [1, n]}$

{\ displaystyle N [1, n] = \ max {\ begin {Bmatrix} \ max _ {1 <k <n-1} (N [1, k-1] + N [k + 1, n-1] +1) \ cdot \ sigma (k, n) & \\ N [2, n-1] \ cdot \ sigma (1, n) & \ end {Bmatrix}}}

and the problem is also solved. The lower term of the maximum value calculation deals with the special case of a base pair between the first and the last element of the sequence, whereby one of the partial sequences is empty ( ). Both options listed are checked and the higher number of base pairs that can be achieved is the result of the calculation. ${\ displaystyle s [0,0]}$

In this way, the algorithm reduces the sequence into smaller and smaller partial sequences until these can be calculated immediately. The intermediate results are then used to calculate the next larger partial sequences.

initialization

The partial sequences of length 2, length 1 and length 0 contain a maximum of 0 base pairs: ${\ displaystyle s [i, i + 1]}$ ${\ displaystyle s [i, i]}$ ${\ displaystyle s [i, i-1]}$

{\ displaystyle N [i, i + 1] = 0}

For

{\ displaystyle 1 \ leq i <n}

{\ displaystyle N [i, i] = 0}

For

{\ displaystyle 1 \ leq i \ leq n}

{\ displaystyle N [i, i-1] = 0}

For

{\ displaystyle 1 <i \ leq n}

Recursion

The following applies to the other elements of the matrix, provided that : ${\ displaystyle N [i, 0] = 0}$

{\ displaystyle N [i, j] = \ max {\ begin {Bmatrix} N [i, j-1] & \\\ max _ {i \ leq k <j-1} (N [i, k-1 ] + N [k + 1, j-1] +1) \ cdot \ sigma (k, j)) & \ end {Bmatrix}}}

With

{\ displaystyle 1 \ leq i <n, 1 <j <n, i <j-1}

After the algorithm has ended, the element of the matrix is the maximum possible number of base pairs of the substring of . So the maximum number of base pairs of the entire input sequence is in store. ${\ displaystyle N [i, j]}$ ${\ displaystyle N}$ ${\ displaystyle s [i..j]}$ ${\ displaystyle s}$ ${\ displaystyle s}$ ${\ displaystyle N [1, n-1]}$

The case distinction in recursion distinguishes between two cases. Either the substring , for which the maximum possible number of base pairs has already been calculated, is extended by a base that does not pair with another base. Or the base pairs with a complementary base in the substring . In the second case there are possible bases with which a base pair could form. The choice of the complementary base divides the substring into two substrings and , for which the maximum possible number of base pairs is recursively calculated. The function has the value if the base is complementary to , otherwise . ${\ displaystyle s [i..j-1]}$ ${\ displaystyle s [j]}$ ${\ displaystyle s [i..j-1]}$ ${\ displaystyle k}$ ${\ displaystyle s [j]}$ ${\ displaystyle s [j]}$ ${\ displaystyle s [i..j-1]}$ ${\ displaystyle s [i..k-1]}$ ${\ displaystyle s [k + 1..j-1]}$ ${\ displaystyle \ sigma (k, j)}$ ${\ displaystyle 1}$ ${\ displaystyle s [k]}$ ${\ displaystyle s [j]}$ ${\ displaystyle 0}$

The distinction between cases corresponds to context-free grammar

{\ displaystyle X = X. \ mid X (X) \ mid \ epsilon}

where a denotes an unpaired base and the brackets represent placeholders for all possible complementary base pairs. According to this grammar, all structures that the Nussinov algorithm uses to optimize can be derived. ${\ displaystyle.}$

The secondary structures, which contain the maximum base pairs, can be generated by the cell by backtracking . This means that the paths through the matrix are traced that lead to the optimal result in and the optimal secondary structures are generated as a function of these paths. ${\ displaystyle N [0, n-1]}$ ${\ displaystyle N [i, n-1]}$

Efficiency

The algorithm uses a matrix with entries, each entry is optimized using elements. The memory requirement is therefore in the complexity class and the runtime in . ${\ displaystyle {\ tfrac {1} {2}} n (n + 1) + n-1}$ ${\ displaystyle {\ mathcal {O}} (n)}$ ${\ displaystyle {\ mathcal {O}} (n ^ {2})}$ ${\ displaystyle {\ mathcal {O}} (n ^ {3})}$

Demarcation

The above specification of the matrix recurrences corresponds to the representation in Nussinov, 1978. In some cases, more recent literature describes a modification of these recurrences as the Nussinov algorithm (e.g. Durbin, 2006). In Durbin, 2006 the recursion consists of a distinction of 4 cases. This variation does not change the values of the calculated matrix , but then several different paths represent a secondary structure in backtracking, since the changed case distinction is semantically ambiguous. ${\ displaystyle N}$

Biological relevance

The secondary structure , which contains the maximum number of base pairs, is not necessarily the structure that occurs in nature (in a cell). Likewise, in natural RNA folding patterns, pseudo-knots do occur which are ignored from the outset by the Nussinov algorithm. In practice, the secondary structure is therefore predicted differently, for example using the Zuker algorithm with a thermodynamic model, which leads to biologically more meaningful results.

Nevertheless, the Nussinov algorithm is of theoretical interest in research and teaching. For example, the algorithm is used to describe the Waterman-Byers backtracking method for backtracking suboptimal structures using a clear matrix recursion as an example. Working with the algorithm is instructive because it uses the dynamic programming method like other RNA structure prediction algorithms, but is still relatively easy to understand with matrix recursion.

literature

Ruth Nussinov, George Pieczenik, Jerrold R. Griggs, Daniel J. Kleitman: Algorithms for Loop Matchings . In: SIAM Journal on Applied Mathematics . tape 35 , no. 1 , July 1978, p. 68-82 .
Durbin et al .: Biological sequence analysis . Cambridge, 2006, ISBN 0-521-62971-3 , pp. 269-272 .

Individual evidence

↑ Stefan Wuchty, Walter Fontana, Ivo L. Hofacker, Peter Schuster: Complete Suboptimal Folding of RNA and the Stability of Secondary Structures . In: Biopolymers . tape 49 , 1999, pp. 145-165 ( PDF, 317 kB ).

[1] Stefan Wuchty, Walter Fontana, Ivo L. Hofacker, Peter Schuster: Complete Suboptimal Folding of RNA and the Stability of Secondary Structures . In: Biopolymers . tape 49 , 1999, pp. 145-165 ( PDF, 317 kB ).