Smith-Waterman algorithm

from Wikipedia, the free encyclopedia

The Smith-Waterman algorithm is an algorithm that calculates the optimal local alignment score ( similarity score ) or the optimal local alignment between two sequences . A sequence alignment is a sequence of edit operations (such as character replacement, insertion, deletion) that convert one sequence into the other. The individual operations have a score and the alignment score is defined as the sum of the edit operations scores. A local alignment is a sequence of edit operations to convert a partial sequence of the first sequence into a partial sequence of the other sequence, i.e. H. a sequence of insert and delete operations at the beginning and end can be ignored during optimization if this improves the alignment score. These ignored operations are not part of the local alignment.

The input sequences can be character strings over different alphabets , e.g. B. in bioinformatics , the Smith-Waterman algorithm is applied to DNA sequences or amino acid sequences . One use case is e.g. B. the search for genes (in newly sequenced genomes ) whose sequence is similar to a known gene sequence in another organism, the edit-operations model approximating biological changes during evolution .

The algorithm uses the dynamic programming method and its running time is quadratic. It was developed in 1981 by Temple Smith and Michael S. Waterman and is a variant of the Needleman-Wunsch algorithm that calculates the global alignment .

Local alignment problem

The Smith-Waterman algorithm solves the local alignment problem:

Given are two sequences and , as well as an alignment assessment . We are looking for all optimal local alignments, that is, global alignments of partial sequences and that optimize the evaluation function .

motivation

The computation of the optimal local alignment has a different application than the computation of the optimal global alignment.

The consideration of global alignments is useful if one can assume that the sequences to be compared are relatively similar, e.g. B. Sequences of the same length from a protein family.

However, if you want to look for local similarities in sequences that can be very different in other areas, then it makes more sense to consider local alignments. Because an optimal global alignment could in this case hide these local matches, since it has to maximize its score with regard to the entire sequence, e.g. B. individual motifs in different protein sequences.

Differentiation from the Needleman Wunsch algorithm

The Needleman Wunsch algorithm calculates the global alignment of two sequences. To solve the local alignment problem, two modifications are necessary to the Needleman Wunsch algorithm:

  1. Initialization of the first column and the first row with 0
  2. Maximization over a fourth case, namely 0

The local alignment score is not in the lower right corner of the score matrix, but somewhere in the matrix. It is the entry with the greatest value in the matrix.

The optimal local alignment is obtained by backtracking from the matrix entry with the largest value to a 0 entry in the matrix.

As with the calculation of the global alignment, there can also be several optimal local alignments of two sequences. So several maximum values ​​can exist in the score matrix, and several different backtraces are possible for each optimal value.

Matrix recurrences

The algorithm specification matrix - recurrences :

Input

  • ... strings over an alphabet with
  • … Alignment score function with
    • ... gap character

Recurrences

  • indicates the maximum alignment score between a suffix of the first characters of and a suffix of the first character of at

Efficiency

The runtime complexity of the Smith-Waterman algorithm is in and the memory requirements in . This can easily be derived from the matrix recurrences.

Because you can calculate the score matrix line by line or column by column, you only need to save the current and the last line or column if you only want to calculate the score and not the alignment. In this case, the memory requirement is in or .

If the memory requirement is linear, the local alignment can also be calculated using the Divide-and-Conquer programming method. See Hirschberg's algorithm .

example

Input

  • Sequence a = TCCG
  • Sequence b = ACGA

For the optimal local alignment, one starts with the number and moves back diagonally, which delivers as the result of the alignment (from sequence ) with (from sequence ). This seems trivial in this simple example, but with longer sequences, the result can no longer be read off the information at a glance. CGCG

literature

Web links