Sequence alignment

Sequence alignment (from the Latin sequentia , “order” and English alignment , “adjustment, arrangement, alignment”) describes the methodical comparison of two or more nucleotide or amino acid sequences in a linear sequence. Sequence alignment is a branch of pattern matching . It is used in molecular phylogeny to study the functional or evolutionary relationship ( homology ) of nucleotide sequences or amino acid sequences. In technical terms, instead of the anglicized "alignment" , the Germanized terms Alignment or Alinierung used.

The principle

In the case of sequence alignment, the linearly following elements of an examined string ("character sequence", from English string , "cord, string") from the data set of a nucleotide or amino acid sequence are assigned to another string.

In particular, there are automated sequence alignment methods for large data sets of nucleotide or amino acid sequences. However, you can also manually align smaller data sets. The manual method enables greater care and the exclusion of highly variable and thus non-alignable positions that would interfere with later analyzes.

In the case of sequence alignment, the order of the linearly following elements should be retained and each element should be assigned in pairs to a different element or a gap ("space, gap") in each string. Mismatches are interpreted as mutations , while gaps are interpreted as gene deletions or insertions ( indel ). The elements assigned to one another should be identical or biochemically similar as possible, because many identical or similar elements in the same order indicate an evolutionary or functional relationship. The similarity of the elements is usually given and depends on the properties of the data or substitution matrices used . So that a meaningful alignment is possible and since the sequences examined are often of different lengths, artificially gaps can be inserted into the strings.

A sequence alignment, created with ClustalW , between two human zinc finger proteins from GenBank

The alignment of two sequences is called pair-wise alignment , and that of several sequences is called multiple alignment . In the case of pairwise alignment, a distinction is also made between global, local and semi-global alignment.

Cost function for automated alignment

The task of an alignment algorithm is to find an alignment that is optimal under a cost function ( alignment score ). The cost function in bioinformatics is usually specified in a so-called " similarity matrix ", also known as an "exchange matrix " ( similarity matrix or mutation data matrix ). For each pair of sequence elements to be compared, these indicate how likely it is that this pairing has arisen through evolution. Identical elements are rated highly, "similar" elements are rated less highly, and strongly different elements reduce the score . The consequence of the cost function is that the computationally best alignment is also that which would be expected if the two sequences were homologous . Gaps are a problem; To date, there are no precise mathematical models for the formation of insertions or deletions in biological sequences. For this reason, empirically motivated, so-called affine gap scores are used , which subtract a constant contribution for the introduction of each gap from the result and a further contribution that increases linearly with the length of the gap.

example

-AAACGG
AAAACCG

The alignment of two short DNA sequences shown above shows at the first position ( -A) that a gap can be inserted to compensate for differences in length. The gap was inserted at the beginning of the upper sequence and not in the middle, because from a biology point of view it is more likely that a sequence will mutate at the ends than in the middle.

In the penultimate position, C and G were aligned, as mutations are quite possible in the DNA in which a G is randomly incorporated instead of a C, or vice versa. It would also have been possible to align G and C each with a gap in the other sequence. This decision depends on the cost function used.

In protein sequence alignment, the amino acid sequences correspond to the strings. The cost functions for the similarities between the individual amino acids are somewhat more complex than for DNA.

Pairwise alignment

An alignment of two symbol sequences is a sequence of edit operations that describe a transformation from to . A symbol can be replaced by another symbol ( mismatch ), a symbol can be deleted ( deletion ) or a symbol can be inserted ( insert ). The symbols involved in an edit operation are usually written over one another, the symbol denoting a gap created by a symbol inserted or deleted in the other sequence. ${\ displaystyle S, T}$ ${\ displaystyle S}$ ${\ displaystyle T}$ -

Each edit operation can be assigned a value using an evaluation function. The sum of the values of all edit operations of an alignment is called the alignment score.

An optimal alignment is an alignment whose score is optimal under an evaluation scheme.

A distinction is sometimes made between the terms score and edit distance. The term score or distance is then used in a scheme that evaluates sequence similarities or sequence differences by positive values. In other words, according to this distinction, an optimal score or an optimal distance is maximum or minimum.

An example of a valuation scheme is the unit costs. The evaluation function is defined as: ${\ displaystyle \ delta}$

${\ displaystyle \ delta (a, b) = {\ begin {Bmatrix} 0 & a = b ~~ {\ text {match}} \\ 1 & a \ neq b ~~ {\ text {mismatch}} \ end {Bmatrix}} }$

${\ displaystyle \ delta (a, -) = 1 ~~ {\ text {Deletion}}}$

${\ displaystyle \ delta (-, b) = 1 ~~ {\ text {Insertion}}}$

Global alignment

With a global alignment between two sequences, all symbols are taken into account. Global alignments are mainly used when the sequences to be examined are of similar length and strong sequence homologies are expected.

The calculation of the optimal alignment score or the optimal alignment is an optimization problem that can be solved efficiently (running time in ) with the pairwise alignment using the dynamic programming method ( Needleman-Wunsch algorithm ) . ${\ displaystyle O (n ^ {2})}$

example

Given: Two sequences S and T, unit costs as an evaluation scheme
Assumption: S and T have common ancestors (are homologous).
Question: Which positions in S and T are homologous?

For S = GAC and T = GC are possible solutions:

possibility	Alignment	Score
1	GAC GC-	`0+1+1=2`
2	GAC-- ---GC	`1+1+1+1+1=5`
3	GAC G-C	`0+1+0=1`

Free-shift alignment

The free-shift alignment (also known as semi-global alignment or end-gap free alignment) of two sequences is a variant of global alignment in which a sequence of insertions or deletions at the beginning or at the end of the alignment is not included in the calculation of the score be taken into account. The calculation of the optimal free-shift alignment can make more sense in certain applications than the calculation of the optimal global alignment if, for example, one sequence is significantly longer than the other and a protruding suffix or prefix is irrelevant.

Local alignment

A local alignment of two sequences is a global alignment of a partial sequence (substring) of and a partial sequence of . In other words, in order to calculate the optimal local alignment of two sequences, the two partial sequences must be found whose optimal alignment score is maximum. Application examples for the calculation of local alignments are the search for the same sequence motifs or domains in proteins. The classic algorithm for calculating optimal local alignments is the Smith-Waterman algorithm . ${\ displaystyle S, T}$ ${\ displaystyle S}$ ${\ displaystyle T}$

Multiple sequence alignment

While the optimum alignment of two sequences using a computer can be quite fast (i.e.,.. In polynomial time) calculated exactly ( runtime O , n and m are the lengths of the sequences), this is when multiple sequence alignment ( engl. Multiple sequence alignment ) is not more possible, since the running time of the algorithm for the exact calculation of the multiple alignment grows exponentially with the number of sequences ( where k is the number of sequences and n is the longest of the sequences to be compared). ${\ displaystyle (nm)}$ ${\ displaystyle O (2 ^ {k} n ^ {k})}$

However, in order to be able to calculate a biologically or evolutionarily meaningful alignment from which similarities and differences in sequence, structure and function can actually be derived, one needs many long sequences. That is why heuristics are used, for example so-called progressive strategies (also called hierarchical methods ). First of all, all the optimal pairwise alignments of the sequences to be examined are calculated and a phylogenetic tree (the so-called guide tree ) is derived from this by cluster analysis (for example using a neighbor joining algorithm ). A multiple alignment is finally determined step by step ( progressively , according to the principle of a greedy algorithm ) along this tree , whereby the optimal solution is not guaranteed by this heuristic procedure.

Alignment algorithms

Needleman Wunsch algorithm (global alignment)
Hirschberg algorithm (global alignment on linear storage space)
Smith-Waterman algorithm (local alignment)
Gotoh algorithm (global alignment with affine gap costs)

Heuristic algorithms for pairwise alignment:

BLAST
FASTA

Heuristic algorithms for multiple alignment:

Popular fragment-based method DIALIGN-TX
Hierarchical methods ( e.g. Feng-Doolittle )
PSI-BLAST

software

Frequently used programs for general sequence alignments are ClustalW , MAFFT and T-Coffee as well as BLAST for database searches .

Online Interface
The STRAP program (software) integrates almost all freely available programs for calculating sequence alignments. These are installed automatically and can then be called up with a comfortable graphical user interface. This saves the user from having to individually install and learn the command line syntax of the individual programs. Because large alignments can be time-consuming to compute, results of lengthy computations are cached. If 3D structures are also available for at least two of the proteins, the combined use of sequence alignment and 3D structure overlay is recommended.

literature

Michael S. Waterman: Introduction to Computational Biology: Maps, Sequences and Genomes , 1995 Chapman & Hall, ISBN 0-412-99391-0 .
Dan Gusfield: Algorithms on Strings, Trees, and Sequences , 1997 Cambridge University Press, ISBN 0-521-58519-8 .
Andreas D. Baxevanis & BF Francis Ouellette (eds.): Bioinformatics: a practical guide to the analysis of genes and proteins , 2001 Wiley-Interscience, ISBN 0-471-38391-0 .
Andrea Hansen: Bioinformatics: A Guide for Natural Scientists , 2004 Birkhaeuser Verlag, ISBN 3-7643-6253-7 .
Roland Fleißner: Sequence alignment and phylogenetic inference , 2004 Logos Berlin, ISBN 3-8325-0588-1 .
Bernhard Haubold, Thomas Wiehe: Introduction to Computational Biology. An Evolutionary Approach , 2006 Birkhaeuser Verlag, ISBN 3-7643-6700-8 .