DNA sequence analysis
In molecular biology and bioinformatics, DNA sequence analysis is the automated, computer-aided determination of characteristic sections , in particular of known genes and suspected genes , on a DNA sequence . The information obtained during DNA sequencing about the sequence and position of the base pairs is examined . The results of this activity are also called annotations , whereby the sequence analysis is not limited to annotation methods.
The analysis of DNA sequences has been conditioned by the availability of large amounts of genomic data and the need to interpret them. Many of the methods developed for nucleotide sequences can also be used in the same way or with minor modifications on amino acid sequences, i.e. the primary structure of proteins . The methods, which for the most part can be assigned to the so-called string algorithms , can - if the biology-specific restrictions are neglected - even be transferred to any symbol sequences .
Sequence analyzes can be motivated by the following problems:
- When sequencing a genome data obtained in the form of thousands of relatively short sequences: How to put this together?
- Analog genes , that is, genes whose protein products have similar functions, can show similar patterns in different species; Homologous genes can diverge in the course of evolution : Can one find unknown genes in humans by knowing the homologous genes in the mouse? How far are the organisms genetically separated from each other? How much time has passed in the family tree since they split up ?
- Introns and exons have different patterns and statistics, and gene control regions are often highly conserved: Can these areas be automatically differentiated solely through pattern comparisons and statistical analysis of the n- tuple frequencies?
- A large part of the genomic DNA consists of non-coding DNA , which is characterized by relatively short, very frequently repeated units ( repeats ): How do you filter these out so that search algorithms do not produce false or misleading results through false positive results?
Algorithms
String algorithms
One of the most common problems is the search for certain partial sequences in a database. You can either search for exact matches ( string matching algorithms ) or for all approximate matches within a certain Levenshtein distance from the search string. In the English-speaking world, these adaptations of two strings are called sequence alignments , which in turn gave the whole family of alignment algorithms their name. The term is also gaining ground in German more and more in untranslated form. By far the best-known realizations of alignments are the Needleman-Wunsch algorithm (global alignment), the Smith-Waterman algorithm (local alignment) and the BLAST algorithm (heuristic pairwise alignment).