BLAST algorithm

from Wikipedia, the free encyclopedia
Photo 1: Schematic flow of a BLAST query.

BLAST (abbreviation of English Basic Local Alignment Search Tool ) is the umbrella term for a collection of programs for the world's most popular analysis of biological sequence data. BLAST is used to compare experimentally determined DNA or protein sequences with sequences already available in a database . As a result the program delivers a series of local alignments , i. H. Comparison of pieces of the searched sequence with similar pieces from the database. In addition, BLAST indicates how significant the hits found are. The search in the database takes place either via a web interface or with the help of various stand-alone programs that can be installed locally.

The BLAST program was developed by Stephen Altschul , Warren Gish , David J. Lipman , Webb Miller and Eugene Myers at the National Institutes of Health . Samuel Karlin was also involved in developing the algorithm .

functionality

The idea of ​​the algorithm is based on the probability that alignments with many hits have short pieces of great identity. These sections are then enlarged further while searching for better and longer alignments.

By keeping these segments short, it is possible to edit the query sequence before a search and to keep a table of all possible parts with their origin in the original sequence.

The algorithm creates a list of all neighboring words of fixed length that would generate a hit in the query sequence with a higher scoring than a parameter to be selected. The target database is then queried for words in this list and the hits found are expanded in order to find possible maximum contiguous hits in both directions.

The main application of BLAST is the search for paralogous and orthologous genes and proteins within one or more organisms.

Methods (selection)

method description
blastp Compares an amino acid sequence against a protein sequence database
PSI-BLAST Position-Specific Iterative BLAST : Used to identify distant relatives of a protein.

First, a list of all very similar proteins is made. A profile is created over these proteins, a kind of averaged sequence. You then use this profile to send another search query to the protein database and you get a larger group of similar sequences. With this group you can create a new profile and repeat the process as often as you like. Because related proteins are included in the search, PSI-BLAST is much more sensitive to finding more distant relationships than the common protein-protein BLAST.

blastn Compares a nucleotide sequence against a nucleotide sequence database
blastx Compares a nucleotide sequence ( translated in all reading frames) against a protein database

One can use this possibility to find a possible translation of a known nucleotide sequence.

tblastn Compares a protein sequence against a nucleotide database (dynamically translated in all reading frames)
tblastx Compares the six-frame - Translation of a nucleotide sequence against the six-frame translocations of a nucleotide sequence database.

tblastx cannot be used with the nucleotide database on the BLAST website because it is technically very complex!

megablast megablast is recommended to search for sequences that are identical to your own sequence. megablast was specially created to compare particularly long sequences with existing counterparts from the database.

discontiguous megablast is recommended for finding matches between sequences that are distributed, e.g. B. from different organisms, and have a low match rate.

cdart cdart searches for sequences with an arrangement of protein domains that is as identical as possible with the aid of the CDD (= conserved domain) database (import of matches from SMART and Pfam ) and compares them with the protein and its domains sought.

Search results

The homology of the processed search sequence is defined using the score and E-value .

The score is a quantitative assessment of the similarity of the search sequence to a known sequence (the higher the sequence, the higher the identity of the sequences).

The E value indicates the expected number of hits whose score is at least as large as the observed one (the smaller the better).

The abbreviations in front of and within the search results mean (selection):

GenBank gi-number | gb | accession | locus
EMBL Data Library gi-number | emb | accession | locus
DDBJ, DNA Database of Japan gi-number | dbj | accession | locus
NCBI Reference Sequence gi-number | ref | accession | locus
SWISS-PROT gi-number | sp | accession | Surname
General database identifier database | identifier
Local sequence identifier identifier

Note: The gi number is a sequence of digits that marks a database entry in the NCBI.

See also

literature

Web links

Individual evidence

  1. Stephen F. Altschul , Warren Gish, Webb Miller, Eugene W. Myers, David J. Lipman: Basic local alignment search tool. In: Journal of Molecular Biology. Vol. 215, 1990, ISSN  0022-2836 , pp. 403-410, doi: 10.1016 / S0022-2836 (05) 80360-2 .
  2. ^ Sense from Sequences: Stephen F. Altschul on Bettering BLAST. In: sciencewatch.com. 2000, archived from the original on April 23, 2008 ; Retrieved July 7, 2016 .