BLOSUM

from Wikipedia, the free encyclopedia
The BLOSUM62 matrix

BLOSUM (BLOcks SUbstitution Matrix) is an evidence-based substitution matrix that is used for sequence alignment of proteins and, in addition to the Point Accepted Mutation Matrix (PAM-Matrix), plays an important role in bioinformatics . The BLOSUM was developed in 1992 by Jorja G. Henikoff and Steven Henikoff. There are different matrices for different evolutionary distances.

calculation

BLOSUM uses single blocks (without gaps) within the sequences of homologous proteins that are being compared. There are different BLOSUM, which are designed for different areas of application. BLOSUM with high numbers like BLOSUM80 are suitable for evolutionarily closely related proteins and those with low numbers like BLOSUM45 are suitable for strongly diverging proteins. In accordance with the matrix number, the authors of BLOSUM combined all blocks with a sequence identity higher than the specified percentage into a single sequence ( clustering ) in order to reduce the influence of closely related blocks. For example, for BLOSUM80, all sequences with more than 80% sequence identity were merged, so that all remaining sequences compared with one another had less than 80% identity. The log-odds values are entered in the matrix :

where the probability of finding the amino acids i and j in an alignment and or the frequency of the amino acids in general denotes. is a normalization factor, the values ​​are rounded to whole numbers. Thus, the logarithm is greater than zero and a positive score results if the two amino acids are found in an alignment more often than would be expected by chance. So is z. B. the value for a substitution of tryptophan with tyrosine in the BLOSUM62 with 2 greater than zero, which means that tryptophan mutates to tyrosine (and vice versa) more often than would be expected by chance - this also results from the similar physical and chemical properties Properties of the two amino acids make sense. The greatest score, however, is mostly observed for identity, so a tryptophan that remains a tryptophan has a score of 11 and a tyrosine that remains a tyrosine has a score of 7.

The advantage of log odds is that they can be added up instead of multiplied as is normally the case with probabilities, making the calculation numerically easier. The probability itself can simply be regained by exponentiating the score .

use

High numbered BLOSUM (e.g. BLOSUM80) are used to compare closely related sequences, while low numbered BLOSUM are used to compare distantly related proteins. Often an alignment of two sequences is evaluated using the BLOSUM. So has z. B. the following alignment

EKNGFPA
| | |
EMQGRWA

with the BLOSUM62 a score of 7.

The algorithms, which perform either global (Needleman & Wunsch) or local (Smith & Waterman) pair-wise sequence alignment, often use BLOSUM as a substitution matrix for protein sequences, but this can be freely selected. The algorithms BLAST or FASTA , which search a database for a certain sequence, also often use BLOSUM for protein searches. The user is often not interested in exact hits and if related but not identical proteins are also searched for, the BLOSUM can be used to evaluate whether the alignment to a certain protein in the database is significant or not.

literature

  • Albert Y. Zomaya: Handbook of Nature-Inspired and Innovative Computing: Integrating Classical Models with Emerging Technologies . Springer Science & Business Media, New York 2006, ISBN 0-387-40532-1 , pp. 673 ( limited preview in Google Book search).
  • Sean R. Eddy: Where did the BLOSUM62 alignment score matrix come from? In: Nature Biotechnology . tape 22 , no. 8 , August 1, 2004, p. 1035-1036 , doi : 10.1038 / nbt0804-1035 .

Individual evidence

  1. In the acronym BLOSUM the last 'M' already stands for 'Matrix' and therefore it is wrong to speak of a 'BLOSUM matrix', as this is a redundant acronym .
  2. ^ A b S. Henikoff, JG Henikoff: Amino acid substitution matrices from protein blocks . In: Proceedings of the National Academy of Sciences of the USA . 89 (22), Nov 15, 1992, pp. 10915-10919. PMID 1438297