Substitution matrix

In bioinformatics , the entries in a substitution matrix describe a relative rate at which one amino acid mutates into another in the course of evolution (in the case of a protein matrix). The entry indicates the relative rate at which the amino acid mutates into the amino acid . Some matrices are symmetrical , so it holds . A substitution matrix is often used to assign a score to a particular sequence alignment and thus to determine how good the alignment is. Frequently used substitution matrices are BLOSUM and Point Accepted Mutation Matrix (PAM-Matrix). Algorithms like BLAST or FASTA use a substitution matrix when searching for similar proteins in a database. ${\ displaystyle a_ {ij}}$ ${\ displaystyle i}$ ${\ displaystyle j}$ ${\ displaystyle a_ {ij} = a_ {ji}}$

Types of substitution matrices

There are different types of substitution matrices:

Identity matrix
Based on the genetic code
Based on the chemical properties of the amino acids
Based on empirical data (PAM and BLOSUM, as well as VT, MD BlastP and OPTIMA)

The last three types of matrices take into account that certain mutations are more common (more likely) than others. However, mostly only matrices based on empirical data are widespread, the BLOSUM (BLOcks SUbstitution Matrix) and the PAM ( Percent accepted Mutations or Point accepted Mutations ) matrix being the best known.

Identity matrix

The simplest substitution matrix is the identity matrix, in which all non-identical letters receive the value 0 and all identical letters receive the value 1. Thus, the score of this matrix divided by the length of the alignment is equal to the percentage identity of the two sequences. This matrix looks like this: e:

${\ displaystyle {\ begin {bmatrix} 1 & 0 & \ cdots & 0 & 0 \\ 0 & 1 && 0 & 0 \\\ vdots && \ ddots && \ vdots \\ 0 & 0 && 1 & 0 \\ 0 & 0 & \ cdots & 0 & 1 \ end {bmatrix}}}$

This matrix would be very poorly suited to compare two evolutionarily distant amino acid sequences. However, such a matrix is often used to compare nucleus sequences (DNA) in which all mutations are similarly likely.

Empirical matrices

BLOSUM matrix

The BLOSUM matrices were calculated by Henikoff and Henikoff in 1992. There are several matrices that only differ in the following numbers. The most commonly used BLOSUM matrix is BLOSUM62. For the calculation of the BLOSUM62 matrix, related protein sequences were compared, which were maximally 62% identical. This comparison produces a table which shows the relative mutation rate (log odds).

PAM matrix

The PAM matrix was one of the first amino acid substitution matrices. It was developed by Margaret Dayhoff in the 1970s .

The matrix is calculated by observing the difference in closely related proteins.

The PAM1 matrix indicates the rate at which a substitution would be expected if 1% of the amino acids had changed, i.e. corresponds to a similarity of 99%. The highest level is PAM250, which corresponds to a sequence similarity of approx. 20%. In practice, higher levels are not used, as one can no longer speak of similarity with a probability of less than 20%.

For the sake of clarity, the probabilities in a PAM matrix are multiplied by 10,000, i.e. H. in the PAM1 matrix below, the probability that glutamic acid (E) will be replaced by alanine (A) is 0.0017 or 0.17%.

Not entirely correct, but easy to remember, is PAM as the percentage of permitted mutations.

Example of a PAM1 matrix

      A     R    N    D    C    Q    E    G    H    I    L    K    M    F    P    S    T    W    Y    V
A  9867     2    9   10    3    8   17   21    2    6    4    2    6    2   22   35   32    0    2   18
R     1  9913    1    0    1   10    0    0   10    3    1   19    4    1    4    6    1    8    0    1
N     4     1 9822   36    0    4    6    6   21    3    1   13    0    1    2   20    9    1    4    1
D     6     0   42 9859    0    6   53    6    4    1    0    3    0    0    1    5    3    0    0    1
C     1     1    0    0 9973    0    0    0    1    1    0    0    0    0    1    5    1    0    3    2
Q     3     9    4    5    0 9876   27    1   23    1    3    6    4    0    6    2    2    0    0    1
E    10     0    7   56    0   35 9865    4    2    3    1    4    1    0    3    4    2    0    1    2
G    21     1   12   11    1    3    7 9935    1    0    1    2    1    1    3   21    3    0    0    5
H     1     8   18    3    1   20    1    0 9912    0    1    1    0    2    3    1    1    1    4    1
I     2     2    3    1    2    1    2    0    0 9872    9    2   12    7    0    1    7    0    1   33
L     3     1    3    0    0    6    1    1    4   22 9947    2   45   13    3    1    3    4    2   15
K     2    37   25    6    0   12    7    2    2    4    1 9926   20    0    3    8   11    0    1    1
M     1     1    0    0    0    2    0    0    0    5    8    4 9874    1    0    1    2    0    0    4
F     1     1    1    0    0    0    0    1    2    8    6    0    4 9946    0    2    1    3   28    0
P    13     5    2    1    1    8    3    2    5    1    2    2    1    1 9926   12    4    0    0    2
S    28    11   34    7   11    4    6   16    2    2    1    7    4    3   17 9840   38    5    2    2
T    22     2   13    4    1    3    2    2    1   11    2    8    6    1    5   32 9871    0    2    9
W     0     2    0    0    0    0    0    0    0    0    0    0    0    1    0    1    0 9976    1    0
Y     1     0    3    0    3    0    1    0    4    1    1    0    0   21    0    1    1    2 9945    1
V    13     2    1    1    3    2    2    3    3   57   11    1   17    1    3    2   10    0    2 9901

horizontal: original amino acid
vertical: mutated amino acid

Example of a PAM250 matrix

      A    R    N    D    C    Q    E    G    H    I    L    K    M    F    P    S    T    W    Y    V
A    13    6    9    9    5    8    9   12    6    8    6    7    7    4   11   11   11    2    4    9
R     3   17    4    3    2    5    3    2    6    3    2    9    4    1    4    4    3    7    2    2
N     4    4    6    7    2    5    6    4    6    3    2    5    3    2    4    5    4    2    3    3
D     5    4    8   11    1    7   10    5    6    3    2    5    3    1    4    5    5    1    2    3
C     2    1    1    1   52    1    1    2    2    2    1    1    1    1    2    3    2    1    4    2
Q     3    5    5    6    1   10    7    3    7    2    3    5    3    1    4    3    3    1    2    3
E     5    4    7   11    1    9   12    5    6    3    2    5    3    1    4    5    5    1    2    3
G    12    5   10   10    4    7    9   27    5    5    4    6    5    3    8   11    9    2    3    7
H     2    5    5    4    2    7    4    2   15    2    2    3    2    2    3    3    2    2    3    2
I     3    2    2    2    2    2    2    2    2   10    6    2    6    5    2    3    4    1    3    9
L     6    4    4    3    2    6    4    3    5   15   34    4   20   13    5    4    6    6    7   13
K     6   18   10    8    2   10    8    5    8    5    4   24    9    2    6    8    8    4    3    5
M     1    1    1    1    0    1    1    1    1    2    3    2    6    2    1    1    1    1    1    2
F     2    1    2    1    1    1    1    1    3    5    6    1    4   32    1    2    2    4   20    3
P     7    5    5    4    3    5    4    5    5    3    3    4    3    2   20    6    5    1    2    4
S     9    6    8    7    7    6    7    9    6    5    4    7    5    3    9   10    9    4    4    6
T     8    5    6    6    4    5    5    6    4    6    4    6    5    3    6    8   11    2    3    6
W     0    2    0    0    0    0    0    0    1    0    1    0    0    1    0    1    0   55    1    0
Y     1    1    2    1    3    1    1    1    3    2    2    1    2   15    1    2    2    3   31    2
V     7    4    4    4    4    4    4    5    4    15   10   4    10   5    5    5    7    2    4   17

horizontal: original amino acid
vertical: mutated amino acid