Protein structure prediction

The protein structure prediction covers all methods mathematically from the amino acid sequence of a protein , the three-dimensional structure of the folded molecule to be determined. It is one of the important goals of bioinformatics and theoretical chemistry . It arises from the practical difficulty of measuring the atomic structure of a protein in nature using physical methods. In particular, there is a great need for the exact atomic positions within the tertiary structure ; they form the basis for drug designand other methods of biotechnology .

The methods of protein structure prediction developed so far are based on knowledge of the primary structure in order to postulate the secondary structure and / or the tertiary structure. Another detailed problem is the determination of the quaternary structure from the available tertiary structure data. Implementations of the algorithms developed are largely available in the source code or as a WWW server . Due to the enormous importance of a final solution to the problem, CASP has been an annual competition since 1994 to compare the best solution methods.

motivation

Determining the natural protein structure using physical methods is possible for many, but by no means all, proteins and is associated with high costs and time expenditure. By 2012, the structures of around 50,000 different proteins could be determined with the aid of NMR and X-ray structure analysis (this number is reduced to 30,000 if proteins with more than 10 percent sequence difference are considered). This contrasts with an estimated 30 million protein sequences. There is therefore a great need for a reliable, purely computational method for determining the protein structure from the amino acid sequence. The anticipated acceleration of the sequencing of entire genomes , even entire ecological metagenomes , increases the discrepancy between known primary and tertiary structures and thus makes solving the problem even more urgent.

Secondary structure considerations

The secondary structure prediction is a collection of bioinformatic techniques aimed at the secondary structure of proteins and RNA using their primary structure to predict (amino acids or nucleotides). In the case of proteins, which are only discussed below, the prediction consists in marking certain sections of the amino acid sequence as likely α-helix , β-sheet , β-loop or as structureless. Success is determined by comparing the prediction with the result of the DSSP algorithm that is applied to the actual structure. In addition to these general structural motifs, there are also algorithms for recognizing special, well-defined structural motifs such as transmembrane helices or coiled coils .

The best modern methods of secondary structure prediction achieve about 80 percent accuracy, which allows their use in convolution detection, ab initio structure prediction, and sequence alignment. The development of the accuracy of secondary structure prediction methods is documented by weekly benchmarks such as LiveBench and EVA.

Tertiary structural considerations

Since a complete recalculation (ab initio) of the protein structure using purely physical-energetic and quantum chemical methods is too time-consuming even for small proteins, algorithms for structure prediction have established themselves that either rely on a classification of individual parts of the amino acid sequence or on predicted contact maps and only in a second step, calculate the final atomic positions.

Structure classes / domains

Various statistical methods have emerged to classify unknown proteins. The most successful use Hidden Markov Models , which are also successful in solving the problem of speech recognition . The respective assignments can of Structural Biology - databases such as Pfam and InterPro be downloaded. If a protein structure is already known within a class, the structures of other members can be calculated by comparative prediction. In the other case, a new method is available with the prediction of the contact map of a structure class, which is no longer dependent on physical structure determination.

Prediction from evolutionary information

With the availability of large amounts of genomic sequences, it becomes possible to study the coevolution of amino acids in protein families. One can assume that in the course of evolution within a structurally conserved protein family the three-dimensional structure of the proteins does not change significantly. The folding of the protein results from the interactions between the individual amino acids. If one of the amino acids in the protein changes as a result of a mutation, the stability of the protein can be reduced and must be restored through compensatory (correlated) mutations.

Several statistical methods exist to determine evolutionarily linked positions within a structurally classified protein family, whereby the multiple sequence alignment of the respective family serves as input . Early methods used local statistical models that only consider two amino acid positions in the sequence at the same time, which leads to inadequate prediction accuracy due to transitive effects. Examples of this are the McLachlan Based Substitution correlation (McBASC), observed versus expected frequencies of residue pairs (OMES), statistical coupling analysis (SCA) and methods based on mutual information (MI).

It was only through the use of global statistical approaches such as the maximum entropy method (inverse Potts model) or partial correlations that it became possible to distinguish the causal coevolution between amino acids from indirect, transitive effects. In addition to the superiority of global models for contact prediction, it was shown for the first time in 2011 that the predicted amino acid contacts can be used to predict 3D protein structures from sequence information alone. No related structures or fragments are used, and the calculations can be carried out on a normal computer within a few hours, even for proteins with several hundred amino acids. Subsequent publications showed that transmembrane proteins can also be predicted with considerable accuracy.

Ab initio prediction

Every naive (with no prior knowledge) protein structure prediction method must be able to measure the astronomical size of the space to be searched for possible structures. The Levinthal Paradox is used to illustrate this . Ab initio (also: de novo ) methods are based only on the application of physical principles (quantum chemistry) to the known primary structure in order to achieve a simulation of the folding process. Other methods start from the possible structures and try to optimize a suitable evaluation function, which usually contains the calculation of the free enthalpy ( Anfinsen dogma ). Such calculations still require a supercomputer and can only be carried out for the smallest proteins. The idea of providing computing power for ab initio prediction through distributed computing led to the implementation of the Folding @ home , Human Proteome Folding Project and Rosetta @ home projects . Despite the computing power required, ab initio is an active area of research.

Comparative forecast

Comparative protein modeling uses known (physically measured) structures as a starting point or template. This works in cases in which a homologous protein with a known structure exists. Since the protein structures did not develop arbitrarily, but are always associated with a biological function, proteins can be combined into groups that are both structurally homologous and functionally uniform, and membership of such a group can easily be found using machine learning ( HMM ) it's so). On the other hand, structural biologists try to physically measure a representative protein for at least each of these protein groups, so that, ideally, all remaining protein structures can be predicted by means of comparison.

Homology modeling

Homology modeling has now established itself in comparative prediction : the amino acid sequence to be investigated is transferred to known protein structures (templates) by means of peptide bonds and the resulting fillings are investigated. From this it can be deduced which structure the examined sequence assumes depending on the template structure.

The prerequisite is that the template and sample sequence are suitable for a common structural folding and can be aligned with one another, because sequence alignment is the main problem in comparative modeling. Without a doubt, very similar sequences produce the best results.

Forecast from contact maps

The division of proteins into structural groups allows the prediction of a contact map for this group by calculating coupled positions in the alignment (see above). On the other hand, structural biologists initially receive a contact map when the protein structure is physically measured using NMR. Algorithms were therefore developed early on in order to obtain conclusions about the protein tertiary structure from a contact map. In principle, it is now possible to reliably predict the protein structure from any sequences as long as a large amount of sequences of proteins of the same grouping is available in order to determine coupled positions and thus a contact map. With the increasing pace of sequencing, enough bacterial genomes (almost 10,000) are already available to successfully apply the method to them, and to model membrane proteins, for example. But the number of eukaryotic sequences is also sufficient in some cases, and the situation in this regard is noticeably relaxed.

Prediction of the side chain geometry

The exact fitting of the amino acid side chains is a problem of its own within the protein structure prediction. The protein backbone is assumed to be rigid and the possible conformations ( rotamers ) of the individual side chains are changed in such a way that the total energy is minimized. Methods that specifically perform side chain prediction are, for example, dead-end elimination (DEE) and self-consistent mean field (SCMF). Both methods use rotamer libraries, in which experience has shown favorable conformations with detailed data. These libraries can be indexed independently of the backbone, dependent on the secondary structure or dependent on the backbone.

The side chain prediction is particularly useful in determining the hydrophobic protein core where the side chains are most closely packed; it is less suitable for the more flexible surface sections, where the number of possible rotamers increases significantly.

Quaternary structure considerations

In cases in which it is known from laboratory results that a protein forms a protein complex with another or the same , and the tertiary structure (s) are also present, docking software can be used to find out how the proteins in the complex are oriented to one another ( Quaternary structure ). In addition, the genomic contact maps provide data that allows conclusions to be drawn about contact positions, as these are functionally linked. This also applies to protein-protein interactions, whereby contact positions of gene pairs of the same species are considered here. The first applications to toxin-antitoxin systems and other signal networks in bacteria have already been presented.

Individual evidence

↑ RCSB: Redundancy in the Protein Data Bank

^ Mount DM (2004). Bioinformatics: Sequence and Genome Analysis. 2. Cold Spring Harbor Laboratory Press. ISBN 0-87969-712-1 .

↑ Leong Lee, Leopold, JL; Frank, RL: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO, the Search for Optimal Segment Length and Threshold . In: IEEE Xplore Digital Library . May 2012.

^ Chen C, Zhou X, Tian Y, Zou X, Cai P: Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion network . In: Anal. Biochem. . 357, No. 1, October 2006, pp. 116-21. doi : 10.1016 / year from 2006.07.022 . PMID 16920060 .

↑ Chen C, Tian YX, XY Zou, Cai PX, Mo Man: Using pseudo-amino acid composition and support vector machine to predict protein structural class . In: J. Theor. Biol . 243, No. 3, December 2006, pp. 444-448. doi : 10.1016 / j.jtbi.2006.06.025 . PMID 16908032 .

↑ Lin H, Li QZ: Using pseudo amino acid composition to predict protein structural class: approached by incorporating 400 dipeptide components . In: J Comput Chem . 28, No. 9, July 2007, pp. 1463-6. doi : 10.1002 / jcc.20554 . PMID 17330882 .

↑ Xiao X, Wang P, Chou KC: Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image . In: J. Theor. Biol . 254, No. 3, October 2008, pp. 691-696. doi : 10.1016 / j.jtbi.2008.06.016 . PMID 18634802 .

^ BG Giraud, John M. Heumann, Alan S. Lapedes: Superadditive correlation . In: Physical Review E . tape 59 , 5 Pt A, May 1999, pp. 4983-4991 , PMID 11969452 .

↑ Ulrike Göbel, Chris Sander, Reinhard Schneider, Alfonso Valencia: Correlated mutations and residue contacts in proteins . In: Proteins . tape 18 , no. 4 , April 1994, pp. 309-317 , doi : 10.1002 / prot.340180402 .

^ Itamar Kass, Amnon Horovitz: Mapping pathways of allosteric communication in GroEL by analysis of correlated mutations . In: Proteins . tape 48 , no. 4 , September 2002, p. 611-617 , doi : 10.1002 / prot.10180 .

↑ Wollenberg, KR and Atchley, WR (2000): Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. In: Proc. Natl Acad. Sci. USA , 97, 3288-3291, doi: 10.1073 / pnas.97.7.3288 , JSTOR 121884 .

↑ ^a ^b ^c Debora S. Marks, Lucy J. Colwell, Robert Sheridan, Thomas A. Hopf, Andrea Pagnani, Riccardo Zecchina, Chris Sander: Protein 3D Structure Computed from Evolutionary Sequence Variation . In: PLOS ONE . tape 6 , no. December 12 , 2011, p. e28766 , doi : 10.1371 / journal.pone.0028766 , PMID 22163331 (free full text).

↑ Alan Lapedes, Bertrand Giraud, Christopher Jarzynski: Using Sequence alignments to Predict Protein Structure and Stability With High Accuracy . In: arXiv . July 2012, arxiv : 1207.2484v1 .

↑ Lukas Burger, Erik van Nimwegen: Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments . In: PLOS Computational Biology . tape 6 , no. 1 , January 2010, p. e1000633 , doi : 10.1371 / journal.pcbi.1000633 , PMID 20052271 (free full text).

↑ ^a ^b F. Morcos, A. Pagnani, as Lunt, A. Bertolino, DS Marks, C. Sander, R. Zecchina, JN Onuchic, T. Hwa, M. Weigt: direct coupling analysis of residue coevolution captures native contacts across many protein families. PNAS Volume 108, Number 49, December 2011, pp. E1293-E1301. doi: 10.1073 / pnas.1111471108 . PMID 22106262 . PMC 3241805 (free full text).

^ David T. Jones, Daniel WA Buchan, Domenico Cozzetto, Massimiliano Pontil: PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments . In: Bioinformatics . tape 28 , no. 2 , January 2012, p. 184–190 , doi : 10.1093 / bioinformatics / btr638 (free full text).

↑ Debora S. Marks, Thomas A. Hopf, Chris Sander: Protein structure prediction from sequence variation . In: Nature Biotechnology . tape 30 , no. 11 , November 2012, p. 1072-1080 , doi : 10.1038 / nbt.2419 , PMID 23138306 (free full text).

↑ Thomas A. Hopf, Lucy J. Colwell, Robert Sheridan, Burkhard Rost, Chris Sander, Debora S. Marks: 3D structures of membrane proteins from genomic sequencing . In: Cell . tape 149 , no. 7 , June 2012, p. 1607–1621 , doi : 10.1016 / j.cell.2012.04.012 , PMC 3641781 (free full text).

↑ Nugent T., Jones DT (2012): Accurate de novo structure prediction of large transmembrane protein domains using fragment-assembly and correlated mutation analysis. PNAS , Volume 109, No. 24, pp. E1540-E1547, doi: 10.1073 / pnas.1120036109 .

↑ ^a ^b Zhang Y: Progress and challenges in protein structure prediction . In: Curr Opin Struct Biol . 18, No. 3, 2008, pp. 342-348. doi : 10.1016 / j.sbi.2008.02.004 . PMID 18436442 . PMC 2680823 (free full text).

↑ X. Qu, R. Swanson, R. Day, J. Tsai: A guide to template based structure prediction. Current Protein & Peptide Science, Volume 10, Number 3, June 2009, pp. 270-285 doi: 10.2174 / 138920309788452182

↑ Zhang Y and Skolnick J: The protein structure prediction problem could be solved using the current PDB library . In: Proc Natl Acad Sci USA . 102, No. 4, 2005, pp. 1029-1034. doi : 10.1073 / pnas.0407152101 . PMID 15653774 . PMC 545829 (free full text).

↑ A. Kolinski, J. Skolnick: Reduced models of proteins and their applications Polymer, Volume 45, No. 2, Jan 2004, pp. 511-524.

↑ JI Sulkowska, F. Morcos, M. Weigt et al .: Genomics-aided structure prediction. PNAS , Volume 109, 2012, pp. 10340-10345, doi: 10.1073 / pnas.1207864109 .

↑ Dunbrack, RL: rotamer libraries in the 21st Century . In: Curr. Opin. Struct. Biol . 12, No. 4, 2002, pp. 431-440. doi : 10.1016 / S0959-440X (02) 00344-5 . PMID 12163064 .

^ Lovell SC, Word JM, Richardson JS , Richardson DC: The penultimate rotamer library . In: Proteins: Struc. Func. Genet. . 40, 2000, pp. 389-408. doi : 10.1002 / 1097-0134 (20000815) 40: 3 <389 :: AID-PROT50> 3.0.CO; 2-2 .

↑ Richardson rotamer libraries

↑ Shapovalov MV, Dunbrack, RL: A smoothed backbone-dependent rotamer library for proteins derived from adaptive kernel density estimates and regressions . In: Structure (Cell Press) . 19, No. 6, 2011, pp. 844-858. doi : 10.1016 / j.str.2011.03.019 . PMID 21645855 . PMC 3118414 (free full text).

↑ Voigt CA, Gordon DB, Mayo SL: Trading accuracy for speed: A quantitative comparison of search algorithms in protein sequence design . In: J Mol Biol . 299, No. 3, 2000, pp. 789-803. doi : 10.1006 / jmbi.2000.3758 . PMID 10835284 .

↑ Krivov GG Shapovalov MV, Dunbrack, RL: Improved prediction of protein side-chain conformations with SCWRL4 . In: Proteins . 77, No. 3, 2009, pp. 778-795. doi : 10.1002 / prot.22488 . PMID 19603484 . PMC 2885146 (free full text).

↑ A. Procaccini, B. Lunt, H. Szurmant, T. Hwa, M. Weigt: Dissecting the specificity of protein-protein interaction in bacterial two-component signaling: orphans and crosstalks. In: PloS one. Volume 6, number 5, 2011, p. E19729. doi: 10.1371 / journal.pone.0019729 . PMID 21573011 . PMC 3090404 (free full text).

literature

GL Butterfoss, B. Yoo et al. a .: De novo structure prediction and experimental characterization of folded peptoid oligomers. PNAS , Volume 109, 2012, pp. 14320-14325, doi: 10.1073 / pnas.1209945109 .

Web links

ExPASy Proteomics tools - list of links on the topic

Server / software for prediction

NetSurfP - Secondary Structure and Surface Accessibility Predictor
DomPred - London's Global University
DOMpro - University of California Irvine
DomainSplit - University of Pittsburgh
PredictProtein
SCRATCH Protein structure prediction suite that includes SSpro
PSSpred A multiple neural network training program for protein secondary structure prediction

[1] RCSB: Redundancy in the Protein Data Bank

[mount-2] Mount DM (2004). Bioinformatics: Sequence and Genome Analysis. 2. Cold Spring Harbor Laboratory Press. ISBN 0-87969-712-1 .

[3] Leong Lee, Leopold, JL; Frank, RL: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO, the Search for Optimal Segment Length and Threshold . In: IEEE Xplore Digital Library . May 2012.

[pmid16920060-4] Chen C, Zhou X, Tian Y, Zou X, Cai P: Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion network . In: Anal. Biochem. . 357, No. 1, October 2006, pp. 116-21. doi : 10.1016 / year from 2006.07.022 . PMID 16920060 .

[pmid16908032-5] Chen C, Tian YX, XY Zou, Cai PX, Mo Man: Using pseudo-amino acid composition and support vector machine to predict protein structural class . In: J. Theor. Biol . 243, No. 3, December 2006, pp. 444-448. doi : 10.1016 / j.jtbi.2006.06.025 . PMID 16908032 .

[pmid17330882-6] Lin H, Li QZ: Using pseudo amino acid composition to predict protein structural class: approached by incorporating 400 dipeptide components . In: J Comput Chem . 28, No. 9, July 2007, pp. 1463-6. doi : 10.1002 / jcc.20554 . PMID 17330882 .

[pmid18634802-7] Xiao X, Wang P, Chou KC: Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image . In: J. Theor. Biol . 254, No. 3, October 2008, pp. 691-696. doi : 10.1016 / j.jtbi.2008.06.016 . PMID 18634802 .

[8] BG Giraud, John M. Heumann, Alan S. Lapedes: Superadditive correlation . In: Physical Review E . tape 59 , 5 Pt A, May 1999, pp. 4983-4991 , PMID 11969452 .

[9] Ulrike Göbel, Chris Sander, Reinhard Schneider, Alfonso Valencia: Correlated mutations and residue contacts in proteins . In: Proteins . tape 18 , no. 4 , April 1994, pp. 309-317 , doi : 10.1002 / prot.340180402 .

[10] Itamar Kass, Amnon Horovitz: Mapping pathways of allosteric communication in GroEL by analysis of correlated mutations . In: Proteins . tape 48 , no. 4 , September 2002, p. 611-617 , doi : 10.1002 / prot.10180 .

[11] Wollenberg, KR and Atchley, WR (2000): Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. In: Proc. Natl Acad. Sci. USA , 97, 3288-3291, doi: 10.1073 / pnas.97.7.3288 , JSTOR 121884 .