Bioinformatics

Surface protein of an influenza virus (model)

The Bioinformatics ( English bioinformatics , and computational biology ) is an interdisciplinary science that problems in the life sciences solves theoretical computational methods. It has contributed to the fundamental knowledge of modern biology and medicine. Bioinformatics achieved notoriety in the media primarily in 2001 with its significant contribution to the sequencing of the human genome .

Bioinformatics is a minor subject in German university policy . Bioinformatics is a broad field of research , both in terms of problems and the methods used. Essential areas of bioinformatics are the management and integration of biological data, sequence analysis , structural bioinformatics and the analysis of data from high-throughput methods (~ omics). Since bioinformatics is indispensable for analyzing data on a large scale, it forms an essential pillar of systems biology .

In the English-speaking world, bioinformatics is often contrasted with computational biology , which covers a wider area than classic bioinformatics, but both terms are usually used synonymously .

Data management

The rapidly growing amount of biological data, especially DNA and protein sequences , their annotation , 3D protein structures , interactions of biological molecules and high throughput data from, for example, microarrays , places special demands on the handling of this data. An important problem in bioinformatics is therefore the preparation and storage of data in suitably indexed and linked biological databases . The advantages are in the uniform structure, the easier searchability and the automation of analyzes by software.

One of the oldest biological databases is the Protein Data Bank , PDB, for data on the 3D structures of biological macromolecules, mostly proteins. Databases for managing nucleotide sequences ( EMBL Data Library , GenBank ) and amino acid sequences ( Protein Information Resource , Swiss-Prot) were set up in the 1980s . The nucleotide sequence databases joined together in the International Nucleotide Sequence Database Cooperation are, as primary databases, archives of original data submitted by the researchers themselves. In contrast, UniProt , the merger of PIR and Swiss-Prot, provides high-quality, expertly maintained and annotated entries of protein sequences with extensive information on each individual protein, which are supplemented by protein sequences automatically translated from the EMBL bank without further annotation.

Other databases contain recurring motifs in protein sequences ( Pfam ), information about enzymes and biochemical components ( BRENDA , KEGG LIGAND and ENZYME), about protein-protein or protein-DNA interactions ( TRANSFAC ), about metabolic and regulatory networks (KEGG, REACTOME) and much more.

The size of the individual databases is growing exponentially in some cases. The number of relevant databases is also growing steadily (over 350 worldwide). When searching for relevant information, bioinformatics meta-search engines ( Bioinformatik-Harvester , Entrez , EBI SRS ) are often used.

The variety of databases available worldwide often leads to redundant and thus error-prone data management, especially since DNA sequences are partly in fragments and partly in fully assembled genomes. Ideally, the storage of genome and proteome data would allow a reconstruction of the rules of an entire organism. Intensive work is being carried out on the mapping of identified proteins to the genes coding for them and vice versa, on the links between them to represent their interactions and on the assignment of proteins to metabolic and regulatory pathways.

Another task in data integration is the creation of controlled vocabularies and ontologies that enable the assignment of function names across all levels. The Gene Ontology Consortium (GO) is currently trying to establish a consistent nomenclature for the molecular function, the biological process and the cell localization of gene products.

Sequence analysis

The first pure bioinformatics applications were developed for DNA sequence analysis and sequence comparisons . Sequence analysis is primarily about quickly finding patterns in protein or DNA sequences. When sequence comparison ( sequence alignment ) deals with the question of whether two genes or proteins related ( "each other homologous ") are. For this purpose, the sequences are superimposed and aligned with one another in such a way that the best possible agreement is achieved. If the correspondence is significantly better than would be expected from coincidental similarity, one can conclude that they are related: For genes and proteins, relatedness always implies a similar structure and usually a similar function. The central importance of the sequence comparison for bioinformatics lies in its use for the sequence and structure prediction of unknown, suspected genes . Moreover be used algorithms of dynamic programming and heuristic algorithms. Dynamic programming delivers optimal solutions, but because of the computer resources required it cannot be applied to very long sequences or very large databases in practice. Heuristic algorithms are suitable for searching the large, globally available databases that archive all known sequences; Although they do not guarantee optimal results, they still do such a good job that the daily work of bioinformaticians and molecular biologists would not be possible without the use of the BLAST algorithm, for example . Other frequently used algorithms that fulfill different functions depending on the area of application are FASTA , Needleman-Wunsch or Smith-Waterman .

In the case of biological questions, it is seldom necessary to search for exact matches of short sequence sections, typically for interfaces of restriction enzymes in DNA sequences, and possibly also of sequence patterns in proteins, from the PROSITE database.

Bioinformatics also plays a major role in genome analysis . The DNA fragments, which are sequenced in small units, are combined to form an overall sequence with the aid of bioinformatic methods.

Were further methods for finding genes in unknown DNA sequences designed ( gene prediction , Eng. Gene finding or gene prediction ). This problem is addressed using various computational methods and algorithms , including statistical sequence analysis, Markov chains , artificial neural networks for pattern recognition , etc.

Both DNA and amino acid sequences can be used to create phylogenetic trees that represent the evolutionary development of today's living things from largely unknown and therefore hypothetical ancestors.

Structural Bioinformatics

Computer-aided visualization of the glucocorticoid receptor ( PDB 1GLU ) bound to a short DNA molecule with a specific nucleotide sequence. The surface of the protein was colored according to the electrostatic properties. Created with BALLView .

With the clarification and extensive functional analysis of various complete genomes , the focus of bioinformatics work shifts to issues of proteomics , e.g. B. the problem of protein folding and structure prediction , i.e. the question of the secondary or tertiary structure for a given amino acid sequence . The question of the interaction of proteins with various ligands (nucleic acids, other proteins or even smaller molecules) is also being investigated, as it can be used to derive not only knowledge for basic research but also important information for medicine and pharmacy , for example about how a Mutation altered protein influences the body functions or which drugs work in which way on different proteins.

literature

Cynthia Gibas and Per Jambeck: Introduction to Practical Bioinformatics , O'Reilly, 2002, ISBN 3-89721-289-7
Nicola Gaedeke: Researching the life sciences: About the use of databases and other bioinformatics resources , Birkhäuser, 2007, ISBN 3-7643-8525-1
Reeves GA, Talavera D, Thornton JM: Genome and proteome annotation: organization, interpretation and integration . In: JR Soc Interface . 6, No. 31, February 2009, pp. 129-47. doi : 10.1098 / rsif.2008.0341 . PMID 19019817 . PMC 2658791 (free full text).

Web links

Wiktionary: bio-informatique - explanations of meanings, word origins, synonyms, translations

Commons : Bioinformatics - collection of images, videos and audio files

Individual evidence

↑ Small subjects: bioinformatics on the small subjects portal. Retrieved June 12, 2019 .
↑ TK Attwood, A. Gisel, NE. Eriksson, E. Bongcam-Rudloff: Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: A European Perspective . InTech, November 2, 2011, doi : 10.5772 / 23535 .
↑ IntAct protein interaction database at the EBI .
↑ GenBank Growth , Statistics 1982–2008
↑ Michael Y. Galperin, Guy R. Cochrane: The 2011 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection . In: Nucleic Acids Research . tape 39 , suppl 1, January 1, 2011, p. D1-D6 , doi : 10.1093 / nar / gkq1243 .

[1] Small subjects: bioinformatics on the small subjects portal. Retrieved June 12, 2019 .

[2] TK Attwood, A. Gisel, NE. Eriksson, E. Bongcam-Rudloff: Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: A European Perspective . InTech, November 2, 2011, doi : 10.5772 / 23535 .

[3] IntAct protein interaction database at the EBI .

[4] GenBank Growth , Statistics 1982–2008

[5] Michael Y. Galperin, Guy R. Cochrane: The 2011 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection . In: Nucleic Acids Research . tape 39 , suppl 1, January 1, 2011, p. D1-D6 , doi : 10.1093 / nar / gkq1243 .