Protein family

from Wikipedia, the free encyclopedia

Protein family refers to a group of structurally similar proteins that are evolutionarily related to one another and are encoded in corresponding gene families. The terms gene family and protein family are mostly used synonymously, depending on whether the homology is considered in relation to genome and DNA ( genes ) or at the level of gene expression , biosynthesis and biological function (proteins).

A classification of proteins into families based on their amino acid sequence and the architecture of the protein domains within the sequence helps in the theoretical understanding of the evolutionary origin of these protein families and has practical applications in biotechnology and diagnostics .

Basics

Evolution of Protein Families

The expansion of a protein family - or the creation of a new family - can happen in different ways; different mechanisms are not mutually exclusive:

Origin of homologous genes Two populations of the same species are z. B. geographically separated and develop independently of each other. In the genome of the offspring occur mutations , which result in the expression to altered proteins (eg. As changing the primary structure , which in turn the stability and function of the protein affected). Depending on the different living conditions , these mutations are selected naturally . As a result, the gene that codes for a protein with slightly different properties is established in this subpopulation. In one of the two separate species, this genetic drift leads to a homologous protein variant of this protein family or - after further and longer changes - to an orthologous protein family with mostly still similar amino acid sequences.

Formation of paralogue genes Another possibility is to change a gene through complete or partial gene duplication (or multiplication). This creates a copy of the gene; the result is a gene cluster with paralogous sequences. Since one of the genes is still able to perform its original function, the other can diverge . Additional mutations can create new functions in the resulting proteins.

Some gene and protein families have experienced "expansion" in the course of evolution through a gene or genome duplication (e.g. an opsin gene duplication on the X chromosome in Old World monkeys).

Use of terms

Protein family, very narrowly defined: The human cyclophilin protein family. Different "family members" are represented by the slightly different structures of their isomerase domains.

The term protein family is not used uniformly in the literature, but rather depending on the context. Protein family can comprise several very large groups of proteins with the lowest possible level of mathematically verifiable sequence homology (and associated very different biological functions) or they can be related to very narrow groups of proteins which - compared with one another - have almost identical sequences, three-dimensional structures and Own functions.

As Margaret Oakley Dayhoff mid-1970s, the system of protein superfamily (Engl. Protein superfamily ) introduced, only 493 protein sequences were known. They were mostly small proteins with only one protein domain such as myoglobin , hemoglobin , and cytochrome c , which Dayhoff and co-workers divided into 116 superfamilies. The terms superfamily > family > subfamily allowed a gradation and number-related definitions were given.

In parallel, other terms such as have been over the years, class of proteins ( protein-class ), group of proteins ( protein group ) and protein subfamily coined and used. These terms are also used ambiguously depending on the context.

Importance of understanding protein families

The total number of directly - or indirectly via the genes - sequenced proteins of living beings and viruses is steadily increasing and requires a meaningful structuring and classification based on the biological conditions. Some scientists put the number of protein families at least 60,000.

On the one hand, there is a theoretical interest in an ever better understanding of how various genes - and the functions of the proteins encoded in this way - have changed and developed in the course of evolution; on the other hand, there are very specific applications in which the knowledge of the relationships between protein families and domain architecture is one to play an important role. Examples are enzymatic synthesis in industrial biotechnology , the development of new vaccines from “tailor-made” recombinant proteins , or the area of ​​medical analysis ( proteomics ).

Sequence comparisons by phylogenetic and cluster analysis allow an allocation of proteins in families and the allocation of these in superordinate superfamilies. From these assignments, theoretical considerations can be made for newly discovered proteins with regard to their potential secondary and tertiary structure, and they open up possible approaches for the elucidation of as yet unknown functions.

Classification systems

There are several systems for the classification of protein families, which differ in approach and systematics. One of these systems is described in detail.

PIRSF classification

The Universal Protein Resource database (UniProt) , the result of the 2002 merger of the TrEMBL databases of the European Bioinformatics Institute (EBI), Swiss-Prot of the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR) of the Georgetown University Medical Center (GUMC), represents the PIR Superfamily Classification System (PIRSF).

terminology

Initially, the PIR classification based on the work of Dayhoff into superfamily, family and subfamily was structured linearly and hierarchically : A protein could and was only allowed to be assigned to a single protein family and this only to a single superfamily. This system had to be revised as more and more primary structures became known (through the direct sequencing of purified proteins, but above all through the reading of the proteins encoded in sequenced genes). It was recognized that there were proteins that were structurally rather simple and others that had very complex structures:

  • Homeomorphic proteins (engl. Homeomorphic proteins ) are proteins that are "equivalent topologically" with one another are, that is, they are of the N-terminus to the C-terminus homologous and have the same type (similar) number and arrangement of domains (including the domain structure orcalled domain architecture ), but can have different sequence lengths.
  • Domain proteins (Engl. Domain proteins ) are due to gene fusions, deletions and / or insertions of complex structured and contain various domains that are otherwise only found in very different homeomorphic proteins (or domains in different scale arrangement).

From 1993 PIR therefore between different homeomorphic superfamilies (Engl. Homeomorphic super families ) and domain superfamily (Engl. Domain super families ).

regulate

The PIRSF system is based on the following rules:

  • The introduction of a new protein into a superfamily, family or subfamily is not carried out automatically, but manually ; The results of machine sequence alignments and cluster analysis are used.
  • Each entry is annotated as extensively as possible and other classification schemes as well as entries from other similar databases are mentioned.
  • The PIRSF system is based on the classification of whole proteins and not on the classification of individual or isolated domains so that both biochemical and biological functions of a protein are clearly presented and also to be able to classify proteins with less well (or not) defined domains.
  • A hierarchical structure cannot represent shifts of domains (English domain shuffling ) that have occurred in the course of evolution. The PIRSF system is therefore “a network-like classification system based on the evolutionary relationship of whole proteins”.
    • Primary network nodes ( parent nodes ) are the homeomorphic protein families that contain proteins that are both homologous (orthologous or paralogous; that is, which have a common ancestor (“protein ancestors”, “original protein”)) and homeomorphic, ie. H. have similarity and a similar arrangement of the domain (s) over the entire length of the primary structure; Defined parameters are used for the mathematical algorithms to determine “similarity” through sequence alignment .
    • The nodes of further (domain) superfamilies are arranged above these nodes of the homeomorphic protein families. These superfamilies, which are evolutionarily more distant from one another (and also the individual proteins that have not yet been assigned to a family) are based on domains that are common to the superfamilies below (a homeomorphic protein family below can - but does not have to - be assigned to several superfamilies above ). These superfamilies above may be homeomorphic protein superfamilies , but it is more likely that they are domain superfamilies if the protein regions that the domains comprise do not extend the full length of the protein.
    • Below the homeomorphic protein families are nodes of child “subfamily” nodes , homologous and homeomorphic groups ( clusters ) of proteins with functional specialization and / or a variation of the domain architecture within the protein family. Each subfamily has only one superordinate network node ( parent node ).

Examples of protein (super) families

The following is an incomplete list of protein families and superfamilies.

Web links

  • Pfam - database of protein families, alignments and HMMs (Engl.)
  • PROSITE - Data Bank for protein domains, protein families and functional sites (Engl.)
  • PIRSF - SuperFamily Classification System
  • PASS2 - Protein Alignment in Structural Superfamilies
  • SUPERFAMILY - HMM library for the representation of superfamilies and database of superfamily and family annotations of all organisms that have been completely sequenced so far

Individual evidence

  1. Timothy H. Goldsmith: Birds see the world more colorful .
  2. ^ MO Dayhoff: Computer analysis of protein sequences , Fed. Proc. 33, 2314-2316, 1974
  3. MO Dayhoff, JP McLaughlin, WC Barker and LT Hunt: Evolution of sequences within protein superfamilies , Naturwissenschaften 62, 154-161, 1975
  4. MO Dayhoff: The origin and evolution of protein superfamilies , Fed. Proc. 35: 2132-2138, 1976
  5. ^ Annual journals of the Society for Natural History in Württemberg, Volumes 130-132, (1975), page 18 : Protein subfamily, with less than 20% differences; Protein family, with less than 50% differences; Large protein family, this includes all proteins whose similarity is not random with a probability of over 99.9%, whereby the number of matching amino acids can also be less than 50%.
  6. Detlev Ganten and Klaus Ruckpaul: Fundamentals of Molecular Medicine , Springer (2007), page xxxi : Protein family, group of proteins with at least 50% sequence identity; Protein superfamily, group of proteins with significant similarity to one another but less than 50% sequence identity.
  7. ^ V. Kunin, I. Cases, AJ Enrigh, V. de Lorenzo and CA Ouzounis: Myriads of protein families, and still counting , Genome Biology 4, 401 (2003)
  8. There may be slight variations in the domain architecture, e.g. B. Repetition of the same domain or in the case of “ auxiliary domains ”, which can often be acquired, moved, replaced or lost again relatively easily.