Haplotype

The articles haplotype and haplogroup overlap thematically. Help me to better differentiate or merge the articles (→ instructions ) . To do this, take part in the relevant redundancy discussion . Please remove this module only after the redundancy has been completely processed and do not forget to include the relevant entry on the redundancy discussion page{{ Done | 1 = ~~~~}}to mark. Biologos ( discussion ) 13:28, Jan. 16, 2019 (CET)

Haplotypes from SNPs from chromosomal segments of the same chromosome from four haploid individuals

The haplotype (from ancient Greek ἁπλούς haplóos or haplús , simple ' and τύπος týpos , German ' image ' ,' pattern '), an abbreviation of “ haploid genotype ”, is a variant of a nucleotide sequence on one and the same chromosome in the genome of a living being. A certain haplotype can be individual, population or species-specific.

As in the International HapMap Project , the alleles compared can be individual combinations of SNPs that can be used as genetic markers . If some of the individuals have the same haplotype due to common ancestry at a certain gene locus , they are combined to form a haplogroup .

history

The term was introduced in 1967 by Ruggero Ceppellini . It was originally used to describe the genetic makeup of the MHC , a complex of genes that encode proteins important for the immune system .

Differentiation from the genotype

If a diploid organism has the genotype AaBb with regard to two genes A and B, it can be based on the haplotypes AB | ab or Ab | aB. In the former case, one chromosome has alleles A and B, the other a and b. In the latter case, one chromosome has alleles A and b, the other a and B.

Determination of haplotypes

A distinction can be made between two cases (in the following the term " allele " refers to the different nucleotides A , C , G and T , but the number of repetitions of a certain microsatellite can also define an allele):

Haploid species

The determination of the haplotypes of a population of haploid individuals from the same species (e.g. different E. coli strains) is trivial. For this, the sequencing and determination of the SNPs of the given population is sufficient (see picture). If individuals are omitted from the sequencing, other alleles contained therein (and the resulting SNPs) cannot of course be recorded.

Polyploid species
If the degree of ploidy of the species under consideration is at least 2, the problem becomes more complicated (e.g. humans are diploid , potatoes are tetraploid and common wheat is hexaploid ). In this case, the genome is composed of two or more homologous sets of chromosomes, one half of which comes from the maternal and the other half from the paternal parent. A distinction must be made between different types of SNPs:

If a maternal and paternal homologous chromosome set differ in nucleotide positions in the DNA in an individual , these SNPs become visible when the corresponding chromosomes of the individual are sequenced (a mixture of homologous chromosomes is always sequenced). Such an SNP is called a heterozygous SNP in the corresponding individual .

If in an individual a maternal and a paternal homologous chromosome set are identical in a considered gene locus , no SNPs are visible when the DNA of the individual is sequenced. Only when another allele is found in at least one second individual in the same locus can one speak of an SNP at the corresponding nucleotide position. Such an SNP is called a homozygous SNP in the first individual , but can represent a heterozygous SNP in another individual.

If two different alleles appear in an SNP (relative to the entire population considered), this SNP is called “biallelic”. If there are three different alleles, this SNP is called “triallelic” and for four alleles “tetraallelic”. A tetraallelic SNP contains the maximum number of different alleles, since SNPs can only be formed from the four nucleotides A, C, G and T.

Diploid species can in principle have tetraallelic SNPs, although only a maximum of two alleles are possible for an individual.

If an SNP is now determined in a polyploid population (of the same species), the haplotypes (of length 1) can be read off directly from the sequencing, as in point 1. Even with two SNPs it becomes problematic: During sequencing, the assignment of the individual alleles to their original chromosomes is lost. Different combinations of the alleles in SNP 1 and SNP 2 are now possible and thus also different haplotypes. The number of possible haplotypes grows exponentially with the number of SNPs.

Various methods have been developed to determine haplotypes in polyploid species.

i) Experimental:

A given chromosome of a given individual is sequenced several times and the corresponding haplotype is determined. During each sequencing, one of the homologous chromosomes was randomly selected from the polyploid set. The number of sequencing is chosen so that it can be assumed with a certain probability that no haplotype was left out during the sequencing. This is expensive and time consuming. In plant breeding, the problem is solved by creating inbred lines . In the final analysis, the homologous chromosomes of an individual from such a line are inherited and therefore identical (only homozygous SNPs in an individual). The determination of the haplotypes is reduced to point 1 and thus to a one-time sequencing of a chromosome or locus.

ii) bioinformatic:

The means are not always available to carry out multiple sequencing or to create inbred lines. If heterozygous SNPs appear in an individual with one-time sequencing and if more than one SNP is considered, then different possible haplotypes can result for an individual. In order to select a biologically sensible one from these exponentially many possibilities (with a linearly increasing number of SNPs), various methods were developed based on different assumptions:

ii.1) Based on a parsimony based criterion, see also Ockham's Razor . This method seeks to minimize the number of haplotypes needed to explain the SNPs of a given population. There are different approaches based on SAT or linear programming to solve this problem efficiently.
Further properties: Is applied under the assumption that no or hardly any recombination takes place in the considered locus . A solution found is always optimal in terms of the thrift criterion. Not practical for large-scale analysis.
ii.2) Maximum likelihood (using the expectation maximization algorithm or Monte Carlo simulation ). These methods attempt to find the set of haplotypes (and the corresponding distribution among the individual individuals) so that the probability of the observed data calculated by a given objective function is maximized.
Further properties: Can also be used for recombination. Solutions are usually only suboptimal, since the algorithm ends in a local optimum or has to make simplifications so that a solution can be calculated at all. Practical for large-scale analyzes, although a suboptimal solution is sufficient.

For methods ii.1 and ii.2, a population with more than one individual is necessary so that the basic assumptions apply and biologically meaningful statements can be made. Partial problems of the haplotype problem are NP-complete , since they can be represented by SAT ( Cook's theorem ) and in the worst case have the same complexity as SAT; the overall problem is thus NP-hard .

Nomenclature of the haplotypes

Rough genealogical tree of mitochondrial DNA in humans. The numbers indicate the position of the mutations.

Detailed family tree of human mitochondrial DNA
The numbers indicate the position of the mutations. “MtEve” is the mitochondrial Eve . “Outgroup” leads to mtDNA from other primates (e.g. chimpanzees). The figure uses the usual (incorrect) nomenclature with the "L1 haplogroup": However, L1 forms the root (L1a is no more closely related to L1f than to V!). Therefore the L1 fields have been crossed out.

A haplogroup can itself contain further sub-haplogroups, which in turn can be further subdivided. One tries to map a tree structure with the nomenclature of the haplogroups and uses letters and numbers alternately. Two mtDNAs of a haplogroup are always monophyletic . Characteristic mutations in the gene sequences of the mtDNA outside of the D-loop are used for the assignment.

A person can e.g. B. have the haplogroup C1a3b2. Your mtDNA is then closely related to that of another person, e.g. B. has C1a3b4. Of course, their mtDNA also shares a common ancestor with a third person who has C1a3c5, but that common ancestor had previously lived before the C1a3 lineage split. That is, C1a3b4 and C1a3b2 are monophyletic to C1a3c5. Likewise, C1a3b2 and C1a3c5 are monophyletic towards all H-haplotypes etc.

The nomenclature is implemented relatively inconsistently. Many letters have been used to denote the major non-African haplogroups. However, many old haplogroups occur in Africa. These are known collectively as “L” and are already used for subdividing the main groups into digits. There is still no scientific consensus on the assignment of some African haplotypes (in L1 and L3).

If you start from the root, the human mitochondrial family tree consists of a series of deep branches. These genetic lines are now called L1. Unlike earlier thought, L1 is not a monophyletic haplogroup, but forms the root. So L1 are actually a whole package of African haplogroups, which are as old as mitochondrial Eve and whose exact relationship to each other has not yet been clarified.

A branch branches off from these old L1 branches through a mutation at position 10810. The haplogroup L2 in turn splits off from this through a mutation at position 16390. L2 also occurs practically only in sub-Saharan Africans.

A mutation at position 3594 forms the branch on which the large haplogroups M and N as well as numerous other African haplogroups, which are still summarized today under L3, are located. Like L1, L3 is not a true (monophyletic) haplogroup. The haplogroups M and N occur in the vast majority of non-Africans. They are very rare in sub-Saharan Africa, where L1, L2, and L3 dominate.

The haplogroup M is divided into the major haplogroups M1, Z, C, D, E, G and Q. The haplogroup N in N1a, N1b, N9, A, I, W, X and Y, as well as in the haplogroup R, which forms the sub-haplogroups B, F, H, P, T, J, U and K.

The currently most extensive study of mitochondrial DNA was carried out by the Genographic Consortium (see also The Genographic Project ). In this comparison 78,590 genotypic samples were included and the mitochondrial haplogroups (and their subgroups) were represented in a phylogenetic tree .

Geographical distribution

The old haplotypes from the L branches dominate in sub-Saharan Africa. There is no doubt that they originated there. These haplotypes are also found in North Africa (approx. 50% frequency) and, to a lesser extent, in Europe and Western Asia.

Haplogroups M and N dominate in the rest of the world and are rare in sub-Saharan Africa. Special variants of the haplogroup M (M1) occur with a frequency of about 20% in Ethiopia. Either M has already arisen there or it is a Semitic return migration to the south.

Native Americans have haplogroups A, B, C, D, and X; of these, A, B, and X emerged from an eastern branch of the haplogroup N, C and D, on the other hand from haplogroup M.

In Europe and Western Asia, haplogroup M is extremely rare. The most common subgroups belong to the subgroup R: H, V, T, J, U and K. In addition, the haplogroups I, W and X occur with a significant frequency. In Europe, the Caucasus and the Middle East, practically the same haplogroups can be found, only the frequencies of the individual haplogroups fluctuate. Haplogroup H in particular is much rarer in the Middle East and the Caucasus than in Europe (~ 25% versus ~ 45%), while haplogroup K is much more common. Within Europe, the frequencies of the haplogroups vary slightly depending on the region.

South and East Asia differ greatly from West Asia in terms of haplogroups. The haplogroups C, D, E, G, Z and Q appear here from haplogroup M. The haplogroup N also occurs here, but it is mainly represented by the haplogroups A, B, F, Y and X.

Haplogroup X is noteworthy because it occurs throughout Eurasia and North America, albeit with a relatively low frequency. It used to be assumed that haplogroup X originated in Europe and only occurs in Europe. When the haplogroup was discovered among Native Americans, the hypothesis arose that it came to America from Europe by sea through European emigrants thousands of years ago. In the meantime, however, haplogroup X has also been discovered in Asia (Derneko et al. , 2001).

swell

Individual evidence

^ The International HapMap Consortium: The International HapMap Project. In: Nature. Volume 426, 2003, pp. 789–796 ( PDF )

↑ R. Ceppellini, ES Curtoni, PL Mattiuz, V. Miggiano, G. Scudeller, A. Serra: Genetics of leucocyte antigens. A family study of segregation and linkage. In: Histocompatibility Testing. 1967, pp. 149-185.

↑ ^a ^b ^c ^d I. Lynce, JP Marques-Silva: Efficient haplotype inference with Boolean Satisfiability. In: National Conference on Artificial Intelligence (AAAI). 2006. ( PDF ).

↑ ^a ^b ^c ^d ^e J. Neigenfind, G. Gyetvai, R. Basekow, S. Diehl, U. Achenbach, C. Gebhardt, J. Selbig, B. Kersten: Haplotype inference from unphased SNP data in heterozygous polyploids based on SAT . In: BMC Genomics. Volume 9, 2008, p. 356 ( summary ).

↑ D. Gusfield: Haplotype inference by Pure Parsimony. In: Proceedings of the 14th annual Symposium on Combinatorial Pattern Matching. 2003, pp. 144-155. PDF ( Memento of the original from June 10, 2010 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2
^ DG Brown, IM Harrower: Integer programming approaches to haplotype inference by pure parsimony. In: IEEE / ACM transactions on computational biology and bioinformatics / IEEE, ACM. Volume 3, Number 2, 2006 Apr-Jun, pp. 141-154, ISSN 1545-5963 . doi: 10.1109 / TCBB.2006.24 . PMID 17048400 .
↑ L. Excoffier, M. Slatkin: Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. In: Molecular Biology and Evolution. Volume 12, 1995, pp. 921-927
↑ Tianhua Niu, Zhaohui S. Qin, 4, Xiping Xu, Jun S. Liu: Bayesian Haplotype Inference for Multiple Linked Single-Nucleotide Polymorphisms. In: American Journal of Human Genetics. Volume 70, 2002, pp. 157–169, PMC 448439 (free full text)
↑ Shu-Yi Su, Jonathan White, David J. Balding, Lachlan JM Coin: Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions. In: BMC Bioinformatics. Volume 9, 2008, p. 513
↑ Macaulay and Richards
↑ DM Behar u. a .: The Genographic Project public participation mitochondrial DNA database. In: PLoS Genet . Vol. 3, San Francisco 2007, pp. E104. PMID 17604454 doi: 10.1371 / journal.pgen.0030104 ISSN 1553-7390

literature

Lexicon of Biology. Volume 7. Spectrum Academic Publishing House, Heidelberg 2004, ISBN 3-8274-0332-4 .
Benjamin Lewin: Molecular Biology of Genes. Spectrum Academic Publishing House, Heidelberg / Berlin 1998, ISBN 3-8274-0234-4 .

Web links

Elke Binder: Chasing the differences . The international "HapMap" project aims to facilitate the search for disease genes. In: Der Tagesspiegel . August 26, 2004 ( online [accessed August 10, 2011]).
Theme and variation . A catalog of the differences in the genome should facilitate research. In: Der Tagesspiegel . October 27, 2005 ( online [accessed August 10, 2011]).
Jan Freudenberg, Sven Cichon, Markus M. Nöthen, Peter Propping : Block structure of the human genome: an organizational principle of genetic variability . In: Deutsches Ärzteblatt . tape 99 , no. 47 , 2002, p. A 3190-3195 ( online [accessed August 10, 2011]).

[1] The International HapMap Consortium: The International HapMap Project. In: Nature. Volume 426, 2003, pp. 789–796 ( PDF )

[2] R. Ceppellini, ES Curtoni, PL Mattiuz, V. Miggiano, G. Scudeller, A. Serra: Genetics of leucocyte antigens. A family study of segregation and linkage. In: Histocompatibility Testing. 1967, pp. 149-185.

[lynce-3] I. Lynce, JP Marques-Silva: Efficient haplotype inference with Boolean Satisfiability. In: National Conference on Artificial Intelligence (AAAI). 2006. ( PDF ).

[neig-4] J. Neigenfind, G. Gyetvai, R. Basekow, S. Diehl, U. Achenbach, C. Gebhardt, J. Selbig, B. Kersten: Haplotype inference from unphased SNP data in heterozygous polyploids based on SAT . In: BMC Genomics. Volume 9, 2008, p. 356 ( summary ).