GLIMMER

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Howicus (talk | contribs) at 00:40, 4 November 2013 (Formatting the lead, removing the first section so that the intro will be above the table of contents (No need for an "Introduction" section header)). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.



GLIMMER
Developer(s)Steven Salzberg & Arthur Delcher
Stable release
3.02 / 9 May 2006 (2006-05-09)
Available inC++
TypeBioinformatics tool
LicenseOSI Certified Open Source Software under the Artistic License
Websiteccb.jhu.edu/software/glimmer/index.shtml

In bioinformatics, GLIMMER (Gene Locator and Interpolated Markov ModelER) is a software program used to find genes in microbial DNA. It is effective at finding genes in bacteria, archea, viruses, typically finding 98-99% of all protein coding genes. GLIMMER was the first system that used the interpolated Markov model to identify coding regions. The GLIMMER software is open source and is maintained by Steven Salzberg, Art Delcher, and their colleagues at the Center for Bioinformatics and Computational Biology[1] at Johns Hopkins University.

Versions of GLIMMER

GLIMMER 1.0

First Version of GLIMMER i.e., GLIMMER 1.0 was released in 1998 and it was published in the paper Microbial gene identification using interpolated Markov models[2]. GLIMMER 1.0 was the first to use interpolated Markov model as a framework for capturing dependencies between nearby nucleotide in a DNA sequence. An interpolated Markov model makes predcition based on a variable context; i.e., a varibale-length oligomer in a DNA sequence. The context used by the GLIMMER changes depending on the local composition of the sequence which makes GLIMMER more flexible and more powerful when compared to fixed-order Markov model.

There was a comparison made between interpolated Markov model used by GLIMMER and fifth order Markov model in the paper Microbial gene identification using interpolated Markov models[2]. GLIMMER algortihm found 1680 genes out of 1717 annotated genes in haemophilus influenzae where fifth order Markov model found only 1574 genes. GLIMMER found 209 additional genes where fifth order Markov model found only 104 genes.

GLIMMER 2.0

Second Version of GLIMMER i.e., GLIMMER 2.0 was released in 1999 and it was published in the paper Improved microbial identification with GLIMMER[3]. This paper[3] provides significant technical improvements which improves the accuracy of GLIMMER.

Interpolated context models are used instead of interpolated Markov model which gives the flexibility to select any of the base not just the adjacent amino acid. For example the nucleotide in third codon position is sometimes irrelevant to the amino acid transaltion. GLIMMER 2.0 provides that flexibility.

GLIMMER 2.0 made a conscious effort to reduce the number of false negative gene predictions at the expense of a slight increase in the number of false positive predictions. GLIMMER 1.0 occasionally discard overlap genes cause by the position of start codon which is resolved in GLIMMER 2.0 by incorporating additional rules.

Various comparisons between GLIMMER 1.0 and GLIMMER 2.0 were made in the paper Improved microbial identification with GLIMMER[3] which shows significant improvement in the latter version.

GLIMMER 3.0

Third version of GLIMMER, GLIMMER 3.0 was released in 2007 and it was published in the paper Identifying bacterial genes and endosymbiont DNA with Glimmer[4]. This paper describes several major changes made to the GLIMMER system including improved methods to idenitfy coding regions and start codon

GLIMMER 3.0 dramatically reduce the rate of falsepositive predictions, while maintaining Glimmer’s 99% sensitivity rate at detecting genes in most species. GLIMMER 3.0 uses new algorithm for scanning coding regions, a new start site detetction module, and an overall architecture that for the first time integrates all gene predictions across an entire genome. Various comparisons were made between GLIMMER 3 and GLIMMER 2 in the paper Identifying bacterial genes and endosymbiont DNA with Glimmer[4]

Accessing GLIMMER

GLIMMER can be accessed in two ways.

1. You can download the latest version of GLIMMER from The Glimmer home page and follow the installation instructions give in there home page. You need a C++ compiler to run GLIMMER.

2. You can also access the online version of GLIMMER hosted by NCBI at this address

How does GLIMMER works?

1. GLIMMER searches for longest open reading frame that dont overlap with any other longest open reading frame and follow certain amino acid distribution to be used as training set

2. GLIMMER trains all the six markov models from zero to eight order for each of the possible reading frame and also models for noncoding DNA.

3. At every stage GLIMMER checks whether there are at least 400 observations for the markov model

   a. If there are atleast 400 observations then GLIMMER obtain probailities directly from data like any other algorithm which uses 
      fixed order markov model
   b. If the no of of observations are less than 400 then GLIMMER combines the result with low order models using interpolated markov model

4. GLIMMER score every open reading frame longer than minimum length using all seven models (six coding DNA models and one non-coding DNA model)

5. If model for the correct reading frame scores above a threshold, then GLIMMER predicts it to be a gene.

6. GLIMMER resolves the overlapped regions.

The GLIMMER system

GLIMMER system consists of two programs. First program called build-imm, which takes an input set of sequences and outputs the interpolated markov model as follows.

GLIMMER computes the probability of each base a,c,g,t for all k-mers for 0 ≤ k ≤ 8. Then, for each k-mer, it computes weight. GLIMMER evaluates new sequences by computing the probability as

where is the oligomer ending at position x and n is the length of the sequence. , the -order interpolated markov model score is computed as

where is the numeric weight associated with k-mer ending at position x-1 in the sequence S and is the estimate obtained from the training data of the probability of the base located at position x in the -order model.

The probability of base given the i previous bases is computed as follows.

The value of associated with can be regarded as a measure of our confidence in the accuracy of this value as an estimate of the true probability. GLIMMER uses two criteria to determine . The first of these is simple frequency occurence in which the number of occurences of context string in the training data exceeds a specific threshold value, then is set to 1.0. The current default value for threshold is 400, which gives 95% confidence. When there are insufficient sample occurances of a context string, we employ additional criteria to determine value. For a given context string of length i, we compare the observed frequenices of the following base , , , with the previously calculated interpolated markov model probabilties using the the next shorter context, , , , . Using a test, we determine how likely it is that the four observed frequencies are consistent with the IMM values from the next shorter context.

The second program called glimmer, then uses this IMM to identify putative gene in an entire genome. GLIMMER identifies all the open reading frame which score higher than threshold and check for overlapping genes. Resolving overlapping genes is explained in the next sub-section.

Equations used above are taken from the paper 'Microbial gene identification using interpolated Markov models[2]

Resolving overlapping genes

In GLIMMER 1.0, when two genes A and B overlap, the overlap region is scored. If A is longer than B, and if A scores higher on the overlap region, and if moving B's start site will not resolve the overlap, then B is rejected.

GLIMMER 2.0 provided a better solution to resolve the overlap. In GLIMMER 2.0, when two potential genes A and B overlap, the overlap region is scored. Suppose gene A scores higher, four different orientations are considered.

Case 1

In the above case, moving of start sites does not remove the overlap. If A is signifcantly longer than B, then B is rejected or else both A and B are called genes, with a doubtful overlap.

Case 2

In the above case, moving of B can resolve the overlap, A and B can be called non overlapped genes but if B is significantly shorter than A, then B is rejected.

Case 3

In the above case, moving of A can resolve the overlap. A is only moved if overlap is a small fraction of A or else B is rejected.

Case 4

In the above case, both A and B can be moved. We first move the start of B until the overlap region scores higher for B. Then we move the start of A until it scores higher. Then B again, and so on, until either the overlap is eliminated or no further moves can be made.

The above example has been taken from the paper 'Identifying bacterial genes and endosymbiont DNA with Glimmer[4]

Reverse Scoring

GLIMMER computes the log-likelihood on a DNA sequence of given interval generated by a model of coding versus noncoding DNA. Main challenge for GLIMMER is to find the true start site. GLIMMER 3.0 scores all the open reading frames in reverse, from the sop codon back toward the start codon. The advantage of scanning open reading frames in reverse is that for nucleotides near the start site, the context window of interpolated markov model is contained within the coding portion of the gene, which is the type of data on which it was trained.

Ribosome binding sites

Ribosome binding site(RBS) provides a strong signal for the position of the true start site. GLIMMER 3.0 has a standalone program RBSfinder, that can be run as a post-processor on the results of GLIMMER analysis. RBSfinder is quite effective at finding Ribosome binding site and adjusting GLIMMER position.


Performance of GLIMMER

Glimmer is the system of choice for genome annotation efforts on a wide range of bacteria, archaeal, and viral species. In a large-scale reannotation effort at the DNA Data Bank of Japan (DDBJ, which mirrors Genbank). Kosuge et al. (2006)[5] examined the gene finding methods used for 183 genomes. They reported that of these projects, Glimmer was the gene finder for 49%, followed by GeneMark with 12%, with other algorithms used in 3% or fewer of the projects. (They also reported that 33% of genomes used "other" programs, which in many cases meant that they could not identify the method. Excluding those cases, Glimmer was used for 73% of the genomes for which the methods could be unambiguously identified.) Glimmer was used by the DNA Databank of Japan (DDBJ) to re-annotate all bacterial genomes in the International Nucleotide Sequence Databases.[6] It is also being used by this group to annotate viruses.[7] Glimmer is part of the bacterial annotation pipeline at the National Center for Biotechnology Information (NCB),[8] which also maintains a web server for Glimmer,[9] as do sites in Germany,[10] Canada,[11] and elsewhere.

Glimmer is a highly cited bioinformatics system in the scientific literature. According to Google Scholar, as of early 2011 the original Glimmer article (Salzberg et al., 1998)[2] has been cited 581 times, and the Glimmer 2.0 article (Delcher et al., 1999)[3] has been cited 950 times.

References

  1. ^ "Center for Computational Biology". Johns Hopkins University. Retrieved 23 March 2013.
  2. ^ a b c d Attention: This template ({{cite pmid}}) is deprecated. To cite the publication identified by PMID 9421513, please use {{cite journal}} with |pmid=9421513 instead.
  3. ^ a b c d Attention: This template ({{cite pmid}}) is deprecated. To cite the publication identified by PMID 10556321, please use {{cite journal}} with |pmid=10556321 instead.
  4. ^ a b c Attention: This template ({{cite pmid}}) is deprecated. To cite the publication identified by PMID 17237039, please use {{cite journal}} with |pmid=17237039 instead.
  5. ^ Attention: This template ({{cite pmid}}) is deprecated. To cite the publication identified by PMID 17166861, please use {{cite journal}} with |pmid=17166861 instead.
  6. ^ Attention: This template ({{cite pmid}}) is deprecated. To cite the publication identified by PMID 17108353, please use {{cite journal}} with |pmid=17108353 instead.
  7. ^ Attention: This template ({{cite pmid}}) is deprecated. To cite the publication identified by PMID 17158166, please use {{cite journal}} with |pmid=17158166 instead.
  8. ^ "NCBI Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP)". Center for Bioinformatics and Computational Biology. Retrieved 23 March 2012.
  9. ^ "Microbial Genome Annotation Tools". Center for Bioinformatics and Computational Biology. Retrieved 23 March 2012.
  10. ^ "TiCo". Institut für Mikrobiologie und Genetik, Universität Göttingen. Retrieved 23 March 2012.
  11. ^ "BASys Bacterial Annotation System". Retrieved 23 March 2012.

External links