Celera assembler

The Celera Assembler , a genome assembler , was originally developed by the Celera company and is now being continued as an open source project. It is used to reconstruct a long, contiguous sequence from many short genomic fragments obtained by whole genome shotgun sequencing.

A seed-and-extend approach is used to search for overlaps between the cleaned fragments. The overlap graph is constructed from this. A spanning forest heuristic combines the fragments into unitigs. These in turn are arranged into scaffolds by the greedy path merging algorithm with the help of the mate pair information and the contig mate graph. The gaps between the contigs can possibly be filled heuristically with previously unused fragments and mate pairs. At the end, the consensus sequence is calculated by multialigning all fragments over the scaffolds found.

Theoretical considerations

An assembly can be described by the following problem.

Let there be a lot of strings. We are looking for a string S with the following properties: ${\ displaystyle F = {f_ {2}, f_ {2}, \ dotsb, f_ {n}}}$

Each is a substring of S, ${\ displaystyle f_ {i}}$ ${\ displaystyle \ forall f_ {i} \ in F: f_ {i} \ in S}$
S is minimal, with 1. applies ${\ displaystyle \ forall S '}$ ${\ displaystyle | S | \ leq | S '|}$

This problem is known as the Shortest Common Superstring . It is NP-complete .

In the biological context it is aggravating that the sequences can be faulty and two possible orientations of a fragment must always be taken into account. In addition, from a biological point of view it does not always make sense to determine a minimal sequence, since fragments from regions containing repetition can lead to over-compressed results. This and the combinatorial complexity mean that the method presented here is strongly heuristic.

Procedure

The Celera assembler implements the whole genome shotgun approach. According to [MSD + 00], the assembler can be divided into several stages that build on one another, all of which are heuristic. The heart of the assembler is the greedy path merging algorithm.

Input data

The sequence fragments are obtained with the help of “double barrel” shotgun sequencing. A clone is sequenced from both sides and a so-called mate pair of reads is obtained . The clones used here have typical lengths of 2 kb, 10 kb, 50 kb and 150 kb and the sequenced reads have an average length of 550 bases . The original reads are then truncated to an interval with bases of 98 percent certainty. The quality values assigned to the read bases are not taken into account in the further process. Parts of cloning or sequencing vectors contained in the reads are also radically removed. The remaining high quality pieces are the assembler input. These are referred to below as fragments or reads.

Screener

The sequence fragments are first searched for known repeats. Fragments that contain repeat patterns are masked and later viewed separately. The cleaned up fragments form the input for the next step.

Overlapper

The overlapper looks for overlaps between fragments. The fragments purified during the screening are entered in this phase. All significant overlaps between the reads are output. For each fragment it is therefore to be determined how it overlaps with any other fragment or its reverse complement . Any two fragments and can overlap in different ways. ${\ displaystyle f_ {i}}$ ${\ displaystyle f_ {j}}$ ${\ displaystyle {\ overline {f_ {j}}}}$ ${\ displaystyle f_ {i}}$ ${\ displaystyle f_ {j}}$

They overlap at the ends, one is included in the other, or they do not overlap. Alignment by dynamic programming is too slow with this large number of fragments. In the human genome, around 27 million reads would have to be compared in pairs. In addition, only overlaps with a high alignment score, i.e. with more than 96% identity, are relevant here. Therefore, a BLAST -like "seed-and-extend" approach is followed.

All fragments are first concatenated, so be . Then one looks for exact matches of the k-mers of on the sequence H, the so-called seeds , for all fragments . The parameter k is selected here between 18 and 22. Now one tries to expand the seeds through banded alignment . The running time is linear and only really good overlaps between and the corresponding one are found. ${\ displaystyle H = f_ {1} f_ {2} \ dotsb f_ {n}}$ ${\ displaystyle f_ {i}}$ ${\ displaystyle f_ {i}}$ ${\ displaystyle f_ {i}}$ ${\ displaystyle f_ {j}}$

If an overlap was found between and , it could have three causes. The fragments actually overlap on the original sequence, the ends of the fragments come from a repeating sequence (repeat) or the sequence equality is purely coincidental. The latter is avoided in that only overlaps of a length of at least 40 base pairs are considered. The overlaps generated by repeats can be avoided by meticulously screening and masking the fragments beforehand for known repeats, or by choosing k, determining the frequency for each k-mer and ignoring extremely frequent k-mers. The calculated overlaps between the fragments are represented in a graph. ${\ displaystyle f_ {i}}$ ${\ displaystyle f_ {j}}$

The overlap graph is defined as follows: for each fragment there are two nodes, a start node , an end node and a directed edge to represent the orientation of the read. Each overlap is represented by an undirected edge. The overlap edges incise with knots at the ends where the corresponding fragments overlap. ${\ displaystyle f_ {i}}$ ${\ displaystyle s_ {i}}$ ${\ displaystyle e_ {i}}$ ${\ displaystyle (s_ {i}, e_ {i})}$

Unitigger

The unit trigger uses the overlap graph to first arrange the fragments, i.e. to assign coordinates to all nodes (layout). The overlaps between fragments from the overlap phase are used as input. A simple heuristic , also used here, is that of the exciting forest . First all read edges are marked. The overlap edges are sorted in ascending order according to length and the shortest edges are added until an exciting tree has emerged in each connected component. Overlap edges that would close a circle are ignored.

This subset of the edges results in a relative arrangement of the reads of this related component. Such an alleged multi-alignment is called a contig.

Incorrect overlap edges generated by repeats can lead to incorrect assembly of a contig (misassembly).

A spanning tree that contains the wrong edge would create a wrong arrangement of the reads. If you ignore the wrong edges, you will of course have the right arrangement. In this case, however, there is no layout that is consistent with all of the overlapping edges found. Therefore one defines: A unitig ( uni que con tig ) is a contig that is consistent with all edges of the overlap graph. Reads that do not belong to a unit are not used in the following. Now it can happen that a longer part of the sequence occurs repeatedly. This means that a contig is consistent with all the overlap edges of the overlap graph, but its fragments do not come from a unique region of the original sequence. One further defines: A U-Unitig (unique unitig) is a unitig that is unique to the original sequence, i.e. not partially or completely in a repeating region. U-Unitigs are identified with the help of statistical considerations. It is assumed that the arrival of the fragments would be Poisson distributed . Let R be the number of reads, G the length of the original sequence and u the length of a unit. Then R / G is the average number of reads per base and the expected number of reads per unit (expected value). The probability that a unit contains k fragments is with the Poisson distribution . If the unitig resulted from the reads of two repeats, the expected value doubles and the probability is . ${\ displaystyle \ lambda = {\ frac {u \ cdot R} {G}}}$ ${\ displaystyle {\ frac {\ lambda ^ {k}} {k!}} e ^ {- \ lambda}}$ ${\ displaystyle {\ frac {2 \ lambda ^ {k}} {k!}} e ^ {- (2 \ lambda)}}$

literature

"Algorithms in Bioinformatics II", Chapter 17, Prof. Dr. Daniel Huson, University of Tübingen, 2004 (PDF file; 1.49 MB)
"The Celera Assembler", seminar paper, 2005 (PDF file; 890 KB)

Web links

Project page on SourceForge