|Assembling the genome|
March 24, 2000
Genome assembly starts with 3.1 million fragments of the Drosophila genomerandom bits of fly DNA that have been converted into characters that a computer can read. The goal of assembly is to arrange these DNA sequences into a properly ordered and nearly complete genome. This requires powerful computers and algorithms that can sort and analyze more data than the human brain can process.
When assembled, the genome will read like one long sentence written with the same four letters repeated over and over in varying order. Because the machines that decode DNA can read only about 500 letters at once, the challenge is to recreate the whole sentence using overlapping sentence fragments. The problem is that one doesn’t know where the fragments belong in the genome and many look alike.
The three million fly fragments are sampled from the gene-rich regions of the genome (about 120 million letters). These fragments are enough DNA to cover the genome 14 times over. The genome is sequenced at this scale to reduce the chances that our random approach missed any of the targeted regions.
With the massive data set in place, the fun begins. Assembly is like putting together a jigsaw puzzle, and the challenge comes from not knowing what kind of pieces you’re dealing with. A fragment could be unique, spelling part of a fly gene, or it could be one of the many repeated sequences in Drosophila. Like blue sky in a puzzle, repeats can be tricky and hard to assemble.
As with any jigsaw puzzle, some moves are easier and certain to be correct. The strategy is to take few risks at first and gradually become more aggressive. The GNN Assembler is actually a pipelinea series of mathematical steps to sort, edit, and assemble fragments. The steps are stages in a layered strategy.
The first stage in assembly is the heavy-lifting: The assembler compares the millions of fragments against each other, finding all common segments between two fragments that are at least 40 letters long. These overlaps could not have occurred by chance, and they become the foundation of assembly.
Of these overlaps, some are "true" and some are "repeat-induced." (Fig. 1) In true overlaps, the shared sequence involves fragments that come from overlapping sections of the genome. These fragments belong together. In repeat-induced overlaps, the shared sequence involves part of a repeat that occurs in several dispersed parts of the genome. These fragments do not belong together. If it were clear which overlaps were true, assembly would be a trivial matter.
The assembler now searches for groups of overlapping fragments that 1) together spell a common sequence, and 2) do not overlap fragments with sequences that dispute, or contest, the common sequence. Such uncontested groups of fragments are assembled into what are called "unitigs." (Fig. 2) Each unitig contains on average about 30 fragments. There are 100 times fewer overlaps between unitigs than overlaps between fragments.
Ninety-nine percent of unitigs are correctly assembled, but a small percentage consist entirely of DNA from a number of instances of the same repeat. Here the computer is simply doing its job: the fragments are assembled together because they spell the sequence of the repeat.
The assembler identifies incorrectly assembled unitigs that spell repeats by looking at the "depth" of the total number of fragments in the unitig. (Fig. 3) Think of cards fanned out on a table. Within the same space, a fan of three decks is deeper than a fan of one deck. A statistic called the Discriminator is used to find stacks of fragments that are suspiciously high. Correctly assembled unitigs that do not spell repetitive DNA are the equivalent of no more than one deck of cards deep. These are called U-unitigs ("U" for unique), and all other unitigs are set aside.
The U-unitigs are mini-phrases that are ready to be ordered in the genome. The scaffolding stage begins. Critical to this stage is the fact that most of the fragments were grabbed from the genome in pairs during sequencing. Known as mate pairs, these fragments are always separated by the same number of letters, either about 1,000 or about 9,000. Since most repeats are shorter than 7,000 letters, mates are a way to circumnavigate, or span, the repeats. However, about 1% of the time mate pairs are not actually paired at the given distance due to errors in the computer tracking of the fragments.
A contiguous sequence of ordered unitigs is a contig. During scaffolding, the assembler orients contigs using mates. (Fig. 4) Most mate pairs are reliable landmarks—they stick together and remain the same distance apart. If mates from the same pair lie on different contigs, for instance, the contigs are likely to be neighbors about 99% of the time. If two or more mate pairs enforce each otherthat is, they indicate the same orientationthen the contigs involved are almost certain to be neighbors.
As the assembler compares more and more mates, the contig geography becomes apparent. Sets of contigs that are ordered and oriented using enforcing pairs are called scaffolds. At this point, the scaffolding is continuous except for gaps. (Fig. 5)
Some of these gaps are due to missing sequence; this is unavoidable. Other gaps contain repetitive sequence that can now be closed using the unitigs that were set aside earlier by the Discriminator. The same strategymake progressively riskier movesapplies to closing gaps.
The assembler classifies repeat sequences by size and reliability, calling the largest and most reliable repeats "rocks." Rocks are tossed into the gaps first, to be followed by the lesser "stones," and finally the smallest and least reliable pieces, "pebbles." Rocks must be linked to the contigs on either side of a gap by two or more mates. (Fig. 6)
Stones are linked to the contigs by only one mate. Their position in a gap is confirmed by overlaps. (Fig. 7)
Pebbles are placed in a gap based on the quality of the overlaps between each other and the adjoining contigs. (Fig. 8)
Assembly has created a path across the unique, gene-bearing regions of the genome and characterized the intervening repeats.
. . .