How do you assemble a genome?
Genome assembly is the job of computer programs known, appropriately enough, as "assemblers." These programs work by finding and analyzing overlaps, or identical DNA sequences at either end of two different reads.
On first sight one might think that reads that overlap belong next to each other in the final genome sequence. However, the genome contains over 30 percent of sequence that is repeated several times, so that a repeat overlap might also occur between fragments that are millions of base pairs apart in the genome.
The task of the assembler is to compare each read to every other, then to put all the reads in the proper order based on how they overlap, not using the repeat overlaps. The outcome of an assembly is a collection of big stretches of the genome that are put together correctly.
The process is a lot like assembling a jigsaw puzzlemethodically placing puzzle pieces next to each other to see if they fit together, then snapping the matching pieces into place.
Assembler programs have continuously improved since the first such software was written in the early 1980s. More powerful computers are also helping scientists assemble larger pieces of DNA faster than ever before.
Nevertheless, even the most powerful assembly software relies more on elegance and simplicity than on brute force. Many assemblers are only a fraction of the size of a typical word processing program150,000 to 200,000 lines of code as opposed to several million. A program used to assemble the human genome would fit easily on the hard disk of a typical personal computer.
But you couldn't actually run the assembler from the computer on your desk. Due to the huge numbermillions of trillionsof comparisons it must make and keep track of, an assembler needs lots of memory to runthousands of times the RAM required to run a word processing program and much more than you are likely to have on your desktop computer.
How do scientists know if a genome sequence is right?
Genome scientists venture daily into uncharted territory. When a genome has never been sequenced before, there is nothing to tell its explorers whether they have sequenced it correctly. Moreover, DNA sequencing is an activity that could have been invented by Mr. Murphy himself: Anything that can go wrong will go wrong, and just about anything can go wrong.
Errors can emerge at any stage of the processwhen DNA is chopped up, when it is copied, as it goes through the sequencing machine, or as it is put together. Some sequences are particularly difficult to copy or to sequence and get left out. And random "noise" in the data can cause a base to be misidentified or overlooked.
But a combination of redundancy and careful checking helps ensure that errors in genome sequencing are kept to an absolute minimum.
One trick for eliminating errors is to sequence the genome more than once. That is, scientists chop multiple copies of a genome up in such a way that each base is sequenced several times6 to 10 times on average, depending on the specific project. That way, if the sequencing machine gets a base wrong, or if a piece of DNA slips through the cracks and doesn't get sequenced, there are likely to be other, correct reads that will provide the sequence.
In addition to identifying DNA bases, software on automatic sequencing machines can evaluate the probability that a base really is the base it appears to be. Error probabilities for all of the bases in a read are added together for an estimate of the number of errors in the sequence.
Bad reads or parts of readsthose with a lot of errors or question marksare weeded out before they even make it to the assembly stage. With slab-gel machines some of this quality control is done by humans, while with capillary sequencers it is exclusively the province of computers.
In addition, assembler software compares all the different reads that cover the same stretch of DNA and generates what is known as a "consensus" sequence. For example, if a certain base comes out as an A nine times and C the tenth, then chances are the base is really an A. An assembler is designed to sift through conflicting information and decide which sequence is likely to be right.
Once a sequence is assembled, there are several ways to make sure it has been put together correctly. The sequence may be checked against small parts of the genome that have previously been sequenced and assembled or against various landmarks on genome maps. In other words, if an assembly is consistent with scattered bits of known information, that is a good sign it is correct overall.
Although computer programs can help resolve gaps and uncertainties in a genome sequence, much of the final polishing is still done by people known as finishers. These expert workers identify gaps in the sequence, design experiments to fill in those gaps, and determine how to collect any additional information that is necessary.
There is no mechanical substitute for the intuition and intelligence of an experienced finisher, so finishing is currently a bottleneck in the process of DNA sequencing. Automatic sequencing machines can churn out raw sequence much faster than humans can analyze and polish these sequences.
Many scientists foresee a day when genome sequencing will be routinewhen sequencing the genomes of many different species will help biologists understand the patterns of evolution, or when sequencing the genomes of individual humans will help doctors design tailor-made medicine. But until speedy machines become finishers as well as sequencers, that scenario will remain science fiction.
What makes sequencing the human genome different from sequencing other genomes?
The human genome is a lot bigger than other genomes that have been sequenced in the past. Most genomes that have been sequenced to date belong to viruses, bacteria, or other simple forms of life with relatively small genomes. The human genome is about a thousand times larger than an average bacterial genome. Even the fruit fly genome, the largest genome sequenced prior to the human genome, is just 165 million base pairsless than a tenth the size of the human genome.
In addition, the human genome is about 25 to 50 percent repetitive DNA, but bacterial and viral genomes contain very little of this exasperating stuff. In repetitive DNA, the same short sequence is repeated over and over again. For example, somewhere in the genome the sequence ATG may be repeated 150 times in a row; elsewhere there may be 40 consecutive copies of the sequence CCTTGCT.
In jigsaw puzzle terms, a genome with a lot of repetitive DNA would be like a puzzle that includes a large number of identical or near-identical piecesone in which the entire foreground is a featureless field of small, pink flowers, for example.
Like repetitive jigsaw puzzles, repetitive DNA can be difficult to assemble. It is often difficult for scientists to determine how much repetitive sequence belongs where. For example, 100 copies of ATG may belong in one spot in the genome, or it may be that only 60 copies belong there and 40 copies belong somewhere else.
Repetitive DNA may also be more difficult to sequence than other DNA. Sometimes the procedures used to copy DNA and prepare it for sequencing do not work on repetitive DNA, and a sequencing machine may have a hard time reading the same string of letters over and over.
When is a genome sequence done?
That question can be answered in more than one way. At present, GNN's analysis of the human genome indicates that there are about 2.91 billion base pairs in the euchromatic region of the genome. For purposes of scientific research, we can say that the genome sequence is 95 percent complete, even though certain portions of the genomenamely the centromeres and telomeres, which are the highly repetitive regions at the center and ends of chromosomesare widely considered to be unsequenceable, at least with current technology.
But the sequence as it is now known is complete enough to be useful to scientists as a base for future research in finding genes and understanding how the human genome as a whole works.
The reasonableness of this approach is revealed by the scientists' experience with the genome of Caenorhabditis elegans, a small roundworm. The worm's genome of 97 million base pairs was pronounced done in late 1998, but scientists are still finding a few mistakes and holes in the sequence. Despite these minor imperfections, the sequence has already helped researchers learn more about the animal's biologyhow it grows, develops, ages and dies, and goes about its daily life. And that, after all, is the point of genome sequencing.
. . . . . . . . . . . . . . . . . . . . .
Updated on January 15, 2003