|
How do you assemble a genome?
Genome assembly is the job of computer programs known, appropriately
enough, as "assemblers." These programs work by finding and analyzing
overlaps, or identical DNA sequences at either end of two different reads.
On first sight one might think that reads that overlap belong next to
each other in the final genome sequence. However, the genome contains
over 30 percent of sequence that is repeated several times, so that a
repeat overlap might also occur between fragments that are millions of
base pairs apart in the genome.
The task of the assembler is to compare each read to every other, then
to put all the reads in the proper order based on how they overlap, not
using the repeat overlaps. The outcome of an assembly is a collection
of big stretches of the genome that are put together correctly.

Assembling overlapping "reads" |
The process is a lot like assembling a jigsaw puzzlemethodically
placing puzzle pieces next to each other to see if they fit together,
then snapping the matching pieces into place.
Assembler programs have continuously improved since the first such
software was written in the early 1980s. More powerful computers are
also helping scientists assemble larger pieces of DNA faster than ever
before.
Nevertheless, even the most powerful assembly software relies more
on elegance and simplicity than on brute force. Many assemblers are
only a fraction of the size of a typical word processing program150,000
to 200,000 lines of code as opposed to several million. A program used
to assemble the human genome would fit easily on the hard disk of a
typical personal computer.
But you couldn't actually run the assembler from the computer on your
desk. Due to the huge numbermillions of trillionsof comparisons
it must make and keep track of, an assembler needs lots of memory to
runthousands of times the RAM required to run a word processing
program and much more than you are likely to have on your desktop computer.
How do scientists know if a genome sequence is right?
Genome scientists venture daily into uncharted territory. When a genome
has never been sequenced before, there is nothing to tell its explorers
whether they have sequenced it correctly. Moreover, DNA sequencing is
an activity that could have been invented by Mr. Murphy himself: Anything
that can go wrong will go wrong, and just about anything can go wrong.
Errors can emerge at any stage of the processwhen DNA is chopped
up, when it is copied, as it goes through the sequencing machine, or
as it is put together. Some sequences are particularly difficult to
copy or to sequence and get left out. And random "noise" in the data
can cause a base to be misidentified or overlooked.
But a combination of redundancy and careful checking helps ensure that
errors in genome sequencing are kept to an absolute minimum.
One trick for eliminating errors is to sequence the genome more than
once. That is, scientists chop multiple copies of a genome up in such
a way that each base is sequenced several times6 to 10 times on
average, depending on the specific project. That way, if the sequencing
machine gets a base wrong, or if a piece of DNA slips through the cracks
and doesn't get sequenced, there are likely to be other, correct reads
that will provide the sequence.
In addition to identifying DNA bases, software on automatic sequencing
machines can evaluate the probability that a base really is the base it
appears to be. Error probabilities for all of the bases in a read are
added together for an estimate of the number of errors in the sequence.
Bad reads or parts of readsthose with a lot of errors or question
marksare weeded out before they even make it to the assembly stage.
With slab-gel machines some of this quality control is done by humans,
while with capillary sequencers it is exclusively the province of computers.
In addition, assembler software compares all the different reads that
cover the same stretch of DNA and generates what is known as a "consensus"
sequence. For example, if a certain base comes out as an A nine times
and C the tenth, then chances are the base is really an A. An assembler
is designed to sift through conflicting information and decide which
sequence is likely to be right.
Once a sequence is assembled, there are several ways to make sure it
has been put together correctly. The sequence may be checked against
small parts of the genome that have previously been sequenced and assembled
or against various landmarks on genome maps. In other words, if an assembly
is consistent with scattered bits of known information, that is a good
sign it is correct overall.

Pulse of the Genome |
Although computer programs can help resolve gaps and uncertainties
in a genome sequence, much of the final polishing is still done by people
known as finishers. These expert workers identify gaps in the sequence,
design experiments to fill in those gaps, and determine how to collect
any additional information that is necessary.
There is no mechanical substitute for the intuition and intelligence
of an experienced finisher, so finishing is currently a bottleneck in
the process of DNA sequencing. Automatic sequencing machines can churn
out raw sequence much faster than humans can analyze and polish these
sequences.
Many scientists foresee a day when genome sequencing will be routinewhen
sequencing the genomes of many different species will help biologists
understand the patterns of evolution, or when sequencing the genomes
of individual humans will help doctors design tailor-made medicine.
But until speedy machines become finishers as well as sequencers, that
scenario will remain science fiction.
What makes sequencing the human genome different from sequencing other genomes?
The human genome is a lot bigger than other genomes that have been sequenced
in the past. Most genomes that have been sequenced to date belong to viruses,
bacteria, or other simple forms of life with relatively small genomes.
The human genome is about a thousand times larger than an average bacterial
genome. Even the fruit fly genome, the largest genome sequenced prior
to the human genome, is just 165 million base pairsless than a tenth
the size of the human genome.
In addition, the human genome is about 25 to 50 percent repetitive
DNA, but bacterial and viral genomes contain very little of this exasperating
stuff. In repetitive DNA, the same short sequence is repeated over and
over again. For example, somewhere in the genome the sequence ATG may
be repeated 150 times in a row; elsewhere there may be 40 consecutive
copies of the sequence CCTTGCT.
In jigsaw puzzle terms, a genome with a lot of repetitive DNA would
be like a puzzle that includes a large number of identical or near-identical
piecesone in which the entire foreground is a featureless field
of small, pink flowers, for example.
Like repetitive jigsaw puzzles, repetitive DNA can be difficult to
assemble. It is often difficult for scientists to determine how much
repetitive sequence belongs where. For example, 100 copies of ATG may
belong in one spot in the genome, or it may be that only 60 copies belong
there and 40 copies belong somewhere else.
Repetitive DNA may also be more difficult to sequence than other DNA.
Sometimes the procedures used to copy DNA and prepare it for sequencing
do not work on repetitive DNA, and a sequencing machine may have a hard
time reading the same string of letters over and over.
When is a genome sequence done?
That question can be answered in more than one way. At present, GNN's
analysis of the human genome indicates that there are about 2.91 billion
base pairs in the euchromatic region of the genome. For purposes of scientific
research, we can say that the genome sequence is 95 percent complete,
even though certain portions of the genomenamely the centromeres
and telomeres, which are the highly repetitive regions at the center and
ends of chromosomesare widely considered to be unsequenceable, at
least with current technology.
But the sequence as it is now known is complete enough to be useful to
scientists as a base for future research in finding genes and understanding
how the human genome as a whole works.

Caenorhabditis elegans
|
The reasonableness of this approach is revealed by the scientists'
experience with the genome of Caenorhabditis elegans, a small
roundworm. The worm's genome of 97 million base pairs was pronounced
done in late 1998, but scientists are still finding a few mistakes and
holes in the sequence. Despite these minor imperfections, the sequence
has already helped researchers learn more about the animal's biologyhow
it grows, develops, ages and dies, and goes about its daily life. And
that, after all, is the point of genome sequencing.
. . . .
. . . . . . . . .
. . . . . . . .
Updated on January 15, 2003
|