|The Homophila Database:
Screening the Fly Genome for Human Disease Genes
Edward R. Winstead
June 11, 2001
Two years ago, as the sequencing of the Drosophila genome neared completion, researchers took an increasing interest in using the fly to study human disease. The publication of the sequence in Science in March of 2000 has led to several studies assessing the prevalence of human disease counterparts in the fly genome.
Researchers at the University of California San Diego were among those
who began constructing a database of human genes, fly genes and genetic
diseases early on. The result is Homophila, a database that went online
last fall. Now, in the June issue of Genome Research, they report
on 548 Drosophila genes representing 714 different diseases that
appear to be counterparts to human disease genes and may be good candidates
The authors have mined the Drosophila genome for links to existing knowledge about medical genetics. Although the value of the fly as a model system is well known, they argue that the biological connections are not always obvious. At the Homophila database Web site, human genes, fly genes, and diseases are cross-referenced and linked, for instance, to scientific abstracts and to a catalogue of genetic conditions called OMIM (Online Mendelian Inheritance in Man).
To produce the latest version of Homophila, Ethan Bier, a biology professor at UCSD, and colleagues screened a set of 929 human disease genes against the complete Drosophila sequence. Their analysis identified the 548 genes as potential relatives of the human genes based on a high degree of similarities in amino acid sequences. Whether two genes are actually related or share similar functions is impossible to know by comparing their sequences.
"We're trying to show with the database that the fly is an extremely useful model system for studying genes associated with disease in humans," says Bier. He proposes that the set of fly genes is a starting point for investigators interested in studying human disease in Drosophila. Homophila is an ongoing project whose ultimate goal is to facilitate communication between fly and human researchers.
Researchers who do not normally work together collaborated on Homophila. The primary architect of the database is Michael Gribskov, a computational biologist at the San Diego Supercomputer Center.
"Ethan Bier came to me in 1999 and said, 'Let's try to find all the human disease genes in the fly,'" recalls Gribskov. "Homophila brought together experimentalists with real biological questions and our bioinformatics group, which does not usually work on fly genetics." Gribskov's group builds genomic databases for the plant Arabidopsis, among other projects.
The research team included Lorraine Potocki, a clinical geneticist and pathologist at Baylor College of Medicine, in Houston, Texas. Working with the fly researchers at UCSD, she generated lists of the kinds of human disorders that can potentially be studied using Drosophila as a model organism.
The organization of human and fly data and the analysis by Potocki suggested that there are fly counterpart genes for human conditions like blindness, deafness, blood disorders, and immunological disorders. "This came as a bit of a surprise as most people don't think to study hearing or cancer in Drosophila," says Lawrence T. Reiter, a UCSD researcher with a background in human genetics and co-leader the Homophila project.
The 929 human disease genes in the study were compiled from OMIM, which was created by Victor A. McKusick, of the Johns Hopkins University School of Medicine, and colleagues. The categories of diseases listed on Homophila include neurological, cardiovascular, and skeletal development.
Typing 'deafness' into the Homophila search engine, for example, brings up hearing-loss syndromes, human genes associated with them, and the fly genes that match these sequences. Drosophila has about a dozen sequences resembling deafness genes in humans, according to the Homophila database.
"The biggest surprise of this study to me was that so many human disease genes have cognates in flies," says Gribskov. He adds that 'cognate' implies a functional similarity between genes, but not necessarily the degree of similarity needed to infer homology (common evolutionary origin).
"Given how different humans are from flies, I expected about 20 to 30 percent of the human disease genes to have matches in the fly," he adds. "We found matches for nearly 80 percent of the genes we screened."
Two previous analyses of human disease counterparts in the fly came up with fewer candidate sequences, although the strategies and methods vary among the studies. Just how many human disease genes might have counterparts in the fly is at present unknown. After the Drosophila sequence was published, one research team reported that 178 fly genes are likely to be homologues to a set of 287 human disease.
The authors of that study, Mark E. Fortini, of the University of Pennsylvania School of Medicine, and colleagues, noted in a paper last year that the literature on the subject includes estimates that from half to three-quarters of the human disease genes have counterparts in the fly. His group found that 62 percent of the human disease genes in their set had homologues in the fly (178 fly genes out of 287 human genes).
Fortini and colleagues started with a list of more than 800 human genes compiled from OMIM, medical textbooks and scientific articles on classes of genes. The researchers eliminated over half of these, however, because they did not meet the criterion for the study, which was that "the human gene must actually be mutated, altered, amplified, or deleted in human subjects with the disease." The final list, they wrote in The Journal of Cell Biology, was not meant to be comprehensive.
How many 'hits' are generated in any cross-genomic comparison depends on several factors, including the statistical methods used to define evolutionary relatedness. The stricter the standard of relatedness based on the similarity of two gene sequences, the fewer the hits.
Bier and colleagues generated hit lists for a variety of 'E values,' a statistical measure of the odds that the match between the two sequences could have occurred by chance. "An E value of 10-5 means that you have to run 10,000 searches with a random query to get the match you're seeing," explains Gribskov.
The 548 fly genes in the UCSD analysis were identified using an E value of 10-10, according to the Genome Research paper. The researchers call these genes "clear hits."
Gribskov's group determined that 409 of the 548 fly genes on the clear-hit list also have cognates in yeast. If the Homophila project develops as planned, the database will be expanded and updated to include newly discovered human and fly genes as well as data on other species. "Now that the human sequence data are available, we expect many more human disease gene candidates to be identified in the coming months," says Bier.
With so much data being generated all the time, researchers face a significant challenge in organizing and managing the information efficiently. "There's been a real revolution in genomic databases in the last five years," observes Gribskov. Until the recent explosion of sequence data, he says, many researchers were interested in downloading information and setting up databases.
The trend today is toward a kind of one-stop-shopping for data. "The personalized approach is no longer practical because the data sets are so big," he says. "And the Internet provides a great way to have a centralized service that is easy for any researcher to use."
. . .