Read "Firepower in the Lab: Automation in the Fight Against Infectious Diseases and Bioterrorism" at NAP.edu

Page 261 Cite Bookmark

Suggested Citation: "High-Throughput Sequencing, Information Generation, and the Future of Biology." Scott P. Layne, et al. 2001. Firepower in the Lab: Automation in the Fight Against Infectious Diseases and Bioterrorism. Washington, DC: Joseph Henry Press. doi: 10.17226/9749.

22 High-Throughput Sequencing, Information Generation, and the Futureof Biology

J. Craig Venter

When my laboratory at the National Institutes of Health started the first automated sequencing for the Human Genome Project, we had six 373 automated DNA sequencers that could run 16 lanes at a time, each once a day. Even at this rate, the biggest challenge for the project was interpreting the data and in the process finding the genes. In 1991 the expressed sequence tag (EST) method was developed, facilitating scaleup by using cDNA libraries to find large numbers of genes.

At the start of the 1990s there were fewer than 2,000 human genes known. Now there are millions of ESTs in the databases. However, the biggest impact in genomics and in understanding microbial genomes and other species has been the use of new mathematical models to deal with, first, the tens and hundreds of thousands and then the millions and tens of millions of sequences, to put together the picture of whole genomes.

MICROBIAL SEQUENCING

We decided to apply this new model, or assembly algorithm, dubbed the “Whole Genome Shotgun Strategy” in 1995 to Hemophilus influenzae as a test organism. It took less than a year, using the facility we had at the Institute for Genomic Research (TIGR). In 1995 we published the first complete genome map of a nonviral organism in Science (Smith et al., 1995).

A few years earlier we had published the complete sequence of smallpox (Massung et al., 1993), but it was orders of magnitude smaller. Observed in this sequence was something that has been found in every pathogen sequenced since—a finding that has changed our view of evolu-

Page 262 Cite Bookmark

tion. In front of a large number of the genes—basically all those coding for surface molecules (lipopolysaccharide biosynthesis)—there are tetrameric repeats that are preprogrammed evolutionary changes. Thus, every 10,000 or so replications there is slippage that changes the downstream reading frame, putting stop codons on the genes—that is, knocking them out. This is an important finding in understanding emerging infections or biological warfare agents. For example, a number of companies have tried to use the influenza sequence to develop new vaccines but have ignored the signals of the proteins designed to evolve very rapidly in real time. This resulted in no new stable vaccines because the virus changed too much in too short a period for the vaccine to be effective.

Currently, we are in an exponential growth phase in microbial sequencing. The first two microbes were sequenced in 1995, and four were sequenced in 1996 and 1997. To date, more than 30 have been completed. The American Society for Microbiology has indicated that the next phase will include the sequencing of more than 500 microbes. Sequencing of biological warfare pathogens is likely to increase over the next few years. For comparative genomic purposes, having a number of sequences complete is orders of magnitude more valuable than having just one complete.

Sequencing of H. influenzae uncovered a number of genes that are completely new to biology. Initially it was thought that this was due to a paucity of data in the databases. However, sequencing of subsequent genomes has revealed that, on average, roughly half of the genes in each genome are new, at least from certain biological perspectives. Some of these genes are highly conserved, but many in each species are novel, posing a real challenge to biology.

Before the first two genomes were sequenced—one a gram-positive organism and the other a gram-negative organism (with tremendous overlap)—we mistakenly thought that the gene pool on this planet was remarkably small. Most of the genes were of unknown function, however, and that changed the view of genetic diversity. The new data have not changed this view. Most likely, the percentage of new genes will remain the same, even after 500 organisms have been sequenced. Moreover, many techniques have been developed to try to identify genomes based on one or two sequences, but in view of the large percentage of previously unknown genes in each genome, it is unlikely that this approach will be effective. As we go forward in evolution, we see a great deal of gene duplication, increased specificity, and significant variation on a theme.

At TIGR we have spent a lot of time trying to characterize a simple organism, Mycoplasma genitalium, a species that might be a causative agent of urethritis. It is questionable whether it is a human pathogen or not. With 470 genes it has the smallest genome of any species known to be a

Page 263 Cite Bookmark

self-replicating organism. Just how many of those are necessary for life in a rich growth environment is an important question.

Hutchison et al. (1999) developed a method to conduct whole-organism transposon mutagenesis. Having the complete genome sequence allows one to incorporate transposons in and then look to see where they insert in the genome. Because one can sequence off the transposon, their precise location can be determined. This approach has been aided by the sequencing of a second bacterium, Mycoplasma pneumoniae. It turns out that the entire M. genitalium genome is contained in the M. pneumoniae genome; however, M. pneumoniae has 200 more genes than M. genitalium. This finding provided a way to test evolutionary hypothesis, with the assumption that because the 200 genes in M. pneumoniae are not required for normal life and replication, we should be able to knock those genes out with less likelihood of the organism dying than if genes of M. genitalium were randomly removed.

The resulting transposon maps showed that a cell could survive only if the transposon did not insert in the middle of an essential gene and kill the organism. If the transposon inserted in a gene does not prove lethal, the gene is deemed to be nonessential. Eventually, it was found that 300 of the 470 genes could not be knocked out without killing the cell. In the next stage of the experiment, plans were made to create a synthetic organism, a step that raised some ethical concerns at TIGR.

Earlier at the National Institutes of Health (NIH), debate began when a laboratory started sequencing the smallpox genome. The concern was that, if the smallpox genome sequence was published, it would be akin to publishing a blueprint to a bomb because eventually anybody with a molecular biology tool kit would be able to synthesize smallpox. This is not far from the truth. It would be relatively trivial to synthesize and replicate the smallpox virus or even modify vaccinia. In short order, genomes will be synthesizable from scratch. Consequently, it will be critical to have the complete genetic sequence of every possible potentially emerging pathogen and potential biological warfare agent because only by examining the complete sequence can it be determined if it has been deliberately modified.

Even though our focus is often on microbes, plants also could be important targets of biological warfare. Fifty percent of the world 's food production is covered by just three different species. With funding from the National Science Foundation, TIGR is in the process of sequencing the rice genome and the Arabidopsis genome, which are models for about 170,000 species. TIGR has just finished the first chromosome and has discovered many features that play key roles in evolution. A surprising finding is that as we go up the evolutionary tree, genes have a lot more introns, and these DNA play key regulatory roles.

Page 264 Cite Bookmark

SEQUENCING THE MALARIA GENOME

Malaria is a worldwide threat to civilian and military populations. The U.S. Department of Defense (DOD) estimates that if U.S. troops were deployed to some of the drug-resistant malaria regions of the world, there could be 20 to 30 percent casualties just from mosquito bites infecting soldiers with drug-resistant malaria. As a result, DOD, NIH, and some private foundations have funded TIGR and other organizations to sequence malaria genomes. For years many people thought that the malaria genome was undecipherable and could not be sequenced because of the high AT content. Although most clones could not even survive in Escherichia coli, we decided to try a whole genome shotgun method with small clones. This method worked well.

Some unique methods were used—such as DNA restriction digest of single molecules of single chromosomes —to correctly verify the assembly of the malaria genome. Moreover, even though this is a eukaryote, genes at both ends of the telomeres appear to play the same role that some of these sequences play in the bacteria—that is, they lead to constant antigen variation and evolution of the malaria species. This is why these organisms evolve rapidly, overcoming any attempted drug interventions.

SEQUENCING TUBERCULOSIS

In 1999 we finished sequencing the Oshkosh strain of tuberculosis, a highly virulent and emerging virulent strain that originated with an individual working at the Oshkosh clothing factory in rural Tennessee. Fortunately, the strain was drug sensitive. Comparing the genome of this strain to a laboratory strain revealed that it had more genes. In addition, some of the genes in the laboratory strains, which are still infectious, may have been spliced out by transposons inserting near each other in the genome. We now have the tools to understand for the first time why this new strain is so infective. These tools also allowed us to realize that this probably was not a new strain but rather a re-emergence of an ancient strain. Tuberculosis used to be much more infective than it appears to be now. This pattern could also occur with smallpox.

BIOCOMPUTING: THE MAJOR TOOL OF SEQUENCING

The scale of sequencing has changed tremendously in the past year with the introduction of a completely automated sequencing instrument from Applied Biosystems. The system uses capillaries, 96 at a time, through which the DNA flows into a cuvette. A laser then shines through the cuvette, activating all the dye simultaneously, giving us 100-fold more

Page 265 Cite Bookmark

sensitivity and throughput than we had before. Celera has 300 high-throughput sequencing machines in a new facility. These machines will allow Celera to do 200,000 sequencing reactions per day, generating 2 billion base pairs of sequence every month.

Using these tools, Celera is concentrating on four key genomes—the human genome, the mouse genome, an insect genome, and the rice genome. It is hoped that by studying mutations in humans we can understand the susceptibility of humans to certain pathogens.

To put this in perspective, it took us a year to do the H. influenzae genome, which required the sequencing of approximately 25,000 clones, amounting to roughly 12 pathogens per 24-hour rate for complete genome sequencing. There are numerous other possible applications of this technology, including sampling environments to detect and culture microbial agents. The ability to run 200,000 samples a day completely alters what can be accomplished.

For example, in an emergency one could collect samples from an individual and within 24 hours have a pathogen's complete genomic sequence, work through the assembly algorithms, and run comparisons to databases. In this scenario many people in genomics have been concerned about how many base pairs could be analyzed. Celera has been concentrating on the computational side of this problem and is collaborating with COMPAQ Computers to build the second-largest supercomputer ever constructed.

Eventually, we will have more than 1,200 interconnected EV6 alpha processors. Although it is complicated, even with the best algorithms, to compare sequences to each other, just one of these new alpha chips can do over 250 billion of these pairwise comparisons an hour. We need this power just to deal with the daily throughput. These computers each have 20 to 30 gigabytes of RAM (random access memory). If there were a terabyte RAM machine at this time, we would be one of the first to purchase it because dealing with the computational side is very important.

Over the next 18 months, we plan to complete the genome sequence of the human. This requires the sequencing of 70 million clones, or the equivalent of 3,000 complete genomes the size of H. influenzae.

In terms of human variation, we all differ from one another in about 3 million letters of genetic code. Although these differences can lead to diseases, our sequencing of the complete genome of five people will provide a database of 20 million single nucleotide polymorphisms (SNPs) over the short term. Others have calculated that this database will represent 80 percent of the abundant polymorphisms in the human population, providing a tremendous resource for understanding the complexity of human diseases. A potential application of this in the next decade is in pharmaco-

Page 266 Cite Bookmark

genetics, as the pharmaceutical industry moves to use patient segmentation based on SNPs for drug trials.

SUMMARY

All of the information that is being generated by these sequence data needs to be organized and processed in order to be useful. To understand the structure of genomes, we need to be able to overlay the mouse genome on top of the human genome and those of other species. The super-computers available today are woefully inadequate for this task. Biology is moving past the forefront of computing, requiring the development of new computational powers. However, there is a big difference between having the information and understanding it. We still must rely on the fundamentals of developmental biology to understand DNA repair, the cell cycle, and gene-gene and gene-environment interactions. Without the ability to process all of the data emerging from these sequencing efforts, we will not be able to make progress in understanding and curing complex multigenic diseases such as cancer. And we will not be able to understand human variability in response to infectious agents.

REFERENCES

Hutchison, C. A., S. N. Peterson, S. R. Gill, R. T. Cline, O. White, C. M. Fraser, H. O. Smith, and J. C. Venter. 1999. Global transposon mutagenesis and a minimal Mycoplasma genome. Science, 286(5447):2165-2169.

Massung, R. F., J. J. Esposito, L. I. Liu, J. Qi, T. R. Utterback, J. C. Knight, L. Aubin, T. E. Yuran, J. M. Parsons, V. N. Loparev, et al. 1993. Potential virulence determinants in terminal regions of variola smallpox virus genome. Nature, 366(6457):748-751.

Smith, H. O., J. F. Tomb, B. A. Dougherty, R. D. Fleischmann, and J. C. Venter. 1995. Frequency and distribution of DNA uptake signal sequences in the Haemophilus influenzae Rd genome. Science, 269(5223):538-540.

Firepower in the Lab: Automation in the Fight Against Infectious Diseases and Bioterrorism (2001)

Chapter: High-Throughput Sequencing, Information Generation, and the Future of Biology

22

High-Throughput Sequencing, Information Generation, and the Futureof Biology

MICROBIAL SEQUENCING

SEQUENCING THE MALARIA GENOME

SEQUENCING TUBERCULOSIS

BIOCOMPUTING: THE MAJOR TOOL OF SEQUENCING

SUMMARY

REFERENCES

My Academies

Firepower in the Lab: Automation in the Fight Against Infectious Diseases and Bioterrorism (2001)

Chapter: High-Throughput Sequencing, Information Generation, and the Future of Biology

22

High-Throughput Sequencing, Information Generation, and the Futureof Biology

MICROBIAL SEQUENCING

SEQUENCING THE MALARIA GENOME

SEQUENCING TUBERCULOSIS

BIOCOMPUTING: THE MAJOR TOOL OF SEQUENCING

SUMMARY

REFERENCES