Read "Firepower in the Lab: Automation in the Fight Against Infectious Diseases and Bioterrorism" at NAP.edu

Page 85 Cite Bookmark

Suggested Citation: "Input/Output of High-Throughput Biology: Experience of the National Canter for Biotechnology Information." Scott P. Layne, et al. 2001. Firepower in the Lab: Automation in the Fight Against Infectious Diseases and Bioterrorism. Washington, DC: Joseph Henry Press. doi: 10.17226/9749.

7 Input/Output of High-Throughput Biology: Experience of the National Center for Biotechnology Information

David J. Lipman

INTRODUCTION

This paper describes several current projects being conducted by the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine, that have involved collaboration among numerous groups. It will also review some aspects of how these projects were organized and developed.

MEDLINE/PUBMED

The PubMed system provides access to Medline, a database of more than 10 million abstracts in the biomedical literature, and also features links to online journals and a number of factual databases. ¹ PubMed currently receives approximately 8 million hits each day. Although the development of PubMed may not appear to involve high-throughput biology, this system does illustrate how many biology databases were built in the past, as well as their strengths and weaknesses.

First, the authors of the articles that are accessed through PubMed do not participate in its development and are not required to have knowledge of the database's functionality. The information is mainly keyed in from the journals, although currently there is some use of electronic input

¹	PubMed, Medline, and MeSH are registered trademarks of the National Library of Medicine.

Page 86 Cite Bookmark

and optical character recognition. Nearly 400 journals submit electronic header information, which is collected in a dataset called Pre-Medline.

It is also important to note that an extensive amount of manual work may be necessary when different groups produce the data and build the database. Medline, for example, requires the assignment of keywords, a process that involves many indexers who read the entire article and then apply appropriate terms from a controlled vocabulary. In fact, a limiting factor for Medline, in terms of the number of journals included and the speed at which the records can be completed, is the cost of this indexing.

Both the availability and the quality of information are important factors for certain databases. For example, with PubMed a search can be conducted using the title and abstract alone, and this information can be made available as soon as the publisher provides it—even before the minimal checking performed at this stage is completed. Other databases also use this process, one that may be particularly relevant for rapid dissemination of information in the area of infectious diseases.

GENBANK

GenBank,² a database of gene sequences, now contains over 4 billion base pairs of DNA from more than 60,000 different species (Benson et al., 2000). Unlike most factual databases, GenBank is used for computation. In fact, most of the new biological databases being developed are designed for computation, which requires information to be validated in a variety of ways. For example, sufficient similarity can often be found between the protein from a human gene and one from another organism (e.g., yeast) such that the related yeast sequence can be retrieved through a database search. The information that is already known about the yeast sequence in experiments can then be used. Even if it is not known what the sequence or the protein does in yeast, experiments can be conducted in yeast that cannot be conducted in humans. This is what makes comparative sequence analysis such a powerful tool.

At GenBank, considerable work goes into building the database, using various tools that have been developed for data computation. In addition, a number of tests are conducted at different steps along the way to determine whether the genes are being translated to the protein correctly. The assumption is that a number of the databases that would be developed for high-throughput infectious disease projects would require similar data validation.

²	GenBank is a registered trademark of the U.S. Department of Health and Human Services.

Page 87 Cite Bookmark

The data flows for Medline and GenBank differ in a number of ways. For Medline, processing cannot begin in terms of keyboarding or MeSH (Medical Subject Headings) indexing until the journal arrives in its hard-copy format. For GenBank all of the data are electronic, and a submission program is used to help ensure that the syntax of the GenBank submission is preserved. However, most authors provide only annual submissions to GenBank, primarily because they have other priorities, and are unlikely to devote time to understanding the syntax and semantics of the records with which they are dealing. Thus, a disadvantage of this system is that, although authors can produce syntactically satisfactory records, staff skilled in molecular biology must review the biological content.

It is also important to keep in mind that while authors use sequences to answer specific questions, others who will be using GenBank may not be trying to answer those particular questions. In addition, because these authors were only interested in answering specific questions, they may have failed to provide information that, although available, did not relate to those more general questions. This is a common challenge encountered in developing any database: Those who are providing the data may have one main purpose in mind, while the database developer has a much broader range of purposes.

In the past, GenBank used indexers who manually scanned journals to locate sequence data in the articles. Although it would seem that direct electronic submission would be more efficient, in some ways this is not true. While obtaining the sequence electronically is certainly valuable, often authors provide additional information in their papers, and one might more easily find the information needed by reading the paper than by communicating with the author via e-mail or by telephone or fax.

Thus, although much progress has been made, challenges remain, and the process continues to be expensive and relatively slow. For example, annotating the records continues to require an extensive amount of time, notwithstanding the fact that GenBank annotators are highly trained and educated. In addition, senior scientists on NCBI staff are on call to conduct final reviews of these records.

Over time the genome centers will be providing a huge volume of data, and NCBI can work with the centers to make sure they can develop a mode of submission with the correct syntax and semantics. Because in most cases these centers are funded for an infrastructure project —to actually do the sequence rather than answer a particular question —they are very cooperative. NCBI fully appreciates this cooperation and recognizes that staff at the genome centers are vastly overworked. However, it is often difficult to obtain the information that is needed from these centers. This is because many staff members have had long experience with this

Page 88 Cite Bookmark

kind of sequencing and still believe that once they post the data on their website, the job is done.

In planning any high-throughput molecular epidemiology project, it is crucial to ensure that, from the beginning, the groups that are developing and generating the data understand that cooperation with those who are collecting the data makes a significant difference in how the information can be ultimately used.

Multiple collection points exist for all of these kinds of data. With two other databases of DNA sequences, the DNA Databank of Japan (DDBJ) and the EMBL Library of Europe, there is general agreement on underlying syntax and semantics, and daily exchanges occur among these centers, with some high-throughput genomes sent to one group and some to another. The level of agreement here is fairly basic, and at times a genome center has information that one of the collection databases cannot represent in its internal data structures. That information is sometimes lost, and this results in complicated arrangements in which, for example, a center in the United Kingdom submits its primary information to EMBL but also submits some additional information directly to NCBI.

Finally, GenBank provides data in a number of ways. It is available on NCBI's site for interactive use or for computing via the BLAST (Basic Local Alignment Search Tool) program (Altschul et al., 1990; 1997). It is also available via FTP to be downloaded and integrated with other datasets. With any of these projects that are creating an information infrastructure in a particular area, it is important to make the data as widely available as possible. It is more effective for a variety of companies and other academic sites to work to make the data available than it is to rely on one center to provide access to the data.

Medline and GenBank also share a characteristic that makes them a bit more challenging to build than some other databases: they are continually being updated, which makes information handling much more complex. Another project, the transcript map, provides periodic releases that involve complete recomputations, an approach that makes information handling much easier. Thus, with a high-throughput project a full re-release every few months makes information handling a less expensive and more efficient process and also contributes to the development of a much more robust system than one in which the information is being updated continually.

HUMAN GENOME MAP

A very different project from the information-handling point of view, and an extremely interesting model, is the human gene map. There have been two major releases of this project—one in 1996 (Schuler et al., 1996)

Page 89 Cite Bookmark

and one in 1998 (Deloukas et al., 1998), which involved an international mapping consortium of genome centers in the United States, the United Kingdom, France, and Japan. Approximately 16,000 genes were mapped in the first release, and then more than 30,000 genes were mapped in the second release. This is very useful for gene hunting. If a region of the genome can be identified based on affected families that may have a disease gene, one can then click on that little part of the map or input the two markers and find all of the human genes that have already been mapped there. A region can be input to get all of the markers that are in that region, or a gene itself can actually be entered. In some cases a database search can make a year 's difference in terms of discovering the gene.

The evolution and information-handling aspects of this project are interesting. It began in an informal manner, with no directly targeted funding. Several mapping groups and Greg Schuler at NCBI developed a way to build the map, and ultimately the first release was so successful that additional funding was received to continue the project. One of the consequences of developing the project in this way was that a tremendous amount of cooperation emerged among all of the participating groups. From the beginning, all of the parties involved functioned as one team, something that is not always possible.

The process of developing the gene map involves clustering partial gene sequences called expressed sequence tags (ESTs). NCBI conducted the computational clustering task and developed a resource called UniGene (Schuler, 1997), which, while initiated specifically for the gene map project, has received far greater use. ESTs are clustered, and then nonredundant representatives are selected from each gene; a sequence of that EST is sent to one of the genome centers. This is an information transfer process; a reagent is not actually being sent. The genome centers take these sequences and, using the PCR technique, develop unique markers for the ESTs, which are mapped on two different radiation hybrid panels. Several laboratories are involved at any given time developing many of these markers, which are then sent to a database at the European Bioinformatics Institute called RHdb (Radiation Hybrid Database). Next, two centers download the data from RHdb and construct a map using that information. Map construction involves complete recomputation each time it is revised.

Through an ongoing cycle of checking data and relaying information back to the centers, NCBI can refine the quality of the data and improve the quality of its own clustering. The feedback provided is also useful to improve the quality of intermediate information resources, which are useful internally and also distributed via the World Wide Web.

Page 90 Cite Bookmark

CANCER GENOME ANATOMY PROJECT

The Cancer Genome Anatomy Project (CGAP) provides a model for future full-length cDNA projects by the National Institutes of Health (NIH). Although NCBI is involved in this effort, CGAP is funded and directed by the National Cancer Institute (NCI). The project has developed information about reagents for deciphering the molecular anatomy of the cancer cell. From the beginning, this project developed not just information but reagents as well. A great deal of information and many ways of querying the system are available on this site. One of the main goals is to develop a tumor gene index of all of the genes that are involved in cancer. Obviously, because any gene could have something to do with cancer, in some sense this involves trying to find all the human genes. However, the goal is to find the genes that are the most likely to be involved in cancer. which is not always consonant with the other main goal of this project —to maximize the gene discovery rate. Achieving a balance between these two goals presented a major challenge during the early part of this project.

Tissues and cell lines are obtained from a variety of sources, including normal colons, precancerous colons, and cancerous colons; cDNA libraries are made; and ESTs are sequenced. Then, a variety of analyses are performed, and that information and the reagents, the actual clones, become one of the products. The overall gene discovery rate is updated weekly. Because a tailing off of the discovery rate was evident with the previous EST project, a great deal of effort went into monitoring progress and maintaining as high a discovery rate as possible.

It appears that it is best to sequence a library broadly and then sequence it more deeply if it appears to be promising. The data flow is complicated. A steering committee for CGAP has decision-making and spending authority, with most funds spent outside NIH on various contracts. NCBI, a member of the steering committee, has built the tracking database and has modified two existing databases, UniGene and dbEST, to provide reports.

NCI handles the step that involves acquiring the tissues, which come from many different sources, and maintains a repository of these sources when possible. The tissue is then sent to the various groups that build the cDNA libraries. Almost all of this occurs outside NIH. Information is fed into the database and then goes to the cDNA library groups, which send information to the tracking database. The library then goes to a group at Lawrence Livermore Laboratory, part of the consortium, where the clones are arrayed out of the library and reports are sent back to the tracking database. The arrays are then sent to the Washington University group

Page 91 Cite Bookmark

for EST sequencing and to commercial distributors. Tracking is crucial, as it will be important to identify the source of the clones in the database.

All of the reports that NCBI provides are available to the various partners for use in semi autonomously adjusting their work. But sometimes at this level users find problems with the clones, including diverse artifacts and errors. Because NCBI does not have control over the system, these problems are difficult to correct.

In CGAP the various commercial distributors are independently trying to make their own curated, cleaned-up sets of the clones, which can cause confusion in terms of defining the “standard” clone set. Thus, on a new and related initiative—the full-length cDNA project—a steering committee will conduct some of the work itself, and a master repository located at NIH will be responsible for quality assurance of the reagents.

SUMMARY

High-throughput biology projects are representative of database projects of the future, which will involve ever-increasing numbers of participating sites. The challenge is and will continue to be maintaining control and organization in the data collection process while at the same time preserving the system flexibility and autonomy that are essential to the performance of high-quality scientific research.

REFERENCES

Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology, 215:403-410.

Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25:3389-3402.

Benson, D. A., I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, B. A. Rapp, and D. L. Wheeler. 2000. GenBank. Nucleic Acids Research, 28:15-18.

Deloukas, P., G. D. Schuler, G. Gyapay, E. M. Beasley, C. Soderlund, P. Rodriguez-Tome, L. Hui, T. C. Matise, K. B. McKusick, J. S. Beckmann, et al. 1998. A physical map of 30,000 genes. Science, 282:744-746.

Schuler, G. D. 1997. Pieces of the puzzle: Expressed sequence tags and the catalog of human genes. Journal of Molecular Medicine, 75:694-698.

Schuler, G. D., M. S. Boguski, E. A. Stewart, L. D. Stein, G. Gyapay, K. Rice, R. E. White, P. Rodriguez-Tome, A. Aggarwal, E. Bajorek, et al. 1996. A gene map of the human genome. Science, 274:540-546.

Page 92 Cite Bookmark

This page in the original is blank.

Firepower in the Lab: Automation in the Fight Against Infectious Diseases and Bioterrorism (2001)

Chapter: Input/Output of High-Throughput Biology: Experience of the National Canter for Biotechnology Information

7

Input/Output of High-Throughput Biology: Experience of the National Center for Biotechnology Information

INTRODUCTION

MEDLINE/PUBMED

GENBANK

HUMAN GENOME MAP

CANCER GENOME ANATOMY PROJECT

SUMMARY

REFERENCES

My Academies

Firepower in the Lab: Automation in the Fight Against Infectious Diseases and Bioterrorism (2001)

Chapter: Input/Output of High-Throughput Biology: Experience of the National Canter for Biotechnology Information

7

Input/Output of High-Throughput Biology: Experience of the National Center for Biotechnology Information

INTRODUCTION

MEDLINE/PUBMED

GENBANK

HUMAN GENOME MAP

CANCER GENOME ANATOMY PROJECT

SUMMARY

REFERENCES