Gary K. Schoolnik and Michael A. Wilson
The increasing availability of complete genome sequences of medically important bacteria has led to the need to extract functional information from the nucleotide sequence in a form that can be used to better understand pathogenesis and physiology and for the development of new diagnostics, antibiotics, and vaccines. Sequence annotation can initiate this process by identifying open reading frames (ORFs), followed by functional predictions of their putative protein products by reference to homologues of known function found in other organisms. This, however, is a nonempirical exercise, and most of the functional assignments must be considered as hypotheses that need to be experimentally tested.
Microarray-based expression profiling is the logical next step after completion of the annotation process and is currently the linchpin of functional genomics (Schena et al., 1998). It is a comprehensive method by which each ORF of the genome of an organism, printed as discrete ORF-specific spots on a surface, is interrogated simultaneously to identify genes selectively expressed by the organism under specific conditions of growth, in response to drugs and other inhibitors of metabolic and/or biosynthetic pathways, and during its growth in host tissues. As generally practiced, the method requires a surface representation of the organism's genome (the microarray) and labeled RNA or cDNA prepared from the organism. This system and the microarray experimental design are discussed below.
A microarray is a surface that contains representations of each ORF of a sequenced and annotated genome (Schena et al., 1995; 1996; 1998; DeRisi et al., 1998). The surface used, the method by which ORF-specific DNA is bound to the surface, and the overall arrangement of the array vary with the system employed. Of the several available formats, the most common, economical, and flexible—and the one used by the authors in their study of the expression response of Mycobacterium tuberculosis (MTB) to the antibiotic isoniazid (Wilson et al., 1999; Behr et al., 1999)—was developed by Patrick Brown and colleagues at Stanford University (Schena et al., 1995; 1996; Shalon et al., 1996; DeRisi et al., 1998).
This array format consists of a microscope slide whose surface contains an x by y matrix of printed spots, each spot containing a polymerase chain reaction (PCR)-derived amplicon that corresponds to all or part of an ORF of the sequenced genome. Thus, each ORF of the genome is represented on the array as a separate spot, its location designated by its xy address. Additional spots are added and correspond to internal controls that monitor the printing and hybridization steps. For example, the MTB array fabricated in the authors' laboratory contain spots corresponding to nearly all of the 3,924 ORFs predicted by the Sanger Center's MTB sequencing project, together with additional control spots (Behr et al., 1999; Wilson et al., 1999).
Arrays of this kind can be fabricated in an academic laboratory. The process is begun by identifying optimal forward and reverse primer pairs for each predicted ORF. For this purpose the authors have used the Primer 3 software provided by the Whitehead Institute (http://www-genome.wi.mit.edu/genome_software/other/primer3.html), and for the MTB microarray the average length of the resulting amplicons is 520 bps. At Stanford University the identified primers are downloaded to the Protein and Nucleic Acid Core facility, and the primers are synthesized by a 96-well multiplex oligonucleotide synthesizer. Subsequently, the corresponding forward and reverse primers are combined and PCR amplified using a 96-well thermocycler. Each PCR product is analyzed by gel electrophoresis, and those primer pairs that fail to yield a single product of the expected size are either resynthesized or a new primer pair set is selected for the ORF in question. Once the full set of amplicons is validated and adjusted to a standardized concentration, each is printed onto a prepared microscope glass slide by a robotic printer, and the DNA is cross-linked to the surface.
The principal innovation in gene expression profiling that was introduced by Brown and colleagues is two-color hybridization (Schena et al., 1995). This method employs two populations of cDNAs that have been
differentially labeled with two different fluorochromes—the cDNAs having been derived from RNA prepared from the same organism cultivated under or exposed to two contrasting conditions. The two differentially labeled populations of cDNAs are combined in equal masses, applied to the array surface, and allowed to hybridize to the corresponding ORF-specific targets. The array is then scanned, and the intensity of each label for each ORF-specific spot is quantified. These values are compared, yielding ratios that serve as a measure of the relative degree of expression or repression of each ORF for the two tested conditions.
In practice, for organisms growing in bulk culture, the culture is split, and one of the resulting cultures is experimentally manipulated to test the hypothesis in question while the other serves as the unaltered control. At prespecified time points, aliquots containing equal numbers of organisms are removed from the control, and experimental culture and “total” RNA are extracted from each. Each pool of total RNA is then separately reverse transcribed in a “first-strand” reaction using random oligonucleotide primers. The authors have found that it is not necessary to separate bacterial ribosomal RNA from mRNA prior to cDNA labeling and hybridization even though the former constitutes approximately 97 percent of the total RNA. During the reverse transcription reaction, a nucleotide labeled with one kind of fluorochrome is incorporated into one of the cDNA products, and a nucleotide labeled with a different fluorochrome is incorporated into the alternative cDNA product. The two differentially labeled cDNA populations are then combined for use in the hybridization reaction that is conducted on the array surface.
The use of contrasting conditions as the principal experimental paradigm is designed to learn more about the genome-wide response of the organism in three general areas of interest: (1) physiological responses to changes in growth conditions (e.g., log-phase growth versus stationary-phase growth or changes in gene expression during a diauxic shift); (2) physicochemical parameters that simulate or could serve as signatures for a particular host environment (e.g., low pH/high pH, low O2/high O2, iron present/iron absent, or H2O2 present/H2O2 absent); and (3) as part of the drug discovery process (e.g., drug or metabolic inhibitor present/absent).
In general, of the thousands of ORFs monitored during experiments of this kind, the vast majority show no differential expression for any particular tested condition. A much smaller set of genes exhibit selective expression or repression, and of these many, but not all, can be plausibly associated with an adaptive response by the organism that is specific to
the condition under study. This appears to be particularly true early in the time course and under conditions that do not induce a generalized stress response. However, no rationale can be adduced, based on current knowledge, for a fraction of the selectively induced or repressed genes. Some of these have no inferred function because, during the annotation process, they were not found to be homologous to genes of known function. For others, while their functions may be known or can be inferred, their regulation by the condition studied could not have been predicted by prior knowledge of the organism's biology. Unexpected results of this kind are perhaps the most interesting and provide the basis for new hypotheses. Accordingly, microarray expression profiling can be viewed as a hypothesis-generating exercise.
Bioinformatics, in its broadest sense, is required for microarray experimentation at three levels: (1) ORF identification and functional annotation of the nucleotide sequence, (2) image processing of the microarray scan, and (3) analysis of microarray data in order to identify genes that exhibit similar patterns of regulation and that may be functionally integrated as an adaptive response of the organism to the tested condition. The latter two are microarray specific.
Image processing is conducted using the crude signal intensities from each of the two fluorochrome-specific channels. Refinement of these signals comes from the use of local background measurements and determination of average signal intensity for each spot using custom-written software (Michael Eisen, Scanalyze, 1998, available at http://rana.stanford.edu).
Microarray data analysis software is very much an evolving field (Eisen et al., 1998; Tamayo et al., 1999; Toronen et al., 1999) and in the authors' opinion is the current rate-limiting step in the experimental process. Even for the relatively small genomes of prokaryotes, a typical microarray-based experiment generates large datasets. Consider, for example, the detection of MTB genes selectively expressed or repressed at four different time points upon exposure of the organism to three buffers that differ only with respect to three pH values: 5.8, 6.8, and 7.4—conditions likely to be encountered by the organism in different host compartments.
Each of the ~4,000 ORFs represented on the array will yield a numerical value representing the cognate gene's level of expression for each pH compared to each of the two other tested pH values. At a minimum, for each of the five time points (including t = 0), this experiment would generate two values for each of the three pairwise comparisons of pH; therefore, the number of gene expression data points would be: 4,000 × 2 × 3 × 5 =
120,000. After statistically derived threshold values are set that distinguish expression and repression from system noise and provide the metric for the amplitude of the observed changes, the basic question asked of such data is: Which genes are induced and which are repressed upon exposure of the organism to one condition compared to a second condition?
The biological presumption is that microarray-derived patterns of gene expression should correspond to regulatory networks of the organism that govern the adaptive response. For this purpose, clusters of genes that exhibit the same expression/repression behavior with respect to amplitude, vector, and time are identified. In turn, this information is further refined by reference to the putative or known functions of genes comprising discrete clusters including their relationship to metabolic or biosynthetic pathways that may be coordinately regulated by the condition under study. Membership in regulons of this kind can thus be inferred, and further analysis of their location and transcriptional orientation will likely show that some reside in operon-like gene clusters. Software currently available in the public domain for these and related applications includes the cluster analysis program of Eisen et al. (1998) and the self-organizing map-based program of Tamayo et al. (1999). The integration of gene cluster data for Escherichia coli with empirically derived information about the corresponding metabolic pathways is enabled by reference to the EcoCyc project (Karp et al., 1999, found at http://ecocyc.PangeaSystems.com/ecocyc/ecocyc.html).
New antibiotics are urgently needed to respond to the growing problem of bacterial strains resistant to one or more commonly used drugs and the imminent appearance of strains resistant to all available classes of drugs. To accelerate the drug discovery process, microarray expression profiling has been proposed as a method to identify new drug targets, to discover the target of an active compound whose mode of action is unknown, and as the basis for a high-throughput screen to identify active leads from large compound libraries.
The logic of this approach follows. The mode of action of most antibiotic classes is the inhibition of a vital metabolic or biosynthetic pathway. Inhibitors of this kind predictably cause a decrease in pathway products downstream of the point of inhibition and an accumulation of pathway precursors upstream of the site of inhibition. The resulting fluctuations in the abundance of products and precursors are sensed by the genome and result in increased expression of genes coding for pathway enzymes distal to the point of inhibition and decreased expression of genes proximal to
the point of inhibition. Increased expression of genes in associated shunt pathways may be expected to occur as well, including those that degrade or expel toxic byproducts that have accumulated as a result of pathway inhibition.
Accordingly, exposure of an organism to a drug of unknown mode of action should elicit an expression profile that incriminates the affected pathway and perhaps even the target in the pathway (Wilson et al., 1999). Through the use of inhibitors that are selective for many such pathways it should be possible to identify signature profiles that are pathway specific and characteristic for an inhibitor's mode of action (Wilson et al., 1999). Such information could accelerate drug development because identification of the pathway and the target in the pathway could lead to the rational design, synthesis, and testing of a series of chemical derivatives of the original compound.
Alternatively, the use of pathway-specific inhibitors for gene-response profiling experiments will illuminate other components of the affected pathway beyond the target itself by causing changes in the expression of their cognate genes (Wilson et al., 1999). If pathway inhibition is lethal, inhibition of any of the critical pathway enzymes should be lethal as well. They, therefore, constitute new drug target candidates. This strategy is particularly attractive when mutations of the original target in a critical pathway have led to antibiotic resistance. In this case, expression profiling may lead to the identification of other targets in the same pathway.
To identify entries in large compound libraries that inhibit a specific preselected pathway, one or more of the genes coding for pathway enzymes in the pathway—identified through the gene-response profiling process described above—can be used to signal pathway inhibition during high-throughput screens. For this purpose the promoter of the selected pathway gene is fused to a gene that encodes a signal suitable for a high-throughput assay, such as luciferase, and the reporter construct is then introduced into a bacterium of the same species. The pathway-specific reporter strain is then dispensed into multiple microtiter wells. Each well then receives a different entry of the compound library, and the plate is scanned to identify wells that emit the signal, indicating that the pathway-specific promoter has been induced.
It is widely recognized that the mode-of-growth of bacteria in nature, including in tissues during the infectious process, is vastly different than for the broth-grown organisms that are used for most antibiotic develop-
ment purposes. In turn, this has led to the idea that an unexplored avenue for drug development is the identification of compounds that selectively inhibit pathways that are crucial for growth of the organism in tissues, in pus, or on the surfaces of implantable prosthetic devices. It is argued that drugs of this kind might not disrupt the normal flora of the host since at least some of the pathways required by an invading organism would not be required by the same organism growing as a commensal in a normally colonized host compartment. In contrast, most of the antibiotics in common use today dramatically alter the normal flora of the host during treatment.
Microarray gene response profiling offers an efficient way to identify tissue-specific bacterial genes, and the two-color hybridization system described above is particularly well suited to compare the gene expression profiles of the same organism as a commensal and an infecting agent. However, while the concepts are simple, the technological difficulty involved in obtaining sufficient RNA from organisms growing in tissues is far greater than from broth-grown organisms, in part because their numbers are likely to be relatively small and RNA from the host and other commensal bacteria may give rise to ambiguities and signal-to-noise problems. However, linear amplification methods and chemical methods to intensify the fluorescent signals of the labeled probes will likely overcome these problems in the near future. Additionally, knowledge of microbial metabolic pathways that are activated by organisms growing in tissues may lead to the identification of in vitro growth conditions that contain tissue-specific signals that will trigger responses that normally occur only at sites of infection. Tissue-simulating media of this kind could be used for high-throughput screens for the identification of compounds that act on organisms in tissues, although this screen would not necessarily exclude compounds that also act on commensal organisms.
The pharmaceutical industry has developed innovative and rapid ways to construct libraries containing 105 to 106 structurally characterized compounds and robotics-based methods to screen these combinatorial libraries against indicator strains or enzymes. This experience and technical platform can be adapted for microarray expression profiling; the following functionally discrete steps are particularly amenable to laboratory automation:
Function: Identification of genes selectively induced or repressed by pathway inhibitors.
Genome sequencing and automation.
Microarray fabrication.
Production of ORF-specific DNA targets.
Printing of targets onto the surface, yielding the x by y array.
mRNA preparation and production of labeled cDNA.
Array hybridization.
Scanning of the hybridized array.
Image processing: deriving data for analysis.
Image analysis to yield the gene response profile.
Data mining: comparative analysis to extract biological meaning.
Function: Identification of entries in a combinatorial compound library that inhibit a preselected critical metabolic/biosynthetic target.
Identify sentinel pathway genes.
Prepare promoter-fusion constructs that will signal promoter activation, indicating pathway inhibition.
Robotic delivery of reporter strains carrying the promoter-fusion constructs to microtiter wells.
Robotic delivery of each entry in the compound library to microtiter wells containing the reporter strain.
Microtiter plate incubation followed by signal detection.
Microarray expression profiling will likely become a common component of the drug discovery process, where it may be used in an iterative manner in the stepwise progression from compound libraries to the identification of leads and the refinement of leads. It will not replace biochemical assays of drug structure and action; rather, its main effect will be to focus the screening process on vital pathways that mediate the adaptive metabolic, physiological, and pathogenic responses of the organism.
The authors gratefully acknowledge Pat Brown and Joe DeRisi for their generous assistance and advice, members of the Schoolnik laboratory for their unstinting support, and the National Institutes of Health for providing funds for this project through grants AI35969 and AI44826.
Behr, M. A., M. A. Wilson, W. P. Gill, H. Salamon, G. K. Schoolnik, S. Rane, and P. M. Small. 1999. Comparative genomics of BCG vaccines by whole genome DNA microarray Science, 284:1520-1523.
DeRisi, J. L., V. R. Iyer, and P. O. Brown. 1998. Exploring the metabolic and genetic control of gene expression on a genome scale. Science, 278:680-686.
Eisen, M. B., P. T. Spellman, P. O. Brown, and D. Botstein. 1998. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences USA, 95:14863-14868.
Karp, P. D., M. Riley, S. M. Paley, A. Pellegrini-Toole, and M. Krummenacker. 1999. Eco Cyc: Encyclopedia of E. coli genes and metabolism. Nucleic Acids Research, 27:55-58.
Schena, M., D. Shalon, R. W. Davis, and P. O. Brown. 1995. Quantitative monitoring of gene expression profiles with a complementary DNA microarray. Science, 270:467-470.
Schena, M., D. Shalon, R. Heller, A. Chai, P. O. Brown, and R. W. Davis. 1996. Parallel human genome analysis: Microarray-based expression monitoring of 1000 genes. Proceedings of the National Academy of Sciences USA, 93:1061-1069.
Schena, M., R. A. Heller, T. P. Theriault, K. Konrad, E. Lachenmeier, and R. W. Davis. 1998. Microarrays: Biotechnology's discovery platform for functional genomics. Trends in Biotechnology, 7:301-306.
Shalon, D., S. J. Smith, and P. O. Brown. 1996. A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Genome Research, 7:639-645.
Tamayo, P., D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, and T. R. Golub. 1999. Interpreting patterns of gene expression with self-organizing maps: Methods and applications to hematopoietic differentiation. Proceedings of the National Academy of Sciences USA, 96:2907-2912.
Toronen, P., M. Kolehmainen, G. Wong, and E. Castren. 1999. Analysis of gene expression data using self-organizing maps. FEBS Letters, 451:142-146.
Wilson, M. A., J. L. DeRisi, H. H. Kristensen, P. Imboden, S. Rane, P. O. Brown, and G. K. Schoolnik. 1999. Exploring drug-induced alterations in gene expression in Mycobacterium tuberculosis by microarray hybridization. Proceedings of the National Academy of Sciences USA, 96(22):12833-12838.