I. Genomics to Biology — Elucidating the Structure and Function of Genomes
The broadly available genome sequences of human and a select set of additional organisms represent foundational information for biology and biomedicine. Embedded within this as-yet poorly understood code are the genetic instructions for the entire repertoire of cellular components, knowledge of which is needed to unravel the complexities of biological systems. Elucidating the structure of genomes and identifying the function of the myriad encoded elements will allow connections to be made between genomics and biology and will, in turn, accelerate the exploration of all realms of the biological sciences.
For this, new conceptual and technological approaches will be needed to:
- Develop a comprehensive and comprehensible catalogue of all of the components encoded in the human genome.
- Determine how the genome-encoded components function in an integrated manner to perform cellular and organismal functions.
- Understand how genomes change and take on new functional roles.
Although DNA is relatively simple and well understood chemically, the human genome's structure is extraordinarily complex and its function is poorly understood. Only 1-2% of its bases encode proteins7, and the full complement of protein-coding sequences still remains to be established. A roughly equivalent amount of the non-coding portion of the genome is under active selection 11, suggesting that it is also functionally important, yet vanishingly little is known about it. It probably contains the bulk of the regulatory information controlling the expression of the approximately 30,000 protein-coding genes, and myriad other functional elements, such as non-protein-coding genes and the sequence determinants of chromosome dynamics. Even less is known about the function of the roughly half of the genome that consists of highly repetitive sequences or of the remaining non-coding, non-repetitive DNA.
The next phase of genomics is to catalogue, characterize, and comprehend the entire set of functional elements encoded in the human and other genomes. Compiling this genome "parts list" will be an immense challenge. Well-known classes of functional elements, such as protein-coding sequences, still cannot be accurately predicted from sequence information alone. Other types of known functional sequences, such as genetic regulatory elements, are even less well understood; undoubtedly new types remain to be defined, so we must be ready to investigate novel, perhaps unexpected, ways in which DNA sequence can confer function. Similarly, a better understanding of epigenetic changes (for example, methylation and chromatin remodelling) is needed to comprehend the full repertoire of ways in which DNA can encode information.
Comparison of genome sequences from evolutionarily diverse species has emerged as a powerful tool for identifying functionally important genomic elements. Initial analyses of available vertebrate genome sequences 7, 11, 19 have revealed many previously undiscovered protein-coding sequences. Mammal-to-mammal sequence comparisons have revealed large numbers of homologies in non-coding regions 11, few of which can be defined in functional terms. Further comparisons of sequences derived from multiple species, especially those occupying distinct evolutionary positions, will lead to significant refinements in our understanding of the functional importance of conserved sequences 20. Thus, the generation of additional genome sequences from several well-chosen species is crucial to the functional characterization of the human genome (Box 1). The generation of such large sequence data sets will benefit from further advances in sequencing technology that yield significant cost reductions (Box 2). The study of sequence variation within species will also be important in defining the functional nature of some sequences (see Grand Challenge I-3).
Effective identification and analysis of functional genomic elements will require increasingly powerful computational capabilities, including new approaches for tackling ever-growing and increasingly complex data sets and a suitably robust computational infrastructure for housing, accessing, and analysing those data sets (Box 3). In parallel, investigators will need to become increasingly adept in dealing with this treasure trove of new information (Box 4). As a better understanding of genome function is gained, refined computational tools for de novo prediction of the identity and behaviour of functional elements should emerge21.
Complementing the computational detection of functional elements will be the generation of additional experimental data by high-throughput methodologies. One example is the production of full-length complementary DNA (cDNA) sequences (see, for example, mgc.nci.nih.gov and www.fruitfly.org/EST/full.shtml). Major challenges inherent in programmes to discover genes are the experimental identification and validation of alternate splice forms and messenger RNAs expressed in a highly restricted fashion. Even more challenging is the experimental validation of functional elements that do not encode protein (for example, regulatory regions and non-coding RNA sequences). High-throughput approaches to identify them (Box 2) will be needed to generate the experimental data that will be necessary to develop, confirm, and enhance computational methods for detecting functional elements in genomes.
Because current technologies cannot yet identify all functional elements, there is a need for a phased approach in which new methodologies are developed, tested on a pilot scale, and finally applied to the entire human genome. Along these lines, the NHGRI recently launched the Encyclopedia of DNA Elements (ENCODE) Project to identify all the functional elements in the human genome. In a pilot project, systematic strategies for identifying all functionally important genomic elements will be developed and tested using a selected 1% of the human genome. Parallel projects involving well-studied model organisms, for example, yeast, nematode, and fruit fly, are ongoing. The lessons learned will serve as the basis for implementing a broader programme for the entire human genome.
Grand Challenge I-2: Elucidate the organization of genetic networks and protein pathways and establish how they contribute to cellular and organismal phenotypes
Genes and gene products do not function independently, but participate in complex, interconnected pathways, networks, and molecular systems that, taken together, give rise to the workings of cells, tissues, organs, and organisms. Defining these systems and determining their properties and interactions is crucial to understanding how biological systems function. Yet these systems are far more complex than any problem that molecular biology, genetics, or genomics has yet approached. On the basis of previous experience, one effective path will begin with the study of relatively simple model organisms 22, such as bacteria and yeast, and then extend the early findings to more complex organisms, such as mouse and human. Alternatively, focusing on a few well-characterized systems in mammals will be a useful test of the approach (see, for example, www.signaling-gateway.org).
Understanding biological pathways, networks, and molecular systems will require information from several levels. At the genetic level, the architecture of regulatory interactions will need to be identified in different cell types, requiring, among other things, methods for simultaneously monitoring the expression of all genes in a cell12. At the gene-product level, similar techniques that allow in vivo, real-time measurement of protein expression, localization, modification, and activity/kinetics will be needed (Box 2). It will be important to develop, refine, and scale up techniques that modulate gene expression, such as conventional gene-knockout methods23, newer knock-down approaches24, and small-molecule inhibitors25 to establish the temporal and cellular expression pattern of individual proteins and to determine the functions of those proteins. This is a key first step towards assigning all genes and their products to functional pathways.
The ability to monitor all proteins in a cell simultaneously would profoundly improve our ability to understand protein pathways and systems biology. A critical step towards gaining a complete understanding of systems biology will be to take an accurate census of the proteins present in particular cell types under different physiological conditions. This is becoming possible in some model systems, such as microorganisms26. It will be a major challenge to catalogue proteins present in low abundance or in membranes. Determining the absolute abundance of each protein, including all modified forms, will be an important next step. A complete interaction map of the proteins in a cell, and their cellular locations, will serve as an atlas for the biological and medical explorations of cellular metabolism27 (see www.nrcam.uchc.edu, for example). These and other related areas constitute the developing field of proteomics.
Establishing a true understanding of how organized molecular pathways and networks give rise to normal and pathological cellular and organismal phenotypes will require more than large, experimentally derived data sets. Once again, computational investigation will be essential (Box 3), and there will be a greatly increased need for the collection, storage, and display of the data in robust databases. By modelling specific pathways and networks, predicting how they affect phenotype, testing hypotheses derived from these models, and refining the models based on new experimental data, it should be possible to understand more completely the difference between a "bag of molecules" and a functioning biological system.
Grand Challenge I-3: Develop a detailed understanding of the heritable variation in the human genome
Genetics seeks to correlate variation in DNA sequence with phenotypic differences (traits). The greatest advances in human genetics have been made for traits associated with variation in a single gene. But most phenotypes, including common diseases and variable responses to pharmacological agents, have a more complex origin, involving the interplay between multiple genetic factors (genes and their products) and nongenetic factors (environmental influences). Unravelling such complexity will require both a complete description of the genetic variation in the human genome and the development of analytical tools for using that information to understand the genetic basis of disease.
Establishing a catalogue of all common variants in the human population, including single-nucleotide polymorphisms (SNPs), small deletions and insertions, and other structural differences, began in earnest several years ago. Many SNPs have been identified28, and most are publicly available. A public collaboration, the International HapMap Project, was formed in 2002 to characterize the patterns of linkage disequilibrium and haplotypes across the human genome and to identify subsets of SNPs that capture most of the information about these patterns of genetic variation to enable large-scale genetic association studies. To reach fruition, such studies need more robust experimental (Box 2) and computational (Box 3) methods that use this new knowledge of human haplotype structure29.
A comprehensive understanding of genetic variation, both in humans and in model organisms, would facilitate studies to establish relationships between genotype and biological function. The study of particular variants and how they affect the functioning of specific proteins and protein pathways will yield important new insights about physiological processes in normal and disease states. An enhanced ability to incorporate information about genetic variation into human genetic studies would usher in a new era for investigating the genetic bases of human disease and drug response (see Grand Challenge II-1).
References
- Mendel,G.Versuche über Pflanzen-Hybriden.Verhandlungen des naturforschenden Vereines,Abhandlungen, Brünn 4, 3-47 (1866).
- Avery,O. T.,MacLeod,C.M.& McCarty, M. Studies of the chemical nature of the substance inducing transformation of pneumococcal types. Induction of transformation by a desoxyribonucleic acid fraction isolated from Pneumococcus Type III. J. Exp.Med. 79, 137-158 (1944).
- Watson, J.D. & Crick, F. H. C.Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid. Nature 171, 737 (1953).
- Nirenberg, M.W. The genetic code: II. Sci.Am. 208, 80-94 (1963).
- Jackson,D. A., Symons, R. H. & Berg, P. Biochemical method for inserting new genetic information into DNA of Simian Virus 40: circular SV40 DNA molecules containing lambda phage genes and the galactose operon of Escherichia coli. Proc.Natl Acad. Sci. USA 69, 2904-2909 (1972).
- Cohen, S. N., Chang,A.C., Boyer, H.W. & Helling,R. B. Construction of biologically functional bacterial plasmids in vitro. Proc.Natl Acad. Sci. USA 70, 3240-3244 (1973).
- The International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860-921 (2001).
- Sanger, F.& Coulson, A. R. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J.Mol. Biol. 94, 441-448 (1975).
- Maxam, A. M. & Gilbert,W. A new method for sequencing DNA. Proc.Natl Acad. Sci. USA 74, 560-564 (1977).
- Smith, L. M. et al. Fluorescence detection in automated DNAsequence analysis. Nature 321, 674-679 (1986).
- The Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520-562 (2002).
- The Chipping Forecast II. Nature Genet. 32, 461-552 (2002).
- Guttmacher,A. E.& Collins, F. S. Genomic medicine — A primer. N. Engl. J. Med. 347, 1512-1520 (2002).
- National Research Council. Mapping and Sequencing the Human Genome (National Academy Press,Washington DC, 1988).
- US Department of Health and Human Services, US DOE. Understanding Our Genetic Inheritance. The US Human Genome Project: The First Five Years. NIH Publication No. 90-1590 (National Institutes of Health, Bethesda, MD, 1990).
- Collins, F. & Galas,D. A new five-year plan for the US Human Genome Project. Science 262, 43-46 (1993).
- Collins, F. S. et al.New goals for the US Human Genome Project: 1998-2003. Science 282, 682-689 (1998).
- Hilbert, D.Mathematical problems. Bull. Am. Math. Soc. 8, 437-479 (1902).
- Aparicio, S. et al.Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301-1310 (2002).
- Sidow, A. Sequence first.Ask questions later. Cell 111, 13-16 (2002).
- Zhang,M.Q.Computational prediction of eukaryotic proteincoding genes. Nature Rev. Genet. 3, 698-709 (2002).
- Banerjee, N. & Zhang,M.X. Functional genomics as applied to mapping transcription regulatory networks.Curr.Opin. Microbiol. 5, 313-317 (2002).
- Van der Weyden, L.,Adams,D. J. & Bradley, A. Tools for targeted manipulation of the mouse genome. Physiol.Genomics 11, 133-164 (2002).
- Hannon, G. J.RNA interference. Nature 418, 244-251 (2002).
- Stockwell, B. R. Chemical genetics: Ligand-based discovery of gene function. Nature Rev. Genet. 1, 116-125 (2000).
- Gavin, A. C. et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141-147 (2002).
- Tyson, J. J.,Chen, K. & Novak, B.Network dynamics and cell physiology. Nature Rev.Mol. Cell Biol. 2, 908-916 (2001).
- Sachidanandam, R. et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928-933 (2001).
- Gabriel, S. B. et al. The structure of haplotype blocks in the human genome. Science 296, 2225-2229 (2002).