it is estimated that when the Human Genome Project's DNA sequencing phase reaches it peak steady-state, the production rate will be approximately 70 to 80 "new" genes gash day, every day. This estimate is based on genomic sequencing and not expressed sequence tags (ESTs) or full-length cDNA sequencing. The challenge to research scientists is to identify these "genes" to some level of detail and determine which are of potential value, and to exploit that information as quickly as possible. This is a daunting task even with powerful computer and information systems; however, without a sophisticated informatics infrastructure it is impossible. The supporting information infrastructure can be broken into a number of components: data-acquisition systems (including inventory control and reagent manipulation systems, sequence production software, etc.); data-analysis systems (including sequence analysis software, structure prediction software, gene mapping software, feature extraction software, etc.); and data-management system (including local and shared databases). Databases form the core systems for the identification of "new" genome features and are the core of any informatics-based drug development strategy. The development and appropriate use of database collections, often referred to as data warehouses, is an essential process and computer science has provided tools for the extraction of knowledge from these database collections. In high throughput research environments, automated systems for processing new data through these data warehouses are essential tools for discovery research. (C) 1997 Wiley-Liss, Inc.