Bioinformatics is an emerging scientific discipline at the intersection of computer science and life science. Genome databases is a subarea of bioinformatics that is concerned with managing the large volumes of experimental data that are generated by new high-throughput techniques in biology, and the symbolic computational theories that result from integrating across those data. This talk will survey research issues in genome databases, and illustrate those issues with an analogy to reverse engineering of computer viruses. The talk will also examine why the computer science field of databases has had so little impact on the field of genome databases, and outline a proposed symbolic computing curriculum for scientists that I argue is essential in this industrial age of science.
"The EcoCyc and MetaCyc Pathway/Genome Databases"
Peter D. Karp, SRI International
"The Encyclopedia of Life Project"
Phil Bourne, San Diego Supercomputer Center
The Encyclopedia of Life (EOL; http://eol.sdsc.edu) is an ambitious project to catalog the complete proteome of every living species in a flexible, powerful reference system. EOL is an open collaboration calculating three-dimensional models and assigning biological function for all recognizable proteins in all currently partially or completely sequenced genomes. This talk with focus on the unique aspects of in silico high throughput proteomics with specific reference to our reliability measures in assigning putative structure and function, the use of high performance computing environments for the massive amounts of computing needed by the project and our efforts to improve the human-computer interface beyond that delivered by most biological resources today.
"Data Integration and Management for Molecular
and Cell Biology"
Barbara Eckman, IBM Life Sciences
Biological research poses significant data integration and management challenges. To identify and characterize regions of functional interest in genomic sequence requires full, flexible query access to an integrated, up-to-date view of all related information, irrespective of where it is stored (within an organization or across the Internet) and its format (traditional database, semi-structured text file, web site, results of runtime analysis). Wide-ranging multi-source queries often return unmanageably large result sets, requiring non-traditional approaches to exclude extraneous data. As high-throughput biology generates large volumes of data about the "parts list" of living organisms, there is a growing need for robust, efficient systems to manage metabolic and signaling pathways, gene regulatory networks, protein interaction networks, and a variety of annotations on the network components. This data tends to be best represented as graphs, and researchers need to navigate, query and manipulate the data in ways that may not be well supported by standard relational database tools.
In this talk I will present approaches for meeting these challenges through extensions to DB2, IBMâ??s relational database management system. Our position is that the increasingly large size of biological data repositories, their complexity, and the requirement for efficient, robust data management, make extending a mature data management technology a good choice. IBM's DiscoveryLink is a federated database middleware system that provides SQL access to data from multiple sources, irrespective of where it is stored (within an organization or across the Internet) and its format. User-Defined Functions (UDFs) â??build science into DB2â?? by providing basic sequence manipulations, generalized pattern-matching, etc from within a SQL query. Finally, we are currently considering how best to extend DB2 with graph objects and operations to support data management in systems biology.
"Integrated Data Systems for Interpreting Genome-Focused
Data in Cancer "
Ajay Jain, University of California, San Francisco
Biology is rapidly evolving into a quantitative molecular science. Completion of the human genome sequence and improvements in measurement technology for DNA, RNA, and proteins contribute to this trend, resulting in an increasing pool of quantitative biological data. Given quantitative data, it is possible to induce constraints and interrelationships to construct predictive models of biological systems. Such models can provide context for the rapid interpretation of experimental observations. However, substantial statistical issues are embedded within data sets of millions of values on much smaller sets of samples. We believe that by integrating quantitative analytical methods and data visualization approaches with annotation information about biological entities, we can derive quantitatively defensible conclusions. Our collaborations are generating extensive and growing sets of microarray-based expression data and/or high-resolution genomic copy number data in multiple human cancers. This seminar will present a data system that enables integrated analysis combining experimental data with genomic and genetic annotations.
"PharmGKB: The Pharmacogenomics Knowledge Base"
Micheal Hewett, Stanford University
The Pharmacogenomics Knowledge Base is a three year old project to collect information on interactions between genes and prescription drugs. Eventually, this knowledge will lead to personalized medicines that are optimally effective for a patient's genetic makeup. Currently the knowledge base supports primary research in pharmacogenomics.
This talk will focus on 3 main points:
"Integration Challenges for Gene Expression
and Related Data"
Anthony Kosky, GeneLogic, Inc.
DNA microarrays have emerged as the leading technology for measuring gene expression, primarily because of their high throughput: a single microarray experiment provides measurements for the mRNA transcription level for tens of thousands of genes in parallel. While this technology opens new opportunities for functional genomics and drug discovery applications, it also presents new bioinformatics and data management challenges arising from the need to capture, manage and integrate vast amounts of expression data together with related gene and sample annotation data.
GeneExpress is a data management system developed by Gene Logic Inc., that contains gene expression information for thousands of normal and diseased samples, and for experimental animal model and cellular tissues. Initially the GeneExpress system was developed with the goal of supporting effective exploration, analysis and management of gene expression data generated at Gene Logic using the Affymetrix GeneChip platform. Building such a system involved addressing various data integration problems in order to associate gene expression data with sample data and gene annotations. A subsequent goal for the GeneExpress system was to provide support for incorporating gene expression data generated by Gene Logic's customers. Addressing this additional goal required the resolution of various levels of syntactic and semantic heterogeneity of sample data, gene annotations and gene expression data, often generated under different experimental cond! itions.
In this talk, I will describe some of the data integration challenges associated with gene expression and related data management system, and describe how we have addressed these challenges in the GeneExpress system.
"Sharing Biomedical Data with Impunity"
Susan B. Davidson
Interim Director, Center for Bioinformatics
University of Pennsylvania
Biological data is frequently shared, copied from a remote location to a local database so that the researcher can have tighter control over the data, combine it with other data, and add additional information or inferences to the data. This entails several problems: First, the data is heterogeneous in format and content, and therefore not easy to integrate. Second, even if the data is similar in content to pre-existing information at the local database, creating a mapping between the data being imported and the pre-existing data is complicated and time-consuming. Third, the semantics of the data being imported have typically been lost when exported from the remote location, and semantic mis-matches may therefore occur between the local representation of the data and the imported data. We discuss progress and problems in each of these three areas, and argue the need for more fundamental database research.
"Shared Vocabularies for Biological Annotation: Current Usage and Future
J. Michael Cherry
The GO project, originally conceived by Dr. Michael Ashburner at the University of Cambridge, is a collaboration between many model organism community/genome databases: for example FlyBase (Drosophila), Mouse Genome Informatics (Mus), The Arabidopsis Information Resource (Arabidopsis), WormDB (Caenorhabditis) and SGD (Saccharomyces); and large specialty databases such as: Genome Knowledge Base (GKB), Swiss-Prot, TIGR and Compugen. There are many hierarchical classification systems for protein function, protein secondary structure and protein families. These classifications are very useful, however they were not designed to describe the higher level biological roles gene product's play within the cell. The GO collaboration is developing and maintaining three independent ontologies that can be used for gene product annotation (www.GeneOntology.org). The three ontologies are Biological Process, Molecular Function and Cellular Component. These ontologies describe a significant controlled vocabulary used in the unification of biological annotations. At SGD, GO terms are used to annotate all yeast genes that have phenotype or functional information from the published literature. The ontologies will allow specific linkage between the member databases of the collaboration. Members of the great GO Collaboration we are collectively developing the terms, databases and software. Many other "ontologies" have been recently created by biological annotation projects, some of these and their future usage will be discussed.