Conceptual Integration of Genome Databases via Reduced Autonomy and Domain-Specific Data Models


Mark Graves

Baylor College of Medicine, Department of Cell Biology
One Baylor Plaza, Houston, TX 77030 USA
phone: +1 (713) 798-8271; fax: +1 (713) 798-3759
email: mgraves@bcm.tmc.edu

Integrating genome data across multiple databases may be facilitated by examining the requirements of genome databases and tailoring available tools to those requirements. General techniques for integrating databases make assumptions that may not hold for every database. Genome databases have sociological and technical restrictions which distinguish them from other kinds of databases. By clarifying some of the sociological restrictions genome databases place on data integration, technological solutions can be developed which will be useful in practice.

INTRODUCTION

Integrating genome data for the Human Genome project requires a database which is:

1. distributed across different sites,
2. managed by several autonomous groups, and
3. conceptually integrated.

An integrated genome database must balance the physical separation of the sources generating the data with the need to integrate the data. Combining all the data into a single store will not distribute the control and responsibility which is sociologically required. Storing data at different sites with minimal or no exchange of data between them will not allow for the conceptual integration which is necessary to further the science, prevent unwarranted duplication of scientific effort, and provide the breadth of genomic information which is required for many investigations.

For data to be shared between databases, there should be some correspondence in how concepts are interpreted by the different databases. The semantic heterogyzosity of the databases cannot be eliminated because of the rapidly changing nature of the science. Instead, a mechanism in necessary which illuminates the differences and allows semantically identical constructs to be exchanged in a syntactically compatible format.

We describe multidatabase systems which allow for physical distribution; characterize design autonomy; and discuss conceptual integration. We present two approaches for achieving conceptual integration via reduced conceptual design autonomy that still preserve most design autonomy. Finally, we present a mechanism for data exchange based on domain-specific data models which supports the approaches to conceptual integration.

Multidatabase Solutions

A multidatabase system is a database management system which supports access to multiple component databases. Multidatabase systems may be characterized into a large number of classes. Three classes of multidatabase systems for genome informatics are:

interoperable databases - multiple, autonomous databases managed together without a global (logical) schema.

distributed database - a single logical database which provides access to data located at multiple sites.

conceptually integrated database - multiple databases which exchange data via an interlingua that captures the concepts of the domain.

The distinctions between the three may be clarified in terms of the design schemas for each database. A database can be described at three levels of abstraction based on the schemas generated in its design. The three design schemas for a database are conceptual, logical, and physical [Teorey 1994]. The conceptual schema describes the concepts and relationships in the domain, such as genes, sequence, maps, markers, clones, proteins and chromosomes. The logical schema describes the data using the data modeling language which formalizes the chosen database management system. The logical data model might be relational [Codd 1970] or object-oriented [Kifer 1989; Hull 1987; Bancilhon 1992]. The genes, sequence, maps, etc. of the conceptual schema are described in the vocabulary of the logical data model, such as relations and objects. The physical schema describes the constructs of the logical schema in the data structures of a database management system, such as tables, classes, or B-Trees.

One possible solution to creating an integrated database is interoperability. Because interoperable databases do not have a global logical schema, they exchange data in the language of the physical layer. Interoperable databases are still a research area in computer science, but they would form a foundation for an integrated genome database, and considerable effort has gone into investigating interoperability for genome databases [Karp 1994].

A distributed database exchanges data using a common logical schema though the databases are physically at separate locations. Developing a global logical schema is possible as Ritter [1994] has shown by integrating the schemas of most major genome databases into an ACeDB schema. Although a successful technical solution, it appears that the genome databases are too autonomous to agree on a global logical schema.

Another solution is conceptual integration. Since distributed databases and interoperable genome databases are not currently viable solutions, a reasonable approach to integrating data is to exchange data at the conceptual level. Rather than exchange data based on the physical formats, such as strings, integers and records, or the logical formats of objects or relations, exchanging data at the conceptual level consists of developing concepts for genomic data, such as genes, markers, locus, etc. These concepts do not restrict what can be stored in different databases, but provide a common language, or interlingua, for exchanging data between genome databases.

Design Autonomy

One of the factors affecting the choice of multidatabase solutions is autonomy of design [Sheth 1990; Veijalainen 1988]. Interoperability assumes that a global logical schema is not feasible and that each database must be completely autonomous. A distributed database assumes that a global logical schema exists and requires a reduction in the logical design autonomy to the point of cooperation.

Design autonomy may be subdivided into autonomy of physical, logical, and conceptual design. If autonomy could be reduced in one of these areas, a multidatabase solution would be easier.

Genome databases require some autonomy. Each database must be able to choose its own design with respect to the area of the domain being modeled, the name of the data elements, interpretation of the concepts in the domain, the data model used, constraints on the data, functionality of the system and the implementation. Each database needs to retain some control over each of these design decisions.

However, the genome databases are not completely autonomous. Each major genome database is providing a service to the same community and receiving funding from a small number of sources. The administrators of most databases consider sharing of data between databases to be a high priority. Currently, the major genome databases are storing overlapping, but not completely redundant data.

Conceptual Integration

Conceptual integration cannot be restrictive. Genome databases must capture the science, and part of the science is different theories. Communities studying different species or working in other areas of biology have different needs. A useful integrated database must capture a variety of definitions, old and new; many of which are not explicitly specified.

Integrating genome data requires reconciling the diverse interpretations of genomic concepts, integrating the relations between genomic concepts and representing the discrepancies in the underlying data. Because of the rapid accumulation of data and the existence of multiple genome databases, one paradigm in which to integrate the data is to develop a multidatabase system which integrates genomic data across databases.

REDUCED CONCEPTUAL DESIGN AUTONOMY

A multidatabase system can be developed by reducing the conceptual design autonomy of the different databases. The logical design of each database and the data model chosen remains under the complete control of the local database administrator. The only restriction is that the database capture the information of the conceptual design.

Conceptual integration allows for complete logical design autonomy, though it does restrict conceptual design autonomy. There are at least three approaches to conceptual integration:

1. complete autonomy in conceptual design
2. reduced (or restricted) autonomy in conceptual design
3. cooperation in conceptual design

Mathematically, these can be distinguished by the number of conceptual languages used to exchange data. For N databases, do we want N languages, less than N languages, or 1 language?

Each option has its own practical considerations. Few people have the inclination to develop N translators to and from their own database as would be required for complete autonomy. Cooperation is possible, but would require an agreement by database administrators (or the agencies that fund them) that sharing data is important enough to make it a priority in database design. Reduced autonomy is not as clean, but may be a mechanism for initiating more cooperation. If two or more databases agree on a common representation for shared genome objects, others may adopt that representation.

Reduced conceptual autonomy may be obtained in at least two ways. The first approach is to not require the conceptual design of the interlingua be adopted in the design of local databases, but only require that the local conceptual design be mappable to the global conceptual schema. Two advantages of this approach are that it can be incorporated by existing databases, and it provides even more autonomy to the local databases.

The second approach is to restrict the functionality of the conceptual language used for data exchange. The data exchange language can be restricted in functionality. One approach would be provide read-only access mechanisms (as suggested by Robbins [1995]). This approach would be simpler to adopt because semantically heterogeneous concepts do not need to be reconciled but instead only be presented in a framework which illuminates the differences.

The two approaches can be combined to further increase autonomy and simplify adoption because the mapping function from the local database to the global conceptual schema needs only to be one-way.

DOMAIN-SPECIFIC DATA MODELS

To exchange data at the conceptual level, an interlingua must exist which captures the semantics of the concepts and relationships being exchanged. Areas of artificial intelligence, such as knowledge representation and computational linguistics, have developed general languages for capturing semantic richness. Conceptual models are an attempt to transform those representation languages to data models which capture the semantic, structural, and behavioral information needed to describe a database.

A data model is a language for describing the data in a database. A data model may be defined by specifying its type constructors (basic building blocks), operators (which manipulate the data) and constraints (that restrict which instances of the data types which can legally occur in the database). The relational data model was the first data model to be specified this way [Codd 1970, 1980]. The type constructor of the relational model is the relation. The eight operators are select, project, join, product, union, difference, intersection and division. There are two database-independent relational integrity constraints: (1) No component of the primary key of a base relation is allowed to accept nulls; (2) The database must not contain any unmatched foreign key values.

Complex domain-specific data models can be developed incrementally both in the area of the domain they cover and in the functionality they provide for that area. For example, a genome mapping data model could begin with addressing only physical maps or only genetic linkage maps. Within the limited area, restricted operations can be provided, such as read-only accessors. Later, operators to enter data may be added. Finally, the coverage of the data model can be expanded to cover other kinds of genome maps.

One simplification to developing domain-specific data models is to restrict the expressive power of the modeling language. One useful restriction is to restrict the type constructors which define the data model to the expressiveness of abstract data types (ADTs).

For example, markers are used frequently in genome mapping, and having a set of commonly agreed upon definitions would increase sharing between databases. A simple data model for markers is given immediately below. The syntax is:

<definition>::= <operator>(<argument>*): <data type>
<argument>::= [<name>:] <data type>

Types:

MARKER, SYMBOL, MARKER_TYPE, NUMBER, DNA_BASE

Operations:

marker_name(MARKER): SYMBOL -- retrieve the name of a marker
marker_type(MARKER): MARKER_TYPE -- retrieve the type of a marker
allele_frequency(ALLELE): NUMBER -- retrieve the frequency of an allele

These retrieval operators depend upon other definitions:

rflp: MARKER_TYPE
dinucleotide_repeat(base1: DNA_BASE, base2: DNA_BASE): MARKER_TYPE
A: DNA_BASE
T: DNA_BASE
C: DNA_BASE
G: DNA_BASE

An advantage of using a domain-specific data model is that it accurately describes what is contained in the database in terms of the domain. Although inadequate for many tasks, much would be accomplished even with this simple data model, such as, agreeing to support RFLP markers and dinucleotide repeat markers and to provide frequency information about their alleles. In addition, a data model describes what is not in the database: Someone looking for minisatellites would know not to examine this data.

Data constructors may also be added to the data model. These do not need to cover the entire functionality of the genome database, but should provide a collection of useful operations. For example, the multidatabase interface might provide a more restrictions on the type of data which can be accessed than the local operations. Some data constructors for markers might include:

marker(name:SYMBOL, location:LOCUS, MARKER_TYPE, SET(ALLELE)): MARKER
PCR_marker(name:SYMBOL, PCR_PRIMERS): MARKER
CA_repeat_marker(name:SYMBOL, location:LOCUS, SET(ALLELE), PCR_PRIMERS): MARKER
allele(name:SYMBOL, frequency: NUMBER): ALLELE
allele(size:NUMBER, frequency: NUMBER): ALLELE

A data model provides flexibility in the design of the interlingua by being extensible. Details which are not addressed initially may be incorporated at a later date. For example, if assuming symbolic names for markers and alleles is too great a restriction, then the data model can be extended to incorporate that functionality.

It is not necessary to agree on all the information required to completely describe the domain, only to describe what is agreed upon in a mutually beneficial language. An interlingua which describes a subset of the data is more useful than no interlingua.

CONCLUSION

We have investigated conceptual integration of genome databases and have found that:

These results have been promising and appear to be a step in the direction of developing genome databases which are physically distributed, autonomously managed, and conceptually integrated.

REFERENCES

Bancilhon F, Delobel C, Kanellakis P, eds, 1992. Building an Object-Oriented Database System: The Story of O2. Morgan Kaufmann, San Francisco.

Codd EF, 1970. A relational model of data for large shared data banks. Communications of the ACM, 13(6):377-387.

Codd EF, 1980. Data models in database management. In M Brodie and SN Ziles, eds, Proc Workshop on Data Abstraction, Databases, and Conceptual Modelling. Also, ACM SIGMOD Record, 11(2).

Hull R, 1987. A survey of theoretical results on typed complex database objects. In J Paredaens, ed, Databases, chap 5, pp 193-256. Academic Press.

Karp P, 1994. Report of the workshop on interconnection of molecular biology databases (MIMBD-94). WWW via http://www.ai.sri.com/people/pkarp/mimbd/mimbd-94.html.

Kifer M, Lausen G, 1989. F-Logic: A higher-order language for reasoning about objects, inheritance, and scheme. In Proc ACM SIGMOD Conference, pp 134-146.

Ritter O, Kocab M, Senger M, Wolf D, Suhai S, 1994. Prototype implementation of the Integrated Genome Database. Computers and Biomedical Research, 27:97-115.

Robbins R, 1995. An information infrastructure for the human genome project. IEEE Engineering in Medicine and Biology, special issue on genome informatics, in press.

Sheth A, Larson J, 1990. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys, 22(3).

Teoery T, 1994. Database Modeling & Design: The Fundamental Principles (2ed). Morgan Kaufmann, San Francisco.

Veijalainen J, Popescu-Zeletin R, 1988. Multidatabase systems in ISO/OSI environment. In Proc of the 6th Symposium on Reliability in Distributed Software and Database Systems.