The Case for Componentry in Genome Information Systems

Nathan Goodman Steve Rozen Lincoln Stein

{nat,steve,lstein}@genome.wi.mit.edu Whitehead/MIT Center for Genome Research Whitehead Institute for Biomedical Research One Kendall Square, Building 300, Floor 5 Cambridge MA 02139, USA

We are interested in a problem that is closely related to the theme of this workshop, namely, the construction of genome information systems using modular, ``plug-and-play'' components. By ``genome information systems'' we mean the sorts of computer systems constructed at genome centers, as well as organism-specific data resources and other public data repositories. By ``components'' we mean significant subsystems, such as databases, analysis programs, and user interfaces.

At present, almost all genome information systems are constructed from scratch with little reuse of software developed elsewhere. The main exceptions are a few organism-specific data resources that have adopted a complete existing system, generally ACEDB, and use it with minor customization. In other words, the choices that face the architect of a genome information system today are: (i) build it yourself so that it does exactly what you want, or (ii) adopt someone else's system and live with most of its quirks and limitations. There is no middle ground in which a designer could choose to build some parts of the system, while adopting existing components for the remainder.

This is a sad state of affairs for several reasons.

  1. It is expensive. Each genome information system duplicates functionality that exists in other systems.
  2. It slows creative progress in our field. Each project is forced to spend a large fraction of its time duplicating the work of others, leaving a small fraction for innovative work.
  3. It creates a barrier to small-scale, investigator-initiated research. Almost all work on genome information systems is done in large projects funded through genome centers or standalone community resources. There are but a handful of small, ``R01 scale'' grants in this area, making it difficult for genome informaticists to function as independent scientists. (There _are_ a reasonable number of small scale grants on algorithm development and similar topics.)
Genome informatics cannot flourish as a scientific discipline until it becomes possible for significant contributions to be made by individual scientists. This cannot happen until the ``unit of contribution'' becomes the component, rather than the complete system. This in turn cannot happen until we learn how to build complete genome information systems out of independently developed components.

To achieve this goal, we need to devise one or more architectural frameworks into which genome information components can ``plug-and-play''. The paragraphs that follow present a strawman framework to illustrate the issues that arise.

Most modern information system are built from the following generic types of components:

An interconnection framework must allow these components to interact gracefully.

An information system for a specific application domain (of which genomics is an example) must support some set of data types. For a genome information system these types would include:

Most data types place demands on multiple component types. For example, to do a good job of supporting sequence data, one generally needs Consider the following hypothetical scenario. Suppose it is the year 1995 (or 2000, or 2010, as you wish), and we face the job of building an information system to support a genome center that is sequencing the human genome. The system would have to manage The system would also have to import published map data (e.g., from GDB) and sequences (e.g., from GenBank) and maintain a database that integrates this information with the newly generated sequence. The system would also have to be able to disseminate these results to the community.

Software already exists to support most of these functions, and surely more and better software will exist by the start of this hypothetical project. In an ideal world, we could utilize existing components for many parts of the system, and focus our creative energy on creating or improving those components that don't exist or are not up to the task. For example, we might manage workflow data using our current LabBase system; we might use ACEDB or Genome Topographer as our integrated database; but we might choose to develop a new sequence database and tools if it turned out that no existing system were up to the challenge of the human genome.

In recent years a number of schemes have become available for sharing structured objects among applications and transporting them among various systems. Notable examples are MicroSoft's Object Linking and Embedding (OLE) and Apple's OpenDoc based on IBM's System Object Model and the Common Object Request Broker Architecture (CORBA). Schemes like these should make it possible to construct well-integrated information systems from standard components. Nevertheless, several challenges must be met before the bioinformatics community can realize the benefits of frameworks for plug-compatible components. These include:

No one group alone can produce the framework and components we envision. The primary job of an individual group is to produce experimental results and provide information to molecular biology researchers. Therefore, once the group has built an information system that satisfies their primary needs, the group derives little benefit ``on the margin'' by expending resources on developing a general framework and plug-compatible components. Even more importantly, if any framework is to be widely accepted by the bioinformatics community it must be the result of a community-based effort based on substantial consensus.

We know that others share this vision of genome information systems constructed from ``plug-and-play'' components. The next step is for us to join together, combine our ideas, and create a concrete specification that our community can use to implement this vision.