The Case for Componentry in Genome Information Systems
Nathan Goodman Steve Rozen Lincoln Stein
{nat,steve,lstein}@genome.wi.mit.edu
Whitehead/MIT Center for Genome Research
Whitehead Institute for Biomedical Research
One Kendall Square, Building 300, Floor 5
Cambridge MA 02139, USA
We are interested in a problem that is closely related to the theme
of this workshop, namely, the construction of genome information
systems using modular, ``plug-and-play'' components. By ``genome
information systems'' we mean the sorts of computer systems
constructed at genome centers, as well as organism-specific data
resources and other public data repositories. By ``components'' we
mean significant subsystems, such as databases, analysis programs,
and user interfaces.
At present, almost all genome information systems are constructed
from scratch with little reuse of software developed elsewhere.
The main exceptions are a few organism-specific data resources that
have adopted a complete existing system, generally ACEDB, and use
it with minor customization. In other words, the choices that face
the architect of a genome information system today are: (i) build
it yourself so that it does exactly what you want, or (ii) adopt
someone else's system and live with most of its quirks and
limitations. There is no middle ground in which a designer could
choose to build some parts of the system, while adopting existing
components for the remainder.
This is a sad state of affairs for several reasons.
- It is expensive. Each genome information system duplicates
functionality that exists in other systems.
- It slows creative progress in our field. Each project is
forced to spend a large fraction of its time duplicating the work
of others, leaving a small fraction for innovative work.
- It creates a barrier to small-scale, investigator-initiated
research. Almost all work on genome information systems is done in
large projects funded through genome centers or standalone
community resources. There are but a handful of small, ``R01 scale''
grants in this area, making it difficult for genome informaticists
to function as independent scientists. (There _are_ a reasonable
number of small scale grants on algorithm development and similar
topics.)
Genome informatics cannot flourish as a scientific discipline until
it becomes possible for significant contributions to be made by
individual scientists. This cannot happen until the ``unit of
contribution'' becomes the component, rather than the complete
system. This in turn cannot happen until we learn how to build
complete genome information systems out of independently developed
components.
To achieve this goal, we need to devise one or more architectural
frameworks into which genome information components can
``plug-and-play''. The paragraphs that follow present a strawman
framework to illustrate the issues that arise.
Most modern information system are built from the following generic
types of components:
- Databases, or more generally, data sources, including flat files,
live data feeds, data from external databases, and database
management systems.
- Application software--typically called analysis software in the
genome field--that performs significant computation.
- Presentation services--such as graphical user interfaces, report
writers, and spreadsheets--that format and organize data in a form
suitable for its users.
- Gateways to external systems, to import and export data, and to
utilize external analysis resources such as the NCBI BLAST server
or the GRAIL server.
An interconnection framework must allow these components to
interact gracefully.
An information system for a specific application domain (of which
genomics is an example) must support some set of data types. For a
genome information system these types would include:
- sequences, nucleotide and protein, raw and finished,
- maps of various sorts, raw and finished,
- workflow types for laboratory and curatorial projects,
- genes, alleles, and phenotypes, and
- abstracts and other scientific documents.
Most data types place demands on multiple component types. For
example, to do a good job of supporting sequence data, one
generally needs
- a database to store sequences,
- alignment software and other tools to analyze sequences,
- user interfaces to display aligned and unaligned sequences,
- gateways to BLAST, BLOCKS, public sequence databases, and other
sequence analysis services, and
- some means of moving sequences and analysis results among these
components.
Consider the following hypothetical scenario. Suppose it is the
year 1995 (or 2000, or 2010, as you wish), and we face the job of
building an information system to support a genome center that is
sequencing the human genome. The system would have to manage
- laboratory workflow data,
- raw sequences, and
- finished sequences.
The system would also have to import published map data (e.g., from
GDB) and sequences (e.g., from GenBank) and maintain a database
that integrates this information with the newly generated sequence.
The system would also have to be able to disseminate these results
to the community.
Software already exists to support most of these functions, and
surely more and better software will exist by the start of this
hypothetical project. In an ideal world, we could utilize existing
components for many parts of the system, and focus our creative
energy on creating or improving those components that don't exist
or are not up to the task. For example, we might manage workflow
data using our current LabBase system; we might use ACEDB or Genome
Topographer as our integrated database; but we might choose to
develop a new sequence database and tools if it turned out that no
existing system were up to the challenge of the human genome.
In recent years a number of schemes have become available for
sharing structured objects among applications and transporting them
among various systems. Notable examples are MicroSoft's Object
Linking and Embedding (OLE) and Apple's OpenDoc based on IBM's
System Object Model and the Common Object Request Broker
Architecture (CORBA). Schemes like these should make it possible
to construct well-integrated information systems from standard
components. Nevertheless, several challenges must be met before
the bioinformatics community can realize the benefits of frameworks
for plug-compatible components. These include:
- Selection of low-level representations and protocols for the
framework.
- Definition of molecular biology object types within the low-level
framework.
- Ensuring that the set of molecular biology object types is
extensible, with provisions for appropriate fall-back options for
situations where client software cannot display a particular type.
- Ensuring object interoperability among Unix, Windows, and
Macintosh computers.
- Allowing object transfer by means of e-mail, ftp, Gopher, WAIS,
and WWW.
- Defining protocols so that tools can let the user invoke behavior
associated with molecular biology objects (such using the GenBank
accession number associated with a GenBank-derived sequence to find
the sequence's full GenBank entry, or allowing an STS to carry
pointers to maps that contain it.)
- Developing a critical mass of plug-compatible components.
No one group alone can produce the framework and components we
envision. The primary job of an individual group is to produce
experimental results and provide information to molecular biology
researchers. Therefore, once the group has built an information
system that satisfies their primary needs, the group derives little
benefit ``on the margin'' by expending resources on developing a
general framework and plug-compatible components. Even more
importantly, if any framework is to be widely accepted by the
bioinformatics community it must be the result of a community-based
effort based on substantial consensus.
We know that others share this vision of genome information systems
constructed from ``plug-and-play'' components. The next step is
for us to join together, combine our ideas, and create a concrete
specification that our community can use to implement this vision.