The Importance of Standards and Componentry in Meeting the Genome Informatics Challenges of the Next Five Years

Nathan Goodman, Steve Rozen, Lincoln Stein

{nat,steve,lstein}@genome.wi.mit.edu

Whitehead Institute / MIT Center for Genome Research
One Kendall Square, Building 300, Floor 5
Cambridge MA 02139, USA

Componentry

Many investigators, including ourselves, have been advocating the use of componentry as a technique for constructing genome informatics systems. Components are independently developed programs (such as databases, user interfaces, analysis programs, and programs associated with laboratory instruments) that are designed to be used as modular, "plug-and-play" building blocks. Component-based systems are informatics systems constructed in a modular fashion from components.

There is a growing body of experience demonstrating the utility of the componentry technique. Our experience at the Whitehead Institute/MIT Center for Genome Research (the Genome Center) is typical, we believe. We have found that the modularity of component-based systems leads to several important benefits: (i) they are easier to develop; (ii) they evolve more gracefully as user requirements change; (iii) they accommodate new computing technologies and new sub-problem solutions more smoothly; and (iv) they are more amenable to software sharing. All in all, we have found componentry to be a very effective technique for developing informatics systems for the Genome Center.

In addition to the pragmatic benefits noted above, componentry offers compelling social benefits. Almost all research and development on genome informatics systems (as opposed to algorithms and theory) is done in mega-projects such as genome centers and community database efforts. This makes it difficult for systems-oriented genome informaticists to function as independent scientists. The study of genome informatics systems cannot flourish as a scientific discipline until individual scientists working on normal sized projects can make significant contributions. This cannot happen until the "unit of contribution" becomes the component, rather than the complete system.

A closely related social benefit is the potential to encourage community-based software development. Software packages such as Mosaic, perl, Tcl/Tk, TEX, LaTeX, GNU emacs, ACeDB, the Staden package of sequence tools, and many others gain tremendous value from incremental improvements made by software developers throughout the community. For such community efforts to gel, the community must adopt a culture in which software sharing is the norm, and system developers expect, as a matter of routine, to incorporate other people's software in the systems they are building.

Componentry Implies Standards

For componentry to be most effective, components and component-based systems must adhere to standards in four areas:

  1. Data representation -- so that components can exchange data in a uniform manner. Absent this, system developers would need to write specialized translation programs for every component in the system.
  2. Inter-program communication -- so that components can invoke each other in a uniform manner. Absent this, system developers might need to employ different communication methods for different components which would be cumbersome and error-prone.
  3. Database management -- so that components can store and access data using the same database management system (DBMS). Absent this, a system with multiple components might need to operate multiple DBMSs which would likely be an operational headache.
  4. Graphical user interfaces -- so that user interface components share the same look-and-feel. Absent this, users would have to cope with changing display and mouse conventions when moving from one component to another. Most users find this extremely frustrating.
Standards are always a touchy subject, because adherence to standards inevitably entails some loss of freedom and flexibility. Moreover, standards are rarely ideal for any one purpose, because requirements across a large community are often contradictory. The situation is exacerbated in the genome field, because competing standards are already in place, and the proponents of these standards have invested considerable energy and prestige in designing them, implementing software based on them, and promoting them in the community.

It seems unlikely that the genome informatics community will ever agree on a single universal standard. For example, it is hard to imagine that the ACeDB community would adopt ASN.1 in lieu of .ace format, or that NCBI would forsake ASN.1 and the NCBI tool set in favor of ACeDB. A more likely outcome is that the major players (such as ACeDB, NCBI, OPM, Genome Topographer) will augment their standards to handle requirements they currently do not meet, leaving the community with multiple, almost equivalent, standards. The challenge will then be to devise translators and other means to allow components based on different standards to interoperate. Though not ideal, this would be a manageable situation -- indeed it is quite common in the computing field -- and would be a major advance over the current state of affairs.

Reaching Consensus on Standards

We have begun a project whose broad aim is to help the community reach consensus on standards in the above areas by gathering information on requirements and potential standards, developing demonstration software that exercises the requirements and potential standards, and presenting the results in a series of reports and workshops. (We have submitted a grant to NIH for this purpose which is under review.) We emphasize that we have no desire to impose standards on the community (which would certainly be futile in any case), but rather to help the community agree on standards to be adopted on a voluntary basis.

The specific aims of the project are as follows:

  1. Technology survey. We will survey technical solutions in the above-mentioned areas of data representation, inter-program communication, database management, and graphical user interfaces. With the advice of the community, we will select a subset of these solutions as candidate standards.
  2. Requirements survey. We will survey the community to discover the community's perceived requirements and priorities in the above areas. We expect to find general agreement on most requirements, intermixed with some number of disagreements.
  3. Demonstration software. We will implement a small number of demonstration programs and systems that conform to the candidate standards, in most cases by adapting existing software. Our intent is to provide a communal base of experience to illustrate requirements, and to compare candidate standards.
  4. Requirements analysis and technology evaluation. We will write a report that describes the community's requirements in the above-mentioned areas, and that evaluates candidate standards with respect to these requirements. This will be a living document with versions released periodically.
  5. Community involvement. We will interact with the community by organizing workshops in conjunction with existing scientific meetings, and by maintaining a World Wide Web site and an email distribution list.

Aim 3, Demonstration software, is the heart of the project. We plan to develop demonstration software on the four topics below:

(i) contig map assembly;
(ii) laboratory informatics for a large scale sequencing strategy;
(iii) analysis of gene fragments obtained by partial cDNA sequencing or exon trapping;
(iv) dissemination of integrated map and sequence data.
These topics cover, in broad strokes, the most common types of genome informatics problems, namely data analysis, high throughput laboratory informatics, and community databases. They also cover many common types of genome data including: Each topic may also have idiosyncratic data types. These include YAC addresses in contig map assembly, sequence traces in sequence assembly, and motifs in gene analysis.

For each common data type, and as many idiosyncratic ones as possible, and for each candidate standard, we will develop the following software:

Most candidate standards already include software for these tasks. Our task is to modify the existing software to make it more uniform so that it will be easier to compare solutions.

We will combine some of the above software to create working demonstration systems. Development of working systems is much more time-consuming than the development of discrete software elements for data types. It is infeasible, given the scope of this project, to construct demonstration systems for all standards across all four topics. Nor would this be particularly informative. Instead, we will construct one or two systems that combine standards from multiple groups.

Certain candidate standards, such as ACeDB and Genome Topographer, are integrated, comprehensive systems providing database management, data analysis, and graphical user interfaces in a single, tightly coupled package. For these solutions to be useful in component-based systems, it is necessary to decompose them into component-able pieces. For ACeDB, Thierry-Mieg has already demonstrated a stand-alone version of the database server; it remains to be shown that the graphics library can be used in stand-alone programs. For Genome Topographer, we need to show that both the database and user interface elements can be used independently.

It is essential that our process be open and non-judgmental, and that we avoid the temptation to pick winners and losers. To this end, we will adopt the following policy:

We will not evaluate a candidate standard, nor implement demonstration software based on that standard unless its sponsors agree to this activity. We will not disseminate our critical evaluation of a candidate standard without giving its sponsors a reasonable opportunity to review the material.

The Next Steps for Genome Informatics

The above project will get us to a point where (i) the community recognizes that standards are a good thing, or at least a necessary evil; and (ii) there is convergence of the major standards.

This will set the stage for interoperability. Without standards, interoperability quickly devolves into a jumbled tower of Babel; with standards, it becomes possible to address the technical problems of data translation among standard data representations, database and query translation among standard database approaches, and integration of user interfaces constructed with standard toolkits.

It will also become sensible to construct libraries of sharable, standard components. Many investigators are developing libraries that could serve as starting points for this (e.g., Searls's BioTk graphical user interface widgets). The demonstration software developed in aim 3 could also play a role.

Once it becomes accepted practice to construct libraries of sharable components, investigators will be faced with the challenge of deciding what components to construct. Today, most investigators develop software to meet the immediate needs of their laboratories. Once sharing becomes the norm, investigators will have to balance local needs with community needs. This suggests that we must begin asking questions regarding the kinds of software that should be developed to meet the needs of the overall genome community.

Genomics and all of molecular biology will change dramatically over the next five years: several whole genomes (perhaps even human!) will be sequenced; more potent and easier-to-use genetic markers will be reduced to practice; systematic techniques for determining protein-protein interactions will be perfected; systematic gene expression mapping will be possible. The genome community will need a great deal of new software to exploit these advances.

The genome informatics community has a responsibility to analyze the implications of these advances with respect to informatics and to begin developing the necessary technology. The acceptance of standards, the adoption of componentry, and the creation of sharable components are essential steps without which we will have neither enough time nor enough trained and talented people to create all the necessary software. In parallel, we must begin the task of domain-requirements analysis to anticipate the software that will be needed in the near future, and to make plans for creating this software.

Back to the Abstracts Page

Last updated: 12 July 1995
Steve Rozen steve@genome.wi.mit.edu