Many investigators, including ourselves, have been advocating the use of componentry as a technique for constructing genome informatics systems. Components are independently developed programs (such as databases, user interfaces, analysis programs, and programs associated with laboratory instruments) that are designed to be used as modular, "plug-and-play" building blocks. Component-based systems are informatics systems constructed in a modular fashion from components.
There is a growing body of experience demonstrating the utility of the componentry technique. Our experience at the Whitehead Institute/MIT Center for Genome Research (the Genome Center) is typical, we believe. We have found that the modularity of component-based systems leads to several important benefits: (i) they are easier to develop; (ii) they evolve more gracefully as user requirements change; (iii) they accommodate new computing technologies and new sub-problem solutions more smoothly; and (iv) they are more amenable to software sharing. All in all, we have found componentry to be a very effective technique for developing informatics systems for the Genome Center.
In addition to the pragmatic benefits noted above, componentry offers compelling social benefits. Almost all research and development on genome informatics systems (as opposed to algorithms and theory) is done in mega-projects such as genome centers and community database efforts. This makes it difficult for systems-oriented genome informaticists to function as independent scientists. The study of genome informatics systems cannot flourish as a scientific discipline until individual scientists working on normal sized projects can make significant contributions. This cannot happen until the "unit of contribution" becomes the component, rather than the complete system.
A closely related social benefit is the potential to encourage community-based software development. Software packages such as Mosaic, perl, Tcl/Tk, TEX, LaTeX, GNU emacs, ACeDB, the Staden package of sequence tools, and many others gain tremendous value from incremental improvements made by software developers throughout the community. For such community efforts to gel, the community must adopt a culture in which software sharing is the norm, and system developers expect, as a matter of routine, to incorporate other people's software in the systems they are building.
For componentry to be most effective, components and component-based systems must adhere to standards in four areas:
It seems unlikely that the genome informatics community will ever agree on a single universal standard. For example, it is hard to imagine that the ACeDB community would adopt ASN.1 in lieu of .ace format, or that NCBI would forsake ASN.1 and the NCBI tool set in favor of ACeDB. A more likely outcome is that the major players (such as ACeDB, NCBI, OPM, Genome Topographer) will augment their standards to handle requirements they currently do not meet, leaving the community with multiple, almost equivalent, standards. The challenge will then be to devise translators and other means to allow components based on different standards to interoperate. Though not ideal, this would be a manageable situation -- indeed it is quite common in the computing field -- and would be a major advance over the current state of affairs.
We have begun a project whose broad aim is to help the community reach consensus on standards in the above areas by gathering information on requirements and potential standards, developing demonstration software that exercises the requirements and potential standards, and presenting the results in a series of reports and workshops. (We have submitted a grant to NIH for this purpose which is under review.) We emphasize that we have no desire to impose standards on the community (which would certainly be futile in any case), but rather to help the community agree on standards to be adopted on a voluntary basis.
The specific aims of the project are as follows:
Aim 3, Demonstration software, is the heart of the project. We plan to develop demonstration software on the four topics below:
(i) contig map assembly;
(ii) laboratory informatics for a large scale sequencing strategy;
(iii) analysis of gene fragments obtained by partial cDNA sequencing or exon trapping;
(iv) dissemination of integrated map and sequence data.These topics cover, in broad strokes, the most common types of genome informatics problems, namely data analysis, high throughput laboratory informatics, and community databases. They also cover many common types of genome data including:
For each common data type, and as many idiosyncratic ones as possible, and for each candidate standard, we will develop the following software:
We will combine some of the above software to create working demonstration systems. Development of working systems is much more time-consuming than the development of discrete software elements for data types. It is infeasible, given the scope of this project, to construct demonstration systems for all standards across all four topics. Nor would this be particularly informative. Instead, we will construct one or two systems that combine standards from multiple groups.
Certain candidate standards, such as ACeDB and Genome Topographer, are integrated, comprehensive systems providing database management, data analysis, and graphical user interfaces in a single, tightly coupled package. For these solutions to be useful in component-based systems, it is necessary to decompose them into component-able pieces. For ACeDB, Thierry-Mieg has already demonstrated a stand-alone version of the database server; it remains to be shown that the graphics library can be used in stand-alone programs. For Genome Topographer, we need to show that both the database and user interface elements can be used independently.
It is essential that our process be open and non-judgmental, and that we avoid the temptation to pick winners and losers. To this end, we will adopt the following policy:
We will not evaluate a candidate standard, nor implement demonstration software based on that standard unless its sponsors agree to this activity. We will not disseminate our critical evaluation of a candidate standard without giving its sponsors a reasonable opportunity to review the material.
The above project will get us to a point where (i) the community recognizes that standards are a good thing, or at least a necessary evil; and (ii) there is convergence of the major standards.
This will set the stage for interoperability. Without standards, interoperability quickly devolves into a jumbled tower of Babel; with standards, it becomes possible to address the technical problems of data translation among standard data representations, database and query translation among standard database approaches, and integration of user interfaces constructed with standard toolkits.
It will also become sensible to construct libraries of sharable, standard components. Many investigators are developing libraries that could serve as starting points for this (e.g., Searls's BioTk graphical user interface widgets). The demonstration software developed in aim 3 could also play a role.
Once it becomes accepted practice to construct libraries of sharable components, investigators will be faced with the challenge of deciding what components to construct. Today, most investigators develop software to meet the immediate needs of their laboratories. Once sharing becomes the norm, investigators will have to balance local needs with community needs. This suggests that we must begin asking questions regarding the kinds of software that should be developed to meet the needs of the overall genome community.
Genomics and all of molecular biology will change dramatically over the next five years: several whole genomes (perhaps even human!) will be sequenced; more potent and easier-to-use genetic markers will be reduced to practice; systematic techniques for determining protein-protein interactions will be perfected; systematic gene expression mapping will be possible. The genome community will need a great deal of new software to exploit these advances.
The genome informatics community has a responsibility to analyze the implications of these advances with respect to informatics and to begin developing the necessary technology. The acceptance of standards, the adoption of componentry, and the creation of sharable components are essential steps without which we will have neither enough time nor enough trained and talented people to create all the necessary software. In parallel, we must begin the task of domain-requirements analysis to anticipate the software that will be needed in the near future, and to make plans for creating this software.
Back to the Abstracts Page