Unifying Database Queries: The ENQUire System

Curt Jamison, Brad Mills, and Bruce Schatz

Community Systems Laboratory
National Center for Supercomputing Applications
University of Illinois, Urbana-Champaign
Urbana, IL 61801 USA

Biological researchers rely heavily upon databases like Genbank, PIR, and EMBL for sequence data, Medline for bibliographic data, and GDB, MGD, FlyBASE, and a myriad of species-specific databases based upon ACeDB for genetic and genomic information. While most of these sources are easily accessible by ordinary computers, the data structures are fundamentally different, forcing the researcher to understand several different query languages and making the integration of results from different databases difficult. There is a need for a system to federate these disparate sources of information so that a researcher can search several databases with a single query, view the results in an integrated report, and submit new data to disparate databases.

When designing a database federation system, it is clearly unreasonable (and in some cases impossible or counter-productive) to design a new data structure standard and expect the world to conform. Instead, use of extant databases and communication protocols is necessary. As a preliminary step towards this, we have written ENQUire (Extensible Network Query Unifier), a Mosaic forms-based query interface which communicates across the Internet with ENQUire, which interprets the query into various query languages and submits the queries to the appropriate database, then collects and collates the responses prior to interpreting them into an HTML format and returning them to the user. The prototype ENQUire system can be found here.

The design of ENQUire is that of an extensible gateway. This means new capabilities can be added to the gateway easily, and without disrupting any operations. There are four sections to ENQUire: a query parser/translator, a WWW communications module, a results parser/translator, and a library of translation protocols. The first three sections are the backbone of the gateway, while the latter is a dynamic library to which any number of new protocols can be added.

ENQUire is a CGI application written in Tcl (although slightly slower than a compiled C program, Tcl allows for rapid integration of new database translators, and has better string manipulation capabilities than standard C). ENQUire was initially conceived and implemented as a replacement for the query capabilities of the Worm Community System (WCS). WCS is a digital library system that federates literature and data sources pertaining to C. elegans. The data used for this system (ACeDB and literature) had to be reformatted before inclusion into the system, and had to reside on the WCS server. ENQUire was designed to circumvent this costly and time-consuming effort by getting information directly from the data sources across the Internet.

Due to the history and parentage of ENQUire, the first translation programs implemented were for the data sources found in WCS: the ACeDB WWW server at INRA and the C. elegans literature server at CSL/NCSA. In addition, a translation program for Genbank/Medline server at NCBI/NLM/NIH is available.

In combining information and computational resources, ENQUire is a prototype system for a WWW-based analysis environment. Comments from researchers to whom this system has been demonstrated are generally favorable, and the CSL http server has averaged 10 queries per day since the official announcement of ENQUire at the Tenth International C. elegans meeting. A more robust production version is now in progress. This version is being written in perl5 to take full advantage of the Perl Tace Server (PTS) genomic servers at USDA/NAL. The new version will also include an object store to maintain session identity and facilitate submission of objects to analysis tools like BLAST, sequence alignment, and protein folding programs. Eventually, ENQUire will serve as a user interface to a biological analysis environment, which will allow the researcher to retrieve data from heterogeneous sources and manipulate it using large-scale computational resource centers like NCSA through custom viewers on their own PCs.

Back to the Abstracts Page