This paper summarizes the work recently done in connection with the XBio project, which consists of several modules and concepts for an advanced facility for remote database and tool access and interoperability.
XBio is based on the idea of using a formal grammar to describe the content and syntax of output reports generated by remote tools and/or database application programs. By defining a data manipulation language to manipulate structurally-defined text, users can issue a variety of powerful inter-database retrieval and manipulation queries.
The main advantages of this approach are that it treats both databases and computational tools uniformly, it does not require a global schema to be maintained, and it allows the user to easily integrate structured with unstructured data. Also, since output reports of public databases and tools are supplied as a service to the user community, they represent a more stable aspect of these remote services than their internal structures (e.g. conceptual schemas of databases). Thus, XBio follows a self-describing and self-documenting database approach, with a special emphasis on formal modelling of text.
2. System Components
The implemented components of the system include:
Abstract Syntax Notation One (ASN.1) has been adopted with modifications as the DDL for XBio. The modification were introduced to remedy the following deficiencies: a) its inability to represent unnamed terminal symbols, and b) lack of a graphical representation that can be used to diagrammatically display the structures it describes. Deficiency a) is sufficient to deprive ASN.1 from the ability to represent an arbitrary context-free grammar, which requires the representation of terminal symbols.
A new DML has been defined for manipulating text that has been parsed using the DDL. The DML is SQL-like, with the exception that it can manipulate ASN.1-described text. Together with the DDL, the DML specifies a data model that is more powerful than the relational model. The model can represent attribute hierarchies, multivalued attributes, optional attributes, both ordered and unordered attributes, ordered and unordered tuples, and unnamed attributes. The approach is well suited for modeling relations as well as modeling data where form is important, like modeling text and flat file databases having ad hoc formats. The primary motivation for developing this language is to provide a common model for the database integration approach of XBio, where some of the databases are assumed to be in the form of formatted flat file text, while others are relational.
The UDS defines a comprehensive data structure that captures all the details of parsed text. The UDS completely describes the underlying DDL describing the format of the text, the raw text, and the way this text has been parsed (the parse tree). In addition, module information is embedded to allow the manipulated text to remain modular. The UDS is exchanged between the different XBio modules.
The DML interpreter:
A function library called ShareLib has been implemented to perform the various tasks that can be specifies using the DML statements. An interpreter has been built to process source DML statements and translate them into appropriate ShareLib function calls. Most ShareLib function operate on the UDS.
The TP accepts text and its modular specifications as input and produce a UDS instance as its output. The TP is like an "import" function for the DML, and is thus, included a special function in ShareLib.
The SE takes a UDS instance as its input and extracts the DDL specifications embedded in the UDS. Since the UDS incorporates the modularity information, the generated DDL will also be in modular form. This is important in light of the possible complexity of DDL descriptions, and the need to share module descriptions among different database files. the SE is also implemented as a ShareLib module.
The TE extracts the flat text embedded in a UDS instance, and is also implemented as a ShareLib function.
A Printform is a standard way to represent the parsed text in a readable form. XBio supports the standard printform through PG, which generates a printform from an instance of the UDS. PG is also another function in ShareLib.
Finally, a comprehensive Graphical user interface called GBM is provided to allow users who are not interested to learn the DML to perform all of its functions graphically. GBM interprets the user's graphical commands and translates them into appropriate ShareLib function calls.
[BUNE89] O. P. Buneman, S. B. Davidson, and A. Watters, "Querying Independent Databases", Information Sciences, 49:3, Nov. 1989.
[BUNE93] P. Buneman, S. Davidson, and A. Kosky, "Theoretical Aspects of Schema Merging", Proceedings of Extending Database Technology, March 1992.
[CHEN93a] I-Min Chen, Victor M. Markowitz, Francis Pang, Ofer Ben-Shachar, "OPM Schema Editor 1.1, A Graphical Editor for Specifying Object-Protocol Structures", Lawrence-Berkeley Laboratory Technical Report LBL-33410, Jan. 1993.
[CHEN93b] I-Min A. Chen and Victor Markowitz, "The Object-Protocol Model: Design, Implementation, and Scientific Applications", Lawrence-Berkeley Laboratory Technical Report LBL-33644, Feb. 1993.
[COOM87] Coombs J., Renear A., and DeRose S. J., "Markup Systems and the Future of Scholarly Text Processing", CACM 30/11, 1978.
[DERO90] DeRose S. J., Durand D. G., David G., Mylonas E., and Renear A., "What is Text, Really?", Journal of Computing in Higher Education, 1/2 (Winter 1990) 3-26.
[HEIM85] Heimbigner D., and McLeod D., "A Federated Architecture for Information Management", ACM Transactions On Office Information Systems 3, 3, July 1985.
[KARP91] Peter D. Karp, The ASN.1 Path-Manipulation Package, National Center for Biotechnology Information, National Library of Medicine publication, May, 1991.
[KAME92a] Nabil Kamel, "A Profile for Molecular Biology Databases", Computer Applications in the Biosciences, Vol. 18, No. 4, 1992.
[KAME92b] Nabil Kamel, Tao Song, and Magdi Kamel, "Incorporating GUI in Integration of Molecular Biology Databases", in: The Third IEEE Workshop on Future Trends of Distributed Computing Systems, Taipei, Taiwan, 404-410, April 14-16, 1992.
[KAME93a] Nabil Kamel, Tao Song, and Magdi Kamel, "An Approach for Building an Integrated Environment for Molecular Biology Databases and Software Tools" in: The International Journal of Distributed and Parallel Databases, Vol. 1, No. 2, 1993.
[LIT86] Witold Litwin and Abdelaziz Abdellatif, "Multidatabase Interoperability", IEEE Computer, Dec. 1986.
[LIT87] Witold Litwin and Abdelaziz Abdellatif, "An Overview of the Multi-Database Manipulation Language MDSL",Proceedings of the IEEE, Vol. 75, No. 5, May 1987.