TRANSFAC and connected databases.

Maintenance and display of a web of crossreferences

H.Karas, R.Knueppel, E.Wingender

Summary

TRANSFAC is a database which compiles experimental data from literature. A lot of additional information is available through the major sequence databases. It can be obtained through a web of database crossreferences. This abstract presents a possible solution for the problem of maintaining these links as well as a stand alone solution for displaying them.

Introduction

The TRANSFAC database is a database of TRANScription regulatory FACtors and is maintained at the GBF Braunschweig It combines data about the transcription factors and their DNA binding sites with additional important information (e.g. the sources of the factors, systematic classification of transcription factors) All experimental data have been extracted from literature.These data are accessible through two main tables, the FACTORS and the SITES table. While the first table holds data about the binding proteins, the second holds the data about the DNA sequences that are recognized by these proteins. Besides these experimental data, TRANSFAC comprises also information derived from them. As many transcription factors can be classified by their DNA binding domains and/or their dimerization domains we introduced the CLASS table to TRANSFAC. We also prepared a GENES table, which contains data about the according genes and their promoters/enhancers (Knueppel et al.) and which will be part of the ASCII flatfile version in future.
To provide database management systems for handling the TRANSFAC content, we have developed TRANSFAC retieval programs, TRP (network and relational data models) [1]. However, as each database should also provide its content in an easily readable format, TRANSFAC is also distributed as ASCII flat files.
The ASCII version of the database consists of five different files, a detailed description is available.

Tiny TRP

Tiny TRP is a browsing tool for the TRANSFAC database. It is a stand alone solution that requires the linked databases in their original format. To allow the user to follow up the various links in TRANSFAC, we established an index system whose format is similar to that used by the EMBL Data Library. With these indices, the user can also run simple queries on TRANSFAC.
The links between the different databases are established at runtime. The flat files contain all necessary information and were analysed by a text parser. This parser chooses the apropriate database using the DR line of the flat file. In this line the parser will find the name of the database, the accession number and the entry name.
The user can design an individual output on the screen by disabling selected fields. This is a very useful option because many entries are several pages long and thus difficult to overlook. So the output can be compressed to the important informations and is easy to read directly on the screen.

The links between TRANSFAC and other databases

Pic 1: Interconnections between TRANSFAC and other databases
       Green: Already connected
       Blue:  Under construction
       Red:   Planned to connect

These links between TRANSFAC and the other databases are very important for the use of TRANSFAC as a basis for theoretical studies as well as for retrieving information for the experimental work.
The main problem in this context is to keep the links between the databases consistent. Each time one entry in a database is deleted, changed or split there must be an update in all databases that have links to this entry. This is a laborious task which cannot always be automatized but needs interactive double-checking. It grows rapidly with the number of connected databases.
It is very important to think about methods to keep the integrity of databases intact. One possible solution of this problem is a connection index that keeps track of the relations between databases.

Types of links

To construct this index we have to think about different types of links. The following description of these types is from the point of view of TRANSFAC. But the problem will affect other linked databases, too. It has to be discussed whether these three types of links are sufficient.

Global links

A global link will link one or more entries of the source database with a complete entry of the target database. This is, for example, the link between the FACTORS table of TRANSFAC with an entry in SwissProt or PIR.

Position sensitive links

A position sensitive link is more complicated to maintain. Here, one or more entries of the source database are linked to a specific position in one or more entries of the target database. These are, for example, the links between the SITES table of TRANSFAC with the EMBL database where each TRANSFAC site is at a specific position within the EMBL sequence entry.

Feature Links

The feature links are lying in between, in this case one or more entries of the source database are linked to a specific feature of an entry of the target database. This is, for example, the link between the FACTORS table from TRANSFAC with the EMBL database, where the link points to the DNA sequence of the gene of a transcription factor.

The structure of the link index

One entry in this index should describe exactly one link between two databases. Therefore this link entry should consist of the following information:

-A unique identification of the entry of the source database. This can be the name of the database and the accession number of the entry (or any other unique identifier).
-A unique identification of the entry of the target database. This can also be the name of the database and the accession number of the entry (or any other unique identifier that identifies one entry (e.g. for the EMBL database this would be the accession number and the corresponding entry name).
-The third block of information is the classification of the link type and the additional positions/features for those links that are not global. For a feature sensitive link, this would be the corresponding term in the feature table (e.g. EMBL) and for a position sensitive link, it would be the start position of the element as well as the length.

They should be as short as possible for fast access and easy maintenance. A delimited ASCII file would fit our preferences very well. It is accessible with every text editor and can be easily imported into any database management system. The working index looks very different, it depends on the database management system that is used. Each change of one entry that is related to another database via this index triggers a suitable message to the maintainers of the connected database.

The communication

There are different possibilities of what such a trigger will cause, greatly depending on the kind of the link.

Changes in globally linked entries
These links are easy to handle since they are only linked without 'knowing' the content. So the target database needs to be informed only, if there are major changes concerning the whole entry (e.g. change of the biological classification, change of the entry name in case of EMBL or merging/deleting entries)
Changes to feature links
For these links, any changes in the designation of the specific feature should also be communicated to the target database, including of course all of the above changes.
Changes to position sensitive links
Besides of the above features, the source database has to be informed about changes and corrections affecting the sequences itself. If this updating of the sequence will affect the position of the specific feature this should be included in the notification as well as changes in the sequence element itself.

Obviously a link index like that proposed above could significantly improve the cooperation between databases. An automatically scheduled message system should be capable to do the major tasks in the maintenance of the links between the connected databases. Since the links indices between pairs of databases connect the entries in their actual states, they will enable the user to navigate between the most recent database issues available.

[1] Knüppel, R., Dietze, P., Lehnberg, W., Frech, K. and Wingender, E. (1994). TRANSFAC retrieval program: a network model database of eukaryotic transcription regulating sequences and proteins. J. Comput. Biol. 1, 191-198.