TRANSFAC is a database which compiles experimental data from literature. A lot of additional information is available through the major sequence databases. It can be obtained through a web of database crossreferences. This abstract presents a possible solution for the problem of maintaining these links as well as a stand alone solution for displaying them.
The TRANSFAC database is a database of
TRANScription regulatory FACtors and
is maintained at the GBF
Braunschweig It combines data about the transcription factors and
their DNA binding sites with additional important information (e.g. the
sources of the factors, systematic classification of transcription
factors) All experimental data have been extracted from
literature.These data are accessible through two main tables, the FACTORS and the SITES table. While the first table holds data about the binding
proteins, the second holds the data about the DNA sequences that are
recognized by these proteins. Besides these experimental data, TRANSFAC
comprises also information derived from them. As many transcription
factors can be classified by their DNA binding domains and/or their
dimerization domains we introduced the CLASS table to TRANSFAC. We also prepared a GENES table, which
contains data about the according genes and their promoters/enhancers
(Knueppel et al.) and which will be part of the ASCII flatfile
version in future.
To provide database management systems for handling the TRANSFAC content, we have developed TRANSFAC retieval programs, TRP (network and relational data models) . However, as each database should also provide its content in an easily readable format, TRANSFAC is also distributed as ASCII flat files.
The ASCII version of the database consists of five different files, a detailed description is available.
Tiny TRP is a browsing tool for the TRANSFAC
database. It is a stand alone solution that requires the linked
databases in their original format. To allow the user to follow up the
various links in TRANSFAC, we established an index system whose format
is similar to that used by the EMBL Data Library. With these indices,
the user can also run simple queries on TRANSFAC.
The links between the different databases are established at runtime. The flat files contain all necessary information and were analysed by a text parser. This parser chooses the apropriate database using the DR line of the flat file. In this line the parser will find the name of the database, the accession number and the entry name.
The user can design an individual output on the screen by disabling selected fields. This is a very useful option because many entries are several pages long and thus difficult to overlook. So the output can be compressed to the important informations and is easy to read directly on the screen.
Pic 1: Interconnections between TRANSFAC and other databases Green: Already connected Blue: Under construction Red: Planned to connect
between TRANSFAC and the other databases are very important for the use
of TRANSFAC as a basis for theoretical studies as well as for retrieving
information for the experimental work.
The main problem in this context is to keep the links between the databases consistent. Each time one entry in a database is deleted, changed or split there must be an update in all databases that have links to this entry. This is a laborious task which cannot always be automatized but needs interactive double-checking. It grows rapidly with the number of connected databases.
It is very important to think about methods to keep the integrity of databases intact. One possible solution of this problem is a connection index that keeps track of the relations between databases.
To construct this index we have to think about different types of links. The following description of these types is from the point of view of TRANSFAC. But the problem will affect other linked databases, too. It has to be discussed whether these three types of links are sufficient.
A global link will link one or more entries of the source database with a complete entry of the target database. This is, for example, the link between the FACTORS table of TRANSFAC with an entry in SwissProt or PIR.
A position sensitive link is more complicated to maintain. Here, one or more entries of the source database are linked to a specific position in one or more entries of the target database. These are, for example, the links between the SITES table of TRANSFAC with the EMBL database where each TRANSFAC site is at a specific position within the EMBL sequence entry.
The feature links are lying in between, in this case one or more entries of the source database are linked to a specific feature of an entry of the target database. This is, for example, the link between the FACTORS table from TRANSFAC with the EMBL database, where the link points to the DNA sequence of the gene of a transcription factor.
One entry in this index should describe exactly one link between two databases. Therefore this link entry should consist of the following information:
-A unique identification of the entry of the source database. This can be the name of the database and the accession number of the entry (or any other unique identifier).
-A unique identification of the entry of the target database. This can also be the name of the database and the accession number of the entry (or any other unique identifier that identifies one entry (e.g. for the EMBL database this would be the accession number and the corresponding entry name).
-The third block of information is the classification of the link type and the additional positions/features for those links that are not global. For a feature sensitive link, this would be the corresponding term in the feature table (e.g. EMBL) and for a position sensitive link, it would be the start position of the element as well as the length.
They should be as short as possible for fast access and easy maintenance. A delimited ASCII file would fit our preferences very well. It is accessible with every text editor and can be easily imported into any database management system. The working index looks very different, it depends on the database management system that is used. Each change of one entry that is related to another database via this index triggers a suitable message to the maintainers of the connected database.
There are different possibilities of what such a trigger will cause, greatly depending on the kind of the link.
Obviously a link index like that proposed above could significantly improve the cooperation between databases. An automatically scheduled message system should be capable to do the major tasks in the maintenance of the links between the connected databases. Since the links indices between pairs of databases connect the entries in their actual states, they will enable the user to navigate between the most recent database issues available. Knüppel, R., Dietze, P., Lehnberg, W., Frech, K. and Wingender, E. (1994). TRANSFAC retrieval program: a network model database of eukaryotic transcription regulating sequences and proteins. J. Comput. Biol. 1, 191-198.