TRANSFAC, TRRD and COMPEL: three databases on gene regulation and their links

R. Knueppel (1), O. V. Kel (2), A. E. Kel (2), N. A. Kolchanov (2) and E. Wingender (1)
(1) GBF, Braunschweig, Germany
(2) Institute of Cytology and Genetics, RAS, Novosibirsk, Russia

Summary:

TRANSFAC, TTRD and COMPEL are three databases on gene regulation supplementing each other with specific information. The biological object suggests to link these databases by integration of a common gene table. This common gene table also facilitates to connect corresponding objects by appropriate relations such as sites of TRANSFAC and cis-elements of TRRD and COMPEL.


Interconnection of databases that share common topics facilitates the user to retrieve and select data from these databases and, once this connection has been established, to maintain these databases by supplementing information from linked databases as they have been updated. TRANSFAC, TRRD and COMPEL are three databases on transcriptional level of gene regulation. Each focusses on particular aspects of gene regulation which is reflected by the structure of the corresponding data modell. All three databases contain experimental data extracted from literature, moreover TRANSFAC contains data such as transcription factor binding matrices or consensus descriptions partly derived by sequence analysing computer tools.

TRANSFAC is maintained at the GBF Braunschweig, Germany. It holds information about gene regulatory DNA sequences (sites) and proteins (transcription factors) that bind to and act through them. In that way the site and factor tables and the relation between these tables are the central part of the data model. Besides compiling experimental data within SITES, FACTORS and CELLS, which were extracted from literature, TRANSFAC comprises also secondary data such as the CLASS table. This table comprises information about factor classification as many transcription factors have been assigned to a distinct family or superfamily of DNA-binding domain and / or of their dimerization domain. Another important example is the MATRIX table.This table holds nucleotide distribution matrices as determined for experimentally verified aligned binding sequences for individual transcription factors. These matrices may have been obtained either by random oligonucleotide selection approaches (and, in this sense, are primary rather than secondary data), or they have been generated from all binding sites as they have been identified within the genomes. These matrices can be used to screen for potential recognition sites of these factors within genomic sequences.
The database is implemented for local maintenance as a relational database using a SQL server as DBMS. It is distributed as set of five ASCII flat files and documentation: SITES, FACTORS, CELLS, CLASS and MATRIX. Some other tables (REFERENCES, METHODS) which appear in a relational model have been integrated for sake of clearness but at the price of enhanced redundancy. SITES and FACTORS table contain cross-references to EMBL data library and to the SwissProt database, both of which refer in turn to TRANSFAC as well (see also Karas et al.). Moreover TRANSFAC is publicly available as network model based database including browsing tool (TRANSFAC retrieval program, TRP [1]) and an integrated interface to the sequence analysis tool ConsInspector from Frech et al. [2].

The database TRRD (Transcription Regulatory Region Database) is maintained at the Institute of Cytology and Genetics, RAS, Novosibirsk, Russia. It contains information about the structure of transcription control regions of eukaryotic genes. The main feature of the description of regulatory regions is to reflect their modular structure. There are the following hierarchical levels of gene regulating modules: cis-elements, composite elements, enhancers, promoters, regulatory control region, integrity of all control regions of the gene. This hierarchical structure of transcription control regions is modelled in a hierarchical net with genes at the top level and cis-elements as elementary units at the bottom level. TRRD up to now is implemented as ASCII flat file (example) database. It will be converted into a relational datebase for internal maintenance and will be publicly available as ASCII flat file.

COMPEL is likewise maintained at the Institute of Cytology and Genetics, RAS, Novosibirsk, Russia. It contains information on composite elements. A composite element is formed of adjacent or partially overlapping binding sites for proteins which belong to different factor families and to different signal transduction pathways. Cross-coupling of two different factors at a composite element exhibits a new pattern of transcription regulation. This is modelled by a tupel of site - transcription factor interactions according the constraint of close neighbourhood of participant sites (and of course implies that a composite element is related to just one gene). COMPEL is implemented as ASCII flat file (example) database. It will be converted into a relational database closely linked to TRRD and will be publicly available as ASCII flat file.

As TRRD and COMPEL are already closely linked by referencing entry identifiers of the respective other database, the task to link all three databases was to establish appropriate links between TRRD resp. COMPEL and TRANSFAC. At first we established a common table (GENES) which links both TRANSFAC and TRRD as TRANSFAC SITES are also connected to entries in this gene table. This table will comprise information about the gene's denomination, an identifier and an accession number, its assignment according to Bucher's promoter classification [3], the links to TRANSFAC and TRRD and, where possible, the cross-reference to EMBL / GenBank. The GENES table is maintained at both locations. It will be temporarily updated by either side locally and will be synchronized in distinct time intervals. Now we are establishing links between cis-elements of TRRD (COMPEL) and sites of TRANSFAC as well as the interacting factors of COMPEL and TRANSFAC. These links will be two additional tables using the respective entry identifiers to represent many-to-many relations between the respective tables.

TRANSFAC tables are already linked to WWW and as soon as TTRD and COMPEL will be prepared for WWW these aforementioned database links will be included as active hyperlinks to the WWW pages to use the WWW browser as a simple database retrieval tool.

[1] Knueppel, R., Dietze, P., Lehnberg, W., Frech, K. and Wingender E. (1994). TRANSFAC retrieval program: a network model database of eukaryotic transcription regulating sequences and proteins. J. Comput. Biol. 1, 191 - 198.
[2] Frech, K., Herrmann, G. and Werner, T. (1993). Computer - assisted prediction, classification and delimitation of protein binding sites in nucleic acids. Nucleic Acids Res. 21, 1655 - 1664.
[3] Bucher, P. (1993). The Eukaryotic Promoter Database EPD. EMBL Nucleotide Sequence Data Library Release 37, Postfach 10.2209, D - 69012 Heidelberg.