To facilitate understanding of, and access to, the current information available for protein structures, we have constructed the Structural Classification of Proteins (scop) database. This database provides a detailed and comprehensive description of the structural and evolutionary relationships of the proteins of known structure. As such, it is largely a secondary, commentary database on the Protein Data Bank. In addition, it integrates data from a variety of sources, including links to coordinates, images of structures, interactive viewers, sequence data, and literature references. The database is freely accessible on world wide web (WWW) with an entry point at URL http://scop.mrc-lmb.cam.ac.uk/scop/, and with mirrors around the world. Scop provides a model for how databases can be comprehensively annotated and integrated in a user-friendly manner.
The exponential growth in the number proteins whose structures have been determined by X-ray crystallography and NMR spectroscopy means that there is now a large and rapidly growing corpus of information available. At present (February, 1995) the Brookhaven Protein Data Bank (PDB) contains 3500 entries and the number is increasing by more than 100 a month. Nearly all proteins have structural similarities with other proteins and, in many cases, share a common evolutionary origin. The knowledge of these relationships makes important contributions to molecular biology and to other related areas of science. It is central to our understanding of the structure and evolution of proteins. It will play an important role in the interpretation of the sequences produced by the genome projects and, therefore, in understanding the evolution of development.
The scop database includes all proteins in the current version of the PDB and many proteins whose structures have been published but whose coordinates are not available from the PDB, and organizes them on hierarchical levels that embody the evolutionary and structural relationships:
To present in an accessible manner the large amounts of complex hierarchical data in the scop database was difficult task. Adding to the challenge was the need to rapidly update the system as new protein structures are solved. Finally, as scop is a "commentary," or annotation, database principally depending upon the contents of the PDB, it was desirable to provide a convenient mechanism for providing easy access to the primary data.
In the spring of 1994, we decided that the WWW would provide all of the needed facilities and would provide a convenient client interface. Hyperlinks between the thousands of pages representing every level of the hierarchical tree make navigation through the complex structure straightforward and even intuitive. In addition, it was straightforward to incorporate documentation for scop with appropriate links on the WWW, so that new users could easily approach the database and understand it. Because the data is maintained on a server, rather than distributed to all users of scop, it is easy to update the database: clients need undertake no special action to access the latest version of the system.
The availability of many other related databases on the WWW allowed us to provide links to them as well. In addition to providing hyperlinks to the PDB server and the coordinate data stored there, we also have links to the beautiful Swiss-3DImage database of protein pictures, and to NCBI Entrez. This last link is especially important, for in addition to providing the sequences of the proteins in scop, and easy access to MEDLINE abstracts, it also provides a way of using scop to organize protein sequences. Entrez contains links from each sequence to its computer-identified homologues, so it is trivial to go from a fold of interest to finding all of the known sequences expected to adopt that structure.
The interactive forms capability of the WWW has also been used to advantage in scop. Using a defined interface, users can send mail to the scop authors without leaving their browser. A keyword searching facility provide links into the main scop hierarchy, and sequence searching also shows alignments and will soon provide simple homology models.
Critically, the keyword searching facility also provides a means for linking in to scop; if a PDB identifier is presented to this system, the appropriate page(s) for that entry will be retrieved. Since scop is organized on a domain level, it is possible for a single protein to occur in multiple places in the database; at present there is no way to select just a region of a particular protein. Likewise, there is not yet any automatic mechanism for retrieving levels of the hierarchy above PDB entries or for providing machine-readable results from these queries. However, all of these necessary interconnectivity facilities are under development.
In addition to using standard WWW capabilities, we have found that some additional features were necessary to make scop optimally usable. We have found that though the individual pages of the scop database are deliberately made small, the international internet links were slow enough to make their transmission remarkably slow. Thus, we have established mirrors for the database on the East and West Coasts of the USA, and in Japan, Israel, and Australia. Since not all searching and computational facilities could be implemented at each site, we developed a system that has every page in the database encode its location. When a search is initiated this location information is transmitted to the search engine--which could be elsewhere. By this mechanism, we return links to hypertext pages that are at the original site (where the query was asked) rather that where the engine runs. Right now, building new mirrors is a computationally intensive process, and we feel that standardized mechanisms for encoding location information would be of great benefit.
Another feature developed specifically for scop was a system to allow markup of molecular structures in interactive viewers. Simply providing coordinates for a structure to be loaded (via MIME typing) into an interactive molecular viewer is straightforward, and a link from scop to the Molecules R Us server at NIH provides this facility. However, as noted above, a single protein structure can occur multiple times in the scop database; only a region of a particular protein may be relevant to the fold currently being investigated. Therefore we needed to develop a system to highlight the region of the protein which was of interest. In addition, we wanted to provide a mechanism to allow users to obtain standard coordinate data from a site local to them, rather than requiring the wasteful transmission of enormous files over the internet. The annotated macromolecule (Annmm) specification, being presented at the ISMB'95 conference, and its forerunner, rasmolscripts, provide a means of doing this. Annmms are not limited to the WWW and display systems; they provide a general and well-defined mechanism of communicating annotation and region information about molecular structures.
The scop database was originally created as a tool for understanding protein evolution through sequence-structure relationships and determining if new sequences and new structures are related to previously known protein structures. On a more general level, the highest levels of classification provide an excellent overview of the diversity of protein structures now known and would be appropriate both for researchers and students. The specific lower levels should be helpful for comparing individual structures with their evolutionary and structurally related counterparts. In addition, we have also found that the search capabilities with easy access to data and images make scop a powerful general-purpose interface to the PDB. As such provides both a model for how integrative and commentary databases can be created and made easy to use, and demonstrates their utility and power.