Bringing Information Extraction Technology to the Analyst's Fingertips

Personnel



Objective

For the first time ever, millions of people use search engines on a daily basis, and they can see how difficult it can be to deal with a flood of articles that purport to be relevant to their information need. Information extraction technology can help this situation by giving analysts direct access to the information they are seeking. Unfortunately, it has always been the case that information extraction technology has been very difficult to deploy because it requires extensive application-specific system development to achieve good results. In this project we address both problems of improving the underlying technology, and making this technology easily accessible to ordinary users.

Approach

We are focusing on three areas that will lead to this goal. We are developing a library of the most common event patterns in business news. We are developing methods for acquiring new rules and modifying old rules in response to annotations and corrections of users. We are improving the methods for entity and event coreference that have been shown to be the principal cause of poor performance in information extraction.

Recent Accomplishments

Current Research Plans

We are now beginning to implement an open-domain information extraction system covering about 150 of the most common event types in business news. This has involved developing an ontology of business news entities, defining patterns in terms of the ontology for the common event types, and compiling these into their various linguistic realizations. We expect to finish the first version of this system in the coming year.

We have already implemented a preliminary version of a module for modifying existing rules in a large system on the basis of corrections made by a user unfamiliar with way FASTUS is programmed; this will be improved and enhanced in the coming year. We have worked out the theoretical approach for rule acquisition from user annotations which will be implemented in the comming year.

We will improve heuristics we have already implemented for entity coreference, and we will extend the coreference capabilities beyond identity for pronouns and definite noun phrases. We will work out and implement a formally specified theory of event coreference.

The FASTUS System

SRI's information extraction technology is based on the FASTUS system. FASTUS comprises a series of finite-state transducers that map input text into SGML or HTML tagged text indicating items of interest (e.g., person or company names) annotations in a TIPSTER document database, or templates summarizing key document content of interest to an analyst. The FASTUS architecture enables domain-dependent aspects of the system to be isolated, thus facilitating transportability of extraction systems across domains.

System Diagram
Back to AI Center Home Page
Back to SRI International Home Page
This page designed and maintained by Douglas E. Appelt (appelt@ai.sri.com)
July 8, 1997