Personnel
Objective
For the first time ever, millions of people use search engines on a
daily basis, and they can see how difficult it can be to deal with a flood of
articles that purport to be relevant to their information need. Information
extraction technology can help this situation by giving analysts direct access
to the information they are seeking. Unfortunately, it has always
been the case that information extraction technology has been very difficult to
deploy because it requires extensive application-specific system development to
achieve good results. In this project we address both problems of improving
the underlying technology, and making this technology easily accessible
to ordinary users.
Approach
We are focusing on three areas that will lead to this goal. We are
developing a library of the most common event patterns in business
news. We are developing methods for acquiring new rules and modifying
old rules in response to annotations and corrections of users. We are
improving the methods for entity and event coreference that
have been shown to be the principal cause of poor performance
in information extraction.
Recent Accomplishments
- We have designed the basic ontology of the Open-Domain information
extraction system.
- We have implemented a lattice-based FASTUS system for improved handling
of ambiguity.
- In cooperation with other contractors, we have developed the basic specifications
for a common pattern specification language.
- Achieved a significant improvement in information retrieval
results using extraction technology to refine the output of an IR system.
Current Research Plans
We are now beginning to implement an open-domain
information extraction system covering about 150 of the most common
event types in business news. This has involved developing an
ontology of business news entities, defining patterns in terms of the
ontology for the common event types, and compiling these into their
various linguistic realizations. We expect to finish the first
version of this system in the coming year.
We have already implemented a preliminary version of a module for
modifying existing rules in a large system on the basis of corrections
made by a user unfamiliar with way FASTUS is programmed; this will be
improved and enhanced in the coming year. We have worked out the
theoretical approach for rule acquisition from user annotations
which will be implemented in the comming year.
We will improve heuristics we have already implemented for entity
coreference, and we will extend the coreference capabilities beyond
identity for pronouns and definite noun phrases. We will work out and
implement a formally specified theory of event coreference.
The FASTUS System
SRI's information extraction technology is based on the
FASTUS system.
FASTUS comprises a series of finite-state transducers that map input text
into SGML or HTML tagged text indicating items of interest (e.g., person
or company names) annotations in a TIPSTER document database, or
templates summarizing key document content of interest to an analyst.
The FASTUS architecture enables domain-dependent aspects of the system
to be isolated, thus facilitating transportability of extraction systems
across domains.
AI Center Home Page
SRI International Home Page
This page designed and maintained by
Douglas E. Appelt
(appelt@ai.sri.com)
July 8, 1997