Search |  Contact |  SRI Home Do not follow this link, or your host will be blocked from this site. This is a spider trap. Do not follow this link, or your host will be blocked from this site. This is a spider trap. Do not follow this link, or your host will be blocked from this site. This is a spider trap.A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A ASRI International.  333 Ravenswood Avenue.  Menlo Park, CA 94025-3493. SRI International is a nonprofit corporation.

AIC Seminar Series

Answering Approximate Queries Efficiently

Chen LiUniversity of California, Irvine[Home Page]

Notice:  hosted by Guizhen Yang

Date:  Thursday, August 3rd 2006 at 4:00pm

Location:  EJ291  (Directions)


Many database applications have the emerging need to answer approximate queries efficiently. Such a query can ask for strings that are similar to a given string, such as "names similar to Schwarzenegger" and "telephone numbers similar to 412-0964," where "similar to" uses a predefined, domain-specific function to specify the similarity between strings, such as edit distance. There are many reasons to support such queries. To name a few: (1) The user might not remember exactly the name or the telephone number when issuing the query. (2) There could be typos in the query. (3) There could be errors or inconsistencies even in the database, especially in applications such as data cleaning.

In this talk we will present some of our recent results on answering approximate queries efficiently. One problem related to optimizing such queries is to estimate the selectivity of a fuzzy string predicate, i.e., estimating how many strings in the database satisfy the predicate. We develop a novel technique, called SEPIA, to solve the problem. We will present the details of this technique using the edit distance function. We study challenges in adopting this technique, including how to construct its histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. We show the results of our extensive experiments.

Time permitting, we will also briefly report our other related results. One is on supporting fuzzy queries with both predicates on numeric attributes (e.g., salary > 50K) and predicates on string attributes (e.g., telephone numbers similar to 412-0964). Another one is on how to relax conditions in an SQL query that returns an empty answer. These results are based on three recent papers in VLDBÂ’2005 and VLDBÂ’2006.

   Bio for Chen Li

Chen Li is an assistant professor in the Department of Computer Science at the University of California, Irvine. He received his Ph.D. degree in Computer Science from Stanford University in 2001, and his M.S. and B.S. in Computer Science from Tsinghua University, China, in 1996 and 1994, respectively. He received a National Science Foundation CAREER Award in 2003. He is currently a part-time Visiting Research Scientist at Google, Santa Monica. His research interests are in the fields of database and information systems, including data integration and sharing, data cleansing, data warehousing, and data privacy.

   On-line Resources