On the Complexity of Schema Inference from Web Pages in the Presence of Nullable Data Attributes

Guizhen Yang    I. V. Ramakrishnan    Michael Kifer

Abstract

An increasingly large number of Web pages are machine-generated by filling in templates with data stored in backend databases. These templates can be viewed as the implicit schemas of those Web pages. The ability to infer the implicit schema from a collection of Web pages is important for scalable data extraction, since the inferred schema can be used to automatically identify schema attributes that are "encoded" in Web pages.

However, the task of inferring a "good" schema is complicated due to the existence of nullable (missing) data attributes. Usually if an attribute contains a null value, then it will be omitted in the generated Web page, giving rise to different variations and permutations of layout structures in Web pages that are generated from the same template.

In this paper we investigate the complexity of schema inference from Web pages in the presence of nullable data attributes. We introduce the notion of unambiguity as a quality measure for inferred schemas and prove that the problem of inferring "good" (unambiguous) schemas is NP-complete. Our complexity results imply that ambiguity resolution is one of the root causes of the computational difficulty underlying schema inference from Web pages.