Automatic Discovery of Semantic Structures in HTML Documents

Saikat Mukherjee    Guizhen Yang    Wenfang Tan    I. V. Ramakrishnan

Abstract

Template-driven HTML documents possess an implicit, fixed schema denoting concepts and their relationships in a hierarchical fashion. Discovering this schema remains a relatively unexplored problem. By exploiting a key observation that semantically related items in HTML documents exhibit spatial locality, we develop an algorithm for automatically partitioning them into tree-like semantic structures which expose the implicit schema.