Stable Partial Information Extraction: A Self-Evolving Hybrid Mechanism

Gao, Peng; Han, Hao; Tokuda, Takehiro

doi:10.3233/978-1-61499-177-9-49

Abstract

The Internet is providing a huge amount of information/knowledge through Web pages. For personal and effective use of such resources, the partial information extraction technology breaks a new path to enable end-users to obtain and integrate only needed information from various Web pages into original compositions. However the traditional XPath-only extraction method would fail in case Web sites use different templates to construct Web pages or change the layout of Web pages, which we call as the stability problem. In this paper, we propose a novel hybrid extraction mechanism for stably extract the partial information. We compare the original and changed Web pages to get the unchanged nodes as a stable-part list and use them to generate new paths. Since the list will be re-ranked after new stable-parts are found, the success rates of extraction can be self evolving and correspondingly reduce manual intervention. We show the usefulness of our approach by experiment on real Web sites in practice.

This website uses cookies

This website uses cookies