The Internet is providing a huge amount of information/knowledge through Web pages. For personal and effective use of such resources, the partial information extraction technology breaks a new path to enable end-users to obtain and integrate only needed information from various Web pages into original compositions. However the traditional XPath-only extraction method would fail in case Web sites use different templates to construct Web pages or change the layout of Web pages, which we call as the stability problem. In this paper, we propose a novel hybrid extraction mechanism for stably extract the partial information. We compare the original and changed Web pages to get the unchanged nodes as a stable-part list and use them to generate new paths. Since the list will be re-ranked after new stable-parts are found, the success rates of extraction can be self evolving and correspondingly reduce manual intervention. We show the usefulness of our approach by experiment on real Web sites in practice.
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
Tel.: +1 703 830 6300
Fax: +1 703 830 2300 firstname.lastname@example.org
(Corporate matters and books only) IOS Press c/o Accucoms US, Inc.
For North America Sales and Customer Service
West Point Commons
Lansdale PA 19446
Tel.: +1 866 855 8967
Fax: +1 215 660 5042 email@example.com