Sample-based XPath Ranking for Web Information Extraction

Oliver Jundt; Maurice Van Keulen

doi:10.2991/eusflat.2013.27

<Previous Article In Volume

Next Article In Volume>

Sample-based XPath Ranking for Web Information Extraction

Authors

Oliver Jundt, Maurice Van Keulen

Corresponding Author

Oliver Jundt

Available Online August 2013.

DOI: 10.2991/eusflat.2013.27 How to use a DOI?
Keywords: Web information extraction Wrappers XPath ranking
Abstract: Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract information for certain fields for a specific website. Manually creating and maintaining wrappers for all target websites is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from a wide variety of possibly beforehand unseen websites. This paper approaches the problem of web information extraction from an angle enabling automatic on-the-fly wrapper creation. The approach is a wrapper induction approach using a small set of data samples for ranking XPaths on their suitability for extracting one particular field from the web pages of a certain site. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted information. Moreover, it appears that 20 to 25 input samples suffice for finding the right XPath for a field.
Copyright: © 2013, the Authors. Published by Atlantis Press.
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the 8th conference of the European Society for Fuzzy Logic and Technology (EUSFLAT-13)
Series: Advances in Intelligent Systems Research
Publication Date: August 2013
ISBN: 978-90786-77-78-9
ISSN: 1951-6851
DOI: 10.2991/eusflat.2013.27 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - CONF
AU  - Oliver Jundt
AU  - Maurice Van Keulen
PY  - 2013/08
DA  - 2013/08
TI  - Sample-based XPath Ranking for Web Information Extraction
BT  - Proceedings of the 8th conference of the European Society for Fuzzy Logic and Technology (EUSFLAT-13)
PB  - Atlantis Press
SP  - 187
EP  - 194
SN  - 1951-6851
UR  - https://doi.org/10.2991/eusflat.2013.27
DO  - 10.2991/eusflat.2013.27
ID  - Jundt2013/08
ER  -

download .riscopy to clipboard