Proceedings of the 8th conference of the European Society for Fuzzy Logic and Technology (EUSFLAT-13)

Sample-based XPath Ranking for Web Information Extraction

Authors
Oliver Jundt, Maurice Van Keulen
Corresponding Author
Oliver Jundt
Available Online August 2013.
DOI
https://doi.org/10.2991/eusflat.2013.27How to use a DOI?
Keywords
Web information extraction Wrappers XPath ranking
Abstract

Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract information for certain fields for a specific website. Manually creating and maintaining wrappers for all target websites is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from a wide variety of possibly beforehand unseen websites. This paper approaches the problem of web information extraction from an angle enabling automatic on-the-fly wrapper creation. The approach is a wrapper induction approach using a small set of data samples for ranking XPaths on their suitability for extracting one particular field from the web pages of a certain site. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted information. Moreover, it appears that 20 to 25 input samples suffice for finding the right XPath for a field.

Copyright
© 2013, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 8th conference of the European Society for Fuzzy Logic and Technology (EUSFLAT-13)
Series
Advances in Intelligent Systems Research
Publication Date
August 2013
ISBN
978-90786-77-78-9
ISSN
1951-6851
DOI
https://doi.org/10.2991/eusflat.2013.27How to use a DOI?
Copyright
© 2013, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Oliver Jundt
AU  - Maurice Van Keulen
PY  - 2013/08
DA  - 2013/08
TI  - Sample-based XPath Ranking for Web Information Extraction
BT  - Proceedings of the 8th conference of the European Society for Fuzzy Logic and Technology (EUSFLAT-13)
PB  - Atlantis Press
SP  - 187
EP  - 194
SN  - 1951-6851
UR  - https://doi.org/10.2991/eusflat.2013.27
DO  - https://doi.org/10.2991/eusflat.2013.27
ID  - Jundt2013/08
ER  -