Sample-based XPath Ranking for Web Information Extraction
- DOI
- 10.2991/eusflat.2013.27How to use a DOI?
- Keywords
- Web information extraction Wrappers XPath ranking
- Abstract
Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract information for certain fields for a specific website. Manually creating and maintaining wrappers for all target websites is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from a wide variety of possibly beforehand unseen websites. This paper approaches the problem of web information extraction from an angle enabling automatic on-the-fly wrapper creation. The approach is a wrapper induction approach using a small set of data samples for ranking XPaths on their suitability for extracting one particular field from the web pages of a certain site. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted information. Moreover, it appears that 20 to 25 input samples suffice for finding the right XPath for a field.
- Copyright
- © 2013, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Oliver Jundt AU - Maurice Van Keulen PY - 2013/08 DA - 2013/08 TI - Sample-based XPath Ranking for Web Information Extraction BT - Proceedings of the 8th conference of the European Society for Fuzzy Logic and Technology (EUSFLAT-13) PB - Atlantis Press SP - 187 EP - 194 SN - 1951-6851 UR - https://doi.org/10.2991/eusflat.2013.27 DO - 10.2991/eusflat.2013.27 ID - Jundt2013/08 ER -