An Approach of Web Page Information Extraction
Authors
Yaohui Li, Lixia Wang, Jianxiong Wang, Jie Yue, Mingzhan Zhao
Corresponding Author
Yaohui Li
Available Online March 2013.
- DOI
- 10.2991/iccsee.2013.556How to use a DOI?
- Keywords
- Information extraction, DOM, page segmentation, HTML tag
- Abstract
The Web has become the largest information source, but the noise content is an inevitable part in any web pages. The noise content reduces the nicety of search engine and increases the load of server. Information extraction technology has been developed. Information extraction technology is mostly based on page segmentation. Through analyzed the existing method of page segmentation, an approach of web page information extraction is provided. The block node is identified by analyzing attributes of HTML tags. This algorithm is easy to implementation. Experiments prove its good performance.
- Copyright
- © 2013, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Yaohui Li AU - Lixia Wang AU - Jianxiong Wang AU - Jie Yue AU - Mingzhan Zhao PY - 2013/03 DA - 2013/03 TI - An Approach of Web Page Information Extraction BT - Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (ICCSEE 2013) PB - Atlantis Press SP - 2217 EP - 2219 SN - 1951-6851 UR - https://doi.org/10.2991/iccsee.2013.556 DO - 10.2991/iccsee.2013.556 ID - Li2013/03 ER -