Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (ICCSEE 2013)

An Approach of Web Page Information Extraction

Authors
Yaohui Li, Lixia Wang, Jianxiong Wang, Jie Yue, Mingzhan Zhao
Corresponding Author
Yaohui Li
Available Online March 2013.
DOI
10.2991/iccsee.2013.556How to use a DOI?
Keywords
Information extraction, DOM, page segmentation, HTML tag
Abstract

The Web has become the largest information source, but the noise content is an inevitable part in any web pages. The noise content reduces the nicety of search engine and increases the load of server. Information extraction technology has been developed. Information extraction technology is mostly based on page segmentation. Through analyzed the existing method of page segmentation, an approach of web page information extraction is provided. The block node is identified by analyzing attributes of HTML tags. This algorithm is easy to implementation. Experiments prove its good performance.

Copyright
© 2013, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (ICCSEE 2013)
Series
Advances in Intelligent Systems Research
Publication Date
March 2013
ISBN
10.2991/iccsee.2013.556
ISSN
1951-6851
DOI
10.2991/iccsee.2013.556How to use a DOI?
Copyright
© 2013, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Yaohui Li
AU  - Lixia Wang
AU  - Jianxiong Wang
AU  - Jie Yue
AU  - Mingzhan Zhao
PY  - 2013/03
DA  - 2013/03
TI  - An Approach of Web Page Information Extraction
BT  - Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (ICCSEE 2013)
PB  - Atlantis Press
SP  - 2217
EP  - 2219
SN  - 1951-6851
UR  - https://doi.org/10.2991/iccsee.2013.556
DO  - 10.2991/iccsee.2013.556
ID  - Li2013/03
ER  -