Proceedings of the 2016 4th International Conference on Sensors, Mechatronics and Automation (ICSMA 2016)

A Type of Web Content Extraction Algorithm Based on Adaptive Threshold

Authors
Guang Zheng, Xianghui Hui, Xin Xu, Lei Xi
Corresponding Author
Guang Zheng
Available Online December 2016.
DOI
10.2991/icsma-16.2016.45How to use a DOI?
Keywords
new rural community; Web information fetching; text density; adaptive threshold; Otsu threshold algorithm; Web page text extraction algorithm
Abstract

On the basis of the text extraction based on the density of text, the Web page text extraction algorithm based on the adaptive threshold was proposed and applied in the new rural community employment information service system for the employment information fetching from the related government affairs website combined with the Otsu threshold algorithm. Through the web page text extraction contrast experiments to the Webpages including "The ministry of human resources and social security of the People's Republic of China", "The ministry of human resources and social security hall of henan province" and "Sina.com", the text extraction rate of the algorithm reached 90%, 92% and 92% respectively. The results showed that the application of the algorithm in new rural community employment information service system could provide technical support for the directional employment information acquisition and realize accurate employment information retrieval.

Copyright
© 2016, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 2016 4th International Conference on Sensors, Mechatronics and Automation (ICSMA 2016)
Series
Advances in Intelligent Systems Research
Publication Date
December 2016
ISBN
10.2991/icsma-16.2016.45
ISSN
1951-6851
DOI
10.2991/icsma-16.2016.45How to use a DOI?
Copyright
© 2016, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Guang Zheng
AU  - Xianghui Hui
AU  - Xin Xu
AU  - Lei Xi
PY  - 2016/12
DA  - 2016/12
TI  - A Type of Web Content Extraction Algorithm Based on Adaptive Threshold
BT  - Proceedings of the 2016 4th International Conference on Sensors, Mechatronics and Automation (ICSMA 2016)
PB  - Atlantis Press
SP  - 244
EP  - 250
SN  - 1951-6851
UR  - https://doi.org/10.2991/icsma-16.2016.45
DO  - 10.2991/icsma-16.2016.45
ID  - Zheng2016/12
ER  -