Proceedings of the 2015 4th National Conference on Electrical, Electronics and Computer Engineering

Automatic Recognition of the Hits Line in Search Engine Result Page

Authors
Qian Haibo, Qian ZhongMin
Corresponding Author
Qian Haibo
Available Online December 2015.
DOI
10.2991/nceece-15.2016.68How to use a DOI?
Keywords
Automatic Recognition; Hits line; Search Engine; Information Extraction; Decision tree
Abstract

When a search engine returns query results to users, it always returns the number of relevant documents (i.e., hits). The text line containing the hits number is called as hits line. The hits number can be used in several applications such as building meta-search engine, estimating the size and the relevance of search engines. Since the hits line is mixed with other text lines in the result page, it is difficult to automatically recognize and extract the line from the result page. To this end, decision tree techniques are employed together with a heuristic approach to build two filters to automatically identify the hits line. First, texts in result pages are automatically extracted in lines. Then four key features are identified and used to build a decision tree based on the learning sample search engines. Classification rules from the tree are built to serve as the first filter to recognize the extracted text lines. To reduce the mis-classification of the first filter, the second filter is constructed using a heuristic weighting approach. The experiment based on 100 search engines shows that the accuracy of 10-fold cross-validation is up to 95%.

Copyright
© 2016, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 2015 4th National Conference on Electrical, Electronics and Computer Engineering
Series
Advances in Engineering Research
Publication Date
December 2015
ISBN
978-94-6252-150-6
ISSN
2352-5401
DOI
10.2991/nceece-15.2016.68How to use a DOI?
Copyright
© 2016, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Qian Haibo
AU  - Qian ZhongMin
PY  - 2015/12
DA  - 2015/12
TI  - Automatic Recognition of the Hits Line in Search Engine Result Page
BT  - Proceedings of the 2015 4th National Conference on Electrical, Electronics and Computer Engineering
PB  - Atlantis Press
SP  - 342
EP  - 348
SN  - 2352-5401
UR  - https://doi.org/10.2991/nceece-15.2016.68
DO  - 10.2991/nceece-15.2016.68
ID  - Haibo2015/12
ER  -