Automatic Recognition of the Hits Line in Search Engine Result Page

Qian Haibo; Qian ZhongMin

doi:10.2991/nceece-15.2016.68

<Previous Article In Volume

Next Article In Volume>

Automatic Recognition of the Hits Line in Search Engine Result Page

Authors

Qian Haibo, Qian ZhongMin

Corresponding Author

Qian Haibo

Available Online December 2015.

DOI: 10.2991/nceece-15.2016.68 How to use a DOI?
Keywords: Automatic Recognition; Hits line; Search Engine; Information Extraction; Decision tree
Abstract: When a search engine returns query results to users, it always returns the number of relevant documents (i.e., hits). The text line containing the hits number is called as hits line. The hits number can be used in several applications such as building meta-search engine, estimating the size and the relevance of search engines. Since the hits line is mixed with other text lines in the result page, it is difficult to automatically recognize and extract the line from the result page. To this end, decision tree techniques are employed together with a heuristic approach to build two filters to automatically identify the hits line. First, texts in result pages are automatically extracted in lines. Then four key features are identified and used to build a decision tree based on the learning sample search engines. Classification rules from the tree are built to serve as the first filter to recognize the extracted text lines. To reduce the mis-classification of the first filter, the second filter is constructed using a heuristic weighting approach. The experiment based on 100 search engines shows that the accuracy of 10-fold cross-validation is up to 95%.
Copyright: © 2016, the Authors. Published by Atlantis Press.
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the 2015 4th National Conference on Electrical, Electronics and Computer Engineering
Series: Advances in Engineering Research
Publication Date: December 2015
ISBN: 978-94-6252-150-6
ISSN: 2352-5401
DOI: 10.2991/nceece-15.2016.68 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - CONF
AU  - Qian Haibo
AU  - Qian ZhongMin
PY  - 2015/12
DA  - 2015/12
TI  - Automatic Recognition of the Hits Line in Search Engine Result Page
BT  - Proceedings of the 2015 4th National Conference on Electrical, Electronics and Computer Engineering
PB  - Atlantis Press
SP  - 342
EP  - 348
SN  - 2352-5401
UR  - https://doi.org/10.2991/nceece-15.2016.68
DO  - 10.2991/nceece-15.2016.68
ID  - Haibo2015/12
ER  -

download .riscopy to clipboard