Automatic Recognition of the Hits Line in Search Engine Result Page
- DOI
- 10.2991/nceece-15.2016.68How to use a DOI?
- Keywords
- Automatic Recognition; Hits line; Search Engine; Information Extraction; Decision tree
- Abstract
When a search engine returns query results to users, it always returns the number of relevant documents (i.e., hits). The text line containing the hits number is called as hits line. The hits number can be used in several applications such as building meta-search engine, estimating the size and the relevance of search engines. Since the hits line is mixed with other text lines in the result page, it is difficult to automatically recognize and extract the line from the result page. To this end, decision tree techniques are employed together with a heuristic approach to build two filters to automatically identify the hits line. First, texts in result pages are automatically extracted in lines. Then four key features are identified and used to build a decision tree based on the learning sample search engines. Classification rules from the tree are built to serve as the first filter to recognize the extracted text lines. To reduce the mis-classification of the first filter, the second filter is constructed using a heuristic weighting approach. The experiment based on 100 search engines shows that the accuracy of 10-fold cross-validation is up to 95%.
- Copyright
- © 2016, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Qian Haibo AU - Qian ZhongMin PY - 2015/12 DA - 2015/12 TI - Automatic Recognition of the Hits Line in Search Engine Result Page BT - Proceedings of the 2015 4th National Conference on Electrical, Electronics and Computer Engineering PB - Atlantis Press SP - 342 EP - 348 SN - 2352-5401 UR - https://doi.org/10.2991/nceece-15.2016.68 DO - 10.2991/nceece-15.2016.68 ID - Haibo2015/12 ER -