Research on Feature Selection and kNN Classification Method in Chinese Text Classification
- DOI
- 10.2991/nceece-15.2016.172How to use a DOI?
- Keywords
- Chinese text classification; feature selection; text similarity; kNN; unbalanced degree of term distribution
- Abstract
Scholars at home and abroad have done lots of research on feature selection methods in Chinese text classification, such as document frequency (DF), information gain (IG), and a -test (CHI). On the basis of their work, we propose a new selection method of counting the unbalanced degree of term distribution, compare it with other feature selection methods using the k-nearest-neighbor (kNN) algorithm, and find that the new method performs as well as CHI and IG. Experiments have shown that whatever the feature selection method we choose, after the number of features reaches a certain value, the gain of classification accuracy becomes very slight. Keep increasing the feature dimension can hardly improve the classification performance, while the time consumed doubles. In that case, we attempts to improve the kNN method by counting the text similarity differently. The improved method will quantify each feature’s weight using a bit string, count the similarity of two documents under their bits mode, and finally remarkably reduce the space required for storing documents and the time consumed by counting their similarity. Experiments have confirmed that the new kNN method can greatly accelerate the speed of classification at the expense of a little loss of classification accuracy.
- Copyright
- © 2016, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Chao Xiao AU - Ping Wu PY - 2015/12 DA - 2015/12 TI - Research on Feature Selection and kNN Classification Method in Chinese Text Classification BT - Proceedings of the 2015 4th National Conference on Electrical, Electronics and Computer Engineering PB - Atlantis Press SP - 956 EP - 962 SN - 2352-5401 UR - https://doi.org/10.2991/nceece-15.2016.172 DO - 10.2991/nceece-15.2016.172 ID - Xiao2015/12 ER -