N-grams based feature selection and text representation for Chinese Text Classification

Zhihua Wei; Duoqian Miao; Jean-Hugues Chauchat; Rui Zhao; Wen Li

doi:10.2991/ijcis.2009.2.4.5

<Previous Article In Issue

Next Article In Issue>

Volume 2, Issue 4, December 2009, Pages 365 - 374

N-grams based feature selection and text representation for Chinese Text Classification

Authors

Zhihua Wei, Duoqian Miao, Jean-Hugues Chauchat, Rui Zhao, Wen Li

Corresponding Author

Zhihua Wei

Received 30 December 2008, Accepted 28 May 2009, Available Online 1 December 2009.

DOI: 10.2991/ijcis.2009.2.4.5 How to use a DOI?
Keywords: Chinese text classification, n-gram, feature selection, text representation weight
Abstract: In this paper, text representation and feature selection strategies for Chinese text classification based on n-grams are discussed. Two steps feature selection strategy is proposed which combines the preprocess within classes with the feature selection among classes. Four different feature selection methods and three text representation weights are compared by exhaustive experiments. Both C-SVC classifier and Naive bayes classifier are adopted to assess the results. All experiments are performed on Chinese corpus TanCorpV1.0 which includes more than 14,000 texts divided in 12 classes. Our experiments concern: (1) the performance comparison among different feature selection strategies: absolute text frequency, relative text frequency, absolute n-gram frequency and relative n-gram frequency; (2) the comparison of the sparseness and feature correlation in the “text by feature” matrices produced by four feature selection methods; (3) the performance comparison among three term weights: 0/1 logical value, n-gram frequency numeric value (TF) and Tf*idf value.
Copyright: © 2009, the Authors. Published by Atlantis Press.
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

<Previous Article In Issue

Next Article In Issue>

Journal: International Journal of Computational Intelligence Systems
Volume-Issue: 2 - 4
Pages: 365 - 374
Publication Date: 2009/12/01
ISSN (Online): 1875-6883
ISSN (Print): 1875-6891
DOI: 10.2991/ijcis.2009.2.4.5 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - JOUR
AU  - Zhihua Wei
AU  - Duoqian Miao
AU  - Jean-Hugues Chauchat
AU  - Rui Zhao
AU  - Wen Li
PY  - 2009
DA  - 2009/12/01
TI  - N-grams based feature selection and text representation for Chinese Text Classification
JO  - International Journal of Computational Intelligence Systems
SP  - 365
EP  - 374
VL  - 2
IS  - 4
SN  - 1875-6883
UR  - https://doi.org/10.2991/ijcis.2009.2.4.5
DO  - 10.2991/ijcis.2009.2.4.5
ID  - Wei2009
ER  -

download .riscopy to clipboard