Short text model based on Strong feature thesaurus
- DOI
- 10.2991/isrme-15.2015.126How to use a DOI?
- Keywords
- Short Text Model; Data Sparseness; Strong Feature; Latent Dirichlet Allocation; Clustering
- Abstract
Data Sparseness, the evident characteristic of short text, is caused by the diversity of language expression and the short text length. The previous text models represented by Bag of Word (BOW) only considers the statistical feature of words, and thus always underperformed when it comes to short texts. To tackle this problem, we introduced a new text model by combining the statistical method and semantic estimation. Specifically, we managed to obtain the “Strong Feature Thesaurus” through mining process with Latent Dirichlet allocation (LDA) model, and then the semantic information is incorporated in the BOW by weighting those strong feature terms. To assess the performance of this model, we conduct two experiments of the clustering of short text corpuses. The results have shown that our model outperform the prevailing text models such as BOW.
- Copyright
- © 2015, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Wentao Lu AU - Yongfeng Huang AU - Xing Li AU - Zhuo Zhang AU - Yingkun Li PY - 2015/04 DA - 2015/04 TI - Short text model based on Strong feature thesaurus BT - Proceedings of the 2015 International Conference on Intelligent Systems Research and Mechatronics Engineering PB - Atlantis Press SP - 620 EP - 625 SN - 1951-6851 UR - https://doi.org/10.2991/isrme-15.2015.126 DO - 10.2991/isrme-15.2015.126 ID - Lu2015/04 ER -