Research on Chinese segmentation algorithm based on Hadoop cloud platform
- DOI
- 10.2991/itoec-15.2015.29How to use a DOI?
- Keywords
- Chinese word segmentation; ICTCLAS; IKAnalyzer; Inverted descending order; HDFS; MapReduce; Hadoop
- Abstract
IKAnalyzer (IK) and ICTCLAS (IC) are very popular Chinese word segmentation algorithms and play an important role in solving text data in a stand-alone environment. In this paper, we compare IK and IC algorithm performance through theory and experiments that reported on experimental work on the mass Chinese text segmentation problem and its optimal solution using the Hadoop cluster, Hadoop Distributed File System (HDFS) for storage and by using parallel processing to process large data sets by using the MapReduce programming framework. The results obtained from various experiments indicate favorable results of above optimized IC and IK algorithms to address mass Chinese text segmentation problems. At the same time, in order to make the large data set after processing is more easily and directly showed, we introduced the Inverted descending order on the segmentation of word frequency in this paper. Through a comparative study of the two kinds of Chinese segmentation algorithm based on Hadoop platform, provides the powerful support for the efficient processing of Chinese mass information.
- Copyright
- © 2015, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Hong Chen PY - 2015/03 DA - 2015/03 TI - Research on Chinese segmentation algorithm based on Hadoop cloud platform BT - Proceedings of the 2015 Information Technology and Mechatronics Engineering Conference PB - Atlantis Press SP - 134 EP - 138 SN - 2352-538X UR - https://doi.org/10.2991/itoec-15.2015.29 DO - 10.2991/itoec-15.2015.29 ID - Chen2015/03 ER -