Proceedings of the 2015 Information Technology and Mechatronics Engineering Conference

Research on Chinese segmentation algorithm based on Hadoop cloud platform

Authors
Hong Chen
Corresponding Author
Hong Chen
Available Online March 2015.
DOI
10.2991/itoec-15.2015.29How to use a DOI?
Keywords
Chinese word segmentation; ICTCLAS; IKAnalyzer; Inverted descending order; HDFS; MapReduce; Hadoop
Abstract

IKAnalyzer (IK) and ICTCLAS (IC) are very popular Chinese word segmentation algorithms and play an important role in solving text data in a stand-alone environment. In this paper, we compare IK and IC algorithm performance through theory and experiments that reported on experimental work on the mass Chinese text segmentation problem and its optimal solution using the Hadoop cluster, Hadoop Distributed File System (HDFS) for storage and by using parallel processing to process large data sets by using the MapReduce programming framework. The results obtained from various experiments indicate favorable results of above optimized IC and IK algorithms to address mass Chinese text segmentation problems. At the same time, in order to make the large data set after processing is more easily and directly showed, we introduced the Inverted descending order on the segmentation of word frequency in this paper. Through a comparative study of the two kinds of Chinese segmentation algorithm based on Hadoop platform, provides the powerful support for the efficient processing of Chinese mass information.

Copyright
© 2015, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 2015 Information Technology and Mechatronics Engineering Conference
Series
Advances in Computer Science Research
Publication Date
March 2015
ISBN
978-94-62520-52-3
ISSN
2352-538X
DOI
10.2991/itoec-15.2015.29How to use a DOI?
Copyright
© 2015, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Hong Chen
PY  - 2015/03
DA  - 2015/03
TI  - Research on Chinese segmentation algorithm based on Hadoop cloud platform
BT  - Proceedings of the 2015 Information Technology and Mechatronics Engineering Conference
PB  - Atlantis Press
SP  - 134
EP  - 138
SN  - 2352-538X
UR  - https://doi.org/10.2991/itoec-15.2015.29
DO  - 10.2991/itoec-15.2015.29
ID  - Chen2015/03
ER  -