Proceedings of the 2012 National Conference on Information Technology and Computer Science

The Research of the Maximum Length n-grams Priority Chinese Word Segmentation Method Based on Corpus Type Frequency Information

Authors
Pengyu Lu, Lijun Jin, Bin Jiang
Corresponding Author
Pengyu Lu
Available Online November 2012.
DOI
10.2991/citcs.2012.111How to use a DOI?
Keywords
word segmentation; word frequency; n-gram word; corpus type frequency information
Abstract

In order to solve the difficulties to extract words in particular domain, we formulate a method of automatic word segmentation in Chinese based on corpus type frequency information. This method can effectively extract n-gram words that are not predefined in a lexicon by setting the maximum length (n) of the n-gram word we want to extract from a sentence and the minimum threshold frequency the n-gram word appears in corpus. When the real frequency the n-gram appears in corpus is above the threshold, the n-gram word will be extracted. If there are two or more n-grams have the same length, the higher frequency one will be chosen, and then the next higher frequency one if any of its characters are not in previous one.

Copyright
© 2012, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 2012 National Conference on Information Technology and Computer Science
Series
Advances in Intelligent Systems Research
Publication Date
November 2012
ISBN
10.2991/citcs.2012.111
ISSN
1951-6851
DOI
10.2991/citcs.2012.111How to use a DOI?
Copyright
© 2012, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Pengyu Lu
AU  - Lijun Jin
AU  - Bin Jiang
PY  - 2012/11
DA  - 2012/11
TI  - The Research of the Maximum Length n-grams Priority Chinese Word Segmentation Method Based on Corpus Type Frequency Information
BT  - Proceedings of the 2012 National Conference on Information Technology and Computer Science
PB  - Atlantis Press
SP  - 426
EP  - 429
SN  - 1951-6851
UR  - https://doi.org/10.2991/citcs.2012.111
DO  - 10.2991/citcs.2012.111
ID  - Lu2012/11
ER  -