Proceedings of the 2012 National Conference on Information Technology and Computer Science

The Research of the Maximum Length n-grams Priority Chinese Word Segmentation Method Based on Corpus Type Frequency Information

Authors
Pengyu Lu, Lijun Jin, Bin Jiang
Corresponding Author
Pengyu Lu
Available Online November 2012.
DOI
https://doi.org/10.2991/citcs.2012.111How to use a DOI?
Keywords
word segmentation; word frequency; n-gram word; corpus type frequency information
Abstract
In order to solve the difficulties to extract words in particular domain, we formulate a method of automatic word segmentation in Chinese based on corpus type frequency information. This method can effectively extract n-gram words that are not predefined in a lexicon by setting the maximum length (n) of the n-gram word we want to extract from a sentence and the minimum threshold frequency the n-gram word appears in corpus. When the real frequency the n-gram appears in corpus is above the threshold, the n-gram word will be extracted. If there are two or more n-grams have the same length, the higher frequency one will be chosen, and then the next higher frequency one if any of its characters are not in previous one.
Open Access
This is an open access article distributed under the CC BY-NC license.

Download article (PDF)

Proceedings
2012 National Conference on Information Technology and Computer Science
Part of series
Advances in Intelligent Systems Research
Publication Date
November 2012
ISBN
978-94-91216-39-8
ISSN
1951-6851
DOI
https://doi.org/10.2991/citcs.2012.111How to use a DOI?
Open Access
This is an open access article distributed under the CC BY-NC license.

Cite this article

TY  - CONF
AU  - Pengyu Lu
AU  - Lijun Jin
AU  - Bin Jiang
PY  - 2012/11
DA  - 2012/11
TI  - The Research of the Maximum Length n-grams Priority Chinese Word Segmentation Method Based on Corpus Type Frequency Information
BT  - 2012 National Conference on Information Technology and Computer Science
PB  - Atlantis Press
SP  - 426
EP  - 429
SN  - 1951-6851
UR  - https://doi.org/10.2991/citcs.2012.111
DO  - https://doi.org/10.2991/citcs.2012.111
ID  - Lu2012/11
ER  -