The Research of the Maximum Length n-grams Priority Chinese Word Segmentation Method Based on Corpus Type Frequency Information

Pengyu Lu; Lijun Jin; Bin Jiang

doi:10.2991/citcs.2012.111

<Previous Article In Volume

Next Article In Volume>

The Research of the Maximum Length n-grams Priority Chinese Word Segmentation Method Based on Corpus Type Frequency Information

Authors

Pengyu Lu, Lijun Jin, Bin Jiang

Corresponding Author

Pengyu Lu

Available Online November 2012.

DOI: 10.2991/citcs.2012.111 How to use a DOI?
Keywords: word segmentation; word frequency; n-gram word; corpus type frequency information
Abstract: In order to solve the difficulties to extract words in particular domain, we formulate a method of automatic word segmentation in Chinese based on corpus type frequency information. This method can effectively extract n-gram words that are not predefined in a lexicon by setting the maximum length (n) of the n-gram word we want to extract from a sentence and the minimum threshold frequency the n-gram word appears in corpus. When the real frequency the n-gram appears in corpus is above the threshold, the n-gram word will be extracted. If there are two or more n-grams have the same length, the higher frequency one will be chosen, and then the next higher frequency one if any of its characters are not in previous one.
Copyright: © 2012, the Authors. Published by Atlantis Press.
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the 2012 National Conference on Information Technology and Computer Science
Series: Advances in Intelligent Systems Research
Publication Date: November 2012
ISBN: 978-94-91216-39-8
ISSN: 1951-6851
DOI: 10.2991/citcs.2012.111 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - CONF
AU  - Pengyu Lu
AU  - Lijun Jin
AU  - Bin Jiang
PY  - 2012/11
DA  - 2012/11
TI  - The Research of the Maximum Length n-grams Priority Chinese Word Segmentation Method Based on Corpus Type Frequency Information
BT  - Proceedings of the 2012 National Conference on Information Technology and Computer Science
PB  - Atlantis Press
SP  - 426
EP  - 429
SN  - 1951-6851
UR  - https://doi.org/10.2991/citcs.2012.111
DO  - 10.2991/citcs.2012.111
ID  - Lu2012/11
ER  -

download .riscopy to clipboard