The Research of the Maximum Length n-grams Priority Chinese Word Segmentation Method Based on Corpus Type Frequency Information
Pengyu Lu, Lijun Jin, Bin Jiang
Available Online November 2012.
- https://doi.org/10.2991/citcs.2012.111How to use a DOI?
- word segmentation; word frequency; n-gram word; corpus type frequency information
- In order to solve the difficulties to extract words in particular domain, we formulate a method of automatic word segmentation in Chinese based on corpus type frequency information. This method can effectively extract n-gram words that are not predefined in a lexicon by setting the maximum length (n) of the n-gram word we want to extract from a sentence and the minimum threshold frequency the n-gram word appears in corpus. When the real frequency the n-gram appears in corpus is above the threshold, the n-gram word will be extracted. If there are two or more n-grams have the same length, the higher frequency one will be chosen, and then the next higher frequency one if any of its characters are not in previous one.
- Open Access
- This is an open access article distributed under the CC BY-NC license.
Cite this article
TY - CONF AU - Pengyu Lu AU - Lijun Jin AU - Bin Jiang PY - 2012/11 DA - 2012/11 TI - The Research of the Maximum Length n-grams Priority Chinese Word Segmentation Method Based on Corpus Type Frequency Information BT - 2012 National Conference on Information Technology and Computer Science PB - Atlantis Press SP - 426 EP - 429 SN - 1951-6851 UR - https://doi.org/10.2991/citcs.2012.111 DO - https://doi.org/10.2991/citcs.2012.111 ID - Lu2012/11 ER -