Proceedings of the 2015 International Conference on Electrical, Computer Engineering and Electronics

New Word Identification for Chinese Patents Based on Multiple Statistic Measures and Pattern Combination

Authors
Xiong Wen
Corresponding Author
Xiong Wen
Available Online June 2015.
DOI
https://doi.org/10.2991/icecee-15.2015.98How to use a DOI?
Keywords
New Word Identification (NWI); Out of Vocabulary (OOV); Pattern Combination; Candidate Generation; Statistical Measures Integration; Pattern Filtering
Abstract
New Words Identification (NWI) is one of the critical researches in Chinese Natural Language Processing (NLP), which has important influence to the successive tasks of Chinese NLP. Aiming at the problem of the NWI, which is disturbed in the automatic or half-automatic processing for text translation of Chinese patents, this paper proposed a method for NWI of Chinese patents based on integration of multiple statistic measures and pattern combination, which included a specifically preprocessing method for string dividing, where the technological terms in patents were reserved, and non-technological words were removed as many as possible; then, the divided strings with different lengths were combined using multiple patterns with a greedy maximum match to generate candidates; furthermore, the noisy candidate strings were filtered using four filtering patterns summarized manually; finally, the statistical measures only adapting to two variables were extended to those adapting to multiple ones; in the meantime, the values of the multiple statistic measures extended were integrated by using a ranking method, which evaluated the candidates according to the thresholds to form the set of new words. Experiments on abstract texts of Chinese patents showed that the precision can reach 80%; and the F1 value can reach 68.15%, verifying the effectiveness of the method.
Open Access
This is an open access article distributed under the CC BY-NC license.

Download article (PDF)

Proceedings
Part of series
Advances in Computer Science Research
Publication Date
June 2015
ISBN
978-94-62520-81-3
ISSN
2352-538X
DOI
https://doi.org/10.2991/icecee-15.2015.98How to use a DOI?
Open Access
This is an open access article distributed under the CC BY-NC license.

Cite this article

TY  - CONF
AU  - Xiong Wen
PY  - 2015/06
DA  - 2015/06
TI  - New Word Identification for Chinese Patents Based on Multiple Statistic Measures and Pattern Combination
PB  - Atlantis Press
SP  - 472
EP  - 478
SN  - 2352-538X
UR  - https://doi.org/10.2991/icecee-15.2015.98
DO  - https://doi.org/10.2991/icecee-15.2015.98
ID  - Wen2015/06
ER  -