Proceedings of the 2016 7th International Conference on Mechatronics, Control and Materials (ICMCM 2016)

Document Structure Identification Method Based on Conditional Random Field

Authors
Yang Lei, Yingai Tian, Ning Li, Xiaolong Gao
Corresponding Author
Yang Lei
Available Online December 2016.
DOI
10.2991/icmcm-16.2016.71How to use a DOI?
Keywords
document structure identification; sequence labeling; CRF
Abstract

On the basis of deep analysis on the structural features and heading features of documents, it has researched the classification method based on templates and the classification method based on statistics as well as the sequence labeling method based on CRF (Conditional Random Field), then proposed to treat document structure identification as sequential data labeling, built CRF training model with feature templates and finally realized document structure identification upon training model with existing way of supervision learning. Experimental results show that identifying paragraph roles from document sequence structure helps to ensure a higher accuracy and it also owns certain fault-tolerant ability. Besides, it is observed that using CRF for many times could further improve the accuracy of identification.

Copyright
© 2016, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 2016 7th International Conference on Mechatronics, Control and Materials (ICMCM 2016)
Series
Advances in Engineering Research
Publication Date
December 2016
ISBN
10.2991/icmcm-16.2016.71
ISSN
2352-5401
DOI
10.2991/icmcm-16.2016.71How to use a DOI?
Copyright
© 2016, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Yang Lei
AU  - Yingai Tian
AU  - Ning Li
AU  - Xiaolong Gao
PY  - 2016/12
DA  - 2016/12
TI  - Document Structure Identification Method Based on Conditional Random Field
BT  - Proceedings of the 2016 7th International Conference on Mechatronics, Control and Materials (ICMCM 2016)
PB  - Atlantis Press
SP  - 354
EP  - 361
SN  - 2352-5401
UR  - https://doi.org/10.2991/icmcm-16.2016.71
DO  - 10.2991/icmcm-16.2016.71
ID  - Lei2016/12
ER  -