Proceedings of the 3rd International Conference on Computer Science and Service System

An Analysis of Characters and Structures of Web Pages Based on Regular Expressions

Authors
Xu Lei
Corresponding Author
Xu Lei
Available Online June 2014.
DOI
10.2991/csss-14.2014.22How to use a DOI?
Keywords
information extraction; HTML; regular expressions
Abstract

This paper introduces a method to analyze characters and structures of web pages via regular expressions. From encoding to HMTL elements, characters in Web pages are counted one by one. The effectiveness of this tool is proven in experiments with more than one hundred real-world web pages. All work can be ready for massive web information extraction.

Copyright
© 2014, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 3rd International Conference on Computer Science and Service System
Series
Advances in Intelligent Systems Research
Publication Date
June 2014
ISBN
10.2991/csss-14.2014.22
ISSN
1951-6851
DOI
10.2991/csss-14.2014.22How to use a DOI?
Copyright
© 2014, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Xu Lei
PY  - 2014/06
DA  - 2014/06
TI  - An Analysis of Characters and Structures of Web Pages Based on Regular Expressions
BT  - Proceedings of the 3rd International Conference on Computer Science and Service System
PB  - Atlantis Press
SP  - 98
EP  - 101
SN  - 1951-6851
UR  - https://doi.org/10.2991/csss-14.2014.22
DO  - 10.2991/csss-14.2014.22
ID  - Lei2014/06
ER  -