Proceedings of the 2nd International Symposium on Computer, Communication, Control and Automation (ISCCCA 2013)

Template-based Delta Compression of Large Scale Web Pages

Authors
Kai Lei, Guangyu Sun, Lian’en Huang
Corresponding Author
Kai Lei
Available Online February 2013.
DOI
10.2991/isccca.2013.153How to use a DOI?
Keywords
LCS, Diff, Delta compression, template
Abstract

Delta compression techniques are commonly used in the context of version control systems and the World Wide Web. They are used to compactly encode the differences between two files or strings in order to reduce communication or storage costs. In this paper, we study the use of delta compression in compressing massive web pages according to the similarity of their templates. We propose a framework for template-based delta compression which uses template-based clustering techniques to find the web pages that have similar templates and then encode their differences with delta compression techniques to reduce the storage cost. We also propose a filter-based optimization of Diff algorithm to improve the efficiency of the delta compression approach. To demonstrate the efficiency of our approach, we present experimental results on massive web pages. Our experiments show that template-based delta compression achieves significant improvements in compression ratio as compared to individually compressing each web page.

Copyright
© 2013, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 2nd International Symposium on Computer, Communication, Control and Automation (ISCCCA 2013)
Series
Advances in Intelligent Systems Research
Publication Date
February 2013
ISBN
10.2991/isccca.2013.153
ISSN
1951-6851
DOI
10.2991/isccca.2013.153How to use a DOI?
Copyright
© 2013, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Kai Lei
AU  - Guangyu Sun
AU  - Lian’en Huang
PY  - 2013/02
DA  - 2013/02
TI  - Template-based Delta Compression of Large Scale Web Pages
BT  - Proceedings of the 2nd International Symposium on Computer, Communication, Control and Automation (ISCCCA 2013)
PB  - Atlantis Press
SP  - 608
EP  - 612
SN  - 1951-6851
UR  - https://doi.org/10.2991/isccca.2013.153
DO  - 10.2991/isccca.2013.153
ID  - Lei2013/02
ER  -