Proceedings of the 2023 3rd International Conference on Business Administration and Data Science (BADS 2023)

Large-scale Chinese Text Infringement Detection Based on Dual-Semantic Fingerprinting

Authors
Ruixue Zhao1, Xiao Yang1, Honglei Li2, *
1Key Laboratory of Knowledge Mining and Knowledge Services in Agricultural Converging Publishing of National Press and Publication Administration, Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing, China
2School of Management, Liaoning Normal University, Dalian, China
*Corresponding author. Email: lhl@lnnu.edu.cn
Corresponding Author
Honglei Li
Available Online 30 December 2023.
DOI
10.2991/978-94-6463-326-9_25How to use a DOI?
Keywords
Infringement detection; Text similarity; CiLin; Dual-semantic fingerprinting
Abstract

The SimHash algorithm is a type of hash method used to deduplicate large web pages. It is also widely used in text similarity comparison due to its high effectiveness and efficiency. In this paper, we improve the classical SimHash algorithm in semantic similarity detection of large Chinese texts. In our method, word similarity is first calculated using the text similarity determination method based on CiLin path depth algorithm, then the keywords extracted using TF-IDF are processed for synonym redundancy. Finally, dual-semantic fingerprints are generated and the Hamming distance between the fingerprints is calculated. The experimental results show that this improved SimHash algorithm is superior to the classical SimHash algorithm in terms of F1_score. It is suggested that this algorithm can further improve the probability of semantically finding infringing texts and provide technical support for digital copyright infringement detection.

Copyright
© 2023 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

Volume Title
Proceedings of the 2023 3rd International Conference on Business Administration and Data Science (BADS 2023)
Series
Atlantis Highlights in Computer Sciences
Publication Date
30 December 2023
ISBN
10.2991/978-94-6463-326-9_25
ISSN
2589-4900
DOI
10.2991/978-94-6463-326-9_25How to use a DOI?
Copyright
© 2023 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

TY  - CONF
AU  - Ruixue Zhao
AU  - Xiao Yang
AU  - Honglei Li
PY  - 2023
DA  - 2023/12/30
TI  - Large-scale Chinese Text Infringement Detection Based on Dual-Semantic Fingerprinting
BT  - Proceedings of the 2023 3rd International Conference on Business Administration and Data Science (BADS 2023)
PB  - Atlantis Press
SP  - 232
EP  - 244
SN  - 2589-4900
UR  - https://doi.org/10.2991/978-94-6463-326-9_25
DO  - 10.2991/978-94-6463-326-9_25
ID  - Zhao2023
ER  -