Proceedings of the 2nd Information Technology and Mechatronics Engineering Conference (ITOEC 2016)

Duplicate text detection based on LCS algorithm

Authors
Jiankun Yu, Mengrong Li, Dengyin Zhang
Corresponding Author
Jiankun Yu
Available Online May 2016.
DOI
10.2991/itoec-16.2016.2How to use a DOI?
Keywords
near-duplicate detection; duplicate detection; duplicate text filter.
Abstract

Broder's Shingling and MinHash are two of the state-of-the-art approaches in detecting near-duplicate documents. But both of these two methods did not take the relative position of elements into consideration. This paper proposes a method which combines Shingling and LCS algorithm called SWLR (Shingling with Location Relationship). And proposes a pre-filter method to speed up the execution speed of SWLR. Experiment results shows that SWLR performances better than Shingling in both recall and precision rate and better than MinHash in recall rate. By applying pre-filter method, SWLR could even be executed faster than MinHash and Shingling.

Copyright
© 2016, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 2nd Information Technology and Mechatronics Engineering Conference (ITOEC 2016)
Series
Advances in Engineering Research
Publication Date
May 2016
ISBN
10.2991/itoec-16.2016.2
ISSN
2352-5401
DOI
10.2991/itoec-16.2016.2How to use a DOI?
Copyright
© 2016, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Jiankun Yu
AU  - Mengrong Li
AU  - Dengyin Zhang
PY  - 2016/05
DA  - 2016/05
TI  - Duplicate text detection based on LCS algorithm
BT  - Proceedings of the 2nd Information Technology and Mechatronics Engineering Conference (ITOEC 2016)
PB  - Atlantis Press
SP  - 5
EP  - 9
SN  - 2352-5401
UR  - https://doi.org/10.2991/itoec-16.2016.2
DO  - 10.2991/itoec-16.2016.2
ID  - Yu2016/05
ER  -