An Improvement Method of Duplicate Webpage Detection
- DOI
- 10.2991/emeit.2012.6How to use a DOI?
- Keywords
- search engine, duplicate detection, BloomFilter, Fuzzy Hamming distance
- Abstract
As Internet is very easy to implement the diffusing and sharing of resources, duplication of pages on the Internet is very large. The search engine as an index tool of Internet resources is facing a serious repeat testing, its crawler will encounter a large number of links of duplicate content. If these links are all added to the download queue, it will cause a serious drop in performance and this would seriously affect the user experience. In this paper, we adopt an improved duplicate detection method------using BloomFilter combining with fuzzy Hamming distance. This will not only meet the detection of duplicate content, but also it will meet the needs of users.
- Copyright
- © 2012, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Chengqi Zhang AU - Wenqian Shang AU - Yafeng Li PY - 2012/09 DA - 2012/09 TI - An Improvement Method of Duplicate Webpage Detection BT - Proceedings of the 2nd International Conference on Electronic & Mechanical Engineering and Information Technology (EMEIT 2012) PB - Atlantis Press SP - 27 EP - 30 SN - 1951-6851 UR - https://doi.org/10.2991/emeit.2012.6 DO - 10.2991/emeit.2012.6 ID - Zhang2012/09 ER -