Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting

J. Prasanna Kumar; P. Govindarajulu

doi:10.1080/18756891.2013.752657

Next Article In Issue>

Volume 6, Issue 1, January 2013, Pages 1 - 13

Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting

Authors

J. Prasanna Kumar, P. Govindarajulu

Corresponding Author

J. Prasanna Kumar

Received 27 December 2011, Accepted 27 July 2012, Available Online 2 January 2013.

DOI: 10.1080/18756891.2013.752657 How to use a DOI?
Keywords: Web Crawling, Web page, Duplicate web page, Near duplicate web page, Near duplicate detection, fingerprinting
Abstract: Duplicate and near-duplicate web pages are the chief concerns for web search engines. In reality, they incur enormous space to store the indexes, ultimately slowing down and increasing the cost of serving results. A variety of techniques have been developed to identify pairs of web pages that are “similar” to each other. The problem of finding near-duplicate web pages has been a subject of research in the database and web-search communities for some years. In order to identify the near duplicate web pages, we make use of sentence level features along with fingerprinting method. When a large number of web documents are in consideration for the detection of web pages, then at first, we use K-mode clustering and subsequently sentence feature and fingerprint comparison is used. Using these steps, we exactly identify the near duplicate web pages in an efficient manner. The experimentation is carried out on the web page collections and the results ensured the efficiency of the proposed approach in detecting the near duplicate web pages.
Copyright: © 2017, the Authors. Published by Atlantis Press.
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Next Article In Issue>

Journal: International Journal of Computational Intelligence Systems
Volume-Issue: 6 - 1
Pages: 1 - 13
Publication Date: 2013/01/02
ISSN (Online): 1875-6883
ISSN (Print): 1875-6891
DOI: 10.1080/18756891.2013.752657 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - JOUR
AU  - J. Prasanna Kumar
AU  - P. Govindarajulu
PY  - 2013
DA  - 2013/01/02
TI  - Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting
JO  - International Journal of Computational Intelligence Systems
SP  - 1
EP  - 13
VL  - 6
IS  - 1
SN  - 1875-6883
UR  - https://doi.org/10.1080/18756891.2013.752657
DO  - 10.1080/18756891.2013.752657
ID  - Kumar2013
ER  -

download .riscopy to clipboard