Proceedings of the 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering 2015

Dis-Dyn Crawler:A Distributed Crawler for Dynamic Web Page

Authors
Jianfu Cai, Hua Zhang
Corresponding Author
Jianfu Cai
Available Online December 2015.
DOI
https://doi.org/10.2991/icmmcce-15.2015.505How to use a DOI?
Keywords
distributed crawler, dynamic web page, HtmlUnit.
Abstract
Nowadays, it has became a widespread approach for achieving rich information in modern web applications using AJAX ,which cause two serious problems for web crawler. One is the incomplete information getting from web page due to the inability to parse dynamic web page. Another is the efficiency of the crawler. In order to solve the above problems, this paper proposes a distributed dynamic web crawler naming Dis-Dyn Crawler. This system uses HtmlUnit to page dynamic pages and choose Redis and ZMQ (Message Queue Zero) to realize the distribution feature, which improve the efficiency of the crawler. The experimental results show that Dis-Dyn Crawler has better performance than Nutch-a distributed crawler system, and the dynamic page parsing efficiency is also improved.
Open Access
This is an open access article distributed under the CC BY-NC license.

Download article (PDF)

Cite this article

TY  - CONF
AU  - Jianfu Cai
AU  - Hua Zhang
PY  - 2015/12
DA  - 2015/12
TI  - Dis-Dyn Crawler:A Distributed Crawler for Dynamic Web Page
BT  - Proceedings of the 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering 2015
PB  - Atlantis Press
SN  - 2352-538X
UR  - https://doi.org/10.2991/icmmcce-15.2015.505
DO  - https://doi.org/10.2991/icmmcce-15.2015.505
ID  - Cai2015/12
ER  -