Dis-Dyn Crawler:A Distributed Crawler for Dynamic Web Page
Jianfu Cai, Hua Zhang
Available Online December 2015.
- https://doi.org/10.2991/icmmcce-15.2015.505How to use a DOI?
- distributed crawler, dynamic web page, HtmlUnit.
- Nowadays, it has became a widespread approach for achieving rich information in modern web applications using AJAX ,which cause two serious problems for web crawler. One is the incomplete information getting from web page due to the inability to parse dynamic web page. Another is the efficiency of the crawler. In order to solve the above problems, this paper proposes a distributed dynamic web crawler naming Dis-Dyn Crawler. This system uses HtmlUnit to page dynamic pages and choose Redis and ZMQ (Message Queue Zero) to realize the distribution feature, which improve the efficiency of the crawler. The experimental results show that Dis-Dyn Crawler has better performance than Nutch-a distributed crawler system, and the dynamic page parsing efficiency is also improved.
- Open Access
- This is an open access article distributed under the CC BY-NC license.
Cite this article
TY - CONF AU - Jianfu Cai AU - Hua Zhang PY - 2015/12 DA - 2015/12 TI - Dis-Dyn Crawler:A Distributed Crawler for Dynamic Web Page BT - Proceedings of the 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering 2015 PB - Atlantis Press SN - 2352-538X UR - https://doi.org/10.2991/icmmcce-15.2015.505 DO - https://doi.org/10.2991/icmmcce-15.2015.505 ID - Cai2015/12 ER -