Proceedings of the 2018 International Conference on Industrial Enterprise and System Engineering (IcoIESE 2018)

Comparison of Web Scraping Techniques : Regular Expression, HTML DOM and Xpath

Authors
Rohmat Gunawan, Alam Rahmatulloh, Irfan Darmawan, Firman Firdaus
Corresponding Author
Rohmat Gunawan
Available Online March 2019.
DOI
https://doi.org/10.2991/icoiese-18.2019.50How to use a DOI?
Keywords
DOM, Regex, Web Scraping, Xpath
Abstract
Data collection is the initial stage of research. There are various data sources on the internet that can be used in the research process. The process of taking data or information from sites on the internet is called web scraping. Some methods of web scraping include Regular Expression (Regex), HTML DOM and XPath. This study ai to determine the performance of the three methods of web scraping. The Comparison is done by testing each method when retrieving data from the target website, then measuring the performance of the process and comparing it. Process time, memory usage, and data consumption are used as measurement parameters in the experiment. The results of the experiment show that web scraping with the regex method is the smallest in memory usage compared to the HTML DOM method, and Xpath. While HTML DOM requires the least amount of time and the smallest data consumption compared to Regular Expression and Xpath methods.
Open Access
This is an open access article distributed under the CC BY-NC license.

Download article (PDF)

Proceedings
2018 International Conference on Industrial Enterprise and System Engineering (ICoIESE 2018)
Part of series
Atlantis Highlights in Engineering
Publication Date
March 2019
ISBN
978-94-6252-689-1
ISSN
2589-4943
DOI
https://doi.org/10.2991/icoiese-18.2019.50How to use a DOI?
Open Access
This is an open access article distributed under the CC BY-NC license.

Cite this article

TY  - CONF
AU  - Rohmat Gunawan
AU  - Alam Rahmatulloh
AU  - Irfan Darmawan
AU  - Firman Firdaus
PY  - 2019/03
DA  - 2019/03
TI  - Comparison of Web Scraping Techniques : Regular Expression, HTML DOM and Xpath
BT  - 2018 International Conference on Industrial Enterprise and System Engineering (ICoIESE 2018)
PB  - Atlantis Press
SN  - 2589-4943
UR  - https://doi.org/10.2991/icoiese-18.2019.50
DO  - https://doi.org/10.2991/icoiese-18.2019.50
ID  - Gunawan2019/03
ER  -