Proceedings of the 2019 International Conference on Computer, Network, Communication and Information Systems (CNCI 2019)

Research of Spark SQL Query Optimization Based on Massive Small Files on HDFS

Authors
Kefei Cheng, Xudong Chen, Ke Zhou, Xianjun Deng, Zhao Luo
Corresponding Author
Kefei Cheng
Available Online May 2019.
DOI
10.2991/cnci-19.2019.25How to use a DOI?
Keywords
Industry Application Card data, Spark SQL, HDFS, Small Files, Parquet
Abstract

This paper focuses on the low efficiency of Spark SQL reading massive small files on HDFS in 4G Industry Application Card (IAC) business analysis system. To solve this issue, we propose a Local Merge Storage Model (LMSM) for 4G IAC small files. In this model, locality is enhanced by exploring the type and time of small files. Then, Spark is used to merge small files into the Parquet column storage file and store them to HDFS. Finally, according to the experimental results, after merging partitioned storage of small files, Spark SQL query efficiency increases up to 60 times higher.

Copyright
© 2019, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 2019 International Conference on Computer, Network, Communication and Information Systems (CNCI 2019)
Series
Advances in Computer Science Research
Publication Date
May 2019
ISBN
10.2991/cnci-19.2019.25
ISSN
2352-538X
DOI
10.2991/cnci-19.2019.25How to use a DOI?
Copyright
© 2019, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Kefei Cheng
AU  - Xudong Chen
AU  - Ke Zhou
AU  - Xianjun Deng
AU  - Zhao Luo
PY  - 2019/05
DA  - 2019/05
TI  - Research of Spark SQL Query Optimization Based on Massive Small Files on HDFS
BT  - Proceedings of the 2019 International Conference on Computer, Network, Communication and Information Systems (CNCI 2019)
PB  - Atlantis Press
SP  - 180
EP  - 190
SN  - 2352-538X
UR  - https://doi.org/10.2991/cnci-19.2019.25
DO  - 10.2991/cnci-19.2019.25
ID  - Cheng2019/05
ER  -