Proceedings of the 2018 International Conference on Industrial Enterprise and System Engineering (IcoIESE 2018)

Deduplication for Data Profiling using Open Source Platform

Authors
Margo Gunatama, Tien Fabrianti, Muhammad Azani Hasibuan
Corresponding Author
Margo Gunatama
Available Online March 2019.
DOI
https://doi.org/10.2991/icoiese-18.2019.48How to use a DOI?
Keywords
data preprocess, data governance, levensthein distance
Abstract
Many companies still yet to know the importance of data quality for the company’s improvement. Many companies in Indonesia, especially BUMN and Government companies have only single application with single database, which cause a problem related to duplication of data between columns, tables and applications when the application is integrated with other applications. This problem can be handled by doing the data preprocess, one of the data preprocess method is data profiling. Data profiling is the process of gathering information that can be determined by process or logic. The process of profiling data can be done with various tools both paid and open source tools, each has advantages both in performance and in data processing according to the desired case study. In this study, the main focus is on data analysis by conducting data profiling using deduplication method called Levensthein Distance for check the duplicate data. The results of the profiling will be implemented in logical form in open source application and will do comparisons between open source applications.
Open Access
This is an open access article distributed under the CC BY-NC license.

Download article (PDF)

Proceedings
2018 International Conference on Industrial Enterprise and System Engineering (ICoIESE 2018)
Part of series
Atlantis Highlights in Engineering
Publication Date
March 2019
ISBN
978-94-6252-689-1
ISSN
2589-4943
DOI
https://doi.org/10.2991/icoiese-18.2019.48How to use a DOI?
Open Access
This is an open access article distributed under the CC BY-NC license.

Cite this article

TY  - CONF
AU  - Margo Gunatama
AU  - Tien Fabrianti
AU  - Muhammad Azani Hasibuan
PY  - 2019/03
DA  - 2019/03
TI  - Deduplication for Data Profiling using Open Source Platform
BT  - 2018 International Conference on Industrial Enterprise and System Engineering (ICoIESE 2018)
PB  - Atlantis Press
SN  - 2589-4943
UR  - https://doi.org/10.2991/icoiese-18.2019.48
DO  - https://doi.org/10.2991/icoiese-18.2019.48
ID  - Gunatama2019/03
ER  -