Deduplication for Data Profiling using Open Source Platform
Margo Gunatama, Tien Fabrianti, Muhammad Azani Hasibuan
Available Online March 2019.
- https://doi.org/10.2991/icoiese-18.2019.48How to use a DOI?
- data preprocess, data governance, levensthein distance
- Many companies still yet to know the importance of data quality for the company’s improvement. Many companies in Indonesia, especially BUMN and Government companies have only single application with single database, which cause a problem related to duplication of data between columns, tables and applications when the application is integrated with other applications. This problem can be handled by doing the data preprocess, one of the data preprocess method is data profiling. Data profiling is the process of gathering information that can be determined by process or logic. The process of profiling data can be done with various tools both paid and open source tools, each has advantages both in performance and in data processing according to the desired case study. In this study, the main focus is on data analysis by conducting data profiling using deduplication method called Levensthein Distance for check the duplicate data. The results of the profiling will be implemented in logical form in open source application and will do comparisons between open source applications.
- Open Access
- This is an open access article distributed under the CC BY-NC license.
Cite this article
TY - CONF AU - Margo Gunatama AU - Tien Fabrianti AU - Muhammad Azani Hasibuan PY - 2019/03 DA - 2019/03 TI - Deduplication for Data Profiling using Open Source Platform BT - 2018 International Conference on Industrial Enterprise and System Engineering (ICoIESE 2018) PB - Atlantis Press SN - 2589-4943 UR - https://doi.org/10.2991/icoiese-18.2019.48 DO - https://doi.org/10.2991/icoiese-18.2019.48 ID - Gunatama2019/03 ER -