Duplicate detection aims to identify different records in data sources that refers to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples.
BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios / Gagliardelli, Luca; Zhu, Song; Simonini, Giovanni; Bergamaschi, Sonia. - 7:(2018), pp. 1015-1023. (Intervento presentato al convegno 25th International Conference on Transdisciplinary Engineering (TE2018) tenutosi a Modena nel July 3-6, 2018) [10.3233/978-1-61499-898-3-1015].
BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios
Gagliardelli, Luca
;Zhu, Song;Simonini, Giovanni;Bergamaschi, Sonia
2018
Abstract
Duplicate detection aims to identify different records in data sources that refers to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples.File | Dimensione | Formato | |
---|---|---|---|
ATDE7-1015.pdf
Open access
Tipologia:
Versione pubblicata dall'editore
Dimensione
949.9 kB
Formato
Adobe PDF
|
949.9 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris