BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios

Gagliardelli, Luca; Zhu, Song; Simonini, Giovanni; Bergamaschi, Sonia

doi:10.3233/978-1-61499-898-3-1015

Duplicate detection aims to identify different records in data sources that refers to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples.

BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios / Gagliardelli, Luca; Zhu, Song; Simonini, Giovanni; Bergamaschi, Sonia. - 7:(2018), pp. 1015-1023. (Intervento presentato al convegno 25th International Conference on Transdisciplinary Engineering (TE2018) tenutosi a Modena nel July 3-6, 2018) [10.3233/978-1-61499-898-3-1015].

BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios

Gagliardelli, Luca;Zhu, Song;Simonini, Giovanni;Bergamaschi, Sonia

2018

Abstract

Duplicate detection aims to identify different records in data sources that refers to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
			2018
		
	Titolo del Convegno
	
			25th International Conference on Transdisciplinary Engineering (TE2018)
		
	Luogo del Convegno
	
			Modena
		
	Data del Convegno
	
			July 3-6, 2018
		
	Codice DOI
	
			https://dx.doi.org/10.3233/978-1-61499-898-3-1015
		
	Codice WoS
	
			WOS:000468226300101
		
	Codice Scopus
	
			2-s2.0-85057972161
		
	Serie
	
			ADVANCES IN TRANSDISCIPLINARY ENGINEERING
		
	N° del Volume
	
			7
		
	Pagina iniziale
	
			1015
		
	Pagina finale
	
			1023
		
	Tutti gli autori
	
			Gagliardelli, Luca; Zhu, Song; Simonini, Giovanni; Bergamaschi, Sonia
		
	Citazione
	
			BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios / Gagliardelli, Luca; Zhu, Song; Simonini, Giovanni; Bergamaschi, Sonia. - 7:(2018), pp. 1015-1023. (Intervento presentato al  convegno 25th International Conference on Transdisciplinary Engineering (TE2018) tenutosi a Modena nel July 3-6, 2018) [10.3233/978-1-61499-898-3-1015].
		
	Tipologia
	
			Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
ATDE7-1015.pdf Open access Tipologia: Versione pubblicata dall'editore Dimensione 949.9 kB Formato Adobe PDF Visualizza/Apri	949.9 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris