BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios

Gagliardelli, Luca; Zhu, Song; Simonini, Giovanni; Bergamaschi, Sonia

doi:10.3233/978-1-61499-898-3-1015

Duplicate detection aims to identify different records in data sources that refers to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples.

BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios / Gagliardelli, L., Zhu, S., Simonini, G., Bergamaschi, S.. - 7:(2018), pp. 1015-1023. (25th International Conference on Transdisciplinary Engineering (TE2018) Modena July 3-6, 2018) [10.3233/978-1-61499-898-3-1015].

BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios

Gagliardelli, Luca;Zhu, Song;Simonini, Giovanni;Bergamaschi, Sonia

2018

Abstract

Duplicate detection aims to identify different records in data sources that refers to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2018
			
	Titolo del Convegno
	
				25th International Conference on Transdisciplinary Engineering (TE2018)
			
	Luogo del Convegno
	
				Modena
			
	Data del Convegno
	
				July 3-6, 2018
			
	Codice DOI
	
				https://dx.doi.org/10.3233/978-1-61499-898-3-1015
			
	Codice WoS
	
				WOS:000468226300101
			
	Codice Scopus
	
				2-s2.0-85057972161
			
	Serie
	
				ADVANCES IN TRANSDISCIPLINARY ENGINEERING
			
	N° del Volume
	
				7
			
	Pagina iniziale
	
				1015
			
	Pagina finale
	
				1023
			
	Tutti gli autori
	
						Gagliardelli, Luca; Zhu, Song; Simonini, Giovanni; Bergamaschi, Sonia
					
	Citazione
	
				BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios / Gagliardelli, L., Zhu, S., Simonini, G., Bergamaschi, S.. - 7:(2018), pp. 1015-1023. (25th International Conference on Transdisciplinary Engineering (TE2018) Modena July 3-6, 2018) [10.3233/978-1-61499-898-3-1015].
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
ATDE7-1015.pdf Open access Tipologia: VOR - Versione pubblicata dall'editore Licenza: [IR] creative-commons Dimensione 949.9 kB Formato Adobe PDF Visualizza/Apri	949.9 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris