BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios

Gagliardelli, Luca; Zhu, Song; Simonini, Giovanni; Bergamaschi, Sonia

doi:10.3233/978-1-61499-898-3-1015

Duplicate detection aims to identify different records in data sources that refers to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples.

BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios / Gagliardelli, Luca; Zhu, Song; Simonini, Giovanni; Bergamaschi, Sonia. - 7:(2018), pp. 1015-1023. ( 25th International Conference on Transdisciplinary Engineering (TE2018) Modena July 3-6, 2018) [10.3233/978-1-61499-898-3-1015].

BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios

Gagliardelli, Luca;Zhu, Song;Simonini, Giovanni;Bergamaschi, Sonia

2018

Abstract

Duplicate detection aims to identify different records in data sources that refers to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2018
			
	Presenza di Autori afferenti a Enti stranieri
	
				no
			
	Lingua/e di pubblicazione
	
				Inglese
			
	Titolo del Convegno
	
				25th International Conference on Transdisciplinary Engineering (TE2018)
			
	Luogo del Convegno
	
				Modena
			
	Data del Convegno
	
				July 3-6, 2018
			
	Codice DOI
	
				https://dx.doi.org/10.3233/978-1-61499-898-3-1015
			
	Codice WoS
	
				WOS:000468226300101
			
	Codice Scopus
	
				2-s2.0-85057972161
			
	Serie
	
				ADVANCES IN TRANSDISCIPLINARY ENGINEERING
			
	Titolo del Volume
	
				Transdisciplinary Engineering Methods for Social Innovation of Industry 4.0
			
	N° del Volume
	
				7
			
	Pagina iniziale
	
				1015
			
	Pagina finale
	
				1023
			
	Codice ISBN del Volume
	
				9781614998976
			
	Nome Editore
	
				IOS Press BV
			
	Paese Editore
	
				PAESI BASSI
			
	Città Editore
	
				NIEUWE HEMWEG 6B, 1013 BG AMSTERDAM, NETHERLANDS
			
	Parole chiave
	
				Duplicate detection, Entity Resolution, Data Integration, Record Linkage, Big Data
			
	Tutti gli autori
	
						Gagliardelli, Luca; Zhu, Song; Simonini, Giovanni; Bergamaschi, Sonia
					
	Tipologia
	
				Atti di CONVEGNO::Relazione in Atti di Convegno
			
	Tipologia sito docente
	
				273
			
	Numero autori
	
				4
			
	Citazione
	
				BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios / Gagliardelli, Luca; Zhu, Song; Simonini, Giovanni; Bergamaschi, Sonia. - 7:(2018), pp. 1015-1023. ( 25th International Conference on Transdisciplinary Engineering (TE2018) Modena July 3-6, 2018) [10.3233/978-1-61499-898-3-1015].
			
	Fulltext
	
				open
			
	Tipologia
	
				info:eu-repo/semantics/conferenceObject
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
ATDE7-1015.pdf Open access Tipologia: VOR - Versione pubblicata dall'editore Licenza: [IR] creative-commons Dimensione 949.9 kB Formato Adobe PDF Visualizza/Apri	949.9 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris