How improve Set Similarity Join based on prefix approach in distributed environment

Zhu, Song; Gagliardelli, Luca; Simonini, Giovanni; Beneventano, Domenico

doi:10.1109/HPCS.2018.00136

Set similarity join is an essential operation in data integration and big data analytics, that finds similar pairs of records where the records contain string or set-based data. To cope with the increasing scale of the data, several techniques have been proposed to perform set similarity joins using distributed frameworks, such as the MapReduce framework. In particular, Vernica et al. [3] proposed a MapReduce implementation of the so-called PPJoin algorithm [2], which in a recent study, was experimentally demonstrated as one of the best set similarity join algorithm [4]. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques. To address these problems, in this paper, we propose a duplicate-free framework, called TTJoin, to perform set simi- larity joins efficiently by utilizing an innovative filter based on prefix tokens and we implement it with one of most popular distributed framework, i.e., Apache Spark. Experiments on real world datasets demonstrate the effectiveness of proposed solution with respect to either traditional PPJoin and the MapReduce implementation proposed in [3].

How improve Set Similarity Join based on prefix approach in distributed environment / Zhu, S., Gagliardelli, L., Simonini, G., Beneventano, D.. - (2018), pp. 844-851. (16th International Conference on High Performance Computing and Simulation, HPCS 2018 Orléans, France 16-20 luglio 2018) [10.1109/HPCS.2018.00136].

How improve Set Similarity Join based on prefix approach in distributed environment

Song Zhu;Luca Gagliardelli;Giovanni Simonini;Domenico Beneventano

2018

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2018
			
	Titolo del Convegno
	
				16th International Conference on High Performance Computing and Simulation, HPCS 2018
			
	Luogo del Convegno
	
				Orléans, France
			
	Data del Convegno
	
				16-20 luglio 2018
			
	Codice DOI
	
				https://dx.doi.org/10.1109/HPCS.2018.00136
			
	Codice WoS
	
				WOS:000450677700116
			
	Codice Scopus
	
				2-s2.0-85057428449
			
	Pagina iniziale
	
				844
			
	Pagina finale
	
				851
			
	Tutti gli autori
	
						Zhu, Song; Gagliardelli, Luca; Simonini, Giovanni; Beneventano, Domenico
					
	Citazione
	
				How improve Set Similarity Join based on prefix approach in distributed environment / Zhu, S., Gagliardelli, L., Simonini, G., Beneventano, D.. - (2018), pp. 844-851. (16th International Conference on High Performance Computing and Simulation, HPCS 2018 Orléans, France 16-20 luglio 2018) [10.1109/HPCS.2018.00136].
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
08514441.pdf Accesso riservato Tipologia: VOR - Versione pubblicata dall'editore Dimensione 701.81 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	701.81 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris