Set similarity join is an essential operation in data integration and big data analytics, that finds similar pairs of records where the records contain string or set-based data. To cope with the increasing scale of the data, several techniques have been proposed to perform set similarity joins using distributed frameworks, such as the MapReduce framework. In particular, Vernica et al. [3] proposed a MapReduce implementation of the so-called PPJoin algorithm [2], which in a recent study, was experimentally demonstrated as one of the best set similarity join algorithm [4]. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques. To address these problems, in this paper, we propose a duplicate-free framework, called TTJoin, to perform set simi- larity joins efficiently by utilizing an innovative filter based on prefix tokens and we implement it with one of most popular distributed framework, i.e., Apache Spark. Experiments on real world datasets demonstrate the effectiveness of proposed solution with respect to either traditional PPJoin and the MapReduce implementation proposed in [3].

How improve Set Similarity Join based on prefix approach in distributed environment / Zhu, Song; Gagliardelli, Luca; Simonini, Giovanni; Beneventano, Domenico. - (2018), pp. 844-851. ((Intervento presentato al convegno 2018 International Conference on High Performance Computing & Simulation (HPCS) tenutosi a Orléans, France nel 16-20 luglio 2018 [10.1109/HPCS.2018.00136].

How improve Set Similarity Join based on prefix approach in distributed environment

Song Zhu
;
Luca Gagliardelli;Giovanni Simonini;Domenico Beneventano
2018

Abstract

Set similarity join is an essential operation in data integration and big data analytics, that finds similar pairs of records where the records contain string or set-based data. To cope with the increasing scale of the data, several techniques have been proposed to perform set similarity joins using distributed frameworks, such as the MapReduce framework. In particular, Vernica et al. [3] proposed a MapReduce implementation of the so-called PPJoin algorithm [2], which in a recent study, was experimentally demonstrated as one of the best set similarity join algorithm [4]. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques. To address these problems, in this paper, we propose a duplicate-free framework, called TTJoin, to perform set simi- larity joins efficiently by utilizing an innovative filter based on prefix tokens and we implement it with one of most popular distributed framework, i.e., Apache Spark. Experiments on real world datasets demonstrate the effectiveness of proposed solution with respect to either traditional PPJoin and the MapReduce implementation proposed in [3].
2018 International Conference on High Performance Computing & Simulation (HPCS)
Orléans, France
16-20 luglio 2018
844
851
Zhu, Song; Gagliardelli, Luca; Simonini, Giovanni; Beneventano, Domenico
How improve Set Similarity Join based on prefix approach in distributed environment / Zhu, Song; Gagliardelli, Luca; Simonini, Giovanni; Beneventano, Domenico. - (2018), pp. 844-851. ((Intervento presentato al convegno 2018 International Conference on High Performance Computing & Simulation (HPCS) tenutosi a Orléans, France nel 16-20 luglio 2018 [10.1109/HPCS.2018.00136].
File in questo prodotto:
File Dimensione Formato  
08514441.pdf

non disponibili

Tipologia: Versione dell'editore (versione pubblicata)
Dimensione 701.81 kB
Formato Adobe PDF
701.81 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

Caricamento pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11380/1167193
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact