Set similarity join is an essential operation in data integration and big data analytics, that finds similar pairs of records where the records contain string or set-based data. To cope with the increasing scale of the data, several techniques have been proposed to perform set similarity joins using distributed frameworks, such as the MapReduce framework. In particular, Vernica et al.  proposed a MapReduce implementation of the so-called PPJoin algorithm , which in a recent study, was experimentally demonstrated as one of the best set similarity join algorithm . These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques. To address these problems, in this paper, we propose a duplicate-free framework, called TTJoin, to perform set simi- larity joins efficiently by utilizing an innovative filter based on prefix tokens and we implement it with one of most popular distributed framework, i.e., Apache Spark. Experiments on real world datasets demonstrate the effectiveness of proposed solution with respect to either traditional PPJoin and the MapReduce implementation proposed in .
|Data di pubblicazione:||2018|
|Titolo:||How improve Set Similarity Join based on prefix approach in distributed environment|
|Autore/i:||Zhu, Song; Gagliardelli, Luca; Simonini, Giovanni; Beneventano, Domenico|
|Nome del convegno:||2018 International Conference on High Performance Computing & Simulation (HPCS)|
|Luogo del convegno:||Orléans, France|
|Data del convegno:||16-20 luglio 2018|
|Citazione:||How improve Set Similarity Join based on prefix approach in distributed environment / Zhu, Song; Gagliardelli, Luca; Simonini, Giovanni; Beneventano, Domenico. - (2018), pp. 844-851. ((Intervento presentato al convegno 2018 International Conference on High Performance Computing & Simulation (HPCS) tenutosi a Orléans, France nel 16-20 luglio 2018.|
|Tipologia||Relazione in Atti di Convegno|
I documenti presenti in Iris Unimore sono rilasciati con licenza Creative Commons Attribuzione - Non commerciale - Non opere derivate 3.0 Italia, salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris