Entity Resolution is a crucial task for many applications, but its nave solution has a low efficiency due to its quadratic complexity. Usually, to reduce this complexity, blocking is employed to cluster similar entities in order to reduce the global number of comparisons. Meta-Blocking (MB) approach aims to restructure the block collection in order to reduce the number of comparisons, obtaining better results in term of execution time. However, these techniques alone are not sufficient to work in the context of Big Data, where typically the records to be compared are in the order of hundreds of million. Parallel implementations of MB have been proposed in the literature, but all of them are built on Hadoop MapReduce, which is known to have a low efficiency on modern cluster architecture. We implement a Meta-Blocking technique for Apache Spark. Unlike Hadoop, Apache Spark uses a different paradigm to manage the tasks: it does not need to save the partial results on disk, keeping them in memory, which guarantees a shorter execution time. We reimplemented the state-of-the-art MB techniques, creating a new algorithm in order to exploit the Spark architecture. We tested our algorithm over several established datasets, showing that ours Spark implementation outperforms other existing ones based on Hadoop.
SparkER: an Entity Resolution framework for Apache Spark / Gagliardelli, Luca; Simonini, Giovanni; Zhu, Song; Bergamaschi, Sonia. - (2017).
SparkER: an Entity Resolution framework for Apache Spark
GAGLIARDELLI, LUCA;SIMONINI, GIOVANNI;ZHU, SONG;BERGAMASCHI, Sonia
2017
Abstract
Entity Resolution is a crucial task for many applications, but its nave solution has a low efficiency due to its quadratic complexity. Usually, to reduce this complexity, blocking is employed to cluster similar entities in order to reduce the global number of comparisons. Meta-Blocking (MB) approach aims to restructure the block collection in order to reduce the number of comparisons, obtaining better results in term of execution time. However, these techniques alone are not sufficient to work in the context of Big Data, where typically the records to be compared are in the order of hundreds of million. Parallel implementations of MB have been proposed in the literature, but all of them are built on Hadoop MapReduce, which is known to have a low efficiency on modern cluster architecture. We implement a Meta-Blocking technique for Apache Spark. Unlike Hadoop, Apache Spark uses a different paradigm to manage the tasks: it does not need to save the partial results on disk, keeping them in memory, which guarantees a shorter execution time. We reimplemented the state-of-the-art MB techniques, creating a new algorithm in order to exploit the Spark architecture. We tested our algorithm over several established datasets, showing that ours Spark implementation outperforms other existing ones based on Hadoop.Pubblicazioni consigliate
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris