Entity Resolution (ER) is a fundamental task of data integration: it identifies different representations (i.e., profiles) of the same real-world entity in databases. To compare all possible profile pairs through an ER algorithm has a quadratic complexity. Blocking is commonly employed to avoid that: profiles are grouped into blocks according to some features, and ER is performed only for entities of the same block. Yet, devising blocking criteria and ER algorithms for data with highly schema heterogeneity is a difficult and error-prone task calling for automatic methods and debugging tools. In our previous work, we presented Blast, an ER system that can scale practitioners’ favorite Entity Resolution algorithms. In current version, Blast has been devised to take full advantage of parallel and distributed computation as well (running on top of Apache Spark). It implements the state-of-the-art unsuper- vised blocking method based on automatically extracted loose schema information. It comes with a GUI, which allows: (i) to visualize, understand, and (optionally) manually modify the loose schema information automatically extracted (i.e., injecting user’s knowledge in the system); (ii) to retrieve resolved entities through a free-text search box, and to visualize the process that lead to that result (i.e., the provenance). Experimental results on real-world datasets show that these two functionalities can significantly enhance Entity Resolution results.

Enhancing Loosely Schema-aware Entity Resolution with User Interaction / Simonini, Giovanni; Gagliardelli, Luca; Zhu, Song; Bergamaschi, Sonia. - (2018), pp. 844-851. ((Intervento presentato al convegno 2018 International Conference on High Performance Computing & Simulation, HPCS 2018, Orleans, France, July 16-20, 2018. tenutosi a Orléans, France nel 16-20 luglio 2018 [10.1109/HPCS.2018.00138].

Enhancing Loosely Schema-aware Entity Resolution with User Interaction

giovanni simonini
;
luca gagliardelli
;
ZHU, SONG;sonia bergamaschi
2018

Abstract

Entity Resolution (ER) is a fundamental task of data integration: it identifies different representations (i.e., profiles) of the same real-world entity in databases. To compare all possible profile pairs through an ER algorithm has a quadratic complexity. Blocking is commonly employed to avoid that: profiles are grouped into blocks according to some features, and ER is performed only for entities of the same block. Yet, devising blocking criteria and ER algorithms for data with highly schema heterogeneity is a difficult and error-prone task calling for automatic methods and debugging tools. In our previous work, we presented Blast, an ER system that can scale practitioners’ favorite Entity Resolution algorithms. In current version, Blast has been devised to take full advantage of parallel and distributed computation as well (running on top of Apache Spark). It implements the state-of-the-art unsuper- vised blocking method based on automatically extracted loose schema information. It comes with a GUI, which allows: (i) to visualize, understand, and (optionally) manually modify the loose schema information automatically extracted (i.e., injecting user’s knowledge in the system); (ii) to retrieve resolved entities through a free-text search box, and to visualize the process that lead to that result (i.e., the provenance). Experimental results on real-world datasets show that these two functionalities can significantly enhance Entity Resolution results.
2018 International Conference on High Performance Computing & Simulation, HPCS 2018, Orleans, France, July 16-20, 2018.
Orléans, France
16-20 luglio 2018
844
851
Simonini, Giovanni; Gagliardelli, Luca; Zhu, Song; Bergamaschi, Sonia
Enhancing Loosely Schema-aware Entity Resolution with User Interaction / Simonini, Giovanni; Gagliardelli, Luca; Zhu, Song; Bergamaschi, Sonia. - (2018), pp. 844-851. ((Intervento presentato al convegno 2018 International Conference on High Performance Computing & Simulation, HPCS 2018, Orleans, France, July 16-20, 2018. tenutosi a Orléans, France nel 16-20 luglio 2018 [10.1109/HPCS.2018.00138].
File in questo prodotto:
File Dimensione Formato  
08514443.pdf

non disponibili

Tipologia: Versione dell'editore (versione pubblicata)
Dimensione 364.22 kB
Formato Adobe PDF
364.22 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

Caricamento pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11380/1167194
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 0
social impact