Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire, continuously-growing data. Similarly, when querying data lakes, we want to transform data on-demand and return the results in a timely manner---a fundamental requirement of ELT (Extract-Load-Transform) pipelines. We propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, following an ORDER BY predicate. Thus, it inherently supports top-k and stop-and-resume execution. For a wide range of applications, a significant amount of resources can be saved. We exhaustively evaluate and show the efficacy of BrewER on four real-world datasets.

Entity Resolution On-Demand / Simonini, Giovanni; Zecchini, Luca; Bergamaschi, Sonia; Naumann, Felix. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - 15:7(2022), pp. 1506-1518.

Entity Resolution On-Demand

Simonini, Giovanni
;
Zecchini, Luca;Bergamaschi, Sonia;
2022

Abstract

Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire, continuously-growing data. Similarly, when querying data lakes, we want to transform data on-demand and return the results in a timely manner---a fundamental requirement of ELT (Extract-Load-Transform) pipelines. We propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, following an ORDER BY predicate. Thus, it inherently supports top-k and stop-and-resume execution. For a wide range of applications, a significant amount of resources can be saved. We exhaustively evaluate and show the efficacy of BrewER on four real-world datasets.
30-mag-2022
15
7
1506
1518
Entity Resolution On-Demand / Simonini, Giovanni; Zecchini, Luca; Bergamaschi, Sonia; Naumann, Felix. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - 15:7(2022), pp. 1506-1518.
Simonini, Giovanni; Zecchini, Luca; Bergamaschi, Sonia; Naumann, Felix
File in questo prodotto:
File Dimensione Formato  
simonini2022brewer@vldb.pdf

accesso aperto

Descrizione: Articolo principale
Tipologia: Post-print dell'autore (bozza post referaggio)
Dimensione 1.69 MB
Formato Adobe PDF
1.69 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Caricamento pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11380/1278078
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact