Entity Resolution On-Demand

Simonini, Giovanni; Zecchini, Luca; Bergamaschi, Sonia; Naumann, Felix

doi:10.14778/3523210.3523226

Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire, continuously-growing data. Similarly, when querying data lakes, we want to transform data on-demand and return the results in a timely manner---a fundamental requirement of ELT (Extract-Load-Transform) pipelines. We propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, following an ORDER BY predicate. Thus, it inherently supports top-k and stop-and-resume execution. For a wide range of applications, a significant amount of resources can be saved. We exhaustively evaluate and show the efficacy of BrewER on four real-world datasets.

Entity Resolution On-Demand / Simonini, G., Zecchini, L., Bergamaschi, S., Naumann, F.. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - 15:7(2022), pp. 1506-1518. (48th International Conference on Very Large Data Bases, VLDB 2022 aus 2022) [10.14778/3523210.3523226].

Entity Resolution On-Demand

Simonini, Giovanni;Zecchini, Luca;Bergamaschi, Sonia;Naumann, Felix

2022

Abstract

Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire, continuously-growing data. Similarly, when querying data lakes, we want to transform data on-demand and return the results in a timely manner---a fundamental requirement of ELT (Extract-Load-Transform) pipelines. We propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, following an ORDER BY predicate. Thus, it inherently supports top-k and stop-and-resume execution. For a wide range of applications, a significant amount of resources can be saved. We exhaustively evaluate and show the efficacy of BrewER on four real-world datasets.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2022
			
	Data di prima pubblicazione
	
				30-mag-2022
			
	Rivista
	
				PROCEEDINGS OF THE VLDB ENDOWMENT
			
	N° del Volume
	
				15
			
	Fascicolo
	
				7
			
	Pagina iniziale
	
				1506
			
	Pagina finale
	
				1518
			
	Codice DOI
	
				https://dx.doi.org/10.14778/3523210.3523226
			
	Codice WoS
	
				WOS:000992375700017
			
	Codice Scopus
	
				2-s2.0-85130755597
			
	Citazione
	
				Entity Resolution On-Demand / Simonini, G., Zecchini, L., Bergamaschi, S., Naumann, F.. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - 15:7(2022), pp. 1506-1518. (48th International Conference on Very Large Data Bases, VLDB 2022 aus 2022) [10.14778/3523210.3523226].
			
	Tutti gli autori
	
						Simonini, Giovanni; Zecchini, Luca; Bergamaschi, Sonia; Naumann, Felix
					
	Tipologia
	
				Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
[email protected] Open access Descrizione: Articolo principale Tipologia: AAM - Versione dell'autore revisionata e accettata per la pubblicazione Dimensione 1.69 MB Formato Adobe PDF Visualizza/Apri	1.69 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris