Progressive Query-Driven Entity Resolution

Zecchini, Luca

doi:10.1007/978-3-030-89657-7_30

Entity Resolution (ER) aims to detect in a dirty dataset the records that refer to the same real-world entity, playing a fundamental role in data cleaning and integration tasks. Often, a data scientist is only interested in a portion of the dataset (e.g., data exploration); this interest can be expressed through a query. The traditional batch approach is far from optimal, since it requires to perform ER on the whole dataset before executing a query on its cleaned version, performing a huge number of useless comparisons. This causes a waste of time, resources and money. Proposed solutions to this problem follow a query-driven approach (perform ER only on the useful data) or a progressive one (the entities in the result are emitted as soon as they are solved), but these two aspects have never been reconciled. This paper introduces BrewER framework, which allows to execute clean queries on dirty datasets in a query-driven and progressive way, thanks to a preliminary filtering and an iteratively managed sorted list that defines emission priority. Early results obtained by first BrewER prototype on real-world datasets from different domains confirm the benefits of this combined solution, paving the way for a new and more comprehensive approach to ER.

Progressive Query-Driven Entity Resolution / Zecchini, L.. - 13058:(2021), pp. 395-401. (14th International Conference on Similarity Search and Applications (SISAP 2021) Dortmund, Germany (virtual event) September, 29 - October, 1) [10.1007/978-3-030-89657-7_30].

Progressive Query-Driven Entity Resolution

Luca Zecchini

2021

Abstract

Entity Resolution (ER) aims to detect in a dirty dataset the records that refer to the same real-world entity, playing a fundamental role in data cleaning and integration tasks. Often, a data scientist is only interested in a portion of the dataset (e.g., data exploration); this interest can be expressed through a query. The traditional batch approach is far from optimal, since it requires to perform ER on the whole dataset before executing a query on its cleaned version, performing a huge number of useless comparisons. This causes a waste of time, resources and money. Proposed solutions to this problem follow a query-driven approach (perform ER only on the useful data) or a progressive one (the entities in the result are emitted as soon as they are solved), but these two aspects have never been reconciled. This paper introduces BrewER framework, which allows to execute clean queries on dirty datasets in a query-driven and progressive way, thanks to a preliminary filtering and an iteratively managed sorted list that defines emission priority. Early results obtained by first BrewER prototype on real-world datasets from different domains confirm the benefits of this combined solution, paving the way for a new and more comprehensive approach to ER.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2021
			
	Data di prima pubblicazione
	
				22-ott-2021
			
	Titolo del Convegno
	
				14th International Conference on Similarity Search and Applications (SISAP 2021)
			
	Luogo del Convegno
	
				Dortmund, Germany (virtual event)
			
	Data del Convegno
	
				September, 29 - October, 1
			
	Codice DOI
	
				https://dx.doi.org/10.1007/978-3-030-89657-7_30
			
	Codice WoS
	
				WOS:000722252200030
			
	Codice Scopus
	
				2-s2.0-85119018070
			
	Serie
	
				LECTURE NOTES IN COMPUTER SCIENCE
			
	N° del Volume
	
				13058
			
	Pagina iniziale
	
				395
			
	Pagina finale
	
				401
			
	Tutti gli autori
	
						Zecchini, Luca
					
	Citazione
	
				Progressive Query-Driven Entity Resolution / Zecchini, L.. - 13058:(2021), pp. 395-401. (14th International Conference on Similarity Search and Applications (SISAP 2021) Dortmund, Germany (virtual event) September, 29 - October, 1) [10.1007/978-3-030-89657-7_30].
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris