Entity Resolution (ER) aims to detect in a dirty dataset the records that refer to the same real-world entity, playing a fundamental role in data cleaning and integration tasks. Often, a data scientist is only interested in a portion of the dataset (e.g., data exploration); this interest can be expressed through a query. The traditional batch approach is far from optimal, since it requires to perform ER on the whole dataset before executing a query on its cleaned version, performing a huge number of useless comparisons. This causes a waste of time, resources and money. Proposed solutions to this problem follow a query-driven approach (perform ER only on the useful data) or a progressive one (the entities in the result are emitted as soon as they are solved), but these two aspects have never been reconciled. This paper introduces BrewER framework, which allows to execute clean queries on dirty datasets in a query-driven and progressive way, thanks to a preliminary filtering and an iteratively managed sorted list that defines emission priority. Early results obtained by first BrewER prototype on real-world datasets from different domains confirm the benefits of this combined solution, paving the way for a new and more comprehensive approach to ER.

Progressive Query-Driven Entity Resolution / Zecchini, Luca. - 13058:(2021), pp. 395-401. ( 14th International Conference on Similarity Search and Applications (SISAP 2021) Dortmund, Germany (virtual event) September, 29 - October, 1) [10.1007/978-3-030-89657-7_30].

Progressive Query-Driven Entity Resolution

Luca Zecchini
2021

Abstract

Entity Resolution (ER) aims to detect in a dirty dataset the records that refer to the same real-world entity, playing a fundamental role in data cleaning and integration tasks. Often, a data scientist is only interested in a portion of the dataset (e.g., data exploration); this interest can be expressed through a query. The traditional batch approach is far from optimal, since it requires to perform ER on the whole dataset before executing a query on its cleaned version, performing a huge number of useless comparisons. This causes a waste of time, resources and money. Proposed solutions to this problem follow a query-driven approach (perform ER only on the useful data) or a progressive one (the entities in the result are emitted as soon as they are solved), but these two aspects have never been reconciled. This paper introduces BrewER framework, which allows to execute clean queries on dirty datasets in a query-driven and progressive way, thanks to a preliminary filtering and an iteratively managed sorted list that defines emission priority. Early results obtained by first BrewER prototype on real-world datasets from different domains confirm the benefits of this combined solution, paving the way for a new and more comprehensive approach to ER.
2021
22-ott-2021
Inglese
14th International Conference on Similarity Search and Applications (SISAP 2021)
Dortmund, Germany (virtual event)
September, 29 - October, 1
https://doi.org/10.1007/978-3-030-89657-7_30
Similarity Search and Applications - 14th International Conference, SISAP 2021, Dortmund, Germany, September 29 - October 1, 2021, Proceedings
Nora Reyes, Richard Connor, Nils Kriege, Daniyal Kazempour, Ilaria Bartolini, Erich Schubert, Jian-Jia Chen
13058
30
395
401
978-3-030-89656-0
978-3-030-89657-7
Springer
SVIZZERA
Cham
Internazionale
Contributo
Entity resolution, Data integration, Data cleaning
Zecchini, Luca
Atti di CONVEGNO::Relazione in Atti di Convegno
273
1
Progressive Query-Driven Entity Resolution / Zecchini, Luca. - 13058:(2021), pp. 395-401. ( 14th International Conference on Similarity Search and Applications (SISAP 2021) Dortmund, Germany (virtual event) September, 29 - October, 1) [10.1007/978-3-030-89657-7_30].
none
info:eu-repo/semantics/conferenceObject
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1254600
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 0
social impact