Entity Resolution (ER) aims to detect in a dirty dataset the records that refer to the same real-world entity, playing a fundamental role in data cleaning and integration tasks. Often, a data scientist is only interested in a portion of the dataset (e.g., data exploration); this interest can be expressed through a query. The traditional batch approach is far from optimal, since it requires to perform ER on the whole dataset before executing a query on its cleaned version, performing a huge number of useless comparisons. This causes a waste of time, resources and money. Proposed solutions to this problem follow a query-driven approach (perform ER only on the useful data) or a progressive one (the entities in the result are emitted as soon as they are solved), but these two aspects have never been reconciled. This paper introduces BrewER framework, which allows to execute clean queries on dirty datasets in a query-driven and progressive way, thanks to a preliminary filtering and an iteratively managed sorted list that defines emission priority. Early results obtained by first BrewER prototype on real-world datasets from different domains confirm the benefits of this combined solution, paving the way for a new and more comprehensive approach to ER.
Progressive Query-Driven Entity Resolution / Zecchini, Luca. - 13058:(2021), pp. 395-401. (Intervento presentato al convegno 14th International Conference on Similarity Search and Applications (SISAP 2021) tenutosi a Dortmund, Germany (virtual event) nel September, 29 - October, 1) [10.1007/978-3-030-89657-7_30].
Progressive Query-Driven Entity Resolution
Luca Zecchini
2021
Abstract
Entity Resolution (ER) aims to detect in a dirty dataset the records that refer to the same real-world entity, playing a fundamental role in data cleaning and integration tasks. Often, a data scientist is only interested in a portion of the dataset (e.g., data exploration); this interest can be expressed through a query. The traditional batch approach is far from optimal, since it requires to perform ER on the whole dataset before executing a query on its cleaned version, performing a huge number of useless comparisons. This causes a waste of time, resources and money. Proposed solutions to this problem follow a query-driven approach (perform ER only on the useful data) or a progressive one (the entities in the result are emitted as soon as they are solved), but these two aspects have never been reconciled. This paper introduces BrewER framework, which allows to execute clean queries on dirty datasets in a query-driven and progressive way, thanks to a preliminary filtering and an iteratively managed sorted list that defines emission priority. Early results obtained by first BrewER prototype on real-world datasets from different domains confirm the benefits of this combined solution, paving the way for a new and more comprehensive approach to ER.Pubblicazioni consigliate
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris