Entity Resolution (ER) is the process of identifying and merging records that refer to the same real-world entity. ER is usually applied as an expensive cleaning step on the entire data before consuming it, yet the relevance of cleaned entities ultimately depends on the user’s specific application, which may only require a small portion of the entities. We introduce BrewER, a framework designed to evaluate SQL SP queries on unclean data while progressively providing results as if they were obtained from cleaned data. BrewER aims at cleaning a single entity at a time, adhering to an ORDER BY predicate, thus it inherently supports top-k queries and stop-and-resume execution. This approach can save a significant amount of resources for various applications. BrewER has been implemented as an open-source Python library and can be seamlessly employed with existing ER tools and algorithms. We thoroughly demonstrated its efficiency through its evaluation on four real-world datasets.

Entity Resolution On-Demand for Querying Dirty Datasets / Simonini, Giovanni; Zecchini, Luca; Naumann, Felix; Bergamaschi, Sonia. - 3478:(2023), pp. 410-419. (Intervento presentato al convegno 31st Italian Symposium on Advanced Database Systems (SEBD 2023) tenutosi a Galzignano Terme (Padova), Italy nel July 2-5, 2023).

Entity Resolution On-Demand for Querying Dirty Datasets

Simonini, Giovanni
;
Zecchini, Luca;Naumann, Felix;Bergamaschi, Sonia
2023

Abstract

Entity Resolution (ER) is the process of identifying and merging records that refer to the same real-world entity. ER is usually applied as an expensive cleaning step on the entire data before consuming it, yet the relevance of cleaned entities ultimately depends on the user’s specific application, which may only require a small portion of the entities. We introduce BrewER, a framework designed to evaluate SQL SP queries on unclean data while progressively providing results as if they were obtained from cleaned data. BrewER aims at cleaning a single entity at a time, adhering to an ORDER BY predicate, thus it inherently supports top-k queries and stop-and-resume execution. This approach can save a significant amount of resources for various applications. BrewER has been implemented as an open-source Python library and can be seamlessly employed with existing ER tools and algorithms. We thoroughly demonstrated its efficiency through its evaluation on four real-world datasets.
2023
4-lug-2023
31st Italian Symposium on Advanced Database Systems (SEBD 2023)
Galzignano Terme (Padova), Italy
July 2-5, 2023
3478
410
419
Simonini, Giovanni; Zecchini, Luca; Naumann, Felix; Bergamaschi, Sonia
Entity Resolution On-Demand for Querying Dirty Datasets / Simonini, Giovanni; Zecchini, Luca; Naumann, Felix; Bergamaschi, Sonia. - 3478:(2023), pp. 410-419. (Intervento presentato al convegno 31st Italian Symposium on Advanced Database Systems (SEBD 2023) tenutosi a Galzignano Terme (Padova), Italy nel July 2-5, 2023).
File in questo prodotto:
File Dimensione Formato  
paper70.pdf

Open access

Tipologia: Versione pubblicata dall'editore
Dimensione 2.79 MB
Formato Adobe PDF
2.79 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1317066
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact