Data practitioners often sample their datasets to produce repre- sentative subsets for their downstream tasks. When entities in a dataset can be partitioned into multiple groups, stratified sampling is commonly used to produce subsets that match a target group distribution, e.g., to select a balanced subset for training a machine learning model. However, real-world data frequently contains du- plicates — multiple representations of the same real-world entity — that can bias sampling, necessitating deduplication. We define deduplicated sampling as the task of producing a clean sample of a dirty dataset according to a target group distribution. The naïve approach to deduplicated sampling would first dedu- plicate the entire dataset upfront, then perform sampling ex post. However, that approach might be prohibitively expensive for large datasets and time/resource constraints. Deduplicated sampling on- demand with RadlER is a novel approach to produce a clean sample by focusing the cleaning effort only on entities required to appear in that sample. Our experimental evaluation, performed on multiple datasets from different domains, demonstrates that RadlER consis- tently outperforms baseline approaches, providing data scientists with an efficient solution to quickly produce a clean sample of a dirty dataset according to a target group distribution.

Deduplicated Sampling On-Demand / Zecchini, Luca; Efthymiou, Vasilis; Naumann, Felix; Simonini, Giovanni. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - 18:8(2025), pp. 2482-2495.

Deduplicated Sampling On-Demand

Luca Zecchini;Felix Naumann;Giovanni Simonini
2025

Abstract

Data practitioners often sample their datasets to produce repre- sentative subsets for their downstream tasks. When entities in a dataset can be partitioned into multiple groups, stratified sampling is commonly used to produce subsets that match a target group distribution, e.g., to select a balanced subset for training a machine learning model. However, real-world data frequently contains du- plicates — multiple representations of the same real-world entity — that can bias sampling, necessitating deduplication. We define deduplicated sampling as the task of producing a clean sample of a dirty dataset according to a target group distribution. The naïve approach to deduplicated sampling would first dedu- plicate the entire dataset upfront, then perform sampling ex post. However, that approach might be prohibitively expensive for large datasets and time/resource constraints. Deduplicated sampling on- demand with RadlER is a novel approach to produce a clean sample by focusing the cleaning effort only on entities required to appear in that sample. Our experimental evaluation, performed on multiple datasets from different domains, demonstrates that RadlER consis- tently outperforms baseline approaches, providing data scientists with an efficient solution to quickly produce a clean sample of a dirty dataset according to a target group distribution.
2025
18
2482
2495
Zecchini, Luca; Efthymiou, Vasilis; Naumann, Felix; Simonini, Giovanni
Deduplicated Sampling On-Demand / Zecchini, Luca; Efthymiou, Vasilis; Naumann, Felix; Simonini, Giovanni. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - 18:8(2025), pp. 2482-2495.
File in questo prodotto:
File Dimensione Formato  
p2482-zecchini-2.pdf

Open access

Tipologia: VOR - Versione pubblicata dall'editore
Dimensione 8.97 MB
Formato Adobe PDF
8.97 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1385869
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact