Data practitioners often sample their datasets to produce repre- sentative subsets for their downstream tasks. When entities in a dataset can be partitioned into multiple groups, stratified sampling is commonly used to produce subsets that match a target group distribution, e.g., to select a balanced subset for training a machine learning model. However, real-world data frequently contains du- plicates — multiple representations of the same real-world entity — that can bias sampling, necessitating deduplication. We define deduplicated sampling as the task of producing a clean sample of a dirty dataset according to a target group distribution. The naïve approach to deduplicated sampling would first dedu- plicate the entire dataset upfront, then perform sampling ex post. However, that approach might be prohibitively expensive for large datasets and time/resource constraints. Deduplicated sampling on- demand with RadlER is a novel approach to produce a clean sample by focusing the cleaning effort only on entities required to appear in that sample. Our experimental evaluation, performed on multiple datasets from different domains, demonstrates that RadlER consis- tently outperforms baseline approaches, providing data scientists with an efficient solution to quickly produce a clean sample of a dirty dataset according to a target group distribution.

Data practitioners often sample their datasets to produce representative subsets for their downstream tasks. When entities in a dataset can be partitioned into multiple groups, stratified sampling is commonly used to produce subsets that match a target group distribution, e.g., to select a balanced subset for training a machine learning model. However, real-world data frequently contains duplicates — multiple representations of the same real-world entity — that can bias sampling, necessitating deduplication. We define deduplicated sampling as the task of producing a clean sample of a dirty dataset according to a target group distribution. The naïve approach to deduplicated sampling would first deduplicate the entire dataset upfront, then perform sampling ex post. However, that approach might be prohibitively expensive for large datasets and time/resource constraints. Deduplicated sampling ondemand with RadlER is a novel approach to produce a clean sample by focusing the cleaning effort only on entities required to appear in that sample. Our experimental evaluation, performed on multiple datasets from different domains, demonstrates that RadlER consistently outperforms baseline approaches, providing data scientists with an efficient solution to quickly produce a clean sample of a dirty dataset according to a target group distribution.

Deduplicated Sampling On-Demand / Zecchini, L., Efthymiou, V., Naumann, F., Simonini, G.. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - 18:8(2025), pp. 2482-2495. (51st International Conference on Very Large Data Bases, VLDB 2025 gbr 2025) [10.14778/3742728.3742742].

Deduplicated Sampling On-Demand

Luca Zecchini;Felix Naumann;Giovanni Simonini
2025

Abstract

Data practitioners often sample their datasets to produce representative subsets for their downstream tasks. When entities in a dataset can be partitioned into multiple groups, stratified sampling is commonly used to produce subsets that match a target group distribution, e.g., to select a balanced subset for training a machine learning model. However, real-world data frequently contains duplicates — multiple representations of the same real-world entity — that can bias sampling, necessitating deduplication. We define deduplicated sampling as the task of producing a clean sample of a dirty dataset according to a target group distribution. The naïve approach to deduplicated sampling would first deduplicate the entire dataset upfront, then perform sampling ex post. However, that approach might be prohibitively expensive for large datasets and time/resource constraints. Deduplicated sampling ondemand with RadlER is a novel approach to produce a clean sample by focusing the cleaning effort only on entities required to appear in that sample. Our experimental evaluation, performed on multiple datasets from different domains, demonstrates that RadlER consistently outperforms baseline approaches, providing data scientists with an efficient solution to quickly produce a clean sample of a dirty dataset according to a target group distribution.
2025
18
2482
2495
Zecchini, Luca; Efthymiou, Vasilis; Naumann, Felix; Simonini, Giovanni
Deduplicated Sampling On-Demand / Zecchini, L., Efthymiou, V., Naumann, F., Simonini, G.. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - 18:8(2025), pp. 2482-2495. (51st International Conference on Very Large Data Bases, VLDB 2025 gbr 2025) [10.14778/3742728.3742742].
File in questo prodotto:
File Dimensione Formato  
p2482-zecchini-2.pdf

Open access

Tipologia: VOR - Versione pubblicata dall'editore
Dimensione 8.97 MB
Formato Adobe PDF
8.97 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1385869
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
social impact