Data practitioners often sample their datasets to produce repre- sentative subsets for their downstream tasks. When entities in a dataset can be partitioned into multiple groups, stratified sampling is commonly used to produce subsets that match a target group distribution, e.g., to select a balanced subset for training a machine learning model. However, real-world data frequently contains du- plicates — multiple representations of the same real-world entity — that can bias sampling, necessitating deduplication. We define deduplicated sampling as the task of producing a clean sample of a dirty dataset according to a target group distribution. The naïve approach to deduplicated sampling would first dedu- plicate the entire dataset upfront, then perform sampling ex post. However, that approach might be prohibitively expensive for large datasets and time/resource constraints. Deduplicated sampling on- demand with RadlER is a novel approach to produce a clean sample by focusing the cleaning effort only on entities required to appear in that sample. Our experimental evaluation, performed on multiple datasets from different domains, demonstrates that RadlER consis- tently outperforms baseline approaches, providing data scientists with an efficient solution to quickly produce a clean sample of a dirty dataset according to a target group distribution.
Deduplicated Sampling On-Demand / Zecchini, Luca; Efthymiou, Vasilis; Naumann, Felix; Simonini, Giovanni. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - 18:8(2025), pp. 2482-2495.
Deduplicated Sampling On-Demand
Luca Zecchini;Felix Naumann;Giovanni Simonini
2025
Abstract
Data practitioners often sample their datasets to produce repre- sentative subsets for their downstream tasks. When entities in a dataset can be partitioned into multiple groups, stratified sampling is commonly used to produce subsets that match a target group distribution, e.g., to select a balanced subset for training a machine learning model. However, real-world data frequently contains du- plicates — multiple representations of the same real-world entity — that can bias sampling, necessitating deduplication. We define deduplicated sampling as the task of producing a clean sample of a dirty dataset according to a target group distribution. The naïve approach to deduplicated sampling would first dedu- plicate the entire dataset upfront, then perform sampling ex post. However, that approach might be prohibitively expensive for large datasets and time/resource constraints. Deduplicated sampling on- demand with RadlER is a novel approach to produce a clean sample by focusing the cleaning effort only on entities required to appear in that sample. Our experimental evaluation, performed on multiple datasets from different domains, demonstrates that RadlER consis- tently outperforms baseline approaches, providing data scientists with an efficient solution to quickly produce a clean sample of a dirty dataset according to a target group distribution.| File | Dimensione | Formato | |
|---|---|---|---|
|
p2482-zecchini-2.pdf
Open access
Tipologia:
VOR - Versione pubblicata dall'editore
Dimensione
8.97 MB
Formato
Adobe PDF
|
8.97 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris




