Deduplicated Sampling On-Demand

Zecchini, Luca; Efthymiou, Vasilis; Naumann, Felix; Simonini, Giovanni

Data practitioners often sample their datasets to produce repre- sentative subsets for their downstream tasks. When entities in a dataset can be partitioned into multiple groups, stratified sampling is commonly used to produce subsets that match a target group distribution, e.g., to select a balanced subset for training a machine learning model. However, real-world data frequently contains du- plicates — multiple representations of the same real-world entity — that can bias sampling, necessitating deduplication. We define deduplicated sampling as the task of producing a clean sample of a dirty dataset according to a target group distribution. The naïve approach to deduplicated sampling would first dedu- plicate the entire dataset upfront, then perform sampling ex post. However, that approach might be prohibitively expensive for large datasets and time/resource constraints. Deduplicated sampling on- demand with RadlER is a novel approach to produce a clean sample by focusing the cleaning effort only on entities required to appear in that sample. Our experimental evaluation, performed on multiple datasets from different domains, demonstrates that RadlER consis- tently outperforms baseline approaches, providing data scientists with an efficient solution to quickly produce a clean sample of a dirty dataset according to a target group distribution.

Deduplicated Sampling On-Demand / Zecchini, Luca; Efthymiou, Vasilis; Naumann, Felix; Simonini, Giovanni. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - 18:8(2025), pp. 2482-2495.