Both the Web and data lakes contain much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides meaningful information about the relatedness of the tables. In particular, we focus on the largest overlap between two tables, i.e., their largest common subtable. The largest overlap can help us discover multiple coexisting versions of the same table, which possibly differ in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, duplicate tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. Unfortunately, detecting the largest overlap is a computationally challenging problem, requiring to carefully permute columns and rows. We introduce therefore Sloth, our solution to efficiently detect the largest overlap between two tables. As we experimentally demonstrate on real-world datasets, Sloth is not only effective in solving this task, but can impact on multiple additional use cases, such as detecting potential copying across sources or automatically discovering candidate multi-column joins.

Overlap-Based Duplicate Table Detection / Zecchini, Luca; Bleifuß, Tobias; Simonini, Giovanni; Bergamaschi, Sonia; Naumann, Felix. - 3741:(2024), pp. 643-652. (Intervento presentato al convegno 32nd Italian Symposium on Advanced Database Systems (SEBD 2024) tenutosi a Villasimius, Italy nel June 23-26, 2024).

Overlap-Based Duplicate Table Detection

Zecchini, Luca
;
Simonini, Giovanni;Bergamaschi, Sonia;Naumann, Felix
2024

Abstract

Both the Web and data lakes contain much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides meaningful information about the relatedness of the tables. In particular, we focus on the largest overlap between two tables, i.e., their largest common subtable. The largest overlap can help us discover multiple coexisting versions of the same table, which possibly differ in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, duplicate tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. Unfortunately, detecting the largest overlap is a computationally challenging problem, requiring to carefully permute columns and rows. We introduce therefore Sloth, our solution to efficiently detect the largest overlap between two tables. As we experimentally demonstrate on real-world datasets, Sloth is not only effective in solving this task, but can impact on multiple additional use cases, such as detecting potential copying across sources or automatically discovering candidate multi-column joins.
2024
26-giu-2024
32nd Italian Symposium on Advanced Database Systems (SEBD 2024)
Villasimius, Italy
June 23-26, 2024
3741
643
652
Zecchini, Luca; Bleifuß, Tobias; Simonini, Giovanni; Bergamaschi, Sonia; Naumann, Felix
Overlap-Based Duplicate Table Detection / Zecchini, Luca; Bleifuß, Tobias; Simonini, Giovanni; Bergamaschi, Sonia; Naumann, Felix. - 3741:(2024), pp. 643-652. (Intervento presentato al convegno 32nd Italian Symposium on Advanced Database Systems (SEBD 2024) tenutosi a Villasimius, Italy nel June 23-26, 2024).
File in questo prodotto:
File Dimensione Formato  
paper24.pdf

Open access

Tipologia: Versione pubblicata dall'editore
Dimensione 1.66 MB
Formato Adobe PDF
1.66 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1351687
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact