Overlap-Based Duplicate Table Detection

Zecchini, Luca; Bleifuß, Tobias; Simonini, Giovanni; Bergamaschi, Sonia; Naumann, Felix

Both the Web and data lakes contain much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides meaningful information about the relatedness of the tables. In particular, we focus on the largest overlap between two tables, i.e., their largest common subtable. The largest overlap can help us discover multiple coexisting versions of the same table, which possibly differ in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, duplicate tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. Unfortunately, detecting the largest overlap is a computationally challenging problem, requiring to carefully permute columns and rows. We introduce therefore Sloth, our solution to efficiently detect the largest overlap between two tables. As we experimentally demonstrate on real-world datasets, Sloth is not only effective in solving this task, but can impact on multiple additional use cases, such as detecting potential copying across sources or automatically discovering candidate multi-column joins.

Overlap-Based Duplicate Table Detection / Zecchini, Luca; Bleifuß, Tobias; Simonini, Giovanni; Bergamaschi, Sonia; Naumann, Felix. - 3741:(2024), pp. 643-652. ( 32nd Italian Symposium on Advanced Database Systems (SEBD 2024) Villasimius, Italy June 23-26, 2024).

Overlap-Based Duplicate Table Detection

Zecchini, Luca;Bleifuß, Tobias;Simonini, Giovanni;Bergamaschi, Sonia;Naumann, Felix

2024

Abstract

Both the Web and data lakes contain much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides meaningful information about the relatedness of the tables. In particular, we focus on the largest overlap between two tables, i.e., their largest common subtable. The largest overlap can help us discover multiple coexisting versions of the same table, which possibly differ in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, duplicate tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. Unfortunately, detecting the largest overlap is a computationally challenging problem, requiring to carefully permute columns and rows. We introduce therefore Sloth, our solution to efficiently detect the largest overlap between two tables. As we experimentally demonstrate on real-world datasets, Sloth is not only effective in solving this task, but can impact on multiple additional use cases, such as detecting potential copying across sources or automatically discovering candidate multi-column joins.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2024
			
	Data di prima pubblicazione
	
				26-giu-2024
			
	Titolo del Convegno
	
				32nd Italian Symposium on Advanced Database Systems (SEBD 2024)
			
	Luogo del Convegno
	
				Villasimius, Italy
			
	Data del Convegno
	
				June 23-26, 2024
			
	Codice Scopus
	
				2-s2.0-85202075769
			
	Serie
	
				CEUR WORKSHOP PROCEEDINGS
			
	N° del Volume
	
				3741
			
	Pagina iniziale
	
				643
			
	Pagina finale
	
				652
			
	Tutti gli autori
	
						Zecchini, Luca; Bleifuß, Tobias; Simonini, Giovanni; Bergamaschi, Sonia; Naumann, Felix
					
	Citazione
	
				Overlap-Based Duplicate Table Detection / Zecchini, Luca; Bleifuß, Tobias; Simonini, Giovanni; Bergamaschi, Sonia; Naumann, Felix. - 3741:(2024), pp. 643-652. ( 32nd Italian Symposium on Advanced Database Systems (SEBD 2024) Villasimius, Italy June 23-26, 2024).
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
paper24.pdf Open access Tipologia: VOR - Versione pubblicata dall'editore Dimensione 1.66 MB Formato Adobe PDF Visualizza/Apri	1.66 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris