Determining the Largest Overlap between Tables

Zecchini, Luca; Bleifuß, Tobias; Simonini, Giovanni; Bergamaschi, Sonia; Naumann, Felix

doi:10.1145/3639303

Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overlap between two tables is not trivial. In particular, detecting their largest overlap, i.e., their largest common subtable, is a computationally challenging problem. As the information overlap may not occur in contiguous portions of the tables, only the ability to permute columns and rows can reveal it. The detection of the largest overlap can help us in relevant tasks such as the discovery of multiple coexisting versions of the same table, which can present differences in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, matching tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. We present the first formal definition of this problem, and with it Sloth, our solution to efficiently detect the largest overlap between two tables. We experimentally demonstrate on real-world datasets its efficacy in solving this task, analyzing its performance and showing its impact on multiple use cases.

Determining the Largest Overlap between Tables / Zecchini, Luca; Bleifuß, Tobias; Simonini, Giovanni; Bergamaschi, Sonia; Naumann, Felix. - In: PROCEEDINGS OF THE ACM ON MANAGEMENT OF DATA. - ISSN 2836-6573. - 2:1(2024), pp. 1-26. [10.1145/3639303]

Determining the Largest Overlap between Tables

Zecchini, Luca;Bleifuß, Tobias;Simonini, Giovanni;Bergamaschi, Sonia;Naumann, Felix

2024

Abstract

Both on the Web and in data lakes, it is possible to detect much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides significant information about the relatedness of the tables. Unfortunately, efficiently quantifying the overlap between two tables is not trivial. In particular, detecting their largest overlap, i.e., their largest common subtable, is a computationally challenging problem. As the information overlap may not occur in contiguous portions of the tables, only the ability to permute columns and rows can reveal it. The detection of the largest overlap can help us in relevant tasks such as the discovery of multiple coexisting versions of the same table, which can present differences in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, matching tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. We present the first formal definition of this problem, and with it Sloth, our solution to efficiently detect the largest overlap between two tables. We experimentally demonstrate on real-world datasets its efficacy in solving this task, analyzing its performance and showing its impact on multiple use cases.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2024
			
	Data di prima pubblicazione
	
				26-mar-2024
			
	Rivista
	
				PROCEEDINGS OF THE ACM ON MANAGEMENT OF DATA
			
	N° del Volume
	
				2
			
	Fascicolo
	
				1
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				26
			
	Codice DOI
	
				https://dx.doi.org/10.1145/3639303
			
	Citazione
	
				Determining the Largest Overlap between Tables / Zecchini, Luca; Bleifuß, Tobias; Simonini, Giovanni; Bergamaschi, Sonia; Naumann, Felix. - In: PROCEEDINGS OF THE ACM ON MANAGEMENT OF DATA. - ISSN 2836-6573. - 2:1(2024), pp. 1-26. [10.1145/3639303]
			
	Tutti gli autori
	
						Zecchini, Luca; Bleifuß, Tobias; Simonini, Giovanni; Bergamaschi, Sonia; Naumann, Felix
					
	Tipologia
	
				Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
zecchini_2024_pacmmod_sloth.pdf Open access Tipologia: VOR - Versione pubblicata dall'editore Licenza: [IR] creative-commons Dimensione 2.49 MB Formato Adobe PDF Visualizza/Apri	2.49 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris