Integrazione di dati on-demand

Zecchini, Luca

Companies and organizations depend heavily on their data to make informed business decisions. Therefore, guaranteeing high data quality is critical to ensure the reliability of data analysis. Data integration, which aims to combine data acquired from several heterogeneous sources to provide users with a unified consistent view, plays a fundamental role to enhance the value of the data at hand. In the past, when data integration involved a limited number of sources, ETL (extract, transform, load) established as the most popular paradigm: once collected, raw data is cleaned, then stored in a data warehouse to perform analysis on it. Nowadays, big data integration needs to deal with millions of sources; thus, the paradigm is more and more moving towards ELT (extract, load, transform). A huge amount of raw data is collected and directly stored (e.g., in a data lake), then different users can transform portions of it according to the task at hand. Hence, novel approaches to data integration need to be explored to address the challenges raised by this paradigm. One of the fundamental building blocks for data integration is entity resolution (ER), which aims at detecting profiles that describe the same real-world entity, to consolidate them into a single consistent representation. ER is typically employed as an expensive offline cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire continuously growing data. Similarly, when querying data lakes, we want to transform data on-demand and return results in a timely manner. Hence, we propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, according to the priority defined by the user through the ORDER BY clause. For a wide range of applications (e.g., data exploration), a significant amount of resources can therefore be saved. Further, duplicates not only exist at profile level, as in the case for ER, but also at dataset level. In the ELT scenario, it is common for data scientists to retrieve datasets from the enterprise’s data lake, perform transformations for their analysis, then store back the new datasets into the data lake. Similarly, in Web contexts such as Wikipedia, a table can be duplicated at a given time, with the different copies having independent development, possibly leading to the insurgence of inconsistencies. Automatically detecting duplicate tables would allow to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. While dataset discovery research developed efficient tools to retrieve unionable or joinable tables, the problem of detecting duplicate tables has been mostly overlooked in the existing literature. To fill this gap, we therefore present Sloth, a framework to efficiently determine the largest overlap (i.e., the largest common subtable) between two tables. The detection of the largest overlap allows to quantify the similarity between the two tables and spot their inconsistencies. BrewER and Sloth represent novel solutions to perform big data integration in the ELT scenario, fostering on-demand use of available resources and shifting this fundamental task towards a task-driven paradigm.

Sempre più spesso aziende e organizzazioni basano le proprie decisioni sui dati di cui dispongono. Garantire la qualità di tali dati è fondamentale per poter effettuare analisi accurate e affidabili. L'integrazione dei dati consiste nel combinare dati acquisiti da molteplici sorgenti eterogenee per fornire all'utente finale una vista unitaria e coerente su tali dati. Si tratta perciò di un processo fondamentale per incrementare il valore dei dati disponibili. In passato, operando su numeri limitati di sorgenti, il paradigma di riferimento, noto come ETL, richiedeva di estrarre i dati grezzi, pulirli e immagazzinarli in un data warehouse per poterli poi analizzare. Al giorno d'oggi, operando su milioni di sorgenti, è sempre più diffuso il paradigma noto invece come ELT, per il quale i dati grezzi vengono raccolti in grandi quantità e immagazzinati così come sono, ad esempio in un data lake. Gli utenti possono poi pulire le porzioni di dati utili per le loro applicazioni. È pertanto necessario studiare soluzioni innovative per l'integrazione dei dati, maggiormente adatte alle nuove sfide che tale modello comporta. Uno dei processi fondamentali per l'integrazione dei dati è la riconciliazione di entità, che consiste nell'individuare i profili che descrivono la stessa entità reale (duplicati) per consolidarli in un unico profilo coerente. Tradizionalmente, questo processo viene effettuato sull'intero dataset prima di poter operare su di esso, risultando perciò spesso molto costoso. In molti casi, solo una porzione delle entità pulite si rivela utile per l'applicazione dell'utente finale. Ad esempio, operando su dati raccolti dal Web, è fondamentale poter filtrare le entità d'interesse senza dover pulire l'intera mole di dati, in continua crescita. Allo stesso modo, quando si effettuano interrogazioni su un data lake, si vuole pulire su richiesta solo la porzione di interesse, ottenendo i relativi risultati nel minor tempo possibile. Per rispondere a tali esigenze presentiamo BrewER, un framework per eseguire interrogazioni SQL su dati sporchi emettendo progressivamente i risultati come se fossero stati ottenuti sui dati puliti. BrewER focalizza il processo di pulizia su un'entità alla volta, in base a una priorità definita dall'utente nella clausola ORDER BY. Per molte applicazioni, come l'esplorazione dei dati, BrewER consente di risparmiare una grande quantità di tempo e risorse. I duplicati non esistono però solo a livello di singoli profili, ma anche a livello di dataset. È infatti comune ad esempio che un data scientist per le proprie analisi effettui trasformazioni su un dataset presente nel data lake aziendale, immagazzinando poi anche la nuova versione ottenuta all'interno del data lake stesso. Situazioni simili si verificano nel Web, ad esempio su Wikipedia, dove le tabelle vengono spesso duplicate e le copie ottenute hanno uno sviluppo indipendente, con la possibile insorgenza di inconsistenze. Individuare automaticamente queste tabelle duplicate consente di renderle coerenti con operazione di pulizia dei dati o propagazione delle modifiche, oppure di rimuovere le copie ridondanti per liberare spazio di archiviazione o risparmiare futuro lavoro agli editori. La ricerca di tabelle duplicate è stata perlopiù ignorata dalla letteratura esistente. Per colmare questa mancanza presentiamo Sloth, un framework che, date due tabelle, consente di determinarne la più grande sottotabella in comune, consentendo di quantificarne la similarità e di rilevare le possibili inconsistenze. BrewER e Sloth rappresentano soluzioni innovative per l'integrazione dei dati nello scenario ELT, utilizzando le risorse a disposizione su richiesta e indirizzando il processo di integrazione dei dati verso un approccio orientato alle applicazioni.

Integrazione di dati on-demand / Luca Zecchini , 2024 Apr 11. 36. ciclo, Anno Accademico 2022/2023.

Integrazione di dati on-demand

ZECCHINI, LUCA

2024

Abstract

Companies and organizations depend heavily on their data to make informed business decisions. Therefore, guaranteeing high data quality is critical to ensure the reliability of data analysis. Data integration, which aims to combine data acquired from several heterogeneous sources to provide users with a unified consistent view, plays a fundamental role to enhance the value of the data at hand. In the past, when data integration involved a limited number of sources, ETL (extract, transform, load) established as the most popular paradigm: once collected, raw data is cleaned, then stored in a data warehouse to perform analysis on it. Nowadays, big data integration needs to deal with millions of sources; thus, the paradigm is more and more moving towards ELT (extract, load, transform). A huge amount of raw data is collected and directly stored (e.g., in a data lake), then different users can transform portions of it according to the task at hand. Hence, novel approaches to data integration need to be explored to address the challenges raised by this paradigm. One of the fundamental building blocks for data integration is entity resolution (ER), which aims at detecting profiles that describe the same real-world entity, to consolidate them into a single consistent representation. ER is typically employed as an expensive offline cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire continuously growing data. Similarly, when querying data lakes, we want to transform data on-demand and return results in a timely manner. Hence, we propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, according to the priority defined by the user through the ORDER BY clause. For a wide range of applications (e.g., data exploration), a significant amount of resources can therefore be saved. Further, duplicates not only exist at profile level, as in the case for ER, but also at dataset level. In the ELT scenario, it is common for data scientists to retrieve datasets from the enterprise’s data lake, perform transformations for their analysis, then store back the new datasets into the data lake. Similarly, in Web contexts such as Wikipedia, a table can be duplicated at a given time, with the different copies having independent development, possibly leading to the insurgence of inconsistencies. Automatically detecting duplicate tables would allow to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. While dataset discovery research developed efficient tools to retrieve unionable or joinable tables, the problem of detecting duplicate tables has been mostly overlooked in the existing literature. To fill this gap, we therefore present Sloth, a framework to efficiently determine the largest overlap (i.e., the largest common subtable) between two tables. The detection of the largest overlap allows to quantify the similarity between the two tables and spot their inconsistencies. BrewER and Sloth represent novel solutions to perform big data integration in the ELT scenario, fostering on-demand use of available resources and shifting this fundamental task towards a task-driven paradigm.

Scheda breve

Scheda completa

Scheda completa (DC)

	Titolo in inglese
	
				Towards Data Integration On-Demand
			
	Anno di discussione
	
				11-apr-2024
			
	Tutor afferenti all'Ateneo
	
				BERGAMASCHI, Sonia
SIMONINI, GIOVANNI
			
	Tipologia
	
				Tesi di dottorato

File in questo prodotto:

File	Dimensione	Formato
Tesi definitiva Zecchini Luca.pdf Open access Descrizione: Tesi definitiva Zecchini Luca Tipologia: Tesi di dottorato Dimensione 6.47 MB Formato Adobe PDF Visualizza/Apri	6.47 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris