Gestione ed Analisi di Big Data: Sfide e Opportunità nell'Integrazione e nell'Estrazione di Conoscenza dai Dati

Paganelli, Matteo

In the Big Data era, the adequate management and consumption of data represents one of the most challenging activities, due to a series of critical issues that are usually categorized into 5 key concepts: volume, velocity, variety, veridicity and variability. In response to these needs, a large number of algorithms and technologies have been proposed in recent years, however many open problems remain and new challenges have emerged. Among these, just to name a few, there is the need to have annotated data for the training of machine learning techniques, to interpret the logic of the systems used, to reduce the impact of their management in production (i.e. the so-called technical debt) and to provide tools to support human-machine interaction. In this thesis, the challenges affecting the areas of data integration and modern management (in terms of readjustment with respect to the new requirements) of relational DBMS are studied in depth. The main problem affecting data integration concerns its evaluation in real contexts, which typically requires the costly and time-demanding involvement of domain experts. In this perspective, the use of tools for the support and automation of this critical task, as well as its unsupervised resolution, would be very useful. In this context, my contribution can be summarized in the following points: 1) the realization of techniques for the unsupervised evaluation of data integration tasks and 2) the development of automatic approaches for the configuration of rules-based matching models. As for relational DBMSs, they have proved to be, over the last few decades, the workhorse of many companies, thanks to their simplicity of governance, security, audibility and high performance. Today, however, we are witnessing a partial rethinking of their use compared to the original design. For example, they are used in solving more advanced tasks, such as classification, regression and clustering, typical of the machine learning field. The establishment of a symbiotic relationship between these two research fields could be essential to solve some of the critical issues listed above. In this context, my main contribution was to verify the possibility of performing in-DBMS inference of machine learning pipeline at serving time.

Nell'era dei Big Data, l'adeguata gestione e consumo dei dati rappresenta una delle attività più sfidanti, a causa di una serie di criticità che si è soliti categorizzare in 5 concetti chiave: volume, velocità, varietà, veridicità e variabilità. In risposta a queste esigenze, negli ultimi anni numerosi algoritmi e tecnologie sono stati proposti, tuttavia rimangono molti problemi aperti e nuove sfide sono emerse. Tra queste, solo per citarne alcune, ci sono la necessità di disporre di dati annotati per l'addestramento di tecniche di machine learning, di interpretare la logica dei sistemi utilizzati, di ridurre l'impatto della loro gestione in produzione (ovvero il cosiddetto debito tecnico o technical debt) e di fornire degli strumenti a supporto dell'interazione uomo-macchina. In questa tesi si approfondiscono in particolare le criticità che affliggono gli ambiti dell'integrazione dati e della moderna gestione (in termini di riadattamento rispetto i nuovi requisiti) dei DBMS relazionali. Il principale problema che affligge l'integrazione di dati riguarda la sua valutazione in contesti reali, la quale richiede tipicamente il costoso coinvolgimento, sia a livello economico che di tempo, di esperti del dominio. In quest'ottica l'impiego di strumenti per il supporto e l'automazione di questa operazione critica, nonché la sua risoluzione in maniera non supervisionata, risulterebbero molto utili. In questo ambito, il mio contributo può essere riassunto nei seguenti punti: 1) la realizzazione di tecniche per la valutazione non supervisionata di processi di integrazione di dati e 2) lo sviluppo di approcci automatici per la configurazione di modelli di matching basati su regole. Per quanto riguarda i DBMS relazionali, essi si sono dimostrati di essere, nell'arco degli ultimi decenni, il cavallo di battaglia di molte aziende, per merito della loro semplicità di governance, sicurezza, verificabilità e dell'elevate performance. Oggigiorno, tuttavia si assiste ad un parziale ripensamento del loro utilizzo rispetto alla progettazione originale. Si tratta per esempio di impiegarli nella risoluzione di compiti più avanzati, quali classificazione, regressione e clustering, tipici dell'ambito del machine learning. L'instaurazione di un rapporto simbiotico tra questi due ambiti di ricerca potrebbe rivelarsi essenziale al fine di risolvere alcune delle criticità sopra elencate. In questo ambito, il mio principale contributo è stato quello di verificare la possibilità di eseguire, durante la messa in produzione di un sistema, predizioni di modelli di machine learning direttamente all'interno del database.

Gestione ed Analisi di Big Data: Sfide e Opportunità nell'Integrazione e nell'Estrazione di Conoscenza dai Dati / Matteo Paganelli , 2021 Mar 23. 33. ciclo, Anno Accademico 2019/2020.

Gestione ed Analisi di Big Data: Sfide e Opportunità nell'Integrazione e nell'Estrazione di Conoscenza dai Dati

PAGANELLI, MATTEO

2021

Abstract

In the Big Data era, the adequate management and consumption of data represents one of the most challenging activities, due to a series of critical issues that are usually categorized into 5 key concepts: volume, velocity, variety, veridicity and variability. In response to these needs, a large number of algorithms and technologies have been proposed in recent years, however many open problems remain and new challenges have emerged. Among these, just to name a few, there is the need to have annotated data for the training of machine learning techniques, to interpret the logic of the systems used, to reduce the impact of their management in production (i.e. the so-called technical debt) and to provide tools to support human-machine interaction. In this thesis, the challenges affecting the areas of data integration and modern management (in terms of readjustment with respect to the new requirements) of relational DBMS are studied in depth. The main problem affecting data integration concerns its evaluation in real contexts, which typically requires the costly and time-demanding involvement of domain experts. In this perspective, the use of tools for the support and automation of this critical task, as well as its unsupervised resolution, would be very useful. In this context, my contribution can be summarized in the following points: 1) the realization of techniques for the unsupervised evaluation of data integration tasks and 2) the development of automatic approaches for the configuration of rules-based matching models. As for relational DBMSs, they have proved to be, over the last few decades, the workhorse of many companies, thanks to their simplicity of governance, security, audibility and high performance. Today, however, we are witnessing a partial rethinking of their use compared to the original design. For example, they are used in solving more advanced tasks, such as classification, regression and clustering, typical of the machine learning field. The establishment of a symbiotic relationship between these two research fields could be essential to solve some of the critical issues listed above. In this context, my main contribution was to verify the possibility of performing in-DBMS inference of machine learning pipeline at serving time.

Scheda breve

Scheda completa

Scheda completa (DC)

	Titolo in inglese
	
				Big Data Management and Analytics: Open Challenges and Opportunities for Data Integration and Data Mining
			
	Anno di discussione
	
				23-mar-2021
			
	Tutor afferenti all'Ateneo
	
				GUERRA, Francesco
			
	Tipologia
	
				Tesi di dottorato

File in questo prodotto:

File	Dimensione	Formato
MatteoPaganelli_PhDThesis_Final.pdf Open Access dal 23/03/2024 Descrizione: Tesi definitiva Paganelli Matteo Tipologia: Tesi di dottorato Dimensione 2.62 MB Formato Adobe PDF Visualizza/Apri	2.62 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris