Accelerazione basata sull’indice: join in tempo reale e query ibride

Aslam, Adeel

Real-time data analysis has become increasingly important with the growth of interconnected systems. One common application is the continuous monitoring of energy data. This data is constantly generated by the sensors installed on different energy-producing and consuming devices. Newly generated data need to be processed frequently to offer meaningful insights promptly. The typical processing approach involves producer and consumer computational patterns. Numerous data processing frameworks have been proposed to consume streaming (real-time) data from various input devices, perform distributed computation, combine individual results, and provide insights. These frameworks commonly employ pipeline parallelism on incoming data and carry out various online operations such as joining, aggregation, and filtering. Streaming data is confined to windows (sliding, tumbling, session, etc.), where newly arriving tuples are continually inserted and expired tuples are removed frequently. Stream join is an essential operation for handling real-time data, however, it comes with additional computational challenges compared to traditional batch join, due to the continuous look-up, add, and delete data points from streaming windows. Common join operators include equality or inequality (theta) joins. The stream inequality join is particularly computationally intensive because it requires additional overhead to hold the contents of the streaming window using index data structures. To tackle this challenge, we identify two key insights: 1) identifying skewed data distributions in real-time and implementing dedicated indexing structures for skewed keys to reduce index update costs; 2) leveraging optimized data structures, including insert-efficient mutable and search-efficient immutable structures to optimize the search stream join process. In this Ph.D. work, I propose novel solutions for distributed stream join processing. One of the key contributions is an indexing method that uses a space-efficient dedicated filter to monitor the frequency of input keys in real-time. This method, called STA-Join, adapts the data processing logic based on the skewness of the data. Additionally, I have extensively compared this technique with existing approaches. Moreover, I have also introduced a two-stage data structure for handling and processing sliding window items (bounded streaming contents) with complex inequality operators. This approach, named SPO-Join, divides the sliding window into mutable (insert-efficient) and immutable (search-efficient) data structures. Despite facing challenges such as state management for distributed processing, processing guarantees, and efficient concurrency mechanisms, experimental results from distributed stream processing systems demonstrate that the proposed solutions outperform existing state-of-the-art methods. Similarly, as generative AI models become more widespread in various industries, including the energy sector, vector databases are increasingly being used to store multidimensional industry data and provide effective prompts to these models. The performance and accuracy of the models depend largely on the quality of the prompts. However, efficiently retrieving relevant vectors, especially for hybrid queries (vectors and predicate conditions) with high recall, is a challenging task. I propose a frequency-aware solution for an index-data structure to address this issue to facilitate approximate nearest neighbor (ANN) searches in high-dimensional spaces, especially for hybrid queries. I have extensively compared this solution with state-of-the-art vector indexing approaches for various types of queries (point, range, and mixed), and the results show that it performs better than the alternatives.

L’analisi dei dati in tempo reale `e diventata sempre pi`u importante con la crescita di sistemi interconnessi. Un’applicazione comune `e il mon- l’elaborazione dei dati energetici. Questi dati sono costantemente generati dai sensori installato su diversi dispositivi che producono e consumano energia. Di nuova generazione I dati devono essere elaborati frequentemente per offrire informazioni significative subito. L’approccio tipico alla lavorazione coinvolge produttore e consumatore modelli computazionali. Sono stati utilizzati numerosi quadri di elaborazione dei dati proposto di consumare flussi di dati (dati in tempo reale) da vari input eseguire calcoli distribuiti, combinare risultati individuali e Fornisci approfondimenti. Questi framework utilizzano in genere il parallelismo della pipeline sui dati in entrata ed effettuare varie operazioni online come l’adesione, aggregazione e filtraggio. I dati in streaming sono gestiti all'interno di finestre (scorrevoli, a cascata, di sessione, ecc.), dove le tuple vengono continuamente aggiunte e quelle scadute rimosse. L'unione di flussi è fondamentale per i dati in tempo reale, ma presenta sfide computazionali maggiori rispetto all'unione di batch tradizionali, a causa della continua ricerca, aggiunta ed eliminazione di dati. Gli operatori di unione comuni includono unioni di uguaglianza e disuguaglianza (theta), con le unioni di disuguaglianza che risultano particolarmente intensive. Per affrontare queste sfide, si propongono due intuizioni chiave: 1) identificare distribuzioni di dati distorte in tempo reale e implementare strutture di indicizzazione dedicate per ridurre i costi di aggiornamento; 2) sfruttare strutture di dati ottimizzate, con strutture mutabili efficienti per l'inserimento e immutabili per la ricerca, per ottimizzare il processo di unione dei flussi. In questo lavoro di dottorato propongo nuove soluzioni per l’elaborazione di join di flussi distribuiti. Uno dei contributi chiave `e un metodo di indicizzazione che utilizza un filtro dedicato efficiente in termini di spazio per monitorare la frequenza delle chiavi di input in tempo reale. Questo metodo, chiamato STA-Join, adatta la logica di elaborazione dei dati in base all’asimmetria dei dati. Inoltre, ho ampiamente confrontato questa tecnica con gli approcci esistenti. Inoltre, ho anche introdotto una struttura dati a due stadi per gestire ed elaborare efficacemente elementi della finestra scorrevole (contenuti streaming delimitati) con operatori di disuguaglianza complessi. Questo approccio, denominato SPO-Join, divide la finestra scorrevole in strutture dati mutabili (efficienti per l’inserimento) e immutabili (efficienti per la ricerca). Nonostante le sfide affrontate, come la gestione dello stato per l’elaborazione distribuita, le garanzie di elaborazione e i meccanismi di concorrenza efficienti, i risultati sperimentali dei sistemi di elaborazione di flussi distribuiti dimostrano che le soluzioni proposte superano i metodi all’avanguardia esistenti. Allo stesso modo, man mano che i modelli di intelligenza artificiale generativa si diffondono in vari settori, tra cui quello energetico, i database vettoriali vengono sempre più utilizzati per archiviare dati industriali multidimensionali e fornire suggerimenti efficaci a questi modelli. Le prestazioni e l'accuratezza del I modelli dipendono in gran parte dalla qualità dei suggerimenti. Tuttavia, il recupero efficiente di vettori rilevanti, in particolare per le query ibride con un elevato richiamo, è un'attività complessa. Propongo una soluzione sensibile alla frequenza per una struttura di dati indice per affrontare questo problema.

Accelerazione basata sull’indice: join in tempo reale e query ibride / Adeel Aslam , 2025 Apr 03. 37. ciclo, Anno Accademico 2023/2024.

Accelerazione basata sull’indice: join in tempo reale e query ibride

ASLAM, ADEEL

2025

Abstract

Real-time data analysis has become increasingly important with the growth of interconnected systems. One common application is the continuous monitoring of energy data. This data is constantly generated by the sensors installed on different energy-producing and consuming devices. Newly generated data need to be processed frequently to offer meaningful insights promptly. The typical processing approach involves producer and consumer computational patterns. Numerous data processing frameworks have been proposed to consume streaming (real-time) data from various input devices, perform distributed computation, combine individual results, and provide insights. These frameworks commonly employ pipeline parallelism on incoming data and carry out various online operations such as joining, aggregation, and filtering. Streaming data is confined to windows (sliding, tumbling, session, etc.), where newly arriving tuples are continually inserted and expired tuples are removed frequently. Stream join is an essential operation for handling real-time data, however, it comes with additional computational challenges compared to traditional batch join, due to the continuous look-up, add, and delete data points from streaming windows. Common join operators include equality or inequality (theta) joins. The stream inequality join is particularly computationally intensive because it requires additional overhead to hold the contents of the streaming window using index data structures. To tackle this challenge, we identify two key insights: 1) identifying skewed data distributions in real-time and implementing dedicated indexing structures for skewed keys to reduce index update costs; 2) leveraging optimized data structures, including insert-efficient mutable and search-efficient immutable structures to optimize the search stream join process. In this Ph.D. work, I propose novel solutions for distributed stream join processing. One of the key contributions is an indexing method that uses a space-efficient dedicated filter to monitor the frequency of input keys in real-time. This method, called STA-Join, adapts the data processing logic based on the skewness of the data. Additionally, I have extensively compared this technique with existing approaches. Moreover, I have also introduced a two-stage data structure for handling and processing sliding window items (bounded streaming contents) with complex inequality operators. This approach, named SPO-Join, divides the sliding window into mutable (insert-efficient) and immutable (search-efficient) data structures. Despite facing challenges such as state management for distributed processing, processing guarantees, and efficient concurrency mechanisms, experimental results from distributed stream processing systems demonstrate that the proposed solutions outperform existing state-of-the-art methods. Similarly, as generative AI models become more widespread in various industries, including the energy sector, vector databases are increasingly being used to store multidimensional industry data and provide effective prompts to these models. The performance and accuracy of the models depend largely on the quality of the prompts. However, efficiently retrieving relevant vectors, especially for hybrid queries (vectors and predicate conditions) with high recall, is a challenging task. I propose a frequency-aware solution for an index-data structure to address this issue to facilitate approximate nearest neighbor (ANN) searches in high-dimensional spaces, especially for hybrid queries. I have extensively compared this solution with state-of-the-art vector indexing approaches for various types of queries (point, range, and mixed), and the results show that it performs better than the alternatives.

Scheda breve

Scheda completa

Scheda completa (DC)

	Titolo in inglese
	
				Index-Driven Acceleration: Real-Time Join and Hybrid Queries
			
	Anno di discussione
	
				3-apr-2025
			
	Tutor afferenti all'Ateneo
	
				BERGAMASCHI, Sonia
SIMONINI, GIOVANNI
			
	Tipologia
	
				Tesi di dottorato

File in questo prodotto:

File	Dimensione	Formato
Index-Driven Acceleration Real time join and hybrid queries.pdf embargo fino al 03/04/2026 Descrizione: Index-Driven Acceleration: Real-Time Join and Hybrid Queries Tipologia: Tesi di dottorato Dimensione 3.55 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	3.55 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris