MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models

Pipoli, Vittorio; Saporita, Alessia; Bolelli, Federico; Cornia, Marcella; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita; Ficarra, Elisa

Recently, Multimodal Large Language Models (MLLMs) have emerged as a leading framework for enhancing the ability of Large Language Models (LLMs) to interpret non-linguistic modalities. Despite their impressive capabilities, the robustness of MLLMs under conditions where one or more modalities are missing remains largely unexplored. In this paper, we investigate the extent to which MLLMs can maintain performance when faced with missing modality inputs. Moreover, we propose a novel framework to mitigate the aforementioned issue called Retrieval-Augmented Generation for missing modalities (MissRAG). It consists of a novel multimodal RAG technique alongside a tailored prompt engineering strategy designed to enhance model robustness by mitigating the impact of absent modalities while preventing the burden of additional instruction tuning. To demonstrate the effectiveness of our techniques, we conducted comprehensive evaluations across five diverse datasets, covering tasks such as audio-visual question answering, audio-visual captioning, and multimodal sentiment analysis.

MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models / Pipoli, V., Saporita, A., Bolelli, F., Cornia, M., Baraldi, L., Grana, C., Cucchiara, R., Ficarra, E.. - (2025). (IEEE/CVF International Conference on Computer Vision Honolulu, Hawaii Oct 19-23).

MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models

Vittorio Pipoli;Alessia Saporita;Federico Bolelli;Marcella Cornia;Lorenzo Baraldi;Costantino Grana;Rita Cucchiara;ELISA FICARRA

2025

Abstract

Recently, Multimodal Large Language Models (MLLMs) have emerged as a leading framework for enhancing the ability of Large Language Models (LLMs) to interpret non-linguistic modalities. Despite their impressive capabilities, the robustness of MLLMs under conditions where one or more modalities are missing remains largely unexplored. In this paper, we investigate the extent to which MLLMs can maintain performance when faced with missing modality inputs. Moreover, we propose a novel framework to mitigate the aforementioned issue called Retrieval-Augmented Generation for missing modalities (MissRAG). It consists of a novel multimodal RAG technique alongside a tailored prompt engineering strategy designed to enhance model robustness by mitigating the impact of absent modalities while preventing the burden of additional instruction tuning. To demonstrate the effectiveness of our techniques, we conducted comprehensive evaluations across five diverse datasets, covering tasks such as audio-visual question answering, audio-visual captioning, and multimodal sentiment analysis.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Titolo del Convegno
	
				IEEE/CVF International Conference on Computer Vision
			
	Luogo del Convegno
	
				Honolulu, Hawaii
			
	Data del Convegno
	
				Oct 19-23
			
	Tutti gli autori
	
						Pipoli, Vittorio; Saporita, Alessia; Bolelli, Federico; Cornia, Marcella; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita; Ficarra, Elisa...espandi
						
	Citazione
	
				MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models / Pipoli, V., Saporita, A., Bolelli, F., Cornia, M., Baraldi, L., Grana, C., Cucchiara, R., Ficarra, E.. - (2025). (IEEE/CVF International Conference on Computer Vision Honolulu, Hawaii Oct 19-23).
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
2025iccv.pdf Open access Tipologia: AAM - Versione dell'autore revisionata e accettata per la pubblicazione Dimensione 786.86 kB Formato Adobe PDF Visualizza/Apri	786.86 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris