Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Sarto, Sara; Cornia, Marcella; Cucchiara, Rita

The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggest promising directions for future research in image captioning assessment.

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives / Sarto, Sara; Cornia, Marcella; Cucchiara, Rita. - (2025). ( 34th International Joint Conference on Artificial Intelligence Montreal, Canada August 16-22).

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Sara Sarto;Marcella Cornia;Rita Cucchiara

2025

Abstract

The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggest promising directions for future research in image captioning assessment.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Titolo del Convegno
	
				34th International Joint Conference on Artificial Intelligence
			
	Luogo del Convegno
	
				Montreal, Canada
			
	Data del Convegno
	
				August 16-22
			
	Tutti gli autori
	
						Sarto, Sara; Cornia, Marcella; Cucchiara, Rita
					
	Citazione
	
				Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives / Sarto, Sara; Cornia, Marcella; Cucchiara, Rita. - (2025). ( 34th International Joint Conference on Artificial Intelligence Montreal, Canada August 16-22).
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris