With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

Barraco, Manuele; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

doi:10.1109/ICCV51070.2023.00282

Image captioning, like many tasks involving vision and language, currently relies on Transformer-based architectures for extracting the semantics in an image and translating it into linguistically coherent descriptions. Although successful, the attention operator only considers a weighted summation of projections of the current input sample, therefore ignoring the relevant semantic information which can come from the joint observation of other samples. In this paper, we devise a network which can perform attention over activations obtained while processing other training samples, through a prototypical memory model. Our memory models the distribution of past keys and values through the definition of prototype vectors which are both discriminative and compact. Experimentally, we assess the performance of the proposed model on the COCO dataset, in comparison with carefully designed baselines and state-of-the-art approaches, and by investigating the role of each of the proposed components. We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training. Source code and trained models are available at: https://github.com/aimagelab/PMA-Net.

With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning / Barraco, Manuele; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita. - (2023), pp. 3009-3019. ( 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 Paris, France October 2-6, 2023) [10.1109/ICCV51070.2023.00282].

With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

Barraco, Manuele;Sarto, Sara;Cornia, Marcella;Baraldi, Lorenzo;Cucchiara, Rita

2023

Abstract

Image captioning, like many tasks involving vision and language, currently relies on Transformer-based architectures for extracting the semantics in an image and translating it into linguistically coherent descriptions. Although successful, the attention operator only considers a weighted summation of projections of the current input sample, therefore ignoring the relevant semantic information which can come from the joint observation of other samples. In this paper, we devise a network which can perform attention over activations obtained while processing other training samples, through a prototypical memory model. Our memory models the distribution of past keys and values through the definition of prototype vectors which are both discriminative and compact. Experimentally, we assess the performance of the proposed model on the COCO dataset, in comparison with carefully designed baselines and state-of-the-art approaches, and by investigating the role of each of the proposed components. We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training. Source code and trained models are available at: https://github.com/aimagelab/PMA-Net.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2023
			
	Titolo del Convegno
	
				2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
			
	Luogo del Convegno
	
				Paris, France
			
	Data del Convegno
	
				October 2-6, 2023
			
	Codice DOI
	
				https://dx.doi.org/10.1109/ICCV51070.2023.00282
			
	Codice WoS
	
				WOS:001159644303025
			
	Codice Scopus
	
				2-s2.0-85182343399
			
	Serie
	
				PROCEEDINGS IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION
			
	Pagina iniziale
	
				3009
			
	Pagina finale
	
				3019
			
	Tutti gli autori
	
						Barraco, Manuele; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
					
	Citazione
	
				With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning / Barraco, Manuele; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita. - (2023), pp. 3009-3019. ( 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 Paris, France October 2-6, 2023) [10.1109/ICCV51070.2023.00282].
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
2023_ICCV_Captioning_Memories.pdf Open access Tipologia: AAM - Versione dell'autore revisionata e accettata per la pubblicazione Dimensione 2.91 MB Formato Adobe PDF Visualizza/Apri	2.91 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris