Retrieval-Augmented Transformer for Image Captioning

Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

doi:10.1145/3549555.3549585

Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction advancements or by modeling better multi-modal connections. In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens based on the past context and on text retrieved from the external memory. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality. Our work opens up new avenues for improving image captioning models at larger scale.

Retrieval-Augmented Transformer for Image Captioning / Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.. - (2022), pp. 1-7. (19th International Conference on Content-based Multimedia Indexing, CBMI 2022 Graz, Austria SEP 14-16, 2022) [10.1145/3549555.3549585].

Retrieval-Augmented Transformer for Image Captioning

Sara Sarto;Marcella Cornia;Lorenzo Baraldi;Rita Cucchiara

2022

Abstract

Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction advancements or by modeling better multi-modal connections. In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens based on the past context and on text retrieved from the external memory. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality. Our work opens up new avenues for improving image captioning models at larger scale.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2022
			
	Titolo del Convegno
	
				19th International Conference on Content-based Multimedia Indexing, CBMI 2022
			
	Luogo del Convegno
	
				Graz, Austria
			
	Data del Convegno
	
				SEP 14-16, 2022
			
	Codice DOI
	
				https://dx.doi.org/10.1145/3549555.3549585
			
	Codice WoS
	
				WOS:001159476300001
			
	Codice Scopus
	
				2-s2.0-85139915726
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				7
			
	Tutti gli autori
	
						Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
					
	Citazione
	
				Retrieval-Augmented Transformer for Image Captioning / Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.. - (2022), pp. 1-7. (19th International Conference on Content-based Multimedia Indexing, CBMI 2022 Graz, Austria SEP 14-16, 2022) [10.1145/3549555.3549585].
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
2022_CBMI_Captioning.pdf Open access Tipologia: AAM - Versione dell'autore revisionata e accettata per la pubblicazione Dimensione 798.26 kB Formato Adobe PDF Visualizza/Apri	798.26 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris