Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction advancements or by modeling better multi-modal connections. In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens based on the past context and on text retrieved from the external memory. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality. Our work opens up new avenues for improving image captioning models at larger scale.

Retrieval-Augmented Transformer for Image Captioning / Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita. - (2022), pp. 1-7. (Intervento presentato al convegno 19th International Conference on Content-based Multimedia Indexing, CBMI 2022 tenutosi a Graz, Austria nel SEP 14-16, 2022) [10.1145/3549555.3549585].

Retrieval-Augmented Transformer for Image Captioning

Sara Sarto;Marcella Cornia;Lorenzo Baraldi;Rita Cucchiara
2022

Abstract

Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction advancements or by modeling better multi-modal connections. In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens based on the past context and on text retrieved from the external memory. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality. Our work opens up new avenues for improving image captioning models at larger scale.
2022
19th International Conference on Content-based Multimedia Indexing, CBMI 2022
Graz, Austria
SEP 14-16, 2022
1
7
Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
Retrieval-Augmented Transformer for Image Captioning / Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita. - (2022), pp. 1-7. (Intervento presentato al convegno 19th International Conference on Content-based Multimedia Indexing, CBMI 2022 tenutosi a Graz, Austria nel SEP 14-16, 2022) [10.1145/3549555.3549585].
File in questo prodotto:
File Dimensione Formato  
2022_CBMI_Captioning.pdf

Open access

Tipologia: AAM - Versione dell'autore revisionata e accettata per la pubblicazione
Dimensione 798.26 kB
Formato Adobe PDF
798.26 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1281718
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 20
  • ???jsp.display-item.citation.isi??? 20
social impact