Image captioning models have lately shown impressive results when applied to standard datasets. Switching to real-life scenarios, however, constitutes a challenge due to the larger variety of visual concepts which are not covered in existing training sets. For this reason, novel object captioning (NOC) has recently emerged as a paradigm to test captioning models on objects which are unseen during the training phase. In this paper, we present a novel approach for NOC that learns to select the most relevant objects of an image, regardless of their adherence to the training set, and to constrain the generative process of a language model accordingly. Our architecture is fully-attentive and end-to-end trainable, also when incorporating constraints. We perform experiments on the held-out COCO dataset, where we demonstrate improvements over the state of the art, both in terms of adaptability to novel objects and caption quality.

Learning to Select: A Fully Attentive Approach for Novel Object Captioning / Cagrandi, Marco; Cornia, Marcella; Stefanini, Matteo; Baraldi, Lorenzo; Cucchiara, Rita. - (2021), pp. 437-441. (Intervento presentato al convegno 11th ACM International Conference on Multimedia Retrieval, ICMR 2021 tenutosi a Taipei, Taiwan nel August 21-24, 2021) [10.1145/3460426.3463587].

Learning to Select: A Fully Attentive Approach for Novel Object Captioning

Marcella Cornia;Matteo Stefanini;Lorenzo Baraldi;Rita Cucchiara
2021

Abstract

Image captioning models have lately shown impressive results when applied to standard datasets. Switching to real-life scenarios, however, constitutes a challenge due to the larger variety of visual concepts which are not covered in existing training sets. For this reason, novel object captioning (NOC) has recently emerged as a paradigm to test captioning models on objects which are unseen during the training phase. In this paper, we present a novel approach for NOC that learns to select the most relevant objects of an image, regardless of their adherence to the training set, and to constrain the generative process of a language model accordingly. Our architecture is fully-attentive and end-to-end trainable, also when incorporating constraints. We perform experiments on the held-out COCO dataset, where we demonstrate improvements over the state of the art, both in terms of adaptability to novel objects and caption quality.
2021
11th ACM International Conference on Multimedia Retrieval, ICMR 2021
Taipei, Taiwan
August 21-24, 2021
437
441
Cagrandi, Marco; Cornia, Marcella; Stefanini, Matteo; Baraldi, Lorenzo; Cucchiara, Rita
Learning to Select: A Fully Attentive Approach for Novel Object Captioning / Cagrandi, Marco; Cornia, Marcella; Stefanini, Matteo; Baraldi, Lorenzo; Cucchiara, Rita. - (2021), pp. 437-441. (Intervento presentato al convegno 11th ACM International Conference on Multimedia Retrieval, ICMR 2021 tenutosi a Taipei, Taiwan nel August 21-24, 2021) [10.1145/3460426.3463587].
File in questo prodotto:
File Dimensione Formato  
2021_ICMR_NOC.pdf

Open access

Tipologia: AAM - Versione dell'autore revisionata e accettata per la pubblicazione
Dimensione 708.29 kB
Formato Adobe PDF
708.29 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1243376
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 8
  • ???jsp.display-item.citation.isi??? 2
social impact