Fully-Attentive Iterative Networks for Region-based Controllable Image and Video Captioning

Cornia, Marcella; Baraldi, Lorenzo; Ayellet, Tal; Cucchiara, Rita

doi:10.1016/j.cviu.2023.103857

Controllable image captioning has recently gained attention as a way to increase the diversity and the applicability to real-world scenarios of image captioning algorithms. In this task, a captioner is conditioned on an external control signal, which needs to be followed during the generation of the caption. We aim to overcome the limitations of current controllable captioning methods by proposing a fully-attentive and iterative network that can generate grounded and controllable captions from a control signal given as a sequence of visual regions from the image. Our architecture is based on a set of novel attention operators, which take into account the hierarchical nature of the control signal, and is endowed with a decoder which explicitly focuses on each part of the control signal. We demonstrate the effectiveness of the proposed approach by conducting experiments on three datasets, where our model surpasses the performances of previous methods and achieves a new state of the art on both image and video controllable captioning.

Fully-Attentive Iterative Networks for Region-based Controllable Image and Video Captioning / Cornia, M., Baraldi, L., Ayellet, T., Cucchiara, R.. - In: COMPUTER VISION AND IMAGE UNDERSTANDING. - ISSN 1077-3142. - 237:(2023), pp. 1-10. [10.1016/j.cviu.2023.103857]

Fully-Attentive Iterative Networks for Region-based Controllable Image and Video Captioning

Cornia, Marcella;Baraldi, Lorenzo;Ayellet, Tal;Cucchiara, Rita

2023

Abstract

Controllable image captioning has recently gained attention as a way to increase the diversity and the applicability to real-world scenarios of image captioning algorithms. In this task, a captioner is conditioned on an external control signal, which needs to be followed during the generation of the caption. We aim to overcome the limitations of current controllable captioning methods by proposing a fully-attentive and iterative network that can generate grounded and controllable captions from a control signal given as a sequence of visual regions from the image. Our architecture is based on a set of novel attention operators, which take into account the hierarchical nature of the control signal, and is endowed with a decoder which explicitly focuses on each part of the control signal. We demonstrate the effectiveness of the proposed approach by conducting experiments on three datasets, where our model surpasses the performances of previous methods and achieves a new state of the art on both image and video controllable captioning.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2023
			
	Rivista
	
				COMPUTER VISION AND IMAGE UNDERSTANDING
			
	N° del Volume
	
				237
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				10
			
	Codice DOI
	
				https://dx.doi.org/10.1016/j.cviu.2023.103857
			
	Codice WoS
	
				WOS:001096717500001
			
	Codice Scopus
	
				2-s2.0-85173486059
			
	Citazione
	
				Fully-Attentive Iterative Networks for Region-based Controllable Image and Video Captioning / Cornia, M., Baraldi, L., Ayellet, T., Cucchiara, R.. - In: COMPUTER VISION AND IMAGE UNDERSTANDING. - ISSN 1077-3142. - 237:(2023), pp. 1-10. [10.1016/j.cviu.2023.103857]
			
	Tutti gli autori
	
						Cornia, Marcella; Baraldi, Lorenzo; Ayellet, Tal; Cucchiara, Rita
					
	Tipologia
	
				Articolo su rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris