Image Captioning is the task of translating an input image into a textual description. As such, it connects Vision and Language in a generative fashion, with applications that range from multi-modal search engines to help visually impaired people. Although recent years have witnessed an increase in accuracy in such models, this has also brought increasing complexity and challenges in interpretability and visualization. In this work, we focus on Transformer-based image captioning models and provide qualitative and quantitative tools to increase interpretability and assess the grounding and temporal alignment capabilities of such models. Firstly, we employ attribution methods to visualize what the model concentrates on in the input image, at each step of the generation. Further, we propose metrics to evaluate the temporal alignment between model predictions and attribution scores, which allows measuring the grounding capabilities of the model and spot hallucination flaws. Experiments are conducted on three different Transformer-based architectures, employing both traditional and Vision Transformer-based visual features.

Explaining Transformer-based Image Captioning Models: An Empirical Analysis / Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita. - In: AI COMMUNICATIONS. - ISSN 0921-7126. - 35:2(2022), pp. 111-129. [10.3233/AIC-210172]

Explaining Transformer-based Image Captioning Models: An Empirical Analysis

Cornia, Marcella;Baraldi, Lorenzo;Cucchiara, Rita
2022

Abstract

Image Captioning is the task of translating an input image into a textual description. As such, it connects Vision and Language in a generative fashion, with applications that range from multi-modal search engines to help visually impaired people. Although recent years have witnessed an increase in accuracy in such models, this has also brought increasing complexity and challenges in interpretability and visualization. In this work, we focus on Transformer-based image captioning models and provide qualitative and quantitative tools to increase interpretability and assess the grounding and temporal alignment capabilities of such models. Firstly, we employ attribution methods to visualize what the model concentrates on in the input image, at each step of the generation. Further, we propose metrics to evaluate the temporal alignment between model predictions and attribution scores, which allows measuring the grounding capabilities of the model and spot hallucination flaws. Experiments are conducted on three different Transformer-based architectures, employing both traditional and Vision Transformer-based visual features.
2022
35
2
111
129
Explaining Transformer-based Image Captioning Models: An Empirical Analysis / Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita. - In: AI COMMUNICATIONS. - ISSN 0921-7126. - 35:2(2022), pp. 111-129. [10.3233/AIC-210172]
Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
File in questo prodotto:
File Dimensione Formato  
aic_2022_35-2_aic-35-2-aic210172_aic-35-aic210172.pdf

Open access

Tipologia: VOR - Versione pubblicata dall'editore
Dimensione 1.04 MB
Formato Adobe PDF
1.04 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1255079
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 24
  • ???jsp.display-item.citation.isi??? 12
social impact