Image captioning is a challenging task that combines Computer Vision and Natural Language Processing to generate descriptive and accurate textual descriptions for input images. Research efforts in this field mainly focus on developing novel architectural components to extend image captioning models and using large-scale image-text datasets crawled from the web to boost final performance. In this work, we explore an alternative to web-crawled data and augment the training dataset with synthetic images generated by a latent diffusion model. In particular, we propose a simple yet effective synthetic data augmentation framework that is capable of significantly improving the quality of captions generated by a standard Transformer-based model, leading to competitive results on the COCO dataset.

SynthCap: Augmenting Transformers with Synthetic Data for Image Captioning / Caffagni, Davide; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita. - 14233:(2023), pp. 112-123. (Intervento presentato al convegno 22nd International Conference on Image Analysis and Processing, ICIAP 2023 tenutosi a Udine, Italy nel September 11-15, 2023) [10.1007/978-3-031-43148-7_10].

SynthCap: Augmenting Transformers with Synthetic Data for Image Captioning

Caffagni, Davide;Barraco, Manuele;Cornia, Marcella;Baraldi, Lorenzo;Cucchiara, Rita
2023

Abstract

Image captioning is a challenging task that combines Computer Vision and Natural Language Processing to generate descriptive and accurate textual descriptions for input images. Research efforts in this field mainly focus on developing novel architectural components to extend image captioning models and using large-scale image-text datasets crawled from the web to boost final performance. In this work, we explore an alternative to web-crawled data and augment the training dataset with synthetic images generated by a latent diffusion model. In particular, we propose a simple yet effective synthetic data augmentation framework that is capable of significantly improving the quality of captions generated by a standard Transformer-based model, leading to competitive results on the COCO dataset.
2023
22nd International Conference on Image Analysis and Processing, ICIAP 2023
Udine, Italy
September 11-15, 2023
14233
112
123
Caffagni, Davide; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
SynthCap: Augmenting Transformers with Synthetic Data for Image Captioning / Caffagni, Davide; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita. - 14233:(2023), pp. 112-123. (Intervento presentato al convegno 22nd International Conference on Image Analysis and Processing, ICIAP 2023 tenutosi a Udine, Italy nel September 11-15, 2023) [10.1007/978-3-031-43148-7_10].
File in questo prodotto:
File Dimensione Formato  
2023-iciap-captioning.pdf

Open access

Tipologia: Versione dell'autore revisionata e accettata per la pubblicazione
Dimensione 576.36 kB
Formato Adobe PDF
576.36 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1309206
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact