Image captioning has attracted significant attention within the Computer Vision and Multimedia research domains, resulting in the development of effective methods for generating natural language descriptions of images. Concurrently, the rise of generative models has facilitated the production of highly realistic and high-quality images, particularly through recent advancements in latent diffusion models. In this paper, we propose to leverage the recent advances in Generative AI and create additional training data that can be effectively used to boost the performance of an image captioning model. Specifically, we combine real images with their synthetic counterparts generated by Stable Diffusion using a Mixup data augmentation technique to create novel training examples. Extensive experiments on the COCO dataset demonstrate the effectiveness of our solution in comparison to different baselines and state-of-the-art methods and validate the benefits of using synthetic data to augment the training stage of an image captioning model and improve the quality of the generated captions. Source code and trained models are publicly available at: https://github.com/aimagelab/synthcap_pp.

Augmenting and Mixing Transformers with Synthetic Data for Image Captioning / Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita. - In: IMAGE AND VISION COMPUTING. - ISSN 0262-8856. - 162:(2025), pp. 1-31. [10.1016/j.imavis.2025.105661]

Augmenting and Mixing Transformers with Synthetic Data for Image Captioning

Davide Caffagni;Marcella Cornia;Lorenzo Baraldi;Rita Cucchiara
2025

Abstract

Image captioning has attracted significant attention within the Computer Vision and Multimedia research domains, resulting in the development of effective methods for generating natural language descriptions of images. Concurrently, the rise of generative models has facilitated the production of highly realistic and high-quality images, particularly through recent advancements in latent diffusion models. In this paper, we propose to leverage the recent advances in Generative AI and create additional training data that can be effectively used to boost the performance of an image captioning model. Specifically, we combine real images with their synthetic counterparts generated by Stable Diffusion using a Mixup data augmentation technique to create novel training examples. Extensive experiments on the COCO dataset demonstrate the effectiveness of our solution in comparison to different baselines and state-of-the-art methods and validate the benefits of using synthetic data to augment the training stage of an image captioning model and improve the quality of the generated captions. Source code and trained models are publicly available at: https://github.com/aimagelab/synthcap_pp.
2025
162
1
31
Augmenting and Mixing Transformers with Synthetic Data for Image Captioning / Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita. - In: IMAGE AND VISION COMPUTING. - ISSN 0262-8856. - 162:(2025), pp. 1-31. [10.1016/j.imavis.2025.105661]
Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
File in questo prodotto:
File Dimensione Formato  
2025_IMAVIS_Captioning_Augmentation.pdf

Open access

Tipologia: AO - Versione originale dell'autore proposta per la pubblicazione
Licenza: [IR] unspecified-oa
Dimensione 2.16 MB
Formato Adobe PDF
2.16 MB Adobe PDF Visualizza/Apri
1-s2.0-S0262885625002495-main.pdf

Open access

Tipologia: VOR - Versione pubblicata dall'editore
Licenza: [IR] creative-commons
Dimensione 3.28 MB
Formato Adobe PDF
3.28 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1382308
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact