Image captioning has attracted significant attention within the Computer Vision and Multimedia research domains, resulting in the development of effective methods for generating natural language descriptions of images. Concurrently, the rise of generative models has facilitated the production of highly realistic and high-quality images, particularly through recent advancements in latent diffusion models. In this paper, we propose to leverage the recent advances in Generative AI and create additional training data that can be effectively used to boost the performance of an image captioning model. Specifically, we combine real images with their synthetic counterparts generated by Stable Diffusion using a Mixup data augmentation technique to create novel training examples. Extensive experiments on the COCO dataset demonstrate the effectiveness of our solution in comparison to different baselines and state-of-the-art methods and validate the benefits of using synthetic data to augment the training stage of an image captioning model and improve the quality of the generated captions. Source code and trained models are publicly available at: https://github.com/aimagelab/synthcap_pp.
Augmenting and Mixing Transformers with Synthetic Data for Image Captioning / Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita. - In: IMAGE AND VISION COMPUTING. - ISSN 0262-8856. - 162:(2025), pp. 1-31. [10.1016/j.imavis.2025.105661]
Augmenting and Mixing Transformers with Synthetic Data for Image Captioning
Davide Caffagni;Marcella Cornia;Lorenzo Baraldi;Rita Cucchiara
2025
Abstract
Image captioning has attracted significant attention within the Computer Vision and Multimedia research domains, resulting in the development of effective methods for generating natural language descriptions of images. Concurrently, the rise of generative models has facilitated the production of highly realistic and high-quality images, particularly through recent advancements in latent diffusion models. In this paper, we propose to leverage the recent advances in Generative AI and create additional training data that can be effectively used to boost the performance of an image captioning model. Specifically, we combine real images with their synthetic counterparts generated by Stable Diffusion using a Mixup data augmentation technique to create novel training examples. Extensive experiments on the COCO dataset demonstrate the effectiveness of our solution in comparison to different baselines and state-of-the-art methods and validate the benefits of using synthetic data to augment the training stage of an image captioning model and improve the quality of the generated captions. Source code and trained models are publicly available at: https://github.com/aimagelab/synthcap_pp.| File | Dimensione | Formato | |
|---|---|---|---|
|
2025_IMAVIS_Captioning_Augmentation.pdf
Open access
Tipologia:
AO - Versione originale dell'autore proposta per la pubblicazione
Licenza:
[IR] unspecified-oa
Dimensione
2.16 MB
Formato
Adobe PDF
|
2.16 MB | Adobe PDF | Visualizza/Apri |
|
1-s2.0-S0262885625002495-main.pdf
Open access
Tipologia:
VOR - Versione pubblicata dall'editore
Licenza:
[IR] creative-commons
Dimensione
3.28 MB
Formato
Adobe PDF
|
3.28 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris




