SynthCap: Augmenting Transformers with Synthetic Data for Image Captioning

. Image captioning is a challenging task that combines Computer Vision and Natural Language Processing to generate descriptive and accurate textual descriptions for input images. Research efforts in this field mainly focus on developing novel architectural components to extend image captioning models and using large-scale image-text datasets crawled from the web to boost final performance. In this work, we explore an alternative to web-crawled data and augment the training dataset with synthetic images generated by a latent diffusion model. In particular, we propose a simple yet effective synthetic data augmentation framework that is capable of significantly improving the quality of captions generated by a standard Transformer-based model, leading to competitive results on the COCO dataset.


Introduction
Image captioning is a complex task that involves the description of an image in natural language, posing challenges at the intersection of Computer Vision and Natural Language Processing fields.The most promising solutions to tackle the task are represented by deep learning-based captioning architectures which have become the de facto standard for the task [46].Despite achieving state-of-the-art results, it is becoming difficult to further improve their performance, primarily because of the struggles in finding datasets containing a satisfactory amount of image-caption pairs.To overcome this issue, the predominant approach in the field is to train captioning networks [13,20,51,58] on large-scale datasets collected from the web [42,44], usually downloading an image along with the description provided in its "alt" tag.As a matter of fact, there is no surprise in witnessing more and more advanced deep learning-based models being trained on webcollected data, especially after the spread of large-scale language models [10,59] and cross-modal architectures [36].The knowledge found on the web, indeed, excels for size and variety, stimulating the robustness and sensibility of deep learning models to long-tail concepts.However, its quality and ethics might be questionable, especially for image captioning which requires proper alignment between visual and textual contents.Although there are successful attempts to refine or distinguish web-based information [13,24], it is unfeasible to completely filter out wrong and noisy data when its extent grows too much.
Synthetic data seems an appealing alternative to match the scaling requirements of modern neural networks while attenuating the drawbacks of webcrawled data.In fact, synthetic data can be produced on-demand, are virtually infinite, and their annotations are in most cases at no cost.Moreover, from an ethical perspective, they usually offer better control over biases than their web counterparts.While the usage of synthetically generated data has led to promising results in various Computer Vision tasks [1,5,9,11,16], limited research efforts have been done in the context of image captioning.
Motivated by the recent advancements in Generative AI, in this work we explore the usage of synthetic images to boost the performance of captioning architectures.In particular, we leverage the well-known Stable Diffusion model [39] to generate synthetic images associated with human-annotated textual sentences and employ these newly generated data to augment the most widely used dataset in the image captioning field (i.e.COCO [28]).From a technical point of view, we introduce a simple yet effective framework to employ synthetic data that probabilistically replace real pictures with fake ones and apply it to a standard Transformer-based architecture [48].To validate our proposal, we conduct extensive experiments to evaluate whether synthetic images can be leveraged to improve the quality of generated captions.Experimental results on the popular COCO dataset [28] demonstrate the effectiveness of our solution, which achieves better results than a baseline model without synthetic data augmentation and competitive performance compared to previously proposed approaches.We believe that our analysis can serve as a starting point for employing synthetically generated images as an effective data augmentation strategy in the field of image captioning and other vision-and-language tasks.

Related Work
Image Captioning.Early deep learning-based image captioning models were based on a basic encoder-decoder scheme, with the use of RNNs and LSTMs as popular choices for the text generation part along with CNNs to encode the visual content [22,38,50].Following these initial attempts, subsequent techniques have steadily advanced both the image encoding and language generation stages.Regarding the image encoding, remarkable progress has been achieved through the introduction of additive attention mechanisms to incorporate spatial knowledge, first from a grid of CNN features [55] and later utilizing image regions extracted from pre-trained object detectors [4], eventually considering their semantic and spatial relationships encoded by graph neural networks [56,57].Nowadays, Transformer-based architectures [48], initially designed for machine translation and language comprehension purposes and then employed in a variety of tasks [15,35,47], have been adopted in the domain of image captioning as well.
Recent advancements have been obtained by large-scale vision-and-language pre-training which usually employs noisy image-text pairs to increase the number of training samples, thus further enhancing the performance of fully-attentive image captioning models [13,20,51,58].Effective alternatives also involve the use of visual features from large-scale cross-modal architectures [7,8,45] like CLIP [36].These multimodal architectures also allow for the enrichment of predicted textual sentences employing retrieval components, that can be added to the captioning model, and external knowledge from which to extract additional information to improve the final performance [26,32,41].Synthetic Data.To the best of our knowledge, there is a limited amount of works that explore the usage of synthetic data in image captioning.In particular, Hossain et al. [19] introduced artificial images into a captioning system, by creating new pictures thanks to generative adversarial networks.More recently, Xiao et al. [54] leveraged a latent diffusion model [39] to augment the training dataset, also employing paraphrasing sentences to pair with the generated pictures.However, they only achieved promising results when using limited training instances or when switching to an unpaired image captioning setting.Concurrently, Li et al. [25] proposed to employ fake images as a replacement for difficult samples to finetune a large-scale vision-and-language model for captioning.In this work, we stick with the same latent diffusion model to generate fake images (i.e.Stable Diffusion [39]), but we do not require any additional textual data outside of captions from the COCO dataset, demonstrating the effectiveness of synthetic data augmentation for the standard image captioning task.

Proposed Method
In this section, we introduce SynthCap, a novel image captioning architecture trained with the proposed synthetic augmentation strategy.Fig. 1 shows an overview of our complete model.

Model Architecture
Visual encoder.Our architecture is based on a fully-attentive Transformer network that takes as input visual features extracted from a pre-trained visual encoder.For the latter, we leverage the image encoder of a pre-trained CLIP-based model [36] and we freeze its weights throughout all the experiments.Specifically, we opt for the CLIP ViT-L/14 version which is based on the Vision Transformer (ViT) backbone [15].Transformer model.Our language model is a standard encoder-decoder Transformer network [48].Each encoder layer is made of a self-attention block followed by a feed-forward layer.The former refines the supplied visual tokens via bidirectional self-attention.The latter operates on single tokens with two dense layers, featuring a GELU non-linearity in between.The output of each block is summed along with its input through a residual connection and then normalized.The decoder network shows a similar architecture to the encoder, but it comprises a cross-attention block interposed between the self-attention and feed-forward block.This additional component is critical, as here occurs the cross-modal integration between visual and textual modalities.In detail, the tokens representing the partial caption generated by the decoder up to time t act as queries, that attend the visual tokens from the encoder, i.e. keys and values.Unlike the encoder self-attention block, the decoder self-attention requires a causal mask to prevent tokens from attending to the future.Specifically, masking is implemented by artificially zeroing the entries of the self-attention matrix with row-column indexes (i, j) ∀j>i .The output of the decoder is a token sequence x = { x t } t=1,...,N whose length is equal to the input.To select the next word x t+1 , we sample from a probability distribution over all the possible words in the reference vocabulary, obtained by feeding x t to a linear and a softmax layer.At inference time, the decoder works in an auto-regressive manner, meaning that the token produced at time t will be included in the input for time t + 1.

Synthetic Data Augmentation
Our goal is to probe whether synthetic images can be a valuable source of information to train captioning algorithms.We leverage Stable Diffusion [39] to generate fake images to extend the training set of the COCO dataset [22], which is originally composed of more than half a million image-caption pairs (I r , c k ), with k = 1, 2, 3, 4, 5, i.e. there are five different reference descriptions available for each image.By conditioning the Stable Diffusion model on c k , we build an extra dataset of synthetic (or fake) images paired with the original captions (I s k , c k ).As we show in the experimental section, the synthetically generated images prove to have a good correspondence with the captions they have been generated from and therefore can be a valuable data augmentation strategy to train an image captioning model.Conversely, training a model exclusively on synthetic images and corresponding captions leads to unsatisfactory results.Therefore, we argue that both real and artificial pictures are useful for the task of image captioning, and they may be complementary to each other.
In our training framework, we propose to probabilistically replace a real image with its fake counterpart during each training iteration.When we feed the model with a real image I r , one reference caption is sampled among the five ground-truth sentences available in the dataset.When instead a synthetic image I s k is given as input, the network should only focus on the words specifically mentioned in c k , as c k alone has been considered by the Stable Diffusion model when generating I s k .Formally, given a caption c k , we build an image-text pair (I, c k ), in which the visual component is chosen as follows: where λ s is a hyperparameter controlling the probability of using synthetic data at each training iteration and ϵ ∼ U (0, 1).When we set λ s = 0, the training set is the original one without any synthetic data augmentation, while when λ s = 1 the training set is composed only of fake images and corresponding textual sentences.Note that, regardless of λ s , the amount of processed samples per epoch remains the same as in the original training process.
Training procedure.We adhere to the two-phase training typically used in image captioning [46] which consists of a pre-training step with cross-entropy loss followed by a finetuning phase based on the self-critical sequence training (SCST) proposed in [38], which optimizes the captioning model with reinforcement learning using the CIDEr metric [49] as a reward.
During SCST optimization, the baseline reward is chosen as the average score over all the sequences sampled using beam search within the same beam, following [14].According to this setup, whenever we require a synthetic image to replace its associated real one, we opt to randomly draw from the five available fake images.Formally, I s k ∼ {I s 1 , I s 2 , I s 3 , I s 4 , I s 5 }.Note that, although for each k, the synthetic image I s k has been created from a single description c k , the CIDEr metric still measures the consensus of the captions generated by our model among all five reference captions c k=1,...,5 .

Implementation Details
Dataset and evaluation metrics.We evaluate our proposal on the Microsoft COCO dataset [28], using the standard Karpathy splits [22].We report the results according to evaluation metrics typically used for image captioning: BLEU [34], METEOR [6], ROUGE [27], CIDEr [49], and SPICE [3].Architecture.Before being fed to the CLIP visual encoder, each input image undergoes a pre-processing pipeline.The first step involves a resize to reduce the longer side length to a maximum of 224 pixels, keeping the original aspect ratio.It follows a center crop plus a channel-wise normalization.The resulting input is a tensor with shape 3 × 224 × 224, from which the ViT-based CLIP encoder extracts a grid of 256 × 1024 features, i.e. the visual tokens.Our Transformerbased image captioning network comprises L = 3 layers in both the encoder and decoder, operating on a hidden size d = 512.We therefore apply a linear projection over the CLIP visual features to match this dimensionality.We employ multi-head attention with 8 different heads in each attentive layer, plus dropout with probability 0.1.To convert words into tokens, we leverage the same bytepair encoding (BPE) tokenizer [43] used by the CLIP text module.Training details.During cross-entropy optimization, we stick with the setup suggested in [26] using a batch size of 32 and the learning rate scheduling strategy of [48] with warmup equal to 20, 000 iterations.In the SCST phase, we use a batch size of 16, a constant learning rate of 10 −6 , and apply beam search decoding with a beam size equal to 5. For both training phases, we employ Adam [23] as optimizer.All experiments have been carried out with mixed precision [31] and ZeRO memory offloading [37], using the Huggingface Transformers library [52].Synthetic data generation.All synthetic images are generated following [2], by feeding Stable Diffusion with the reference captions from the COCO Karpathy training split using the standard prompt "An image of ".As Stable Diffusion model, we employ the implementation provided by the Huggingface library3 .

Ablation Studies and Analysis
In this section, we conduct ablation studies to discuss the main design choice of our proposal and validate the proposed synthetic data augmentation strategy.Overall validation of synthetic images.We first validate the correspondence of generated synthetic images with associated textual sentences by computing the image-text similarity between cross-modal embeddings extracted from CLIPbased visual and textual backbones.As demonstrated in recent literature [18,40], this image-text similarity is effective for evaluating image captioning models.As shown in Table 1, on average, synthetic images seem to have a slightly higher affinity with their descriptions compared to the real ones.This suggests that they could be a valuable source of information to feed an image captioning model during training.Percentage of synthetic data.In our framework, we control the probability to replace a real image with a synthetic one thanks to λ s .Table 2 presents the results when varying this parameter in comparison with a baseline model trained without synthetic data.When λ s = 1.0, we entirely rely on synthetic images and experience a consistent drop with respect to the baseline.This behavior can be due to the reality gap between real and synthetic images which prevents the model to generalize on real data when it is trained on synthetically generated samples only.This means that synthetic images, despite the advancements in Generative AI, are still far from exactly mimicking pictures from the natural distribution.On the other hand, all other models benefit from augmented training with synthetic images.In detail, we reach the highest CIDEr score when feeding the model with fake images half of the time (i.e.λ s = 0.5), but we still observe improvements with up to 60% of synthetic images.The positive effects of synthetic data appear to worsen with λ s = 0.7, even though the performance is still competitive against the baseline without synthetic data augmentation.
Effectiveness of synthetic data.To prove that the observed improvements truly come from using synthetic images to augment our training set, we repeat the setup explained in Sec.3.2 but change the source of visual input for augmentation.Since a synthetic image is naturally similar to the original image, a reasonable comparison should rely on visually similar but real images.Thus, in this case, given an image I from the COCO dataset, we replace it with probability λ s with I r k , that corresponds to a real image randomly selected among the top-k similar images with respect to I. In particular, following [41], we extract a feature vector for each image from a pre-trained CLIP model.Then, given an encoded query image, the k most similar ones are retrieved with k = 1, 3, 5, using the cosine similarity between pairs of feature vectors as a similarity measure.For this experiment, we employ λ s = 0.5 that corresponds to the configuration leading to the highest CIDEr score in the previous analysis.According to the results reported in Table 3, we can notice that our synthetic data augmentation strategy achieves the best performance compared to both the baseline and the employed retrieval-based augmentation solution.

Comparison to the State of the Art
We now test SynthCap against other state-of-the-art captioning models.In our analysis, we include earlier approaches featuring LSTM as language models and attention over image regions, like Up-Down [4], eventually boosted with graphbased encoding (GCN-LSTM [57] and SGAE [56]) or self-attention, such as AoANet [21].Further, we include more recent proposals that rely on the Transformer network, namely M 2 Transformer [14], X-Transformer [33], DLCT [30], RSTNet [60], DIFNet [53], CaMEL [8], and COS-Net [26].We report the results in Table 4.As it can be seen, SynthCap beats the baseline across all the metrics, in both the cross-entropy pre-training and CIDEr-based optimization stages.Compared to the other better-performing approaches, our framework achieves competitive results, while being based on a simple encoder-decoder Transformer model without any other specific architectural component.
To further confirm the effectiveness of our data augmentation strategy, we report the results on the COCO online test server in Table 5.Following previous literature, we leverage an ensemble of four models trained using different random  seeds.Also in this setting, SynthCap achieves the best results according to all evaluation metrics.Finally, in Fig. 2, we show some qualitative results on sample images from the COCO dataset, comparing captions generated by our model with those generated by the baseline without synthetic data augmentation.

Conclusion
In this work, we propose a novel image captioning framework enhanced with a synthetic data augmentation strategy.In particular, we leverage the well-known Stable Diffusion model to generate additional images that can be effectively employed as additional training samples.The proposed strategy is widely usable, given the easy accessibility of advanced text-to-image generative models and their increasingly impressive results.Experimentally, the proposed solution is capable of boosting the performance of a standard Transformer-based model, working only at the data level and maintaining the exact same network.

Fig. 1 .
Fig. 1.Overview of the proposed method: (a) we select either a real or a synthetic image, according to a λs weight; (b) the CLIP-based visual encoder converts the input image into a sequence of visual tokens; (c) the encoder-decoder Transformer network generates the caption grounded on the visual token.

Fig. 2 .
Fig. 2. Qualitative comparison between SynthCap and the baseline on sample images from the COCO dataset.

Table 1 .
CLIP-based image-text similarity scores for real and synthetic images and corresponding textual sentences.

Table 2 .
Analysis using different percentages of synthetic data.Results are reported after cross-entropy pre-training.

Table 3 .
Analysis using our best configuration (i.e.λs = 0.5), replacing synthetic images with real ones selected among the top-k similar images.Results are reported after cross-entropy pre-training.

Table 4 .
Comparison with the state of the art on the COCO Karpathy test.