Unveiling the Impact of Image Transformations on Deepfake Detection: An Experimental Analysis

With the recent explosion of interest in visual Generative AI, the field of deepfake detection has gained a lot of attention. In fact, deepfake detection might be the only measure to counter the potential proliferation of generated media in support of fake news and its consequences. While many of the available works limit the detection to a pure and direct classification of fake versus real, this does not translate well to a real-world scenario. Indeed, malevolent users can easily apply post-processing techniques to generated content, changing the underlying distribution of fake data. In this work, we provide an in-depth analysis of the robustness of a deepfake detection pipeline, considering different image augmentations, transformations, and other pre-processing steps. These transformations are only applied in the evaluation phase, thus simulating a practical situation in which the detector is not trained on all the possible augmentations that can be used by the attacker. In particular, we analyze the performance of a k-NN and a linear probe detector on the COCOFake dataset, using image features extracted from pre-trained models, like CLIP and DINO. Our results demonstrate that while the CLIP visual backbone outperforms DINO in deepfake detection with no augmentation, its performance varies significantly in presence of any transformation, favoring the robustness of DINO.


Introduction
Although the generation of deepfake encompasses results of diverse nature, the world of fake image forgery has gained a lot of attention, since the breakthrough of diffusion models [7,13,30,31,33] in the Generative AI domain.While this ⋆ Equal contribution.
technological advancement was received enthusiastically by the community, it has also raised significant concerns regarding its potential impact on various domains, including the realms of human art and privacy.Both these domains are susceptible to risks due to the ease with which these models generate new content.Consequently, in light of the ongoing advancements in Generative AI, there has been a significant shift towards enhancing deepfake detection systems [36,41] to mitigate the risks posed by the remarkably convincing nature of such content.
The first efforts towards AI-generated content detection were conceived in the realm of fake face detection, with the release of ad-hoc datasets [18,32] and methodologies [11,19].However, it should be noted that the significance of deepfake detection extends beyond fake faces or biometric data, necessitating the need for broader and more versatile detection methods that can address a wider range of generative scenarios.Only recently, a limited number of studies [1,6,36] have started to investigate deepfake images generated from text-to-image models [2,30,31,33], thereby enabling the detection of a wider variety of subjects with respect to biometric data.Although these studies assert high accuracy in detecting fake images, the resilience and robustness of the proposed methods have not yet been quantitatively evaluated.
In this manuscript, we freeze the recently proposed Stable Diffusion [31] model as the text-to-image generator and test two different detection approaches.In addition, we employ two different feature extractors, namely CLIP [29] and DINO [4], and evaluate their robustness to a wide variety of image transformations, at pixel-value and image-structure levels (Fig. 1).To the best of our knowledge, we are the first to assess the performance variability of real-fake recognition within such an environment.The experimental results shed light on the generally more robust performance of self-supervised methods (i.e., DINO) against transformations in deepfake detection.Indeed, while CLIP achieves better performance without augmentation, the behavior of deepfake classifiers across different transformations is more consistent for DINO compared to CLIP.Surprisingly, CLIP performs similarly to DINO in the recognition of real images.

Related Work
Text-to-image generation.Deepfake images can be generated through three main models which consist of autoregressive approaches [25,26,33,39], generative adversarial networks (GANs) [12,34,38,42], and diffusion models [7,13,17,37].In this work, we narrow down the field of deepfake generation considering the recent paradigm of text-to-image generation, which consists of generating an image starting from a textual description.While some GAN-based approaches [20] have been proposed as a possible solution to text-to-image generation, great results have been recently obtained with the application of diffusion models [2,30,33] by conditioning the diffusion process on the input textual description.
Recently, latent diffusion models [28,31] have improved the efficiency of standard diffusion models while maintaining their generation quality, by operating in a lower dimensional latent space z using a pre-trained variational autoencoder (VAE) [9,16].In particular, during image generation, this approach involves the diffusion process occurring within the embedding space z, followed by the decompression of the resulting image through the VAE decoder.We conduct our experiments using images generated by the Stable Diffusion model [31], using both the 1.4 and 2.0 versions.The main differences between them lie in the backbone used to extract features from texts and images.In fact, Stable Diffusion v1 employs CLIP [31], which is trained on a non-publicly available dataset, while Stable Diffusion v2 relies on OpenCLIP [14], which is trained on a subset of LAION-5B [35] dataset.Both Stable Diffusion versions are finetuned on a filtered subset of LAION-5B to improve aesthetics and avoid explicit contents.Deepfake detection.The deepfake detection pipeline employed in this study comprises two consecutive stages: an image feature extractor followed by the actual detector.As for the first bit, different works have made extensive use of CLIP features as a starting point for their analysis [1,24,36].In [5], they introduced an exploratory study of the frequency spectrum of the created images, thus capturing the impact of the specific generation model on the structure of the final images.Conversely, in [1], the authors proposed a wider-spectrum evaluation of the effects of different image feature extractors, presenting results on CLIP and OpenCLIP.Simultaneously, within the literature on image watermarking [10], analyses have been conducted to examine the robustness of the added watermark when the image is subjected to transformations.This type of analysis has been also conducted in relation to the detection of manipulated images and videos specifically focused on facial manipulation [22].We embark on this path, applying it to the deepfake detection scenario, and studying how it affects the performance of some detection algorithms and the distribution of the features in the embedding space.

Dataset
This section provides an overview of the COCOFake dataset [1] used in this work to perform the analysis on deepfake detection.COCOFake consists of an extension of the COCO dataset [21], that includes both real and fake images.Specifically, each real image in COCO is paired with five captions which are used to generate five fake images through a text-to-image model.The dataset is divided into training, validation, and test sets following the Karpathy splits, as used in the captioning literature [15].Since COCO contains 113,287 training images and 5,000 validation and test images, COCOFake is composed of 679,722 instances in training, and 30,000 in validation and test.
From a technical standpoint, the production of counterfeit images is achieved through the utilization of Stable Diffusion [31] version 1.4.Furthermore, CO-COFake also includes validation and test splits generated with Stable Diffusion version 2.0 to increase the robustness and generalization of possible analysis.It is worth mentioning that, all the images of COCOFake are stored in JPEG format, following the original COCO compression.

Visual backbones
In our experimental analysis, we employ three different visual backbones, namely CLIP [29], DINO [4], and DINOv2 [27].It is worth mentioning that all the backbones adopt the same Vision Transformer architecture [8], ensuring a fair comparison between the employed methods.The primary distinction among the visual backbones is the pre-training method employed.For instance, the CLIP approach utilizes language supervision to enforce similarities between visual and textual concepts.This is achieved by independently processing the image and its textual description using a visual and a textual backbone and then linearly projecting their representation into a shared embedding space.CLIP is pre-trained with a contrastive objective that maximizes the cosine similarity of correct image-text pairs.While CLIP obtains a semantic coherence [23] that can be useful for deepfake detection, the only image augmentation that is applied during training consists of a random square crop from resized images.This could make the visual backbone vulnerable to adversarial image augmentation.
In contrast to CLIP, DINO eschews the use of textual references, heavily relying on image augmentations during the pre-training phase.Indeed, DINO augments the input image through various techniques, including multi-crop [3], color jittering, Gaussian blur, and solarization.Multi-crop is used to generate multiple views of the same image, which can be logically divided into local views with lower resolutions and global views with higher resolutions.The DINO model is trained by enforcing local-to-global correspondences between different views of the same image.On the other hand, DINOv2 introduces additional pre-training objectives compared to DINO, such as randomly masking patches of the local views, leaving the model to learn how to reconstruct these patches.Since both DINO and DINOv2 enforce robustness to image augmentation during pretraining, we investigate their effectiveness in a deepfake detection pipeline.

Deepfake Detection Pipeline
In this section, we present the deepfake detection pipeline that has been utilized for the analysis conducted in this study.Our pipeline encompasses a feature extraction phase followed by a detector model.Specifically, the detector model under investigation includes both a linear probe and a k-nearest neighbor (k-NN) classifier.The incorporation of different detector models serves the purpose of assessing distinct aspects.Specifically, the linear probe is engineered to identify any potential indications of the generation process within the feature space.Conversely, the k-nearest neighbor approach relies on the distance between existing features stored during training, thus allowing us to measure the similarity between real and fake content, in the embedding space.
Feature extraction process.From a technical perspective, the previously introduced visual backbones are employed as feature extraction models.Indeed, during the process of feature extraction, each image from the training, validation, and test sets of COCOFake undergoes processing by the visual backbones CLIP, DINO, and DINOv2.It is worth mentioning that no image augmentation is applied during the feature extraction phase.
Formally, each image x ∈ R C×H×W is firstly split into a sequence of squared patches {x p i } N i=1 where C, H, W are respectively channel, height and width, while is the i-th image patch of size P × P .Consecutively, the sequence of image patches is linearly projected in the embedding dimensionality of the model D. At this step, a learnable classification token [CLS] ∈ R D is concatenated to the input sequence.After L self-attention blocks the [CLS] token is saved as the representation of the image.In addition, and only for the CLIP model, the [CLS] token is linearly projected into the multi-modal embedding space.
Implementation-wise, the Base version of ViT [8] (i.e., ViT-B) is used for CLIP, DINO, and DINOv2.In detail, ViT-B includes 85M learnable parameters, a 768 embedding dimensionality D, and L = 12 self-attention blocks.The considered input image size is C = 3, H = 224, W = 224, while the image patch size P is 14 for DINOv2 and 16 for CLIP and DINO.Regarding the pre-trained weights, the open-source ViT-B/16 version (i.e., OpenCLIP [14]), pre-trained on the LAION-2B dataset [35], is used for CLIP, while the publicly available ViT-B/16 and ViT-B/14 are used for DINO and DINOv2, respectively.Linear probe.In the linear probe approach, we use the extracted features to train a logistic regressor.The goal of the method is to identify a signature, or imprint, in the extracted features that enable the linear model to distinguish between real and fake data.The logistic regressor is trained with an ℓ 2 objective, and the loss is weighted to account for the difference in the number of real and fake samples.Specifically, since the number of fake images in COCOFake is five times greater than the number of real images, the loss is weighted inversely proportional to class frequencies.In addition, the LBFGS solver [40] is employed for training.Results are evaluated with accuracy scores over real and fake data.k-nearest neighbor (k-NN).The classification task in the k-nearest neighbor approach is dependent on measuring distances within the visual feature space extracted by the utilized backbones.This implies that no further training is required.Hence, in the validation and test sets, the distances between each element and the features stored offline from the training split are calculated.The deepfake classification task is a supervised task, whereby the corresponding label (real or fake) is known for each feature embedding.So, the accuracy is determined by applying majority voting on the k-nearest features within the training feature space.While the k-NN approach was originally proposed by [24] in a deepfake detection scenario, it presents notable limitations.Specifically, k-NN is highly sensitive to missing values or outliers, necessitating extensive coverage in the embedding space of the visual backbones by the training dataset.Moreover, as the dataset size increases, the computational cost of calculating distances between a new image and each existing one escalates significantly, ultimately compromising the algorithm performance.From an implementation perspective, we take into account the cosine similarity and the top-1 nearest neighbor to define the k-NN.Moreover, to manage the unbalanced COCOFake dataset, only a single pair of real and fake images are considered to compute the visual features in the training split, thus obtaining balanced real-fake images.

Image Augmentation
Drawing inspiration from [10,22], we explore the effectiveness of twelve distinct image augmentation techniques, detailed in Table 1.This series of transformations depict the potential manipulations of the image, considering imagestructure and pixel-value transforms.As we can notice, each augmentation involves a tunable parameter to control the degree of impact on images.We undertake a detailed analysis of these parameters to assess the robustness of the classification methods in response to the strength of the transformation.To this end, we select a range delimited by a minimum and maximum parameter for each augmentation, aiming to preserve the visual quality of the image in both cases, thus ensuring the preservation of visual consistency and usability.We assess the results by linearly partitioning the parameter range into five equally spaced segments.Following this process, we obtain five different image augmentation techniques for each transform with varying strengths.The utilization of these transformations evaluates the employed classifiers' accuracy in terms of resilience and generalization.A visual example of some of the image augmentation applied to an image is reported in Fig. 1.
In addition to the conventional augmentation methods, we introduce a novel technique called Stable Diffusion (SD) compression.This approach involves the projection of an image x into the latent space z of the Stable Diffusion model by utilizing the encoder of the autoencoder model [9] implemented within the Stable Diffusion framework.Following this projection, the image x is reconstructed using the decoder of the autoencoder.This augmentation technique is exclusively applied to real images to examine the biases of the detector concerning the lossy compression inherent in the generation of fake images.

Experimental Results
In this section, we analyze the results obtained by employing data augmentation on real and fake images, while testing different visual backbones.Deepfake detection of plain images.To evaluate the resilience of the aforementioned methods, a preliminary study is conducted to examine the performance of the detection pipeline without any applied transformations.
Based on the findings presented in Table 2, we can notice that the linear probe classifier exhibits a high classification accuracy, across all the backbones, with scores of 99.6%, 96.9%, and 96.6%, respectively with CLIP, DINO, and DINOv2, over the COCOFake test set generated with Stable Diffusion v1.4.These results validate the hypothesis that linear probes effectively identify the generator's imprint, embedded in the image features.Similar behavior is also highlighted by the k-NN approach, whose objective is not to specifically identify the imprinting trace.The observed performance strongly suggests that, in the backbones embedding space, fake images tend to exhibit proximity to one another and a similar phenomenon may hold true for real images.Specifically, k-NN performs with an accuracy of 96.7%, 91.3%, and 89%, over respectively CLIP, DINO, and DINOv2.Moreover, the comparable performance observed on the COCOFake test set of Stable Diffusion 1.4 and Stable Diffusion v2.0 underscores the classifiers' capability to generalize beyond their initial training domain.As a result, further experiments will solely focus on the test set of Stable Diffusion 1.4.Building upon these initial results, subsequent experiments extend the analysis to explore the accuracy patterns when transformations are applied to fake and real images.Fake data analysis.Presented in Table 3, we encounter a concise overview of the performance of the deepfake detection pipeline over transformed fake images.Evidently, the evaluation using the linear probe on the CLIP backbone demonstrates remarkably low performance.Specifically, CLIP achieves an average accuracy, among all the transformations, of only 41.6% for fake images, while DINO and DINOv2 demonstrate higher accuracy of 91.8% and 92.3%, respectively.Furthermore, the average standard deviation of CLIP, which amounts to 24.6%, highlights the substantial variability in performance across different transformations.This variability poses a significant threat to the overall robustness of a CLIP-based deepfake detector.In contrast, DINO and DINOv2 consistently exhibit robustness across a wide range of performed transformations.In addition, Figure 2 illustrates the trajectory of accuracy outcomes for the linear probes under varying degrees of strength of image augmentations, as discussed in Sec.3.4.It is visually evident that, while DINO and DINOv2 exhibit a tendency to maintain consistent performance levels, CLIP performance is highly influenced by the intensity of each transformation.For example, a JPEG compression transformation with 10% quality produces an accuracy of 0.4% over CLIP linear probe while 95% and 91% for respectively DINO and DINOv2.We assume that the linear probe trained on CLIP-extracted features may be prone to overfitting on the distinctive imprint of fake data.This assumption arises from the observation that the CLIP visual backbone is not trained using extensive data augmentation.Consequently, alterations in the images could modify the extracted features, thus altering the fake imprint.This would explain the significant decline in the performance of the linear probe on CLIP.Although the k-NN outcomes, as shown in Table 3, indicate that CLIP achieves accuracy on par with DINO and DINOv2, the higher average standard deviation observed in CLIP highlights the superiority of the latter models.
Real data analysis.Table 4 presents a comprehensive analysis of the performance of CLIP, DINO, and DINOv2 evaluated on transformed real images.We decide to logically cluster results in JPEG Compression, SD Compression, and Other Transforms to facilitate the analysis.Specifically, we isolate the compression augmentations, leaving a summary of the others.Regarding the obtained results, it is noteworthy that the linear probes demonstrate commendable performance on the other non-compression-based transforms.However, when subjected to JPEG compression, the linear probes exhibit lower accuracy.Specifically, the average accuracy reaches 93.2%, 74.8%, and 58.2% for CLIP, DINO, and DI-NOv2 respectively.Furthermore, the poorest performance is observed in CLIP with SD compression, resulting in an accuracy of 44.2%.We hypothesize that the compression imprints bear a strong resemblance to the fake imprint, thereby deceiving the linear probe into misclassifying a real image as fake.
A comparable examination can be directly carried out on the feature space of the visual backbones.Specifically, when considering the embedding space of CLIP, real images subjected to the SD compression exhibit closer proximity, on average, to fake images compared to JPEG compression and other transforma- tions.This is additional proof that SD compression has a great influence on the fake data imprint.In contrast, DINO and DINOv2 are equally subjected to all transformations, exhibiting an average accuracy in the k-NN analysis of 75.4% and 80.5%, respectively.It is noteworthy that the limitations inherent to k-NN, as mentioned in Sec.3.3, can attenuate its impact on deepfake detection.

Conclusion
In conclusion, the growing capacity and utilization of text-to-image models present a persistent challenge in the detection of artificially generated images.Our proposal introduces an analysis of the robustness of a set of classifiers, specifically considering transformations that modify the visual appearance of the image.The performance of the classifiers is significantly influenced by these transformations and this study emphasizes the significance of the robustness to such transformations for deepfake detector classifiers that need to operate in real-world scenarios.

Fig. 1 .
Fig. 1.Visual comparison of image transformations on a sample real image (top left).

Fig. 2 .
Fig.2.The plots showcase the linear probe accuracy using different backbones, namely CLIP, DINO, and DINOv2, varying the applied transformation.Each subplot illustrates the accuracy of the classifiers under varying degrees of strength in image augmentations, used to provide insights into the effectiveness of classifiers.

Table 1 .
Comprehensive summary of essential information regarding the applied transformations to assess the robustness of the different classifiers.

Table 2 .
Accuracy performance on the COCOFake test set without any transformations for Stable Diffusion v1.4 and v2.0, using different classifiers and backbones.

Table 3 .
Comparison of accuracy performance on the COCOFake test set with transforms applied to fake images.The table shows results for linear and k-NN classifiers, for each backbones.

Table 4 .
Accuracy performance on the COCOFake test set with transformations applied to real images.We report results for linear probe and k-NN classifiers for each backbones.Transformations are divided into compression based and others, to highlight the accuracy drop when applying compression-based transformations.