Journal Pre-proof Explaining Digital Humanities by Aligning Images and Textual Descriptions

Replicating the human ability to connect Vision and Language has recently been gaining a lot of attention in the Computer Vision and the Natural Language Processing communities. This research effort has resulted in algorithms that can retrieve images from textual descriptions and vice versa, when realistic images and sentences with simple semantics are employed and when paired training data is provided. In this paper, we go beyond these limitations and tackle the design of visual-semantic algorithms in the domain of the Digital Humanities. This setting not only advertises more complex visual and semantic structures but also features a signiﬁcant lack of training data which makes the use of fully-supervised approaches infeasible. With this aim, we propose a joint visual-semantic embedding that can automatically align illustrations and textual elements without paired supervision. This is achieved by transferring the knowledge learned on ordinary visual-semantic datasets to the artistic domain. Experiments, performed on two datasets speciﬁcally designed for this domain, validate the proposed strategies and quantify the domain shift between natural images and artworks.


Introduction
As humans, we can easily link our ability to see and understand the surrounding environment with the ability to express ourselves in natural language.In the effort of artificially replicating these connections, new models have emerged for image and video captioning (Anderson et al., 2018;Lu et al., 2018;Cornia et al., 2019) and for visualsemantic retrieval (Kiros et al., 2014;Faghri et al., 2018;Lee et al., 2018).The former architectures combine vision and language in a generative flavor on the textual side, the latter build common spaces to integrate the two domains and retrieve textual elements given visual queries, and vice versa.
The leading solutions for visual-semantic retrieval have so far relied on fully supervised settings in which paired training samples are available and have been applied to Our approach can align illustrations and textual elements by transferring the knowledge learned on standard datasets to match images and captions coming from a target domain.

Distribution Alignment
This painting shows a girl in a yellow dress holding a bouquet of flowers.It is a typical portrait of the artist showing the influence of his teacher, Agnolo Bronzino.
Two people on surfboards with a third in the water.

Target Domain Source Domain
general-purpose datasets where the state of the art of concept recognition methods is useful and well assessed.In the domain of arts and culture, however, both visual and textual elements are far from those of ordinary datasets.On one side, textual descriptions often contain technical language with symbolic reminds, metaphors and artistic or historical connections; on the other side, artworks and illustrations are characterized by visual features different from those of natural images.Beyond this domainshift issue, the supervised training of a common visualsemantic embedding requires sufficiently large datasets.Instead, the artistic domain is often characterized by small scale datasets in which the pairing between visual and textual elements is not available or expensive to obtain.
Tackling the aforementioned setting, in this paper we propose a semi-supervised visual-semantic embedding model (SS-VSE) for cross-modal retrieval in the artistic domain.Our approach relies on the construction of a common semantic embedding, in which the knowledge learned on a supervised and ordinary visual-semantic dataset is transferred to an artistic dataset in which the pairing between images and sentences is not available.After using global feature vectors, we also investigate the use of auto-encoders (SS-VSE-AE) to obtain more compact representations of input images and sentences.Experiments are conducted on two datasets specifically designed for the artistic domain.In particular, we use the BibleVSA dataset (Baraldi et al., 2018) which contains illustrations and textual sentences extracted from the commentaries of a historical manuscript, and the SemArt dataset (Garcia and Vogiatzis, 2018) that is composed of artwork images and textual comments.Extensive experiments are presented to validate the proposed solution and to visualize the effect of the knowledge transfer between source and target datasets.

Related work
Deep Learning techniques often require significant efforts to be applied to the domain of Digital Humanities and Cultural Heritage, due to the presence of specific challenges.The research efforts of the past few years have resulted in various works and applications spanning from generative models to classification and retrieval solutions.On the generative and synthesis side, promising results have been obtained for transferring the style of a painting to a real photograph (Gatys et al., 2016;Sanakoyeu et al., 2018;Jing et al., 2018) and inversely, to create a realistic representation of a given painting (Zhu et al., 2017;Tomei et al., 2018Tomei et al., , 2019a,b),b).On the analysis and feature extraction side, instead, several efforts have been made on the collection and annotation of large scale datasets containing artistic images, mainly focusing on style and genre classification (Karayev et al., 2014;Mao et al., 2017;Strezoski and Worring, 2018), visual patterns detection (Shen et al., 2019), and artwork instance recognition (Del Chiaro et al., 2019).
Concerning the problem of linking textual descriptions and artistic images, there is a limited bunch of works available in the literature.In the next section, after briefly reviewing the most important works related to visualsemantic retrieval, we focus on image-text matching approaches applied to the artistic domain, and subdividing them between supervised and semi-supervised methods.

Visual-semantic retrieval
Matching visual data and natural language is a challenging task in computer vision and multimedia.Since visual and textual data belong to two distinct modalities, one of the seminal approaches (Kiros et al., 2014) has been that of generating a joint visual-semantic embedding space in which images and sentences could be compared.Even if other approaches exist, currently this is still one of the most commonly used solutions.
Following this line, Faghri et al. (2018) introduced a modification of the Hinge-based loss function to exploit hard negatives, i.e. worst matching pairs, during training.This has demonstrated to be effective to improve cross-modal retrieval performance and has been used in almost all subsequent works.Further, Wang et al. (2018) used a two-branch network composed of an embedding and a similarity branch: while the embedding network translates image and text into a feature representation, the similarity network predicts how well the feature representations match.Differently, Dong et al. (2018) suggested to tackle the retrieval problem exclusively in the visual space, introducing a deep neural model that learns to predict a visual feature representation from textual input.
Recently, strong improvements have been obtained by Lee et al. (2018) with a stacked cross-attention mechanism that matches images and textual descriptions by learning a latent correspondence between detected regions and words of the caption.Wang et al. (2019) extended this model by integrating an encoding of the relative position of image regions, which has proven to further enhance the learning of the joint embedding.On the same line, Li et al. (2019) proposed a reasoning model based on graph convolutional networks to generate a visual representation that captures key objects and semantic concepts of a scene.All of these supervised methods have been proven effective when trained on large scale datasets, and are not designed to work with scarce data.
Only a few works have applied image-text matching strategies to artistic data.Among them, Garcia and Vogiatzis (2018) used additional metadata such as title, author, genre, and period of the paintings to find corresponding image-text pairs.Stefanini et al. (2019) introduced a new dataset and a visual-semantic model to discriminate visual and contextual sentences associated to artistic images and, at the same time, to align the corresponding visual and textual elements.While (Garcia and Vogiatzis, 2018;Stefanini et al., 2019) matched images and textual descriptions in a supervised way, (Baraldi et al., 2018;Carraggi et al., 2018) addressed the problem in a semi-supervised setting, adapting the knowledge learned on a given source domain to align images and text belonging to a different target domain and without directly training the model on the target domain.This solution, which is known as domain adaptation, has been used in a wide variety of applications such as image classification (Long et al., 2017), semantic segmentation (Hoffman et al., 2018;Chen et al., 2018b), object detection (Inoue et al., 2018;Chen et al., 2018a), and image captioning (Chen et al., 2017;Yang et al., 2018).Typically, it is addressed by minimizing the distance between feature space statistics of the source and target, or by using domain adversarial objectives where a domain classifier is trained to distinguish between the source and target representations.

Semi-supervised cross-modal retrieval
In the following, we describe our strategy for crossmodal retrieval in the artistic domain.Our model has a two-fold role: retrieving relevant images given textual sentences as queries, and retrieve relevant sentences when given images as queries.Parameters of the model are learned with the objective of maximizing recall at Ki.e. the fraction of queries for which the most relevant item is ranked among the top K retrieved ones.As training data in the artistic domain is often scarce, we build a proposal that does not need a paired training set in which the associations between images and sentences are known in advance.Rather, our model transfers the knowledge learned on a source annotated dataset to a target dataset in which the pairing between the two modalities is unknown at training time.
In a nutshell, the paradigm of the common embedding space is exploited to learn similarities between images and sentences.In addition to using global feature vectors to encode data from both modalities, we also investigate the use of auto-encoders to learn more compact representations of images and sentences.To transfer knowledge to the artistic domain without leveraging annotated pairs, we devise a distribution alignment strategy based on the Maximum Mean Discrepancy measure, which aims at uncovering suitable cross-modal representation of cultural heritage data without supervision.

Visual-semantic embeddings
Aligning works of arts and their corresponding textual descriptions requires the ability to compare visual and textual data in this particular domain.To this end, we adopt the strategy of creating a shared multi-modal embedding space, in which both textual and visual elements can be projected and compared using a similarity function.
Formally, we denote φ(I, w φ ) ∈ R D φ as the feature representation computed from an image I of the dataset (such as the representation coming from a CNN), and ψ(T, w ψ ) ∈ R D ψ as the representation of a textual element T , computed, for example, using a text encoder on onehot vectors, or as a function of pre-trained word embeddings.Here, w φ and w ψ indicate, respectively, the learnable weights of the visual and textual encoders.
To project those representations into a common semantic space, we perform a linear projection followed by a 2normalization step, so that the resulting embedding space lies on the 2 unit ball: (1) where 2,norm is the 2 normalization function.Being D the dimensionality of the joint embedding space, w f is a D φ × D matrix, and w g is a D ψ × D matrix.
Visual and textual elements can be compared in the joint multi-modal embedding space by computing the cosine similarity (equivalent, in this case, to a dot product) between their projections, so that the similarity between an image I and a caption T becomes (3) Clearly, the utility of the joint embedding space is maximized when it exhibits suitable cross-modality matching properties, i.e. when similarities in the embedding space correspond to meaningful similarities in both modalities.
In this case, the embedding space acts as a bridge between the two modalities and makes it possible to retrieve textual pieces describing a query image, and images described by a query caption by identifying the closest neighbors in both modalities.Given a dataset annotated with matching visualsemantic pairs, a good proxy of this property is to verify that corresponding pairs are neighbours in the embedding space.As a matter of fact, classical approaches have relied on the availability of paired datasets, and have learned the joint embedding for a specific domain in a completely supervised way, e.g.training the parameters of the model according to a Hinge triplet ranking loss with margin, which imposes suitable similarities between matching and non-matching elements.Formally, it is defined as: where [x] + = max(0, x) and α is a margin.In the equation above, (I, T ) is a matching image-text pair (i.e., such that T describes the content of I, and I represents the content of T ), while T is a negative text with respect to I (such that T does not describe I), and Î is a negative image with respect to T (such that T does not describe Î).The terms contained in both sums require that the difference in similarity between the matching and the non-matching pair is higher than a margin α: in the first sum, this is done by considering an image anchor and matching or nonmatching captions; in the latter, instead, a caption is used as anchor.
As reported by a recent work by (Faghri et al., 2018), in a completely supervised setting it is often beneficial to replace the sums in Eq. 4 with maximum operations, so to consider only the most violating non-matching pair.

Auto-encoding images and sentences
In addition to the use of plain global feature vectors, we also investigate an alternative projection strategy in which images and sentences are fed to an auto-encoder to learn a more compact yet powerful representation of the input, which can in turn be used as the input of the projection function defined in Eq. 1.
To this end, we design a textual auto-encoder which can convert variable-length captions to fixed-length representations from which input sentences can be reconstructed.In particular, our model exploits Gated Recurrent Networks (GRUs) (Cho et al., 2014) for both encoding and decoding.Formally, given a sentence T = (w 1 , w 2 , ..., w N ) with length N, we firstly encode it word by word through a single-layer GRU and take the last hidden state of the Recurrent layer as the encoding of the sentence.Given the recurrent relation defined by the GRU cell and the t-h word, i.e.
the encoding of the input sentence is defined as: In the decoding stage, the input sentence is reconstructed by feeding h N to a second GRU layer which is in charge of generating the reconstructed sentence.During training, at the t-th iteration the Recurrent layer is fed with h N and the previous ground-truth words, and it is trained to predict the t-h word.Formally, the training objective is thus: The probability of a word is modeled via a softmax layer applied to the output of the decoder.To reduce the dimensionality of the decoder, a linear embedding transformation is used to project one-hot word vectors into the input space of the decoder and, vice-versa, to project the output of the decoder to the dictionary space.
Given the auto-encoder for the textual part, we build an encoder-decoder model that can take an image feature vector as input and reconstruct it starting from an intermediate and more compact representation.In practice, the encoder model is composed of a single fully connected layer.We indeed notice that a single layer leads to have a fairly informative representation of the image feature vector.Formally, we define the output of the encoder model z (i.e. the intermediate representation of the input image) as z = tanh(W e φ(I) where W e and b e are, respectively, the weight matrix and the bias vector of the encoder.Notice that the output of the encoder layer is fed through a tanh non-linearity activation function.
The decoder model has a symmetric structure.Therefore, starting from the intermediate vector z, the decoder applies a single fully connected layer that transforms z to the size of the input image feature vector.Formally, the reconstructed image feature vector φ(I) is defined as where W d and b d are the weight matrix and the bias vector of the decoder.Overall, the image auto-encoder is trained to minimize the reconstruction error for each input image.We define the decoder loss function as the mean square error between the original image feature vector φ(I) and the corresponding reconstruction φ(I).

Aligning distributions
While the knowledge of matching and non-matching pairs on a source dataset can be exploited to train the embedding space, as discussed in Sec.3.1, the two reconstruction losses can be applied to both the source and the target dataset, thus building encoded representations which are suitable for both datasets.However, this is not enough to transfer knowledge from the source domain to the target domain, as there is no guarantee that encoded words and sentences from the target dataset will lie together in the embedding space.
To this end, we match the distributions of textual and visual data in the target domain, while learning from pairs sampled from the source domain.Following recent works in the field (Hubert Tsai et al., 2016;Tsai et al., 2017;Yan et al., 2017), we use the Maximum Mean Discrepancy (MMD) to compare distributions.This, basically, computes the distance between the expectations of the two distributions in a reproducing kernel Hilbert space H κ endowed with a kernel κ, and can be used as an additional loss term: where I is the distribution of the illustrations, and T is the distribution of captions.The kernel in the MMD criterion must be a universal kernel, and thus we empirically choose a Gaussian kernel: At training time, we sample two mini-batches of samples, one from the supervised set and a second one from the unsupervised dataset.The back-propagated loss is then the sum of the supervised loss (Eq.4) on the supervised set, plus the MMD loss L mmd approximated over the batch from the unsupervised set.Additionally, the two loss terms of the auto-encoders are evaluated over both the supervised and the unsupervised batches.

Datasets
We perform experiments on two different visualsemantic datasets containing artistic images and corresponding textual descriptions (described below).As source domains, we use Flickr30k and COCO which are composed of natural images and are commonly used to train cross-modal retrieval methods.For these two datasets, we use the splits provided by (Karpathy and Fei-Fei, 2015).BibleVSA (Baraldi et al., 2018).The dataset consists of 2, 282 illustrations taken from the digitized version of the Borso d'Este Holy Bible, one of the most significant illustrated manuscripts of Renaissance.Each image is associated with a single textual phrase extracted from a textual commentary which describes the content of each page of the manuscript.In our experiments, we use the original training, validation, and test split, respectively composed of 1, 671, 293, and 307 image-caption pairs.SemArt (Garcia and Vogiatzis, 2018).This dataset is composed of 21, 384 paintings extracted from the Web Gallery of Art, which contains European fine-art reproductions between the 8th and the 19th century.Each image is associated to an artistic comment and to a set of 7 different attributes comprising the title, the author, and the type of the painting.Overall, the dataset is divided in training, validation and test split with 19, 244, 1, 069 and 1, 069 elements, respectively.The average length  of each artistic comment is more than 80, with a maximum number of words equal to 830.This highlights the difference between SemArt and ordinary visual-semantic datasets (i.e.COCO has an average caption length lower than 11) and accentuates the challenges of this set of data.
To first validate our solution in a less complex scenario, we limit the validation and test set to 300 randomly selected image-text pairs.Then, we evaluate our model using a different number of retrievable items.

Implementation details
To encode input images, we use two different convolutional networks: the VGG-19 (Simonyan and Zisserman, 2015) and ResNet-152 (He et al., 2016).We extract image features from the fc7 layer of the VGG-19 and from the average pooling layer of the ResNet-152 thus obtaining an input image embedding dimensionality D φ of 4096 and 2048, respectively.

Method
Word Emb.Text Retrieval Image Retrieval R@1 R@5 R@10 R@1 R@5 R@10 For encoding image descriptions, we use a GRU network (Cho et al., 2014).We set the dimensionality of the GRU and of the joint embedding space D to 512, while the input size of word embeddings D ψ is set to 300.We use either a text encoder on one-hot vectors or different pre-trained word embeddings (such as GloVe (Pennington et al., 2014) and FastText (Bojanowski et al., 2017)) as input of the GRU.
The model with textual and visual auto-encoders is trained using the same input and output sizes.For the training with pre-trained word embeddings, instead of using the loss function defined in Eq. 7, we compute the cosine distance between original and reconstructed embeddings of each word.
All experiments are performed by using Adam opti-mizer with a learning rate of 0.0002 for 15 epochs and then decreased by a factor of 10.We set the margin α to 0.2, the σ parameter of the Gaussian kernel to 1 and the size of the mini-batch to 128.

Analysis of artistic visual-semantic data
To get an insight of characteristics of the BibleVSA and SemArt datasets, we analyze the distribution of image and textual features respectively obtained from CNNs and sentence embeddings and compare them with those extracted from classical visual-semantic datasets.
For the visual part, we extract the activation from the VGG-19 and ResNet-152 networks, while, for textual elements, we embed each word of a caption with a word embedding strategy (either GloVe or FastText).To get a feature vector for a sentence, we sum the 2 normalized embeddings of the words, and we apply the 2 -norm also to the results.This strategy is largely used in image and video retrieval literature and is known for preserving the information of the original vectors into a compact representation with fixed dimensionality (Tolias et al., 2016) .
Fig. 2 shows the distributions of visual and textual features of both datasets.To get a suitable two-dimensional representation, we run the t-SNE algorithm (Maaten and Hinton, 2008), which iteratively finds a non-linear projection that preserves the statistical distribution of the pairwise distances from the original space.As it can be observed, the features of ordinary visual-semantic datasets share almost the same visual and textual distributions.BibleVSA and SemArt, on the contrary, feature a completely different distribution, according to both modalities and all feature extractors.This underlines, on the one hand, that artistic datasets define a completely new domain.On the other hand, instead, this motivates the low performance of existing models when tested on these datasets.

Cross-modal retrieval results
To evaluate the effectiveness of the visual-semantic embeddings, we report rank-based performance metrics R@K (K = 1, 5, 10) for image and caption retrieval.In particular, R@K computes the percentage of test images or test sentences for which at least one correct result is found among the top-K retrieved sentences, in the case

Method
Text Retrieval Image Retrieval R@1 R@5 R@10 R@1 R@5 R@10 Flickr30k of caption retrieval, or the top-K retrieved images, in the case of image retrieval.Firstly, we assess the performance of our full model when using different CNN features or different word embeddings, to get an insight of the role of different global feature vectors.In Table 1, we show the performance of the proposed approach on the test sets of BibleVSA and SemArt when using image features extracted, respectively, from VGG-19 and ResNet-152.Table 2 compares the use of FastText and GloVe embeddings versus a learned word embedding matrix.In this case, the results on SemArt test set are obtained by using 300 randomly selected retrievable items.
For space reasons, we limit this analysis to a single source dataset (namely, COCO), as we have observed similar behaviours on Flickr30k.The two variants of our approach are denoted as SS-VSE and SS-VSE-AE, where the first refers to the model with global feature vectors and linear projection, and the latter refers to the model with the visual and textual auto-encoder.As it can be observed, the global descriptor extracted from ResNet-152 outperforms the one extracted from VGG-19 in almost all settings.Noticeably, learned word embeddings outperform pre-trained solutions.We speculate that this performance drop is due to the the highly specialized nature of the target datasets.In this regards, word embeddings seem to offer a poor initialization point with respect to a from-scratch learning of the word embedding matrix.

Evaluation of semi-supervised embeddings
In Tables 3 and 4, we compare the performances of the two proposed semi-supervised approaches (SS-VSE and SS-VSE-AE) on SemArt and BibleVSA test set with respect to the two models trained without the distribution alignment (VSE and VSE-AE).For these experiments, we use global feature vectors extracted from ResNet-152 and learned word embeddings.Given the significant size of SemArt dataset, we report retrieval results when using different sets of database items (i.e.100, 300, 500, 1000).
We notice that, when using a medium-scale source dataset like Flickr30k, the use of the auto-encoder is competitive with the use of a linear projection of the global feature vector.Instead, when transferring from a large-scale dataset like COCO, the reconstruction term is not needed and the reduced size of the representation degrades the performance.In all settings, the MMD loss gives a significant contribution to the final performance thus confirming the effectiveness of our distribution alignment strategy.
To get a better understanding of the role of the MMD loss, we also show the learned multi-modal embedding space by using t-SNE visualizations.Figure 3 shows the embedding spaces when transferring from COCO to Se-mArt, with and without the MMD loss.As it can be noticed, without the MMD loss the distribution of textual and visual elements on the target domain remains almost separate, as the learning signal from the source domain is not general enough on the target domain.On the contrary, when applying the MMD loss the distribution of the learned image embeddings matches that of the textual counterpart on the target domain, thus confirming the effectiveness of the proposed semi-supervised strategy.Noticeably, the distributions of the source and target domain still remain separate in the embedding space, thus underlying the diverse nature of the two sets.
Finally, Fig. 4.5 reports sample qualitative results on BibleVSA and SemArt dataset.As it can be noticed, our method can retrieve significant elements without employing any paired supervision from the artistic dataset.

Conclusion
We tackled the task of building visual-semantic retrieval approaches for the Cultural Heritage domain.To

.).
A round depicting a dog hunting a heron.

Query Image
Top-1 Retrieved Caption Query Image Top-1 Retrieved Caption A round, within a quadrangular frame of a laurel wreath, with Moses kneeling listening to the word of God appearing in the sky.
A fantastic figure with a leopard body and human head holds a spear and a shield.

Query Image Top-1 Retrieved Caption
This painting depicts a still-life of flowers in a vase, with fruit on a ledge behind.
This painting shows the Madonna and Child in a landscape with the Infant Saint John the Baptist.It betrays the influence of (…).

Top-1 Retrieved Image
A quadrangular vignette with Moses and Aaron, kneeling in a landscape, they listen to the word of God appearing in the form of a radiated cloud.

Query Caption
A landscape with the leopard with tail and dragon wings.

Quadrangular vignette with
Moses preaching to the people gathered around him.

Top-1 Retrieved Image
This study of a bearded man, head and shoulders, was probably made with the intention to use it in some multi-figural composition.
In this genre scene three men are depicted relaxing in a sparse interior as one plays his violin and the others jovially hold a pipe and vessels for drinking (…).
This still-life depicts Bohemian crystals, cups, and a watch.
this aim, we have proposed a semi-supervised approach which does not rely on labelled data on the artistic domain and translates the knowledge learned on ordinary visualsemantic datasets to the more challenging case of artistic data.Extensive experimental results validated the proposed strategy.Regardless, future research should consider the potential effects of semi-supervised approaches using more fine-grained methods, for example aligning detected regions and sentence words between source and target distributions instead of their global representations.As this has been proven useful in ordinary domains, its interactions with domain adaptation should be investigated.Moreover, a comprehensive comparison of domain adaptation techniques, including those employing adversarial objectives, and of their applicability to the Cultural Heritage domain is needed to further advance the research in the field.

Figure 1 :
Figure 1: Visual and textual data from the artistic domain are different from those addressed by ordinary visual-semantic datasets, posing significant challenges in the automatic understanding of arts and culture.Our approach can align illustrations and textual elements by transferring the knowledge learned on standard datasets to match images and captions coming from a target domain.

Figure 2 :
Figure 2: Comparison between the visual and textual features of ordinary visual-semantic datasets (Flickr30k, COCO) and those of BibleVSA and SemArt dataset.Visualization is obtained by running the t-SNE algorithm on top of the features.Best seen in color.BibleVSA SemArt Flickr30k COCO

Figure 4 :
Figure 4: Qualitative image-to-text (upper) and text-to-image (lower) results on BibleVSA (first and third rows) and SemArt (second and fourth rows) dataset, using the proposed semi-supervised strategy.

Table 1 :
Semi-supervised cross-modal retrieval results using different visual features.Results are reported on BibleVSA and SemArt test set.

Table 2 :
Semi-supervised cross-modal retrieval results using different word embeddings.Results are reported on BibleVSA and SemArt test set.

Table 4 :
Semi-supervised retrieval results on BibleVSA test set.

Table 3 :
Semi-supervised cross-modal retrieval results on SemArt test set using a different number N of retrievable items.