Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions

Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As an image can be described in infinite ways depending on the goal and the context at hand, a higher degree of controllability is needed to apply captioning algorithms in complex scenarios. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by allowing both grounding and controllability. Given a control signal in the form of a sequence or set of image regions, we generate the corresponding caption through a recurrent architecture which predicts textual chunks explicitly grounded on regions, following the constraints of the given control. Experiments are conducted on Flickr30k Entities and on COCO Entities, an extended version of COCO in which we add grounding annotations collected in a semi-automatic manner. Results demonstrate that our method achieves state of the art performances on controllable image captioning, in terms of caption quality and diversity. Code and annotations are publicly available at: https://github.com/aimagelab/show-control-and-tell.


Introduction
Image captioning brings vision and language together in a generative way. As a fundamental step towards machine intelligence, this task has been recently gaining much attention thanks to the spread of Deep Learning architectures which can effectively describe images in natural language [42,18,46,43]. Image captioning approaches are usually capable of learning a correspondence between an input image and a probability distribution over time, from which captions can be sampled either using a greedy decoding strategy [43], or more sophisticated techniques like beam search and its variants [1].
As the two main components of captioning architectures are the image encoding stage and the language model, re-   [43], (b) attentive models which integrate features from image regions [3] and (c) our Show, Control and Tell. Our method can produce multiple captions for a given image, depending on a control signal which can be either a sequence or a set of image regions. Moreover, chunks of the generated sentences are explicitly grounded on regions.
searchers have focused on improving both phases, which resulted in the emergence of attentive models [46] on one side, and of more sophisticated interactions with the language model on the other [25,5]. Recently, attentive models have been improved by replacing the attention over a grid of features with attention over image regions [3,44,50]. In these models, the generative process attends a set of regions which are softly selected while generating the caption.
Despite these advancements, captioning models still lack controllability and explainability -i.e., their behavior can hardly be influenced and explained. As an example, in the case of attention-driven models, the architecture implicitly selects which regions to focus on at each timestep, but it cannot be supervised from the exterior. While an image can be described in multiple ways, such an architecture provides no way of controlling which regions are described and what importance is given to each region. This lack of controllability creates a distance between human and machine intelligence, as humans can manage the variety of ways in which an image can be described, and select the most appropriate one depending on the task and the context at hand. Most importantly, this also limits the applicability of captioning algorithms to complex scenarios in which some control over the generation process is needed. As an example, a captioning-based driver assistance system would need to focus on dangerous objects on the road to alert the driver, rather than describing the presence of trees and cars when a risky situation is detected. Eventually, such systems would also need to be explainable, so that their behavior could be easily interpreted in case of failures.
In this paper, we introduce Show, Control and Tell, that explicitly addresses these shortcomings (Fig. 1). It can generate diverse natural language captions depending on a control signal which can be given either as a sequence or as a set of image regions which need to be described. As such, our method is capable of describing the same image by focusing on different regions and in a different order, following the given conditioning. Our model is built on a recurrent architecture which considers the decomposition of a sentence into noun chunks and models the relationship between image regions and textual chunks, so that the generation process can be explicitly grounded on image regions. To the best of our knowledge, this is the first captioning framework controllable from image regions. Contributions. Our contributions are as follows: • We propose a novel framework for image captioning which is controllable from the exterior, and which can produce natural language captions explicitly grounded on a sequence or a set of image regions. • The model explicitly considers the hierarchical structure of a sentence by predicting a sequence of noun chunks. Also, it takes into account the distinction between visual and textual words, thus providing an additional grounding at the word level. • We evaluate the model with respect to a set of carefully designed baselines, on Flickr30k Entities and on COCO, which we semi-automatically augment with grounding image regions for training and evaluation purposes. • Our proposed method achieves state of the art results for controllable image captioning on Flick30k and COCO both in terms of diversity and caption quality, even when compared with methods which focus on diversity.

Related work
A large number of models has been proposed for image captioning [37,47,24,23,17,25]. Generally, all integrate recurrent neural networks as language models, and a representation of the image which might be given by the output of one or more layer of a CNN [43,10,37,24], or by a time-varying vector extracted with an attention mechanism [46,48,24,7,3] selected either from a grid over CNN features, or integrating image regions eventually extracted from a detector [32,3]. Attentive models provided a first way of grounding words to parts of the image, although with a blurry indication which was rarely semantically significant. Regarding the training strategies, notable advances have been made by using Reinforcement Learning to train non-differentiable captioning metrics [35,23,37]. In this work, we propose an extended version of this approach which deals with multiple output distributions and rewards the alignment of the caption to the control signal.
Recently, more principled approaches have been proposed for grounding a caption on the image [34,38,15,16]: DenseCap [17] generates descriptions for specific image regions. Further, the Neural Baby Talk approach [25] extends the attentive model in a two-step design in which a wordlevel sentence template is firstly generated and then filled by object detectors with concepts found in the image. We instead decompose the caption at the level of noun chunks, and explicitly ground each of them to a region. This approach has the additional benefit of providing an explicability method at the chunk level.
Another related line of work is that of generating diverse descriptions. Some works have extended the beam-search algorithm to sample multiple captions from the same distribution [41,1], while different GAN-based approaches have also appeared [8,39,45]. Most of these improve on diversity, but suffer on accuracy and do not provide controllability over the generation process. Others have conditioned the generation with a specific style or sentiment [27,28,11]. Our work is mostly related to [9], which uses a control input as a sequence of part-of-speech tags. This approach, while generating diversity, is hardly employable to effectively control the generation of the sentence; in contrast, we use image regions as a controllability method.

Method
Sentences are natural language structures which are hierarchical by nature [26]. At the lowest level, a sentence might be thought as a sequence of words: in the case of a sentence describing an image, we can further distinguish between visual words, which describe something visually present in the image, and textual words, that refer to entities which are not present in the image [25]. Analyzing further the syntactic dependencies between words, we can recover a higher abstraction level in which words can be organized into a tree-like structure: in a dependency tree [12,14,6], each word is linked together with its modifiers (Fig. 2).
Given a dependency tree, nouns can be grouped with their modifiers, thus building noun chunks. For instance, the caption depicted in Fig. 2 can be decomposed into a sequence of different noun chunks: "a young boy", "a cap", "his head", "striped shirt", and "gray and sweat jacket". As noun chunks, just like words, can be visually grounded into image regions, a caption can also be mapped to a sequence of regions, each corresponding to a noun chunk. A chunk might also be associated with multiple image regions of the same class if more than one possible mapping exists. The number of ways in which an image can be described results in different sequences of chunks, linked together to form a fluent sentence. Therefore, captions also differ in terms of the set of considered regions, the order in which they are described, and their mapping to chunks given by the linguistic abilities of the annotator.
Following these premises, we define a model which can recover the variety of ways in which an image can be described, given a control input expressed as a sequence or set of image regions. We begin by presenting the former case, and then show how our model deals with the latter scenario.

Generating controllable captions
Given an image I and an ordered sequence of set of regions R = (r 0 , r 1 , ..., r N ) 1 , the goal of our captioning model is to generate a sentence y = (y 0 , y 1 , ..., y T ) which in turns describes all the regions in R while maintaining the fluency of language.
Our model is conditioned on both the input image I and the sequence of region sets R, which acts as a control signal, and jointly predicts two output distributions which correspond to the word-level and chunk-level representation of the sentence: the probability of generating a word at a given time, i.e. p(y t |R, I; θ), and that of switching from one chunk to another, i.e. p(g t |R, I; θ), where g t is a boolean chunk-shifting gate. During the generation, the model maintains a pointer to the current region set r i and can shift to the next element in R by means of the gate g t .
To generate the output caption, we employ a recurrent neural network with adaptive attention. At each timestep, we compute the hidden state h t according to the previous hidden state h t−1 , the current image region set r t and the current word w t , such that h t = RNN(w t , r t , h t−1 ). At training time, r t and w t are the ground-truth region set and word corresponding to timestep t; at test time, w t is sampled from the first distribution predicted by the model, while the choice of the next image region is driven by the values of the chunk-shifting gate sampled from the second distribution: (1) where {g k } k is the sequence of sampled gate values, and N is the number of region sets in R. Chunk-shifting gate. We compute p(g t |R) via an adaptive mechanism in which the LSTM computes a compatibility function between its internal state and a latent representation which models the state of the memory at the end of a chunk. The compatibility score is compared to that of attending one of the regions in r t , and the result is used as an indicator to switch to the next region set in R.
The LSTM is firstly extended to obtain a chunk sentinel s c t , which models a component extracted from the memory encoding the state of the LSTM at the end of a chunk. The sentinel is computed as: where W ig ∈ R d×k , W hg ∈ R d×d are learnable weights, m t ∈ R d is the LSTM cell memory and x t ∈ R k is the input of the LSTM at time t; represents the Hadamard element-wise product and σ the sigmoid logistic function.
We then compute a compatibility score between the internal state h t and the sentinel vector through a single-layer neural network; analogously, we compute a compatibility function between h t and the regions in r t .
where n is the number of regions in r t , 1 ∈ R n is a vector with all elements set to 1, w T h is a row vector, and all W * , w * are learnable parameters. Notice that the representation extracted from the internal state is shared between all compatibility scores, as if the region set and the sentinel vector were part of the same attentive distribution. Contrarily to an attentive mechanism, however, there is no value extraction.
The probability of shifting from one chunk to the next one is defined as the probability of attending the sentinel vector s c t in a distribution over s c t and the regions in r t : where z r ti indicates the i-th element in z r t , and we dropped the dependency between n and t for clarity. At test time, the  Figure 3: Overview of the approach. Given an image and a control signal, the figure shows the process to generate the controlled caption and the architecture of the language model. value of gate g t ∈ {0, 1} is then sampled from p(g t |R) and drives the shifting to the next region set in R. Adaptive attention with visual sentinel. While the chunkshifting gate predicts the end of a chunk, thus linking the generation process with the control signal given by R, once r t has been selected a second mechanism is needed to attend its regions and distinguish between visual and textual words. To this end, we build an adaptive attention mechanism with a visual sentinel [24].
The visual sentinel vector models a component of the memory to which the model can fall back when it chooses to not attend a region in r t . Analogously to Eq. 2, it is defined as: where W is ∈ R d×k and W hs ∈ R d×d are matrices of learnable weights. An attentive distribution is then generated over the regions in r t and the visual sentinel vector s v t : where [·] indicates concatenation. Based on the attention distribution, we obtain a context vector which can be fed to the LSTM as a representation of what the network is attending: Notice that the context vector will be, mostly, an approximation of one of the regions in r t or the visual sentinel. However, r t will vary at different timestep according to the chunk-shifting mechanism, thus following the control input. The model can alternate the generation of visual and textual words by means of the visual sentinel.

Objective
The captioning model is trained using a loss function which considers the two output distributions of the model. Given the target ground-truth caption y * 1:T , the ground-truth region sets r * 1:T and chunk-shifting gate values corresponding to each timestep g * 1:T , we train both distributions by means of a cross-entropy loss. The relationship between target region sets and gate values will be further expanded in the implementation details. The loss function for a sample is defined as: Chunk-level probability (11) Following previous works [35,37,3], after a pre-training step using cross-entropy, we further optimize the sequence generation using Reinforcement Learning. Specifically, we use the self-critical sequence training approach [37], which baselines the REINFORCE algorithm with the reward obtained under the inference model at test time.
Given the nature of our model, we extend the approach to work on multiple output distributions. At each timestep, we sample from both p(y t |R) and p(g t |R) to obtain the next word w t+1 and region set r t+1 . Once a EOS tag is reached, we compute the reward of the sampled sentence w s and backpropagate with respect to both the sampled word sequence w s and the sequence of chunk-shifting gates g s . The final gradient expression is thus: where b = r(ŵ) is the reward of the sentence obtained using the inference procedure (i.e. by sampling the word and gate value with maximum probability). We then build a reward function which jointly considers the quality of the caption and its alignment with the control signal R.
Rewarding caption quality. To reward the overall quality of the generated caption, we use image captioning metrics as a reward. Following previous works [3], we employ the CIDEr metric (specifically, the CIDEr-D score) which has been shown to correlate better with human judgment [40].
Rewarding the alignment. While captioning metrics can reward the semantic quality of the sentence, none of them can evaluate the alignment with respect to the control input 2 . Therefore, we introduce an alignment score based on the Needleman-Wunsch algorithm [30]. Given a predicted caption y and its target counterpart y * , we extract all nouns from both sentences, and evaluate the alignment between them, recalling the relationships between noun chunks and region sets. We use the following scoring system: the reward for matching two nouns is equal to the cosine similarity between their word embeddings; a gap gets a negative reward equal to the minimum similarity value, i.e. −1. Once the optimal alignment is computed, we normalize its score, al(y, y * ) with respect to the length of the sequences. The alignment score is thus defined as: where #y and #y * represent the number of nouns contained in y and y * , respectively. Notice that NW(·, ·) ∈ [−1, 1]. The final reward that we employ is a weighted version of CIDEr-D and the alignment score.

Controllability through a set of detections
The proposed architecture, so far, can generate a caption controlled by a sequence of region sets R. To deal with the case in which the control signal is unsorted, i.e. a set of regions sets, we build a sorting network which can arrange the control signal in a candidate order, learning from data. The resulting sequence can then be given to the captioning network to produce the output caption (Fig. 3).
To this aim, we train a network which can learn a permutation, taking inspiration from Sinkhorn networks [29]. As shown in [29], the non-differentiable parameterization of a permutation can be approximated in terms of a differentiable relaxation, the so-called Sinkhorn operator. While a permutation matrix has exactly one entry of 1 in each row and each column, the Sinkhorn operator iteratively normalizes rows and columns of any matrix to obtain a "soft" per-mutation matrix, i.e. a real-valued matrix close to a permutation one.
Given a set of region sets R = {r 1 , r 2 , ..., r N }, we learn a mapping from R to its sorted version R * . Firstly, we pass each element in R through a fully-connected network which processes every item of a region set independently and produces a single output feature vector with length N . By concatenating together the feature vectors obtained for all region sets, we thus get a N × N matrix, which is then passed to the Sinkhorn operator to obtain the soft permutation matrix P . The network is then trained by minimizing the mean square error between the scrambled input and its reconstructed version obtained by applying the soft permutation matrix to the sorted ground-truth, i.e. P T R * .
At test time, we take the soft permutation matrix and apply the Hungarian algorithm [20] to obtain the final permutation matrix, which is then used to get the sorted version of R for the captioning network.

Implementation details
Language model and image features. We use a language model with two LSTM layers (Fig. 3): the input of the bottom layer is the concatenation of the embedding of the current word, the image descriptor, as well as the hidden state of the second layer. This layer predicts the context vector via the visual sentinel as well as the chunk-gate. The second layer, instead, takes as input the context vector and the hidden state of the bottom layer and predicts the next word.
To represent image regions, we use Faster R-CNN [36] with ResNet-101 [13]. In particular, we employ the model finetuned on the Visual Genome dataset [19] provided by [3]. As image descriptor, following the same work [3], we average the feature vectors of all the detections.
The hidden size of the LSTM layers is set to 1000, and that of attention layers to 512, while the input word embedding size is set to 1000. Ground-truth chunk-shifting gate sequences. Given a sentence where each word of a noun chunk is associated to a region set, we build the chunk-shifting gate sequence {g * t } t by setting g * t to 1 on the last word of every noun chunk, and 0 otherwise. The region set sequence {r * t } t is built accordingly, by replicating the same region set until the end of a noun chunk, and then using the region set of the next chunk. To compute the alignment score and for extracting dependencies, we use the spaCy NLP toolkit 3 . We use GloVe [33] as word vectors. Sorting network. To represent regions, we use Faster R-CNN vectors, the normalized position and size and the GloVe embedding of the region class. Additional details on architectures and training can be found in the Supplementary material.
A young girl is sitting down with her dog.
A woman sitting at a table with a dog eating cake.
A woman and a dog that is eating from a plate.
A young man walking past a red fire hydrant.
A man walks past a red fire hydrant on the sidewalk.

COCO Entities (ours)
Train Validation Test

Datasets
We experiment with two datasets: Flickr30k Entities, which already contains the associations between chunks and image regions, and COCO, which we annotate semiautomatically. Table 1 summarizes the datasets we use. Flickr30k Entities [34]. Based on Flickr30k [49], it contains 31, 000 images annotated with five sentences each. Entity mentions in the caption are linked with one or more corresponding bounding boxes in the image. Overall, 276, 000 manually annotated bounding boxes are available. In our experiments, we automatically associate each bounding box with the image region with maximum IoU among those detected by the object detector. We use the splits provided by Karpathy et al. [18]. COCO Entities. Microsoft COCO [22] contains more than 120, 000 images, each of them annotated with around five crowd-sourced captions. Here, we again follow the splits defined by [18] and automatically associate noun chunks with image regions extracted from the detector [36].
We firstly build an index associating each noun of the dataset with the five most similar class names, using word vectors. Then, each noun chunk in a caption is associated by using either its name or the base form of its name, with the first class found in the index which is available in the image. This association process, as confirmed by an extensive manual verification step, is generally reliable and produces few false positive associations. Naturally, it can result in region sets with more than one element (as in Flickr30k), and noun chunks with an empty region set. In this case, we fill empty training region sets with the most probable detections of the image and let the adaptive attention mechanism learn the corresponding association; in validation and testing, we drop those captions. Some examples of the additional annotations extracted from COCO are shown in Fig. 4.

Experimental setting
The experimental settings we employ is different from that of standard image captioning. In our scenario, indeed, the sequence of set of regions is a second input to the model which shall be consider when selecting the ground-truth sentences to compare against. Also, we employ additional metrics beyond the standard ones like BLEU-4 [31], ME-TEOR [4], ROUGE [21], CIDEr [40] and SPICE [2].
When evaluating the controllability with respect to a sequence, for each ground-truth regions-image input (R, I), we evaluate against all captions in the dataset which share the same pair. Also, we employ the alignment score (NW) to evaluate how the model follows the control input.
Similarly, when evaluating the controllability with respect to a set of regions, given a set-image pair (R, I), we evaluate against all ground-truth captions which have the same input. To assess how the predicted caption covers the control signal, we also define a soft intersection-overunion (IoU) measure between the ground-truth set of nouns and its predicted counterpart, recalling the relationships between region sets and noun chunks. Firstly, we compute the optimal assignment between the two set of nouns, using distances between word vectors and the Hungarian algorithm [20], and define an intersection score between the two sets as the sum of assignment profits. Then, recalling that set union can be expressed in function of an intersection, we define the IoU measure as follows: IoU(y, y * ) = I(y, y * ) #y + #y * − I(y, y * ) where I(·, ·) is the intersection score, and the # operator represents the cardinality of the two sets of nouns.  [3] 12  A graffiti on a wall with a woman on the sidewalk.
A woman walking down a sidewalk in front of a graffiti. A dog sitting on a bench with a man. A couple of people sitting on a bench with a dog. Two horses grazing in a field of grass with trees and a fence.

Baselines
Controllable LSTM. We start from a model without attention: an LSTM language model with a single visual feature vector. Then, we generate a sequential control input by feeding a flattened version of R to a second LSTM and taking the last hidden state, which is concatenated to the visual feature vector. The structure of the language model resembles that of [3], without attention.
Controllable Up-Down. In this case, we employ the full Up-Down model from [3], which creates an attentive distribution over image regions and make it controllable by feeding only the regions selected in R and ignoring the rest. This baseline is not sequentially controllable.
Ours without visual sentinel. To investigate the role of the visual sentinel and its interaction with the gate sentinel, in this baseline we ablate our model by removing the visual sentinel. The resulting baseline, therefore, lacks a mechanism to distinguish between visual and textual words.
Ours with single sentinel. Again, we ablate our model by merging the visual and chunk sentinel: a single sentinel is used for both roles, in place of s c t and s v t . As further baselines, we also compare against noncontrollable captioning approaches, like FC-2K [37], Up-Down [3], and Neural Baby Talk [25].

Quantitative results
Controllability through a sequence of detections. Firstly, we show the performance of our model when providing the full control signal as a sequence of region sets. Table 2 shows results on COCO Entities, in comparison with the aforementioned approaches. We can see that our method achieves state of the art results on all automatic evaluation metrics, outperforming all baselines both in terms of overall caption quality and in terms of alignment with the control signal. Using the cross-entropy pre-training, we outperform the Controllable LSTM and Controllable Up-Down by 32.0 on CIDEr and 0.112 on NW. Optimizing the model with CIDEr and NW further increases the alignment quality while maintaining outperforming results on all metrics, leading to a final 0.649 on NW, which outperforms the Controllable Up-Down baseline by a 0.25. Recalling that NW ranges from −1 to 1, this improvement amounts to a 12.5% of the full metric range.
A boy hitting a tennis ball on a court.
A boy in a red shirt holding a tennis racket on a court.
A girl wearing sunglasses holding a frisbee on the grass.
A girl standing in the grass holding a frisbee.
A man and a woman toting a luggage on a street with a door.
A woman with a luggage next to a red fire hydrant on a sidewalk.   In Table 3, we instead show the results of the same experiments on Flickr30k Entities, using CIDEr+NW optimization for all controllable methods. Also on this manually annotated dataset, our method outperforms all the compared approaches by a significant margin, both in terms of caption quality and alignment with the control signal. Controllability through a set of detections. We then assess the performance of our model when controlled with a set of detections. Tables 4 and 5 show the performance of our method in this setting, respectively on COCO Entities and Flickr30k Entities. We notice that the proposed approach outperforms all baselines and compared approaches in terms of IoU, thus testifying that we are capable of respecting the control signal more effectively. This is also combined with better captioning metrics, which indicate higher semantic quality. Diversity evaluation. Finally, we also assess the diversity of the generated captions, comparing with the most recent approaches that focus on diversity. In particular, the variational autoencoder proposed in [45] and the approach of [9], which allows diversity and controllability by feeding PoS sequences. To test our method on a significant number of diverse captions, given an image we take all regions which are found in control region sets, and take the permutations which result in captions with higher log-probability. This approach is fairly similar to the sampling strategy used in [9], even if ours considers region sets. Then, we follow the experimental approach defined in [45,9]: each ground-   truth sentence is evaluated against the generated caption with the maximum score for each metric. Higher scores, thus, indicate that the method is capable of sampling high accuracy captions. Results are reported in Table 6, where to guarantee the fairness of the comparison, we run this experiments on the full COCO test split. As it can be seen, our method can generate significantly diverse captions.

Conclusion
We presented Show, Control and Tell, a framework for generating controllable and grounded captions through regions. Our work is motivated by the need of bringing captioning systems to more complex scenarios. The approach considers the decomposition of a sentence into noun chunks, and grounds chunks to image regions following a control signal. Experimental results, conducted on Flickr30k and on COCO Entities, validate the effectiveness of our approach in terms of controllability and diversity.