Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era

,


Introduction
When humans look around the world, observing an image or watching at a video sequence, attentive mechanisms drive their gazes towards salient regions.Attentional mechanisms have been studied in psychology and neuroscience since decades [17], and it is well assessed that the attentional mechanism is mainly bottom-up in its early stages, although influenced by some contextual cues, and guided by the salient points in the scene which is scanned very quickly by the eyes (in about 25-50 ms per item).If the person has a task-driven behaviour, e.g. when one drives a car, top-down attentive process arise; they are slower (at least 200 ms of reaction in humans) and due to the learned semantics of the scene.In general, the control of attention combines some stimuli processed in different cortical areas to mix spatial localization and recognition tasks, integrating datadriven pop outs and some learned semantics.It has also a temporal evolution, since some mechanisms such as the inhibition of return and the control of eye movements allow humans to refine attention during time.
Reproducing the same attentional process in artificial vision is still an open problem.In the case of a static image, researchers have shown that salient regions can be identified by considering discontinuities in low-level visual features, such as color, texture and contrast, and high-level cues as well, like faces, text, and the horizon.When watching a video sequence, instead, static visual features have lower importance while motion gains a crucial role, motivating the need of different solutions for static images and video.In both scenarios, computational models capable of identifying salient regions can enhance many vision-based inference mechanisms, ranging from image captioning [11] to video compression [13].
Since the seminal research of Kock, Ulman and Itti [23,18], traditional prediction models have followed biological evidences using low-level features and semantic concepts [14,22].With the advent of Deep Learning (DL), researchers have developed data-driven architectures capable of overcoming many of the limitations of previous hand-crafted models.This is not only due to the brute-force of DL architectures, with their capability of being trained by supervised data.This is one area where these architecture are particularly suitable since they recall precisely the neural biological models.Still, it is surprising to see how much today's models share with those early works.
Motivated by these considerations, we present an overview of different solutions that we have developed for saliency prediction on images and video with DL, which represent now the state-of-the-art in public available benchmarks.We compare the neural network model with the early models of computational saliency map, to show similarities and differences.The main contribution of this work is a discussion on why the model of attention prediction with Deep Learning is useful.The paper will show that today's models, based on Convolutional Neural Networks (CNNs) share many of the principles of early models, while at the same time solving many of their drawbacks.Different convolutional architectures will be presented, to deal with features extracted at multiple levels, and to refine saliency maps in an iterative way.Eventually, a solution for video saliency prediction will be discussed and analyzed in the case of driver attention estimation.

Saliency prediction on images
Early works on saliency prediction on images were based on the Feature Integration Theory proposed by Treisman et al. [32] in the eighties.Itti et al. [18], then, proposed the first saliency computational model: this work, inspired by Koch and Ullman [23], computed a set of individual topographical maps representing low-level cues such as color, intensity and orientation and combined them into a global saliency map.The saliency map is a scalar map, as large as the image, where each point represents the visual saliency, irrespective of the feature dimension that makes the location salient.The locus of highest activity in the saliency map is the most probable eye fixation point or is the point where the focus of attention should be localized.
After this work, a large variety of methods explored the same idea of combining complementary low-level features [5,14] and often included additional center-surround cues [38].Other methods enriched predictions exploiting semantic classifiers for detecting higher level concepts such as faces, people, cars and horizons [22].
In the last few years, thanks to the large spread of deep learning techniques, the saliency prediction task has achieved a considerable improvement.First attempts of predicting saliency with convolutional networks mainly suffered from the absence of fine-tuning of network parameters over a saliency prediction dataset and from the lack of sufficient amount of data to train a deep saliency architecture [33,25].The publication of the large-scale attention dataset SALI-CON [20] has contributed to a big progress of deep saliency prediction models and several new architectures have been proposed.
Huang et al. [16] introduced a deep neural network applied at two different image scales trained by using some evaluation metrics specific for the saliency prediction task as loss functions.Kruthiventi et al. [24] proposed a fully convolutional network called DeepFix that captures features at multiple scales and takes global context into account through the use of large receptive fields.Pan et al. [27] instead presented a shallow and a deep convnet where the first is trained from scratch while some layers of the second are initialized with the parameters of a standard convolutional network.Finally, Jetley et al. [19] introduced a saliency model that formulates a map as a generalized Bernoulli distribution and they used these maps to train a CNN trying different loss functions.

Saliency prediction in video
When considering video inputs, saliency estimation is quite different with respect to still images.Indeed, motion is a key factor that strongly attracts human attention.Accordingly, some video saliency models pair bottom-up feature extraction with a further motion estimation step, that can be performed either by means of optical flow [39] or feature tracking [37].Somehow differently, some models have been proposed to force the coherence of bottom-up features across time.In this setting, previous works address feature extraction both in a supervised [30] and unsupervised [34] fashion, whereas temporal smoothness of output maps can be achieved through optical flow motion cues [39] or explicitly conditioning the current map on information from previous frames [28].
As previously discussed for the image saliency setting, the representation capability of deep learning architectures, along with large labeled datasets, can yield better results.However, deep video saliency models still lack, being the work in [4] the only meaningful effort that can be found in the current literature.Such model leverages a recurrent architecture iteratively updating its hidden state over time, and emitting the saliency map at each step by means of a Gaussian Mixture Model.

Saliency Prediction with Deep Learning Architectures
In this section we provide a detailed discussion of different deep learning architectures for saliency prediction on images and video.We will introduce a convolutional model for images, which incorporates low and high level visual features, and which, conceptually, extends the seminal work by Itti and Koch [18] by means of a modern neural network.A discussion on the similarities and differences between these two models will follow, and forerun the presentation of a second model, in which a recurrent convolutional architecture is used to refine saliency maps in a way which is roughly similar to the human scanpath.Finally, we will present an architecture for saliency prediction on video, and show how this particular domain differs from that of images in the case of driver attention prediction.

Incorporating low-level and high-level cues in a Multi-Level Network
In [8], we proposed a Deep Multi-Level Network (ML-Net) for saliency prediction.
In contrast to previous proposals, in which saliency maps were predicted from a non-linear combination of features coming from the last convolutional layer of a CNN, we effectively combined feature maps coming from three different levels of a fully convolutional network thus taking into account low, medium and high level cues.Moreover, to model the center bias present in human eye fixations, we incorporated a learned prior map by applying it to the predicted saliency map.Fig. 1 shows the overall architecture of our ML-Net model.
More in details, the first component of our architecture is a CNN based on a standard convolutional network originally designed for image classification and then employed in several other computer vision tasks.This network, named VGG-16 [29], is composed by 13 convolutional layers, divided in 5 different blocks, and 3 fully connected layers.Since we aimed at producing a 2-dimensional map (i.e. the predicted saliency map), we removed the fully connected layers thus obtaining a fully convolutional architecture.Several other deep saliency models [16,27,19,9] employ the VGG-16 as starting point for their architectures and almost each of them combines feature maps coming only from the last convolutional layer of the VGG-16 network differentiating from each other by designing specific saliency component or by using different training strategies.In contrast to this approach, the second component of our model took as input feature maps coming from three different levels of the VGG-16 network: the output of the third, fourth and fifth convolutional blocks.Our model effectively combined these feature maps through two specific convolutional layers that merge low, medium and high level features and then produce a temporary saliency map.Finally, we decided to incorporate an important property of human gazes in our model.In fact, when an observers looks at an image its gaze is biased toward the center of the scene.To this end, the last component of our architecture was designed to model this center bias through a learned prior map which was applied to the predicted saliency map thus giving more importance to the center of the image.
It is well known that at training time a deep learning architecture has to minimize a given loss function that, in the saliency prediction task, aims at effectively approaching the predicted saliency map to the ground-truth one obtained from human fixation points.Previous deep saliency models were trained with different strategies by using a saliency evaluation metric as loss function or, more commonly, a square error loss (such as the euclidean loss).We instead designed a specific loss function inspired by three different objectives: predicted saliency maps should be similar to ground-truth ones, therefore a square error loss was a reasonable choice.Secondly, predictions should be invariant to their maximum, and there is no point in forcing the network to produce values in a given numerical range, so predictions were normalized by their maximum.Third, the loss should give the same importance to high and low ground-truth values, even though the majority of ground-truth pixels are close to zero.For this reason, the deviation between predicted values and ground-truth values was weighted by a linear function, which tends to give more importance to pixels with high ground-truth fixation probability.The overall loss function was thus where x i are the predicted saliency maps while y i are the ground-truth ones.
The proposed architecture was trained with mini-batch of N samples by using the Stochastic Gradient Descent as optimizer.

Deep Learning architectures vs. the Itti and Koch's model
The first computational model for saliency prediction,and probably the most famous, was presented in a seminal paper by Itti and Koch [18].It proposed to extract multi-scale low-level features from the input image which were linearly combined and then processed by a dynamic neural network with a winner-takesall strategy to select attended locations in decreasing order of saliency.As we have shown in the previous section, nowadays saliency prediction is generally tackled via CNN architectures, therefore giving more importance to learning than to hand engineering of features.However, today's models share a lot with that influential work.The model in [18] extracted three kinds of features from input images: color (as a linear combination of raw pixels in color channels), intensity (again, computed as a linear combination of color channels), and orientation, by means of oriented Gabor pyramids [12].It should be noted that all these features can be easily extracted by a single convolutional layer, and, indeed, visualization and inversion techniques [36] showed that filters learned in the early stages of a CNN roughly extract color and gradient features.Also, the linear combinations of color channels in [18] can be computed via a single convolutional layer with channel-wise uniform weights or with a 1 × 1 kernel.
One detail, however, is missing in current convolutional architectures: authors of [18] extracted the same features at multiple scales, and then validated them by performing central differences between adjacent scales.In a CNN, instead, features are always computed at a single scale, even though the overall architecture extracts (different) features at different scales thanks to pooling stages.Of course the multi-scale validation of features was also motivated by the need of extracting robust features, something which comes almost for free in modern architectures.Moreover, many state of the art CNN models are multi-scale by construction, feeding a pyramid of images to the same convolutional stack.Even in our model, we combine different features extracted at different scales to form the final prediction, instead of taking only those produced by the last layer.
Conversely, the most evident characteristic that the Itti and Koch model misses with respect to today's architectures is the ability to extract higher level Image Itti ML-Net Ground-truth Fig. 2: Qualitative comparisons between the Itti [18] and ML-Net [8] models.
features, and to detect objects and part of objects.This is achieved, in today's networks, by increasing the depth of the network (e.g.152 layers in the ResNet model [15]).This, given the big performance gap, clearly highlights the need of high-level features for saliency prediction.
As a proof of concept, in Table 1 we compare the results of the model in [18] 1 with those of our method.We use the standard performance indicators for saliency: the Similarity, the Linear Correlation Coefficient (CC), the Area Under the ROC Curve (AUC) and its shuffled version (sAUC), the Normalized Scanpath Saliency (NSS) and the Earth-Mover Distance (EMD).We refer the reader to the work by Bylinskii et al. [7] for a detailed discussion on these metrics.It can clearly be seen that CNNs overcame that early model by a big margin, with respect to all metrics, and this experimentally confirms the need of high-level features for saliency prediction, rather than just employing low-level cues such as in [18].To give a better insight of the performance gain, we also report some qualitative results on images randomly chosen from the SALICON dataset.We show them in Fig. 2, along with the ground-truth saliency map computed from human eye fixations.While the model of [18] tends to concentrate on color and gradient discontinuities, which often do not match with the human fixation map, our model can clearly guess most of the saliency maps in a way which is almost indistinguishable from the ground-truth.The middle image, showing a pizza, is also a good example to show the role of the center prior: when there is no a clear object which stands out in the scene, human eyes tend to fix the center of the image, as our model has learned to do.Also, predictions from our ML-Net are particularly focused on small areas, similarly to the SALICON ground-truth.This is due to the fact that, in absence of a task-driven attentive mechanism, the 1 Numerical and qualitative results of the Itti-Koch model have been generated using the re-implementation of [14], which is also the one reported in the MIT Saliency Benchmark [6].
Fig. 3: Qualitative comparison between the Itti [18], ML-Net [8] and SAM [10] models on images taken from the SALICON dataset [20].For the SAM model, we show predictions given by the recurrent attentive network at different steps.
focus tends to be directed on what is a-priori known, such as a person, a face, a traffic sign.The architecture, trained on similar data, does not overfit specific points, but tends to replicate the same semantic-based attentive behaviour.

Saliency map refinement via a convolutional recurrent architecture
Models for saliency prediction can also go beyond feed-forward neural networks and include recurrent components.Recurrent neural networks are usually employed to deal with time-varying input sequences, but can be used, in general, to process any kind of sequence.Following this intuition, we proposed a second model [10] in which we combined a fully convolutional network (similar to the one described in the previous sections) with a recurrent convolutional network, endowed with an attention mechanism.The recurrent network, instead of looping on a time sequence as in the case of video captioning [3], performs an iterative refinement of the saliency map by focusing on different part of the image.This behaviour is encouraged by using a spatial attentive mechanism, inspired by the machine translation literature [2].We called the overall architecture SAM, i.e.Saliency Attentive Model.
Figure 3 shows, for some images taken from the SALICON dataset, the prediction from the model of Itti and Koch [18], that from our previous model [8], and the output of the attentive network at each step, for t = 1, ..., 4, as well as the ground-truth map.As it can be noticed, the refinement strategy carried out by the network results in a progressive improvement of the prediction, which overcomes the performance of a feed-forward neural network like the one in the ML-Net model.

Estimating task-driven saliency in videos
In [26], we described a model devised for predicting saliency on the DR(eye)VE dataset [1], and capable of replicating human attentional behavior while driv-AI*IA ing.The need for a different model tailored for this specific context is twofold: first, as anticipated, objects motion in videos tends to capture human attention.Moreover, fixations recorded during the dataset acquisition in [1] are strongly related to the driving activity, and call for a task-driven model and training procedure.
Motivated by the insight that a small temporal window holds sufficient information meaningful for the task of driving, our model captures short-term correlations by means of 3D convolutions, which also stride along time axis.Accordingly, it takes as input samples holding 16 consecutive frames (called clips from now on) and provides a dense saliency probability map for the last (current) frame of the clip.The network is jointly trained with two input streams (Fig. 4), in order to tackle the central bias that usually affects saliency benchmarks in general, and is even more noticeable in the driving task.Both streams rely on the same backbone encoder, that we name COARSE module as provides a rough, harsh saliency estimate.Such model is based on the work by Tran et al. [31] and employs their C3D architecture to map pixels into a 512-dimensional encoding space.Being interested in spatially coherent feature maps, we drop the top fully connected classification module.Moreover, we discard the deepest convolutional layer, which encodings are strongly tailored to the original action recognition task, retaining only the most general features provided by previous layers.Eventually, we modify the last pooling layer to cover the whole time axis, and therefore squeeze out the temporal dimension from the output features.The resulting map, which is reduced by a 16x factor along spatial dimension and lacks the temporal axis due to pooling layers, is then processed to produce a saliency estimate as big as the original image and featuring a single probability channel.This is achieved by means of a series of upsampling followed by convolutions.
During training, the model is fed with two streams.The first stream encourages the model to learn saliency estimation given visual cues rather than prior spatial bias, and feeds the COARSE model with random crops.Cropping is also employed in the original C3D training process.Indeed, in [31] authors perform a tensor resize to 128 × 128 and then a random 112 × 112 crop.In our experience, this cropping policy is too polite, and yields models strongly biased towards the image center since ground-truth maps still suffer a poor variety.The policy we employ is immoderate, and features a 256 × 256 resize before the crop.This way, samples cover a small portion of the input tensor and allow variety in prediction targets, at the cost of a wider attentional area.Intuitively, the smaller crops are, the larger the attentional map will appear.Thus, the trained model was able to escape the bias when required, but unfortunately provided over-rough estimates.
To address this issue, we feed the COARSE model with a second stream providing images resized to match the crop size.The prediction, after being resized and concatenated with the last frame of the clip, than undergoes a further block of convolutional layers (FINE module) that refine the map.Estimates from both streams are modeled as a probability density P over pixels, and optimized jointly against a ground-truth map Q by means of the Kullback-Leibler divergence: where the summation index i spans across image pixels and is a regularization constant.Evaluation Here we discuss the experiments performed in order to assess the design choices of our architecture for video saliency.As common in public benchmarks, we first compare our model against two central baselines.The first one represents the central bias as a Gaussian N (µ, Σ), being µ the image center and Σ a diagonal covariance matrix whose variances are coherent with the image aspect ratio.A more precise, task-driven baseline is obtained by averaging all training ground-truth maps, and two unsupervised state-of-the-art video saliency models [35,34] are also included in the comparison.The evaluation has been carried out comparing the shift between predicted and ground-truth maps both in terms of Pearson's correlation coefficient (CC) and Kullback-Leibler divergence (D KL ).We report such measures evaluated both in the whole test set and in the attentive subsequences only2 in Tab. 2.Moreover, we report the results of the ML-Net model, that was originally proposed for image saliency and has been properly trained from scratch on the DR(eye)VE dataset.Several conclusions can be drawn from this evaluation.Firstly, from the poor performances of unsupervised models emerges the peculiar nature of the driving context, that demands for task-driven supervision.Moreover, it can be noticed that the attentive subset of samples is crucial for the evaluation, as simple input-agnostic baselines perform positively overall.Finally, an important remark is revealed by the superior performance of the proposed model w.r.t ML-Net.
The gap in performance is due to the temporal nature of video data: indeed, COARSE+FINE profitably learned to extract temporal features that are meaningful for video saliency prediction, whereas the design of ML-Net cannot capture such precious dependencies.A qualitative illustration of the difference in predictions is illustrated in Fig. 5.

Conclusions
In this work we presented different deep learning architectures for saliency prediction on images and video, showing the importance of multi-level features and the ability of recurrent architectures to enhance saliency prediction results.We also shown, with experiments on a driving dataset, that dealing with video sequences requires ad-hoc architectures due to the need of extracting motion features.The comparison between today's models and the early model by Itti and Koch [18] revealed several similarities in the way feature are extracted, and motivated the gap in performances with current models, which is not merely due to the their brute-force nature, but also to their ability to recall very closely early saliency and biological models, although improved with the semantics learned on the ground-thuth.

Fig. 4 :
Fig. 4: Illustration of the COARSE+FINE model depicting the both streams guiding the optimization during training.Please note that in test stage the cropped stream is not used.At the bottom, the architecture of the COARSE module is illustrated.

Fig. 5 :
Fig. 5: Representation of differences in the video saliency estimation.This qualitative assessment indicates the suitability of the COARSE+FINE model in encoding temporal information.On the other hand, the ML-Net model processes still images and is more influenced by low-level non temporal features.

Table 2 :
Evaluation of the proposed models against central baselines, both on the test and attentive sequences of DR(eye)VE.