Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild

In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81 % and 40.47 %): we achieve an accuracy of 50.39 % and 49.92 % respectively on the validation and the testing data.


INTRODUCTION
Emotion recognition is an active research topic in the affective computing community.During the last decade, emotion recognition systems have been integrated in a number of applications across a growing number of domain fields such as cognitive science [31], clinical diagnosis [15], entertainment [38] and human-machine interaction [3].Automatic emotion analysis and recognition in realworld videos (i.e. in the wild) is nevertheless still an open challenge in computer vision.One fundamental limiting factor is that there is almost no large dataset with real-world facial expressions available for emotion recognition.Other challenging factors include head pose variation, complex facial expression variations, different illumination conditions and face occlusion.
Recent achievements in the field are based on the use of data coming from multiple modalities, such as facial and vocal expressions.Indeed, each modality presents very distinct properties and combining them helps to learn useful and complementary representations of the data.Still, representing and fusing different modalities in an appropriate and efficient manner is an open research question.
The extraction of visual cues for emotion recognition has been receiving a great deal of attention in the past decade.Recently, with the rapid growth of Convolutional Neural Networks (CNNs), extracting visual features from video frames has been investigated in many emotion recognition tasks and there are various face pretrained models made available [34,36,40].However, those models are not directly suitable for video due to the lack of the temporal information and to the variation of emotion expression patterns across individuals.To deal with this issue, 3D versions of CNN have been recently proposed [43].
Adding the audio information surely plays an important role in emotion recognition in video.Most of the multimodal approaches mainly used hand crafted audio features such as the Mel Frequency Cepstrum Coefficients (MFCC) or spectrograms, with either traditional [33,42] or deep [48] classifiers.However, those audio features are very low level and are not designed for video analysis.
In this paper, we propose a deep multimodal architecture for emotion recognition.Visual and temporal information are presented using a hybrid 2D-3D CNN approach, whereas the audio information is extracted using a deep CNN that has been trained by transferring knowledge from vision to sound [2].To the best of our knowledge, learned deep audio features have not been yet investigated in the context of multimodal emotion recognition.The remainder of the paper is organized as follows: Section 2 presents related work, Section 3 describes the proposed multimodal emotion recognition architecture, Section 4 presents experiments and results, and finally Section 5 concludes the work and gives some future directions.

RELATED WORK
Emotions are displayed in video by visual and vocal means.Visual information is related to the dynamics patterns of face while the vocal information relies on audio signals.Recently, several deep audio-visual emotion recognition approaches have been proposed.
In this section, we briefly review the related work regarding the emotion recognition in videos, embracing the deep learning representations of appearance, temporal and audio information and the related multimodal fusion schemes.
Spatio-temporal evolution of facial features is one of the strongest cues for emotion recognition.Prior works using Deep Neural Networks (DNNs) for emotion recognition in video have mainly relied on temporal averaging and pooling strategies [5,24].More recently, we note an increase in using temporal neural networks such as Recurrent Neural Networks (RNN) to quantify the visual motion.Several previous works trained temporal neural network models on visual hand-crafted features [17,35].Few works have considered combining CNNs with RNNs [11,26].For instance, in the work of [11], the authors combine RNN with CNN to model the facial expression dynamic in video.The later suggested that temporal information integration improves classification results.In similar works [6,17,29], the authors use Long Short-Term Memory (LSTM) cells to aggregate CNN features over time.Other recent works model the motion information using 3D convolutional networks (C3D) [12,48].
Regarding the audio information, deep learning-based approaches have recently attracted increasing attention among the computer vision community.Classical approaches rely on extracting audio hand-crafted features and apply a DNN classifier on those features.For instance, [29] and [14] investigate the use of deep learning approaches for emotional speech recognition.[14] train a DNN with MFCC features to classify emotions into 6 classes.In [47], the authors extract Mel-spectrogram features from audio signals for each video segment to classify emotions using a DNN.In [10], the authors train an LSTM on acoustic parameter set for affective computing.In [4], the authors investigate the emotional impact of movie genre to predict media interestingness.The later work use Soundnet and VGG features for genre recognition.Few works proposed to learn deep audio model from scratch and most of them are dedicated to specific task such as speech recognition [18,21,30].One of the main challenges to build deep audio models is the lack of labeled sound data.For instance, in [41], the authors present a new deep architecture with data augmentation strategy to learn a model for audio events recognition.The later claims that combining visual features with deep audio features leads to significant performance in action recognition and video highlight detection compared to either the use of visual features alone or the fusion with MFCC features.
Multimodal data fusion remains an important challenge in emotion recognition systems.Previous works in multimodal emotion recognition using deep learning assume independence of different modalities, performing either early fusion (feature-level fusion) [7] or late fusion (decision-level fusion) [10,11,24,44].Fan et al. [12] combine RNN and C3D network in a late-fusion fashion.The CNN-RNN, C3D and audio SVM model were trained separately and their prediction scores were combined into the final score.Kaya et al. [25] combine audio-visual data with least squares regression based classifiers and weighted late fusion scheme.Recent work investigate the use of DNN to fuse multimodal information.One advantage of DNNs is their capability to jointly learn feature representations and appropriate classifiers [27].Some fusion methods based on fully connected layers have been suggested to improve video classification by capturing the mutual correlation among different modalities.For example, in [47], a fusion network is trained to obtain a joint audio-visual feature representation.

PROPOSED METHOD
To deal with the multimodal and temporal nature of the emotion recognition task, we build a network which is able to jointly extract static and dynamic features from different modalities, and to address the temporal evolution of the video.
Our architecture, as illustrated in Figure 1, is composed of three network branches, where the first and second ones are explicitly designed to deal with the visual features from the video, while the third one processes the audio of the input video clip.In particular, the first branch is a 2D CNN that processes the single frames, the second one is a C3D network that processes short frame snippets, and the third one is a 1D CNN that processes audio snippets.
Since all the branches can process either one frame or a short sequence, a temporal fusion strategy is devised, to deal with videos of varying length and to exploit temporal dependencies.In particular, the features of the first two branches are combined in the temporal dimension using a NetVLAD layer [1], which extends the VLAD [23] aggregation technique by learning its cluster centers.In the audio branch, instead, we make use of a LSTM network to learn the temporal dependencies between consecutive audio snippets, and represent them with the last hidden state of the network.Features coming from the three branches, once aggregated over time, are finally concatenated and fed to a multimodal network which is in charge of combining the visual, the motion, and the audio information.

Data Preprocessing
Video clips from emotion recognition datasets are usually collected from classic movies and TV reality shows, so most of frames contain irrelevant or misleading information, like background objects and background motion.Therefore, it is beneficial to pre-process the original video frames in order to limit this effect.Indeed, we extract all faces from each frame of the input video clip, and retain only the face bounding boxes, discarding all the rest in a frame.For the face detection and extraction phase, we use a cascaded convolutional neural network [46] in which faces are detected by means of a multitask convolutional network which jointly detects facial landmarks and predicts the face bounding box.If more than one face is detected, we only use the crop of the biggest one as input to the model.Some examples of the performed pre-processing on video frames are shown in Figure 2. Furthermore, we follow the work of Aytar et al. [2] to extract and pre-process the audio information.Hence, we sampled the audio from video clips at a frequency of 22050 Hz and we saved every clip in mp3 format, single channel.Then, the waveform of every sample is scaled to be in the range [−256, 256].

Hybrid Deep Visual Features Extraction
In order to capture visual and motion features, we design two different branches: the first one, based on a recent version of the popular FaceNet architecture [36], captures a set of visual features representing the face, while the second one, built upon the C3D network [43], jointly captures visual and motion information.
CNN Branch.In this branch, the Inception-ResNet v1 [40] network, trained as proposed in [36], is used as feature extractor.The network takes a color image of size 160 × 160 as input.The output of the fifth Inception-resnet-C block (of size 3 × 3 × 1792) is then used as input to a small neural network composed by a convolutional layer with 256 filters of size 3 × 3, a fully connected layer with 256 units and a softmax layer of 7 classes.This network is trained using every extracted face from the challenge dataset and every image of the FER-2013 dataset [13] (more details regarding the datasets are available in Section 4.1).Images are preprocessed accordingly to the chosen open-source implementation of the network 1 .Note that only the last convolutional and fully connected layers are trained from scratch.
C3D Branch.In this branch, the C3D network [43] is used as feature extractor.The network takes 16 frames of size 112 × 112 × 3 as input.The output of the Pool5 block (of size 4 × 4 × 512) is given as input to a small neural network composed by a max pooling layer of size 2×2 and stride 2, a convolutional layer with 256 filters of size 2 × 2, a fully connected network with 256 units and a softmax layer of 7 classes.The network is trained using the challenge dataset only.Slices of 16 extracted faces are used as input to the network and only the last convolutional and fully connected layers are trained from scratch.

Deep Acoustic Features Extraction
In the audio branch, the SoundNet network [2] is used as feature extractor.The output of the conv4 block (of size 22 × 128) is given as input to a network defined as follows.The first layer is a 1D convolutional layer with 512 filters of size 4, applied with a stride of 4. The size of the output is 6 × 512.The six feature vectors of size 512 can be seen as the compression of the audio temporal input, therefore they still contain temporal information.Based on that, the six feature vectors are given as input of an LSTM [20] layer with two levels and 128 hidden units.This layer is followed by a softmax layer of 7 classes.The network is trained using the challenge dataset and part of the eNTERFACE dataset [32] (more details regarding the datasets are available in Section 4.1).Audio raw waveform sequences extracted from the videos are used as input to the network.Even in this case, only the last convolutional, LSTM, and softmax layers are trained from scratch.

Temporal Aggregation and Multimodal Fusion
The fusion network has a double purpose: to combine the temporal information of the visual and motion features and to fuse the multiple modalities.In order to combine features extracted at different timesteps, the CNN branch and the C3D branch are followed by a NetVLAD layer [1].Given a set of D-dimensional features {x i }, the layer can learn K cluster centers {c i } in the same space of the features, and produce an aggregated description of the set with size K × D, through the sum of residuals with respect to the cluster centers.Formally, the k-th row of the aggregated description is given by where δ (x i , c k ) denotes the degree of membership of descriptor x i to cluster c k .The resulting matrix is then column-wise L 2normalized, flattened and then L 2 -normalized again.Since an hard assignment of features to clusters would be non-differentiable, the NetVLAD layer employs a soft-assignment variant, in which where α controls the decay of the response with magnitude of the distance.In practice, the learnable cluster centers c k are decoupled into two sets of convolutional parameters, so that the layer can be implemented via the composition of a convolutional layer, softmax activation and the final L 2 normalizations.
In the proposed architecture, the NetVLAD layer on top of the CNN branch takes the features extracted from 48 frames and outputs an aggregated representation composed by 8 visual feature vectors corresponding to 8 different clusters.The feature vectors are then flattened and followed by a fully connected layer with 128 units to reduce the output dimension.In the C3D branch, instead, the NetVLAD layer takes 32 motion features and outputs an aggregated representation of 8 motion feature vectors corresponding to 8 different clusters.As before, the feature vectors are flattened and followed by a fully connected layer with 128 units.
Regarding the audio information, we take only one second in the middle of the video, since we found that this amount of data is sufficient to perform a good classification without over-fitting the training data.Then, the output of the two fully connected layers of the CNN and C3D branch and the output of the LSTM layer of the audio branch are concatenated forming a 384 feature vector.The obtained feature vector is followed by a fully connected layer with 128 units and a softmax layer with 7 classes.The highest output of the softmax layer is our classification of the video.

Training
The training process is composed by two phases.During the first phase, the last layers of the three branches of our architecture are trained separately.Then, the multimodal fusion network is trained using the trained branches, without the softmax layer, as feature extractors.This approach allows us to use additional datasets during the training of the single branches, obtaining more robust and generalizing networks as feature extractors.
The categorical cross-entropy loss function on the seven classes of the challenge dataset is used to train all the networks of the architecture except the multimodal fusion of the Submission 4. The fusion network related to the last submission is trained using a weighted version of the categorical cross-entropy loss function.In order to increase the importance of the most frequent classes and reduce the importance of the less frequent ones, the standard loss value is multiplied by a regularizing parameter based on the distribution of the seven classes in the training set.Specifically, an exponential function is sampled following the classes distribution on the training set to obtain the regularizing parameters.
The standard and the weighted loss function are as follows: where N is the number of examples in the batch, t i is the target probability vector of sample i (i.e. a one-hot vector), p i is the vector containing the predicted probabilities for sample i, c i is the ground truth class of sample i, and λ k is the regularizing parameter for the class k.

EXPERIMENTS AND RESULTS
In this section, we firstly describe the datasets used during the experiments.Then, we detail the implementation of the proposed model.Finally, we report and discuss the results achieved on the validation and the testing data.

Datasets
We trained our networks with different emotion datasets, evaluating them on the challenge dataset only.
Acted Facial Expressions in the Wild (AFEW).The Acted Facial Expressions in the Wild (AFEW) dataset [9] (2017 edition) is the dataset of the Emotion Recognition in the Wild (EmotiW) challenge.It is composed by 1809 video clips extracted from movies and, since 2016, TV series.Every clip is annotated with one of seven emotions (Angry, Disgust, Fear, Happy, Sad, Surprise, and Neutral), but only the annotations of the training and validation sets are publicly available.Some statistics about the dataset are available in Table 1.In Figure 2, some frames from the dataset and the corresponding cropped faces are shown.The dataset is used for both training and evaluation of all the branches and the multimodal fusion.

Facial Expression Recognition 2013 (FER-2013).
The Facial Expression Recognition 2013 (FER-2013) dataset [13] has been created for the Facial Expression Recognition Challenge.35, 887 grayscale images have been crawled on the web and annotated with one of seven emotions (Angry, Disgust, Fear, Happy, Sad, Surprise, and Neutral).These additional images increase the accuracy of the CNN branch when used during the training.eNTERFACE.The eNTERFACE dataset [32] consists of 1166 video clips annotated with one of the six basic emotions (Angry, Disgust, Fear, Happy, Sad and Surprise).The clips are recorded in constrained environments and contain both audio and video data.This dataset is used during the training of the Audio branch of our architecture, decreasing the over-fitting.

Implementation Details
Detected faces are pre-processed and resized to comply with the expected input of the CNN and C3D network, respectively 160×160 and 112 × 112.Data augmentation techniques are applied during the training in order to reduce the over-fitting and increase the generalization capabilities of our architecture.Random flip, crop, and zoom are applied to the visual input, while the audio is used "as it is", but a random 1 second-length slice is selected every time.
Furthermore, batch normalization [22] is applied before every activation of the trained layers in conjunction with dropout [19,39] between fully connected layers.Dropout is also applied on the LSTM block of the Audio branch, following [45] and [37].The related keep probabilities range between 0.1 and 0.8 based on the position and the network where dropout is applied.Low keep probability values allow to reduce the over-fitting despite the small amount of training data.
The parameters of the layers for which we did not use pre-trained weights are initialized following what was proposed in [16].The Adam optimizer [28] is used during the training in every network of our architecture with β 1 = 0.9, β 2 = 0.999, and ϵ = 10 −8 .The learning rate varies depending on the network, due to the considerable differences between the branches and the fusion network.Regarding the CNN, C3D, Audio, and fusion network, the learning rate are respectively 0.0001, 0.001, 0.001, and 0.0005.In all the networks, the target batch size is 128, but we are forced to drastically reduce it in the C3D and fusion network due to memory space limits.

Results
In this section, we present the achieved results using our different approaches on the challenge validation and test sets.

Results on Validation Set.
To validate the performance of our models, we conduct first a set of experiments on the validation set.Table 2 presents the best achieved results for the single branches and the multimodal fusion approaches.For the CNN and C3D branch, the accuracy is reported with respect to both every single frame and the whole video.In the latter case, the prediction of every video is obtained averaging the predictions of its frames.As one can note, the multimodal fusion gives an absolute accuracy gain of nearly 6% with respect to the best single branch.Both the temporal combination on the visual branches and the multimodal fusion of the three branches contribute to the accuracy improvement.The corresponding confusion matrix is shown in Figure 3a.
It is worth to notice that the proposed architecture is able to classify almost every emotion with a good accuracy on the validation set.The only exception is the class Fear, mainly confused with Angry, Neutral, and Sad.Interestingly, while analyzing results on the validation data we observed that the CNN branch correctly classify about every emotion with an acceptable accuracy, the C3D branch is unable to classify the classes Disgusted, Fear, and Surprise in most of the cases whereas the Audio branch never correctly classify the classes Disgusted and Surprise.
Additional experiments were made on the validation set to investigate the fine-tuning of the pre-trained networks (FaceNet, C3D,  SoundNet), but they resulted in an early over-fitting of the models.Indeed, over-fitting has been a major issue in most of the experiments performed in this work.We think this is mainly caused by the limited size of the available datasets regarding multimodal emotion recognition.Furthermore, most of the available datasets contain video clips recorded in constrained environments in which a subject acts an emotion.As a result, expressed emotions are not natural and audio information is rarely available.

Challenge Submissions:
Results on Test Set.In order to evaluate the performance of our different approaches/models on the challenge test set, we submitted 4 runs to EmotiW 2017 challenge.In this paper, we present only the three best submissions (1, 3 and 4).In particular, submission 1 corresponds to a preliminary version of our architecture: it contains the CNN branch and the Audio branch only and it is trained with the standard categorical crossentropy loss function (Eq.( 3)).Differently, both submission 3 and submission 4 correspond to the architecture described in Section 3. The standard categorical cross-entropy loss function (Eq.( 3)) is used in the first case while its weighted version (Eq.( 4)) is used in the second one.
To increase the training data while keeping a stopping condition, the validation set of the AFEW dataset was split in five folds and the models are trained five times, following the k-fold cross validation technique.The folds were created maintaining the train and the validation fold subject-independent.Submission 2 attempted to keep emotion-balance instead of subject independence while training the same architecture as submission 1, but poor results were obtained and hence are not reported.The results of our submissions are presented in Table 3.The second column of the table contains the averaged accuracy on the five validation folds, while the third one contains the results of our submissions on the test set.
Looking at Figure 3 and Table 3, it can be noticed that the multimodal fusion network trained with the weighted loss (Submission 4) performs better on the test set, while the model trained with the standard loss performs better on the five validation folds.This counter-intuitive behaviour is presumably attributable to the different class distribution of the test set of the AFEW dataset compared to the train and validation set of the same dataset.Moreover, the test set contains video clips extracted from TV series (since 2016), while the training and the validation set don't.We think that these are the reasons of the discrepancy between our results on the validation set and on the test set.As shown in Table 4, our deep learning-based architecture outperforms by a clear margin the challenge baseline [8] both on validation and test set.In particular, our best submission reaches an accuracy of 49.92%, corresponding to an absolute improvement of 9.45% with respect to the challenge baseline.

CONCLUSION
We proposed a multimodal deep learning framework for emotion recognition in video that participated to the audio-video based subchallenge of the Emotion Recognition in the Wild 2017 challenge.Our approach combines visual, temporal and audio information using neural network-based architectures only.Notwithstanding the small amount of labelled data regarding the emotion recognition in video, the proposed method outperforms the challenge baselines of 38.87% and 40.47% obtaining an accuracy of 50.39% and 49.92% on the validation and the test dataset respectively.In the future, making use of larger annotated datasets, we are planning to train the entire architecture in one step and to fine-tune the pre-trained audio and visual models to make the extracted features more domain specific.

Figure 1 :
Figure 1: Overview of the proposed architecture.

Figure 2 :
Figure 2: Some examples of cropped faces extracted from input video frames.

Table 2 :
Experimental results on the AFEW validation set.

Table 3 :
Experimental results of our three best submissions.

Table 4 :
Proposed method accuracy compared to the challenge baseline.