Modeling Multimodal Cues in a Deep Learning-based Framework for Emotion Recognition in the Wild

Pini, Stefano; Ben Ahmed, Olfa; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita; Huet, Benoit

doi:10.1145/3136755.3143006

In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81% and 40.47%): we achieve an accuracy of 50.39% and 49.92% respectively on the validation and the testing data.

Modeling Multimodal Cues in a Deep Learning-based Framework for Emotion Recognition in the Wild / Pini, Stefano; Ben Ahmed, Olfa; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita; Huet, Benoit. - (2017), pp. 536-543. (Intervento presentato al convegno 19th ACM International Conference on Multimodal Interaction tenutosi a Glasgow, Scotland nel November 13-17th, 2017) [10.1145/3136755.3143006].

Modeling Multimodal Cues in a Deep Learning-based Framework for Emotion Recognition in the Wild

Pini, Stefano;Ben Ahmed, Olfa;CORNIA, MARCELLA;BARALDI, LORENZO;CUCCHIARA, Rita;Huet, Benoit

2017

Abstract

In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81% and 40.47%): we achieve an accuracy of 50.39% and 49.92% respectively on the validation and the testing data.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2017
			
	Titolo del Convegno
	
				19th ACM International Conference on Multimodal Interaction
			
	Luogo del Convegno
	
				Glasgow, Scotland
			
	Data del Convegno
	
				November 13-17th, 2017
			
	Codice DOI
	
				https://dx.doi.org/10.1145/3136755.3143006
			
	Codice Scopus
	
				2-s2.0-85046678797
			
	Pagina iniziale
	
				536
			
	Pagina finale
	
				543
			
	Tutti gli autori
	
				Pini, Stefano; Ben Ahmed, Olfa; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita; Huet, Benoit
			
	Citazione
	
				Modeling Multimodal Cues in a Deep Learning-based Framework for Emotion Recognition in the Wild / Pini, Stefano; Ben Ahmed, Olfa; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita; Huet, Benoit. - (2017), pp. 536-543. (Intervento presentato al  convegno 19th ACM International Conference on Multimodal Interaction tenutosi a Glasgow, Scotland nel November 13-17th, 2017) [10.1145/3136755.3143006].
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
2017-icmi.pdf Open access Tipologia: Versione dell'autore revisionata e accettata per la pubblicazione Dimensione 6.2 MB Formato Adobe PDF Visualizza/Apri	6.2 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris