Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era

Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita

doi:10.1007/978-3-319-70169-1_29

Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we propose a discussion on why convolutional neural networks (CNNs) are so accurate in saliency prediction. We present our DL architectures which combine both bottom-up cues and higher-level semantics, and incorporate the concept of time in the attentional process through LSTM recurrent architectures. Eventually, we present a video-specific architecture based on the C3D network, which can extracts spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. The merit of this work is to show how these deep networks are not mere brute-force methods tuned on massive amount of data, but represent well-defined architectures which recall very closely the early saliency models, although improved with the semantics learned by human ground-thuth.

Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era / Cornia, M., Abati, D., Baraldi, L., Palazzi, A., Calderara, S., Cucchiara, R.. - 10640:(2017), pp. 387-399. (16th International Conference of the Italian Association for Artificial Intelligence Bari, Italy November 14-17, 2017) [10.1007/978-3-319-70169-1_29].

Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era

CORNIA, MARCELLA;ABATI, DAVIDE;BARALDI, LORENZO;PALAZZI, ANDREA;CALDERARA, Simone;CUCCHIARA, Rita

2017

Abstract

Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we propose a discussion on why convolutional neural networks (CNNs) are so accurate in saliency prediction. We present our DL architectures which combine both bottom-up cues and higher-level semantics, and incorporate the concept of time in the attentional process through LSTM recurrent architectures. Eventually, we present a video-specific architecture based on the C3D network, which can extracts spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. The merit of this work is to show how these deep networks are not mere brute-force methods tuned on massive amount of data, but represent well-defined architectures which recall very closely the early saliency models, although improved with the semantics learned by human ground-thuth.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2017
			
	Titolo del Convegno
	
				16th International Conference of the Italian Association for Artificial Intelligence
			
	Luogo del Convegno
	
				Bari, Italy
			
	Data del Convegno
	
				November 14-17, 2017
			
	Codice DOI
	
				https://dx.doi.org/10.1007/978-3-319-70169-1_29
			
	Codice WoS
	
				WOS:000451442200029
			
	Codice Scopus
	
				2-s2.0-85033662960
			
	Serie
	
				LECTURE NOTES IN COMPUTER SCIENCE
			
	N° del Volume
	
				10640
			
	Pagina iniziale
	
				387
			
	Pagina finale
	
				399
			
	Tutti gli autori
	
						Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita
					
	Citazione
	
				Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era / Cornia, M., Abati, D., Baraldi, L., Palazzi, A., Calderara, S., Cucchiara, R.. - 10640:(2017), pp. 387-399. (16th International Conference of the Italian Association for Artificial Intelligence Bari, Italy November 14-17, 2017) [10.1007/978-3-319-70169-1_29].
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
paper_84 (1).pdf Open access Tipologia: AAM - Versione dell'autore revisionata e accettata per la pubblicazione Dimensione 6.68 MB Formato Adobe PDF Visualizza/Apri	6.68 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris