Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era

Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades both in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we discuss the effectiveness of convolutional neural networks (CNNs) models in saliency prediction. We present a set of Deep Learning architectures developed by us, which can combine both bottom-up cues and higher-level semantics, and extract spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. We will show how these deep networks closely recall the early saliency models, although improved with the semantics learned from the human ground-truth. Eventually, we will present a use-case in which saliency prediction is used to improve the automatic description of images.

Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era / Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita. - In: INTELLIGENZA ARTIFICIALE. - ISSN 1724-8035. - 12:2(2018), pp. 161-175. (Intervento presentato al convegno x tenutosi a y nel z) [10.3233/IA-170033].

Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era

Cornia, Marcella;Abati, Davide;Baraldi, Lorenzo;Palazzi, Andrea;Calderara, Simone;Cucchiara, Rita

2018

Abstract

Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades both in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we discuss the effectiveness of convolutional neural networks (CNNs) models in saliency prediction. We present a set of Deep Learning architectures developed by us, which can combine both bottom-up cues and higher-level semantics, and extract spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. We will show how these deep networks closely recall the early saliency models, although improved with the semantics learned from the human ground-truth. Eventually, we will present a use-case in which saliency prediction is used to improve the automatic description of images.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2018
			
	Rivista
	
				INTELLIGENZA ARTIFICIALE
			
	N° del Volume
	
				12
			
	Fascicolo
	
				2
			
	Pagina iniziale
	
				161
			
	Pagina finale
	
				175
			
	Codice DOI
	
				https://dx.doi.org/10.3233/IA-170033
			
	Codice WoS
	
				WOS:000482599200008
			
	Codice Scopus
	
				2-s2.0-85062420831
			
	Citazione
	
				Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era / Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita. - In: INTELLIGENZA ARTIFICIALE. - ISSN 1724-8035. - 12:2(2018), pp. 161-175. (Intervento presentato al  convegno x tenutosi a y nel z) [10.3233/IA-170033].
			
	Tutti gli autori
	
						Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita
					
	Tipologia
	
				Articolo su rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1164162

Citazioni

ND

0

0

social impact