Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters

Landi, Federico; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita

In Vision-and-Language Navigation (VLN), an embodied agent needs to reach a target destination with the only guidance of a natural language instruction. To explore the environment and progress towards the target location, the agent must perform a series of low-level actions, such as rotate, before stepping ahead. In this paper, we propose to exploit dynamic convolutional filters to encode the visual information and the lingual description in an efficient way. Differently from some previous works that abstract from the agent perspective and use high-level navigation spaces, we design a policy which decodes the information provided by dynamic convolution into a series of low-level, agent friendly actions. Results show that our model exploiting dynamic filters performs better than other architectures with traditional convolution, being the new state of the art for embodied VLN in the low-level action space. Additionally, we attempt to categorize recent work on VLN depending on their architectural choices and distinguish two main groups: we call them low-level actions and high-level actions models. To the best of our knowledge, we are the first to propose this analysis and categorization for VLN.

Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters / Landi, Federico; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita. - (2019), pp. 1-12. ( 30th British Machine Vision Conference, BMVC 2019 Cardiff, UK 9th-12th September 2019).

Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters

LANDI, FEDERICO;Baraldi, Lorenzo;CORSINI, Massimiliano;Cucchiara, Rita

2019

Abstract

In Vision-and-Language Navigation (VLN), an embodied agent needs to reach a target destination with the only guidance of a natural language instruction. To explore the environment and progress towards the target location, the agent must perform a series of low-level actions, such as rotate, before stepping ahead. In this paper, we propose to exploit dynamic convolutional filters to encode the visual information and the lingual description in an efficient way. Differently from some previous works that abstract from the agent perspective and use high-level navigation spaces, we design a policy which decodes the information provided by dynamic convolution into a series of low-level, agent friendly actions. Results show that our model exploiting dynamic filters performs better than other architectures with traditional convolution, being the new state of the art for embodied VLN in the low-level action space. Additionally, we attempt to categorize recent work on VLN depending on their architectural choices and distinguish two main groups: we call them low-level actions and high-level actions models. To the best of our knowledge, we are the first to propose this analysis and categorization for VLN.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2019
			
	Titolo del Convegno
	
				30th British Machine Vision Conference, BMVC 2019
			
	Luogo del Convegno
	
				Cardiff, UK
			
	Data del Convegno
	
				9th-12th September 2019
			
	Codice Scopus
	
				2-s2.0-85087328502
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				12
			
	Tutti gli autori
	
						Landi, Federico; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita
					
	Citazione
	
				Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters / Landi, Federico; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita. - (2019), pp. 1-12. ( 30th British Machine Vision Conference, BMVC 2019 Cardiff, UK 9th-12th September 2019).
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
BMVC2019.pdf Open access Tipologia: AAM - Versione dell'autore revisionata e accettata per la pubblicazione Dimensione 2.07 MB Formato Adobe PDF Visualizza/Apri	2.07 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris