In Vision-and-Language Navigation (VLN), an embodied agent needs to reach a target destination with the only guidance of a natural language instruction. To explore the environment and progress towards the target location, the agent must perform a series of low-level actions, such as rotate, before stepping ahead. In this paper, we propose to exploit dynamic convolutional filters to encode the visual information and the lingual description in an efficient way. Differently from some previous works that abstract from the agent perspective and use high-level navigation spaces, we design a policy which decodes the information provided by dynamic convolution into a series of low-level, agent friendly actions. Results show that our model exploiting dynamic filters performs better than other architectures with traditional convolution, being the new state of the art for embodied VLN in the low-level action space. Additionally, we attempt to categorize recent work on VLN depending on their architectural choices and distinguish two main groups: we call them low-level actions and high-level actions models. To the best of our knowledge, we are the first to propose this analysis and categorization for VLN.

Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters / Landi, Federico; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita. - (2019), pp. 1-12. (Intervento presentato al convegno 30th British Machine Vision Conference tenutosi a Cardiff, UK nel 9th-12th September 2019).

Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters

LANDI, FEDERICO;Baraldi, Lorenzo;CORSINI, Massimiliano;Cucchiara, Rita
2019

Abstract

In Vision-and-Language Navigation (VLN), an embodied agent needs to reach a target destination with the only guidance of a natural language instruction. To explore the environment and progress towards the target location, the agent must perform a series of low-level actions, such as rotate, before stepping ahead. In this paper, we propose to exploit dynamic convolutional filters to encode the visual information and the lingual description in an efficient way. Differently from some previous works that abstract from the agent perspective and use high-level navigation spaces, we design a policy which decodes the information provided by dynamic convolution into a series of low-level, agent friendly actions. Results show that our model exploiting dynamic filters performs better than other architectures with traditional convolution, being the new state of the art for embodied VLN in the low-level action space. Additionally, we attempt to categorize recent work on VLN depending on their architectural choices and distinguish two main groups: we call them low-level actions and high-level actions models. To the best of our knowledge, we are the first to propose this analysis and categorization for VLN.
2019
30th British Machine Vision Conference
Cardiff, UK
9th-12th September 2019
1
12
Landi, Federico; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita
Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters / Landi, Federico; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita. - (2019), pp. 1-12. (Intervento presentato al convegno 30th British Machine Vision Conference tenutosi a Cardiff, UK nel 9th-12th September 2019).
File in questo prodotto:
File Dimensione Formato  
BMVC2019.pdf

Open access

Tipologia: Versione dell'autore revisionata e accettata per la pubblicazione
Dimensione 2.07 MB
Formato Adobe PDF
2.07 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1178762
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 14
  • ???jsp.display-item.citation.isi??? ND
social impact