Multimodal Attention Networks for Low-Level Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a target destination. The goal gets even harder as the actions available to the agent get simpler and move towards low-level, atomic interactions with the environment. This setting takes the name of low-level VLN. In this paper, we strive for the creation of an agent able to tackle three key issues: multi-modality, long-term dependencies, and adaptability towards different locomotive settings. To that end, we devise "Perceive, Transform, and Act" (PTA): a fully-attentive VLN architecture that leaves the recurrent approach behind and the first Transformer-like architecture incorporating three different modalities -- natural language, images, and low-level actions for the agent control. In particular, we adopt an early fusion strategy to merge lingual and visual information efficiently in our encoder. We then propose to refine the decoding phase with a late fusion extension between the agent's history of actions and the perceptual modalities. We experimentally validate our model on two datasets: PTA achieves promising results in low-level VLN on R2R and achieves good performance in the recently proposed R4R benchmark. Our code is publicly available at https://github.com/aimagelab/perceive-transform-and-act.

Multimodal Attention Networks for Low-Level Vision-and-Language Navigation / Landi, Federico; Baraldi, Lorenzo; Cornia, Marcella; Corsini, Massimiliano; Cucchiara, Rita. - In: COMPUTER VISION AND IMAGE UNDERSTANDING. - ISSN 1077-3142. - 210:(2021), pp. 1-10. [10.1016/j.cviu.2021.103255]

Multimodal Attention Networks for Low-Level Vision-and-Language Navigation

Federico Landi;Lorenzo Baraldi;Marcella Cornia;Massimiliano Corsini;Rita Cucchiara

2021

Abstract

Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a target destination. The goal gets even harder as the actions available to the agent get simpler and move towards low-level, atomic interactions with the environment. This setting takes the name of low-level VLN. In this paper, we strive for the creation of an agent able to tackle three key issues: multi-modality, long-term dependencies, and adaptability towards different locomotive settings. To that end, we devise "Perceive, Transform, and Act" (PTA): a fully-attentive VLN architecture that leaves the recurrent approach behind and the first Transformer-like architecture incorporating three different modalities -- natural language, images, and low-level actions for the agent control. In particular, we adopt an early fusion strategy to merge lingual and visual information efficiently in our encoder. We then propose to refine the decoding phase with a late fusion extension between the agent's history of actions and the perceptual modalities. We experimentally validate our model on two datasets: PTA achieves promising results in low-level VLN on R2R and achieves good performance in the recently proposed R4R benchmark. Our code is publicly available at https://github.com/aimagelab/perceive-transform-and-act.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
			2021
		
	Rivista
	
			COMPUTER VISION AND IMAGE UNDERSTANDING
		
	N° del Volume
	
			210
		
	Pagina iniziale
	
			1
		
	Pagina finale
	
			10
		
	Codice DOI
	
			https://dx.doi.org/10.1016/j.cviu.2021.103255
		
	Codice WoS
	
			WOS:000691812700008
		
	Codice Scopus
	
			2-s2.0-85112649528
		
	Citazione
	
			Multimodal Attention Networks for Low-Level Vision-and-Language Navigation / Landi, Federico; Baraldi, Lorenzo; Cornia, Marcella; Corsini, Massimiliano; Cucchiara, Rita. - In: COMPUTER VISION AND IMAGE UNDERSTANDING. - ISSN 1077-3142. - 210:(2021), pp. 1-10. [10.1016/j.cviu.2021.103255]
		
	Tutti gli autori
	
			Landi, Federico; Baraldi, Lorenzo; Cornia, Marcella; Corsini, Massimiliano; Cucchiara, Rita
		
	Tipologia
	
			Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
2021_CVIU_VLN.pdf Open access Tipologia: Versione dell'autore revisionata e accettata per la pubblicazione Dimensione 991.66 kB Formato Adobe PDF Visualizza/Apri	991.66 kB	Adobe PDF	Visualizza/Apri
1-s2.0-S1077314221000990-main.pdf Accesso riservato Tipologia: Versione pubblicata dall'editore Dimensione 1.15 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.15 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1250623

Citazioni

ND

15

11

social impact