All you can embed: Natural language based vehicle retrieval with spatio-temporal transformers

Combining Natural Language with Vision represents a unique and interesting challenge in the domain of Artificial Intelligence. The AI City Challenge Track 5 for Natural Language-Based Vehicle Retrieval focuses on the problem of combining visual and textual information, applied to a smart-city use case. In this paper, we present All You Can Embed (AYCE), a modular solution to correlate single-vehicle tracking sequences with natural language. The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual information. For the training of the retrieval model, a variation of the Triplet Margin Loss is proposed to learn a distance measure between the visual and language embeddings. The code is publicly available at https://github.com/cscribano/AYCE_2021.

All you can embed: Natural language based vehicle retrieval with spatio-temporal transformers / Scribano, C.; Sapienza, D.; Franchini, G.; Verucchi, M.; Bertogna, M.. - (2021), pp. 4248-4257. (Intervento presentato al convegno 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2021 tenutosi a usa nel 2021) [10.1109/CVPRW53098.2021.00481].

All you can embed: Natural language based vehicle retrieval with spatio-temporal transformers

Scribano C.;Sapienza D.;Franchini G.;Verucchi M.;Bertogna M.

2021

Abstract

Combining Natural Language with Vision represents a unique and interesting challenge in the domain of Artificial Intelligence. The AI City Challenge Track 5 for Natural Language-Based Vehicle Retrieval focuses on the problem of combining visual and textual information, applied to a smart-city use case. In this paper, we present All You Can Embed (AYCE), a modular solution to correlate single-vehicle tracking sequences with natural language. The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual information. For the training of the retrieval model, a variation of the Triplet Margin Loss is proposed to learn a distance measure between the visual and language embeddings. The code is publicly available at https://github.com/cscribano/AYCE_2021.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2021
			
	Titolo del Convegno
	
				2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2021
			
	Luogo del Convegno
	
				usa
			
	Data del Convegno
	
				2021
			
	Codice DOI
	
				https://dx.doi.org/10.1109/CVPRW53098.2021.00481
			
	Codice WoS
	
				WOS:000705890204047
			
	Codice Scopus
	
				2-s2.0-85116058044
			
	Serie
	
				IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS
			
	Pagina iniziale
	
				4248
			
	Pagina finale
	
				4257
			
	Tutti gli autori
	
						Scribano, C.; Sapienza, D.; Franchini, G.; Verucchi, M.; Bertogna, M.
					
	Citazione
	
				All you can embed: Natural language based vehicle retrieval with spatio-temporal transformers / Scribano, C.; Sapienza, D.; Franchini, G.; Verucchi, M.; Bertogna, M.. - (2021), pp. 4248-4257. (Intervento presentato al  convegno 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2021 tenutosi a usa nel 2021) [10.1109/CVPRW53098.2021.00481].
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
All_You_Can_Embed_Natural_Language_based_Vehicle_Retrieval_with_Spatio-Temporal_Transformers.pdf Accesso riservato Tipologia: VOR - Versione pubblicata dall'editore Dimensione 8.53 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	8.53 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1264951

Citazioni

ND

9

7

social impact