Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Moratelli, Nicholas; Barraco, Manuele; Morelli, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

doi:10.3390/s23031286

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through k-nearest neighbor (kNN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates / Moratelli, Nicholas; Barraco, Manuele; Morelli, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita. - In: SENSORS. - ISSN 1424-8220. - 23:3(2023), pp. 1-16. [10.3390/s23031286]

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Nicholas Moratelli;Manuele Barraco;Davide Morelli;Marcella Cornia;Lorenzo Baraldi;Rita Cucchiara

2023

Abstract

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through k-nearest neighbor (kNN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2023
			
	Data di prima pubblicazione
	
				gen-2023
			
	Rivista
	
				SENSORS
			
	N° del Volume
	
				23
			
	Fascicolo
	
				3
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				16
			
	Codice DOI
	
				https://dx.doi.org/10.3390/s23031286
			
	Codice WoS
	
				WOS:000929612500001
			
	Codice Scopus
	
				2-s2.0-85147895592
			
	Codice PubMed
	
				36772326
			
	Citazione
	
				Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates / Moratelli, Nicholas; Barraco, Manuele; Morelli, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita. - In: SENSORS. - ISSN 1424-8220. - 23:3(2023), pp. 1-16. [10.3390/s23031286]
			
	Tutti gli autori
	
						Moratelli, Nicholas; Barraco, Manuele; Morelli, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
					
	Tipologia
	
				Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
sensors-23-01286-v2.pdf Open access Tipologia: VOR - Versione pubblicata dall'editore Dimensione 1.41 MB Formato Adobe PDF Visualizza/Apri	1.41 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris