Investigating Bidimensional Downsampling in Vision Transformer Models

Vision Transformers (ViT) and other Transformer-based architectures for image classification have achieved promising performances in the last two years. However, ViT-based models require large datasets, memory, and computational power to obtain state-of-the-art results compared to more traditional architectures. The generic ViT model, indeed, maintains a full-length patch sequence during inference, which is redundant and lacks hierarchical representation. With the goal of increasing the efficiency of Transformer-based models, we explore the application of a 2D max-pooling operator on the outputs of Transformer encoders. We conduct extensive experiments on the CIFAR-100 dataset and the large ImageNet dataset and consider both accuracy and efficiency metrics, with the final goal of reducing the token sequence length without affecting the classification performance. Experimental results show that bidimensional downsampling can outperform previous classification approaches while requiring relatively limited computation resources.

Investigating Bidimensional Downsampling in Vision Transformer Models / Bruno, Paolo; Amoroso, Roberto; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita. - 13232:(2022), pp. 287-299. (Intervento presentato al convegno 21st International Conference on Image Analysis and Processing, ICIAP 2022 tenutosi a Lecce, Italy nel 23 - 27 May 2022) [10.1007/978-3-031-06430-2_24].

Investigating Bidimensional Downsampling in Vision Transformer Models

Bruno, Paolo;Amoroso, Roberto;Cornia, Marcella;Cascianelli, Silvia;Baraldi, Lorenzo;Cucchiara, Rita

2022

Abstract

Vision Transformers (ViT) and other Transformer-based architectures for image classification have achieved promising performances in the last two years. However, ViT-based models require large datasets, memory, and computational power to obtain state-of-the-art results compared to more traditional architectures. The generic ViT model, indeed, maintains a full-length patch sequence during inference, which is redundant and lacks hierarchical representation. With the goal of increasing the efficiency of Transformer-based models, we explore the application of a 2D max-pooling operator on the outputs of Transformer encoders. We conduct extensive experiments on the CIFAR-100 dataset and the large ImageNet dataset and consider both accuracy and efficiency metrics, with the final goal of reducing the token sequence length without affecting the classification performance. Experimental results show that bidimensional downsampling can outperform previous classification approaches while requiring relatively limited computation resources.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2022
			
	Titolo del Convegno
	
				21st International Conference on Image Analysis and Processing, ICIAP 2022
			
	Luogo del Convegno
	
				Lecce, Italy
			
	Data del Convegno
	
				23 - 27 May 2022
			
	Codice DOI
	
				https://dx.doi.org/10.1007/978-3-031-06430-2_24
			
	Codice WoS
	
				WOS:000870296100024
			
	Codice Scopus
	
				2-s2.0-85130888294
			
	Serie
	
				LECTURE NOTES IN COMPUTER SCIENCE
			
	N° del Volume
	
				13232
			
	Pagina iniziale
	
				287
			
	Pagina finale
	
				299
			
	Tutti gli autori
	
						Bruno, Paolo; Amoroso, Roberto; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita
					
	Citazione
	
				Investigating Bidimensional Downsampling in Vision Transformer Models / Bruno, Paolo; Amoroso, Roberto; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita. - 13232:(2022), pp. 287-299. (Intervento presentato al  convegno 21st International Conference on Image Analysis and Processing, ICIAP 2022 tenutosi a Lecce, Italy nel 23 - 27 May 2022) [10.1007/978-3-031-06430-2_24].
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1268738

Citazioni

ND

2

2

social impact