Video action detection by learning graph-based spatio-temporal interactions

Action Detection is a complex task that aims to detect and classify human actions in video clips. Typically, it has been addressed by processing fine-grained features extracted from a video classification backbone. Recently, thanks to the robustness of object and people detectors, a deeper focus has been added on relationship modelling. Following this line, we propose a graph-based framework to learn high-level interactions between people and objects, in both space and time. In our formulation, spatio-temporal relationships are learned through self-attention on a multi-layer graph structure which can connect entities from consecutive clips, thus considering long-range spatial and temporal dependencies. The proposed module is backbone independent by design and does not require end-to-end training. Extensive experiments are conducted on the AVA dataset, where our model demonstrates state-of-the-art results and consistent improvements over baselines built with different backbones. Code is publicly available at https://github.com/aimagelab/STAGE_action_detection.

Video action detection by learning graph-based spatio-temporal interactions / Tomei, Matteo; Baraldi, Lorenzo; Calderara, Simone; Bronzin, Simone; Cucchiara, Rita. - In: COMPUTER VISION AND IMAGE UNDERSTANDING. - ISSN 1077-3142. - 206:(2021), pp. 1-9. [10.1016/j.cviu.2021.103187]

Video action detection by learning graph-based spatio-temporal interactions

Matteo Tomei;Lorenzo Baraldi;Simone Calderara;Simone Bronzin;Rita Cucchiara^{Project Administration}

2021

Abstract

Action Detection is a complex task that aims to detect and classify human actions in video clips. Typically, it has been addressed by processing fine-grained features extracted from a video classification backbone. Recently, thanks to the robustness of object and people detectors, a deeper focus has been added on relationship modelling. Following this line, we propose a graph-based framework to learn high-level interactions between people and objects, in both space and time. In our formulation, spatio-temporal relationships are learned through self-attention on a multi-layer graph structure which can connect entities from consecutive clips, thus considering long-range spatial and temporal dependencies. The proposed module is backbone independent by design and does not require end-to-end training. Extensive experiments are conducted on the AVA dataset, where our model demonstrates state-of-the-art results and consistent improvements over baselines built with different backbones. Code is publicly available at https://github.com/aimagelab/STAGE_action_detection.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2021
			
	Rivista
	
				COMPUTER VISION AND IMAGE UNDERSTANDING
			
	N° del Volume
	
				206
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				9
			
	Codice DOI
	
				https://dx.doi.org/10.1016/j.cviu.2021.103187
			
	Codice WoS
	
				WOS:000634960000006
			
	Codice Scopus
	
				2-s2.0-85102037745
			
	Citazione
	
				Video action detection by learning graph-based spatio-temporal interactions / Tomei, Matteo; Baraldi, Lorenzo; Calderara, Simone; Bronzin, Simone; Cucchiara, Rita. - In: COMPUTER VISION AND IMAGE UNDERSTANDING. - ISSN 1077-3142. - 206:(2021), pp. 1-9. [10.1016/j.cviu.2021.103187]
			
	Tutti gli autori
	
						Tomei, Matteo; Baraldi, Lorenzo; Calderara, Simone; Bronzin, Simone; Cucchiara, Rita
					
	Tipologia
	
				Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S107731422100031X-main.pdf Open access Tipologia: VOR - Versione pubblicata dall'editore Dimensione 1.22 MB Formato Adobe PDF Visualizza/Apri	1.22 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1235540

Citazioni

ND

16

14

social impact