Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks

Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita

doi:10.1109/TMM.2016.2644872

In this paper, we propose a novel scene detection algorithm which employs semantic, visual, textual and audio cues. We also show how the hierarchical decomposition of the storytelling video structure can improve retrieval results presentation with semantically and aesthetically effective thumbnails. Our method is built upon two advancements of the state of the art: 1) semantic feature extraction which builds video specific concept detectors; 2) multimodal feature embedding learning, that maps the feature vector of a shot to a space in which the Euclidean distance has task specific semantic properties. The proposed method is able to decompose the video in annotated temporal segments which allow for a query specific thumbnail extraction. Extensive experiments are performed on different data sets to demonstrate the effectiveness of our algorithm. An in-depth discussion on how to deal with the subjectivity of the task is conducted and a strategy to overcome the problem is suggested.

Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks / Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita. - In: IEEE TRANSACTIONS ON MULTIMEDIA. - ISSN 1520-9210. - 19:5(2017), pp. 955-968. [10.1109/TMM.2016.2644872]

Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks

BARALDI, LORENZO;GRANA, Costantino;CUCCHIARA, Rita

2017

Abstract

In this paper, we propose a novel scene detection algorithm which employs semantic, visual, textual and audio cues. We also show how the hierarchical decomposition of the storytelling video structure can improve retrieval results presentation with semantically and aesthetically effective thumbnails. Our method is built upon two advancements of the state of the art: 1) semantic feature extraction which builds video specific concept detectors; 2) multimodal feature embedding learning, that maps the feature vector of a shot to a space in which the Euclidean distance has task specific semantic properties. The proposed method is able to decompose the video in annotated temporal segments which allow for a query specific thumbnail extraction. Extensive experiments are performed on different data sets to demonstrate the effectiveness of our algorithm. An in-depth discussion on how to deal with the subjectivity of the task is conducted and a strategy to overcome the problem is suggested.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2017
			
	Data di prima pubblicazione
	
				23-dic-2016
			
	Rivista
	
				IEEE TRANSACTIONS ON MULTIMEDIA
			
	N° del Volume
	
				19
			
	Fascicolo
	
				5
			
	Pagina iniziale
	
				955
			
	Pagina finale
	
				968
			
	Codice DOI
	
				https://dx.doi.org/10.1109/TMM.2016.2644872
			
	Codice WoS
	
				WOS:000404056000006
			
	Codice Scopus
	
				2-s2.0-85018173257
			
	Citazione
	
				Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks / Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita. - In: IEEE TRANSACTIONS ON MULTIMEDIA. - ISSN 1520-9210. - 19:5(2017), pp. 955-968. [10.1109/TMM.2016.2644872]
			
	Tutti gli autori
	
						Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita
					
	Tipologia
	
				Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
2016TMM.pdf Open access Tipologia: AO - Versione originale dell'autore proposta per la pubblicazione Dimensione 3.18 MB Formato Adobe PDF Visualizza/Apri	3.18 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris