This paper considers a learnable approach for comparing and aligning videos. Our architecture builds upon and revisits temporal match kernels within neural networks: we propose a new temporal layer that finds temporal alignments by maximizing the scores between two sequences of vectors, according to a time-sensitive similarity metric parametrized in the Fourier domain. We learn this layer with a temporal proposal strategy, in which we minimize a triplet loss that takes into account both the localization accuracy and the recognition rate. We evaluate our approach on video alignment, copy detection and event retrieval. Our approach outperforms the state on the art on temporal video alignment and video copy detection datasets in comparable setups. It also attains the best reported results for particular event search, while precisely aligning videos.

LAMV: Learning to align and match videos with kernelized temporal layers / Baraldi, Lorenzo; Douze, Matthijs; Cucchiara, Rita; Jégou, Hervé. - (2018), pp. 7804-7813. (Intervento presentato al convegno IEEE/CVF Conference on Computer Vision and Pattern Recognition tenutosi a Salt Lake City, UT, USA, USA nel June 18-22) [10.1109/CVPR.2018.00814].

LAMV: Learning to align and match videos with kernelized temporal layers

Baraldi, Lorenzo;Cucchiara, Rita;
2018

Abstract

This paper considers a learnable approach for comparing and aligning videos. Our architecture builds upon and revisits temporal match kernels within neural networks: we propose a new temporal layer that finds temporal alignments by maximizing the scores between two sequences of vectors, according to a time-sensitive similarity metric parametrized in the Fourier domain. We learn this layer with a temporal proposal strategy, in which we minimize a triplet loss that takes into account both the localization accuracy and the recognition rate. We evaluate our approach on video alignment, copy detection and event retrieval. Our approach outperforms the state on the art on temporal video alignment and video copy detection datasets in comparable setups. It also attains the best reported results for particular event search, while precisely aligning videos.
2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Salt Lake City, UT, USA, USA
June 18-22
7804
7813
Baraldi, Lorenzo; Douze, Matthijs; Cucchiara, Rita; Jégou, Hervé
LAMV: Learning to align and match videos with kernelized temporal layers / Baraldi, Lorenzo; Douze, Matthijs; Cucchiara, Rita; Jégou, Hervé. - (2018), pp. 7804-7813. (Intervento presentato al convegno IEEE/CVF Conference on Computer Vision and Pattern Recognition tenutosi a Salt Lake City, UT, USA, USA nel June 18-22) [10.1109/CVPR.2018.00814].
File in questo prodotto:
File Dimensione Formato  
1517.pdf

Open access

Tipologia: Versione dell'autore revisionata e accettata per la pubblicazione
Dimensione 2.54 MB
Formato Adobe PDF
2.54 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1155754
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 34
  • ???jsp.display-item.citation.isi??? 20
social impact