LAMV: Learning to Align and Match Videos with Kernelized Temporal Layers

This paper considers a learnable approach for comparing and aligning videos. Our architecture builds upon and revisits temporal match kernels within neural networks: we propose a new temporal layer that finds temporal alignments by maximizing the scores between two sequences of vectors, according to a time-sensitive similarity metric parametrized in the Fourier domain. We learn this layer with a temporal proposal strategy, in which we minimize a triplet loss that takes into account both the localization accuracy and the recognition rate. We evaluate our approach on video alignment, copy detection and event retrieval. Our approach outperforms the state on the art on temporal video alignment and video copy detection datasets in comparable setups. It also attains the best reported results for particular event search, while precisely aligning videos.


Introduction
Thanks to the success of neural networks and the availability of large annotated collections of images like Imagenet [4] and COCO [23], we have recently witnessed drastic improvements on many core computer vision problems, such as image classification [22,14] and segmentation [13].The analysis of videos has largely benefited from this game-changing adoption of neural networks, in particular by exploiting state-of-the-art image networks.Current methods for tackling video-related tasks mostly rely on the trunk of neural network architectures trained on images [32,34,9,21].
Many attempts to exploit the temporal axis of videos within neural architectures have been proposed.These approaches typically extract information at the frame level and  subsequently enforce or mesure the temporal consistency.For instance, Kang et al. [21] propose a temporal convolutional network to regularize object detection results.Fernando et al. [9] postulate that a method able to temporally re-order the frames of a video would be more suitable to detect the evolution of appearance, and use this supervision signal to improve action recognition.Diba et al. [5] investigate different ways of aggregating feature maps from image-level convolutional neural networks to achieve an end-to-end learning of a video representation.
On the contrary, only few works consider learning a joint spatio-temporal representation, like the C3D network [32].Several difficulties may explain this situation.First, the amount of temporally-labelled data is limited: for large collections the annotation is provided at the video level only, or automatically extracted, or both [1].Second, the number of parameters to learn a spatio-temporal representation is generally much larger than for still images.Third, depending on the task, it is not obvious that temporality is at all useful.For instance, the recent high-profile leaderbord competition 1 on video understanding was won by a technique agnostic of temporality [24].
In this paper, we tackle the task of retrieving and aligning similar video instances.This problem arises in different applications such as copy detection, particular event detection, video editing and re-purposing.In the literature, one can distinguish the methods offering temporal alignment and those discarding the time information, typically through temporal pooling operations.According to a comparative study on copy detection conducted in 2014 [19], the best methods were relying on local descriptors and framebased matching [18], even though temporal alignment is often needed later, for example to manually verify a copyright infringement.In contrast, the state of the art for particular event retrieval [6,11] exploits a single vector per video.
Similarly, because accurate video alignment requires matching with a frame-level granularity, methods based on temporal pooling [8,32,10,25] inevitably introduce some invariance to small time shifts.They are therefore not appropriate for achieving high localization accuracy.
In order to preserve the capability to align videos while offering a competitive recognition accuracy, another line of research considers Fourier-domain representations, like the circulant temporal encoding (CTE) [28,7] inspired by prior works on tracking with correlation filters [16,17].In our work, we consider the temporal matching kernel (TMK) by Poullot et al. [26].This representation consists of complementary periodic encodings of a sequence of frames into a fixed-sized representation.It provides both an accurate matching and alignment hypothesis, and outperforms CTE [28] in terms of alignement accuracy.
An advantage of TMK is that it disentangles the visual and temporal aspects while keeping the temporal consistency.Our proposal revists temporal match kernels in the context of a neural network.More specifically, we propose a temporal layer inspired by TMK [26].The design is modified and the parameters are learned with a supervision signal that takes into account both the matching quality and the precision of the alignement.This is in contrast to the original technique, where the parameters are hand-crafted by a choice of a specific kernel (Von Mises).To train our layer, we adopt a temporal proposal strategy providing both positive and negative examples.The learning is performed on both real and synthetic data simulating temporal and visual attacks undergone by videos for our different tasks.
As a complementary contribution, we provide guidelines for tuning the hyper-parameters, in particular the design of better complementary elementary kernels.This, by itself, provides a significant boost, leading us to outperform the state of the art for temporal video alignment, copy detection and event retrieval on the public benchmarks Madonna [7], Climbing [7], VCDB [19] and EVVE [28]. 1 https://www.kaggle.com/c/youtube8m/leaderboardThe rest of this paper is organized as follows.After reviewing the fundamentals of temporal match kernels in Section 2, we introduce our approach in Section 3 and evaluate it in Section 4.

Related work and Temporal kernels
For a given video to describe, we consider a sequence of frame descriptors extracted at distinct timestamps T = {t 1 , . . ., t i , . . .}.Each frame f i is represented as a tuple (x i , t i ), where x i is a d-dimensional vector and t i denotes the scalar timestamp of the frame.The frame descriptor x i is typically obtained by post-processing hand-crafted or CNN-based representations.We assume that the frame descriptors are 2 -normalized and are compared with inner products, or equivalently with the cosine similarity.Joint frame and timestamp encoding.We consider a kernel function between frames descriptors such that the similarity between a pair of descriptors takes into account their absolute position in time.This operation is commonly referred to as a modulation.Formally, it amounts to defining a kernel between frame descriptors x and x with respective timestamps t and t as where ϕ(•) is a feature map function approximating the kernel k t between timestamps, which lowers the similarity between frames that are distant in time.By convention, we set k t (t, t ) = 0 if t or t are outside the range of the valid timestamps for the two videos.Further algebraic manipulation reveals that this kernel can be expressed as where ⊗ is the Kronecker product.Therefore, we describe the tuple (x t , t) by a single feature vector, namely x t ⊗ϕ(t).
Temporal match kernel.Given two videos represented by the sequences of frame descriptors X = {(x i , t i )} i and X = {(x j , t j )} j , we consider the temporal kernel that compares the videos on a frame-by-frame basis, assuming that the videos are shifted in time by the duration δ.With Eqn. 3, this kernel is subsequently re-written as where ψ 0 (X) is the descriptor associated with the first video, and ψ δ (X ) is the descriptor associated with the second video and re-mapped to the new time origin δ.
In the temporal match kernel from Poullot et al. [26], k t is expressed by means of a Fourier approximation with period T and M coefficients.In this case, the feature vector representing a video can be written as where: where a m are the coefficients of the Fourier series.If T consists of evenly-spaced timestamps2 , this is equivalent to taking the Fourier transform of the input time series with period T and convolving it with ϕ(t).It leads to a feature vector with dimensionality d × (2m + 1).
Alternative choices for ϕ exist.For instance, this kind of kernel approximation was first defined with random Fourier features [27].Vedaldi and Zisserman [33] show that explictly using the Fourier decomposition gives a much better approximation of shift-invariant kernels.By departing from the Fourier basis, Chum [3] shows how to learn sparse feature maps improving the compromise between the number of coefficients and the approximation of a kernel.
Trigonometric polynomial of scores.At this stage, ψ 0 (X) is a representation of the video.The first component V 0 is the average frame descriptor and can be used to directly compare two videos, in this case discarding the temporal information.Yet one of the strength of the chosen kernelization is that it keeps a latent variable and allows the maximization of the kernel w.r.t.this variable.This property was first exploited by Tolias et al. [30] when aggregating local descriptors.Bursuc et al. [2] exploit it to define a kernel local descriptor that automatically adjust the orientation and scale to maximize the matching score when provided with two candidate descriptors.
In our context, the latent variable is the relative time offset between the two videos.Consider a given alignment hypothesis and two videos X and X : the similarity between two video sequences is computed, for a given alignment hypothesis, as Therefore, the score as a function of δ is a trigonometric polynomial of degree M .Evaluating this polynomial at regular timestamps is efficient and only requires 1+4M dot products between vectors of dimension d.Multiple periods.Poullot et al. [26] employ multiple kernels with distinct periods, shorter than the video length, and take the sum of the kernel scores as the final similarity measure.This increases localization accuracy while inducing a large period for the kernel summation.

Proposed approach: LAMV
We revisit the temporal match kernel as a global video descriptor to compute the similarity between videos and align them temporally.This approach is referred to as LAMV (Learning to Align and Match Videos).For this, we transform the kernel into a differentiable layer, and learn the coefficients of the feature transform by imposing a triplet loss that jointly takes into account (i) the similarity scores produced when comparing two videos globally, and (ii) the temporal alignment accuracy when processing overlapping videos.The batches on which the loss is evaluated contain hard negative proposals.We also devise a normalization strategy that enhances retrieval and alignment performance.

Overview: Layerizing temporal match kernels
All the operations involved in the computation of scores produced by the temporal match kernel are differentiable with respect to their parameters a i , even when using multiple periods.The kernel can be seen as a differentiable layer that can compute the similarity between two videos.The Fourier coefficients of the feature map ϕ(•) are parameters that can be learned by backpropagating a supervision signal built on the similarity scores.
The LAMV layer can aggregate frame-level features to compute a video feature vector, and it can then compare two videos by shifting one of the two descriptors.In this regard, its structure resambles that of Siamese networks, in which the same function is applied to two branches, then compared by a distance function.
Given a set P of periods for which the kernel is computed, each video segment X is encoded by taking a Fourier transform for each of the periods in P, and subsequently applying the feature map ϕ(•) according to Eqn. 6.This process results into a tensor ψ 0 (X) with dimensionality d × (2m + 1) × |P|, where one of the axes is along the different periods.
Two video features are compared for a set of time shifts {δ 0 , ..., δ i , ...} by taking the dot products of Eqn. 10 for each period and then summing, resulting in a scalar score for each shift.Once a loss function L δ is defined over the score obtained for a time shit δ, its partial derivative with respect to the learnable Fourier coefficients of each of the periods in P are expressed from the derivatives as and where we define Ṽ0 and Ṽm, * as V0 √ a0 and Vm, * √ am , respectively.
Normalizations.With the aim of reducing the interferences caused by the strong self-similarity present in videos, we apply two normalization steps which improve the alignment and retrieval performance of the descriptor.First, the Ṽ0 and Ṽm, * vectors are 2 -normalized, so that ψ 0 (X) becomes a concatenation of normalized vectors, each weighted by its corresponding coefficient.Then, we 2 -normalize ψ 0 (X) over its frequency axis.The norms computed in this stage are independent of δ, so the video feature vector ψ 0 (X) can be normalized once and then shifted multiple times using trigonometric polynomials to compute the final scores.Figure 2 reports an example of the scores obtained at different time shifts for two matching videos.As it can be seen, long periods (T = 651s) fail to provide enough localization accuracy, while shorter periods (T = 16.9s)provide good localization but generate frequent false positives.The sum of the scores obtained with different periods increases localization accuracy while avoiding false positives.

Loss function
Ideally, kernel scores K δ (•, •) should be higher for overlapping videos and lower for non overlapping videos, so to enhance the retrieval of similar or overlapping videos.At the same time, the layer should perform a precise localization, which corresponds to requiring that the kernel scores for a pair of overlapping videos are higher near to the ground truth alignment point, and lower for incorrect alignment points.
Given a triplet of videos (X 0 , X + , X − ), where X + overlaps with X 0 and X − does not overlap with X 0 , we define a retrieval loss that enforces kernel scores to be globally higher for the overlapping pair than for the non overlapping pair.This is done by placing a margin loss between the maximum of the kernel scores obtained when evaluating (X 0 , X + ) and (X 0 , X − ): (13) where K * (X, X ) is the maximum of K δ (X, X ), i.e.K * (X, X ) = max δ K δ (X, X ), and m r is the retrieval margin.
To enforce a correct localization inside the overlapping pair, instead, we define a localization loss which imposes a margin between the kernel scores in a neighborhood of the correct alignment point δ * , and the kernel scores outside the neighborhood: where m l is the localization margin, δ * is the ground truth alignment point, K N (δ * ) (X 0 , X + ) is the maximum of kernel scores in a neighborhood [δ * − r, δ * + r], and K O(δ * ) (X 0 , X + ) is the maximum of kernel scores outside the neighborhood r.

Learning with temporal proposals
To learn the parameters of the layer, we exploit a dataset of video sequences aligned on a global timeline.In this setting, we know which sequences overlap with which sequences, and we can build suitable training triplets.
Overlapping sequences can be very long and using the entire sequences would result in a reduction of the minibatch size (because of GPU memory limitations).On the other hand, using very short snippets would downgrade the recognition performance of the layer and create inconsistencies between the train and test phases.The length of training snippets should be related to the longest period in P. In our case, we build training triplets made of 500 frames snippets (which at 15 fps amounts to 33.3 s).
To speed up convergence, we perform negative mining.At each epoch, we build a training triplet for each pair of overlapping videos contained in the dataset.The X 0 snippet is sampled randomly from one of the two videos, while the matching snippet X + is obtained by randomly sampling a sequence from the other video, with at least a 75% overlap with X + .In this way, we guarantee that the ground truth alignment point is random, and that coefficients of long periods can be properly learned.To select a hard negative X − , we sample a random snippet from 20 videos which do not overlap with X 0 and X + , and select the one having the highest K * (X 0 , •) for the current set of weights.

Multiple period design
The choice of the periods in P influences both localization and recognition, as well as the maximum video length the network can process.When summing two periodic signals with periods T 1 and T 2 , the resulting signal is periodic with period T 1 •T 2 /gcd(T 1 , T 2 ), where gcd(•, •) is the great-  est common divisor.To increase the periodicity of K δ (•, •) while preserving a sufficient choice between short and long periods, periods in P are conveniently selected to be relatively prime.In this case, the period of To design the set of periods, we run a coarse grid search on the Madonna dataset for video alignment.Since no feature learning is involved, findings can be applied to other video alignment datasets.Starting with a single long period (T = 14653 frames, equal to 977s) sufficient to cover the longest video in the dataset, we subsequently add shorter and relatively prime periods, by approximately scaling with a factor of 1.5, and test all combinations.
Figure 3 reports the localization accuracy obtained when matching each sequence in the dataset to the rest of the database.Given a query, we use the maximum of kernel scores K * δ (•, •) as a global similarity score to sort the remaining videos in the database, and then select the offset with the maximum score to compute the localization error.
Starting from the longest period, as shorter periods are added, the localization accuracy increases monotonically (solid lines).On the other hand, this increases the size of the final descriptor, so we investigate the choice of a subset of periods.Using only short periods leads to precise localization and insufficient recognition (an example is reported in dashed line), while a combination of short and medium long periods provides the same performance at a fraction of the size (solid line with markers).In the rest of the experiments, we will use this combination of four periods.Discussion Figure 4 compares the temporal kernels learned by our procedure on the Madonna dataset (further details are provided in Sec.4), with those employed in TMK [26] and with a cross-correlation kernel.We report the crosscorrelation kernel using the longest period of TMK, and for an increasing number of frequencies.For m = 64 this has the same size as the TMK and our descriptor.While the limited number of frequencies induces of oscillations in the cross-correlation kernel, TMK avoids this phenomenon by using Von Mises kernels which have flat responses out of the target bandwidth.Kernels learned by LAMV, in contrast, have shorter periods and stronger higher-frequency coefficients, which experimentally shows to be beneficial for matching and localization.

Experiments
We assess the performance of the proposed method on three settings: temporal video alignment, video copy detection and event retrieval.All can be casted as joint retrieval and localization tasks, in which given a query video we want to retrieve overlapping videos, and precisely localize the query with respect to retrieved videos.In the case of temporal video alignment, the same action is recorded from different cameras, while in video copy detection the transformation matching videos is limited to 2D geometric and photometric distortions.In event retrieval, finally, the same event is captured in different videos which do not necessarily overlap, making this a more high level context.

Datasets
Table 1 summarizes the datasets we use.The Madonna dataset [7] clips are decomposed in segments, and the segments are temporally aligned on a common timeline.The image matching involves challenging viewpoint changes and wildly different frame representations.To build train and test splits, we identify the connected components inside the dataset (i.e.sets of sequences that overlap temporally) and build five folds which do not cross different connected components.We then use five-fold evaluation on these, and evaluate the fraction of accurately aligned videos.Similar to Madonna, the Climbing dataset [7] contains 89 aligned videos from a rock climbing session.It features only one connected component, therefore we use it only for testing.
The VCDB dataset for copy detection [19] consists of clips from sharing sites.They are all copies, possibly partial, of one of 30 source clips (Kennedy assassination, Titanic fly scene, etc.).The manual annotation gives the exact extent of the overlapping part between each pair of the clips.Most clips are quite easy to match automatically, but there are also difficult transforms like large overlays or film-fromscreen copies.For evaluation, each clip is matched with all the remaining, and a segment-level version of precision and recall is computed, as defined in [19].An additional set of 100k distractors is also provided by the same authors.
The EVVE dataset [28] contains clips that illustrate one out of 13 "events".The events can be news events (Flood in Thailand), or an event occurring at a specific location (Wedding of Kate and William), or a re-occurring event (eruption of the Stokkur geyser).The depictions can be exactly the same (for example, for the wedding, there is a single official video), or slightly different (different views of the same concert), or just have a common topic (the flood) that is hard to match visually.The evaluation is done with a retrieval protocol: there is a query/database split of the dataset and the result is evaluated in terms of mean average precision.
The YFCC100M [29] dataset is a dataset that contains 800,000 videos, whose annotations we ignore.We use it as a background set for unsupervised training.
Finally, VCD is a synthetic video copy dataset that we generated for training our layer on vido copy detection and event retrieval.We combined pairs of videos from from YFCC100M [29].One of the videos is used as foreground and inserted in the other, used as background.The foreground video is clipped to a few seconds, resized and transformed geometrically (rotation, perspective transform, etc.) and photometrically (convert to gray, low-quality encoding, etc.) in various random ways.The ground-truth alignment is recorded.The data and alignment is used to train the alignment quality on an independent dataset.We split the dataset in two equal parts for training and validation.

Implementation details
The video clips are decoded at a fixed frame rate of 15 fps.As frame descriptors, we employ MultiVLAD whitened descriptors [28] and vanilla RMAC [31].RMAC is a pooling layer that extracts bounding boxes from an arbitrary activation map in a CNN stack, and pools them into a fixed-size vector.The CNN can be fine-tuned [12], but we found that a pre-trained CNN works just as well in a context where the type of images to match is not known in advance.RMAC requires an unsupervised training phase (to find the PCA matrix), that we train on YFCC100M [29].In preliminary experiments, we found that extracting RMAC from the 29th activation map of a Resnet-34 [15] gives the best @ 0.1s @ 1s @ 10s Frame descriptor is MVLAD CTE (m = 16) 9 F1 score Temporal Hough voting (SIFT+BoV) [19] 55.0 Temporal network (SIFT+BoV) [19] 60.0 Temporal network (AlexNet) [20] 65.0 TMK (RMAC) [26] 67.4 LAMV, freq norm.

68.7
Table 3: Evaluation on the VCDB dataset for video copy detection.The evaluation measure is the maximum F1 score on segment-level precision and recall measures [20].
matching results, so we keep this setting throughout.We also tested with C3D features [32].The localization and retrieval accuracy was not satisfactory with these techniques.We build mini-batches with 128 triplets.We combine the retrieval loss L r and the localization loss L l , respectively, with weights 1/4 and 3/4.The retrieval margin m r is set to 0.01, and the localization margin m l to 0.001.The radius r is set to 1s.We train the network using SGD with Nesterov momentum 0.9 and a learning rate of 0.001.
The set of periods P is set to {9767, 2731, 1039, 253}, which, in seconds, correspond to {651s, 182s, 69s, 17s}.When computing the TMK and the LAMV descriptor, the number of frequencies M is always set to 16, so to have comparable descriptor sizes.

Experimental results
Video alignment.We assess the localization and retrieval performances of our model on temporal video alignment by learning on Madonna with five-folds evaluation, and using MVLAD and RMAC descriptors.For each fold we use each sequence in the test set as query against the remaining se-quences in the same set.As in Section 3.4, we use the maximum of kernel scores to sort the set, and then select the offset with maximum score from the first retrieved sequence.
We compare LAMV against our reimplementations of TMK [26] and CTE [28] with 16 and 64 frequencies.The size of our descriptor is equal to that of TMK and of CTE with 64 frequencies.Table 2 reports the localization errors: our model attains the best localization accuracy using both descriptors, both for low and high localization errors in comparable settings.To validate the two stage normalization proposed in Section 3.1, we also show the performance of LAMV when applying both or only one of the two normalizations.Using the combined normalization helps to localize videos with greater accuracy, and to enhance the retrieval capabilities of the layer, as testified by the increased localization at higher thresholds.
We investigate the generality of the models learned for temporal video alignment by testing each of them on the Climbing dataset, which contains a different scenario.Averaged results are reported in Table 2 (right).Our method obtains a higher localization accuracy and retrieval performances when compared to the same baselines, and the effectiveness of the two-step normalization is confirmed also in this setting.In Figure 5 we report a sample of challenging sequences taken from different point of views that are correctly aligned by our method.
Video copy detection.For video copy detection, we train on VCD using RMAC features, which show good invariance to copy detection transformations, and test on the recent VCDB dataset.Results are reported in Table 3.We compare with our reimplementation of TMK, and with three state of the art proposals for copy detection: the temporal Hough voting and the temporal network proposed in [19] on local SIFT descriptors, and temporal network us- Method mean mAP per category TMK [26] 51.6 65.9 37.5 13.2 43.9 36.ing AlexNet features [20].Temporal Hough voting aligns matched frames by means of a temporal Hough transform, while the temporal network uses a network flow optimization strategy.They both require to store frame-level descriptors for matching videos.LAMV attains the best F-Score reported on this dataset, and features a fixed-size video descriptor, independent on the video length.When testing with the large number of distractors from the VCDB+100K set, however, we observed that the performance of the temporal network [20] is still higher (58.9 vs 49.3 F1), even though LAMV outperforms TMK also in this setting (49.3 vs 35.5 F1).
Event retrieval.Finally, we apply our approach on event retrieval.We compare against the Mean-MultiVLAD (MMV), obtained by averaging and 2 -normalizing Multi-VLAD frame descriptors, CTE [28], Stable hyper-pooling [6] and the recent Counting Grid Aggregation (CGA) [11].LAMV, CTE and TMK are able to provide a good localization in addition to retrieval, the others can not.
To factor out the impact of the raw frame descriptor, we also report the values obtained by using the 2 -normalized mean RMAC descriptor, and run our reimplementation of TMK on RMAC features.As shown in Table 5, our method outperforms all the baselines it has been compared to, including CGA and TMK.We also evaluate the performance of LAMV when using average query expansion (AQE) [6].In this setting, the top N 1 results are averaged and then to produce an augmented query, which is then used for retrieval.Overall, our methods attains the best result reported on this dataset without query expansion and with AQE.
End-to-end training and performance.We tested endto-end training of the architecture.In practice it did not give a significant improvement.This observation is common with videos, and can be explained by (a) the lack of real data for these tasks (feature learning is limited with artificially copied sequences), and (b) by the structure of TMK and RMAC which creates complex path of gradients, as also observed in prior works [12].The matching, for each δ hypothesis, requires the computation of an inner product between frame-level features, which is comparable to CTE.In terms of memory consumption, LAMV is |P| = 4 times larger than CTE if using the same number of frequencies, but provides a significant localization accuracy boost.

Conclusion
We presented a learnable descriptor based on temporal match kernels.It can be learned with a triplet loss function designed to improve its performance when comparing and temporally aligning videos.Experimental results, conducted on temporal video alignment, video copy detection and event retrieval, show that our approach beats the state of the art on all three tasks with a significant margin.

Figure 1 :
Figure 1: We present a learnable temporal layer that compares and precisely aligns videos by means of multi-period temporal kernels parametrized in the Fourier domain.

Figure 2 :
Figure 2: Response of the individual filters (top) when matching a video with a temporally-cropped excerpt of the same video.The bottom figure shows the combination of the response.The ground truth alignment point is δ * = 1000.

Figure 3 :
Figure 3: Fraction of correct alignments as a function of the acceptance threshold for several combinations of periods.

Figure 4 :
Figure 4: Comparison between a cross-correlation kernel, the temporal kernels proposed in the paper by Poullot et al.[26] and those learned in LAMV.

Figure 5 :
Figure 5: Examples of a sequence correctly aligned by LAMV on the Climbing dataset.Each column corresponds to temporally aligned frames (2 frames per second are represented).

Table 1 :
Characteristics of the datasets.

Table 2 :
Evaluation on the Madonna (left) and Climbing (right) datasets for temporal video alignment.The evaluation measure the percentage of queries localized better than a threshold (0.1s, 1s, 10s).

Table 4 :
[28]uation for event retrieval (mAP on EVVE).The ordering of categories is the same as in the EVVE paper[28].

Table 5 :
Comparison with the state of the art for event retrieval (mAP on EVVE).