Time2Feat: learning interpretable representations for multivariate time series clustering

Clustering multivariate time series is a critical task in many real-world applications involving multiple signals and sensors. Existing systems aim to maximize effectiveness, efficiency and scalability, but fail to guarantee the interpretability of the results. This hinders their application in critical real scenarios where human comprehension of algorithmic behavior is required. This paper introduces Time2Feat , an end-to-end machine learning system for multivariate time series (MTS) clustering. The system relies on inter-signal and intra-signal interpretable features extracted from the time series. Then, a dimensionality reduction technique is applied to select a subset of features that retain most of the information, thus enhancing the interpretability of the results. In addition, domain experts can semi-supervise the process, by providing a small amount of MTS with a target cluster. This process further improves both accuracy and interpretability, narrowing down the number of features used by the clustering process. We demonstrate the effectiveness, interpretability, efficiency, and robustness of Time2Feat through experiments on eighteen benchmarking time series datasets, comparing them with state-of-the-art MTS clustering methods.


INTRODUCTION
Time is a dimension that affects many aspects of real-world and digital-world phenomena.Physical environments, industrial machineries, healthcare monitoring, and economic and financial activities are a few examples of scenarios whose elements are regulated and evolve over time.Multivariate time series (MTS), i.e., datasets with more than one time-dependent signal, are widely used data artifacts for encoding collections of sequential observations over the temporal axis.MTS analytics includes supervised and unsupervised tasks, ranging from classification and clustering, to pattern discovery, forecasting, and exploration.Cluster analysis recently gained momentum in many applications and use cases where sensors collect massive amounts of data points.
Research on clustering time series has mainly focused on univariate time series (UTS), i.e., datasets with a single time-dependent variable, addressing issues related to the development of similarity measures to cluster the data (e.g., Dynamic Time Warping -DTW [11,14,22], K-Shape [30]).By opposite, research on MTS is still at an early stage.Proposals adapt clustering approaches designed for UTS to MTS after applying dimensionality reduction techniques.Examples of such techniques (CSPCA [19] and  2  [18]) are based on the Principal Component Analysis (PCA), which enables the conversion of a set of correlated features in the high dimensional space into a set of uncorrelated features in the low dimensional space.Nevertheless, the resulting clusters suffer from poor explainability as the original dimensions are lost.More recently, approaches based on Deep Neural Networks (DNNs) [49], and in particular Variational Autoencoders [13,23] have been used to generate MTS encodings before applying clustering methods.Although these solutions might exhibit high performance, the resulting clusters are based on latent dimensions that remain unexplainable to the end-users.Limited interpretability can hamper the adoption of a clustering technique in critical real-world scenarios, when experts are asked to provide detailed and trustable explanations of their algorithms' recommendations [4,27,32,35,37,40].
We introduce Time2Feat, an open-source system for MTS clustering that adopts an end-to-end semi-supervised feature-based pipeline.Features are automatically extracted from the signals composing the MTS.We exploit both intra-signal features characterizing the single signals of MTS, and inter-signal features measuring pairwise relatedness (in terms of similarity and correlation) of multiple signals by employing interpretable metrics.For the extraction of the intra-signal features, we rely on the tsfresh library [6,7], which generates features describing the MTS signals according to statistical perspectives (Distribution, Correlation, Information Theory, etc.).Two dataset-dependent techniques are then introduced to select the most important features among the ones describing the MTS.The unsupervised mode is an entirely automatic approach based on Principal Features Analysis (PFA) [21].The semi-supervised mode relies on user's annotations on small dataset samples to improve the selection process.Our extensive experimental analysis shows that the number of features is reduced by two orders of magnitude while preserving the accuracy of the results.Finally, a clustering technique is applied to group the MTS.
Time2Feat is scalable: it computes efficiently clusters regardless of the main dimensions of the problem (i.e., number of MTS in the dataset, number of generated clusters, number of signals composing the MTS and length of the time series) as the experiments in Section 5.3.1 demonstrate.Time2Feat provides interpretable features for the cluster representations.The meaning and the properties associated to interpretable features have not been clearly agreed upon by the literature.As many other approaches, we state that the features are interpretable if humans can understand what they refer to [52], and we consider conciseness as one of the main properties to be satisfied by a set of interpretable features [16,29,31].In Time2Feat, we rely on PFA [21] to select a concise number of features among the ones computed by tsfresh.These measures can be interpreted by experts who know the statistical measures used to summarize the time series values.Leveraging interpretable features, users can conduct an in-depth analysis to understand why MTS share the same cluster.They could, for instance, measure the value similarities among the features for MTS in the same cluster or apply techniques for evaluating feature importance during clustering generation.
The key contributions of this paper are summarized as follows: An interpretable and efficient end-to-end clustering system for multivariate time series.Time2Feat provides a suite of clustering pipelines that leverages the MTS features to make the user aware of the results and internals of the clustering process while at the same time preserving the efficiency of the process.A human-in-the-loop clustering system allowing for learningbased annotations.Time2Feat allows domain experts to provide small and controllable amounts of labels as input to the process in order to improve the accuracy of the clusters and tame the sizes of extracted feature sets.The feature reduction process allows enhancing the scalability and interpretability of the system further.A comprehensive evaluation.We evaluate Time2Feat by benchmarking our pipeline against 8 state-of-the-art MTS clustering systems spanning a collection of 18 underlying datasets [3] and reporting their quality and computational performance.
The paper is organized as follows.Section 2 presents a motivating real-world scenario of the usage of the Time2Feat system.Section 3 describes the steps of our clustering pipeline, while Section 4 presents the implementation of Time2Feat.Section 5 presents our experimental setup on real-life and benchmarking data.Section 6 discusses the related work.Finally, Section 7 concludes our work.

MOTIVATING REAL-WORLD SCENARIO
The BasicMotions dataset is a real-world dataset belonging to the UEA multivariate time series classification archive [3].This dataset describes four kinds of activity (i.e., playing badminton, running, standing, and walking) performed by students through two sensors (an accelerometer and a gyroscope) installed in their smartwatches.The sensors gather data in a three-dimensional space, thus producing three different signals (X, Y, Z).The overall dataset comprises 80 MTS, and each signal includes 100 recordings.
Suppose we are asked to analyze the dataset and no detail on the activities that MTS describes are provided to us.This lack of   information frequently happens in business scenarios where trade secrets or simply the costs for labeling make it necessary to work with unlabeled datasets.Clustering is one of the main exploration techniques we can apply on unlabeled data.Generating clusters for MTS is a non-trivial task.From a data structure perspective, they are third-order tensors, i.e., a dataset includes many MTS, each one containing multiple signals, and each signal is composed of several timestamps.From a numerical perspective, it is frequent to work with datasets composed of thousands of MTS and tens of signals with thousands of records (see, for example, the datasets used in the experiments in Section 5).This problem gives rise to a first challenge to address: (C1): Analyzing MTS datasets requires the application of scalable techniques capable of dealing with the high dimensionality of the data.
We address this challenge by proposing Time2Feat, which computes the clusters based on features extracted from the signals of the MTS.This operation allows us to reduce the problem's dimensionality: from the many timestamps constituting the time series to the single values of the features.We rely on external specialized software libraries to extract intra and inter-signal interpretable features from the MTS.The former describes particular properties of the signals in isolation (e.g., the mean value, the autocorrelation, etc.).The latter evaluates pairwise the signals measuring for example distances, and correlations.Feature extraction can give rise to a large number of features describing the same MTS under different (but also possibly close) perspectives.For example, this operation overall generates 4842 features in the BasicMotion Dataset.The high dimensionality could both be the cause of inefficiencies in the generation of clusters and could lead to the creation of clusters difficult to interpret by the users.The large amount of features would make them non-interpretable by humans who would not understand what clusters they represent and the reasons why the data points were grouped together in the same clusters.This problem introduces a second challenge: (C2): providing an interpretable clustering technique is of paramount importance for data analysis.State-of-the-art approaches for MTS clustering [13,18,19] suffer from low interpretability.Time2Feat addresses this problem by relying on a reduced number of interpretable features.In particular, Time2Feat adopts a mechanism for feature selections based on the PFA, that ranks the features according to their importance in the process and selects only the meaningful ones.Figure 1a shows the features reduced by PFA to 46 from the 4842 initially extracted.
Although Time2Feat does not rely on a specific clustering technique, the number of clusters to generate is another critical parameter to select.We adopt the well-known Elbow Method to automatically compute the number that better fits the computed statistical measures.The application of the Elbow method to BasicMotion generates 4 clusters, as shown in Figure 1c in blue circles (t-SNE [48] was applied to reduce the dimensionality).We point out that this result is obtained through a completely automatic unsupervised procedure where the pipeline starts with the BasicMotion dataset and generates 4 clusters based on 46 features.Data analytics processes are typically the result of several iterations where users gain more and more insights from the data that enable them the application of deeper analytical functions.This intuition is the third challenge to address: (C3): clustering techniques for data analytics need to put the human in the loop.
Time2Feat addresses this challenge by supporting a semi-supervised procedure allowing users to select samples of elements representative of the clusters they want to generate.Our experiments demonstrate that selecting (i.e.labeling) a few elements per cluster improves the accuracy and significantly reduces the number of features adopted by the clustering technique, thus improving their interpretability.Going back to our example, suppose that a user decides to manually provide four elements per cluster.For instance, the user can analyze some elements from the top-right cluster of Figure 1c and observe that they refer to people playing badminton.Indeed, the high variance of the acceleration recorded for the cluster elements (as shown in Figure 1a), caused by the sudden and irregular movements in badminton can clearly support the decision.The user can also observe the values of the acceleration variance for the elements from the second top-right cluster, and decide that the cluster refers to people doing running.Through a similar analysis, the user can notice that the variance of the other clusters suggests less intensive activities, typical of walking and standing people.Time2Feat exploits the user annotations by further reducing the number of features used for the clustering (from 46 down to 3), as shown in Figure 1b, and improving the accuracy of the clusters.
Finally, we would like to point out that cluster analysis offered by Time2Feat is exceptionally flexible and facilitates precise and fruitful data explorations.For instance, in case the user would like to obtain two clusters instead of four, Time2Feat would derive a first cluster with elements from the first two clusters representing badminton and running activities and a second cluster with elements representing standing and walking activities.In the pipeline, we consider both intra-signal features F , describing the signal constituting the MTS in isolation, and inter-signal F ′ , describing pairs of signals, as defined below.Given the large scale of feature sets, the feature selection step is devoted to pruning features with null values or low variance and ranking the remaining ones in order only to keep the meaningful ones for the clustering step.This leads to obtaining significant and semantically rich features, thus achieving interpretability and scalability.Contrarily to raw data points, features are interpretable and understandable for end-users.The third pipeline step then executes clustering as defined below.Definition 3.5 (Clustering MTS).Given a multivariate time series dataset  and a set of clusters  with cardinality , the goal of MTS clustering is to map a cluster to each series in .This corresponds to define a surjective function  :  −→  that maps each time series  ∈  into a cluster  ∈ .

THE TIME2FEAT SYSTEM
Time2Feat implements the components of the data analysis pipeline as illustrated in Figure 2. Time2Feat takes a MTS dataset  as input and the number of clusters to generate (provided by the user or via some heuristic).It can run under unsupervised mode, i.e., no further input is required, and under semi-supervised mode, i.e., the users specify a subset of clustered samples.The three steps of the pipeline

Feature Extraction
The goal is to generate an exhaustive representation of an MTS dataset via a large spectrum of features, each describing the MTS signals (in isolation or pairs).Intra-signal Features Extraction.The computation of statistical features describing the signals of the MTS relies on the library tsfresh [6,7], already in many time series analysis tasks [36,38,50].Each of the 700+ features computed by tsfresh encodes the signal description from the perspective offered by a specific analysis method, such as Distribution Analysis, Statistical Analysis, etc.In this way, the features are interpretable for users who know how the statistical measure summarizes the time series values.Lines 2-6 of Algorithm 1 shows a simplified procedure where a nested forcycle iterates over the MTS and the intra_feature_extraction function generates the features for each composing signal.In the actual implementation, we leverage the efficient parallelization of the feature extraction function provided by tsfresh that can compute features in batches of univariate series.Inter-signal Features Extraction Many works [20,42] highlight the importance of inter-signal relationships in the analysis of time series.Nevertheless, they typically extract the features employing neural network architectures, obtaining uninterpretable descriptions.We adopt a straightforward approach by conceiving inter-signal features as the measure of the relatedness (in terms of similarity and correlation) between pairs of signals that we measure through 8 metrics (e.g., correlation, Euclidean distance, etc.).The pipeline firstly generates the pairs of signals per time series (lines 7-10 of Algorithm 1), then applies (line 12 of Algorithm 1) the function inter_feature_extraction in charge of the extraction.

Feature Selection
The feature extraction procedure generates a large number of features per dataset.Reducing the dimensionality of such representation improves the interpretability and increases (as the experiments in Section 5.1 show) the performance of the clustering procedure.
As a first operation, in line 1 of Algorithm 2, we clean the matrix of the features by removing all zero-variance features and the features that have missing or infinite values (they would be useless for the cluster generation).If Time2Feat is running with the semi-supervised mode, the available labels are used to rank the relevance of the features for identifying a subset capable of generating clusters.We rely on the Analysis of Variance (ANOVA) [44] for computing the p-value associated with each feature and quantifying its significance.Then, we apply a grid search analysis identifying the subset of features that maximizes the quality of the generated clusters.To evaluate the quality, we use the Homogeneity Score [34], the Adjusted Mutual Information (AMI) [28] and Adjusted Rand Index [45] which are specific measures evaluating the agreement and similarity of pairs of cluster elements.The joint application of the ANOVA and grid search analysis is referred to auto_anova_selection in line 3 of Algorithm 2. Finally, both in the presence and in the absence of labels, we apply the PFA technique to select the most meaningful features.We observe that the PFA technique not only guarantees conciseness but also diversity1 of the features by choosing "the principal features which retain most of the information in the sense of maximum variability of the features in the lower dimensional space" [21].

Clustering
Time2Feat can work with any clustering algorithm.In Section 5, we show that among the experimented approaches, the hierarchical technique achieved the best accuracy.Concerning the number of clusters, the Time2Feat system leverages state-of-the-art heuristics (e.g., applying the well-known Elbow method) or user preferences.Finally, the clustering operation includes a normalization step that avoids the dominance of features due to large-scale domain ranges.

EXPERIMENTAL EVALUATION
The evaluation addresses four main research questions:  Baselines.We selected 18 benchmark datasets from the UEA multivariate time series classification archive [3].For each dataset, Table 1 reports the number of MTS ( ), the number of signals (), the length ( ) of the series, and the clusters (), where the MTS can be grouped according to the baselines.In addition, we computed the overall number of elements in the dataset (  ) that provides a yardstick for measuring the scalability of the approach.Finally, we estimate the complexity of generating the clusters by computing the number of elements per MTS (  ).Intuitively, the lower the value, the lower the ability to extract descriptive features.The datasets represent different scenarios as their overall number of elements   spans over three orders of magnitudes, and   ranges from 16 elements for the PD dataset to 10000 for SW.We compared Time2Feat with eight approaches: Hierarchical, KMeans, and Spectral are straightforward applications of these classical clustering techniques to MTS datasets.CSPCA and  2  introduce a PCA-based mechanism to reduce the data dimensionality before the clustering.DETSEC and IT-TSC leverage neural networks 2 : the former by creating embeddings for the series through autoencoders, the latter by combining a multi-path neural network with variable association graphs to determine the importance of the signals for each cluster.Finally, we created a variant of the KMeans clustering technique by introducing DTW to measure the similarity between two temporal sequences.We refer the reader to Section 6 for an extensive discussion of these baselines.Setup.The experiments are executed on a machine with a 12 cores Intel Xeon Processor, 64GB of RAM, and 324GB of local (SSD) storage.The machine runs Ubuntu version 18.04.All experiments have been executed ten times, and the average result plus standard deviation is reported (whenever significant). 2Our technical report [2] includes experiments with other neural network techniques.

Effectiveness
We evaluated the effectiveness of Time2Feat by adopting the AMI [28] to measure the accuracy of the generated clusters with respect to the baselines.The AMI evaluates to 1 when the two clusterings are identical, and to roughly 0 (negative values are allowed) in case of random partitions.Table 2 shows the results of this experiment.Time2Feat has been evaluated by executing the unsupervised mode (column T2F 0 ) and by simulating the semi-supervised mode through stratified random samples composed of 20% (column T2F 2 ), 40% (column T2F 4 ), 50% (column T2F 5 ) of labels per cluster from the baseline datasets.The remaining columns show the competing approaches (discussed in Section 6).Among them, Hierarchical, KMeans, and Spectral can be considered as reference baselines for their simplicity 3 .Discussion.The experiment results clearly show that Time2Feat outperforms its competitors.In particular, in the unsupervised mode, the accuracy of the clusters generated by Time2Feat is higher than the other approaches in 11 out of 18 datasets.Among them, in 3 datasets, it obtains the best accuracy score.By providing 20% labels per cluster, Time2Feat outperforms the other approaches in 13 datasets (obtaining in 2 datasets the best accuracy score).The performances generally improve by adding more labels, as in the configuration T2F 5 , where Time2Feat outperforms the other approaches in 15 out of 18 datasets (showing the best accuracy value in 9 out of 18 datasets).This experiment helped us derive the following insights: (1) Time2Feat's pipeline is highly efficient as at least one configuration of Time2Feat outperforms the other approaches in all datasets, except in the UW dataset, where it performs slightly worse than some competing approaches.The reason is that UW describes trajectories.One kind of trajectory is the composition of two other trajectories.The features extracted by Time2Feat cannot recognize these three different movements.(2) Time2Feat is highly scalable as it obtains high accuracy results both for small (the ones at the top of Table 2) and for large datasets (the ones at the bottom of Table 2).These results do not hold for the competitors, where the accuracy drops as the number of elements in the dataset increases (and in some cases, marked  / in the Table 2, no cluster is generated due to timeout or memory exceptions).(3) The semi-supervised procedure improves the accuracy.By labeling a small number of elements per dataset, the accuracy steadily increases.

Interpretability
We provide a measure of the interpretability of the clusters by analyzing the number of features that Time2Feat uses for their computation.A limited number of features facilitating human comprehension and conciseness is one of the main properties of interpretable features (see Section 6).The column All in Table 3 shows the overall amount of features extracted after the feature extraction step of the pipeline (Section 4.1).The other columns report the number of features retained with the unsupervised mode (column T2F 0 ) and with increasing levels of supervision as in the previous experiment.The values represent the average number of features across ten runs of each experiment.Discussion.The feature extraction generates a large number of features that increases as the number of elements per dataset   (the Pearson correlation coefficient -Pcc is 0.61) and the number of elements per MTS   (Pcc = 0.39) increases.The unsupervised approach drastically reduces the selected features while maintaining a strong correlation (Pcc = 0.51) with the overall number of features.Not always an increase in the supervision corresponds to a reduction in the features.We explain this as a sort of "overfitting" that forces better accuracy results by adding features.However, we observe the small number of features (they can be managed by humans) retained in all semi-supervised settings.Finally, the experiment shows the importance of the inter-signal features.Despite the low number, the extracted inter-signal features are preserved in almost all Time2Feat settings, thus showing their importance in the clustering process.

Efficiency
We evaluate the efficiency of our approach by computing the overall time required to complete the pipeline (Section 5.3.1),evaluating the time breakdown (Section 5.3.2) and introducing a simple heuristic to optimize the parallelism of the feature extraction (Section 5.3.3).

Time
Performance.Table 4 shows the maximum time to complete the cluster computations for all the datasets in the 10 repetitions of the experiment.We show only the time measured in the unsupervised mode (T2F 0 ): the semi-supervision does not change the value significantly.The last row shows the average time computed on all datasets (excluding the ones raising the exceptions).
Discussion.The clustering techniques that reach the best time performance are KMeans and Hierarchical, which finish the pipelines in a few seconds.However, this comes at the cost of accuracy and interpretability loss.CSPCA also shows a low time performance, but the algorithm cannot handle large datasets, where time and memory exceptions occur and the accuracy is poor.The other approaches report an average time greater than Time2Feat of at least an order of magnitude and, in some cases, time and memory exceptions.Finally, we observe that Time2Feat's execution time ranges from a few seconds to a few thousands of seconds, having the performance correlated with the overall number of elements (Pcc=0.75).

Time breakdown.
We analyze the breakdown of the computation time (Figure 3a) into the main pipeline components (feature extraction, feature selection, and cluster generation).Discussion.The time required for extracting the features dominates the other components: it takes between 88% and 99% of the overall time needed for completing the pipeline.The average time to complete the feature extraction is around 337 seconds, and the one to complete features selection is 9 seconds, whereas, for clustering, it amounts to 1 second.The correlation between the time spent in clustering and  is strong (Pcc=0.99).This is why in three datasets (PD, LS, and PS, the ones with the largest  ), the clustering time takes more than 1 second but less than 8 seconds.The feature extraction and selection show a different behavior: they are correlated with the overall number of elements in the datasets   (the Pcc is more than 0.95 for both the tasks).In order to further confirm these trends, we have also used 27 synthetic datasets generated by varying the number of MTS  ∈ (100, 1000, 2500), the number of signals  ∈ (2,8,16) and the length of the series  ∈ (100, 1000, 2000) by means of GRATIS tool [15].The results (omitted for space reasons and reported in [2]) show that by fixing  and varying  and  , the only time that increases is the time due to feature extraction, while the other times (feature selection and clustering) remain constant.As a conclusion, despite the increase, the approach remains overall scalable for several datasets, whereas for larger ones alternative strategies can be devised, as shown in the next experiment.

Workload balancing.
We evaluate a straightforward heuristic to improve the time performance by optimizing the computational workload on the processors.Feature extraction performs the computation using batches of time series, which are not balanced by default.Time2Feat allows to balance the workload by customizing the number of batches per dataset by dividing the total number of MTS (V) by the number of available processors, rounding for excess to the upper integer.Figure 3b shows the time reduction obtained.Discussion.By balancing the workload on the processors, the time performance essentially improves in almost all datasets (the average time computed on all datasets decreases from 348 to 242 seconds.In 5 datasets (AF, BM, Ep, SW, and HM), the time reduction is more than 60%.In only two cases (PD and LS), the heuristic does not affect the time performance, and the time slightly increases.

Robustness
This Section assesses the robustness of the pipeline components by evaluating: the importance of feature selection (Section 5.4.1);alternative options to the hierarchical algorithm for performing the final cluster computations (Section 5.4.2); the importance of the features in the clustering task (Section 5.4.3).

5.4.1
Importance of feature selection.We evaluate how the accuracy (AMI) in Figure 4a, the interpretability (number of features) in Figure 4b, and the efficiency in Figure 4c vary with and without feature selection along the pipeline.
Discussion.The experiment shows that the removal generally has a large impact on the accuracy and interpretability.The AMI largely decreases in almost all datasets when the clusters are computed with all features.Moreover, feature selection reduces the number of features of two orders of magnitude.Finally, the time required for     performing the feature selection is negligible, especially compared to the feature extraction step as shown in Figure 3a.In summary, the feature selection improves the accuracy and the interpretability, and only slightly affects the time efficiency of the pipeline.

Importance of the clustering technique.
We experimented with 3 techniques (Hierarchical, KMeans, Spectral) for generating the clusters, as shown in Table 5.For each technique, we computed the AMI of the clusters obtained with three settings: the unsupervised procedure (T2F 0 ), the semi-supervised procedure with 20% and 50% labeled elements per cluster (T2F 2 and T2F 5 , respectively).
Discussion.The results show that Hierarchical clustering obtains the best performance.The KMeans technique has similar accuracy.The Spectral clustering is competitive only for the largest datasets when it achieves the best results, but inline with other approaches.
5.4.3Importance of the features in the clustering task.This experiment evaluates whether a feature-based clustering approach is more effective than an approach based on raw data.To this end, we run Time2Feat in the unsupervised mode by performing the clustering computation with the same techniques used in the previous experiment (Hierarchical, KMeans, and Spectral), and we compare the accuracy obtained (in terms of AMI) with the one obtained by the application of the same clustering technique to the raw datasets.
Discussion.The results of the experiment reported in the technical report [2] show that the feature-based clustering approach usually obtains the best performance.Only in three datasets (PD, UW, and S1) the approach based on raw data performs slightly better.

Lessons Learned
We conclude by pinpointing how our feature-based clustering pipeline addresses the aforementioned questions.
(RQ1) Thanks to the features, we gain in effectiveness.The experiment in Section 5.1 demonstrates that Time2Feat provides more accurate clusters than its competitors.The ablation test in Section 5.4.3 further confirms that feature-based clustering techniques generate more effective results than techniques using raw data.
(RQ2) The cluster representations are concise.The experiment in Section 5.2 shows that Time2Feat relies on a small number of features for generating the clusters, and shows the importance of inter-signal features retained during cluster generation.
(RQ3) Feature-based clustering achieves a trade-off between accuracy and performance.Time2Feat achieves the best accuracy in the majority of the datasets along with good performance.The approach is among the fastest ones and performs in seconds, thus making it efficiently usable for batch analyses (Section 5.3).The studied time breakdown of the pipeline components shows that the feature extraction phase is the most expensive (Section 5.3.2).Nevertheless, heuristics for optimizing the workload balance, e.g., concerning the available processors (Section 5.3.3),can be quickly developed to reduce the overall time execution considerably.(RQ4) The pipeline is robust and scalable.The pipeline is highly modular and then scalable concerning the specificities of real-world environments.Our robustness analysis shows the importance of all pipeline components.Feature selection (Section 5.4.1)improves both accuracy and interpretability.The comparison of clustering techniques shows the adaptativeness of the pipeline, allowing for striking a balance between accurate results in small and large datasets depending on the use case at hand (Section 5.4.2).

RELATED WORK
Clustering of multivariate time series.Dimensionality reduction is one of the research questions addressed by previous work on MTS clustering.Principal component analysis (PCA) [5,17,39,41,43] has been adopted to transform MTS into a new dimensional space to find the most critical features representative of the original MTS.Among the approaches, we recall the Covariance Sequencebased Principal Component Analysis (CSPCA) [19] that builds a matrix representing the pairwise covariance of the MTS.Then, a PCA-based transformation is applied to the matrix to reduce the dimensions, followed by clustering techniques. 2  [18] is a similar technique, that applies a cycle of feature transformation based on an internal component, the Common Principal Analysis (CPCA) or clustering (based on KMeans) until the reconstruction error is small.The main problem of PCA-based approaches is the poor explainability of the generated clusters building on latent space dimensions with no semantics for the end-users.Building time series representations (embeddings) based on Neural Networks is another line of research to generate clusters [1,9].DETSEC [13] is a state-of-the-art embedded solution, based on an encoder-decoder architecture built with GRU components.IT-TSC [49] is an approach based on neural networks to build variable association graphs for clusters creation.Approaches based on Neural Networks have shown high performance for detecting the clusters, but they suffer from poor interpretability.Moreover, a large body of work has been devoted to UTS clustering (such as Dynamic Time Warping (DTW) [26], KShape [30] and FeatTS [46,47]).All these methods are not directly applicable to MTS clustering due to the inherent differences between univariate and multivariate time series, leading to poor scalability.
Interpreting the clusters.Providing insights for cluster membership is a real need in many scenarios [4,32,37,40].Explainable AI (XAI) is one of the current hottest topics [8,24], usually approached in two ways [10,51]: 1) by exploiting post-hoc analysis or 2) by designing intrinsically explainable systems.Time2Feat falls into the second category, being based on interpretable features.More specifically, interpretability of the clustering techniques is typically addressed: (1) by applying dimensionality reduction techniques (e.g., PCA) to be able to visualize clusters through two or three dimensions; (2) by identifying the centroid or a selected set of points to represent the cluster [33]; and (3) by relying on an interpretable model (usually a decision tree) that learns how to classify the generated clusters [4,12,32,37].In our work, we mainly deal with interpretable representations, that can support users in understanding cluster's contents.Whereas there is no consensus in the literature on the meaning of interpretable features [52] and on what properties interpretable features should satisfy [25], several approaches indicate conciseness as one of the key properties for interpreting algorithm behaviors [16,29,31].

CONCLUSION
We have presented an end-to-end feature-based clustering pipeline for multivariate time series, leveraging state-of-the-art machine learning components and making them interact with each other.We have empirically studied Time2Feat under the lenses of its effectiveness, interpretability, efficiency, and robustness, comparing it with existing clustering methods on several real-world and benchmarking datasets.The results show that the combination of interpretable features and weakly labeled MTS lead to better quality and explainability of the obtained clusters.

( b )
The features (3) extracted in the semi-supervised mode.(c)Clusters generated with the unsupervised mode.

Figure 2
Figure 2 depicts the main components of our MTS clustering pipeline.The pipeline can be formally defined as follow.Definition 3.1 (Multivariate Time Series).A multivariate time series M is a set of univariate time series (a.k.a.signals).In particular,  = ( 1 ,  2 , . . .,   ), where  is the number of signals, and   = ( 1 ,  2 , . . .,    ) is a time series of length  .More generally, a multivariate time series can be represented as a matrix R   , where the signals are described as column vectors.Definition 3.2 (Multivariate Time Series Dataset).A dataset D of multivariate time series is a set of V multivariate time series  = ( 1 ,  2 , . . .,   ).A dataset is represented as a tensor R    .
How effective is Time2Feat in solving MTS clustering tasks (Section 5.1); RQ2 To what extent the representations of the generated clusters are interpretable (Section 5.2); RQ3 How efficient is the cluster computation (Section 5.3);

Figure 4 :
Figure 4: Removing the Features Selection from the pipeline.
(46)Excerpt of the features(46)extracted in the unsupervised mode.
Algorithm 1: feature_extraction Input :  ∈ R     Multivariate time series dataset.Output :  ∈ R   Matrix of extracted features.// Extracting intra-signal features 1  [ ] ← 0; // list of extracted features 2 foreach  ∈ ; // For each MTS in the dataset 3 do 4 foreach   ∈  ; // For each signal in the MTS  =  −   ; 10 foreach   ∈  ; // For pairs of signals in the MTS  _ _ (  ,   )   Matrix of extracted features.labels Optional labels Output : ∈ R    Matrix of signals and top features.// Extract best feature with PFA 5 return  ; are implemented by the feature_extraction function (described in Section 4.1), the feature_selection function (in Section 4.2) and the cluster function (in Section 4.3).

Table 1 :
The datasets evaluated in the experiments. is the number of MTS,  the number of signals,  the length of the series,  the number of classes in the ground truth,   the number of elements per dataset and   the number of elements per MTS.
RQ4 How robust is the pipeline, i.e. to what extent do the components in the pipeline contribute to the task.(Section 5.4)

Table 2 :
Effectiveness (AMI).In bold, the best value per dataset.↑ shows Time2Feat settings overcoming all competing approaches.

Table 5 :
Accuracy (AMI) varying the clustering techniques.In bold, the best result per dataset.