Detection and classification of sensor anomalies for simulating urban traffic scenarios

Sensor network infrastructures are widely used in smart cities to monitor and analyze urban traffic flow. Starting from punctual information coming from traffic sensor data, traffic simulation tools are used to create the digital twin” mobility data model that helps local authorities to better understand urban mobility. However, sensors can be faulty and errors in sensor data can be propagated to the traffic simulations, leading to erroneous analysis of the traffic scenarios. Providing real-time anomaly detection for time series data streams is highly valuable since it enables to automatically recognize and discard or repair sensor faults in time-sensitive processes. In this paper, we implement a data cleaning process that detects and classifies traffic anomalies distinguishing between sensor faults and unusual traffic conditions, and removes sensor faults from the input of the traffic simulation model, improving its performance. Experiments conducted on a real scenario for 30 days have demonstrated that anomaly detection coupled with anomaly classification boosts the performance of the traffic model in emulating real urban traffic.


Introduction
Anomaly investigation is an important element in many application domains such as fault detection, privacy and cybersecurity, communication networks, and social media. In a complex system, regular behavior allows us to investigate the general characteristics of the system, while anomalies let us discover and analyze unexpected behavior, as significant events and patterns, that otherwise could not be discovered. For this reason, anomaly detection and classification are universally recognized and have become very important in data analysis [1].
In the smart city context, through deploying Internet of Things (IoT) technologies, many aspects of the urban environment can be monitored in real-time such as mobility, pollution, parking, waste, lighting. Although the technology has evolved greatly and has allowed for a drastic drop in costs and a wide variety of sensors available, sensors are prone to malfunctions. Therefore, the big sensor data streams generated in real-time need to be handled with appropriate techniques to detect erroneous measurements instantly.
Analysis of traffic data is an essential component of intelligent transportation system applications crucial for smart cities. Traffic data collected through sensors such as induction loop detectors often contain anomalies, e.g. due to malfunctioning detectors or anomalous traffic conditions. Such anomalies can heavily affect the results of the subsequent analysis like traffic flow analysis, monitoring, and prediction. There are several challenges regarding anomaly detection, including the absence of a universal definition of anomaly, change of traffic pattern over time, as well as unavailability of labeled data. Traffic models are Chiara  essential tools for road traffic analysis and simulation of urban mobility and the deployment of intelligent transportation system applications. Through a traffic model, real-time simulations can be performed using as input the traffic sensors data to emulate the traffic flow in the entire urban context. However, the delay between the acquisition of sensor data and the generation of the model output should be reduced as much as possible. Therefore, the detection of anomalies in the input has to be done in a short time.
In this paper, a methodology of anomaly detection and classification on traffic time series data streams is proposed. This method does not assume labeled historical data and can be applied in real-time. Since observations coming from traffic sensors are often used as input of customized traffic models that simulate urban traffic flows in real-time, this paper also examines and discusses the improvement provided by the anomaly detection and classification method by comparing the traffic model outputs, considering or excluding sensor faults. The method is exemplified and evaluated by applying it to real traffic data collected through loop detectors installed in a medium city road network. Employing the proposed methodology can increase the accuracy of sensor observations, as well as ease the learning of different traffic patterns, and improve the performance of traffic simulations.
The rest of this document is structured as follows. Related works are introduced in Sect. 2. While Sect. 3 outlines the context of the use cases. The methodology used, introduced in Sect. 4, is composed of: filtering and detection of anomalies (Sect. 5), classification of anomalies in sensor faults and unusual traffic conditions (Sect. 6) and the creation of a traffic model for running simulations (Sect. 7) that take advantage of the anomalies identified and classified in the previous steps. The experimentation on a real scenario is reported and discussed in Sect. 8. Section 9 sketches conclusion and future directions.
This work extends the conference paper [2] that introduced a novel data cleaning process to detect anomalies in real-time traffic data streams, exploiting a traffic sensor network. In this paper, the data cleaning process, that previously contained only anomaly detection, is ameliorated with an anomaly classification phase where anomalies are classified into sensor faults and unusual traffic conditions; then, only observations classified as sensor faults are removed from the input of the traffic model. Several changes and extensions have been made to the article w.r.t. the previous conference paper. In particular, Sect. 3 is enriched with two new subsections (3.3 and 3.4) to help the reader to better understand the features of traffic data on which the anomaly detection and classification method is applied; in Sect. 4, the proposed methodology is ameliorated with an anomaly classification phase where anomalies detected by an anomaly detection algorithm based on STL are classified. In subsection 5.3, a new version of the anomaly detection process, that applies RobustSTL and the inverse function of the logarithm to the residual values and finally normalizes the obtained values using the Robust Scaler, is provided and compared to the original one presented in [2]. A new Sect. 6 is inserted to describe the classification of anomalies into sensor faults and unusual traffic conditions. In the end, Sect. 8 has been extensively enriched and restructured into several subsections; new experiments have been conducted to test the anomaly classification (subsection 8.3), and new traffic simulations have been run to demonstrate that eliminating anomalies after having classified them and distinguished among sensor faults and unusual traffic conditions further improves the results of the traffic model (subsection 8.4).

Related work
With IoT devices pervading our everyday life, we see an exponential increase in the availability of data streams. As known, these devices may have abnormal behavior; thus, different techniques to perform data cleaning on sensors' data have been discussed, classified, and compared.
In the literature, we find supervised [3], unsupervised [4], and semi-supervised algorithms. However, several anomaly detection methods are formerly created for processing data in batches, and unsuitable for real-time streaming applications.
In [5], constraint-based algorithms are used with a focus on speed constraints. Then, time series anomaly detection algorithms are proposed: ARIMA (Autoregressive Integrated Moving Average), LSTM (Long Short-Term Memory), DBSCAN (Density-Based Spatial Clustering), and GANs (Generative Adversarial Networks). An interesting improvement of DBSCAN is the combination of ST-BOF (Spatio-Temporal Behavioral Outlier Factor) and ST-BDBCAN (Spatio-Temporal Behavioral Density-Based Clustering of Applications with Noise) [6]. Time series anomaly detection can be performed with several techniques: statistical approaches, machine learning algorithms [7], and deep neural networks. In [8], 20 different univariate anomaly detection methods from the above-mentioned three categories are compared and evaluated on publicly available datasets. Also, Seasonal-Trend decomposition using Loess (STL) is exploited for anomaly detection combined with other methods, such as Interquartile Range (IQR). In [9], a two-layer outlier detection approach is proposed: firstly, the non-stationarity and periodic variation of the time series are studied, then observable variables in the environment are used to explain any additional signal variation. The authors of [10] combine STL decomposition with the SARIMA model to detect anomalies in non-periodic time series. In the end, the approach described in [11] combines STL decomposition with extended isolation forest. Our method integrates the use of STL decomposition with a constraint-based filter on the data coming from the sensors. This approach is missing in the aforementioned methods. The filter identifies anomalous values that would not be recognized as anomalies by the STL decomposition technique.
Explaining the causes of anomalous sensor readings is a tricky issue. An interesting approach that tries to classify sensor faults is the one proposed in ClariSense? [12]. ClariSense? is an automated anomaly clarification service with the ability to explain sensor anomalies using social network feeds from Twitter. Authors have demonstrated the feasibility of explaining the causes of traffic anomalies by identifying unusual social network feeds that are correlated with each anomaly in time and space. In [13], an accurate and transferable accident detection approach is described. The detection methodology is based on the relationship between traffic variables and observed traffic accidents. Authors employ a deep learning-based method calibrated using part of the collected traffic variables and the pre-assigned traffic accidents. This methodology is not able to detect sensor faults but is focused on the discovery of traffic accidents.
Correlated anomaly detection is a cutting-edge topic. Especially from streaming data, it is an essential task in several real-time data mining applications. In [14], a framework is introduced for better detection of correlated anomalies from large streaming data of various correlation strengths. The experiment shows balanced recall and estimated accuracy of the framework that was applied to the U.S. stock daily price data set. In the traffic context, the correlation among traffic sensors to detect anomalies has been analyzed in [15]. The authors exploited a statistical baseline method, along with a sensor correlation analysis. They evaluated the approach by comparing the detected anomalies against traffic alerts, which are emitted by Traffic Agents on Twitter.
Another possible approach is described in [16]; in this study, faulty readings from traffic sensors are identified by examining the correlations among them and by taking advantage of the ubiquitous citizens through crowdsourced data. The authors evaluate cross-correlation between sensors using, firstly, the Pearson metric, and then employing a multivariate ARIMA model to detect anomalies considering correlated sensors. In our work, we use a different correlation metric and employ correlation to perform anomaly classification once the anomalies have already been detected with our combined anomaly detection methodology.

Context
This section is devoted to describing the urban context and the traffic sensor network in the city of Modena. Moreover, a statistical overview of the traffic data is provided, including an analysis of the stationarity of the traffic sensors' time series.
The discussed use case consists of the traffic sensor network of Modena, a medium Italian city of 184, 727 inhabitants with a population density of 1; 017 inhabitants=km 2 and more than 900 km of public roads. Data coming from around 400 traffic sensors are given as input to a customized traffic model that simulates traffic flows on the whole road network in real-time. The improvement provided by the data cleaning process has been evaluated through a comparison between the model output, considering or excluding sensor faults.

Sensor network
Smart Cities employ sensors to share information with the public, businesses, city managers, and other smart systems. For traffic management, there are plenty of sensors, that exploit different technologies, able to determine the number of vehicles traversing the city streets (e.g. induction loop, preformed loop, pressure sensor, radar, video). Induction loops are the most commonly used sensors, introduced 50 years ago, that are still present in a lot of cities. An induction loop sensor consists of wire ''coiled'' to form a loop that is installed into or under the surface of the roadway. In Modena, induction loop detectors are spread in different locations, usually near traffic lights. These sensors collect traffic data (i.e. the number of vehicles and the average speed) with different frequencies according to the provider of the data. In Modena, we have two data providers for traffic information: the sensors located in the urban area are managed by the city council and send data every 1 minute; while the sensors in provincial and regional roads are owned by the Region and their data are distributed by a regional company with a frequency of 15 min.
Sensors data are collected and exploited to emulate real routes of vehicles in a traffic model [17][18][19]. Modena sensor map 1 displays all the traffic sensors available in the city of Modena, the ones in green are used as input to the traffic model. The others, excluded from the input of the traffic model, might be unreliable sensors (i.e. they obtain only a few measurements in a day interval, or most of their measurements are zero values) or sensors located in streets that are outside the urban area where we run the traffic model.

Sensor data collection
The sensors' data coming from the two data providers are collected in real-time into a PostgreSQL database [18,20]. The database exploits two extensions: PostGIS to handle geospatial data and Timescale to perform better with a huge amount of data. From September 2018 till now (June 2021) the database collected more than 457 million observations recorded by urban traffic sensors in Modena. For each observation, the database stores a record composed of the identifier of the sensor, the flow measured by the sensor, the speed, the type of the vehicles (only for provincial and regional sensors), and the time slot of the measurement. The process to parse data coming from the traffic sensors and store them has been implemented in Python.

Statistical overview of traffic sensors observations
In order to detect anomalies in data streams, a statistical analysis of the data distribution can help to determine the best methodology for the use case. Considering all the measurements from October 2018 till May 2020, we discover that 33 out of 400 sensors did not provide any measurement. Then, the observations provided by the remaining sensors were aggregated every 15 min and for each sensor median, mean, standard deviation, and interquartile range (IQR) were evaluated. As a result of this evaluation, 35 sensors were considering malfunctioning since their mean, median, and IQR of traffic flow were zeros. Observing the value of IQR of traffic flow, some sensors have an IQR equal to zero. This means that there is no difference between first and third percentile, and all the measurements have a very similar value. This is not the regular behavior of a traffic sensor, where measurements are supposed to vary during the day. Thus, the 6 sensors with IQR equal to zero were excluded. Moreover, for 8 sensors the rate of measurements with a flow value equal to zero was above the 90%; thus, they cannot be considered reliable. To summarize, the total number of untrustworthy traffic sensors is 49 in addition to 33 not working sensors, as reported in Table 1. These sensors were not included in the data cleaning procedure and their data were not given as input to the traffic model. The number of reliable sensors is 318.
The Table in Fig. 1 shows some statistical evaluation for the distribution of IQR, median, and standard deviation considering only the 318 reliable traffic sensors. The median of the traffic flow values has an average value of 30 vehicles in 15 min; however, there are sensors with a median equal to zero and sensors with a median equal to 132. Observing the graph showing the distribution of the median, the majority of sensors have a traffic flow median value between 0 and 25. This happens because the time series of measurements of traffic flow also includes the night period where the traffic flow values are all near zero and the standard deviation and the IQR are significantly reduced. Besides, the graphs show that the distribution of median, IQR, and standard deviation values is rightskewed: mean is higher than mode. Thus, there are few sensors with high values and more sensors with lower values.

Stationary study
If we think about vehicle traffic in a city, many trends come to mind: a daily trend, a weekly trend, a similar trend in working days that is different in the weekend, and also seasonal trends. Here, we want to introduce the concept of stationary and study the stationary of the time series provided by the traffic sensors in Modena.
In a stationary time series, each time series measurement reflects a system in a steady state. A series X t is called ''stationary'' if, loosely speaking, its statistical properties (mean, variance and covariance) do not change over time [21]. A stationary time series does not exhibit trends or patterns based on time, or periodic fluctuations (seasonality). There are three types of stationary: (1) a strict stationary series satisfies the definition as mentioned above of stationary; (2) a trend stationary series exhibits a trend, that, if removed, makes the resulting series strict stationary; in the end, (3) a difference stationary series can be transformed into a strict stationary series by differencing. Before applying any prediction model to a non-stationary time series, the time series has to be converted into a strict stationary series. For this reason, we tried to study our time series stationary. The easiest way to do this is by differencing, which means to compute the difference of consecutive terms in the series, following the formula Y i ¼ X i À X iÀ1 , where i is the time instant and X i is the value of the time series at instant t.
The most popular tests to check if a time series is stationary or non-stationary are: ADF test is a statistical test and, in particular, a unit root test that aims at determining how strongly a time series is defined by a trend. A unit root is a characteristic element of a time series that makes it non-stationary; Besides, the number of unit roots in a time series corresponds to the number of differencing operations required to make the series stationary. Since the ADF test is a statistical test, there are two hypotheses to be tested: a null hypothesis (H0) and an alternate hypothesis (H1). The null hypothesis is that the time series is not stationary and has some timedependent structure that can be represented by a unit root.
The alternate hypothesis instead is that the time series is stationary. The p-value obtained by the ADF test is exploited to interpret the result of the test, i.e., whether to reject the null hypothesis or not. In statistical, the p-value is a probability score that establishes the significance of an observed effect. In other words, it is the probability of seeing the effect E when the null hypothesis is true (p À value ¼ PðE j H0Þ). If the p-value is lower than a threshold, the time series is stationary (H0 rejected); otherwise, it is non-stationary (H0 not rejected). Typically, the threshold is set to 0.05. Also, the KPSS test is a unit root test to study the stationary of the time series around a deterministic trend. A time series exhibits a deterministic trend if the slope of the trend does not change permanently; even if the series goes through a shock, it tends to regain its original path. The difference from the ADF test is that the null hypothesis of the KPSS test is the stationary of the series. Therefore, the p-value must be interpreted oppositely: if the p-value is lower than the threshold, the series is non-stationary.
We decided to apply both the tests to be sure of the stationary of the time series. In some cases, the results of the two tests could conflict with each other. In particular, if the result of KPSS test is ''stationary'' and the one of ADF test is ''non-stationary'', then the series will be trend stationary; on the other hand, if the result of KPSS test is ''non-stationary'' and the one of ADF test is ''stationary'', the series will be difference stationary.
Our goal is to study the stationary of the time series of traffic data in Modena to find the best way to look for anomalies. By way of example, Fig. 2 shows the number of vehicles measured by one traffic sensor; data is related to the period from 1st to 8th April 2019 and is grouped over 15 min. Looking at the curve, it is straightforward to identify the weekly and daily trends. To check the stationary of these time series, we applied both the ADF and the KPSS tests to the 15-min aggregated measurements of April 2019; we exploited the statsmodels package in Python. Then, we compared the p-values of each daily time series to the commonly used significance threshold of 0.05. All the p-values are far less than the threshold. Therefore the time series are classified as stationary, as expected.

Proposed methodology
Our goal is to perform data cleaning on traffic observations by identifying anomalies among the data coming from the traffic sensors to exclude them from the input of the traffic simulation model. As represented in Fig. 3, the methodology consists of several steps: (1) filtering observations with an anomalous flow -speed correlation (these observations are stored as ''filtered'' in the database) (described in more details in Sect. 5.1), (2) anomaly repairing by replacing ''filtered'' observations with average values, (3) anomaly detection through STL decomposition following two different approaches, (4) classification of anomalies in sensor faults and unusual traffic conditions, (5) generation of the input for the traffic simulation model by removing sensor faults, and (6) running the model. The effects derived from the data cleaning will be investigated, comparing the performance of the traffic model excluding anomalies from the input data or including them.

Filtering and anomaly detection
In this Section, we describe the first three steps of the proposed methodology, namely filtering, anomaly repairing, and anomaly detection of Fig. 3.

Flow-speed correlation filter
The values of flow and speed provided by the traffic sensors are strongly correlated with each other. Assuming that a sensor is located in a single lane, as in our case, there is a maximum number of vehicles that can pass over a sensor with a certain average speed in a fixed time interval. This number is provided by the following formula: where speed is the average speed provided by the sensor, vehicle_length is the average length of different types of vehicles, and safe_distance is the safe driving distance that should be kept between every couple of vehicles based on the vehicle speed (calculated as speed/3.6). We set the value of vehicle_length to 4.
The value of num_vehicles represents the maximum number of vehicles in one hour based on the average speed of the vehicles. The upper bound for allowed values of flow is obtained by dividing the value of num_vehicles by 60 or by 4 based on the time interval which the observation is related to (1 minute or 15 min, respectively). If the flow provided by the sensor is higher than this number, the observation is considered an anomaly and marked as ''filtered'', therefore it will not be considered in the following steps.
We apply this filter in real-time, that is, every time an observation is stored in our database we add a mark ''filtered'' or ''non-filtered'' on it.

Anomaly repairing
The majority of sensors provide measurements every minute, and the filtering is applied to this time interval. Then, data need to be aggregated every 15 min to be used by the traffic model. For this reason, once sensor measurements have been labeled as filtered, they cannot be simply removed from the traffic sensor observations. To evaluate the aggregated flow in a bigger time interval, flow values are summed up, thus removing the observation is like considering that zero vehicles are passing in that time interval. This assumption is not correct, thus an alternative solution needs to be found. Since the measured value in the filtered observation is not reliable, it is substituted considering the values observed in the proximal time interval by the same sensor. We decide to consider a time interval of 15 min. The flow of filtered observation is thus replaced with the average of the reliable (non-filtered) flows measured by the sensor in the same time interval.
The aggregated speed is more difficult to evaluate; since the measure provided by the sensor is an average speed, we assume that filtered observations have a speed equal to the weighted averaged speed evaluated considering only the non-filtered measurements in the 15 min time interval. This anomaly repairing technique cannot be applied when less than 2 reliable observations are available for a sensor in the 15 min time interval. In this case, the aggregated flow and speed of that 15 min time interval are classified as anomalies, and they are not used in the following steps.

Anomaly detection through STL decomposition
The Seasonal-Trend Decomposition using Loess (STL) is a filtering procedure for decomposing a time series into three components: the trend, the seasonal, and the remainder (also called residual) [22]. Splitting the time series into components can help in identifying anomalies. The additive decomposition considers the time series model described by the following formula: t ¼ 1; 2; :::; N where y t represents the observation at time t, s t is the trend in time series, s t is the seasonal signal with period T, and r t is the remainder component.
The trend component consists of the low-frequency variations in the data with non-stationary, long-term changes, i. e. continuous increase or decrease. The seasonality instead is composed of the variations (periodic patterns) in the data near a baseline. Typically, trend changes faster than seasonal. The remaining variations are included in the residual.
The decomposition of the time series is obtained by applying the locally-weighted regression (Loess smoother) several times, iteratively. The result of these applications is a curve representing a smoothing of the original time series, computed taking into account the value of a variable number of observations in the neighborhood.
A variant of STL is the RobustSTL [23], defined as a robust and generic seasonal-trend decomposition method able to extract seasonality from data with a long seasonality period and high noises. The method applied by the RobustSTL consists of four steps: 1. noise removal by applying bilateral filtering and using neighbors in a window of 2H þ 1 observations with similar values to smooth the time series; 2. trend extraction by using the least absolute deviations (LAD) loss with sparse regularization; 3. seasonality extraction through non-local seasonal filtering which takes into consideration K seasonal neighborhoods of 2H þ 1 observations with different weights according to their distance in the time dimension and their seasonality values; 4. final adjustment by calculating the mean of seasonality, which is added to trend and removed from seasonality.
Several configuration parameters must be provided to the algorithm: the period T, that is the number of observations in each cycle of seasonality, the hyper-parameters of the bilateral filter of step 1 (dn1 and dn2), the number of the neighborhood (H), the regularization parameters for trend extraction (reg1 and reg2), the number of past season samples (K), and the hyper-parameters of the bilateral filter in seasonality extraction step (ds1 and ds2). After each step, the input signal is updated by removing the component extracted in that specific step. After step 4, the steps are repeated until you get convergence between the remainder of the current iteration and the one of the previous iteration. The convergence is computed by the following formula: where r i is the remainder at iteration i (current iteration) and r iÀ1 is the remainder at iteration i À 1 (previous iteration). If the convergence is higher than a threshold, the process will continue with another iteration, otherwise, the decomposition obtained is considered definitive. In the latter case, the results are 3 time series representing trend, seasonal, and residual. Once the decomposition is concluded, the curve of residual can be analyzed to detect anomalies by using different methods. A solution could be the application of the interquartile range (IQR), which allows the calculation of the two fences to define reliable values. More details are provided in the following.
We evaluated two versions of the anomaly detection process: ADP1 and ADP2. The operations of each version are explained in Fig. 4. In ADP1, we apply the logarithm to the flow values obtained after anomaly repairing and provide it to RobustSTL. Then, remainder values are normalized by using mean and standard deviation, as follows: where mean and standarddeviation are computed on the remainder. Thus, the first (Q1) and third (Q3) quartiles are calculated on the normalized residual to find lower and upper fences with the following formulas: where IQR is the difference between third and first quartiles, and k is a multiplier. The observations with a residual value lower than lowerfence or higher than upperfence are outliers. The higher the multiplier value, the fewer outliers are found. The value of k depends on which type of outliers we want to detect; high values of k are used for extreme outliers.
In ADP2, after the application of RobustSTL, the inverse function of the logarithm is applied to the residual values. The obtained values are normalized using the Robust Scaler, as follows: where median and IQR are evaluated on the remainder after the inverse logarithm function. Then, the lower fence and upper fence are calculated on the normalized remainder values and the IQR filter is applied to them. Again, the observations with residual value out of the range between the lower fence and the upper fence are outliers. We set up n processes to apply STL decomposition to sensors data in real-time, where n is the number of sensors in our dataset. Indeed the decomposition of the time series of one sensor is independent of the others. We exploited an implementation of RobustSTL available online. 2 In both anomaly detection process versions, the decomposition is performed every 15 min. The time series, used as input, is generated grouping by 15 min the logarithm of one-week flow measurements of a specific sensor. After the application of IQR, only the anomalies of the last 15 min are taken into account since the anomalies on the previous time slots have already been extracted by the previous applications of the decomposition. We choose to aggregate observations every 15 min since the traffic model takes in input this type of aggregation, as described in Sect. 7. The choice of using one week observations is a trade-off between the need to provide context for decomposition and the requirement of having results in a short time. Indeed, the time required for one month decomposition (on average 12 min) is far greater than the one for week decomposition (on average 15 seconds). Therefore, we set the period T to the number of observations in the week divided by 7. After several experiments, we decided to set the values of the other configuration parameters in this way: dn1 = 1, dn2 = 1, H = 3, reg1 = 10, reg2 = 0.5, K = 1, ds1 = 50, ds2 = 1. In the end, we initialized the value of k in the IQR method to 3 since we want to avoid that real unusual traffic conditions are labeled as anomalies. Using a lower value for k in the IQR method to 3 since we want to exclude those observations that are real anomalies. Using a lower value for k, Fig. 4 The two versions of anomaly detection process: ADP1 and ADP2 we noticed that a lot of observations were considered anomalies and, checking the values of flow and speed, they did not seem like real anomalies. Probably this is due to the fact that our sensors are located near the traffic lights, therefore, the presence of peaks is possible and could be related to the turned green of the traffic light.

Anomaly classification
In this section, we introduce the methodology we applied to classify anomalies and distinguish between sensor faults and unusual traffic conditions. Firstly, the correlation among traffic sensors was studied: in this phase, groups of sensors that have similar trends are identified. Anomalies are classified as unusual traffic conditions when they occurred contemporary in correlated sensors, otherwise, they are labeled as sensor faults. In the road traffic environment, an anomaly can be caused by traffic accidents, road closures, and road works that create an unusual traffic condition, or by faulty sensors, i.e. sensors that are not able to measure correctly, vehicles in the streets. In the case of unusual traffic condition, the measurement collected by the sensor is a real value; on the other hand, in the case of a sensor fault, the measured value is an error and do not correspond to the real number of vehicles crossing the street in the sensor's position. In [24], a description of how these faults may occur is given.
Induction loop sensors are composed of inductive and capacitive elements that may encounter open or short circuit faults during measurements. Such faults lead to erroneous interpretation of data acquired from the loops. The idea behind the classification of anomalies is the following: if an anomaly occurs only in the time series of a sensor and, at the same time, it is not recorded by other sensors correlated to the sensor itself, then it is a sensor fault. Given a sensor A, B is a correlated sensor of A if its observations of traffic flow are in relation with the observations recorded by A. Usually, B can be placed in proximity of A. For example, sensors placed in the same junctions are often correlated or sensors on the same road and lane are correlated sensors (as shown in Fig. 5). Otherwise, if the anomaly recorded by the sensor occurs in the same period as the anomalies recorded by the correlated sensors, then it is an unusual traffic condition.
In order to distinguish between sensor faults and unusual traffic conditions, the correlation evaluated with DCCA (described in Sect. 6.1) was taken into account. Each anomaly was associated with anomalies that happened in an adjacent time interval. The amplitude of this time interval is a parameter that should be defined considering the frequency of observations. For each anomaly, the number of traffic sensors that had simultaneous anomalies (ns) was calculated, then the number of correlated sensors was counted (nc). If an anomaly happened in adjacent time intervals for correlated sensors, the anomaly is considered an unusual traffic condition.
The anomalies observed for sensors with a low number of correlated sensors are more likely to be classified as sensor faults, even if ns is high. For this reason, we take into account not only correlated sensors but also proximal sensors. For each anomaly classified as sensor fault, the distance between the sensor it belongs to and each traffic sensor showing a simultaneous anomaly was evaluated; if there are at least two other sensors experiencing an anomaly in a radius of 1500 meters from the sensor, the anomaly is classified as an unusual traffic condition. The distance was evaluated directly, querying the PostgreSQL database where sensor locations are collected. This allows us to identify additional spatial correlations between traffic sensors located nearby that may appear in specific traffic conditions (i.e., traffic accidents that caused re-routing of vehicles) but were not detected by the DCCA coefficient that mainly explores the temporal dimension. An overview of the classification process is shown in Fig. 6.

Correlation
If sensors are placed nearby one another, an unusual traffic condition will be recorded by the majority of sensors in the area. Instead, if the anomaly is caused by sensor fault, the majority of sensors located nearby will not observe any anomaly. The relation between sensors is not completely dependent on their spatial distance. The road network has a complex structure, and two sensors can be distant but placed on two connected roads. For this reason, a way to deeply understand the relation between two or more sensors is to take into account historical data and evaluate their correlation with a representative metric.
Considering measurements referring to one month (April 2019), the correlation between sensors was investigated. Sensors' observations are collected every minute at different timestamps, and to compare them, we need a common timestamp. Thus, we aggregate values every 15 min. Moreover, the traffic sensors available in Modena are located near traffic lights, and their measurements are affected by the traffic light logic and the presence of the red light. When the traffic light is red, vehicles cannot move; thus the number of vehicles that can be counted is one (if there is a vehicle that stops on the sensor) or zero. Aggregating data every 15 min reduces this effect and allows us to better understand the real traffic flow.
Firstly, the correlation between sensors was studied on the 15-min aggregated time series with Pearson coefficient, as described in Sect. 6.2. Then, since Pearson correlation is not suitable for skewed distributions, in order to ameliorate the correlation study and obtain a more reliable result, DCCA was studied (Sect. 6.3).

Pearson correlation
Firstly, we evaluate the correlation between sensors using Pearson correlation metrics. The Pearson coefficient measures the strength and direction of linear relationships between pairs of continuous variables. As described in [25], the Pearson coefficient is easily computed, dividing the correlation between the two variables for the product of the square of their variance. The correlation was firstly evaluated considering whole traffic observations of April 2019. Only traffic sensors with a correlation coefficient higher or equal to 0.8 and with a distance between them lower than 2000 m are considered correlated; these thresholds were selected considering the necessity of a strong correlation and the mutual distance of the sensors in our urban area. There were a lot of high correlation scores. Observing Fig. 7 showing the correlation between two highly correlated sensors, the comparison between correlation studied considering night period or not shows that the high number of zero values registered during the night strongly influences the distribution of each sensor. For this reason, the correlation was evaluated by removing night hours (between midnight and five in the morning). For the two sensors in Fig. 7, the exclusion of night hours strongly decreases the observed correlation to 0.74, which is lower than 0.8. Thus the two sensors will no more be considered correlated. The sensors correlated with sensor ''R013_SM21'' are represented in Fig. 8. For each of the traffic sensors that collect at least one measure every 15 min in the reference period (they were 310 out of the 318 reliable sensors), the correlated sensors have been evaluated with the presented methodology. For each sensor, the average number of correlated sensors is 21. In Table 2, can be observed that sensor showing at least three correlated sensors are 242, 196 of them considering a threshold for the Person coefficient of 0.8 and 46 relaxing this threshold to 0.7. The sensors that have less than three correlated sensors, but at least one correlated sensor, are 30. Finally, the isolated sensors are 38.
Pearson correlation assumes that the two variables are individually normally distributed [26], thus it is not suitable for skewed distributions. Since we aim to find outliers, we can assume outliers are present in our data. The Pearson correlation coefficient is also very sensitive to outliers.

Detrending cross-correlation analysis coefficient (DCCA)
In [27], a new coefficient is proposed to quantify the level of cross-correlation between non-stationary time series but can be employed to study traffic flow fluctuations in the presence of outliers also in stationary time series [28]. This cross-correlation metric is called DCCA (Detrending Cross-Correlation Coefficient). Since traffic sensors data have nonlinear behavior, their correlation can be studied by taking into account the properties of fractals. DFA (Detrended Fluctuation Analysis) is a method for determining the statistical auto-correlation of a trend-stationary signal. DCCA is a generalization of DFA which measures crosscorrelation between two time series. The presence of outliers can introduce noise in a stationary time series; thus, DCCA is preferred for the traffic flow correlation analysis.
To evaluate the DCCA coefficient between two time series of equal length, four steps are necessary: • integrating each time series; • dividing the integrated sequence into non-overlapping segments and finding a polynomial function representing the trend of each segment for each time series; • evaluating the local detrended covariance between the two time series in each segment; • calculating the mean between the fluctuation evaluated for each segment to obtain a unique value.
In the integration step, for each time series, the total mean is subtracted from each value of the sequence, and then a cumulative sum is performed. This step is performed to obtain from each time series its profile. The two resulting profiles of length N (the same as the two original time series) are then divided into sequences of n values obtaining N n ¼ N=n segments for each profile. The obtained segments are enumerated as s = 1,..N n . Then, each segment s is taken singularly to evaluate its trend and define a polynomial function (T) that approximates its sequence of values. The degree of the polynomial function is a parameter that can be fitted. Each sequence s starts at element ðs À 1Þ Ã n þ 1 of the profile and ends at s Ã n. The polynomial function has a value for each element of s and each profile. In the third step, the covariance of each segment is evaluated as follows [29]: where T ðsÀ1ÞÃnþi is the value of the polynomial function in position ðs À 1Þ Ã n þ 1 of the s segment, and Y ðsÀ1ÞÃnþi is the value in the sequence. D 1 and D 2 are the values of the detrended signal respectively in the first and second profile. In the last step, the average between all the evaluated fluctuations (one for each segment) is calculated to obtain a unique covariance that represents the correlation between the two time series. To quantify the level of cross-correlations, a dimensionless measure, the DCCA cross-correlation coefficient, needs to be evaluated. This coefficient is the ratio between the previously calculated covariance and the product of two detrended standard deviations. The DCCA method allows us to select a value of n. Thus, it is possible to evaluate the covariance considering segments of a given length. This is important for our use case, sensor measurements can be correlated daily but their correlation can be less evident considering a shorter period. We are mostly interested in the short-term correlation between sensors. The daily trend can be similar (peak hours in the morning, fewer vehicles during the evening), even weekly trend can be easily detected (fewer vehicles during the weekend and more during working days) but to identify sensor faults we need to find sensors that are correlated considering a time interval of maximum one hour. For this reason, the DCCA correlation coefficient was evaluated considering segments of length n ¼ 2 (corresponding to 30 min). Two sensors are considered correlated if their DCCA correlation coefficient is higher or equal to 0.7, and their spatial distance is lower than 2000 m. The choice of these thresholds is the result of several empirical tests. The mean number of correlated sensors for each sensor is 42, a higher value than the one obtained with the Pearson correlation. The number of isolated sensors decreases in comparison with the Pearson correlation to 23. Comparing the isolated sensors detected by Pearson and DCCA, we found that 18 of them are in common. The DCCA methodology reduced the number of isolated sensors and better investigated the correlation between sensors. For the 23 isolated sensors, we decided to change the segment length n ¼ 4 to cover one hour and to set the condition on the distance between correlated sensors at 2500 m. In this way, the isolated sensors became 17 (indicated as strongly isolated sensors in Table 2); 9 of them are located far away from any other sensors, one of them is a sensor that counts only busses and appears to have different behavior. For these sensors, we were not able to find correlated sensors. A comparison of the results of correlation analysis conducted with Pearson Correlation and DCCA is shown in Table 2. It can be observed that DCCA methodology not only better fits our data but also allows us to discover more correlation patterns and reduce the number of isolated sensors. For these reasons, the DCCA correlation cohefficient is the one employed in our data cleaning process to classify anomalies.

Traffic model
Traffic simulation models are mathematical tools that help to plan, manage and analyze urban mobility. Dynamic traffic models create a detailed evolution over time of traffic situations. The model employed is a dynamic microscopic model. In microscopic models, the movement of each vehicle is the result of individual choices depending on: interactions with other vehicles, road environments, and traffic signals. The vehicle's movements do not depend on macroscopic or probabilistic laws. Every single vehicle in the model is a unique entity with its own goals and behavioral characteristics; each possessing the ability to interact with other entities in the model. We employed the open-source micro-simulation model SUMO 3 [30], configured to generate the routes of the vehicles starting from traffic sensors data as described in [19]. This model produces data about vehicle counts and their average speed in every road of the city of Modena taking as input the measurements of the traffic sensors.

Input generation
Vehicles routes are created taking into account real traffic sensors data aggregated into 15-min time slots. The model includes objects called calibrators. Calibrators are part of the SUMO suite and are like virtual traffic sensors calibrated considering the real measurements of the on-road sensors. Each calibrator can produce the aspired traffic flow, i.e. the number of vehicles counted by the sensor associated with that calibrator. For each real traffic sensor in the city, a correspondent calibrator is inserted in the simulation. Vehicle counts and speed from the traffic sensor are the values the calibrator aspires to reach. If no information about the traffic flow in a specific time interval is provided to the calibrator, vehicles are not added or subtracted. Instead, if any information about the vehicle flow to be reached is given, the calibrator acts adding or removing vehicles to reach the given traffic flow. For this reason, removing sensors faults from the input data is of fundamental importance to obtain a more realistic simulation.
We use OSM (OpenStreetMap) as the source of geographical data to define our road network and store it in our database. Every road section is a line object composed of several points. The road section is then divided into smaller segments that are the piece of road between two consecutive points.

Output
Simulations are performed on an HPC resource. Once a simulation is finished, its output is stored in the database. The values of flow (veh/h) and average speed (m/s) are collected for each lane segment (more than 17000 in our use case) every 15 min of simulation. An analysis of the traffic data obtained from several simulations is described in [31]. An additional output is provided for every sensor location: a time series of 15 min aggregated vehicle counts and speeds observed during the simulation at that point.

Traffic model evaluation
The main goal of a traffic model is to simulate a traffic flow close enough to real values observed in urban streets. The model provides as output the number of vehicles counted in the exact position where the traffic sensor is located. The vehicles number simulated in a point where a traffic sensor is located must be compared with real vehicle counts. Even if traffic sensor data are used to feed the model, calibrators are not always able to insert the required number of vehicles: the vehicles inserted in the simulation and the vehicles counted by real sensors can be different. This mainly happens when a sensor measures a very high flow and its calibrator inserts many vehicles causing a jam in the simulation and avoiding other calibrators, placed nearby, to eventually add additional vehicles. This very high flow could be caused by a sensor fault that affects the performance of the entire simulation.
The evaluation method of the presented traffic model is described in [32]. For each sensor, whose measurements are employed as input for the model, the distance between the two time series, i.e. real flows observed by the sensors and simulated flow, is calculated with three different metrics: the fast Dynamic Time Warping (DTW), the PointWise Distance (PWD), and the Count Time slots Distance (CTD).
The DTW is evaluated using FastDTW, a less complex version of DTW described in [33]. This metric allows sequences to be stretched along the time axis, is able to find corresponding regions in time series, and can tolerate noise, time shifts, and scaling in the y axis [34]. PWD is evaluated by summing all the differences between the measured flow and the simulated flow in all the time steps of the simulation. We do not use absolute distance, but we sum distances considering their sign since a subsequent time slot can compensate for the difference observed in a previous one. The CTD is the number of time slots in which the absolute difference between the measured and simulated flow is higher than 2 vehicles per minute.
Calibrators have been classified considering the presented metrics in ''aligned'' and ''not aligned''. A calibrator is considered ''not aligned'' if the DTW distance between real measurements and simulated flow is higher than 1200, the PWD is higher than 30, and the CTD is higher than half of the total time steps of the simulation. A calibrator classified as ''not aligned'' cannot correctly insert the expected number of vehicles as required by the given input.
To evaluate the performances of the simulation, we need to produce metrics that can summarize with a unique value the distances observed in each sensor position. We defined 5 metrics that help us to compare simulations performed considering or excluding anomalies referring to the whole simulation: percentage of aligned calibrators, mean Root Mean Squared Error (RMSE), mean DTW reduction, mean PWD reduction, and mean CTD reduction. For each simulation, the number of calibrators classified as aligned has been divided by the total number of calibrators to obtain the percentage of aligned calibrators. A value of the percentage of aligned calibrators is calculated for the simulation performed including anomalies and the simulation of the same period excluding anomalies from the input of the model. A higher percentage means a higher number of aligned calibrators, thus an optimization of model performance. The RMSE between the real measurements and the simulated flows is evaluated for each sensor position and averaged. For each simulation with and excluding anomalies, a value of mean RMSE is calculated. A higher mean RMSE means a bigger error and thus a decrease in the performance of the model. Mean DTW reduction, mean PWD reduction, and mean CTD reduction are obtained comparing the performances of two simulations: the simulations with all the available measurements (we will refer to this simulation as ''Standard''), and the simulation excluding anomalies (we will refer to this simulation as ''Cleaned''). The DTW distance between real measurements and simulated flow time series is evaluated in each sensor position for the two simulations as follows: 1 N X N i¼1 ðDTWðm i;1 ; s i;1 Þ À DTWðm i;2 ; s i;2 ÞÞ where m i;1 is the real measurements time series of sensor i, s i;1 is the simulated flow in the sensor position, and N is the total number of sensors. The difference between the DTW distances observed by the two simulations in each location is evaluated and the mean of these differences is called mean DTW reduction. If the value is positive, the distance has been reduced, thus the cleaned simulation performed better; otherwise, the standard simulation has better performances. Similarly, mean PWD reduction and mean CTD reduction are obtained by evaluating the difference between values of PWD distances or CTD distances of the two simulations referring to the same period for each sensor location and averaging these differences to obtain a unique value. If the value is positive, the distance was reduced and the cleaned simulation performed better than the standard one.

Experiment and result
To demonstrate the effectiveness of our methodology, the traffic model performances without a data cleaning procedure (standard simulation) and excluding anomalies (cleaned simulation) are compared for each day of April 2019. In Sect. 8.1, we show the results of flow-speed correlation filter. While in Sect. 8.2, the results of ADP1 without anomaly classification are discussed. Moreover, in Sect. 8.3, the results of the classification applied to both ADP1 and ADP2 will be compared and examined. Finally, in Sect. 8.4, the performances of the simulations obtained excluding anomalies detected by ADP2 and classified as sensor faults are presented.

Flow-speed correlation filter
In April 2019, the number of observations coming from the traffic sensors was 13 millions, and they were produced by 338 sensors.
Using the flow-speed correlation filter, 450845 observations are filtered out (3% of the total number of observations). These filtered observations are related to 259 sensors. The scatter plots in Fig. 9 show the values of flow and speed of the observations of urban sensors (Fig. 9a) and sensors outside the urban area, in provincial or regional roads (Fig. 9b) in April 2019; the red points are the filtered observations. As can be seen, no filtered observations are found among data coming from provincial and regional sensors. In Fig. 9a, we can notice that very high values of speed are considered ''non-anomalous'' for low values of flow. Indeed, if there is no traffic (low flow), it could be possible that vehicles move at high speed, especially at night. However, for higher values of flow, an observation with a very high speed is considered ''anomalous'' and filtered. Then, the filtered flow values are replaced as described in Sect. 5.2.

Improvement in traffic simulation with anomaly detection (ADP1)
In this subsection, we discuss the anomalies identified by ADP1 and the performance of traffic simulation obtained excluding these anomalies (ADP1 SIM). After the anomaly repairing phase is applied to one minute filtered data, all sensor observations are aggregated every 15 min. The STL decomposition, applied to aggregated traffic data by using the IQR method, detects 13932 anomalies (less than 0.1% of the observations) related to 310 sensors. Figure 10a shows an example of decomposition which refers to the observations of one sensor from April 8th to April 14th, 2019. Time series decomposition involves thinking of a series as a combination of trend, seasonality, and remainder (also called noise or residual) components. Figure 10b  Analyzing the time distribution of the anomalies, most of them are detected at night, as can be seen in Fig. 11. Figure 12 shows the percentage of anomalies for every day of April 2019: in one day (April 6th, 2019) the percentage exceeds 2%.
Once the anomalies have been detected by ADP1, they are stored in the database. Each anomaly is linked to the 15-min time interval of the aggregated observations and the sensor it belongs to.
For each day of April 2019, two simulations have been performed. The STD SIM uses as input all the available sensors observations (no data cleaning is performed on them). The cleaned simulation ADP1 SIM, instead, is obtained applying ADP1 to traffic data and removing anomalous observations from the input. We simply remove the anomalous observations from the input of the traffic model and calibrators simulate vehicles considering previous and next observations. Classification of anomalies is not applied in this case. The evaluation of both the standard simulation (STD SIM) and the ADP1 SIM are performed considering only the non-anomalous points. This because we assume that the real measurements labeled as anomalous are not reliable and cannot be used to estimate the error.
In Table 3, all the evaluated metrics are displayed for each day of April 2019. Comparison metrics described in Sect. 7.3 are evaluated, considering as the first simulation the standard simulation (STD SIM), and as the second simulation the simulation without anomalies detected by ADP1 (ADP1 SIM). In the 77% of daily simulations, the percentage of aligned calibrators increased excluding anomalies (as can be seen in the third column of Table 3), only on 3 days the number of aligned calibrators decreases. In the 73% of cases, the mean RMSE error decreases (as can be seen in the fifth column of Table 3). If mean DTW reduction has a positive value, the exclusion of anomalies reduces mean DTW distance. In only the 60% of daily simulations, the mean DTW reduction is positive (as can be seen in the sixth column of Table 3). Moreover, the mean PWD reduction is positive in the 60% of daily simulations (as can be seen in the seventh column of Table 3), this means that in the majority of days, the mean PWD was reduced through the data cleaning process. Finally, the mean CTD reduction is positive in the 90% of daily simulations (as can be seen in the eighth column of Table 3), thus the mean number of time slots in which the distance between the simulated flow and the real measurements was higher than 2 (veh/minute) is significantly reduced. Overall, the 83% of days shows an improvement on at least 3 out of 5 metrics in the ADP1 SIM. For each day, the number of filtered one-minute data and the number of anomalies detected using STL is calculated. The days with the highest number of filtered values and anomalies are the ones with better performances (see for example 5th, 6th, and 20th April in Table 3).
The reason for this enhancement of performances was further investigated, observing the time series of measurements and simulation' flows in the more affected locations. In Fig. 13, the comparison between the observed and the simulated flow on the 1st April, without excluding anomalies (STD SIM) and removing anomalies detected by ADP1 (ADP1 SIM) shows that the performances in 3 sensors locations are significantly affected by the anomaly detection process. The calibrators located in the position of sensors ''R137_S11'', ''R047_S4'' and ''R027_SM125'' before the data cleaning process were not able to correctly follow real measurements: ''R137_S11'' had some time slot with very high flow, ''R047_S4'' had zero flow for the most part of the simulation, and ''R027_SM125'' was not able to follow the real flow in the second half of the STD SIM. For these reasons, they were all classified as not aligned calibrators. After the data cleaning process, the similarity between the observed and the simulated flow rises, and the calibrators were classified as aligned. Finally, the outputs (flow and speed in every road section) of the simulations were compared. Since we removed some high flow values detected as anomalies, the expected result was a decrease in the total number of vehicles. Yet, the number of vehicles increased globally and the total number of vehicles per day increased on average in the 57% of road sections. The daily mean speed has a difference with an absolute value higher than 10 km/h for only 11% of road sections. Calculating the sum of the values of speed increment and speed reduction due to anomaly detection, the final result is that the data cleaning process speeds up the vehicles in the simulation. The road sections whose flow is more affected by the data cleaning process are displayed in the flow variation map in Fig. 14. Roads with an increase in the mean daily flow for at least 20 days on 30 are colored in red; while roads with a decrease are colored in blue.

Anomaly classification for tuning the anomaly detection process
Anomalies detected by ADP1 in April 2019 were classified. Firstly, the DCCA correlation coefficient was employed to identify for each sensor a group of correlated Furthermore, the corresponding value of flow in detected anomalies is very low during the night, suggesting that the unusual traffic conditions can be acceptable values that should not be excluded from the input of the traffic model.
The results of the classification on the anomalies detected by ADP1 underline the necessity of some improvement; hence, a new version of the anomaly detection algorithm that includes classification was generated: ADP2.
In Fig. 15, the results of the classification for the two versions of the anomaly detection process can be compared; the value of traffic flow is very low for both sensor faults and unusual traffic conditions in ADP1. In ADP2 instead, the traffic flow values are significantly higher for sensor faults and remain lower for unusual traffic The map can be seen in more detail at https://trafair.eu/ flowvariationmap/ conditions. Besides, in ADP2 the percentage of unusual traffic conditions detected during night hours is reduced to 68% and 44% for sensor faults.
In Fig. 16, the time series of measurements of sensor ''R124_S2'' is displayed, with sensor faults and unusual traffic conditions. When unusual traffic conditions appear during night hours, they generally are a consequence of the inability of the STL decomposition to correctly extract the seasonality of night hours; thus, they are not significant for traffic evaluation. Instead, when unusual traffic conditions are detected during the day they may indicate traffic congestion or a real traffic event (e.g a road accident).
In Fig. 17, the number of unusual traffic conditions during daylight hours for each day of April 2019 is displayed. The highest number of unusual traffic conditions is detected on Fridays (the 5th and the 12th) and Saturdays (the 6th and the 20th). The 23rd is a Tuesday, and it was the day after Easter Monday in Italy. Therefore, when a day is different from the days before, a high number of unusual traffic conditions is detected. This is a consequence of the STL parameters (the number of seasonality used by this version is 3), thus, if the day curve is different from the seasonality of the three days before, the remainder will be larger and more anomalies will be detected. As can be seen in Fig. 18, this effect is not present in sensor faults; thus, the classification discriminates between anomalies that are a consequence of the adopted algorithm (classified as unusual traffic conditions) and real anomalies.

Improvement in traffic simulation with anomaly detection and classification (ADP21CLASS)
Different from experiments in subsection 8.2, in this part, we test the application of anomaly detection and classification in order to exclude from the input of the traffic simulation not all the anomalies but only the sensor faults. ADP2 was used to detect anomalies in April 2019. Then, the detected anomalies were classified and only the ones labeled as sensor faults were removed from the traffic model input. Thus, in this case, we combine ADP2 with the anomaly classification.
ADP2 detected 26792 anomalies, which are related to 333 sensors. Compared to the number of anomalies detected by ADP1in the same period (as reported in subsection 8.2), ADP2 finds twice as many anomalies. Figure 19.a shows anomalies detected by ADP2 on the time series observations of the sensor ''R124_S2'' from April 8th to April 14th, 2019, while Fig. 19.b reports the same anomalies on the remainder component after the application of the inverse logarithm function (the red lines represent the lower fence and the upper fence).
These figures can be compared with the ones of Fig. 10. Obviously, the remainder values of Figs. 19b and 10c are different since they are calculated in different ways, as described in Sect. 5.3. We can notice that anomalies on the high values of April 10th are identified by both anomaly detection processes; in addition, ADP2 finds another very high value in April 10th. Not all the anomalies on low flow values are detected by ADP2; indeed, we can notice that the low flow observations on the nights between 13th and 14th, and 14th and 15th are not highlighted as anomalies. Besides, at the end of April 10th ADP2 manages to identify anomalies on two positive peaks.
After anomaly detection, the traffic of each day of April 2019 was simulated using the traffic model described in Sect. 7. We compared the results of the simulation considering anomalous values (STD SIM) and the simulation excluding them (ADP2?CLASS SIM) for all days of April 2019. As displayed in Tables 4 and 5, the average value of Mean DTW reduction is very high: 76.71. This is a significant improvement, considering that the same value was negative (-2.82) using ADP1. Moreover, the average value of mean PWD reduction is 0.68 (it was 0.28 with ADP1); however, the average value of mean CTD reduction is 1 and is lower than the one in ADP1 (2).
Concerning the days with the worst performances using ADP1 (red values in Table 4), the values of the metrics described in Sect. 7.3 are evaluated considering only nonanomalous measurements for both the standard simulation (STD SIM) and the simulation performed removing sensor  Table 3 even if the simulation is the same.
The results show that ADP2 combined with classification significantly improves the performances of 4th, 9th, and 17th of April; however, removing anomalies still reduces the performance of April 12th, only the mean RMSE error is reduced.

Conclusion and future work
In this paper, we have presented a methodology to detect and classify anomalies in traffic sensor data streams. The proposed data cleaning process consists of three main components. The first one is a flow-speed correlation filter that removes unrealistic observations where the number of counted vehicles in a certain time interval is not related to the corresponding average speed. The second one is the anomaly detection algorithm that is based on the Seasonal-Trend Decomposition using Loess (STL). Two versions of the anomaly detection algorithm are defined and tested: ADP1 and ADP2. Anomaly classification performed investigating the correlation between sensors is the third component. The combination of ADP2 and anomaly classification has been proved to be the most effective in identifying the anomalies, as confirmed by experimental results.
Experiments proved that the classification of anomalies between sensor faults and unusual traffic conditions allows removing sensor faults from the input of the traffic simulation model, improving its performance, and ensuring that it can better emulate real urban traffic conditions.
The proposed solution that employs ADP1 is currently employed in real-time to detect anomalies on traffic sensor data. In the future, also ADP2 will be employed in realtime on our dataset of traffic sensor measurements. Moreover, we plan to experiment with different solutions for anomaly detection in multivariate time series considering both flow and speed.
Author contributions The authors contributed equally to this work.
Funding Research reported in this paper was partially supported by the TRAFAIR project (Grant Agreement: 2017-EU-IA-0167), cofinanced by the Connecting Europe Facility of the European Union. Conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of EU Commission.