SimilarityTS-cli is a command-line interface tool that act as an interface of the SimilarityTS package. Similarity-TS package facilitates the evaluation and comparison of multivariate time series data. It provides a comprehensive toolkit for analyzing, visualizing, and reporting multiple metrics and figures derived from time series datasets. The toolkit simplifies the process of evaluating the similarity of time series by offering data preprocessing, metrics computation, visualization, statistical analysis, and report generation functionalities. With its customizable features, SimilarityTS empowers researchers and data scientists to gain insights, identify patterns, and make informed decisions based on their time series data.
This command-line interface is OS independent and can be easily installed and used.
This toolkit can compute the following metrics:
kl
: Kullback-Leibler divergencejs
: Jensen-Shannon divergenceks
: Kolmogorov-Smirnov testmmd
: Maximum Mean Discrepancydtw
Dynamic Time Warpingcc
: Difference of co-variancescp
: Difference of correlationshi
: Difference of histograms
This toolkit can generate the following figures:
-
2d
: the ordinary graphical representation of the time series in a 2D figure with the time represented on the x axis and the data values on the y-axis for- the complete multivariate time series; and
- a plot per column.
Each generated figure plots both the
ts1
and thets2
data to easily obtain key insights into the similarities or differences between them. -
delta
: the differences between the values of each column grouped by periods of time. For instance, the differences between the percentage of cpu used every 10, 25 or 50 minutes. These delta can be used as a means of comparison between time series short-/mid-/long-term patterns. -
pca
: the linear dimensionality reduction technique that aims to find the principal components of a data set by computing the linear combinations of the original characteristics that explain the most variance in the data. -
tsne
: a tool for visualising high-dimensional data sets in a 2D or 3D graphical representation allowing the creation of a single map that reveals the structure of the data at many different scales. -
dtw
path: In addition to the numerical similarity measure, the graphical representation of the DTW path of each column can be useful to better analyse the similarities or differences between the time series columns. Notice that there is no multivariate representation of DTW paths, only single column representations.
To install the tool in your local environment, just run follow command:
pip install similarity-ts-cli
Users must provide .csv
files containing multivariate time series by using the arguments -ts1
and -ts2
.
-ts1
should point to a singlecsv
filename. This time series may represent the baseline or ground truth time series.-ts2
can point to another singlecsv
filename or a directory that contains multiplecsv
files to be compared with-ts1
file.-head
if your time series files include a header this argument must be present. If not present, the software understands that csv files don't include a header row.
Constraints:
-ts1
time-series file and-ts2
time-series file(s) must:- have the same dimensionality (number of columns)
- not include a timestamp column
- include only numeric values
- include the same header (if present)
- if a header is present as first row, use the
-head
argument. - all
-ts2
time-series files must have the same length (number of rows).
Note: the column delimiter is automatically detected.
If your data include categorical values, it might be pre-processed to convert them to numerical values. All ts2s
time-series must have the same length (number of rows).
If -ts1
time-series file is longer (more rows) than -ts2
time-series file(s), the -ts1
time series will be
divided in windows of the same
length as the -ts2
time-series file(s).
For each -ts2
time-series file, the most similar window (*) from -ts1
time series is selected.
Finally, metrics and figures that assess the similarity between each pair of -ts2
time-series file and its
associated most similar -ts1
window are computed.
(*) -w_select_met
is the metric used for the selection of the most
similar -ts1
time-series window per each -ts2
time-series file(s).dtw
is the default value, however, any of
the
metrics are also available for this purpose using this argument.
Users can provide metrics or figures to be computed/generated:
-m
the metrics names to be computed as a list separated by spaces.-f
the figures names to be computed as a list separated by spaces.
If no metrics nor figures are provided, the tool will compute all the available metrics and figures.
The following arguments are also available for fine-tuning:
-ts_freq_secs
the frequency in seconds in which samples were taken just to generate the delta figures. By default is1
second.-strd
whents1
time-series is longer thants2
time-series file(s) the windows are computed by using a stride of1
by default. Sometimes using a larger value for the stride parameter improves the performance by skipping the computation of similarity between so many windows.
Some examples of evaluation of similarity are shown below. You can download some test data by running the following command:
wget https://github.com/alejandrofdez-us/similarity-ts-cli/raw/main/data_samples.zip && unzip data_samples.zip
Or manually download and unzip from https://github.com/alejandrofdez-us/similarity-ts-cli/raw/main/data_samples.zip .
-
A time series and all time series computing all metrics and figures:
similarity-ts-cli -ts1 data_samples/alibaba2018/ts1_machine_usage_days_1_to_8_grouped_300_seconds.csv -ts2 data_samples/alibaba2018/ts2 -head
Every metric computation and figure generated will be found in the
results/{timestamp}/
directory. -
Two time series computing only DTW metric and DTW figure:
similarity-ts-cli -ts1 data_samples/alibaba2018/ts1_machine_usage_days_1_to_8_grouped_300_seconds.csv -ts2 data_samples/alibaba2018/ts2/sample_0.csv -head -m dtw -f dtw
-
A time series and all time series within a directory computing every metric and figure in SimilarityTS toolkit:
similarity-ts-cli -ts1 data_samples/alibaba2018/ts1_machine_usage_days_1_to_8_grouped_300_seconds.csv -ts2 data_samples/alibaba2018/ts2 -head -m js mmd dtw ks kl cc cp hi -f delta dtw 2d pca tsne
-
Comparison between time series specifying the frequency in seconds in which samples were taken:
similarity-ts-cli -ts1 data_samples/alibaba2018/ts1_machine_usage_days_1_to_8_grouped_300_seconds.csv -ts2 data_samples/alibaba2018/ts2 -head -m dtw -f dtw -ts_freq_secs 300
-
Comparison between time series specifying the stride that determines the step or distance by which a fixed-size window moves over the first time series:
similarity-ts-cli -ts1 data_samples/alibaba2018/ts1_machine_usage_days_1_to_8_grouped_300_seconds.csv -ts2 data_samples/alibaba2018/ts2 -head -m dtw -f dtw -strd 5
-
Comparison between time series specifying the window selection metric to be used when selecting the most similar windows in the first time series:
similarity-ts-cli -ts1 data_samples/alibaba2018/ts1_machine_usage_days_1_to_8_grouped_300_seconds.csv -ts2 data_samples/alibaba2018/ts2 -head -m dtw -f dtw -w_select_met js
-
Using our sample time series to compute every single metric and figure with a fixed timestamp frequency and stride:
similarity-ts-cli -ts1 data_samples/alibaba2018/ts1_machine_usage_days_1_to_8_grouped_300_seconds.csv -ts2 data_samples/alibaba2018/ts2 -head -m mmd dtw ks kl cc cp hi -f delta dtw 2d pca tsne -w_select_met cc -ts_freq_secs 300 -strd 5
SimilarityTS toolkit is free and open-source software licensed under the MIT license.
Project PID2021-122208OB-I00, PROYEXCEL_00286 and TED2021-132695B-I00 project, funded by MCIN / AEI / 10.13039 / 501100011033, by Andalusian Regional Government, and by the European Union - NextGenerationEU.