Time Series Data Clustering — Unsupervised Sequential Data Separation with Tslean
Clustering is an important machine learning technique that helps divide the data points into several groups. Common clustering algorithms are K-means, Mean-shift, Density-Based Spatial Clustering, and Expectation–Maximization (EM) Clustering. However, when it comes to time series data, we can not use those algorithms directly since we are dealing with separating sequences of data, not data points. Here, we can replace the Euclidean Distance measure for data points in the K-means algorithm with Dynamic Time Warping to solve the problem.
1. Dynamic Time Warping(DTW)
We can see from the picture, that if we use Euclidean matching, there will be some leftover data points since the duration of these sequences is different. Dynamic Time Warping will find the nearest corresponding points in the other sequence. Euclidean matching is one-to-one, while DTW matching is one-to-many.
The formal way to express the measurement is:
The calculation is implemented in tslean (a library that focuses on time series data analysis):
from tslearn.metrics import dtw
dtw_score = dtw(x, y)
2. Algorithm (implemented in tslearn library)
Now, we can still use K-means to cluster the data, simply replacing the original Euclidean distance. This has been implemented as the TimeSeriesKMeans function in tslean.
from tslearn.clustering import TimeSeriesKMeans
model = TimeSeriesKMeans(n_clusters=3, metric="dtw", max_iter=10)
A complete explanation and code can be found here: https://tslearn.readthedocs.io/en/stable/auto_examples/clustering/plot_kmeans.html#sphx-glr-auto-examples-clustering-plot-kmeans-py
Now we can use this algorithm to derive the trend of many behaviours. For example, grouping customers with similar purchasing schedules, or stocks that fluctuated similarly over time.