Shapelets
polars_ts.clustering.shapelets
U-Shapelet (unsupervised shapelet) clustering for time series.
Discovers discriminative subsequences (shapelets) that separate groups of time series, then clusters in shapelet-distance space.
References
- Zakaria, J. et al. (2012). Clustering Time Series Using Unsupervised-Shapelets. ICDM.
UShapeletClusterer
Unsupervised shapelet-based time series clustering.
Discovers discriminative subsequences (shapelets) and clusters series by their shapelet distances.
Parameters
n_clusters Number of clusters. n_shapelets Number of shapelets to select. shapelet_lengths Candidate shapelet lengths to consider. n_candidates Number of random shapelet candidates to evaluate. target_col Column with the values to cluster. id_col Column identifying each time series. time_col Column with timestamps for ordering. seed Random seed for reproducibility. max_iter Maximum k-means iterations.
fit(df)
Discover shapelets and cluster time series.
Parameters
df Input DataFrame with time series data.
Returns
Self
_extract_series(df, target_col, id_col, time_col)
Extract series as a zero-padded 2-D array (n_series, max_len).
_subsequence_distance(shapelet, series)
Minimum sliding-window Euclidean distance between shapelet and series.
_extract_candidates(X, shapelet_lengths, n_candidates, rng)
Extract random shapelet candidates from the dataset.
_score_shapelet(shapelet, X)
Score a shapelet candidate using the gap statistic.
Computes the distances from the shapelet to all series, then finds the split point that maximizes the gap between successive sorted distances. A larger gap means the shapelet better separates series into two groups.
_kmeans_1d(distances, k, rng, max_iter=100)
Run k-means on a distance-feature matrix.
shapelet_cluster(df, k=3, n_shapelets=10, shapelet_lengths=None, n_candidates=100, target_col='y', id_col='unique_id', time_col='ds', seed=42, max_iter=100)
Discover U-Shapelets and cluster time series.
Convenience function wrapping :class:UShapeletClusterer.
Parameters
df Input DataFrame with time series data. k Number of clusters. n_shapelets Number of shapelets to select. shapelet_lengths Candidate shapelet lengths to consider. n_candidates Number of random shapelet candidates to evaluate. target_col Column with the values to transform. id_col Column identifying each time series. time_col Column with timestamps for ordering. seed Random seed for reproducibility. max_iter Maximum k-means iterations.
Returns
pl.DataFrame
DataFrame with columns [id_col, "cluster"].