Skip to content

Scalable

polars_ts.clustering.scalable

Scalable k-medoids variants: CLARA and CLARANS.

CLARA (Clustering LARge Applications) — subsample → PAM → repeat, keep best. CLARANS (Clustering Large Applications based on RANdomized Search) — randomized medoid neighborhood search over the full dataset.

References

Kaufman, L. & Rousseeuw, P.J. (1990). Finding Groups in Data. Wiley. Ng, R. & Han, J. (2002). CLARANS: A method for clustering objects for spatial data mining. IEEE TKDE.

clara(df, k, method='dtw', n_samples=5, sample_size=40, max_iter=100, seed=42, id_col='unique_id', target_col='y', **distance_kwargs)

CLARA: subsample-based PAM for large datasets.

Runs PAM on n_samples random subsamples of size sample_size, evaluates the full-dataset cost for each, and returns the best result.

Parameters

df DataFrame with columns id_col and target_col. k Number of clusters. method Distance metric name (e.g. "dtw", "erp"). Default "dtw". n_samples Number of subsampling iterations. Default 5. sample_size Number of series per subsample. Clamped to the total number of series if larger. Default 40. max_iter Maximum PAM swap iterations per subsample. Default 100. seed Random seed for reproducibility. id_col Column identifying each time series. target_col Column with the time series values. **distance_kwargs Extra keyword arguments forwarded to the distance function.

Returns

pl.DataFrame DataFrame with columns [id_col, "cluster"].

clarans(df, k, method='dtw', num_local=2, max_neighbor=10, seed=42, id_col='unique_id', target_col='y', **distance_kwargs)

CLARANS: randomized medoid neighborhood search.

Performs num_local restarts of a local search that explores up to max_neighbor random medoid swaps before declaring convergence.

Parameters

df DataFrame with columns id_col and target_col. k Number of clusters. method Distance metric name (e.g. "dtw", "erp"). Default "dtw". num_local Number of random restarts. Default 2. max_neighbor Maximum random swap attempts per restart before stopping. Default 10. seed Random seed for reproducibility. id_col Column identifying each time series. target_col Column with the time series values. **distance_kwargs Extra keyword arguments forwarded to the distance function.

Returns

pl.DataFrame DataFrame with columns [id_col, "cluster"].