Scalable
polars_ts.clustering.scalable
Scalable k-medoids variants: CLARA and CLARANS.
CLARA (Clustering LARge Applications) — subsample → PAM → repeat, keep best. CLARANS (Clustering Large Applications based on RANdomized Search) — randomized medoid neighborhood search over the full dataset.
References
Kaufman, L. & Rousseeuw, P.J. (1990). Finding Groups in Data. Wiley. Ng, R. & Han, J. (2002). CLARANS: A method for clustering objects for spatial data mining. IEEE TKDE.
clara(df, k, method='dtw', n_samples=5, sample_size=40, max_iter=100, seed=42, id_col='unique_id', target_col='y', **distance_kwargs)
CLARA: subsample-based PAM for large datasets.
Runs PAM on n_samples random subsamples of size sample_size,
evaluates the full-dataset cost for each, and returns the best result.
Parameters
df
DataFrame with columns id_col and target_col.
k
Number of clusters.
method
Distance metric name (e.g. "dtw", "erp"). Default "dtw".
n_samples
Number of subsampling iterations. Default 5.
sample_size
Number of series per subsample. Clamped to the total number of
series if larger. Default 40.
max_iter
Maximum PAM swap iterations per subsample. Default 100.
seed
Random seed for reproducibility.
id_col
Column identifying each time series.
target_col
Column with the time series values.
**distance_kwargs
Extra keyword arguments forwarded to the distance function.
Returns
pl.DataFrame
DataFrame with columns [id_col, "cluster"].
clarans(df, k, method='dtw', num_local=2, max_neighbor=10, seed=42, id_col='unique_id', target_col='y', **distance_kwargs)
CLARANS: randomized medoid neighborhood search.
Performs num_local restarts of a local search that explores up to
max_neighbor random medoid swaps before declaring convergence.
Parameters
df
DataFrame with columns id_col and target_col.
k
Number of clusters.
method
Distance metric name (e.g. "dtw", "erp"). Default "dtw".
num_local
Number of random restarts. Default 2.
max_neighbor
Maximum random swap attempts per restart before stopping. Default 10.
seed
Random seed for reproducibility.
id_col
Column identifying each time series.
target_col
Column with the time series values.
**distance_kwargs
Extra keyword arguments forwarded to the distance function.
Returns
pl.DataFrame
DataFrame with columns [id_col, "cluster"].