Skip to content

Clustering Time Series with polars-ts Distance Matrices

This guide shows how to use pairwise distance matrices computed by polars-ts as input to standard clustering algorithms from scipy and scikit-learn.

Overview

polars-ts computes pairwise distances between time series and returns the results as a Polars DataFrame with three columns:

Column Description
id_1 Identifier of the first series
id_2 Identifier of the second series
(metric) The distance value (column name matches the metric, e.g. dtw, msm)

The output contains one row per unique unordered pair — that is, only the upper triangle of the distance matrix, excluding the diagonal. This is exactly the information needed to build a condensed distance vector for scipy or a full square matrix for scikit-learn.


1. Computing a Full Pairwise Distance Matrix

All compute_pairwise_* functions accept two DataFrames with columns unique_id (series identifier) and y (values). Pass the same DataFrame twice to get the full pairwise matrix of all series against each other.

import polars as pl
from polars_ts import compute_pairwise_dtw
# Each series is identified by "unique_id"; values live in "y"
df = pl.DataFrame({
    "unique_id": ["A"] * 5 + ["B"] * 5 + ["C"] * 5,
    "y":         [1.0, 2, 3, 4, 5,
                  1.0, 2, 3, 4, 6,
                  5.0, 4, 3, 2, 1],
})

distances = compute_pairwise_dtw(df, df)
print(distances)
# shape: (3, 3)
# ┌──────┬──────┬─────┐
# │ id_1 │ id_2 │ dtw │
# │ ---  │ ---  │ --- │
# │ str  │ str  │ f64 │
# ╞══════╪══════╪═════╡
# │ A    │ B    │ 1.0 │
# │ A    │ C    │ 8.0 │
# │ B    │ C    │ 9.0 │
# └──────┴──────┴─────┘

Other available distance functions and their key parameters:

Function Distance column Key parameters
compute_pairwise_dtw dtw method ("sakoe_chiba", "itakura", "fast"), param
compute_pairwise_ddtw ddtw
compute_pairwise_wdtw wdtw g (weight penalty)
compute_pairwise_msm msm c (cost)
compute_pairwise_erp erp g (gap penalty)
compute_pairwise_lcss lcss epsilon (matching threshold)
compute_pairwise_twe twe nu, lambda_
compute_pairwise_dtw_multi dtw_multi metric ("manhattan", "euclidean")
compute_pairwise_msm_multi msm_multi c

2. Converting to a scipy Condensed Distance Vector

scipy's hierarchical clustering functions expect a condensed distance vector — a 1-D array containing the upper triangle of the distance matrix in row-major order. polars-ts already returns exactly these upper-triangle pairs, but the row order may not match what scipy expects. The helper below re-indexes the pairs into the correct order.

import numpy as np
from scipy.spatial.distance import squareform

def to_condensed(distances):
    # Convert a polars-ts pairwise result to a scipy condensed distance vector.
    # Returns a (condensed_vector, labels) tuple.
    cols = [c for c in distances.columns if c not in ("id_1", "id_2")]
dist_col = cols[0]
    # Collect all unique ids in sorted order
    all_ids = set(distances["id_1"].to_list()) | set(distances["id_2"].to_list())
    ids = sorted(all_ids)
    id_to_idx = {name: i for i, name in enumerate(ids)}
    n = len(ids)
    # Build the condensed vector in scipy's expected order
    condensed = np.zeros(n * (n - 1) // 2)
    for row in distances.iter_rows(named=True):
        i, j = id_to_idx[row["id_1"]], id_to_idx[row["id_2"]]
        if i > j:
            i, j = j, i
        idx = n * i - i * (i + 1) // 2 + (j - i - 1)
        condensed[idx] = row[dist_col]
    return condensed, ids

You can also convert to a full square matrix when needed:

condensed, labels = to_condensed(distances)
square_matrix = squareform(condensed)
print(square_matrix)

3. Hierarchical Clustering with scipy

Once you have the condensed vector you can pass it directly to scipy.cluster.hierarchy.linkage.

from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
import matplotlib.pyplot as plt

condensed, labels = to_condensed(distances)
# Compute the linkage matrix (Ward's method is not valid for arbitrary
# distance matrices; use "average", "complete", or "single" instead)
Z = linkage(condensed, method="average")
# Cut the dendrogram to get flat clusters
cluster_labels = fcluster(Z, t=2, criterion="maxclust")
for name, cluster in zip(labels, cluster_labels):
    print(f"  {name} -> cluster {cluster}")

Drawing a Dendrogram

fig, ax = plt.subplots(figsize=(8, 4))
dendrogram(Z, labels=labels, ax=ax)
ax.set_ylabel("Distance")
ax.set_title("Hierarchical Clustering Dendrogram (DTW)")
plt.tight_layout()
plt.show()

4. Spectral Clustering with scikit-learn

Spectral clustering operates on an affinity (similarity) matrix rather than a distance matrix. Convert distances to affinities using a Gaussian kernel or simple inversion.

from sklearn.cluster import SpectralClustering

condensed, labels = to_condensed(distances)
square_matrix = squareform(condensed)
# Convert distances to affinities via a Gaussian kernel.
# sigma controls the width; a common heuristic is the median distance.
sigma = np.median(square_matrix[square_matrix > 0])
affinity_matrix = np.exp(-square_matrix ** 2 / (2 * sigma ** 2))
# Ensure the diagonal is 1 (self-similarity)
np.fill_diagonal(affinity_matrix, 1.0)

sc = SpectralClustering(
    n_clusters=2,
    affinity="precomputed",
    random_state=42,
)
cluster_labels = sc.fit_predict(affinity_matrix)

for name, cluster in zip(labels, cluster_labels):
    print(f"  {name} -> cluster {cluster}")

5. Complete Worked Example

The example below generates synthetic time series, computes DTW distances, performs both hierarchical and spectral clustering, and prints the results.

import numpy as np
import polars as pl
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
from scipy.spatial.distance import squareform
from sklearn.cluster import SpectralClustering
import matplotlib.pyplot as plt

from polars_ts import compute_pairwise_dtw
# ── Step 1: Create sample time series ──────────────────────────────────
np.random.seed(42)
n_points = 50
# Group 1: upward trends
series = {}
for i in range(4):
    series[f"up_{i}"] = np.linspace(0, 10, n_points) + np.random.normal(0, 0.5, n_points)
# Group 2: downward trends
for i in range(4):
    series[f"down_{i}"] = np.linspace(10, 0, n_points) + np.random.normal(0, 0.5, n_points)
# Build a polars DataFrame in the expected format
df = pl.DataFrame({
    "unique_id": [name for name, vals in series.items() for _ in vals],
    "y": [v for vals in series.values() for v in vals],
})

print(f"Input shape: {df.shape}")
print(f"Series: {sorted(series.keys())}")
# ── Step 2: Compute pairwise DTW distances ─────────────────────────────
distances = compute_pairwise_dtw(df, df)
print(f"\nPairwise distances: {distances.shape[0]} pairs")
print(distances.head())
# ── Step 3: Convert to condensed form ──────────────────────────────────
cols = [c for c in distances.columns if c not in ("id_1", "id_2")]
dist_col = cols[0]
ids = sorted(set(distances["id_1"].to_list()) | set(distances["id_2"].to_list()))
id_to_idx = {name: i for i, name in enumerate(ids)}
n = len(ids)

condensed = np.zeros(n * (n - 1) // 2)
for row in distances.iter_rows(named=True):
    i, j = id_to_idx[row["id_1"]], id_to_idx[row["id_2"]]
    if i > j:
        i, j = j, i
    idx = n * i - i * (i + 1) // 2 + (j - i - 1)
    condensed[idx] = row[dist_col]
# ── Step 4: Hierarchical clustering ───────────────────────────────────
Z = linkage(condensed, method="average")
hier_labels = fcluster(Z, t=2, criterion="maxclust")

print("\n=== Hierarchical Clustering (2 clusters) ===")
for name, cl in zip(ids, hier_labels):
    print(f"  {name:>8s} -> cluster {cl}")
# ── Step 5: Spectral clustering ───────────────────────────────────────
square_matrix = squareform(condensed)
sigma = np.median(square_matrix[square_matrix > 0])
affinity = np.exp(-square_matrix ** 2 / (2 * sigma ** 2))
np.fill_diagonal(affinity, 1.0)

sc = SpectralClustering(n_clusters=2, affinity="precomputed", random_state=42)
spec_labels = sc.fit_predict(affinity)

print("\n=== Spectral Clustering (2 clusters) ===")
for name, cl in zip(ids, spec_labels):
    print(f"  {name:>8s} -> cluster {cl}")
# ── Step 6: Visualize ─────────────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Dendrogram
dendrogram(Z, labels=ids, ax=axes[0])
axes[0].set_title("Hierarchical Clustering Dendrogram")
axes[0].set_ylabel("DTW Distance")
axes[0].tick_params(axis="x", rotation=45)
# Heatmap of the distance matrix
im = axes[1].imshow(square_matrix, cmap="viridis")
axes[1].set_xticks(range(n))
axes[1].set_yticks(range(n))
axes[1].set_xticklabels(ids, rotation=45, ha="right")
axes[1].set_yticklabels(ids)
axes[1].set_title("DTW Distance Heatmap")
fig.colorbar(im, ax=axes[1], shrink=0.8)

plt.tight_layout()
plt.show()

Expected output

The eight series should cleanly separate into two clusters: all up_* series in one cluster and all down_* series in the other. Both hierarchical and spectral clustering should agree on this partition because the within-group DTW distances (small warping between noisy versions of the same trend) are much smaller than the between-group distances (comparing an upward trend against a downward trend).


Tips

  • Linkage method: Ward's method (method="ward") requires Euclidean distances and is not appropriate for arbitrary time-series distance metrics. Use "average", "complete", or "single" instead.
  • Choosing the number of clusters: Use the dendrogram to visually identify a natural cut height, or use the inconsistent criterion in fcluster.
  • Large datasets: polars-ts distance functions are implemented in Rust and run in parallel. For very large collections, consider using approximate methods like compute_pairwise_dtw(df, df, method="fast", param=5.0) (FastDTW).
  • Alternative metrics: Different distance metrics capture different notions of similarity. MSM is better for amplitude-sensitive comparisons, LCSS is robust to outliers, and WDTW penalizes large temporal shifts. Experiment with multiple metrics on your data.