Clustering: Discovering Hidden Patterns in Production Line Data

What Is Unsupervised Learning?

In previous lessons, every dataset had labels. Unsupervised learning works without labels -- the algorithm discovers structure on its own. You might have millions of operating hours logged for a pump but no one has categorized the different operating modes. Clustering finds these hidden groups automatically.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

K-Means: Dividing Data Into Groups

K-Means is the most widely used clustering algorithm. You specify k clusters, and the algorithm iteratively assigns each point to its nearest center, then updates the centers.

from sklearn.cluster import KMeans

np.random.seed(42)
mode_a = np.random.normal([50, 3.0], [5, 0.3], (200, 2))
mode_b = np.random.normal([75, 4.5], [4, 0.4], (200, 2))
mode_c = np.random.normal([95, 6.0], [3, 0.5], (150, 2))

data = np.vstack([mode_a, mode_b, mode_c])
df = pd.DataFrame(data, columns=["pressure_psi", "flow_rate_m3h"])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
df["cluster"] = kmeans.fit_predict(X_scaled)

fig, ax = plt.subplots(figsize=(10, 7))
for c in range(3):
    subset = df[df["cluster"] == c]
    ax.scatter(subset["pressure_psi"], subset["flow_rate_m3h"],
               alpha=0.5, s=20, label=f"Cluster {c}")
ax.set_xlabel("Pressure (PSI)")
ax.set_ylabel("Flow Rate (m3/h)")
ax.set_title("K-Means: Compressor Operating Modes")
ax.legend()
plt.tight_layout()
plt.show()

centers = scaler.inverse_transform(kmeans.cluster_centers_)
for i, center in enumerate(centers):
    print(f"Cluster {i}: Pressure={center[0]:.1f} PSI, Flow={center[1]:.2f} m3/h")

Choosing the Number of Clusters: The Elbow Method

The Elbow Method plots within-cluster variance against different k values. The "elbow" suggests the natural number of groups.

from sklearn.metrics import silhouette_score

inertias = []
for k in range(1, 11):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(range(1, 11), inertias, "bo-", markersize=8)
ax.set_xlabel("Number of Clusters (k)")
ax.set_ylabel("Inertia")
ax.set_title("Elbow Method for Optimal k")
plt.tight_layout()
plt.show()

for k in range(2, 8):
    labels = KMeans(n_clusters=k, random_state=42, n_init=10).fit_predict(X_scaled)
    print(f"k={k}: Silhouette Score = {silhouette_score(X_scaled, labels):.3f}")

DBSCAN: Discovering Irregularly Shaped Clusters

K-Means assumes spherical clusters. DBSCAN finds clusters of any shape and automatically identifies outliers as noise.

from sklearn.cluster import DBSCAN

noise = np.random.uniform([30, 1], [110, 8], (50, 2))
data_with_noise = np.vstack([data, noise])
df_noisy = pd.DataFrame(data_with_noise, columns=["pressure_psi", "flow_rate_m3h"])

X_noisy_scaled = scaler.fit_transform(df_noisy)

dbscan = DBSCAN(eps=0.3, min_samples=10)
df_noisy["cluster"] = dbscan.fit_predict(X_noisy_scaled)

n_clusters = len(set(df_noisy["cluster"])) - (1 if -1 in df_noisy["cluster"].values else 0)
n_noise = (df_noisy["cluster"] == -1).sum()
print(f"Clusters found: {n_clusters}, Noise points: {n_noise}")

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for i, eps in enumerate([0.2, 0.3, 0.5]):
    db = DBSCAN(eps=eps, min_samples=10)
    labels = db.fit_predict(X_noisy_scaled)
    axes[i].scatter(df_noisy["pressure_psi"], df_noisy["flow_rate_m3h"],
                    c=labels, cmap="Set1", alpha=0.5, s=15)
    axes[i].set_title(f"eps={eps}, clusters={len(set(labels))-1}")
plt.tight_layout()
plt.show()

Comparing K-Means and DBSCAN

Feature	K-Means	DBSCAN
Cluster shape	Spherical	Arbitrary
Number of clusters	Must specify k	Automatic
Noise handling	Assigns everything	Labels noise as -1
Speed	Very fast	Moderate
Best for	Known number of modes	Unknown structure with outliers

Practical Example: Discovering Different Operating Patterns in a Pump

A water treatment plant logs pump data 24/7 but has never documented the different operating regimes.

np.random.seed(42)
hours = 720
timestamps = pd.date_range("2025-03-01", periods=hours, freq="h")

regime = np.random.choice(["idle", "normal", "peak"], hours, p=[0.2, 0.5, 0.3])
flow, power, vibration = [], [], []
for r in regime:
    if r == "idle":
        flow.append(np.random.normal(5, 1))
        power.append(np.random.normal(2, 0.5))
        vibration.append(np.random.normal(0.5, 0.1))
    elif r == "normal":
        flow.append(np.random.normal(25, 3))
        power.append(np.random.normal(15, 2))
        vibration.append(np.random.normal(2.0, 0.3))
    else:
        flow.append(np.random.normal(45, 4))
        power.append(np.random.normal(28, 3))
        vibration.append(np.random.normal(4.5, 0.5))

df = pd.DataFrame({"timestamp": timestamps, "flow_m3h": flow,
                    "power_kw": power, "vibration_mm_s": vibration})
df.set_index("timestamp", inplace=True)

X = scaler.fit_transform(df)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
df["cluster"] = kmeans.fit_predict(X)

print(f"Silhouette Score: {silhouette_score(X, df['cluster']):.3f}")
print("\nDiscovered operating patterns:")
for c in sorted(df["cluster"].unique()):
    subset = df[df["cluster"] == c]
    print(f"  Pattern {c}: Flow={subset['flow_m3h'].mean():.1f}, "
          f"Power={subset['power_kw'].mean():.1f}, Hours={len(subset)}")

Summary

In this lesson you learned unsupervised learning through clustering. K-Means partitions data into spherical clusters, and the Elbow Method helps choose k. DBSCAN discovers clusters of any shape and identifies noise. You compared both approaches and applied K-Means to discover hidden operating patterns in an undocumented pump system. In the next lesson, you will tackle anomaly detection for finding rare events that signal equipment problems.