Clustering: Discovering Hidden Patterns in Production Line Data
What Is Unsupervised Learning?
In previous lessons, every dataset had labels. Unsupervised learning works without labels -- the algorithm discovers structure on its own. You might have millions of operating hours logged for a pump but no one has categorized the different operating modes. Clustering finds these hidden groups automatically.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
K-Means: Dividing Data Into Groups
K-Means is the most widely used clustering algorithm. You specify k clusters, and the algorithm iteratively assigns each point to its nearest center, then updates the centers.
from sklearn.cluster import KMeans
np.random.seed(42)
mode_a = np.random.normal([50, 3.0], [5, 0.3], (200, 2))
mode_b = np.random.normal([75, 4.5], [4, 0.4], (200, 2))
mode_c = np.random.normal([95, 6.0], [3, 0.5], (150, 2))
data = np.vstack([mode_a, mode_b, mode_c])
df = pd.DataFrame(data, columns=["pressure_psi", "flow_rate_m3h"])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
df["cluster"] = kmeans.fit_predict(X_scaled)
fig, ax = plt.subplots(figsize=(10, 7))
for c in range(3):
subset = df[df["cluster"] == c]
ax.scatter(subset["pressure_psi"], subset["flow_rate_m3h"],
alpha=0.5, s=20, label=f"Cluster {c}")
ax.set_xlabel("Pressure (PSI)")
ax.set_ylabel("Flow Rate (m3/h)")
ax.set_title("K-Means: Compressor Operating Modes")
ax.legend()
plt.tight_layout()
plt.show()
centers = scaler.inverse_transform(kmeans.cluster_centers_)
for i, center in enumerate(centers):
print(f"Cluster {i}: Pressure={center[0]:.1f} PSI, Flow={center[1]:.2f} m3/h")
Choosing the Number of Clusters: The Elbow Method
The Elbow Method plots within-cluster variance against different k values. The "elbow" suggests the natural number of groups.
from sklearn.metrics import silhouette_score
inertias = []
for k in range(1, 11):
km = KMeans(n_clusters=k, random_state=42, n_init=10)
km.fit(X_scaled)
inertias.append(km.inertia_)
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(range(1, 11), inertias, "bo-", markersize=8)
ax.set_xlabel("Number of Clusters (k)")
ax.set_ylabel("Inertia")
ax.set_title("Elbow Method for Optimal k")
plt.tight_layout()
plt.show()
for k in range(2, 8):
labels = KMeans(n_clusters=k, random_state=42, n_init=10).fit_predict(X_scaled)
print(f"k={k}: Silhouette Score = {silhouette_score(X_scaled, labels):.3f}")
DBSCAN: Discovering Irregularly Shaped Clusters
K-Means assumes spherical clusters. DBSCAN finds clusters of any shape and automatically identifies outliers as noise.
from sklearn.cluster import DBSCAN
noise = np.random.uniform([30, 1], [110, 8], (50, 2))
data_with_noise = np.vstack([data, noise])
df_noisy = pd.DataFrame(data_with_noise, columns=["pressure_psi", "flow_rate_m3h"])
X_noisy_scaled = scaler.fit_transform(df_noisy)
dbscan = DBSCAN(eps=0.3, min_samples=10)
df_noisy["cluster"] = dbscan.fit_predict(X_noisy_scaled)
n_clusters = len(set(df_noisy["cluster"])) - (1 if -1 in df_noisy["cluster"].values else 0)
n_noise = (df_noisy["cluster"] == -1).sum()
print(f"Clusters found: {n_clusters}, Noise points: {n_noise}")
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for i, eps in enumerate([0.2, 0.3, 0.5]):
db = DBSCAN(eps=eps, min_samples=10)
labels = db.fit_predict(X_noisy_scaled)
axes[i].scatter(df_noisy["pressure_psi"], df_noisy["flow_rate_m3h"],
c=labels, cmap="Set1", alpha=0.5, s=15)
axes[i].set_title(f"eps={eps}, clusters={len(set(labels))-1}")
plt.tight_layout()
plt.show()
Comparing K-Means and DBSCAN
| Feature | K-Means | DBSCAN |
|---|---|---|
| Cluster shape | Spherical | Arbitrary |
| Number of clusters | Must specify k | Automatic |
| Noise handling | Assigns everything | Labels noise as -1 |
| Speed | Very fast | Moderate |
| Best for | Known number of modes | Unknown structure with outliers |
Practical Example: Discovering Different Operating Patterns in a Pump
A water treatment plant logs pump data 24/7 but has never documented the different operating regimes.
np.random.seed(42)
hours = 720
timestamps = pd.date_range("2025-03-01", periods=hours, freq="h")
regime = np.random.choice(["idle", "normal", "peak"], hours, p=[0.2, 0.5, 0.3])
flow, power, vibration = [], [], []
for r in regime:
if r == "idle":
flow.append(np.random.normal(5, 1))
power.append(np.random.normal(2, 0.5))
vibration.append(np.random.normal(0.5, 0.1))
elif r == "normal":
flow.append(np.random.normal(25, 3))
power.append(np.random.normal(15, 2))
vibration.append(np.random.normal(2.0, 0.3))
else:
flow.append(np.random.normal(45, 4))
power.append(np.random.normal(28, 3))
vibration.append(np.random.normal(4.5, 0.5))
df = pd.DataFrame({"timestamp": timestamps, "flow_m3h": flow,
"power_kw": power, "vibration_mm_s": vibration})
df.set_index("timestamp", inplace=True)
X = scaler.fit_transform(df)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
df["cluster"] = kmeans.fit_predict(X)
print(f"Silhouette Score: {silhouette_score(X, df['cluster']):.3f}")
print("\nDiscovered operating patterns:")
for c in sorted(df["cluster"].unique()):
subset = df[df["cluster"] == c]
print(f" Pattern {c}: Flow={subset['flow_m3h'].mean():.1f}, "
f"Power={subset['power_kw'].mean():.1f}, Hours={len(subset)}")
Summary
In this lesson you learned unsupervised learning through clustering. K-Means partitions data into spherical clusters, and the Elbow Method helps choose k. DBSCAN discovers clusters of any shape and identifies noise. You compared both approaches and applied K-Means to discover hidden operating patterns in an undocumented pump system. In the next lesson, you will tackle anomaly detection for finding rare events that signal equipment problems.