Model Evaluation: How Do You Know Your Model Actually Works?
Why Accuracy Alone Is Not Enough
Imagine a quality inspection model where 98% of parts are good. A model labeling everything as "good" achieves 98% accuracy yet catches zero defects. In industrial applications, the rare events are often the most important.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (confusion_matrix, classification_report,
precision_score, recall_score, f1_score,
ConfusionMatrixDisplay)
np.random.seed(42)
n = 2000
features = pd.DataFrame({
"thickness_mm": np.random.normal(3.0, 0.1, n),
"hardness_hrc": np.random.normal(58, 2.5, n),
"surface_um": np.random.exponential(1.2, n),
"cycle_time_s": np.random.normal(12, 1.5, n)
})
labels = ((features["thickness_mm"] < 2.78) | (features["hardness_hrc"] < 53) |
(features["surface_um"] > 4.5)).astype(int)
print(f"Defect rate: {labels.mean():.2%}")
X_train, X_test, y_train, y_test = train_test_split(
features, labels, test_size=0.25, random_state=42, stratify=labels)
The Confusion Matrix: The Full Picture
The confusion matrix shows exactly how predictions map to reality.
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(7, 6))
ConfusionMatrixDisplay(cm, display_labels=["Pass", "Fail"]).plot(ax=ax, cmap="Blues")
plt.tight_layout()
plt.show()
tn, fp, fn, tp = cm.ravel()
print(f"True Positives (caught defects): {tp}")
print(f"True Negatives (correct passes): {tn}")
print(f"False Positives (false alarms): {fp}")
print(f"False Negatives (missed defects): {fn}")
Precision, Recall, and the F1 Score
Precision = TP / (TP + FP) -- Of flagged parts, how many are actual defects? High precision means few false alarms.
Recall = TP / (TP + FN) -- Of actual defects, how many were caught? High recall means few missed defects.
F1 Score = Harmonic mean balancing both.
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall: {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")
| Scenario | Priority | Reason |
|---|---|---|
| Safety-critical parts | Recall | Missing a defect is catastrophic |
| Expensive rework | Precision | Every false alarm costs labor |
| General quality control | F1 Score | Balance both error types |
Cross-Validation: A Stronger Test
A single split might give misleading results. Cross-validation trains and tests k times, each with a different fold as test set.
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
f1_scores = cross_val_score(clf, features, labels, cv=cv, scoring="f1")
precision_scores = cross_val_score(clf, features, labels, cv=cv, scoring="precision")
recall_scores = cross_val_score(clf, features, labels, cv=cv, scoring="recall")
print("5-Fold Cross-Validation:")
print(f" F1: {f1_scores.mean():.3f} +/- {f1_scores.std():.3f}")
print(f" Precision: {precision_scores.mean():.3f} +/- {precision_scores.std():.3f}")
print(f" Recall: {recall_scores.mean():.3f} +/- {recall_scores.std():.3f}")
If the standard deviation across folds is high, your model is unstable -- it depends heavily on which data it sees, suggesting you need more data or a simpler model.
Overfitting: When the Model Memorizes Instead of Learning
Overfitting occurs when a model learns training noise and fails to generalize.
train_scores, test_scores = [], []
depths = range(1, 25)
for d in depths:
dt = DecisionTreeClassifier(max_depth=d, random_state=42)
dt.fit(X_train, y_train)
train_scores.append(f1_score(y_train, dt.predict(X_train)))
test_scores.append(f1_score(y_test, dt.predict(X_test)))
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(depths, train_scores, "b-o", markersize=4, label="Training F1")
ax.plot(depths, test_scores, "r-o", markersize=4, label="Test F1")
ax.set_xlabel("Tree Depth")
ax.set_ylabel("F1 Score")
ax.set_title("Overfitting: Training vs Test Performance")
ax.legend()
plt.tight_layout()
plt.show()
print(f"Optimal depth: {depths[np.argmax(test_scores)]}")
Remedies: reduce complexity, apply regularization, gather more data, use cross-validation to detect overfitting early.
Practical Example: Evaluating a Quality Inspection Model
Compare multiple models and select the best for a factory quality system.
from sklearn.svm import SVC
models = {
"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
"Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
"SVM": SVC(kernel="rbf", random_state=42)
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
best_model_name, best_f1 = None, 0
for name, model in models.items():
scores = cross_val_score(model, features, labels, cv=cv, scoring="f1")
mean_f1 = scores.mean()
print(f"{name:25s} F1={mean_f1:.3f} (+/- {scores.std():.3f})")
if mean_f1 > best_f1:
best_f1 = mean_f1
best_model_name = name
print(f"\nBest model: {best_model_name}")
best_model = models[best_model_name]
best_model.fit(X_train, y_train)
print(classification_report(y_test, best_model.predict(X_test),
target_names=["Pass", "Fail"]))
Summary
In this lesson you learned why accuracy is misleading for imbalanced industrial data. The confusion matrix shows the full picture. Precision, recall, and F1 quantify different error types. Cross-validation provides robust generalization estimates. Understanding overfitting helps build models that work on new data. In the next lesson, you will deploy your evaluated model from a notebook to a production service.