Model Evaluation: How Do You Know Your Model Actually Works?

Why Accuracy Alone Is Not Enough

Imagine a quality inspection model where 98% of parts are good. A model labeling everything as "good" achieves 98% accuracy yet catches zero defects. In industrial applications, the rare events are often the most important.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (confusion_matrix, classification_report,
                              precision_score, recall_score, f1_score,
                              ConfusionMatrixDisplay)

np.random.seed(42)
n = 2000
features = pd.DataFrame({
    "thickness_mm": np.random.normal(3.0, 0.1, n),
    "hardness_hrc": np.random.normal(58, 2.5, n),
    "surface_um": np.random.exponential(1.2, n),
    "cycle_time_s": np.random.normal(12, 1.5, n)
})
labels = ((features["thickness_mm"] < 2.78) | (features["hardness_hrc"] < 53) |
          (features["surface_um"] > 4.5)).astype(int)

print(f"Defect rate: {labels.mean():.2%}")
X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.25, random_state=42, stratify=labels)

The Confusion Matrix: The Full Picture

The confusion matrix shows exactly how predictions map to reality.

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(7, 6))
ConfusionMatrixDisplay(cm, display_labels=["Pass", "Fail"]).plot(ax=ax, cmap="Blues")
plt.tight_layout()
plt.show()

tn, fp, fn, tp = cm.ravel()
print(f"True Positives (caught defects): {tp}")
print(f"True Negatives (correct passes): {tn}")
print(f"False Positives (false alarms):  {fp}")
print(f"False Negatives (missed defects): {fn}")

Precision, Recall, and the F1 Score

Precision = TP / (TP + FP) -- Of flagged parts, how many are actual defects? High precision means few false alarms.

Recall = TP / (TP + FN) -- Of actual defects, how many were caught? High recall means few missed defects.

F1 Score = Harmonic mean balancing both.

print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.3f}")

Scenario	Priority	Reason
Safety-critical parts	Recall	Missing a defect is catastrophic
Expensive rework	Precision	Every false alarm costs labor
General quality control	F1 Score	Balance both error types

Cross-Validation: A Stronger Test

A single split might give misleading results. Cross-validation trains and tests k times, each with a different fold as test set.

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
f1_scores = cross_val_score(clf, features, labels, cv=cv, scoring="f1")
precision_scores = cross_val_score(clf, features, labels, cv=cv, scoring="precision")
recall_scores = cross_val_score(clf, features, labels, cv=cv, scoring="recall")

print("5-Fold Cross-Validation:")
print(f"  F1:        {f1_scores.mean():.3f} +/- {f1_scores.std():.3f}")
print(f"  Precision: {precision_scores.mean():.3f} +/- {precision_scores.std():.3f}")
print(f"  Recall:    {recall_scores.mean():.3f} +/- {recall_scores.std():.3f}")

If the standard deviation across folds is high, your model is unstable -- it depends heavily on which data it sees, suggesting you need more data or a simpler model.

Overfitting: When the Model Memorizes Instead of Learning

Overfitting occurs when a model learns training noise and fails to generalize.

train_scores, test_scores = [], []
depths = range(1, 25)

for d in depths:
    dt = DecisionTreeClassifier(max_depth=d, random_state=42)
    dt.fit(X_train, y_train)
    train_scores.append(f1_score(y_train, dt.predict(X_train)))
    test_scores.append(f1_score(y_test, dt.predict(X_test)))

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(depths, train_scores, "b-o", markersize=4, label="Training F1")
ax.plot(depths, test_scores, "r-o", markersize=4, label="Test F1")
ax.set_xlabel("Tree Depth")
ax.set_ylabel("F1 Score")
ax.set_title("Overfitting: Training vs Test Performance")
ax.legend()
plt.tight_layout()
plt.show()

print(f"Optimal depth: {depths[np.argmax(test_scores)]}")

Remedies: reduce complexity, apply regularization, gather more data, use cross-validation to detect overfitting early.

Practical Example: Evaluating a Quality Inspection Model

Compare multiple models and select the best for a factory quality system.

from sklearn.svm import SVC

models = {
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
    "SVM": SVC(kernel="rbf", random_state=42)
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
best_model_name, best_f1 = None, 0

for name, model in models.items():
    scores = cross_val_score(model, features, labels, cv=cv, scoring="f1")
    mean_f1 = scores.mean()
    print(f"{name:25s} F1={mean_f1:.3f} (+/- {scores.std():.3f})")
    if mean_f1 > best_f1:
        best_f1 = mean_f1
        best_model_name = name

print(f"\nBest model: {best_model_name}")
best_model = models[best_model_name]
best_model.fit(X_train, y_train)
print(classification_report(y_test, best_model.predict(X_test),
                             target_names=["Pass", "Fail"]))

Summary

In this lesson you learned why accuracy is misleading for imbalanced industrial data. The confusion matrix shows the full picture. Precision, recall, and F1 quantify different error types. Cross-validation provides robust generalization estimates. Understanding overfitting helps build models that work on new data. In the next lesson, you will deploy your evaluated model from a notebook to a production service.