Linear Regression: Predicting Machine Behavior From Historical Data

What Is Regression and Why It Matters for Engineers

Regression predicts a continuous number from input data. In industry, this means predicting energy consumption from production volume, estimating remaining tool life from wear measurements, or forecasting output quality from process parameters.

Unlike classification which answers "which category?", regression answers "how much?" -- and that distinction drives critical industrial decisions.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

Simple Linear Regression: One Line Summarizing Data

The simplest regression fits a straight line through your data. One input, one output, one relationship.

np.random.seed(42)
speed = np.random.uniform(40, 120, 200)
power = 2.5 + 0.08 * speed + np.random.normal(0, 1.0, 200)

X = speed.reshape(-1, 1)
y = power

model = LinearRegression()
model.fit(X, y)
print(f"Slope: {model.coef_[0]:.4f} kW per unit speed")
print(f"Prediction at speed 80: {model.predict([[80]])[0]:.2f} kW")

fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(speed, power, alpha=0.4, s=15, label="Observed")
line_x = np.linspace(35, 125, 100).reshape(-1, 1)
ax.plot(line_x, model.predict(line_x), "r-", lw=2, label="Regression line")
ax.set_xlabel("Production Speed (units/hour)")
ax.set_ylabel("Power Consumption (kW)")
ax.legend()
plt.tight_layout()
plt.show()

Multiple Regression: Several Factors Acting Together

Real processes depend on multiple variables simultaneously. A motor's temperature depends on load, ambient temperature, and cooling flow -- not just one factor.

np.random.seed(42)
n = 500
load = np.random.uniform(20, 100, n)
ambient_temp = np.random.uniform(15, 40, n)
coolant_flow = np.random.uniform(5, 20, n)

motor_temp = (35 + 0.3 * load + 0.5 * ambient_temp
              - 0.8 * coolant_flow + np.random.normal(0, 2, n))

df = pd.DataFrame({"load_pct": load, "ambient_c": ambient_temp,
                    "coolant_lpm": coolant_flow, "motor_temp_c": motor_temp})

X = df[["load_pct", "ambient_c", "coolant_lpm"]]
y = df["motor_temp_c"]

model_multi = LinearRegression()
model_multi.fit(X, y)

for name, coef in zip(X.columns, model_multi.coef_):
    print(f"{name}: {coef:+.4f}")

The coefficients tell the story: each percentage point of load adds 0.3 degrees, each degree of ambient adds 0.5, and each liter/min of coolant removes 0.8 degrees.

Training and Testing: Splitting the Data

A model that only works on data it has already seen is useless. To measure real performance, test on unseen data.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples:  {len(X_test)}")

model_eval = LinearRegression()
model_eval.fit(X_train, y_train)
y_pred = model_eval.predict(X_test)

The 80/20 split is a common starting point. With very large datasets, you can use 90/10. With small datasets, consider cross-validation instead.

Evaluating the Model: R-squared and MSE

R-squared (R2) tells you what fraction of the variance your model explains. An R2 of 0.85 means 85% of variation is explained by the input features.

RMSE (Root Mean Squared Error) has the same unit as your target, making it directly interpretable.

r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"R-squared: {r2:.4f}")
print(f"RMSE: {rmse:.4f} C")

fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(y_test, y_pred, alpha=0.5, s=15)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
        "r--", lw=2, label="Perfect prediction")
ax.set_xlabel("Actual Temperature (C)")
ax.set_ylabel("Predicted Temperature (C)")
ax.set_title(f"Predicted vs Actual (R2={r2:.3f})")
ax.legend()
plt.tight_layout()
plt.show()

Practical Example: Predicting Energy Consumption From Production Data

A factory wants to predict daily energy consumption to optimize its electricity contract.

np.random.seed(42)
days = 365

df = pd.DataFrame({
    "production_tons": np.random.uniform(50, 200, days),
    "ambient_temp_c": 20 + 10 * np.sin(np.linspace(0, 2 * np.pi, days)),
    "machines_active": np.random.randint(3, 8, days),
    "weekend": np.tile([0, 0, 0, 0, 0, 1, 1], 53)[:days]
})

df["energy_kwh"] = (500 + 12 * df["production_tons"]
                    + 8 * df["ambient_temp_c"]
                    + 150 * df["machines_active"]
                    - 200 * df["weekend"]
                    + np.random.normal(0, 100, days))

X = df[["production_tons", "ambient_temp_c", "machines_active", "weekend"]]
y = df["energy_kwh"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                      random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"R-squared: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.1f} kWh")
print("\nFactor contributions:")
for name, coef in zip(X.columns, model.coef_):
    print(f"  {name}: {coef:+.2f} kWh per unit")

Summary

In this lesson you learned regression for predicting continuous values. You built simple linear regression with one variable and multiple regression with several factors. You practiced splitting data into training and testing sets, and evaluated model quality using R-squared and RMSE. Finally, you predicted factory energy consumption from production data. In the next lesson, you will move to classification -- predicting categories instead of numbers.