Apr 17, 2026 · 22 min read · ML / Math

Regression - Predicting Continuous Values

In this series (5 parts)

Introduction to Machine Learning
Supervised Learning - Learning from Labeled Data
Regression - Predicting Continuous Values
Classification - Predicting Categories
Unsupervised Learning - Finding Hidden Patterns

We’ve established that supervised learning has two flavors: regression and classification. In this post, we go deep on regression - predicting a continuous numerical value.

What is Regression?

Regression answers the question: “How much?” or “How many?”

Given input features, a regression model outputs a continuous number - not a category, not a label, but a value on a number line.

flowchart LR
  A["Input Features
(Size, Bedrooms, Age)"] --> B["Regression Model"]
  B --> C["Continuous Output
($285,000)"]
  style B fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style C fill:#e8f4fd,stroke:#1a5276,color:#1a5276

Real-World Regression Problems

Problem	Input Features	Output (Prediction)
House price prediction	Size, location, rooms	Price in dollars
Salary estimation	Experience, education, role	Annual salary
Temperature forecasting	Historical data, humidity	Temperature (°C)
Stock price prediction	Market data, volume	Tomorrow’s price
Crop yield estimation	Soil quality, rainfall	Yield (tons/acre)
Customer lifetime value	Purchase history, tenure	Revenue ($)

Linear Regression: The Foundation

The simplest regression model is linear regression. It assumes a straight-line relationship between input and output.

One Feature: Simple Linear Regression

With a single feature $x$ , the model is:

$\hat{y} = wx + b$

Where:

$\hat{y}$ is the predicted value
$x$ is the input feature
$w$ is the weight (slope of the line)
$b$ is the bias (y-intercept)

Example: House Size → Price

Let’s look at some data points:

House Size (sq ft)	Actual Price ($)
600	150,000
800	180,000
1,000	220,000
1,200	250,000
1,500	310,000
1,800	350,000
2,000	400,000
2,400	460,000

The model tries to find the best line through these points. If we fit a linear regression, we might get:

$\hat{y} = 175 \cdot x + 42{,}000$

This means: for every additional square foot, the price increases by $175, with a base price of$ 42,000.

Predictions with Our Model

Size (x)	Actual Price	Predicted ( $175x +$ 42k)	Error
600	$150,000	$147,000	-$3,000
1,000	$220,000	$217,000	-$3,000
1,500	$310,000	$304,500	-$5,500
2,000	$400,000	$392,000	-$8,000
2,400	$460,000	$462,000	+$2,000

Not perfect, but close. The errors are small relative to the prices.

The Cost Function

How does the model know if it’s doing well? It uses a cost function (also called loss function) to measure its errors.

The most common cost function for regression is Mean Squared Error (MSE):

$J(w, b) = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$

Where:

$n$ = number of training examples
$\hat{y}_i$ = predicted value for example $i$
$y_i$ = actual value for example $i$

We square the errors so that:

Negative and positive errors don’t cancel out
Larger errors are penalized more heavily

flowchart TD
  A["Training Data"] --> B["Model: ŷ = wx + b"]
  B --> C["Predictions ŷ₁, ŷ₂, ... ŷₙ"]
  C --> D["Compare with actual: y₁, y₂, ... yₙ"]
  D --> E["MSE = (1/n) Σ(ŷᵢ - yᵢ)²"]
  E --> F{"MSE low enough?"}
  F -->|"No"| G["Adjust w and b"]
  G --> B
  F -->|"Yes"| H["Final Model"]
  style E fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
  style H fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38

Gradient Descent: Finding the Best Parameters

The goal is to find $w$ and $b$ that minimize the cost function. The algorithm that does this is called gradient descent.

Intuition

Imagine you’re standing on a hilly landscape in thick fog. You can’t see the lowest point, but you can feel the slope under your feet. Gradient descent says: take a step in the direction of steepest descent. Repeat until you reach the bottom.

flowchart TD
  A["Start with random w, b"] --> B["Compute cost J(w,b)"]
  B --> C["Compute gradients ∂J/∂w, ∂J/∂b"]
  C --> D["Update:
w = w - α · ∂J/∂w
b = b - α · ∂J/∂b"]
  D --> E{"Converged?"}
  E -->|"No"| B
  E -->|"Yes"| F["Optimal w, b found"]
  style A fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style D fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
  style F fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38

The Update Rule

At each step, we update the parameters:

$w = w - \alpha \cdot \frac{\partial J}{\partial w}$

$b = b - \alpha \cdot \frac{\partial J}{\partial b}$

Where $\alpha$ is the learning rate - how big of a step we take.

The Gradients

For MSE with linear regression:

$\frac{\partial J}{\partial w} = \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) \cdot x_i$

$\frac{\partial J}{\partial b} = \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)$

Learning Rate Matters

Learning Rate	Behavior
Too small (0.0001)	Converges very slowly, thousands of iterations
Just right (0.01)	Smooth convergence, efficient training
Too large (1.0)	Overshoots the minimum, may diverge

Multiple Linear Regression

Real problems have many features. With $m$ features, the model becomes:

$\hat{y} = w_1 x_1 + w_2 x_2 + \cdots + w_m x_m + b$

Or in compact vector notation:

$\hat{y} = \mathbf{w} \cdot \mathbf{x} + b$

Example: House Price with Multiple Features

Size	Bedrooms	Age	Distance to City (km)	Actual Price
1,400	3	15	8	$285,000
850	1	30	2	$165,000
2,200	4	5	12	$425,000
1,100	2	20	5	$210,000
3,000	5	2	15	$380,000
1,600	3	10	6	$340,000

A trained model might learn:

$\hat{y} = 150 \cdot \text{size} + 15{,}000 \cdot \text{beds} - 2{,}000 \cdot \text{age} - 5{,}000 \cdot \text{distance} + 30{,}000$

This tells us:

Each sq ft adds $150
Each bedroom adds $15,000
Each year of age subtracts $2,000
Each km from city subtracts $5,000

Feature Scaling

When features have very different ranges, gradient descent struggles. Feature scaling normalizes them:

Feature	Original Range	After Scaling (0–1)
Size	600 – 3,000	0.0 – 1.0
Bedrooms	1 – 5	0.0 – 1.0
Age	2 – 30	0.0 – 1.0
Distance	2 – 15	0.0 – 1.0

Common methods:

Min-Max scaling: $x' = \frac{x - x_{min}}{x_{max} - x_{min}}$
Standardization (Z-score): $x' = \frac{x - \mu}{\sigma}$

Evaluating Regression Models

Key Metrics

Metric	Formula	Interpretation
MAE	$\frac{1}{n}\sum\\|y_i - \hat{y}_i\\|$	Average absolute error in same units
MSE	$\frac{1}{n}\sum(y_i - \hat{y}_i)^2$	Average squared error (penalizes outliers)
RMSE	$\sqrt{MSE}$	Square root of MSE (back to original units)
R²	$1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$	% of variance explained (0 to 1)

R² = 1.0: Perfect predictions
R² = 0.0: Model is no better than predicting the mean
R² < 0: Model is worse than the mean

Polynomial Regression

What if the relationship isn’t linear? Polynomial regression adds higher-degree terms:

$\hat{y} = w_1 x + w_2 x^2 + w_3 x^3 + b$

This can fit curves, but beware - higher degrees risk overfitting.

flowchart LR
  A["Degree 1
(Linear)"] --> B["Degree 2
(Quadratic)"]
  B --> C["Degree 3
(Cubic)"]
  C --> D["Degree 10
(Overfitting!)"]
  style A fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style B fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style C fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
  style D fill:#fde8e8,stroke:#7a1a1a,color:#7a1a1a

Regularization

To prevent overfitting, we add a penalty to the cost function:

Ridge (L2): $J = MSE + \lambda \sum w_i^2$ - shrinks weights toward zero
Lasso (L1): $J = MSE + \lambda \sum |w_i|$ - can zero out features (feature selection)
Elastic Net: Combination of L1 and L2

$\lambda$ controls how strong the penalty is.

Summary

Concept	Key Takeaway
Regression	Predicts continuous values
Linear regression	$\hat{y} = wx + b$
Cost function	MSE measures prediction error
Gradient descent	Iteratively minimizes cost
Learning rate	Controls step size
Feature scaling	Normalizes different feature ranges
Regularization	Prevents overfitting
R² score	Measures model quality (0 to 1)

What’s Next?

flowchart LR
  A["✅ Intro to ML"] --> B["✅ Supervised Learning"]
  B --> C["✅ Regression"]
  C --> D["Classification"]
  D --> E["Unsupervised Learning"]
  style A fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style B fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style D fill:#e8f4fd,stroke:#1a5276,color:#1a5276

Next up: Classification - predicting categories instead of numbers. We’ll cover logistic regression, decision boundaries, confusion matrices, and when to use which classifier.

See you in Part 4.

← Back to all series