Search…
ML from Scratch · Part 3

Regression - Predicting Continuous Values

In this series (5 parts)
  1. Introduction to Machine Learning
  2. Supervised Learning - Learning from Labeled Data
  3. Regression - Predicting Continuous Values
  4. Classification - Predicting Categories
  5. Unsupervised Learning - Finding Hidden Patterns

We’ve established that supervised learning has two flavors: regression and classification. In this post, we go deep on regression - predicting a continuous numerical value.

What is Regression?

Regression answers the question: “How much?” or “How many?”

Given input features, a regression model outputs a continuous number - not a category, not a label, but a value on a number line.

flowchart LR
  A["Input Features
(Size, Bedrooms, Age)"] --> B["Regression Model"]
  B --> C["Continuous Output
($285,000)"]
  style B fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style C fill:#e8f4fd,stroke:#1a5276,color:#1a5276

Real-World Regression Problems

ProblemInput FeaturesOutput (Prediction)
House price predictionSize, location, roomsPrice in dollars
Salary estimationExperience, education, roleAnnual salary
Temperature forecastingHistorical data, humidityTemperature (°C)
Stock price predictionMarket data, volumeTomorrow’s price
Crop yield estimationSoil quality, rainfallYield (tons/acre)
Customer lifetime valuePurchase history, tenureRevenue ($)

Linear Regression: The Foundation

The simplest regression model is linear regression. It assumes a straight-line relationship between input and output.

One Feature: Simple Linear Regression

With a single feature xx, the model is:

y^=wx+b\hat{y} = wx + b

Where:

  • y^\hat{y} is the predicted value
  • xx is the input feature
  • ww is the weight (slope of the line)
  • bb is the bias (y-intercept)

Example: House Size → Price

Let’s look at some data points:

House Size (sq ft)Actual Price ($)
600150,000
800180,000
1,000220,000
1,200250,000
1,500310,000
1,800350,000
2,000400,000
2,400460,000

The model tries to find the best line through these points. If we fit a linear regression, we might get:

y^=175x+42,000\hat{y} = 175 \cdot x + 42{,}000

This means: for every additional square foot, the price increases by 175,withabasepriceof175, with a base price of 42,000.

Predictions with Our Model

Size (x)Actual PricePredicted (175x+175x + 42k)Error
600$150,000$147,000-$3,000
1,000$220,000$217,000-$3,000
1,500$310,000$304,500-$5,500
2,000$400,000$392,000-$8,000
2,400$460,000$462,000+$2,000

Not perfect, but close. The errors are small relative to the prices.

The Cost Function

How does the model know if it’s doing well? It uses a cost function (also called loss function) to measure its errors.

The most common cost function for regression is Mean Squared Error (MSE):

J(w,b)=1ni=1n(y^iyi)2J(w, b) = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2

Where:

  • nn = number of training examples
  • y^i\hat{y}_i = predicted value for example ii
  • yiy_i = actual value for example ii

We square the errors so that:

  1. Negative and positive errors don’t cancel out
  2. Larger errors are penalized more heavily
flowchart TD
  A["Training Data"] --> B["Model: ŷ = wx + b"]
  B --> C["Predictions ŷ₁, ŷ₂, ... ŷₙ"]
  C --> D["Compare with actual: y₁, y₂, ... yₙ"]
  D --> E["MSE = (1/n) Σ(ŷᵢ - yᵢ)²"]
  E --> F{"MSE low enough?"}
  F -->|"No"| G["Adjust w and b"]
  G --> B
  F -->|"Yes"| H["Final Model"]
  style E fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
  style H fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38

Gradient Descent: Finding the Best Parameters

The goal is to find ww and bb that minimize the cost function. The algorithm that does this is called gradient descent.

Intuition

Imagine you’re standing on a hilly landscape in thick fog. You can’t see the lowest point, but you can feel the slope under your feet. Gradient descent says: take a step in the direction of steepest descent. Repeat until you reach the bottom.

flowchart TD
  A["Start with random w, b"] --> B["Compute cost J(w,b)"]
  B --> C["Compute gradients ∂J/∂w, ∂J/∂b"]
  C --> D["Update:
w = w - α · ∂J/∂w
b = b - α · ∂J/∂b"]
  D --> E{"Converged?"}
  E -->|"No"| B
  E -->|"Yes"| F["Optimal w, b found"]
  style A fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style D fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
  style F fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38

The Update Rule

At each step, we update the parameters:

w=wαJww = w - \alpha \cdot \frac{\partial J}{\partial w}

b=bαJbb = b - \alpha \cdot \frac{\partial J}{\partial b}

Where α\alpha is the learning rate - how big of a step we take.

The Gradients

For MSE with linear regression:

Jw=2ni=1n(y^iyi)xi\frac{\partial J}{\partial w} = \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) \cdot x_i

Jb=2ni=1n(y^iyi)\frac{\partial J}{\partial b} = \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)

Learning Rate Matters

Learning RateBehavior
Too small (0.0001)Converges very slowly, thousands of iterations
Just right (0.01)Smooth convergence, efficient training
Too large (1.0)Overshoots the minimum, may diverge

Multiple Linear Regression

Real problems have many features. With mm features, the model becomes:

y^=w1x1+w2x2++wmxm+b\hat{y} = w_1 x_1 + w_2 x_2 + \cdots + w_m x_m + b

Or in compact vector notation:

y^=wx+b\hat{y} = \mathbf{w} \cdot \mathbf{x} + b

Example: House Price with Multiple Features

SizeBedroomsAgeDistance to City (km)Actual Price
1,4003158$285,000
8501302$165,000
2,2004512$425,000
1,1002205$210,000
3,0005215$380,000
1,6003106$340,000

A trained model might learn:

y^=150size+15,000beds2,000age5,000distance+30,000\hat{y} = 150 \cdot \text{size} + 15{,}000 \cdot \text{beds} - 2{,}000 \cdot \text{age} - 5{,}000 \cdot \text{distance} + 30{,}000

This tells us:

  • Each sq ft adds $150
  • Each bedroom adds $15,000
  • Each year of age subtracts $2,000
  • Each km from city subtracts $5,000

Feature Scaling

When features have very different ranges, gradient descent struggles. Feature scaling normalizes them:

FeatureOriginal RangeAfter Scaling (0–1)
Size600 – 3,0000.0 – 1.0
Bedrooms1 – 50.0 – 1.0
Age2 – 300.0 – 1.0
Distance2 – 150.0 – 1.0

Common methods:

  • Min-Max scaling: x=xxminxmaxxminx' = \frac{x - x_{min}}{x_{max} - x_{min}}
  • Standardization (Z-score): x=xμσx' = \frac{x - \mu}{\sigma}

Evaluating Regression Models

Key Metrics

MetricFormulaInterpretation
MAE1nyiy^i\frac{1}{n}\sum\|y_i - \hat{y}_i\|Average absolute error in same units
MSE1n(yiy^i)2\frac{1}{n}\sum(y_i - \hat{y}_i)^2Average squared error (penalizes outliers)
RMSEMSE\sqrt{MSE}Square root of MSE (back to original units)
1(yiy^i)2(yiyˉ)21 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}% of variance explained (0 to 1)
  • R² = 1.0: Perfect predictions
  • R² = 0.0: Model is no better than predicting the mean
  • R² < 0: Model is worse than the mean

Polynomial Regression

What if the relationship isn’t linear? Polynomial regression adds higher-degree terms:

y^=w1x+w2x2+w3x3+b\hat{y} = w_1 x + w_2 x^2 + w_3 x^3 + b

This can fit curves, but beware - higher degrees risk overfitting.

flowchart LR
  A["Degree 1
(Linear)"] --> B["Degree 2
(Quadratic)"]
  B --> C["Degree 3
(Cubic)"]
  C --> D["Degree 10
(Overfitting!)"]
  style A fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style B fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style C fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
  style D fill:#fde8e8,stroke:#7a1a1a,color:#7a1a1a

Regularization

To prevent overfitting, we add a penalty to the cost function:

  • Ridge (L2): J=MSE+λwi2J = MSE + \lambda \sum w_i^2 - shrinks weights toward zero
  • Lasso (L1): J=MSE+λwiJ = MSE + \lambda \sum |w_i| - can zero out features (feature selection)
  • Elastic Net: Combination of L1 and L2

λ\lambda controls how strong the penalty is.

Summary

ConceptKey Takeaway
RegressionPredicts continuous values
Linear regressiony^=wx+b\hat{y} = wx + b
Cost functionMSE measures prediction error
Gradient descentIteratively minimizes cost
Learning rateControls step size
Feature scalingNormalizes different feature ranges
RegularizationPrevents overfitting
R² scoreMeasures model quality (0 to 1)

What’s Next?

flowchart LR
  A["✅ Intro to ML"] --> B["✅ Supervised Learning"]
  B --> C["✅ Regression"]
  C --> D["Classification"]
  D --> E["Unsupervised Learning"]
  style A fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style B fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style D fill:#e8f4fd,stroke:#1a5276,color:#1a5276

Next up: Classification - predicting categories instead of numbers. We’ll cover logistic regression, decision boundaries, confusion matrices, and when to use which classifier.

See you in Part 4.

Start typing to search across all content
navigate Enter open Esc close