Regression - Predicting Continuous Values
In this series (5 parts)
- Introduction to Machine Learning
- Supervised Learning - Learning from Labeled Data
- Regression - Predicting Continuous Values
- Classification - Predicting Categories
- Unsupervised Learning - Finding Hidden Patterns
We’ve established that supervised learning has two flavors: regression and classification. In this post, we go deep on regression - predicting a continuous numerical value.
What is Regression?
Regression answers the question: “How much?” or “How many?”
Given input features, a regression model outputs a continuous number - not a category, not a label, but a value on a number line.
flowchart LR A["Input Features (Size, Bedrooms, Age)"] --> B["Regression Model"] B --> C["Continuous Output ($285,000)"] style B fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38 style C fill:#e8f4fd,stroke:#1a5276,color:#1a5276
Real-World Regression Problems
| Problem | Input Features | Output (Prediction) |
|---|---|---|
| House price prediction | Size, location, rooms | Price in dollars |
| Salary estimation | Experience, education, role | Annual salary |
| Temperature forecasting | Historical data, humidity | Temperature (°C) |
| Stock price prediction | Market data, volume | Tomorrow’s price |
| Crop yield estimation | Soil quality, rainfall | Yield (tons/acre) |
| Customer lifetime value | Purchase history, tenure | Revenue ($) |
Linear Regression: The Foundation
The simplest regression model is linear regression. It assumes a straight-line relationship between input and output.
One Feature: Simple Linear Regression
With a single feature , the model is:
Where:
- is the predicted value
- is the input feature
- is the weight (slope of the line)
- is the bias (y-intercept)
Example: House Size → Price
Let’s look at some data points:
| House Size (sq ft) | Actual Price ($) |
|---|---|
| 600 | 150,000 |
| 800 | 180,000 |
| 1,000 | 220,000 |
| 1,200 | 250,000 |
| 1,500 | 310,000 |
| 1,800 | 350,000 |
| 2,000 | 400,000 |
| 2,400 | 460,000 |
The model tries to find the best line through these points. If we fit a linear regression, we might get:
This means: for every additional square foot, the price increases by 42,000.
Predictions with Our Model
| Size (x) | Actual Price | Predicted (42k) | Error |
|---|---|---|---|
| 600 | $150,000 | $147,000 | -$3,000 |
| 1,000 | $220,000 | $217,000 | -$3,000 |
| 1,500 | $310,000 | $304,500 | -$5,500 |
| 2,000 | $400,000 | $392,000 | -$8,000 |
| 2,400 | $460,000 | $462,000 | +$2,000 |
Not perfect, but close. The errors are small relative to the prices.
The Cost Function
How does the model know if it’s doing well? It uses a cost function (also called loss function) to measure its errors.
The most common cost function for regression is Mean Squared Error (MSE):
Where:
- = number of training examples
- = predicted value for example
- = actual value for example
We square the errors so that:
- Negative and positive errors don’t cancel out
- Larger errors are penalized more heavily
flowchart TD
A["Training Data"] --> B["Model: ŷ = wx + b"]
B --> C["Predictions ŷ₁, ŷ₂, ... ŷₙ"]
C --> D["Compare with actual: y₁, y₂, ... yₙ"]
D --> E["MSE = (1/n) Σ(ŷᵢ - yᵢ)²"]
E --> F{"MSE low enough?"}
F -->|"No"| G["Adjust w and b"]
G --> B
F -->|"Yes"| H["Final Model"]
style E fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
style H fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
Gradient Descent: Finding the Best Parameters
The goal is to find and that minimize the cost function. The algorithm that does this is called gradient descent.
Intuition
Imagine you’re standing on a hilly landscape in thick fog. You can’t see the lowest point, but you can feel the slope under your feet. Gradient descent says: take a step in the direction of steepest descent. Repeat until you reach the bottom.
flowchart TD
A["Start with random w, b"] --> B["Compute cost J(w,b)"]
B --> C["Compute gradients ∂J/∂w, ∂J/∂b"]
C --> D["Update:
w = w - α · ∂J/∂w
b = b - α · ∂J/∂b"]
D --> E{"Converged?"}
E -->|"No"| B
E -->|"Yes"| F["Optimal w, b found"]
style A fill:#e8f4fd,stroke:#1a5276,color:#1a5276
style D fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
style F fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
The Update Rule
At each step, we update the parameters:
Where is the learning rate - how big of a step we take.
The Gradients
For MSE with linear regression:
Learning Rate Matters
| Learning Rate | Behavior |
|---|---|
| Too small (0.0001) | Converges very slowly, thousands of iterations |
| Just right (0.01) | Smooth convergence, efficient training |
| Too large (1.0) | Overshoots the minimum, may diverge |
Multiple Linear Regression
Real problems have many features. With features, the model becomes:
Or in compact vector notation:
Example: House Price with Multiple Features
| Size | Bedrooms | Age | Distance to City (km) | Actual Price |
|---|---|---|---|---|
| 1,400 | 3 | 15 | 8 | $285,000 |
| 850 | 1 | 30 | 2 | $165,000 |
| 2,200 | 4 | 5 | 12 | $425,000 |
| 1,100 | 2 | 20 | 5 | $210,000 |
| 3,000 | 5 | 2 | 15 | $380,000 |
| 1,600 | 3 | 10 | 6 | $340,000 |
A trained model might learn:
This tells us:
- Each sq ft adds $150
- Each bedroom adds $15,000
- Each year of age subtracts $2,000
- Each km from city subtracts $5,000
Feature Scaling
When features have very different ranges, gradient descent struggles. Feature scaling normalizes them:
| Feature | Original Range | After Scaling (0–1) |
|---|---|---|
| Size | 600 – 3,000 | 0.0 – 1.0 |
| Bedrooms | 1 – 5 | 0.0 – 1.0 |
| Age | 2 – 30 | 0.0 – 1.0 |
| Distance | 2 – 15 | 0.0 – 1.0 |
Common methods:
- Min-Max scaling:
- Standardization (Z-score):
Evaluating Regression Models
Key Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| MAE | Average absolute error in same units | |
| MSE | Average squared error (penalizes outliers) | |
| RMSE | Square root of MSE (back to original units) | |
| R² | % of variance explained (0 to 1) |
- R² = 1.0: Perfect predictions
- R² = 0.0: Model is no better than predicting the mean
- R² < 0: Model is worse than the mean
Polynomial Regression
What if the relationship isn’t linear? Polynomial regression adds higher-degree terms:
This can fit curves, but beware - higher degrees risk overfitting.
flowchart LR A["Degree 1 (Linear)"] --> B["Degree 2 (Quadratic)"] B --> C["Degree 3 (Cubic)"] C --> D["Degree 10 (Overfitting!)"] style A fill:#e8f4fd,stroke:#1a5276,color:#1a5276 style B fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38 style C fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a style D fill:#fde8e8,stroke:#7a1a1a,color:#7a1a1a
Regularization
To prevent overfitting, we add a penalty to the cost function:
- Ridge (L2): - shrinks weights toward zero
- Lasso (L1): - can zero out features (feature selection)
- Elastic Net: Combination of L1 and L2
controls how strong the penalty is.
Summary
| Concept | Key Takeaway |
|---|---|
| Regression | Predicts continuous values |
| Linear regression | |
| Cost function | MSE measures prediction error |
| Gradient descent | Iteratively minimizes cost |
| Learning rate | Controls step size |
| Feature scaling | Normalizes different feature ranges |
| Regularization | Prevents overfitting |
| R² score | Measures model quality (0 to 1) |
What’s Next?
flowchart LR A["✅ Intro to ML"] --> B["✅ Supervised Learning"] B --> C["✅ Regression"] C --> D["Classification"] D --> E["Unsupervised Learning"] style A fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38 style B fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38 style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38 style D fill:#e8f4fd,stroke:#1a5276,color:#1a5276
Next up: Classification - predicting categories instead of numbers. We’ll cover logistic regression, decision boundaries, confusion matrices, and when to use which classifier.
See you in Part 4.