Mar 28, 2026 · 20 min read · Machine Learning

Logistic regression and classification

In this series (18 parts)

Prerequisites: Linear regression and Calculus and derivatives.

Linear regression predicts a number. But many problems need a category: spam or not spam, tumor benign or malignant, digit 0 through 9. Logistic regression adapts linear regression for classification by squashing predictions into probabilities.

The problem: yes-or-no answers

Consider predicting whether a student passes an exam based on study hours and practice exams taken.

Student	Study hours	Practice exams	Pass?
1	2	0	No
2	3	1	No
3	4	1	No
4	5	2	Yes
5	6	2	Yes
6	7	3	Yes
7	8	3	Yes
8	1	0	No

Student study hours vs exam outcome

Linear regression would predict a number like 0.3 or 1.7 for each student. But pass/fail is binary. We need a model that outputs a probability between 0 and 1, then decides yes or no based on a cutoff.

Why we need a probability, not a raw number

graph TD
  A["Raw input features"] --> B["Linear combination: w*x + b"]
  B --> C["Problem: output can be -3, 0.5, 142..."]
  C --> D["Solution: sigmoid squashes to 0-1"]
  D --> E["Output: probability of Pass"]
  E --> F["Threshold at 0.5"]
  F --> G["Predict Pass or Fail"]

The sigmoid function is like a dimmer switch. Small inputs produce values near 0. Large inputs produce values near 1. It transitions smoothly through 0.5 in the middle. No matter what number goes in, you always get a valid probability out.

Now let’s formalize this intuition, starting with why plain linear regression breaks down for classification.

Why linear regression fails for classification

Suppose you want to classify emails as spam (1) or not spam (0). If you use linear regression, $\hat{y} = w^Tx + b$ can output anything: $-3.7$ , $0.5$ , $142$ . These numbers don’t make sense as probabilities.

Even if the data happens to give sensible-looking outputs for your training set, a single outlier can skew the line and mess up all your predictions. We need a function that outputs values between 0 and 1, interpretable as $P(y = 1 | x)$ .

The sigmoid function

The sigmoid (logistic) function maps any real number to $(0, 1)$ :

$\sigma(z) = \frac{1}{1 + e^{-z}}$

Properties:

$\sigma(0) = 0.5$
$\sigma(z) \to 1$ as $z \to +\infty$
$\sigma(z) \to 0$ as $z \to -\infty$
Symmetric: $\sigma(-z) = 1 - \sigma(z)$

The derivative has a nice form:

$\sigma'(z) = \sigma(z)(1 - \sigma(z))$

This clean derivative is why sigmoid is so convenient for optimization. It appears naturally when you take derivatives of the loss function.

Sigmoid maps any input to a probability between 0 and 1

graph LR
  A["Large negative z"] --> B["sigmoid near 0"]
  C["z = 0"] --> D["sigmoid = 0.5"]
  E["Large positive z"] --> F["sigmoid near 1"]

Sigmoid function with decision threshold at 0.5

The logistic regression model

Combine a linear function with sigmoid:

$\hat{p} = \sigma(w^Tx + b) = \frac{1}{1 + e^{-(w^Tx + b)}}$

$\hat{p}$ is the predicted probability that $y = 1$ . To get a class prediction, apply a threshold (usually 0.5):

$\hat{y} = \begin{cases} 1 & \text{if } \hat{p} \geq 0.5 \\ 0 & \text{if } \hat{p} < 0.5 \end{cases}$

Since $\sigma(z) \geq 0.5$ when $z \geq 0$ , the decision boundary is the set of points where $w^Tx + b = 0$ . In 2D, this is a straight line. In higher dimensions, it’s a hyperplane.

Logistic regression pipeline: from features to class prediction

graph LR
  A["Input features x"] --> B["Linear combination w*x + b"]
  B --> C["Sigmoid function"]
  C --> D["Probability p-hat"]
  D --> E["Threshold at 0.5"]
  E --> F["Class 0 or 1"]

Linear regression vs logistic regression output

graph TD
  subgraph Linear["Linear Regression"]
      L1["Input"] --> L2["w*x + b"]
      L2 --> L3["Any real number"]
  end
  subgraph Logistic["Logistic Regression"]
      LR1["Input"] --> LR2["w*x + b"]
      LR2 --> LR3["Sigmoid"]
      LR3 --> LR4["Probability in 0 to 1"]
  end

The loss function: cross-entropy

MSE is a bad choice for classification. The loss surface has flat regions that make gradient descent slow. Instead, we use cross-entropy loss (also called log loss):

For a single example with true label $y \in \{0, 1\}$ and predicted probability $\hat{p}$ :

$\ell(y, \hat{p}) = -[y \log(\hat{p}) + (1 - y) \log(1 - \hat{p})]$

When $y = 1$ : $\ell = -\log(\hat{p})$ . If $\hat{p}$ is close to 1, loss is near 0. If $\hat{p}$ is close to 0, loss goes to $+\infty$ . Heavily penalizes confident wrong predictions.

When $y = 0$ : $\ell = -\log(1 - \hat{p})$ . Same logic, flipped.

For the full dataset of $n$ examples:

$L(w, b) = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i)]$

This loss is convex, so gradient descent finds the global minimum.

Example 1: computing sigmoid output and cross-entropy loss

Consider a single data point with features $x = [2, 1]$ , true label $y = 1$ , weights $w = [0.5, -0.3]$ , and bias $b = 0.1$ .

Step 1: compute the linear combination $z$ .

$z = w^Tx + b = 0.5 \cdot 2 + (-0.3) \cdot 1 + 0.1$

$= 1.0 - 0.3 + 0.1 = 0.8$

Step 2: apply sigmoid.

$\hat{p} = \sigma(0.8) = \frac{1}{1 + e^{-0.8}}$

$e^{-0.8} \approx 0.4493$

$\hat{p} = \frac{1}{1 + 0.4493} = \frac{1}{1.4493} \approx 0.6900$

So the model predicts a 69% probability that this example belongs to class 1.

Step 3: compute the loss.

Since $y = 1$ :

$\ell = -\log(\hat{p}) = -\log(0.6900) \approx -(-0.3711) = 0.3711$

Step 4: check what happens if the prediction were wrong.

Suppose instead $w = [-0.5, 0.3]$ and $b = -0.1$ :

$z = -0.5 \cdot 2 + 0.3 \cdot 1 + (-0.1) = -1.0 + 0.3 - 0.1 = -0.8$

$\hat{p} = \sigma(-0.8) = 1 - \sigma(0.8) \approx 1 - 0.690 = 0.310$

$\ell = -\log(0.310) \approx 1.171$

The loss jumped from 0.371 to 1.171. When the model is more wrong (predicting 31% for a true positive), the loss is much higher. Cross-entropy punishes confident mistakes severely.

The gradient

To run gradient descent, we need the gradient of the loss with respect to $w$ and $b$ . Using the chain rule:

For a single example:

$\frac{\partial \ell}{\partial w_j} = (\hat{p} - y) x_j$

$\frac{\partial \ell}{\partial b} = (\hat{p} - y)$

This is remarkably clean. The gradient is the prediction error $(\hat{p} - y)$ times the input. Same form as linear regression.

For the full dataset:

$\nabla_w L = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i) x_i = \frac{1}{n} X^T(\hat{p} - y)$

$\frac{\partial L}{\partial b} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i)$

Example 2: three gradient descent steps

Let’s train a logistic regression model from scratch on a tiny dataset.

Data (1 feature plus bias):

$x$	$y$
1	0
2	0
3	1
4	1

We want to learn $w$ (slope) and $b$ (intercept). Initialize $w = 0$ , $b = 0$ . Learning rate $\alpha = 0.1$ .

Step 1:

Compute $z_i = wx_i + b$ for each point:

$z = [0 \cdot 1 + 0, \; 0 \cdot 2 + 0, \; 0 \cdot 3 + 0, \; 0 \cdot 4 + 0] = [0, 0, 0, 0]$

Compute $\hat{p}_i = \sigma(z_i)$ :

$\hat{p} = [0.5, 0.5, 0.5, 0.5]$

Compute errors $\hat{p}_i - y_i$ :

$\text{errors} = [0.5 - 0, \; 0.5 - 0, \; 0.5 - 1, \; 0.5 - 1] = [0.5, 0.5, -0.5, -0.5]$

Compute gradients:

$\frac{\partial L}{\partial w} = \frac{1}{4}(0.5 \cdot 1 + 0.5 \cdot 2 + (-0.5) \cdot 3 + (-0.5) \cdot 4) = \frac{1}{4}(0.5 + 1.0 - 1.5 - 2.0) = \frac{-2.0}{4} = -0.5$

$\frac{\partial L}{\partial b} = \frac{1}{4}(0.5 + 0.5 - 0.5 - 0.5) = \frac{0}{4} = 0$

Update:

$w \leftarrow 0 - 0.1 \cdot (-0.5) = 0.05$

$b \leftarrow 0 - 0.1 \cdot 0 = 0$

Step 2:

Compute $z$ :

$z = [0.05 \cdot 1, \; 0.05 \cdot 2, \; 0.05 \cdot 3, \; 0.05 \cdot 4] = [0.05, 0.10, 0.15, 0.20]$

Compute $\hat{p}$ :

$\hat{p} = [\sigma(0.05), \sigma(0.10), \sigma(0.15), \sigma(0.20)]$

$\approx [0.5125, 0.5250, 0.5374, 0.5498]$

Compute errors:

$= [0.5125, 0.5250, -0.4626, -0.4502]$

Gradients:

$\frac{\partial L}{\partial w} = \frac{1}{4}(0.5125 \cdot 1 + 0.5250 \cdot 2 + (-0.4626) \cdot 3 + (-0.4502) \cdot 4)$

$= \frac{1}{4}(0.5125 + 1.0500 - 1.3878 - 1.8008) = \frac{-1.6261}{4} = -0.4065$

$\frac{\partial L}{\partial b} = \frac{1}{4}(0.5125 + 0.5250 - 0.4626 - 0.4502) = \frac{0.1247}{4} = 0.0312$

Update:

$w \leftarrow 0.05 - 0.1 \cdot (-0.4065) = 0.05 + 0.0407 = 0.0907$

$b \leftarrow 0 - 0.1 \cdot 0.0312 = -0.0031$

Step 3:

Compute $z$ :

$z = [0.0907 \cdot 1 - 0.0031, \; 0.0907 \cdot 2 - 0.0031, \; 0.0907 \cdot 3 - 0.0031, \; 0.0907 \cdot 4 - 0.0031]$

$= [0.0876, 0.1783, 0.2690, 0.3597]$

Compute $\hat{p}$ :

$\approx [0.5219, 0.5444, 0.5668, 0.5890]$

Errors:

$= [0.5219, 0.5444, -0.4332, -0.4110]$

Gradients:

$\frac{\partial L}{\partial w} = \frac{1}{4}(0.5219 + 1.0888 - 1.2996 - 1.6440) = \frac{-1.3329}{4} = -0.3332$

$\frac{\partial L}{\partial b} = \frac{1}{4}(0.5219 + 0.5444 - 0.4332 - 0.4110) = \frac{0.2221}{4} = 0.0555$

Update:

$w \leftarrow 0.0907 + 0.0333 = 0.1240$

$b \leftarrow -0.0031 - 0.0056 = -0.0087$

Progress after 3 steps: $w = 0.124$ , $b = -0.009$ . The weight is positive (larger $x$ means higher probability of class 1), and the bias is slightly negative. After many more steps, the model converges to something like $w \approx 2.1$ and $b \approx -5.3$ , giving a decision boundary at $x = 5.3/2.1 \approx 2.5$ . That’s right between the class-0 points ( $x = 1, 2$ ) and class-1 points ( $x = 3, 4$ ).

import numpy as np

X = np.array([1, 2, 3, 4]).reshape(-1, 1)
y = np.array([0, 0, 1, 1])

w = 0.0
b = 0.0
alpha = 0.1
n = len(y)

for step in range(3):
    z = w * X.flatten() + b
    p_hat = 1 / (1 + np.exp(-z))
    errors = p_hat - y
    dw = (1/n) * np.sum(errors * X.flatten())
    db = (1/n) * np.sum(errors)
    w -= alpha * dw
    b -= alpha * db
    print(f"Step {step+1}: w={w:.4f}, b={b:.4f}")

The decision boundary

The decision boundary is where $\hat{p} = 0.5$ , which means $w^Tx + b = 0$ .

For 2 features, this is a line: $w_1 x_1 + w_2 x_2 + b = 0$ , or equivalently $x_2 = -(w_1/w_2)x_1 - b/w_2$ .

Logistic regression can only produce linear decision boundaries. It can’t separate classes that require a curve. For that, you’d add polynomial features or use a nonlinear model.

graph TD
  A["Linear boundary<br/>w₁x₁ + w₂x₂ + b = 0"] --> B["Class 0 side<br/>w·x + b < 0"]
  A --> C["Class 1 side<br/>w·x + b > 0"]

Multi-class classification: softmax

Binary logistic regression handles two classes. For $K > 2$ classes, we generalize with the softmax function.

Instead of one weight vector, we have $K$ weight vectors $w_1, w_2, \ldots, w_K$ . For each class $k$ , compute a score:

$z_k = w_k^T x + b_k$

Then softmax converts scores to probabilities:

$P(y = k | x) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$

Properties:

All probabilities are positive (exponentials are always positive)
They sum to 1: $\sum_{k=1}^{K} P(y = k | x) = 1$
The class with the highest score gets the highest probability

Softmax example

Three classes, scores $z = [2.0, 1.0, 0.5]$ :

$e^{z_1} = e^{2.0} \approx 7.389$

$e^{z_2} = e^{1.0} \approx 2.718$

$e^{z_3} = e^{0.5} \approx 1.649$

$\text{sum} = 7.389 + 2.718 + 1.649 = 11.756$

$P(y=1) = \frac{7.389}{11.756} \approx 0.629$

$P(y=2) = \frac{2.718}{11.756} \approx 0.231$

$P(y=3) = \frac{1.649}{11.756} \approx 0.140$

Class 1 has the highest probability. The softmax amplifies differences: score gaps of 1.0 and 0.5 become probability ratios of roughly 2.7:1 and 1.6:1.

Multi-class cross-entropy loss

The loss for a single example with true class $c$ is:

$\ell = -\log P(y = c | x) = -\log \frac{e^{z_c}}{\sum_j e^{z_j}}$

This is called categorical cross-entropy. It only looks at the probability assigned to the correct class. If the model puts high probability on the right class, loss is low. If it spreads probability to wrong classes, loss is high.

For the full dataset:

$L = -\frac{1}{n} \sum_{i=1}^{n} \log P(y = y_i | x_i)$

Logistic regression with regularization

Just like linear regression, logistic regression benefits from regularization. Adding an L2 penalty:

$L_{\text{reg}} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)] + \lambda \|w\|_2^2$

In scikit-learn, the C parameter is $1/\lambda$ . Larger C means less regularization:

from sklearn.linear_model import LogisticRegression

# C=1.0 is the default
model = LogisticRegression(C=1.0, penalty='l2', max_iter=1000)
model.fit(X_train, y_train)

# Probabilities
probs = model.predict_proba(X_test)  # shape (n, 2) for binary

# Class predictions
preds = model.predict(X_test)

Why logistic regression, not “logistic classification”?

Despite being used for classification, it’s called “regression” because the model regresses on the log-odds:

$\log \frac{\hat{p}}{1 - \hat{p}} = w^Tx + b$

The left side is the log-odds (logit), which ranges from $-\infty$ to $+\infty$ . The linear model predicts the logit, and we transform it to a probability with the sigmoid. So internally, it’s doing regression on a transformed target.

Summary

Component	Formula	Purpose
Sigmoid	$\sigma(z) = \frac{1}{1 + e^{-z}}$	Maps linear output to probability
Cross-entropy	$-[y\log\hat{p} + (1-y)\log(1-\hat{p})]$	Loss for binary classification
Gradient	$\frac{1}{n}X^T(\hat{p} - y)$	Same form as linear regression
Softmax	$\frac{e^{z_k}}{\sum_j e^{z_j}}$	Multi-class probabilities
Decision boundary	$w^Tx + b = 0$	Where predicted class changes

Logistic regression is simple, interpretable, and surprisingly effective. It’s the standard baseline for classification problems. When it doesn’t work well enough, the fix is usually adding better features or switching to a more flexible model, not abandoning the framework. Neural networks are essentially stacked logistic regressions with nonlinear activations.

What comes next

Your model makes predictions, but how good are they? Accuracy alone isn’t enough, especially with imbalanced classes. The next article, Evaluation metrics, covers precision, recall, F1 score, ROC curves, and AUC, the tools you need to properly measure classification performance.

← Back to all series