Search…

Logistic regression and classification

In this series (18 parts)
  1. What is machine learning: a map of the field
  2. Data, features, and the ML pipeline
  3. Linear regression
  4. Bias, variance, and the tradeoff
  5. Regularization: Ridge, Lasso, and ElasticNet
  6. Logistic regression and classification
  7. Evaluation metrics for classification
  8. Naive Bayes classifier
  9. K-Nearest Neighbors
  10. Decision trees
  11. Ensemble methods: Bagging and Random Forests
  12. Boosting: AdaBoost and Gradient Boosting
  13. Support Vector Machines
  14. K-Means clustering
  15. Dimensionality Reduction: PCA
  16. Gaussian mixture models and EM algorithm
  17. Model selection and cross-validation
  18. Feature engineering and selection

Prerequisites: Linear regression and Calculus and derivatives.

Linear regression predicts a number. But many problems need a category: spam or not spam, tumor benign or malignant, digit 0 through 9. Logistic regression adapts linear regression for classification by squashing predictions into probabilities.

The problem: yes-or-no answers

Consider predicting whether a student passes an exam based on study hours and practice exams taken.

StudentStudy hoursPractice examsPass?
120No
231No
341No
452Yes
562Yes
673Yes
783Yes
810No

Student study hours vs exam outcome

Linear regression would predict a number like 0.3 or 1.7 for each student. But pass/fail is binary. We need a model that outputs a probability between 0 and 1, then decides yes or no based on a cutoff.

Why we need a probability, not a raw number

graph TD
  A["Raw input features"] --> B["Linear combination: w*x + b"]
  B --> C["Problem: output can be -3, 0.5, 142..."]
  C --> D["Solution: sigmoid squashes to 0-1"]
  D --> E["Output: probability of Pass"]
  E --> F["Threshold at 0.5"]
  F --> G["Predict Pass or Fail"]

The sigmoid function is like a dimmer switch. Small inputs produce values near 0. Large inputs produce values near 1. It transitions smoothly through 0.5 in the middle. No matter what number goes in, you always get a valid probability out.

Now let’s formalize this intuition, starting with why plain linear regression breaks down for classification.

Why linear regression fails for classification

Suppose you want to classify emails as spam (1) or not spam (0). If you use linear regression, y^=wTx+b\hat{y} = w^Tx + b can output anything: 3.7-3.7, 0.50.5, 142142. These numbers don’t make sense as probabilities.

Even if the data happens to give sensible-looking outputs for your training set, a single outlier can skew the line and mess up all your predictions. We need a function that outputs values between 0 and 1, interpretable as P(y=1x)P(y = 1 | x).

The sigmoid function

The sigmoid (logistic) function maps any real number to (0,1)(0, 1):

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Properties:

  • σ(0)=0.5\sigma(0) = 0.5
  • σ(z)1\sigma(z) \to 1 as z+z \to +\infty
  • σ(z)0\sigma(z) \to 0 as zz \to -\infty
  • Symmetric: σ(z)=1σ(z)\sigma(-z) = 1 - \sigma(z)

The derivative has a nice form:

σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z))

This clean derivative is why sigmoid is so convenient for optimization. It appears naturally when you take derivatives of the loss function.

Sigmoid maps any input to a probability between 0 and 1

graph LR
  A["Large negative z"] --> B["sigmoid near 0"]
  C["z = 0"] --> D["sigmoid = 0.5"]
  E["Large positive z"] --> F["sigmoid near 1"]

Sigmoid function with decision threshold at 0.5

The logistic regression model

Combine a linear function with sigmoid:

p^=σ(wTx+b)=11+e(wTx+b)\hat{p} = \sigma(w^Tx + b) = \frac{1}{1 + e^{-(w^Tx + b)}}

p^\hat{p} is the predicted probability that y=1y = 1. To get a class prediction, apply a threshold (usually 0.5):

y^={1if p^0.50if p^<0.5\hat{y} = \begin{cases} 1 & \text{if } \hat{p} \geq 0.5 \\ 0 & \text{if } \hat{p} < 0.5 \end{cases}

Since σ(z)0.5\sigma(z) \geq 0.5 when z0z \geq 0, the decision boundary is the set of points where wTx+b=0w^Tx + b = 0. In 2D, this is a straight line. In higher dimensions, it’s a hyperplane.

Logistic regression pipeline: from features to class prediction

graph LR
  A["Input features x"] --> B["Linear combination w*x + b"]
  B --> C["Sigmoid function"]
  C --> D["Probability p-hat"]
  D --> E["Threshold at 0.5"]
  E --> F["Class 0 or 1"]

Linear regression vs logistic regression output

graph TD
  subgraph Linear["Linear Regression"]
      L1["Input"] --> L2["w*x + b"]
      L2 --> L3["Any real number"]
  end
  subgraph Logistic["Logistic Regression"]
      LR1["Input"] --> LR2["w*x + b"]
      LR2 --> LR3["Sigmoid"]
      LR3 --> LR4["Probability in 0 to 1"]
  end

The loss function: cross-entropy

MSE is a bad choice for classification. The loss surface has flat regions that make gradient descent slow. Instead, we use cross-entropy loss (also called log loss):

For a single example with true label y{0,1}y \in \{0, 1\} and predicted probability p^\hat{p}:

(y,p^)=[ylog(p^)+(1y)log(1p^)]\ell(y, \hat{p}) = -[y \log(\hat{p}) + (1 - y) \log(1 - \hat{p})]

When y=1y = 1: =log(p^)\ell = -\log(\hat{p}). If p^\hat{p} is close to 1, loss is near 0. If p^\hat{p} is close to 0, loss goes to ++\infty. Heavily penalizes confident wrong predictions.

When y=0y = 0: =log(1p^)\ell = -\log(1 - \hat{p}). Same logic, flipped.

For the full dataset of nn examples:

L(w,b)=1ni=1n[yilog(p^i)+(1yi)log(1p^i)]L(w, b) = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i)]

This loss is convex, so gradient descent finds the global minimum.

Example 1: computing sigmoid output and cross-entropy loss

Consider a single data point with features x=[2,1]x = [2, 1], true label y=1y = 1, weights w=[0.5,0.3]w = [0.5, -0.3], and bias b=0.1b = 0.1.

Step 1: compute the linear combination zz.

z=wTx+b=0.52+(0.3)1+0.1z = w^Tx + b = 0.5 \cdot 2 + (-0.3) \cdot 1 + 0.1

=1.00.3+0.1=0.8= 1.0 - 0.3 + 0.1 = 0.8

Step 2: apply sigmoid.

p^=σ(0.8)=11+e0.8\hat{p} = \sigma(0.8) = \frac{1}{1 + e^{-0.8}}

e0.80.4493e^{-0.8} \approx 0.4493

p^=11+0.4493=11.44930.6900\hat{p} = \frac{1}{1 + 0.4493} = \frac{1}{1.4493} \approx 0.6900

So the model predicts a 69% probability that this example belongs to class 1.

Step 3: compute the loss.

Since y=1y = 1:

=log(p^)=log(0.6900)(0.3711)=0.3711\ell = -\log(\hat{p}) = -\log(0.6900) \approx -(-0.3711) = 0.3711

Step 4: check what happens if the prediction were wrong.

Suppose instead w=[0.5,0.3]w = [-0.5, 0.3] and b=0.1b = -0.1:

z=0.52+0.31+(0.1)=1.0+0.30.1=0.8z = -0.5 \cdot 2 + 0.3 \cdot 1 + (-0.1) = -1.0 + 0.3 - 0.1 = -0.8

p^=σ(0.8)=1σ(0.8)10.690=0.310\hat{p} = \sigma(-0.8) = 1 - \sigma(0.8) \approx 1 - 0.690 = 0.310

=log(0.310)1.171\ell = -\log(0.310) \approx 1.171

The loss jumped from 0.371 to 1.171. When the model is more wrong (predicting 31% for a true positive), the loss is much higher. Cross-entropy punishes confident mistakes severely.

The gradient

To run gradient descent, we need the gradient of the loss with respect to ww and bb. Using the chain rule:

For a single example:

wj=(p^y)xj\frac{\partial \ell}{\partial w_j} = (\hat{p} - y) x_j

b=(p^y)\frac{\partial \ell}{\partial b} = (\hat{p} - y)

This is remarkably clean. The gradient is the prediction error (p^y)(\hat{p} - y) times the input. Same form as linear regression.

For the full dataset:

wL=1ni=1n(p^iyi)xi=1nXT(p^y)\nabla_w L = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i) x_i = \frac{1}{n} X^T(\hat{p} - y)

Lb=1ni=1n(p^iyi)\frac{\partial L}{\partial b} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i)

Example 2: three gradient descent steps

Let’s train a logistic regression model from scratch on a tiny dataset.

Data (1 feature plus bias):

xxyy
10
20
31
41

We want to learn ww (slope) and bb (intercept). Initialize w=0w = 0, b=0b = 0. Learning rate α=0.1\alpha = 0.1.

Step 1:

Compute zi=wxi+bz_i = wx_i + b for each point:

z=[01+0,  02+0,  03+0,  04+0]=[0,0,0,0]z = [0 \cdot 1 + 0, \; 0 \cdot 2 + 0, \; 0 \cdot 3 + 0, \; 0 \cdot 4 + 0] = [0, 0, 0, 0]

Compute p^i=σ(zi)\hat{p}_i = \sigma(z_i):

p^=[0.5,0.5,0.5,0.5]\hat{p} = [0.5, 0.5, 0.5, 0.5]

Compute errors p^iyi\hat{p}_i - y_i:

errors=[0.50,  0.50,  0.51,  0.51]=[0.5,0.5,0.5,0.5]\text{errors} = [0.5 - 0, \; 0.5 - 0, \; 0.5 - 1, \; 0.5 - 1] = [0.5, 0.5, -0.5, -0.5]

Compute gradients:

Lw=14(0.51+0.52+(0.5)3+(0.5)4)=14(0.5+1.01.52.0)=2.04=0.5\frac{\partial L}{\partial w} = \frac{1}{4}(0.5 \cdot 1 + 0.5 \cdot 2 + (-0.5) \cdot 3 + (-0.5) \cdot 4) = \frac{1}{4}(0.5 + 1.0 - 1.5 - 2.0) = \frac{-2.0}{4} = -0.5

Lb=14(0.5+0.50.50.5)=04=0\frac{\partial L}{\partial b} = \frac{1}{4}(0.5 + 0.5 - 0.5 - 0.5) = \frac{0}{4} = 0

Update:

w00.1(0.5)=0.05w \leftarrow 0 - 0.1 \cdot (-0.5) = 0.05

b00.10=0b \leftarrow 0 - 0.1 \cdot 0 = 0

Step 2:

Compute zz:

z=[0.051,  0.052,  0.053,  0.054]=[0.05,0.10,0.15,0.20]z = [0.05 \cdot 1, \; 0.05 \cdot 2, \; 0.05 \cdot 3, \; 0.05 \cdot 4] = [0.05, 0.10, 0.15, 0.20]

Compute p^\hat{p}:

p^=[σ(0.05),σ(0.10),σ(0.15),σ(0.20)]\hat{p} = [\sigma(0.05), \sigma(0.10), \sigma(0.15), \sigma(0.20)]

[0.5125,0.5250,0.5374,0.5498]\approx [0.5125, 0.5250, 0.5374, 0.5498]

Compute errors:

=[0.5125,0.5250,0.4626,0.4502]= [0.5125, 0.5250, -0.4626, -0.4502]

Gradients:

Lw=14(0.51251+0.52502+(0.4626)3+(0.4502)4)\frac{\partial L}{\partial w} = \frac{1}{4}(0.5125 \cdot 1 + 0.5250 \cdot 2 + (-0.4626) \cdot 3 + (-0.4502) \cdot 4)

=14(0.5125+1.05001.38781.8008)=1.62614=0.4065= \frac{1}{4}(0.5125 + 1.0500 - 1.3878 - 1.8008) = \frac{-1.6261}{4} = -0.4065

Lb=14(0.5125+0.52500.46260.4502)=0.12474=0.0312\frac{\partial L}{\partial b} = \frac{1}{4}(0.5125 + 0.5250 - 0.4626 - 0.4502) = \frac{0.1247}{4} = 0.0312

Update:

w0.050.1(0.4065)=0.05+0.0407=0.0907w \leftarrow 0.05 - 0.1 \cdot (-0.4065) = 0.05 + 0.0407 = 0.0907

b00.10.0312=0.0031b \leftarrow 0 - 0.1 \cdot 0.0312 = -0.0031

Step 3:

Compute zz:

z=[0.090710.0031,  0.090720.0031,  0.090730.0031,  0.090740.0031]z = [0.0907 \cdot 1 - 0.0031, \; 0.0907 \cdot 2 - 0.0031, \; 0.0907 \cdot 3 - 0.0031, \; 0.0907 \cdot 4 - 0.0031]

=[0.0876,0.1783,0.2690,0.3597]= [0.0876, 0.1783, 0.2690, 0.3597]

Compute p^\hat{p}:

[0.5219,0.5444,0.5668,0.5890]\approx [0.5219, 0.5444, 0.5668, 0.5890]

Errors:

=[0.5219,0.5444,0.4332,0.4110]= [0.5219, 0.5444, -0.4332, -0.4110]

Gradients:

Lw=14(0.5219+1.08881.29961.6440)=1.33294=0.3332\frac{\partial L}{\partial w} = \frac{1}{4}(0.5219 + 1.0888 - 1.2996 - 1.6440) = \frac{-1.3329}{4} = -0.3332

Lb=14(0.5219+0.54440.43320.4110)=0.22214=0.0555\frac{\partial L}{\partial b} = \frac{1}{4}(0.5219 + 0.5444 - 0.4332 - 0.4110) = \frac{0.2221}{4} = 0.0555

Update:

w0.0907+0.0333=0.1240w \leftarrow 0.0907 + 0.0333 = 0.1240

b0.00310.0056=0.0087b \leftarrow -0.0031 - 0.0056 = -0.0087

Progress after 3 steps: w=0.124w = 0.124, b=0.009b = -0.009. The weight is positive (larger xx means higher probability of class 1), and the bias is slightly negative. After many more steps, the model converges to something like w2.1w \approx 2.1 and b5.3b \approx -5.3, giving a decision boundary at x=5.3/2.12.5x = 5.3/2.1 \approx 2.5. That’s right between the class-0 points (x=1,2x = 1, 2) and class-1 points (x=3,4x = 3, 4).

import numpy as np

X = np.array([1, 2, 3, 4]).reshape(-1, 1)
y = np.array([0, 0, 1, 1])

w = 0.0
b = 0.0
alpha = 0.1
n = len(y)

for step in range(3):
    z = w * X.flatten() + b
    p_hat = 1 / (1 + np.exp(-z))
    errors = p_hat - y
    dw = (1/n) * np.sum(errors * X.flatten())
    db = (1/n) * np.sum(errors)
    w -= alpha * dw
    b -= alpha * db
    print(f"Step {step+1}: w={w:.4f}, b={b:.4f}")

The decision boundary

The decision boundary is where p^=0.5\hat{p} = 0.5, which means wTx+b=0w^Tx + b = 0.

For 2 features, this is a line: w1x1+w2x2+b=0w_1 x_1 + w_2 x_2 + b = 0, or equivalently x2=(w1/w2)x1b/w2x_2 = -(w_1/w_2)x_1 - b/w_2.

Logistic regression can only produce linear decision boundaries. It can’t separate classes that require a curve. For that, you’d add polynomial features or use a nonlinear model.

graph TD
  A["Linear boundary<br/>w₁x₁ + w₂x₂ + b = 0"] --> B["Class 0 side<br/>w·x + b < 0"]
  A --> C["Class 1 side<br/>w·x + b > 0"]

Multi-class classification: softmax

Binary logistic regression handles two classes. For K>2K > 2 classes, we generalize with the softmax function.

Instead of one weight vector, we have KK weight vectors w1,w2,,wKw_1, w_2, \ldots, w_K. For each class kk, compute a score:

zk=wkTx+bkz_k = w_k^T x + b_k

Then softmax converts scores to probabilities:

P(y=kx)=ezkj=1KezjP(y = k | x) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}

Properties:

  • All probabilities are positive (exponentials are always positive)
  • They sum to 1: k=1KP(y=kx)=1\sum_{k=1}^{K} P(y = k | x) = 1
  • The class with the highest score gets the highest probability

Softmax example

Three classes, scores z=[2.0,1.0,0.5]z = [2.0, 1.0, 0.5]:

ez1=e2.07.389e^{z_1} = e^{2.0} \approx 7.389

ez2=e1.02.718e^{z_2} = e^{1.0} \approx 2.718

ez3=e0.51.649e^{z_3} = e^{0.5} \approx 1.649

sum=7.389+2.718+1.649=11.756\text{sum} = 7.389 + 2.718 + 1.649 = 11.756

P(y=1)=7.38911.7560.629P(y=1) = \frac{7.389}{11.756} \approx 0.629

P(y=2)=2.71811.7560.231P(y=2) = \frac{2.718}{11.756} \approx 0.231

P(y=3)=1.64911.7560.140P(y=3) = \frac{1.649}{11.756} \approx 0.140

Class 1 has the highest probability. The softmax amplifies differences: score gaps of 1.0 and 0.5 become probability ratios of roughly 2.7:1 and 1.6:1.

Multi-class cross-entropy loss

The loss for a single example with true class cc is:

=logP(y=cx)=logezcjezj\ell = -\log P(y = c | x) = -\log \frac{e^{z_c}}{\sum_j e^{z_j}}

This is called categorical cross-entropy. It only looks at the probability assigned to the correct class. If the model puts high probability on the right class, loss is low. If it spreads probability to wrong classes, loss is high.

For the full dataset:

L=1ni=1nlogP(y=yixi)L = -\frac{1}{n} \sum_{i=1}^{n} \log P(y = y_i | x_i)

Logistic regression with regularization

Just like linear regression, logistic regression benefits from regularization. Adding an L2 penalty:

Lreg=1ni=1n[yilog(p^i)+(1yi)log(1p^i)]+λw22L_{\text{reg}} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)] + \lambda \|w\|_2^2

In scikit-learn, the C parameter is 1/λ1/\lambda. Larger C means less regularization:

from sklearn.linear_model import LogisticRegression

# C=1.0 is the default
model = LogisticRegression(C=1.0, penalty='l2', max_iter=1000)
model.fit(X_train, y_train)

# Probabilities
probs = model.predict_proba(X_test)  # shape (n, 2) for binary

# Class predictions
preds = model.predict(X_test)

Why logistic regression, not “logistic classification”?

Despite being used for classification, it’s called “regression” because the model regresses on the log-odds:

logp^1p^=wTx+b\log \frac{\hat{p}}{1 - \hat{p}} = w^Tx + b

The left side is the log-odds (logit), which ranges from -\infty to ++\infty. The linear model predicts the logit, and we transform it to a probability with the sigmoid. So internally, it’s doing regression on a transformed target.

Summary

ComponentFormulaPurpose
Sigmoidσ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}Maps linear output to probability
Cross-entropy[ylogp^+(1y)log(1p^)]-[y\log\hat{p} + (1-y)\log(1-\hat{p})]Loss for binary classification
Gradient1nXT(p^y)\frac{1}{n}X^T(\hat{p} - y)Same form as linear regression
Softmaxezkjezj\frac{e^{z_k}}{\sum_j e^{z_j}}Multi-class probabilities
Decision boundarywTx+b=0w^Tx + b = 0Where predicted class changes

Logistic regression is simple, interpretable, and surprisingly effective. It’s the standard baseline for classification problems. When it doesn’t work well enough, the fix is usually adding better features or switching to a more flexible model, not abandoning the framework. Neural networks are essentially stacked logistic regressions with nonlinear activations.

What comes next

Your model makes predictions, but how good are they? Accuracy alone isn’t enough, especially with imbalanced classes. The next article, Evaluation metrics, covers precision, recall, F1 score, ROC curves, and AUC, the tools you need to properly measure classification performance.

Start typing to search across all content
navigate Enter open Esc close