Logistic regression and classification
In this series (18 parts)
- What is machine learning: a map of the field
- Data, features, and the ML pipeline
- Linear regression
- Bias, variance, and the tradeoff
- Regularization: Ridge, Lasso, and ElasticNet
- Logistic regression and classification
- Evaluation metrics for classification
- Naive Bayes classifier
- K-Nearest Neighbors
- Decision trees
- Ensemble methods: Bagging and Random Forests
- Boosting: AdaBoost and Gradient Boosting
- Support Vector Machines
- K-Means clustering
- Dimensionality Reduction: PCA
- Gaussian mixture models and EM algorithm
- Model selection and cross-validation
- Feature engineering and selection
Prerequisites: Linear regression and Calculus and derivatives.
Linear regression predicts a number. But many problems need a category: spam or not spam, tumor benign or malignant, digit 0 through 9. Logistic regression adapts linear regression for classification by squashing predictions into probabilities.
The problem: yes-or-no answers
Consider predicting whether a student passes an exam based on study hours and practice exams taken.
| Student | Study hours | Practice exams | Pass? |
|---|---|---|---|
| 1 | 2 | 0 | No |
| 2 | 3 | 1 | No |
| 3 | 4 | 1 | No |
| 4 | 5 | 2 | Yes |
| 5 | 6 | 2 | Yes |
| 6 | 7 | 3 | Yes |
| 7 | 8 | 3 | Yes |
| 8 | 1 | 0 | No |
Student study hours vs exam outcome
Linear regression would predict a number like 0.3 or 1.7 for each student. But pass/fail is binary. We need a model that outputs a probability between 0 and 1, then decides yes or no based on a cutoff.
Why we need a probability, not a raw number
graph TD A["Raw input features"] --> B["Linear combination: w*x + b"] B --> C["Problem: output can be -3, 0.5, 142..."] C --> D["Solution: sigmoid squashes to 0-1"] D --> E["Output: probability of Pass"] E --> F["Threshold at 0.5"] F --> G["Predict Pass or Fail"]
The sigmoid function is like a dimmer switch. Small inputs produce values near 0. Large inputs produce values near 1. It transitions smoothly through 0.5 in the middle. No matter what number goes in, you always get a valid probability out.
Now let’s formalize this intuition, starting with why plain linear regression breaks down for classification.
Why linear regression fails for classification
Suppose you want to classify emails as spam (1) or not spam (0). If you use linear regression, can output anything: , , . These numbers don’t make sense as probabilities.
Even if the data happens to give sensible-looking outputs for your training set, a single outlier can skew the line and mess up all your predictions. We need a function that outputs values between 0 and 1, interpretable as .
The sigmoid function
The sigmoid (logistic) function maps any real number to :
Properties:
- as
- as
- Symmetric:
The derivative has a nice form:
This clean derivative is why sigmoid is so convenient for optimization. It appears naturally when you take derivatives of the loss function.
Sigmoid maps any input to a probability between 0 and 1
graph LR A["Large negative z"] --> B["sigmoid near 0"] C["z = 0"] --> D["sigmoid = 0.5"] E["Large positive z"] --> F["sigmoid near 1"]
Sigmoid function with decision threshold at 0.5
The logistic regression model
Combine a linear function with sigmoid:
is the predicted probability that . To get a class prediction, apply a threshold (usually 0.5):
Since when , the decision boundary is the set of points where . In 2D, this is a straight line. In higher dimensions, it’s a hyperplane.
Logistic regression pipeline: from features to class prediction
graph LR A["Input features x"] --> B["Linear combination w*x + b"] B --> C["Sigmoid function"] C --> D["Probability p-hat"] D --> E["Threshold at 0.5"] E --> F["Class 0 or 1"]
Linear regression vs logistic regression output
graph TD
subgraph Linear["Linear Regression"]
L1["Input"] --> L2["w*x + b"]
L2 --> L3["Any real number"]
end
subgraph Logistic["Logistic Regression"]
LR1["Input"] --> LR2["w*x + b"]
LR2 --> LR3["Sigmoid"]
LR3 --> LR4["Probability in 0 to 1"]
end
The loss function: cross-entropy
MSE is a bad choice for classification. The loss surface has flat regions that make gradient descent slow. Instead, we use cross-entropy loss (also called log loss):
For a single example with true label and predicted probability :
When : . If is close to 1, loss is near 0. If is close to 0, loss goes to . Heavily penalizes confident wrong predictions.
When : . Same logic, flipped.
For the full dataset of examples:
This loss is convex, so gradient descent finds the global minimum.
Example 1: computing sigmoid output and cross-entropy loss
Consider a single data point with features , true label , weights , and bias .
Step 1: compute the linear combination .
Step 2: apply sigmoid.
So the model predicts a 69% probability that this example belongs to class 1.
Step 3: compute the loss.
Since :
Step 4: check what happens if the prediction were wrong.
Suppose instead and :
The loss jumped from 0.371 to 1.171. When the model is more wrong (predicting 31% for a true positive), the loss is much higher. Cross-entropy punishes confident mistakes severely.
The gradient
To run gradient descent, we need the gradient of the loss with respect to and . Using the chain rule:
For a single example:
This is remarkably clean. The gradient is the prediction error times the input. Same form as linear regression.
For the full dataset:
Example 2: three gradient descent steps
Let’s train a logistic regression model from scratch on a tiny dataset.
Data (1 feature plus bias):
| 1 | 0 |
| 2 | 0 |
| 3 | 1 |
| 4 | 1 |
We want to learn (slope) and (intercept). Initialize , . Learning rate .
Step 1:
Compute for each point:
Compute :
Compute errors :
Compute gradients:
Update:
Step 2:
Compute :
Compute :
Compute errors:
Gradients:
Update:
Step 3:
Compute :
Compute :
Errors:
Gradients:
Update:
Progress after 3 steps: , . The weight is positive (larger means higher probability of class 1), and the bias is slightly negative. After many more steps, the model converges to something like and , giving a decision boundary at . That’s right between the class-0 points () and class-1 points ().
import numpy as np
X = np.array([1, 2, 3, 4]).reshape(-1, 1)
y = np.array([0, 0, 1, 1])
w = 0.0
b = 0.0
alpha = 0.1
n = len(y)
for step in range(3):
z = w * X.flatten() + b
p_hat = 1 / (1 + np.exp(-z))
errors = p_hat - y
dw = (1/n) * np.sum(errors * X.flatten())
db = (1/n) * np.sum(errors)
w -= alpha * dw
b -= alpha * db
print(f"Step {step+1}: w={w:.4f}, b={b:.4f}")
The decision boundary
The decision boundary is where , which means .
For 2 features, this is a line: , or equivalently .
Logistic regression can only produce linear decision boundaries. It can’t separate classes that require a curve. For that, you’d add polynomial features or use a nonlinear model.
graph TD A["Linear boundary<br/>w₁x₁ + w₂x₂ + b = 0"] --> B["Class 0 side<br/>w·x + b < 0"] A --> C["Class 1 side<br/>w·x + b > 0"]
Multi-class classification: softmax
Binary logistic regression handles two classes. For classes, we generalize with the softmax function.
Instead of one weight vector, we have weight vectors . For each class , compute a score:
Then softmax converts scores to probabilities:
Properties:
- All probabilities are positive (exponentials are always positive)
- They sum to 1:
- The class with the highest score gets the highest probability
Softmax example
Three classes, scores :
Class 1 has the highest probability. The softmax amplifies differences: score gaps of 1.0 and 0.5 become probability ratios of roughly 2.7:1 and 1.6:1.
Multi-class cross-entropy loss
The loss for a single example with true class is:
This is called categorical cross-entropy. It only looks at the probability assigned to the correct class. If the model puts high probability on the right class, loss is low. If it spreads probability to wrong classes, loss is high.
For the full dataset:
Logistic regression with regularization
Just like linear regression, logistic regression benefits from regularization. Adding an L2 penalty:
In scikit-learn, the C parameter is . Larger C means less regularization:
from sklearn.linear_model import LogisticRegression
# C=1.0 is the default
model = LogisticRegression(C=1.0, penalty='l2', max_iter=1000)
model.fit(X_train, y_train)
# Probabilities
probs = model.predict_proba(X_test) # shape (n, 2) for binary
# Class predictions
preds = model.predict(X_test)
Why logistic regression, not “logistic classification”?
Despite being used for classification, it’s called “regression” because the model regresses on the log-odds:
The left side is the log-odds (logit), which ranges from to . The linear model predicts the logit, and we transform it to a probability with the sigmoid. So internally, it’s doing regression on a transformed target.
Summary
| Component | Formula | Purpose |
|---|---|---|
| Sigmoid | Maps linear output to probability | |
| Cross-entropy | Loss for binary classification | |
| Gradient | Same form as linear regression | |
| Softmax | Multi-class probabilities | |
| Decision boundary | Where predicted class changes |
Logistic regression is simple, interpretable, and surprisingly effective. It’s the standard baseline for classification problems. When it doesn’t work well enough, the fix is usually adding better features or switching to a more flexible model, not abandoning the framework. Neural networks are essentially stacked logistic regressions with nonlinear activations.
What comes next
Your model makes predictions, but how good are they? Accuracy alone isn’t enough, especially with imbalanced classes. The next article, Evaluation metrics, covers precision, recall, F1 score, ROC curves, and AUC, the tools you need to properly measure classification performance.