Support Vector Machines
In this series (18 parts)
- What is machine learning: a map of the field
- Data, features, and the ML pipeline
- Linear regression
- Bias, variance, and the tradeoff
- Regularization: Ridge, Lasso, and ElasticNet
- Logistic regression and classification
- Evaluation metrics for classification
- Naive Bayes classifier
- K-Nearest Neighbors
- Decision trees
- Ensemble methods: Bagging and Random Forests
- Boosting: AdaBoost and Gradient Boosting
- Support Vector Machines
- K-Means clustering
- Dimensionality Reduction: PCA
- Gaussian mixture models and EM algorithm
- Model selection and cross-validation
- Feature engineering and selection
A support vector machine draws the decision boundary that sits as far as possible from both classes. That “as far as possible” is what makes SVMs different from other linear classifiers. Instead of just finding any line that separates the data, an SVM finds the best line, the one with the maximum margin.
Prerequisites: You should be comfortable with Lagrangian duality and logistic regression before reading this post.
Why SVMs?
You need to separate two classes of data points. Many lines can do it, but which line is best? A line that barely squeezes between the classes is fragile. A tiny shift in the data could flip predictions. You want the line with the widest possible gap between the classes.
Consider these eight points in 2D:
| Point | x1 | x2 | Class |
|---|---|---|---|
| A | 1.0 | 3.0 | +1 |
| B | 2.0 | 3.5 | +1 |
| C | 1.5 | 2.5 | +1 |
| D | 2.5 | 4.0 | +1 |
| E | 5.0 | 1.0 | -1 |
| F | 6.0 | 1.5 | -1 |
| G | 5.5 | 0.5 | -1 |
| H | 6.5 | 2.0 | -1 |
The positive points cluster in the upper-left. The negative points cluster in the lower-right. Many lines separate them. The SVM picks the one with the widest gap (margin) between the two groups.
SVM decision boundary with margin and support vectors
Separating two classes with maximum margin:
graph TD
subgraph space["Feature Space"]
direction LR
subgraph pos["Class +1"]
A1["A"]
B1["B"]
C1["C"]
D1["D"]
end
subgraph neg["Class -1"]
E1["E"]
F1["F"]
G1["G"]
H1["H"]
end
end
space --> boundary["Decision boundary:<br/>widest margin between classes"]
Support vectors are the closest points to the boundary:
graph LR SV1["Support vector: closest +1 point"] --- margin["MARGIN:<br/>widest possible gap"] margin --- SV2["Support vector: closest -1 point"] margin --> boundary["Decision boundary<br/>sits in the middle"]
Only the closest points to the boundary matter. These are called support vectors. Every other point could move around freely without changing the boundary, as long as it stays on its correct side. This makes SVMs robust: the solution depends on just a handful of critical points.
The SVM finds this optimal boundary by minimizing the weight vector’s magnitude. Smaller weights mean a wider margin. Points that violate the margin get penalized.
Now let’s formalize this as an optimization problem.
The max-margin classifier
Suppose you have two classes of points that are linearly separable. There are infinitely many hyperplanes that separate them. Which one should you pick?
The SVM answer: pick the hyperplane that maximizes the distance to the nearest data point from either class. That distance is called the margin.
A hyperplane in is defined by:
where is the normal vector and is the bias. For a point with label , the signed distance to the hyperplane is:
The margin is twice the distance from the hyperplane to the nearest point:
We get this by enforcing a canonical scaling: the closest points satisfy . The optimization problem becomes:
We minimize instead of maximizing because it gives us a nice quadratic objective that is easier to optimize. The factor of is there to make the derivatives cleaner.
This is a convex quadratic program, so any local minimum is also the global minimum.
Support vectors
The points that sit exactly on the margin boundary, where , are called support vectors. These are the only points that matter for defining the decision boundary.
If you removed every non-support-vector point and retrained, you would get the exact same hyperplane. This is a powerful property: the SVM solution depends only on a small subset of the training data.
Example 1: finding the max-margin hyperplane
Consider three points in 2D:
| Point | Label | |
|---|---|---|
We need to find and that minimize subject to the margin constraints.
Step 1: Identify candidate support vectors.
The support vectors must include at least one point from each class. Let’s try and as support vectors (the points that land exactly on the margin).
Step 2: Set up the equations for the support vectors.
For with :
For with :
Step 3: Solve for in terms of .
From the second equation:
Substitute into the first equation:
Step 4: Minimize .
Take the derivative and set it to zero:
Step 5: Compute the remaining parameters.
So and .
Step 6: Verify the constraint for .
Point is not a support vector. It sits safely behind the margin.
Step 7: Compute the margin.
The decision boundary is , or equivalently . The support vectors are and .
The dual problem
The primal problem above works fine, but the dual formulation unlocks something important: it lets us express everything in terms of dot products between data points. This is the door that opens to the kernel trick.
Using the Lagrangian, we introduce multipliers for each constraint:
Setting derivatives to zero:
Substituting back gives the dual problem:
subject to and .
Notice that the data only appears through dot products . The KKT conditions tell us that only for the support vectors, and for all other points.
Example 2: solving the dual for our 3-point dataset
Using the same three points from Example 1, the dual objective is:
Step 1: Compute the dot products.
Step 2: Include the labels.
The terms:
Step 3: Write the dual objective.
With the constraint , so .
Step 4: Substitute and solve.
After substituting and taking partial derivatives (setting them to zero), we get:
Step 5: Recover and verify.
This matches our primal solution. The support vectors are and (the ones with ), and is not a support vector ().
The kernel trick
What if the data is not linearly separable? Consider points arranged in a circle: positive class in the center, negative class around the outside. No line can separate them.
The idea: map the data to a higher-dimensional space where it is linearly separable, then find the max-margin hyperplane there.
Define a feature map where . The dual problem becomes:
The kernel trick says: we don’t need to compute explicitly. We just need a kernel function that gives us the dot product in the higher-dimensional space directly.
Common kernels:
| Kernel | Formula | Use case |
|---|---|---|
| Linear | Linearly separable data | |
| Polynomial | Polynomial boundaries | |
| RBF (Gaussian) | Complex, nonlinear boundaries |
The RBF kernel maps data to an infinite-dimensional space, yet we can compute the kernel value in time. That’s the power of the trick.
Why kernels work
Take the polynomial kernel with and : .
For :
This is the dot product of with . The kernel computes the 3D dot product without ever building the 3D vectors.
Soft margin SVM
Real data is rarely perfectly separable. The soft margin SVM allows some points to violate the margin by introducing slack variables :
subject to:
The parameter controls the trade-off:
- Large : penalize misclassifications heavily. The margin will be narrow, fitting the training data closely. This can overfit.
- Small : allow more margin violations. The margin will be wider, potentially ignoring some noisy points. This is more regularized.
When , point is correctly classified and outside the margin. When , the point is inside the margin but still on the correct side. When , the point is misclassified.
The dual of the soft margin SVM is the same as the hard margin dual, except now . The upper bound is the only difference.
Hinge loss interpretation
The soft margin objective can be rewritten as:
where . The first term is the hinge loss, which is zero when a point is correctly classified with margin , and grows linearly otherwise. This looks like regularized empirical risk minimization, connecting SVMs to the broader framework of machine learning.
You can optimize this with gradient descent or its stochastic variants like SGD, which is how SVMs are trained on large datasets in practice.
SVM decision flow: from simple to complex:
graph TD
A["Data to classify"] --> B{"Linearly<br/>separable?"}
B -->|"Yes, perfectly"| C["Hard margin SVM:<br/>no violations allowed"]
B -->|"Mostly, with noise"| D["Soft margin SVM:<br/>allow violations, tune C"]
B -->|"No"| E["Kernel trick:<br/>map to higher dimensions"]
E --> F{"Choose kernel"}
F --> G["Polynomial: curved boundaries"]
F --> H["RBF: complex, flexible boundaries"]
Kernel trick concept: making non-separable data separable:
graph LR A["2D: classes overlap,<br/>no line separates them"] -->|"Apply kernel function:<br/>map to higher dimensions"| B["Higher-D: classes<br/>now separable by<br/>a hyperplane"] B --> C["Find max-margin<br/>boundary in new space"] C --> D["Project decision<br/>boundary back to<br/>original space"]
Putting it together with code
Here is a minimal SVM implementation using scikit-learn:
import numpy as np
from sklearn.svm import SVC
import matplotlib.pyplot as plt
# Training data from Example 1
X = np.array([[1, 1], [2, 0], [-1, 0]])
y = np.array([1, 1, -1])
# Fit a linear SVM with large C (hard margin)
clf = SVC(kernel='linear', C=1000)
clf.fit(X, y)
print("w =", clf.coef_) # [[0.8 0.4]]
print("b =", clf.intercept_) # [-0.2]
print("Support vectors:", clf.support_vectors_)
# [[ 1. 1.]
# [-1. 0.]]
The output matches our hand calculation: and .
Multiclass SVM
SVMs are inherently binary classifiers. For classes, there are two common strategies:
- One-vs-Rest (OvR): Train binary classifiers, each separating one class from all others. Predict the class whose classifier gives the highest score.
- One-vs-One (OvO): Train classifiers, one for each pair of classes. Predict by majority vote.
Both approaches work well in practice. OvO trains more classifiers but each one uses less data.
When to use SVMs
SVMs work well when:
- The number of features is large relative to the number of samples
- You need a strong classifier with good generalization
- The data is high-dimensional (text classification, genomics)
They struggle when:
- The dataset is very large (training time scales poorly for kernel SVMs)
- You need probability estimates (SVMs give hard decisions, though Platt scaling can add probabilities)
- The data has lots of noise and overlapping classes
Summary
| Concept | Key idea |
|---|---|
| Max margin | Maximize distance between classes |
| Support vectors | Points on the margin boundary; define the solution |
| Dual problem | Express SVM in terms of dot products |
| Kernel trick | Implicitly work in high-dimensional space |
| Soft margin | Allow margin violations with penalty |
| Hinge loss | ; connects SVM to ERM |
What comes next
Now that we have covered both supervised classification (logistic regression, SVMs) and the math behind them, we will shift to unsupervised learning. In the next post on K-Means clustering, you will learn how to group data points without labels.