Search…

Support Vector Machines

In this series (18 parts)
  1. What is machine learning: a map of the field
  2. Data, features, and the ML pipeline
  3. Linear regression
  4. Bias, variance, and the tradeoff
  5. Regularization: Ridge, Lasso, and ElasticNet
  6. Logistic regression and classification
  7. Evaluation metrics for classification
  8. Naive Bayes classifier
  9. K-Nearest Neighbors
  10. Decision trees
  11. Ensemble methods: Bagging and Random Forests
  12. Boosting: AdaBoost and Gradient Boosting
  13. Support Vector Machines
  14. K-Means clustering
  15. Dimensionality Reduction: PCA
  16. Gaussian mixture models and EM algorithm
  17. Model selection and cross-validation
  18. Feature engineering and selection

A support vector machine draws the decision boundary that sits as far as possible from both classes. That “as far as possible” is what makes SVMs different from other linear classifiers. Instead of just finding any line that separates the data, an SVM finds the best line, the one with the maximum margin.

Prerequisites: You should be comfortable with Lagrangian duality and logistic regression before reading this post.

Why SVMs?

You need to separate two classes of data points. Many lines can do it, but which line is best? A line that barely squeezes between the classes is fragile. A tiny shift in the data could flip predictions. You want the line with the widest possible gap between the classes.

Consider these eight points in 2D:

Pointx1x2Class
A1.03.0+1
B2.03.5+1
C1.52.5+1
D2.54.0+1
E5.01.0-1
F6.01.5-1
G5.50.5-1
H6.52.0-1

The positive points cluster in the upper-left. The negative points cluster in the lower-right. Many lines separate them. The SVM picks the one with the widest gap (margin) between the two groups.

SVM decision boundary with margin and support vectors

Separating two classes with maximum margin:

graph TD
  subgraph space["Feature Space"]
      direction LR
      subgraph pos["Class +1"]
          A1["A"]
          B1["B"]
          C1["C"]
          D1["D"]
      end
      subgraph neg["Class -1"]
          E1["E"]
          F1["F"]
          G1["G"]
          H1["H"]
      end
  end
  space --> boundary["Decision boundary:<br/>widest margin between classes"]

Support vectors are the closest points to the boundary:

graph LR
  SV1["Support vector: closest +1 point"] --- margin["MARGIN:<br/>widest possible gap"]
  margin --- SV2["Support vector: closest -1 point"]
  margin --> boundary["Decision boundary<br/>sits in the middle"]

Only the closest points to the boundary matter. These are called support vectors. Every other point could move around freely without changing the boundary, as long as it stays on its correct side. This makes SVMs robust: the solution depends on just a handful of critical points.

The SVM finds this optimal boundary by minimizing the weight vector’s magnitude. Smaller weights mean a wider margin. Points that violate the margin get penalized.

Now let’s formalize this as an optimization problem.

The max-margin classifier

Suppose you have two classes of points that are linearly separable. There are infinitely many hyperplanes that separate them. Which one should you pick?

The SVM answer: pick the hyperplane that maximizes the distance to the nearest data point from either class. That distance is called the margin.

A hyperplane in Rd\mathbb{R}^d is defined by:

wx+b=0w \cdot x + b = 0

where ww is the normal vector and bb is the bias. For a point xix_i with label yi{1,+1}y_i \in \{-1, +1\}, the signed distance to the hyperplane is:

yi(wxi+b)w\frac{y_i(w \cdot x_i + b)}{\|w\|}

The margin is twice the distance from the hyperplane to the nearest point:

margin=2w\text{margin} = \frac{2}{\|w\|}

We get this by enforcing a canonical scaling: the closest points satisfy yi(wxi+b)=1y_i(w \cdot x_i + b) = 1. The optimization problem becomes:

minw,b12w2subject toyi(wxi+b)1i\min_{w, b} \frac{1}{2}\|w\|^2 \quad \text{subject to} \quad y_i(w \cdot x_i + b) \geq 1 \quad \forall i

We minimize w2\|w\|^2 instead of maximizing 2w\frac{2}{\|w\|} because it gives us a nice quadratic objective that is easier to optimize. The factor of 12\frac{1}{2} is there to make the derivatives cleaner.

This is a convex quadratic program, so any local minimum is also the global minimum.

Support vectors

The points that sit exactly on the margin boundary, where yi(wxi+b)=1y_i(w \cdot x_i + b) = 1, are called support vectors. These are the only points that matter for defining the decision boundary.

If you removed every non-support-vector point and retrained, you would get the exact same hyperplane. This is a powerful property: the SVM solution depends only on a small subset of the training data.

Example 1: finding the max-margin hyperplane

Consider three points in 2D:

PointxxLabel yy
x1x_1(1,1)(1, 1)+1+1
x2x_2(2,0)(2, 0)+1+1
x3x_3(1,0)(-1, 0)1-1

We need to find w=(w1,w2)w = (w_1, w_2) and bb that minimize 12(w12+w22)\frac{1}{2}(w_1^2 + w_2^2) subject to the margin constraints.

Step 1: Identify candidate support vectors.

The support vectors must include at least one point from each class. Let’s try x1x_1 and x3x_3 as support vectors (the points that land exactly on the margin).

Step 2: Set up the equations for the support vectors.

For x1=(1,1)x_1 = (1,1) with y1=+1y_1 = +1:

w1(1)+w2(1)+b=1w_1(1) + w_2(1) + b = 1

For x3=(1,0)x_3 = (-1, 0) with y3=1y_3 = -1:

(w1(1)+w2(0)+b)=1-(w_1(-1) + w_2(0) + b) = 1 w1b=1w_1 - b = 1

Step 3: Solve for bb in terms of w1w_1.

From the second equation: b=w11b = w_1 - 1

Substitute into the first equation:

w1+w2+(w11)=1w_1 + w_2 + (w_1 - 1) = 1 2w1+w2=22w_1 + w_2 = 2 w2=22w1w_2 = 2 - 2w_1

Step 4: Minimize w2\|w\|^2.

w2=w12+w22=w12+(22w1)2\|w\|^2 = w_1^2 + w_2^2 = w_1^2 + (2 - 2w_1)^2 =w12+48w1+4w12=5w128w1+4= w_1^2 + 4 - 8w_1 + 4w_1^2 = 5w_1^2 - 8w_1 + 4

Take the derivative and set it to zero:

ddw1(5w128w1+4)=10w18=0\frac{d}{dw_1}(5w_1^2 - 8w_1 + 4) = 10w_1 - 8 = 0 w1=45w_1 = \frac{4}{5}

Step 5: Compute the remaining parameters.

w2=22×45=285=25w_2 = 2 - 2 \times \frac{4}{5} = 2 - \frac{8}{5} = \frac{2}{5} b=451=15b = \frac{4}{5} - 1 = -\frac{1}{5}

So w=(45,25)w = \left(\frac{4}{5}, \frac{2}{5}\right) and b=15b = -\frac{1}{5}.

Step 6: Verify the constraint for x2x_2.

y2(wx2+b)=1(45(2)+25(0)15)=8515=75>1y_2(w \cdot x_2 + b) = 1 \cdot \left(\frac{4}{5}(2) + \frac{2}{5}(0) - \frac{1}{5}\right) = \frac{8}{5} - \frac{1}{5} = \frac{7}{5} > 1 \checkmark

Point x2x_2 is not a support vector. It sits safely behind the margin.

Step 7: Compute the margin.

w=1625+425=2025=255\|w\| = \sqrt{\frac{16}{25} + \frac{4}{25}} = \sqrt{\frac{20}{25}} = \frac{2\sqrt{5}}{5}

margin=2w=2255=55=52.24\text{margin} = \frac{2}{\|w\|} = \frac{2}{\frac{2\sqrt{5}}{5}} = \frac{5}{\sqrt{5}} = \sqrt{5} \approx 2.24

The decision boundary is 45x1+25x215=0\frac{4}{5}x_1 + \frac{2}{5}x_2 - \frac{1}{5} = 0, or equivalently 4x1+2x2=14x_1 + 2x_2 = 1. The support vectors are x1=(1,1)x_1 = (1,1) and x3=(1,0)x_3 = (-1, 0).

The dual problem

The primal problem above works fine, but the dual formulation unlocks something important: it lets us express everything in terms of dot products between data points. This is the door that opens to the kernel trick.

Using the Lagrangian, we introduce multipliers αi0\alpha_i \geq 0 for each constraint:

L(w,b,α)=12w2i=1nαi[yi(wxi+b)1]L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_{i=1}^{n} \alpha_i \left[y_i(w \cdot x_i + b) - 1\right]

Setting derivatives to zero:

Lw=0    w=i=1nαiyixi\frac{\partial L}{\partial w} = 0 \implies w = \sum_{i=1}^{n} \alpha_i y_i x_i

Lb=0    i=1nαiyi=0\frac{\partial L}{\partial b} = 0 \implies \sum_{i=1}^{n} \alpha_i y_i = 0

Substituting back gives the dual problem:

maxαi=1nαi12i,jαiαjyiyj(xixj)\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i \cdot x_j)

subject to αi0\alpha_i \geq 0 and iαiyi=0\sum_i \alpha_i y_i = 0.

Notice that the data only appears through dot products xixjx_i \cdot x_j. The KKT conditions tell us that αi>0\alpha_i > 0 only for the support vectors, and αi=0\alpha_i = 0 for all other points.

Example 2: solving the dual for our 3-point dataset

Using the same three points from Example 1, the dual objective is:

W(α)=α1+α2+α312i,jαiαjyiyj(xixj)W(\alpha) = \alpha_1 + \alpha_2 + \alpha_3 - \frac{1}{2}\sum_{i,j}\alpha_i \alpha_j y_i y_j (x_i \cdot x_j)

Step 1: Compute the dot products.

x1=(1,1)x_1 = (1,1)x2=(2,0)x_2 = (2,0)x3=(1,0)x_3 = (-1,0)
x1x_11+1=21+1=22+0=22+0=21+0=1-1+0=-1
x2x_2224+0=44+0=42+0=2-2+0=-2
x3x_31-12-21+0=11+0=1

Step 2: Include the labels.

y1=+1,y2=+1,y3=1y_1 = +1, y_2 = +1, y_3 = -1

The yiyj(xixj)y_i y_j (x_i \cdot x_j) terms:

j=1j=1j=2j=2j=3j=3
i=1i=1(+1)(+1)(2)=2(+1)(+1)(2) = 2(+1)(+1)(2)=2(+1)(+1)(2) = 2(+1)(1)(1)=1(+1)(-1)(-1) = 1
i=2i=2(+1)(+1)(2)=2(+1)(+1)(2) = 2(+1)(+1)(4)=4(+1)(+1)(4) = 4(+1)(1)(2)=2(+1)(-1)(-2) = 2
i=3i=3(1)(+1)(1)=1(-1)(+1)(-1) = 1(1)(+1)(2)=2(-1)(+1)(-2) = 2(1)(1)(1)=1(-1)(-1)(1) = 1

Step 3: Write the dual objective.

W(α)=α1+α2+α312(2α12+4α22+α32+4α1α2+2α1α3+4α2α3)W(\alpha) = \alpha_1 + \alpha_2 + \alpha_3 - \frac{1}{2}(2\alpha_1^2 + 4\alpha_2^2 + \alpha_3^2 + 4\alpha_1\alpha_2 + 2\alpha_1\alpha_3 + 4\alpha_2\alpha_3)

With the constraint α1+α2α3=0\alpha_1 + \alpha_2 - \alpha_3 = 0, so α3=α1+α2\alpha_3 = \alpha_1 + \alpha_2.

Step 4: Substitute and solve.

After substituting α3=α1+α2\alpha_3 = \alpha_1 + \alpha_2 and taking partial derivatives (setting them to zero), we get:

α1=25,α2=0,α3=25\alpha_1 = \frac{2}{5}, \quad \alpha_2 = 0, \quad \alpha_3 = \frac{2}{5}

Step 5: Recover ww and verify.

w=αiyixi=25(+1)(1,1)+0(+1)(2,0)+25(1)(1,0)w = \sum \alpha_i y_i x_i = \frac{2}{5}(+1)(1,1) + 0 \cdot (+1)(2,0) + \frac{2}{5}(-1)(-1,0) =25(1,1)+25(1,0)=(45,25)= \frac{2}{5}(1,1) + \frac{2}{5}(1,0) = \left(\frac{4}{5}, \frac{2}{5}\right)

This matches our primal solution. The support vectors are x1x_1 and x3x_3 (the ones with αi>0\alpha_i > 0), and x2x_2 is not a support vector (α2=0\alpha_2 = 0).

The kernel trick

What if the data is not linearly separable? Consider points arranged in a circle: positive class in the center, negative class around the outside. No line can separate them.

The idea: map the data to a higher-dimensional space where it is linearly separable, then find the max-margin hyperplane there.

Define a feature map ϕ:RdRD\phi: \mathbb{R}^d \to \mathbb{R}^D where D>dD > d. The dual problem becomes:

maxαiαi12i,jαiαjyiyjϕ(xi)ϕ(xj)\max_{\alpha} \sum_{i} \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \, \phi(x_i) \cdot \phi(x_j)

The kernel trick says: we don’t need to compute ϕ(x)\phi(x) explicitly. We just need a kernel function K(xi,xj)=ϕ(xi)ϕ(xj)K(x_i, x_j) = \phi(x_i) \cdot \phi(x_j) that gives us the dot product in the higher-dimensional space directly.

Common kernels:

KernelFormulaUse case
LinearK(x,z)=xzK(x, z) = x \cdot zLinearly separable data
PolynomialK(x,z)=(xz+c)dK(x, z) = (x \cdot z + c)^dPolynomial boundaries
RBF (Gaussian)K(x,z)=exp(xz22σ2)K(x, z) = \exp\left(-\frac{\|x - z\|^2}{2\sigma^2}\right)Complex, nonlinear boundaries

The RBF kernel maps data to an infinite-dimensional space, yet we can compute the kernel value in O(d)O(d) time. That’s the power of the trick.

Why kernels work

Take the polynomial kernel with d=2d=2 and c=0c=0: K(x,z)=(xz)2K(x, z) = (x \cdot z)^2.

For x,zR2x, z \in \mathbb{R}^2:

K(x,z)=(x1z1+x2z2)2=x12z12+2x1x2z1z2+x22z22K(x, z) = (x_1 z_1 + x_2 z_2)^2 = x_1^2 z_1^2 + 2x_1 x_2 z_1 z_2 + x_2^2 z_2^2

This is the dot product of ϕ(x)=(x12,2x1x2,x22)\phi(x) = (x_1^2, \sqrt{2}\, x_1 x_2, x_2^2) with ϕ(z)\phi(z). The kernel computes the 3D dot product without ever building the 3D vectors.

Soft margin SVM

Real data is rarely perfectly separable. The soft margin SVM allows some points to violate the margin by introducing slack variables ξi0\xi_i \geq 0:

minw,b,ξ12w2+Ci=1nξi\min_{w, b, \xi} \frac{1}{2}\|w\|^2 + C \sum_{i=1}^{n} \xi_i

subject to:

yi(wxi+b)1ξi,ξi0y_i(w \cdot x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0

The parameter CC controls the trade-off:

  • Large CC: penalize misclassifications heavily. The margin will be narrow, fitting the training data closely. This can overfit.
  • Small CC: allow more margin violations. The margin will be wider, potentially ignoring some noisy points. This is more regularized.

When ξi=0\xi_i = 0, point ii is correctly classified and outside the margin. When 0<ξi<10 < \xi_i < 1, the point is inside the margin but still on the correct side. When ξi1\xi_i \geq 1, the point is misclassified.

The dual of the soft margin SVM is the same as the hard margin dual, except now 0αiC0 \leq \alpha_i \leq C. The upper bound CC is the only difference.

Hinge loss interpretation

The soft margin objective can be rewritten as:

minw,bi=1nmax(0,1yi(wxi+b))+λ2w2\min_{w, b} \sum_{i=1}^{n} \max(0, 1 - y_i(w \cdot x_i + b)) + \frac{\lambda}{2}\|w\|^2

where λ=1C\lambda = \frac{1}{C}. The first term is the hinge loss, which is zero when a point is correctly classified with margin 1\geq 1, and grows linearly otherwise. This looks like regularized empirical risk minimization, connecting SVMs to the broader framework of machine learning.

You can optimize this with gradient descent or its stochastic variants like SGD, which is how SVMs are trained on large datasets in practice.

SVM decision flow: from simple to complex:

graph TD
  A["Data to classify"] --> B{"Linearly<br/>separable?"}
  B -->|"Yes, perfectly"| C["Hard margin SVM:<br/>no violations allowed"]
  B -->|"Mostly, with noise"| D["Soft margin SVM:<br/>allow violations, tune C"]
  B -->|"No"| E["Kernel trick:<br/>map to higher dimensions"]
  E --> F{"Choose kernel"}
  F --> G["Polynomial: curved boundaries"]
  F --> H["RBF: complex, flexible boundaries"]

Kernel trick concept: making non-separable data separable:

graph LR
  A["2D: classes overlap,<br/>no line separates them"] -->|"Apply kernel function:<br/>map to higher dimensions"| B["Higher-D: classes<br/>now separable by<br/>a hyperplane"]
  B --> C["Find max-margin<br/>boundary in new space"]
  C --> D["Project decision<br/>boundary back to<br/>original space"]

Putting it together with code

Here is a minimal SVM implementation using scikit-learn:

import numpy as np
from sklearn.svm import SVC
import matplotlib.pyplot as plt

# Training data from Example 1
X = np.array([[1, 1], [2, 0], [-1, 0]])
y = np.array([1, 1, -1])

# Fit a linear SVM with large C (hard margin)
clf = SVC(kernel='linear', C=1000)
clf.fit(X, y)

print("w =", clf.coef_)           # [[0.8  0.4]]
print("b =", clf.intercept_)      # [-0.2]
print("Support vectors:", clf.support_vectors_)
# [[ 1.  1.]
#  [-1.  0.]]

The output matches our hand calculation: w=(0.8,0.4)=(4/5,2/5)w = (0.8, 0.4) = (4/5, 2/5) and b=0.2=1/5b = -0.2 = -1/5.

Multiclass SVM

SVMs are inherently binary classifiers. For KK classes, there are two common strategies:

  • One-vs-Rest (OvR): Train KK binary classifiers, each separating one class from all others. Predict the class whose classifier gives the highest score.
  • One-vs-One (OvO): Train (K2)\binom{K}{2} classifiers, one for each pair of classes. Predict by majority vote.

Both approaches work well in practice. OvO trains more classifiers but each one uses less data.

When to use SVMs

SVMs work well when:

  • The number of features is large relative to the number of samples
  • You need a strong classifier with good generalization
  • The data is high-dimensional (text classification, genomics)

They struggle when:

  • The dataset is very large (training time scales poorly for kernel SVMs)
  • You need probability estimates (SVMs give hard decisions, though Platt scaling can add probabilities)
  • The data has lots of noise and overlapping classes

Summary

ConceptKey idea
Max marginMaximize distance between classes
Support vectorsPoints on the margin boundary; define the solution
Dual problemExpress SVM in terms of dot products
Kernel trickImplicitly work in high-dimensional space
Soft marginAllow margin violations with penalty CC
Hinge lossmax(0,1yf(x))\max(0, 1 - y \cdot f(x)); connects SVM to ERM

What comes next

Now that we have covered both supervised classification (logistic regression, SVMs) and the math behind them, we will shift to unsupervised learning. In the next post on K-Means clustering, you will learn how to group data points without labels.

Start typing to search across all content
navigate Enter open Esc close