Apr 4, 2026 · 20 min read · Machine Learning

Support Vector Machines

In this series (18 parts)

A support vector machine draws the decision boundary that sits as far as possible from both classes. That “as far as possible” is what makes SVMs different from other linear classifiers. Instead of just finding any line that separates the data, an SVM finds the best line, the one with the maximum margin.

Prerequisites: You should be comfortable with Lagrangian duality and logistic regression before reading this post.

Why SVMs?

You need to separate two classes of data points. Many lines can do it, but which line is best? A line that barely squeezes between the classes is fragile. A tiny shift in the data could flip predictions. You want the line with the widest possible gap between the classes.

Consider these eight points in 2D:

Point	x1	x2	Class
A	1.0	3.0	+1
B	2.0	3.5	+1
C	1.5	2.5	+1
D	2.5	4.0	+1
E	5.0	1.0	-1
F	6.0	1.5	-1
G	5.5	0.5	-1
H	6.5	2.0	-1

The positive points cluster in the upper-left. The negative points cluster in the lower-right. Many lines separate them. The SVM picks the one with the widest gap (margin) between the two groups.

SVM decision boundary with margin and support vectors

Separating two classes with maximum margin:

graph TD
  subgraph space["Feature Space"]
      direction LR
      subgraph pos["Class +1"]
          A1["A"]
          B1["B"]
          C1["C"]
          D1["D"]
      end
      subgraph neg["Class -1"]
          E1["E"]
          F1["F"]
          G1["G"]
          H1["H"]
      end
  end
  space --> boundary["Decision boundary:<br/>widest margin between classes"]

Support vectors are the closest points to the boundary:

graph LR
  SV1["Support vector: closest +1 point"] --- margin["MARGIN:<br/>widest possible gap"]
  margin --- SV2["Support vector: closest -1 point"]
  margin --> boundary["Decision boundary<br/>sits in the middle"]

Only the closest points to the boundary matter. These are called support vectors. Every other point could move around freely without changing the boundary, as long as it stays on its correct side. This makes SVMs robust: the solution depends on just a handful of critical points.

The SVM finds this optimal boundary by minimizing the weight vector’s magnitude. Smaller weights mean a wider margin. Points that violate the margin get penalized.

Now let’s formalize this as an optimization problem.

The max-margin classifier

Suppose you have two classes of points that are linearly separable. There are infinitely many hyperplanes that separate them. Which one should you pick?

The SVM answer: pick the hyperplane that maximizes the distance to the nearest data point from either class. That distance is called the margin.

A hyperplane in $\mathbb{R}^d$ is defined by:

$w \cdot x + b = 0$

where $w$ is the normal vector and $b$ is the bias. For a point $x_i$ with label $y_i \in \{-1, +1\}$ , the signed distance to the hyperplane is:

$\frac{y_i(w \cdot x_i + b)}{\|w\|}$

The margin is twice the distance from the hyperplane to the nearest point:

$\text{margin} = \frac{2}{\|w\|}$

We get this by enforcing a canonical scaling: the closest points satisfy $y_i(w \cdot x_i + b) = 1$ . The optimization problem becomes:

$\min_{w, b} \frac{1}{2}\|w\|^2 \quad \text{subject to} \quad y_i(w \cdot x_i + b) \geq 1 \quad \forall i$

We minimize $\|w\|^2$ instead of maximizing $\frac{2}{\|w\|}$ because it gives us a nice quadratic objective that is easier to optimize. The factor of $\frac{1}{2}$ is there to make the derivatives cleaner.

This is a convex quadratic program, so any local minimum is also the global minimum.

Support vectors

The points that sit exactly on the margin boundary, where $y_i(w \cdot x_i + b) = 1$ , are called support vectors. These are the only points that matter for defining the decision boundary.

If you removed every non-support-vector point and retrained, you would get the exact same hyperplane. This is a powerful property: the SVM solution depends only on a small subset of the training data.

Example 1: finding the max-margin hyperplane

Consider three points in 2D:

Point	$x$	Label $y$
$x_1$	$(1, 1)$	$+1$
$x_2$	$(2, 0)$	$+1$
$x_3$	$(-1, 0)$	$-1$

We need to find $w = (w_1, w_2)$ and $b$ that minimize $\frac{1}{2}(w_1^2 + w_2^2)$ subject to the margin constraints.

Step 1: Identify candidate support vectors.

The support vectors must include at least one point from each class. Let’s try $x_1$ and $x_3$ as support vectors (the points that land exactly on the margin).

Step 2: Set up the equations for the support vectors.

For $x_1 = (1,1)$ with $y_1 = +1$ :

$w_1(1) + w_2(1) + b = 1$

For $x_3 = (-1, 0)$ with $y_3 = -1$ :

$-(w_1(-1) + w_2(0) + b) = 1$ $w_1 - b = 1$

Step 3: Solve for $b$ in terms of $w_1$ .

From the second equation: $b = w_1 - 1$

Substitute into the first equation:

$w_1 + w_2 + (w_1 - 1) = 1$ $2w_1 + w_2 = 2$ $w_2 = 2 - 2w_1$

Step 4: Minimize $\|w\|^2$ .

$\|w\|^2 = w_1^2 + w_2^2 = w_1^2 + (2 - 2w_1)^2$ $= w_1^2 + 4 - 8w_1 + 4w_1^2 = 5w_1^2 - 8w_1 + 4$

Take the derivative and set it to zero:

$\frac{d}{dw_1}(5w_1^2 - 8w_1 + 4) = 10w_1 - 8 = 0$ $w_1 = \frac{4}{5}$

Step 5: Compute the remaining parameters.

$w_2 = 2 - 2 \times \frac{4}{5} = 2 - \frac{8}{5} = \frac{2}{5}$ $b = \frac{4}{5} - 1 = -\frac{1}{5}$

So $w = \left(\frac{4}{5}, \frac{2}{5}\right)$ and $b = -\frac{1}{5}$ .

Step 6: Verify the constraint for $x_2$ .

$y_2(w \cdot x_2 + b) = 1 \cdot \left(\frac{4}{5}(2) + \frac{2}{5}(0) - \frac{1}{5}\right) = \frac{8}{5} - \frac{1}{5} = \frac{7}{5} > 1 \checkmark$

Point $x_2$ is not a support vector. It sits safely behind the margin.

Step 7: Compute the margin.

$\|w\| = \sqrt{\frac{16}{25} + \frac{4}{25}} = \sqrt{\frac{20}{25}} = \frac{2\sqrt{5}}{5}$

$\text{margin} = \frac{2}{\|w\|} = \frac{2}{\frac{2\sqrt{5}}{5}} = \frac{5}{\sqrt{5}} = \sqrt{5} \approx 2.24$

The decision boundary is $\frac{4}{5}x_1 + \frac{2}{5}x_2 - \frac{1}{5} = 0$ , or equivalently $4x_1 + 2x_2 = 1$ . The support vectors are $x_1 = (1,1)$ and $x_3 = (-1, 0)$ .

The dual problem

The primal problem above works fine, but the dual formulation unlocks something important: it lets us express everything in terms of dot products between data points. This is the door that opens to the kernel trick.

Using the Lagrangian, we introduce multipliers $\alpha_i \geq 0$ for each constraint:

$L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_{i=1}^{n} \alpha_i \left[y_i(w \cdot x_i + b) - 1\right]$

Setting derivatives to zero:

$\frac{\partial L}{\partial w} = 0 \implies w = \sum_{i=1}^{n} \alpha_i y_i x_i$

$\frac{\partial L}{\partial b} = 0 \implies \sum_{i=1}^{n} \alpha_i y_i = 0$

Substituting back gives the dual problem:

$\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j (x_i \cdot x_j)$

subject to $\alpha_i \geq 0$ and $\sum_i \alpha_i y_i = 0$ .

Notice that the data only appears through dot products $x_i \cdot x_j$ . The KKT conditions tell us that $\alpha_i > 0$ only for the support vectors, and $\alpha_i = 0$ for all other points.

Example 2: solving the dual for our 3-point dataset

Using the same three points from Example 1, the dual objective is:

$W(\alpha) = \alpha_1 + \alpha_2 + \alpha_3 - \frac{1}{2}\sum_{i,j}\alpha_i \alpha_j y_i y_j (x_i \cdot x_j)$

Step 1: Compute the dot products.

	$x_1 = (1,1)$	$x_2 = (2,0)$	$x_3 = (-1,0)$
$x_1$	$1+1=2$	$2+0=2$	$-1+0=-1$
$x_2$	$2$	$4+0=4$	$-2+0=-2$
$x_3$	$-1$	$-2$	$1+0=1$

Step 2: Include the labels.

$y_1 = +1, y_2 = +1, y_3 = -1$

The $y_i y_j (x_i \cdot x_j)$ terms:

	$j=1$	$j=2$	$j=3$
$i=1$	$(+1)(+1)(2) = 2$	$(+1)(+1)(2) = 2$	$(+1)(-1)(-1) = 1$
$i=2$	$(+1)(+1)(2) = 2$	$(+1)(+1)(4) = 4$	$(+1)(-1)(-2) = 2$
$i=3$	$(-1)(+1)(-1) = 1$	$(-1)(+1)(-2) = 2$	$(-1)(-1)(1) = 1$

Step 3: Write the dual objective.

$W(\alpha) = \alpha_1 + \alpha_2 + \alpha_3 - \frac{1}{2}(2\alpha_1^2 + 4\alpha_2^2 + \alpha_3^2 + 4\alpha_1\alpha_2 + 2\alpha_1\alpha_3 + 4\alpha_2\alpha_3)$

With the constraint $\alpha_1 + \alpha_2 - \alpha_3 = 0$ , so $\alpha_3 = \alpha_1 + \alpha_2$ .

Step 4: Substitute and solve.

After substituting $\alpha_3 = \alpha_1 + \alpha_2$ and taking partial derivatives (setting them to zero), we get:

$\alpha_1 = \frac{2}{5}, \quad \alpha_2 = 0, \quad \alpha_3 = \frac{2}{5}$

Step 5: Recover $w$ and verify.

$w = \sum \alpha_i y_i x_i = \frac{2}{5}(+1)(1,1) + 0 \cdot (+1)(2,0) + \frac{2}{5}(-1)(-1,0)$ $= \frac{2}{5}(1,1) + \frac{2}{5}(1,0) = \left(\frac{4}{5}, \frac{2}{5}\right)$

This matches our primal solution. The support vectors are $x_1$ and $x_3$ (the ones with $\alpha_i > 0$ ), and $x_2$ is not a support vector ( $\alpha_2 = 0$ ).

The kernel trick

What if the data is not linearly separable? Consider points arranged in a circle: positive class in the center, negative class around the outside. No line can separate them.

The idea: map the data to a higher-dimensional space where it is linearly separable, then find the max-margin hyperplane there.

Define a feature map $\phi: \mathbb{R}^d \to \mathbb{R}^D$ where $D > d$ . The dual problem becomes:

$\max_{\alpha} \sum_{i} \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \, \phi(x_i) \cdot \phi(x_j)$

The kernel trick says: we don’t need to compute $\phi(x)$ explicitly. We just need a kernel function $K(x_i, x_j) = \phi(x_i) \cdot \phi(x_j)$ that gives us the dot product in the higher-dimensional space directly.

Common kernels:

Kernel	Formula	Use case
Linear	$K(x, z) = x \cdot z$	Linearly separable data
Polynomial	$K(x, z) = (x \cdot z + c)^d$	Polynomial boundaries
RBF (Gaussian)	$K(x, z) = \exp\left(-\frac{\\|x - z\\|^2}{2\sigma^2}\right)$	Complex, nonlinear boundaries

The RBF kernel maps data to an infinite-dimensional space, yet we can compute the kernel value in $O(d)$ time. That’s the power of the trick.

Why kernels work

Take the polynomial kernel with $d=2$ and $c=0$ : $K(x, z) = (x \cdot z)^2$ .

For $x, z \in \mathbb{R}^2$ :

$K(x, z) = (x_1 z_1 + x_2 z_2)^2 = x_1^2 z_1^2 + 2x_1 x_2 z_1 z_2 + x_2^2 z_2^2$

This is the dot product of $\phi(x) = (x_1^2, \sqrt{2}\, x_1 x_2, x_2^2)$ with $\phi(z)$ . The kernel computes the 3D dot product without ever building the 3D vectors.

Soft margin SVM

Real data is rarely perfectly separable. The soft margin SVM allows some points to violate the margin by introducing slack variables $\xi_i \geq 0$ :

$\min_{w, b, \xi} \frac{1}{2}\|w\|^2 + C \sum_{i=1}^{n} \xi_i$

subject to:

$y_i(w \cdot x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$

The parameter $C$ controls the trade-off:

Large $C$ : penalize misclassifications heavily. The margin will be narrow, fitting the training data closely. This can overfit.
Small $C$ : allow more margin violations. The margin will be wider, potentially ignoring some noisy points. This is more regularized.

When $\xi_i = 0$ , point $i$ is correctly classified and outside the margin. When $0 < \xi_i < 1$ , the point is inside the margin but still on the correct side. When $\xi_i \geq 1$ , the point is misclassified.

The dual of the soft margin SVM is the same as the hard margin dual, except now $0 \leq \alpha_i \leq C$ . The upper bound $C$ is the only difference.

Hinge loss interpretation

The soft margin objective can be rewritten as:

$\min_{w, b} \sum_{i=1}^{n} \max(0, 1 - y_i(w \cdot x_i + b)) + \frac{\lambda}{2}\|w\|^2$

where $\lambda = \frac{1}{C}$ . The first term is the hinge loss, which is zero when a point is correctly classified with margin $\geq 1$ , and grows linearly otherwise. This looks like regularized empirical risk minimization, connecting SVMs to the broader framework of machine learning.

You can optimize this with gradient descent or its stochastic variants like SGD, which is how SVMs are trained on large datasets in practice.

SVM decision flow: from simple to complex:

graph TD
  A["Data to classify"] --> B{"Linearly<br/>separable?"}
  B -->|"Yes, perfectly"| C["Hard margin SVM:<br/>no violations allowed"]
  B -->|"Mostly, with noise"| D["Soft margin SVM:<br/>allow violations, tune C"]
  B -->|"No"| E["Kernel trick:<br/>map to higher dimensions"]
  E --> F{"Choose kernel"}
  F --> G["Polynomial: curved boundaries"]
  F --> H["RBF: complex, flexible boundaries"]

Kernel trick concept: making non-separable data separable:

graph LR
  A["2D: classes overlap,<br/>no line separates them"] -->|"Apply kernel function:<br/>map to higher dimensions"| B["Higher-D: classes<br/>now separable by<br/>a hyperplane"]
  B --> C["Find max-margin<br/>boundary in new space"]
  C --> D["Project decision<br/>boundary back to<br/>original space"]

Putting it together with code

Here is a minimal SVM implementation using scikit-learn:

import numpy as np
from sklearn.svm import SVC
import matplotlib.pyplot as plt

# Training data from Example 1
X = np.array([[1, 1], [2, 0], [-1, 0]])
y = np.array([1, 1, -1])

# Fit a linear SVM with large C (hard margin)
clf = SVC(kernel='linear', C=1000)
clf.fit(X, y)

print("w =", clf.coef_)           # [[0.8  0.4]]
print("b =", clf.intercept_)      # [-0.2]
print("Support vectors:", clf.support_vectors_)
# [[ 1.  1.]
#  [-1.  0.]]

The output matches our hand calculation: $w = (0.8, 0.4) = (4/5, 2/5)$ and $b = -0.2 = -1/5$ .

Multiclass SVM

SVMs are inherently binary classifiers. For $K$ classes, there are two common strategies:

One-vs-Rest (OvR): Train $K$ binary classifiers, each separating one class from all others. Predict the class whose classifier gives the highest score.
One-vs-One (OvO): Train $\binom{K}{2}$ classifiers, one for each pair of classes. Predict by majority vote.

Both approaches work well in practice. OvO trains more classifiers but each one uses less data.

When to use SVMs

SVMs work well when:

The number of features is large relative to the number of samples
You need a strong classifier with good generalization
The data is high-dimensional (text classification, genomics)

They struggle when:

The dataset is very large (training time scales poorly for kernel SVMs)
You need probability estimates (SVMs give hard decisions, though Platt scaling can add probabilities)
The data has lots of noise and overlapping classes

Summary

Concept	Key idea
Max margin	Maximize distance between classes
Support vectors	Points on the margin boundary; define the solution
Dual problem	Express SVM in terms of dot products
Kernel trick	Implicitly work in high-dimensional space
Soft margin	Allow margin violations with penalty $C$
Hinge loss	$\max(0, 1 - y \cdot f(x))$ ; connects SVM to ERM

What comes next

Now that we have covered both supervised classification (logistic regression, SVMs) and the math behind them, we will shift to unsupervised learning. In the next post on K-Means clustering, you will learn how to group data points without labels.

← Back to all series

Point	x1	x2	Class
A	1.0	3.0	+1
B	2.0	3.5	+1
C	1.5	2.5	+1
D	2.5	4.0	+1
E	5.0	1.0	-1
F	6.0	1.5	-1
G	5.5	0.5	-1
H	6.5	2.0	-1

Point	x1	x2	Class
A	1.0	3.0	+1
B	2.0	3.5	+1
C	1.5	2.5	+1
D	2.5	4.0	+1
E	5.0	1.0	-1
F	6.0	1.5	-1
G	5.5	0.5	-1
H	6.5	2.0	-1

Point	x1	x2	Class
A	1.0	3.0	+1
B	2.0	3.5	+1
C	1.5	2.5	+1
D	2.5	4.0	+1
E	5.0	1.0	-1
F	6.0	1.5	-1
G	5.5	0.5	-1
H	6.5	2.0	-1