Mar 31, 2026 · 14 min read · Machine Learning

K-Nearest Neighbors

In this series (18 parts)

KNN is the laziest algorithm in machine learning, and that is not an insult. Most classifiers spend time during training to build a model: learn weights, split data, fit distributions. KNN does none of that. It memorizes the entire training set, then at prediction time it finds the $K$ closest training points to the new input and lets them vote. The majority class wins. That’s it.

This “lazy learning” approach means there is zero training cost. The price you pay is at prediction time, where KNN has to compare the new point against every stored example. Whether that tradeoff is worth it depends on your problem.

The idea: ask your neighbors

Suppose you want to guess whether a house is expensive or affordable. Here are 10 houses you already know about.

House	Sq ft	Bedrooms	Age	Pool	Neighborhood	Price class
1	800	1	40	No	Rural	Affordable
2	1200	2	20	No	Suburbs	Affordable
3	1500	3	10	No	Suburbs	Affordable
4	2000	3	5	Yes	Suburbs	Expensive
5	2400	4	3	Yes	Downtown	Expensive
6	1800	3	8	No	Downtown	Expensive
7	900	1	35	No	Rural	Affordable
8	2100	4	2	Yes	Suburbs	Expensive
9	1100	2	25	No	Rural	Affordable
10	1600	3	12	No	Suburbs	Affordable

A new house comes along: 1900 sq ft, 3 bedrooms, 6 years old, with a pool. Which known houses does it look most like? Houses 4, 6, and 8 are similar in size, age, and features. Two of those three are Expensive. KNN predicts Expensive.

That is the entire algorithm. Find the K closest examples. Let them vote.

KNN in three steps

graph LR
  A["New data point"] --> B["Compute distance to every training point"]
  B --> C["Sort by distance"]
  C --> D["Pick K nearest neighbors"]
  D --> E["Majority vote"]
  E --> F["Predicted class"]

Distance measurement matters. If one feature is in square feet (hundreds or thousands) and another is number of bedrooms (1 to 5), the square footage will dominate unless you scale features first.

Now let’s define the algorithm precisely and explore the math behind distance metrics.

The Algorithm

KNN has three steps:

Store all training data as-is. No transformations, no parameters.
Compute distances from the new point to every training point.
Vote. The $K$ nearest neighbors each cast a vote for their class. The class with the most votes is the prediction.

If $K = 1$ , you just predict whatever class the single closest point belongs to. If $K = 5$ , you find the five closest points and go with the majority.

There is no “fit” step. The entire algorithm lives in the prediction phase.

import numpy as np
from collections import Counter

def knn_predict(X_train, y_train, x_new, k=3):
    distances = np.sqrt(np.sum((X_train - x_new) ** 2, axis=1))
    nearest_indices = np.argsort(distances)[:k]
    nearest_labels = y_train[nearest_indices]
    vote = Counter(nearest_labels).most_common(1)[0][0]
    return vote

That’s a working KNN classifier in about six lines. The simplicity is genuine.

KNN prediction flowchart

graph TD
  A["New point q"] --> B["Compute distance to all training points"]
  B --> C["Sort distances smallest to largest"]
  C --> D["Select K nearest neighbors"]
  D --> E["Count class labels among K neighbors"]
  E --> F["Return most common class"]

Distance Metrics

The entire algorithm hinges on “closest,” so the distance metric matters a lot. The three you will see most often:

Euclidean distance (L2): the straight-line distance between two points. For points $\mathbf{a}$ and $\mathbf{b}$ in $d$ dimensions:

$d_{\text{euclid}}(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_{i=1}^{d}(a_i - b_i)^2}$

This is the default choice for most problems. It works well when features have similar scales and you care about the actual geometric distance.

Manhattan distance (L1): sum of absolute differences along each axis.

$d_{\text{manhattan}}(\mathbf{a}, \mathbf{b}) = \sum_{i=1}^{d} |a_i - b_i|$

Manhattan distance is better when your features are on different scales or when you are working on a grid (like city blocks, hence the name). It is also more robust to outliers than Euclidean distance because it does not square the differences.

Minkowski distance (Lp): the general form that includes both as special cases.

$d_{\text{mink}}(\mathbf{a}, \mathbf{b}) = \left(\sum_{i=1}^{d} |a_i - b_i|^p \right)^{1/p}$

Set $p = 1$ and you get Manhattan. Set $p = 2$ and you get Euclidean. Larger values of $p$ put more emphasis on the largest single-dimension difference. In practice, $p = 1$ or $p = 2$ covers the vast majority of use cases.

If you want a refresher on norms and how these distances relate to dot products, that foundation matters here because distance metrics are just norms applied to difference vectors.

Worked Example 1: Classifying a Point with K=1, K=3, K=5

Suppose we have 8 training points in 2D with two classes, A and B:

Point	$x_1$	$x_2$	Class
P1	1	3	A
P2	2	4	A
P3	3	2	A
P4	5	1	B
P5	6	3	B
P6	4	5	B
P7	2	1	A
P8	5	4	B

We want to classify the test point $\mathbf{q} = (3.5, 3)$ .

Step 1: Compute Euclidean distances from $\mathbf{q}$ to every training point.

$d(\mathbf{q}, P1) = \sqrt{(3.5 - 1)^2 + (3 - 3)^2} = \sqrt{6.25 + 0} = 2.50$

$d(\mathbf{q}, P2) = \sqrt{(3.5 - 2)^2 + (3 - 4)^2} = \sqrt{2.25 + 1} = 1.80$

$d(\mathbf{q}, P3) = \sqrt{(3.5 - 3)^2 + (3 - 2)^2} = \sqrt{0.25 + 1} = 1.12$

$d(\mathbf{q}, P4) = \sqrt{(3.5 - 5)^2 + (3 - 1)^2} = \sqrt{2.25 + 4} = 2.50$

$d(\mathbf{q}, P5) = \sqrt{(3.5 - 6)^2 + (3 - 3)^2} = \sqrt{6.25 + 0} = 2.50$

$d(\mathbf{q}, P6) = \sqrt{(3.5 - 4)^2 + (3 - 5)^2} = \sqrt{0.25 + 4} = 2.06$

$d(\mathbf{q}, P7) = \sqrt{(3.5 - 2)^2 + (3 - 1)^2} = \sqrt{2.25 + 4} = 2.50$

$d(\mathbf{q}, P8) = \sqrt{(3.5 - 5)^2 + (3 - 4)^2} = \sqrt{2.25 + 1} = 1.80$

Step 2: Sort by distance.

Rank	Point	Distance	Class
1	P3	1.12	A
2	P2	1.80	A
3	P8	1.80	B
4	P6	2.06	B
5	P1	2.50	A
6	P4	2.50	B
7	P5	2.50	B
8	P7	2.50	A

Step 3: Classify with different values of K.

K=1: Nearest neighbor is P3 (class A). Prediction: A.
K=3: Neighbors are P3 (A), P2 (A), P8 (B). Vote: 2A vs 1B. Prediction: A.
K=5: Neighbors are P3 (A), P2 (A), P8 (B), P6 (B), P1 (A). Vote: 3A vs 2B. Prediction: A.

In this case, all three values of $K$ agree. That will not always happen. If the test point were closer to the cluster of B points, you could easily see K=1 say A (one nearby A point) while K=5 says B (more B points in the larger neighborhood).

KNN classification: query point with K=3 nearest neighbors

Choosing K

The value of $K$ controls the bias-variance tradeoff directly.

Small K (e.g., K=1 or K=3):

Low bias, high variance.
The decision boundary is jagged and reacts to every single point.
Sensitive to noise. One mislabeled point in the training set can flip predictions in its neighborhood.

Large K (e.g., K=20 or K=50):

High bias, low variance.
The decision boundary is smooth.
You risk washing out local patterns. If you set K equal to the entire training set, you always predict the majority class.

The standard approach: use cross-validation. Try several values of K (typically odd numbers to avoid ties in binary classification), measure accuracy on a validation fold, and pick the K that performs best. A common starting range is $K \in \{1, 3, 5, 7, 9, 11, 15, 21\}$ .

from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

best_k, best_score = 1, 0
for k in [1, 3, 5, 7, 9, 11, 15, 21]:
    clf = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(clf, X_train, y_train, cv=5)
    mean_score = scores.mean()
    if mean_score > best_score:
        best_k, best_score = k, mean_score

print(f"Best K: {best_k}, CV Accuracy: {best_score:.3f}")

A rule of thumb you will sometimes see: $K \approx \sqrt{n}$ where $n$ is the number of training samples. This is a reasonable starting point, nothing more.

Effect of K on the decision boundary

graph TD
  A["K = 1"] --> B["Jagged boundary, follows every point"]
  A2["K = 5"] --> B2["Smoother boundary, less noise-sensitive"]
  A3["K = 20"] --> B3["Very smooth, may miss local patterns"]
  B --> C["Overfitting risk: high"]
  B2 --> C2["Good balance"]
  B3 --> C3["Underfitting risk: high"]

Feature Scaling: Why It Is Critical

KNN computes distances. If one feature ranges from 0 to 1000 and another from 0 to 1, the first feature will completely dominate the distance calculation. The second feature becomes almost irrelevant.

Worked Example 2: Feature Scaling Matters

Consider three training points for a house price classifier (expensive vs affordable):

Point	Square Feet ( $x_1$ )	Bedrooms ( $x_2$ )	Class
H1	800	2	Affordable
H2	1500	3	Affordable
H3	2000	4	Expensive

New house to classify: $\mathbf{q} = (1400, 4)$ .

Without scaling:

$d(\mathbf{q}, H1) = \sqrt{(1400 - 800)^2 + (4 - 2)^2} = \sqrt{360000 + 4} = \sqrt{360004} \approx 600.00$

$d(\mathbf{q}, H2) = \sqrt{(1400 - 1500)^2 + (4 - 3)^2} = \sqrt{10000 + 1} = \sqrt{10001} \approx 100.00$

$d(\mathbf{q}, H3) = \sqrt{(1400 - 2000)^2 + (4 - 4)^2} = \sqrt{360000 + 0} = \sqrt{360000} = 600.00$

With K=1, the nearest neighbor is H2 (Affordable) at distance 100.00. Notice how the bedroom feature contributes almost nothing. The difference of $(4 - 3)^2 = 1$ is invisible next to $(1400 - 1500)^2 = 10000$ .

With min-max normalization:

We normalize each feature to $[0, 1]$ using $x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$ .

For square feet: $x_{\min} = 800$ , $x_{\max} = 2000$ , range $= 1200$ .

For bedrooms: $x_{\min} = 2$ , $x_{\max} = 4$ , range $= 2$ .

Normalized training points:

H1: $\left(\frac{800 - 800}{1200},\ \frac{2 - 2}{2}\right) = (0, 0)$
H2: $\left(\frac{1500 - 800}{1200},\ \frac{3 - 2}{2}\right) = (0.583, 0.5)$
H3: $\left(\frac{2000 - 800}{1200},\ \frac{4 - 2}{2}\right) = (1.0, 1.0)$

Normalized query: $\left(\frac{1400 - 800}{1200},\ \frac{4 - 2}{2}\right) = (0.5, 1.0)$

Now compute distances:

$d(\mathbf{q}, H1) = \sqrt{(0.5 - 0)^2 + (1.0 - 0)^2} = \sqrt{0.25 + 1.0} = 1.12$

$d(\mathbf{q}, H2) = \sqrt{(0.5 - 0.583)^2 + (1.0 - 0.5)^2} = \sqrt{0.0069 + 0.25} = 0.507$

$d(\mathbf{q}, H3) = \sqrt{(0.5 - 1.0)^2 + (1.0 - 1.0)^2} = \sqrt{0.25 + 0} = 0.500$

With K=1, the nearest neighbor is now H3 (Expensive) at distance 0.500. The classification flipped because the bedroom feature (4 bedrooms, matching H3 exactly) now gets a fair say.

This is a real problem, not a textbook curiosity. Always scale your features before running KNN. Standard approaches are min-max normalization (scale to $[0, 1]$ ) or z-score standardization (subtract mean, divide by standard deviation).

Weighted KNN

Standard KNN gives every neighbor an equal vote. But should a neighbor at distance 0.1 count the same as one at distance 5.0? Probably not.

Weighted KNN assigns each neighbor a vote proportional to the inverse of its distance:

$w_i = \frac{1}{d(\mathbf{q}, \mathbf{x}_i)}$

If a neighbor is very close, it gets a large weight. If it is far away (but still within the top $K$ ), its influence is small. The predicted class is the one with the highest total weight.

This helps in cases where $K$ is large and you want nearby points to dominate. It also smooths out the sensitivity to the exact value of $K$ , because distant neighbors contribute less regardless.

from sklearn.neighbors import KNeighborsClassifier

# Uniform voting (default)
clf_uniform = KNeighborsClassifier(n_neighbors=5, weights='uniform')

# Distance-weighted voting
clf_weighted = KNeighborsClassifier(n_neighbors=5, weights='distance')

One edge case to watch: if a test point is exactly on top of a training point (distance = 0), the weight goes to infinity. Implementations handle this by just assigning the label of the overlapping point directly.

The Curse of Dimensionality

KNN works beautifully in low dimensions (2D, 3D, maybe 10D). It falls apart in high dimensions. This is one of the most important concepts in machine learning.

Here is the core problem: as the number of dimensions grows, distances between points become nearly identical. In a 1000-dimensional space, the “nearest” neighbor might not be meaningfully closer than the “farthest” one. The ratio of the nearest distance to the farthest distance approaches 1.

Think about it with a simple example. In 1D, two random points in $[0, 1]$ are on average $0.33$ apart. In 2D, two random points in the unit square are on average about $0.52$ apart. In 100D, the average distance between random points in a unit hypercube is about $4.08$ , and the variance relative to the mean is tiny. Everything is roughly the same distance from everything else.

The curse of dimensionality: distances lose meaning

graph TD
  A["Low dimensions: 2-3 features"] --> B["Nearest neighbor is clearly closer"]
  A2["High dimensions: 100+ features"] --> B2["All points are roughly the same distance apart"]
  B --> C["KNN works well"]
  B2 --> D["KNN struggles"]
  D --> E["Fix: reduce dimensions with PCA first"]

When distances stop being meaningful, KNN cannot distinguish neighbors from non-neighbors. The algorithm breaks.

Practical consequences:

If you have hundreds of features, KNN will likely perform poorly.
Feature selection or dimensionality reduction (like PCA) becomes essential before applying KNN in high-dimensional settings.
Regularization ideas don’t directly apply to KNN since there are no parameters to regularize, but the analogous remedy is reducing dimensionality.

A useful heuristic: KNN tends to work well when the number of features is low (under 20-30) and you have enough training data to densely populate the feature space. Once dimensions climb much higher, consider a different algorithm.

Computational Cost

KNN’s simplicity comes with a runtime cost. At prediction time, you compute the distance from the new point to all $n$ training points, each with $d$ features. That is $O(nd)$ per prediction.

For small datasets (thousands of points), this is fine. For millions of points, it is slow. If you need to classify 10,000 test points against 1,000,000 training points with 50 features, that is $10{,}000 \times 1{,}000{,}000 \times 50 = 5 \times 10^{11}$ operations. Not great.

KD-Trees partition the feature space into axis-aligned regions, allowing the algorithm to prune large portions of the training data without computing distances to every point. In low dimensions, this reduces average query time to $O(d \log n)$ . In high dimensions, KD-trees degrade back to brute force because the tree can no longer effectively prune.

Ball Trees use hyperspheres instead of axis-aligned boxes. They handle moderate dimensions better than KD-trees but still struggle in very high dimensions.

from sklearn.neighbors import KNeighborsClassifier

# Let sklearn choose the best algorithm automatically
clf = KNeighborsClassifier(n_neighbors=5, algorithm='auto')

# Or specify explicitly
clf_kd = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')
clf_ball = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree')
clf_brute = KNeighborsClassifier(n_neighbors=5, algorithm='brute')

The 'auto' setting in scikit-learn picks between these based on the data shape. For most use cases, just leave it on auto.

KNN for Regression

Everything so far has been about classification (majority vote). KNN works for regression too. Instead of voting, you average the target values of the $K$ nearest neighbors.

Given a test point $\mathbf{q}$ with neighbors $\mathbf{x}_1, \ldots, \mathbf{x}_K$ having target values $y_1, \ldots, y_K$ :

$\hat{y} = \frac{1}{K} \sum_{i=1}^{K} y_i$

With distance weighting:

$\hat{y} = \frac{\sum_{i=1}^{K} w_i \cdot y_i}{\sum_{i=1}^{K} w_i}, \quad w_i = \frac{1}{d(\mathbf{q}, \mathbf{x}_i)}$

The same considerations apply: feature scaling, choice of $K$ , curse of dimensionality. Regression KNN is useful as a baseline or when you want a non-parametric model that makes no assumptions about the functional form.

from sklearn.neighbors import KNeighborsRegressor

reg = KNeighborsRegressor(n_neighbors=5, weights='distance')
reg.fit(X_train, y_train)
predictions = reg.predict(X_test)

Strengths and Weaknesses

When KNN works well:

Low-dimensional data with clear, local decision boundaries.
Small to medium datasets where prediction speed is acceptable.
Problems where the decision boundary is highly non-linear but locally smooth.
As a baseline. KNN is fast to implement and gives you a sanity check for more complex models.

When KNN struggles:

High-dimensional feature spaces (curse of dimensionality).
Large datasets where brute-force distance computation is too slow.
Features on very different scales (if you forget to normalize).
Imbalanced classes, where the majority class can dominate votes. This relates to cross-entropy loss and the broader topic of class imbalance.

What Comes Next

KNN predicts by looking at its neighbors. It has no notion of splitting the feature space into interpretable regions. Decision Trees take the opposite approach: they learn a series of if-then rules that recursively partition the space, producing models you can read and reason about. That is where we go next.

← Back to all series