Mar 30, 2026 · 16 min read · Machine Learning

Naive Bayes classifier

In this series (18 parts)

Naive Bayes is one of the simplest classifiers you can build, and yet it holds its own against far more complex models in a surprising number of real-world tasks. It is fast to train, fast to predict, and needs very little data to produce reasonable results. The catch? It makes an assumption so strong that it’s almost always wrong. And somehow that barely matters.

Spam or not spam: the intuition

Consider 10 emails. Each one has a few observable features, plus a label.

Email	Contains “free”	Contains “meeting”	Contains “click”	Short?	Label
1	Yes	No	Yes	Yes	Spam
2	No	Yes	No	No	Not spam
3	Yes	No	Yes	Yes	Spam
4	No	Yes	No	No	Not spam
5	Yes	No	No	Yes	Spam
6	No	Yes	No	Yes	Not spam
7	Yes	Yes	Yes	No	Spam
8	No	No	No	No	Not spam
9	Yes	No	Yes	Yes	Spam
10	No	Yes	No	No	Not spam

Look at the pattern. Emails with “free” and “click” tend to be spam. Emails with “meeting” tend to be legitimate. Naive Bayes counts how often each feature appears in spam versus not-spam, then uses those frequencies to classify new emails. It treats each feature independently, which is a simplification, but it works surprisingly well.

A new email arrives with “free” and “click” but no “meeting.” Naive Bayes checks: how often does “free” appear in spam? Very often. How often does “click” appear in spam? Very often. How often does “meeting” appear in spam? Rarely. Each feature casts its vote, and the votes combine into a final prediction.

How Naive Bayes classifies an email

graph TD
  A["New email arrives"] --> B["Check each feature independently"]
  B --> C["Contains free? High spam signal"]
  B --> D["Contains meeting? Low spam signal"]
  B --> E["Contains click? High spam signal"]
  C --> F["Combine all signals with Bayes rule"]
  D --> F
  E --> F
  F --> G["Predict: Spam or Not Spam"]

Feature likelihoods: Spam vs Ham

Now let’s formalize this counting-and-combining approach using probability theory.

Generative vs discriminative models

Before we get into the mechanics, it helps to know where Naive Bayes sits in the broader landscape of classifiers.

Discriminative models learn the boundary between classes directly. They model $P(y \mid x)$ , the probability of a class $y$ given features $x$ . Logistic regression is a classic example. It doesn’t care how the data was generated. It just finds the line (or surface) that best separates classes.

Generative models take a different path. They model how the data was generated for each class. That means they learn $P(x \mid y)$ and $P(y)$ separately, then combine them using Bayes’ theorem to get $P(y \mid x)$ . Naive Bayes is a generative classifier.

Why does this matter? Generative models give you more. You can use them to generate synthetic data, detect outliers, or handle missing features more gracefully. The tradeoff is that they make stronger assumptions about the data distribution, and those assumptions can hurt if they’re wrong.

Bayes’ theorem in a classification context

Here’s the setup. You have a data point with features $x = (x_1, x_2, \ldots, x_n)$ , and you want to figure out which class $y$ it belongs to. Bayes’ theorem gives us:

$P(y \mid x) = \frac{P(x \mid y) \, P(y)}{P(x)}$

$P(y)$ is the prior: how likely is class $y$ before we see any features?
$P(x \mid y)$ is the likelihood: given class $y$ , how likely are we to see these features?
$P(x)$ is the evidence: how likely are these features overall?
$P(y \mid x)$ is the posterior: the probability of class $y$ after seeing the features.

For classification, we don’t actually need $P(x)$ . It’s the same for every class, so it acts as a normalizing constant. We just need to compare the numerators:

$\hat{y} = \arg\max_y \; P(x \mid y) \, P(y)$

That’s the core idea. Pick the class that maximizes the product of prior and likelihood.

Bayes rule applied to classification

graph LR
  A["Prior P(class)"] --> D["Multiply"]
  B["Likelihood P(features | class)"] --> D
  D --> E["Unnormalized posterior"]
  E --> F["Pick class with highest score"]

The naive assumption

Here’s where the “naive” part comes in. Computing $P(x \mid y) = P(x_1, x_2, \ldots, x_n \mid y)$ is hard. With $n$ features, you’d need to estimate a joint distribution that grows exponentially with $n$ . For any reasonable dataset, you simply don’t have enough data.

Naive Bayes cuts through this by assuming that features are conditionally independent given the class:

$P(x_1, x_2, \ldots, x_n \mid y) = \prod_{i=1}^{n} P(x_i \mid y)$

Each feature contributes its own evidence independently. In reality, features are almost never independent. If an email contains the word “free,” it probably also contains “click” or “offer.” The naive assumption says these correlations don’t exist. That’s clearly wrong.

But here’s the thing. For classification, you only need the model to get the ranking of classes right, not the exact probabilities. Even when the independence assumption is violated, the relative ordering of posteriors often stays correct. This is why Naive Bayes works much better than you’d expect from such a crude assumption. The bias-variance perspective helps explain it: the strong assumption introduces bias, but it dramatically reduces variance. With small datasets, that’s a good trade.

The naive independence assumption: each feature contributes separately

graph TD
  C["Class label"] --> F1["Feature 1"]
  C --> F2["Feature 2"]
  C --> F3["Feature 3"]
  C --> F4["Feature n"]
  F1 -.- N1["No arrows between features"]
  F2 -.- N1

MAP estimation

When we pick the class that maximizes $P(y \mid x)$ , we’re doing Maximum A Posteriori (MAP) estimation:

$\hat{y}_{\text{MAP}} = \arg\max_y \; P(y) \prod_{i=1}^{n} P(x_i \mid y)$

In practice, multiplying many small probabilities leads to numerical underflow. We take the log to turn products into sums:

$\hat{y}_{\text{MAP}} = \arg\max_y \left[ \log P(y) + \sum_{i=1}^{n} \log P(x_i \mid y) \right]$

This is equivalent because log is a monotonic function. The class that wins before taking logs still wins after.

Types of Naive Bayes

The general framework is always the same. What changes is how you model $P(x_i \mid y)$ , the likelihood of each feature given the class.

Multinomial Naive Bayes

Best for discrete count data, especially text classification. Each feature represents how many times a word appears in a document.

$P(x_i \mid y) = \frac{N_{yi} + \alpha}{N_y + \alpha \, |V|}$

where $N_{yi}$ is the count of feature $i$ in class $y$ , $N_y$ is the total count of all features in class $y$ , $|V|$ is the vocabulary size, and $\alpha$ is a smoothing parameter.

Bernoulli Naive Bayes

Similar to Multinomial, but features are binary: a word either appears in a document or it doesn’t. No counts. This model explicitly penalizes the absence of a feature, which Multinomial does not.

Use it when your features are naturally binary, or when document length varies a lot and you care more about presence than frequency.

Gaussian Naive Bayes

For continuous features. Assumes each feature follows a normal distribution within each class:

$P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma_{yi}^2}} \exp\left(-\frac{(x_i - \mu_{yi})^2}{2\sigma_{yi}^2}\right)$

You estimate $\mu_{yi}$ and $\sigma_{yi}^2$ from the training data for each feature-class pair. Simple, but it breaks down if your features aren’t roughly bell-shaped.

Worked example 1: spam detection with Multinomial Naive Bayes

Let’s classify emails as spam or ham (not spam) using a tiny vocabulary: free, money, the, meeting.

Training data

Doc	Class	free	money	the	meeting
D1	spam	2	1	0	0
D2	spam	1	1	1	0
D3	ham	0	0	1	1
D4	ham	0	0	2	1

Step 1: Compute priors.

We have 2 spam documents and 2 ham documents out of 4 total.

$P(\text{spam}) = \frac{2}{4} = 0.5$

$P(\text{ham}) = \frac{2}{4} = 0.5$

Step 2: Compute likelihoods with Laplace smoothing ( $\alpha = 1$ ).

For spam, the total word count is: $2 + 1 + 0 + 0 + 1 + 1 + 1 + 0 = 6$ . Vocabulary size $|V| = 4$ .

$P(\text{free} \mid \text{spam}) = \frac{(2+1) + 1}{6 + 4} = \frac{4}{10} = 0.4$

$P(\text{money} \mid \text{spam}) = \frac{(1+1) + 1}{6 + 4} = \frac{3}{10} = 0.3$

$P(\text{the} \mid \text{spam}) = \frac{(0+1) + 1}{6 + 4} = \frac{2}{10} = 0.2$

$P(\text{meeting} \mid \text{spam}) = \frac{(0+0) + 1}{6 + 4} = \frac{1}{10} = 0.1$

For ham, the total word count is: $0 + 0 + 1 + 1 + 0 + 0 + 2 + 1 = 5$ . Same $|V| = 4$ .

$P(\text{free} \mid \text{ham}) = \frac{0 + 1}{5 + 4} = \frac{1}{9} \approx 0.111$

$P(\text{money} \mid \text{ham}) = \frac{0 + 1}{5 + 4} = \frac{1}{9} \approx 0.111$

$P(\text{the} \mid \text{ham}) = \frac{3 + 1}{5 + 4} = \frac{4}{9} \approx 0.444$

$P(\text{meeting} \mid \text{ham}) = \frac{2 + 1}{5 + 4} = \frac{3}{9} \approx 0.333$

Step 3: Classify a new email.

New document: “free money free” (word counts: free=2, money=1, the=0, meeting=0).

For Multinomial Naive Bayes, we only multiply likelihoods for words that appear, raised to their count:

$\text{Score}(\text{spam}) = P(\text{spam}) \times P(\text{free} \mid \text{spam})^2 \times P(\text{money} \mid \text{spam})^1$

$= 0.5 \times 0.4^2 \times 0.3 = 0.5 \times 0.16 \times 0.3 = 0.024$

$\text{Score}(\text{ham}) = P(\text{ham}) \times P(\text{free} \mid \text{ham})^2 \times P(\text{money} \mid \text{ham})^1$

$= 0.5 \times 0.111^2 \times 0.111 = 0.5 \times 0.01232 \times 0.111 \approx 0.000684$

Step 4: Decision.

$0.024 \gg 0.000684$ , so we classify this email as spam. That makes sense: “free money free” looks like spam.

If you want actual probabilities, normalize:

$P(\text{spam} \mid x) = \frac{0.024}{0.024 + 0.000684} \approx 0.972$

The model is 97.2% confident this is spam.

Worked example 2: Gaussian Naive Bayes on numeric features

Suppose we have two features, $x_1$ (height in cm) and $x_2$ (weight in kg), and two classes: athlete and non-athlete.

Training data

Class	$x_1$ (height)	$x_2$ (weight)
athlete	180	75
athlete	175	70
athlete	185	80
non-athlete	160	85
non-athlete	165	90
non-athlete	155	80

Step 1: Compute priors.

$P(\text{athlete}) = \frac{3}{6} = 0.5, \quad P(\text{non-athlete}) = \frac{3}{6} = 0.5$

Step 2: Compute class-conditional means and variances.

For athletes:

$\mu_{1} = \frac{180 + 175 + 185}{3} = 180, \quad \sigma_{1}^2 = \frac{(0)^2 + (-5)^2 + (5)^2}{3} = \frac{50}{3} \approx 16.67$

$\mu_{2} = \frac{75 + 70 + 80}{3} = 75, \quad \sigma_{2}^2 = \frac{(0)^2 + (-5)^2 + (5)^2}{3} = \frac{50}{3} \approx 16.67$

For non-athletes:

$\mu_{1} = \frac{160 + 165 + 155}{3} = 160, \quad \sigma_{1}^2 = \frac{(0)^2 + (5)^2 + (-5)^2}{3} = \frac{50}{3} \approx 16.67$

$\mu_{2} = \frac{85 + 90 + 80}{3} = 85, \quad \sigma_{2}^2 = \frac{(0)^2 + (5)^2 + (-5)^2}{3} = \frac{50}{3} \approx 16.67$

Step 3: Classify a new point $x = (170, 72)$ .

Using the Gaussian PDF $P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)$ :

For athlete, feature $x_1 = 170$ :

$P(170 \mid \text{athlete}) = \frac{1}{\sqrt{2\pi \times 16.67}} \exp\left(-\frac{(170 - 180)^2}{2 \times 16.67}\right) = \frac{1}{\sqrt{104.72}} \exp\left(-\frac{100}{33.33}\right)$

$= \frac{1}{10.233} \exp(-3.0) = 0.0977 \times 0.0498 = 0.00487$

For athlete, feature $x_2 = 72$ :

$P(72 \mid \text{athlete}) = \frac{1}{10.233} \exp\left(-\frac{(72 - 75)^2}{33.33}\right) = 0.0977 \times \exp(-0.27) = 0.0977 \times 0.7634 = 0.0746$

For non-athlete, feature $x_1 = 170$ :

$P(170 \mid \text{non-athlete}) = \frac{1}{10.233} \exp\left(-\frac{(170 - 160)^2}{33.33}\right) = 0.0977 \times \exp(-3.0) = 0.00487$

For non-athlete, feature $x_2 = 72$ :

$P(72 \mid \text{non-athlete}) = \frac{1}{10.233} \exp\left(-\frac{(72 - 85)^2}{33.33}\right) = 0.0977 \times \exp(-5.07) = 0.0977 \times 0.00627 = 0.000613$

Step 4: Compute posteriors (unnormalized).

$\text{Score}(\text{athlete}) = 0.5 \times 0.00487 \times 0.0746 = 0.0001816$

$\text{Score}(\text{non-athlete}) = 0.5 \times 0.00487 \times 0.000613 = 0.00000149$

Step 5: Decision.

Athlete score is about 122 times larger. Classification: athlete.

Normalizing: $P(\text{athlete} \mid x) \approx \frac{0.0001816}{0.0001816 + 0.00000149} \approx 0.992$

The height of 170 is between the two class means, so it doesn’t help much. But the weight of 72 is much closer to the athlete mean (75) than the non-athlete mean (85), and that’s what drives the decision.

Laplace smoothing

Look back at the spam example. What if a word never appears in the spam training set? Its count is zero, so $P(\text{word} \mid \text{spam}) = 0$ . And since we multiply likelihoods, one zero kills the entire product. The model would say “this can’t possibly be spam” based on a single unseen word, ignoring all other evidence.

Laplace smoothing (also called add- $\alpha$ smoothing) fixes this by adding a small count $\alpha$ to every feature:

$P(x_i \mid y) = \frac{N_{yi} + \alpha}{N_y + \alpha \, |V|}$

With $\alpha = 1$ , we add 1 to every word count. No probability is ever zero. This is a form of regularization: it pushes the model away from extreme probability estimates and toward a more uniform distribution.

Setting $\alpha$ too high flattens the distribution too much, making all words equally likely. Setting it too low doesn’t fix the zero-probability problem enough. In practice, $\alpha = 1$ works well for most text tasks, but you can tune it with cross-validation.

Python implementation

Here’s a minimal Multinomial Naive Bayes from scratch:

import numpy as np

class MultinomialNB:
    def __init__(self, alpha=1.0):
        self.alpha = alpha

    def fit(self, X, y):
        self.classes = np.unique(y)
        n_classes = len(self.classes)
        n_features = X.shape[1]

        self.priors = np.zeros(n_classes)
        self.likelihoods = np.zeros((n_classes, n_features))

        for i, c in enumerate(self.classes):
            X_c = X[y == c]
            self.priors[i] = X_c.shape[0] / X.shape[0]
            # Smoothed likelihoods
            counts = X_c.sum(axis=0) + self.alpha
            self.likelihoods[i] = counts / counts.sum()

    def predict(self, X):
        log_priors = np.log(self.priors)
        log_likelihoods = np.log(self.likelihoods)
        # Log-posterior for each class
        log_posteriors = X @ log_likelihoods.T + log_priors
        return self.classes[np.argmax(log_posteriors, axis=1)]

That’s it. The entire training step is counting and dividing. Prediction is a matrix multiply plus an argmax. This is why Naive Bayes is so fast.

Strengths and weaknesses

When Naive Bayes works well

Text classification. Spam filtering, sentiment analysis, document categorization. Multinomial NB is a strong baseline here.
Small datasets. Because of the strong independence assumption, the model has very few parameters. It won’t overfit when data is scarce.
High-dimensional data. When you have thousands of features (like words in a vocabulary), Naive Bayes handles it without breaking a sweat. No matrix inversions, no gradient descent.
Fast training and prediction. Training is a single pass through the data. Prediction is a dot product. Hard to beat on speed.
Multi-class problems. It naturally handles multiple classes without any modification.

When Naive Bayes struggles

Correlated features. If features are heavily dependent (e.g., pixel values in an image), the independence assumption hurts badly. The model double-counts evidence.
Probability calibration. The predicted probabilities from Naive Bayes tend to be pushed toward 0 and 1. If you need well-calibrated probabilities (not just rankings), use calibration methods like Platt scaling.
Complex decision boundaries. The decision boundary is linear (or piecewise linear). If classes overlap in complex ways, you’ll need something more flexible.
Continuous features with non-Gaussian distributions. Gaussian NB assumes normality. Skewed, multimodal, or heavy-tailed features will mislead the model.

Naive Bayes vs Logistic Regression

graph TD
  subgraph NB["Naive Bayes"]
      NB1["Generative model"]
      NB2["Learns P(x|y) and P(y)"]
      NB3["Fast: just counting"]
      NB4["Strong independence assumption"]
  end
  subgraph LR["Logistic Regression"]
      LR1["Discriminative model"]
      LR2["Learns P(y|x) directly"]
      LR3["Needs gradient descent"]
      LR4["No independence assumption"]
  end

A practical tip

Naive Bayes is an excellent first model to try on any classification problem. It gives you a performance floor in seconds. If it works well enough, ship it. If it doesn’t, you now have a baseline to beat with more complex models. Either way, you’ve lost almost no time.

Connection to other loss functions

You might wonder how Naive Bayes relates to the loss functions used in other classifiers. When you train Naive Bayes with maximum likelihood estimation, you’re implicitly minimizing the cross-entropy loss between the true class distribution and your predicted distribution. The difference is that you’re doing it analytically through counting, rather than through iterative optimization like gradient descent.

What comes next

Naive Bayes is a strong, fast, and interpretable baseline. But it’s not the only simple classifier worth knowing. Next up, we’ll look at K-Nearest Neighbors, a model that takes the opposite approach: instead of learning a parametric model from training data, it just memorizes everything and makes predictions by looking at the closest examples at test time. No training step at all.

← Back to all series