Naive Bayes classifier
In this series (18 parts)
- What is machine learning: a map of the field
- Data, features, and the ML pipeline
- Linear regression
- Bias, variance, and the tradeoff
- Regularization: Ridge, Lasso, and ElasticNet
- Logistic regression and classification
- Evaluation metrics for classification
- Naive Bayes classifier
- K-Nearest Neighbors
- Decision trees
- Ensemble methods: Bagging and Random Forests
- Boosting: AdaBoost and Gradient Boosting
- Support Vector Machines
- K-Means clustering
- Dimensionality Reduction: PCA
- Gaussian mixture models and EM algorithm
- Model selection and cross-validation
- Feature engineering and selection
Naive Bayes is one of the simplest classifiers you can build, and yet it holds its own against far more complex models in a surprising number of real-world tasks. It is fast to train, fast to predict, and needs very little data to produce reasonable results. The catch? It makes an assumption so strong that it’s almost always wrong. And somehow that barely matters.
Spam or not spam: the intuition
Consider 10 emails. Each one has a few observable features, plus a label.
| Contains “free” | Contains “meeting” | Contains “click” | Short? | Label | |
|---|---|---|---|---|---|
| 1 | Yes | No | Yes | Yes | Spam |
| 2 | No | Yes | No | No | Not spam |
| 3 | Yes | No | Yes | Yes | Spam |
| 4 | No | Yes | No | No | Not spam |
| 5 | Yes | No | No | Yes | Spam |
| 6 | No | Yes | No | Yes | Not spam |
| 7 | Yes | Yes | Yes | No | Spam |
| 8 | No | No | No | No | Not spam |
| 9 | Yes | No | Yes | Yes | Spam |
| 10 | No | Yes | No | No | Not spam |
Look at the pattern. Emails with “free” and “click” tend to be spam. Emails with “meeting” tend to be legitimate. Naive Bayes counts how often each feature appears in spam versus not-spam, then uses those frequencies to classify new emails. It treats each feature independently, which is a simplification, but it works surprisingly well.
A new email arrives with “free” and “click” but no “meeting.” Naive Bayes checks: how often does “free” appear in spam? Very often. How often does “click” appear in spam? Very often. How often does “meeting” appear in spam? Rarely. Each feature casts its vote, and the votes combine into a final prediction.
How Naive Bayes classifies an email
graph TD A["New email arrives"] --> B["Check each feature independently"] B --> C["Contains free? High spam signal"] B --> D["Contains meeting? Low spam signal"] B --> E["Contains click? High spam signal"] C --> F["Combine all signals with Bayes rule"] D --> F E --> F F --> G["Predict: Spam or Not Spam"]
Feature likelihoods: Spam vs Ham
Now let’s formalize this counting-and-combining approach using probability theory.
Generative vs discriminative models
Before we get into the mechanics, it helps to know where Naive Bayes sits in the broader landscape of classifiers.
Discriminative models learn the boundary between classes directly. They model , the probability of a class given features . Logistic regression is a classic example. It doesn’t care how the data was generated. It just finds the line (or surface) that best separates classes.
Generative models take a different path. They model how the data was generated for each class. That means they learn and separately, then combine them using Bayes’ theorem to get . Naive Bayes is a generative classifier.
Why does this matter? Generative models give you more. You can use them to generate synthetic data, detect outliers, or handle missing features more gracefully. The tradeoff is that they make stronger assumptions about the data distribution, and those assumptions can hurt if they’re wrong.
Bayes’ theorem in a classification context
Here’s the setup. You have a data point with features , and you want to figure out which class it belongs to. Bayes’ theorem gives us:
- is the prior: how likely is class before we see any features?
- is the likelihood: given class , how likely are we to see these features?
- is the evidence: how likely are these features overall?
- is the posterior: the probability of class after seeing the features.
For classification, we don’t actually need . It’s the same for every class, so it acts as a normalizing constant. We just need to compare the numerators:
That’s the core idea. Pick the class that maximizes the product of prior and likelihood.
Bayes rule applied to classification
graph LR A["Prior P(class)"] --> D["Multiply"] B["Likelihood P(features | class)"] --> D D --> E["Unnormalized posterior"] E --> F["Pick class with highest score"]
The naive assumption
Here’s where the “naive” part comes in. Computing is hard. With features, you’d need to estimate a joint distribution that grows exponentially with . For any reasonable dataset, you simply don’t have enough data.
Naive Bayes cuts through this by assuming that features are conditionally independent given the class:
Each feature contributes its own evidence independently. In reality, features are almost never independent. If an email contains the word “free,” it probably also contains “click” or “offer.” The naive assumption says these correlations don’t exist. That’s clearly wrong.
But here’s the thing. For classification, you only need the model to get the ranking of classes right, not the exact probabilities. Even when the independence assumption is violated, the relative ordering of posteriors often stays correct. This is why Naive Bayes works much better than you’d expect from such a crude assumption. The bias-variance perspective helps explain it: the strong assumption introduces bias, but it dramatically reduces variance. With small datasets, that’s a good trade.
The naive independence assumption: each feature contributes separately
graph TD C["Class label"] --> F1["Feature 1"] C --> F2["Feature 2"] C --> F3["Feature 3"] C --> F4["Feature n"] F1 -.- N1["No arrows between features"] F2 -.- N1
MAP estimation
When we pick the class that maximizes , we’re doing Maximum A Posteriori (MAP) estimation:
In practice, multiplying many small probabilities leads to numerical underflow. We take the log to turn products into sums:
This is equivalent because log is a monotonic function. The class that wins before taking logs still wins after.
Types of Naive Bayes
The general framework is always the same. What changes is how you model , the likelihood of each feature given the class.
Multinomial Naive Bayes
Best for discrete count data, especially text classification. Each feature represents how many times a word appears in a document.
where is the count of feature in class , is the total count of all features in class , is the vocabulary size, and is a smoothing parameter.
Bernoulli Naive Bayes
Similar to Multinomial, but features are binary: a word either appears in a document or it doesn’t. No counts. This model explicitly penalizes the absence of a feature, which Multinomial does not.
Use it when your features are naturally binary, or when document length varies a lot and you care more about presence than frequency.
Gaussian Naive Bayes
For continuous features. Assumes each feature follows a normal distribution within each class:
You estimate and from the training data for each feature-class pair. Simple, but it breaks down if your features aren’t roughly bell-shaped.
Worked example 1: spam detection with Multinomial Naive Bayes
Let’s classify emails as spam or ham (not spam) using a tiny vocabulary: free, money, the, meeting.
Training data
| Doc | Class | free | money | the | meeting |
|---|---|---|---|---|---|
| D1 | spam | 2 | 1 | 0 | 0 |
| D2 | spam | 1 | 1 | 1 | 0 |
| D3 | ham | 0 | 0 | 1 | 1 |
| D4 | ham | 0 | 0 | 2 | 1 |
Step 1: Compute priors.
We have 2 spam documents and 2 ham documents out of 4 total.
Step 2: Compute likelihoods with Laplace smoothing ().
For spam, the total word count is: . Vocabulary size .
For ham, the total word count is: . Same .
Step 3: Classify a new email.
New document: “free money free” (word counts: free=2, money=1, the=0, meeting=0).
For Multinomial Naive Bayes, we only multiply likelihoods for words that appear, raised to their count:
Step 4: Decision.
, so we classify this email as spam. That makes sense: “free money free” looks like spam.
If you want actual probabilities, normalize:
The model is 97.2% confident this is spam.
Worked example 2: Gaussian Naive Bayes on numeric features
Suppose we have two features, (height in cm) and (weight in kg), and two classes: athlete and non-athlete.
Training data
| Class | (height) | (weight) |
|---|---|---|
| athlete | 180 | 75 |
| athlete | 175 | 70 |
| athlete | 185 | 80 |
| non-athlete | 160 | 85 |
| non-athlete | 165 | 90 |
| non-athlete | 155 | 80 |
Step 1: Compute priors.
Step 2: Compute class-conditional means and variances.
For athletes:
For non-athletes:
Step 3: Classify a new point .
Using the Gaussian PDF :
For athlete, feature :
For athlete, feature :
For non-athlete, feature :
For non-athlete, feature :
Step 4: Compute posteriors (unnormalized).
Step 5: Decision.
Athlete score is about 122 times larger. Classification: athlete.
Normalizing:
The height of 170 is between the two class means, so it doesn’t help much. But the weight of 72 is much closer to the athlete mean (75) than the non-athlete mean (85), and that’s what drives the decision.
Laplace smoothing
Look back at the spam example. What if a word never appears in the spam training set? Its count is zero, so . And since we multiply likelihoods, one zero kills the entire product. The model would say “this can’t possibly be spam” based on a single unseen word, ignoring all other evidence.
Laplace smoothing (also called add- smoothing) fixes this by adding a small count to every feature:
With , we add 1 to every word count. No probability is ever zero. This is a form of regularization: it pushes the model away from extreme probability estimates and toward a more uniform distribution.
Setting too high flattens the distribution too much, making all words equally likely. Setting it too low doesn’t fix the zero-probability problem enough. In practice, works well for most text tasks, but you can tune it with cross-validation.
Python implementation
Here’s a minimal Multinomial Naive Bayes from scratch:
import numpy as np
class MultinomialNB:
def __init__(self, alpha=1.0):
self.alpha = alpha
def fit(self, X, y):
self.classes = np.unique(y)
n_classes = len(self.classes)
n_features = X.shape[1]
self.priors = np.zeros(n_classes)
self.likelihoods = np.zeros((n_classes, n_features))
for i, c in enumerate(self.classes):
X_c = X[y == c]
self.priors[i] = X_c.shape[0] / X.shape[0]
# Smoothed likelihoods
counts = X_c.sum(axis=0) + self.alpha
self.likelihoods[i] = counts / counts.sum()
def predict(self, X):
log_priors = np.log(self.priors)
log_likelihoods = np.log(self.likelihoods)
# Log-posterior for each class
log_posteriors = X @ log_likelihoods.T + log_priors
return self.classes[np.argmax(log_posteriors, axis=1)]
That’s it. The entire training step is counting and dividing. Prediction is a matrix multiply plus an argmax. This is why Naive Bayes is so fast.
Strengths and weaknesses
When Naive Bayes works well
- Text classification. Spam filtering, sentiment analysis, document categorization. Multinomial NB is a strong baseline here.
- Small datasets. Because of the strong independence assumption, the model has very few parameters. It won’t overfit when data is scarce.
- High-dimensional data. When you have thousands of features (like words in a vocabulary), Naive Bayes handles it without breaking a sweat. No matrix inversions, no gradient descent.
- Fast training and prediction. Training is a single pass through the data. Prediction is a dot product. Hard to beat on speed.
- Multi-class problems. It naturally handles multiple classes without any modification.
When Naive Bayes struggles
- Correlated features. If features are heavily dependent (e.g., pixel values in an image), the independence assumption hurts badly. The model double-counts evidence.
- Probability calibration. The predicted probabilities from Naive Bayes tend to be pushed toward 0 and 1. If you need well-calibrated probabilities (not just rankings), use calibration methods like Platt scaling.
- Complex decision boundaries. The decision boundary is linear (or piecewise linear). If classes overlap in complex ways, you’ll need something more flexible.
- Continuous features with non-Gaussian distributions. Gaussian NB assumes normality. Skewed, multimodal, or heavy-tailed features will mislead the model.
Naive Bayes vs Logistic Regression
graph TD
subgraph NB["Naive Bayes"]
NB1["Generative model"]
NB2["Learns P(x|y) and P(y)"]
NB3["Fast: just counting"]
NB4["Strong independence assumption"]
end
subgraph LR["Logistic Regression"]
LR1["Discriminative model"]
LR2["Learns P(y|x) directly"]
LR3["Needs gradient descent"]
LR4["No independence assumption"]
end
A practical tip
Naive Bayes is an excellent first model to try on any classification problem. It gives you a performance floor in seconds. If it works well enough, ship it. If it doesn’t, you now have a baseline to beat with more complex models. Either way, you’ve lost almost no time.
Connection to other loss functions
You might wonder how Naive Bayes relates to the loss functions used in other classifiers. When you train Naive Bayes with maximum likelihood estimation, you’re implicitly minimizing the cross-entropy loss between the true class distribution and your predicted distribution. The difference is that you’re doing it analytically through counting, rather than through iterative optimization like gradient descent.
What comes next
Naive Bayes is a strong, fast, and interpretable baseline. But it’s not the only simple classifier worth knowing. Next up, we’ll look at K-Nearest Neighbors, a model that takes the opposite approach: instead of learning a parametric model from training data, it just memorizes everything and makes predictions by looking at the closest examples at test time. No training step at all.