Introduction
Naïve Bayes is a simple yet powerful probabilistic classification algorithm based on Bayes’ Theorem. It assumes that the features are conditionally independent given the class label. Despite its ‘naïve’ assumption, it performs well in various real-world scenarios like spam detection, sentiment analysis, and document classification.
Bayes’ Theorem
Bayes’ Theorem is the foundation of the Naïve Bayes algorithm:
P(C|X) = [P(X|C) * P(C)] / P(X)
Where:
- P(C|X) is the posterior probability of class C given predictor X
- P(X|C) is the likelihood probability
- P(C) is the prior probability of class
- P(X) is the prior probability of predictor
Assumption
It assumes all features are independent given the class label. That is:
P(X1, X2, ..., Xn | C) = P(X1 | C) * P(X2 | C) * ... * P(Xn | C)
Types of Naïve Bayes Models
- Gaussian: Assumes features follow a normal distribution (used for continuous data)
- Multinomial: Used for discrete count features (e.g., text classification)
- Bernoulli: Used for binary/boolean features
Example
Suppose we want to classify whether an email is “Spam” or “Not Spam” based on two features: “Contains ‘Free'” and “Contains ‘Buy now'”.
Free | Buy Now | Spam | |
---|---|---|---|
1 | Yes | Yes | Yes |
2 | No | Yes | Yes |
3 | Yes | No | No |
4 | No | No | No |
Step 1: Calculate prior probabilities
P(Spam=Yes) = 2/4 = 0.5 P(Spam=No) = 2/4 = 0.5
Step 2: Calculate likelihoods
P(Free=Yes | Spam=Yes) = 1/2 = 0.5 P(Buy Now=Yes | Spam=Yes) = 2/2 = 1 P(Free=Yes | Spam=No) = 1/2 = 0.5 P(Buy Now=Yes | Spam=No) = 0/2 = 0
Step 3: Predict for new email: Free=Yes, Buy Now=Yes
P(Spam=Yes | X) ∝ 0.5 * 0.5 * 1 = 0.25 P(Spam=No | X) ∝ 0.5 * 0.5 * 0 = 0
Prediction: Email is classified as Spam
Advantages
- Fast and simple
- Works well with high-dimensional data
- Requires less training data
Limitations
- Assumes feature independence
- Zero probabilities if a feature class never appears
Conclusion
Naïve Bayes is a powerful classification algorithm that works well for many tasks, especially with text data. It is computationally efficient, easy to implement, and surprisingly effective despite its simplifying assumptions.