Introduction
K-Nearest Neighbors (KNN) is a simple and intuitive supervised machine learning algorithm used for classification and regression tasks. It classifies new data points based on the majority label of the ‘k’ closest training examples in the feature space.
How KNN Works
- Choose the number of neighbors ‘k’
- Calculate the distance (e.g., Euclidean) between the new data point and all other points in the training dataset
- Select the ‘k’ nearest neighbors
- Assign the most frequent label among those neighbors to the new data point (for classification)
Example
Let’s consider a dataset where we want to predict whether a fruit is an Apple (A) or an Orange (O) based on its weight and texture.
Fruit | Weight (g) | Texture (1=smooth, 0=bumpy) |
---|---|---|
A | 150 | 1 |
O | 170 | 0 |
A | 140 | 1 |
O | 160 | 0 |
Now we want to classify a new fruit with weight = 155g and texture = 0. We’ll compute Euclidean distance between the new point and existing ones.
Euclidean Distance:
Distance = √((weight1 – weight2)² + (texture1 – texture2)²)
- To A (150,1): √((155−150)² + (0−1)²) = √(25 + 1) = √26 ≈ 5.1
- To O (170,0): √((155−170)² + (0−0)²) = √(225) = 15
- To A (140,1): √((155−140)² + (0−1)²) = √(225 + 1) = √226 ≈ 15.03
- To O (160,0): √((155−160)² + (0−0)²) = √25 = 5
For k=3, nearest neighbors are: A (150,1), O (160,0), O (170,0)
Majority = O → Predicted: Orange
Distance Metrics
- Euclidean Distance (default)
- Manhattan Distance
- Minkowski Distance
Choosing k
Too small k: Model may overfit. Too large k: Model may underfit. Use cross-validation to choose the best ‘k’.
Advantages
- Simple to understand and implement
- No training phase (lazy learner)
- Effective for small datasets
Disadvantages
- Computationally expensive for large datasets
- Performance degrades with high-dimensional data
- Needs proper feature scaling
Conclusion
KNN is a foundational algorithm that works well for many problems where interpretability and simplicity are priorities. It’s a great baseline model to compare with more complex models.