K-Nearest Neighbors A Simple Yet Powerful Machine Learning Algorithm
Check the video explanation of this article here.
K-Nearest Neighbors aka KNN is a very powerful machine learning algorithm. It is used to solve both classifications as well as regression problems.
We will start with classification because it’s mostly used to solve classification problems.
KNN for Claasification:
A classification problem can be a binary classification or multi-class classification here for simplicity we are taking binary classification but the idea will be the same for multi-class classification as well.
Let’s assume the training dataset has the data distribution as given in the above image, there are two classes positive class (Yellow Points) and negative class (Red Points) now when a new data point comes in KNN has to tell that new data point belongs to which class?. In the above image q is the new data point.
How does KNN work?
As we said in the beginning knn is a very simple algorithm it looks at the nearest data points classes and whichever class is maximum, is the resultant class for the new data point.
Now we know how knn works but what is the use of K in the algorithm?
So when knn looks for the nearest data points classes the value of K decides how many neighbors it has to look for. E.g. if the value of K is 3 it will look for the 3 nearest data points. So while training knn algorithm K is a hyperparameter.
Let’s see visually how it works.
In the above image, we can see for a new data point “q” with K=3, we got 2 positive class points near it and 1 negative class and since the majority is with the positive class we will call this new data point a positive class data point. Similarly, we can try out different values of K to see which one performs well for the dataset.
Now one question comes to mind what if K=2 and 1 nearest data point belongs to the positive class and the other belongs to the negative class then how do we break the tie?
So the best way to deal with this problem is to set the K with numbers which are not divisible by the number of classes in the dataset.
Failure Cases of KNN:
- If the query point( New Data Point) is very far from the actual data points it’s not evident to decide whether the majority class is correct or not.
In the above image, we can see that the query point ‘q’ is very far from all the data points so it’s very hard to decide whether the majority class which is Negative in this case is correct or not.
2. If data points of all the classes are mixed together. It will be nearly impossible to classify.
In the above image, we can see that the data points of both the classes are mixed together so if we set the K=1 then also we will not be sure whether the resultant class is correct or not. So while selecting the algorithm we should look at the dataset and whether the algorithm can perform well on the dataset or not.
We have talked a lot about the working of knn, the significance of “K” and the failure cases of knn. But how knn decide which data point is closer to it? For that knn uses some distance measurement techniques.
Let’s see some of the distance measure techniques used by KNN:
Since Euclidean distance uses simple Pythagoras theorem to calculate the distance between two points it is also known as “Pythagorean distance”.
Euclidean distance is also known as the L2 norm.
In the manhattan distance, we take the absolute distance between two points. Manhattan distance is similar to adding the distance of the blocks between two points.
It is adding the distance of |P1 →Q1| and |P2 →Q2|.
Manhattan Distance is also known is L1 norm.
Minkowski Distance is a generalized form by which we can achieve both Euclidean distance and Manhattan distance.
It is also known as Lp-Norm.
In the formula of Minkowski distance, if we put the value of p=1, it becomes the Manhattan distance formula, similarly, if we put p=2, it becomes the Euclidean distance formula.
KNN In the scikit-learn package has a parameter p which is used to change the distance metric by default the Minkowski distance metric is used with p=2 so which becomes Euclidean distance. So if you do not change the p parameter in the KNN classifier it would use Euclidean distance.
I hope you got the idea of how KNN calculates the distance between two points and decides which data point is nearest neighbor.
Overfitting and Underfitting Cases in KNN:
As shown in the above image, if the value of K=1 the decision surface will look something like the above image. It will try to fit each and every point but in a real-time scenario those two red points which are outliers in yellow class should have classified as yellow points similarly one yellow point is also present in the red class region which should have been classified as red but since the value of K=1, it will try to map it to yellow. Just think if K=3 in that case the majority class will be red.
So if the value of K is less it will lead to overfitting.
If the value of K is very large it will lead to Underfitting.
For example, let’s say the value of K = Total Number of Data Points. In this case every time all the data points would become neighbors of the query point and whichever class is having more data points in the dataset will always be the winner in majority voting.
So having a big value of K is never a good idea in KNN.
We should always try to find a trade-off between overfitting and underfitting.
Regression in KNN:
The regression in KNN is also very simple.
Step 1: Get the nearest neighbors of the query point. Let’s say the target labels of the nearest data points are (y1, y2, y3).
Step 2: Calculate the mean of (y1,y2,y3) and get the results of regression for a query point. We are considering K=3 here that’s why we have only three dependent points (y1,y2,y3)
Now let’s see some of the limitations of KNN.
Limitations of KNN:
- KNN takes a lot of space to store the data points for distance calculations which is a very big problem if you are dealing with a large dataset with millions of data points.
- KNN takes a long time to respond because it calculates the distance with all the points and looks for the nearest data points. If the data set size is large the time will increase. That is why it is also called a lazy learner because it memorizes the dataset and uses it for distance calculations.
Hope you enjoyed reading it! 🤗