Chapter 1: K Nearest Neighbors (Supervised Machine Learning Algorithm)

Sabita Rajbanshi
Machine Learning Community
6 min readAug 27, 2021

--

K nearest neighbors are simple and yet the most powerful supervised machine learning algorithms. The K-NN algorithms are used to solve both classification and regression problems. In this article, we will understand how a K-NN classifier works.

Table of Contents

  1. Geometric Intuition of KNN
  2. Failure Cases of KNN
  3. Limitations of KNN
  4. The right value of K
  5. Distance Metrics
  6. Cosine distance and cosine similarity
  7. Overfitting and Underfitting
  8. How K-Fold Cross-Validation Works?
  9. Sklearn K Nearest Neighbor and Parameters
  10. Real-World Applications of KNN

1. Geometric Intuition of KNN:

In KNN an object is classified by a majority vote of its neighbors. If k = 1 then the object is simply assigned to the class of that single nearest neighbor.

Figure: KNN via https://www.datacamp.com/

Let’s take a simple case of binary classification. Here, X-Axis and Y-Axis are two dimensions. Consider red points as class A and green points as class B. Consider the “?” sign as a query point(xq) for which determine its class.

From the above image, if we take K=3, then xq is classified as class B and if we continue with K=7, the xq is classified as class A using nearest neighbors.

2. Failure Cases of KNN:

Case a: When the data is jumbled we cannot obtain any useful insights from it. In such a case when we are given a query point, the KNN classifier will try to find its nearest neighbors but since the dataset is jumbled its accuracy can be very low.

case b: When a query point is very far away from the data points. In such a case, we cannot use KNN to identify the class.

3. Limitations of KNN :

  • KNN doesn’t work well with a large dataset
  • It doesn’t work well with a high number of dimensions
  • It is sensitive to outliers and missing values

The biggest problem with KNN is a large time and space complexity but there are two data structures namely Kd-tree and LSH which can improve KNN performance by reducing time and space complexity.

4. The right value of K:

  • As we decrease the value of K, our prediction becomes less stable and as we increase the value of K, our prediction becomes more stable due to majority averaging.
  • To select the optimal value of K for the dataset, we run the KNN algorithm several times with different values of K and choose the K which reduces the number of errors while predicting accurately.
  • We need to take the value of K, an odd number as K=1,3,5,7,9,11… so on.

5. Distance Metrics:

The distance metric is the most crucial factor for measuring the distance between trained data values and new test samples. Amongst all, there are three distance measures:

a. Euclidean distance: The simple understanding of distance is the length of the shortest line between two points. Euclidean distance is also known as the L2 norm which calculates the distance between two rows of data that have numerical values(ie; integer, floating-point values). Here we measure the distance between Point 1(x1, y1) and Point 2(x2, y2).

Figure: Euclidean distance via slideplayer.com

b. Manhattan distance: The Manhattan distance is the sum of absolute differences. Manhattan distance is longer and you can find it with more than one path, unlike Euclidean distance. Manhattan distance is also known as the L1 norm. It is used in speech recognition and image processing in machine learning.

Figure: Manhattan distance

c. Minkowski distance: Generalization of L1-norm and L2-norm is Lp-norm also known as Minkowski distance. The Minkowski distance of order p between two points X=(x1, x2, x3,….,xn) and Y=(y1,y2,y3,…,yn) is defined as

Figure: Minkowski distance via math.stackexchange.com

6. Cosine Distance and Cosine Similarity:

For two points A and B:

  • When the distance between two points increases then the similarity decreases, and when the distance between two points decreases then the similarity increases.
  • When two points are very similar then, Cos-Sim(A, B)=+1, and when two points are very dissimilar then, Cos-Sim(A, B)=-1.

7. Overfitting and Underfitting:

Figure: Fitting via medium.com

Underfitting: When the classifier is underworking and it says, if you have more positive points, the query data is positive and if you have more negative points, the query data is negative.

Well-fit: It is neither perfect(over-fit) nor imperfect(under-fit). It ignores the outlier and noisy data and the decision surface is well balanced.

Overfitting: It is about having extremely non-smooth decision surfaces which are trying over hard to fit in every point so that it makes no mistakes in its training data.

8. How K-Fold Cross-Validation Works?

Note: K in k-fold is different than K in K-NN

Cross-Validation is a resampling procedure used to evaluate models on limited data points. The K parameter refers to the number of groups that a given data is divided into. If we choose K=10 that means we can divide the given data into 10 equal parts known as 10-fold cross-validation.

For better understanding: Given a dataset D, divide D among Dtrain, Dcv, and Dtest. The function we are training is the KNN algorithm where we get the nearest neighbors from the training dataset Dtrain, obtain the right K using cross-validation Dcv, and test our model on unseen data from Dtest.

How K-Fold cross-validation works?

Step 1: Given, total data as Dn which is divided into Dtrain(80%) and Dtest(20%). Using Dtrain data we need to compute both nearest neighbors and right K.

Step 2: Take the Dtrain and split it randomly into four equal-sized parts as D1, D2, D3, D4 for 4-fold cross-validation.

Step 3: For K=1 in the first fold, use D1, D2, D3 as training data and D4 as cv data and compute accuracy a4 on cv data(D4). Again for the second fold use D1, D2, D4 as training data and D3 as cv data and compute accuracy a3 on cv data(D3) and do the same for other 3rd and fourth folds. Then, we take the average of 4 different accuracies(a1, a2, a3, a4) of four different parts and assign this accuracy for K=1. Again repeat the process for K=2 and assign the average accuracy to K=2. Repeat the same process for increasing the value of K. Finally compare the accuracies for all the values of K to get the best K. Assume K=3 has the highest accuracy on 4-fold cv.

Step 4: Now given a query point, take 3-NN to compute the nearest neighbors in the whole of Dtrain.

Finally, to determine the accuracy of the model, use the accuracy of 3-NN on unseen data, Dtest.

Note: Typically we use 10-fold cross-validation

9. Sklearn K Nearest Neighbor and Parameters:

Sklearn is a python library that provides an implementation of the K Nearest Neighbor Classifier. Sample code snippet used in python is as below:

10. Real-World Applications of KNN:

  • Text mining
  • Climate forecasting
  • Forecasting stock market
  • Money laundering analyses
  • Currency exchange rate
  • Loan management

Key Takeaways:

  • If you need an algorithm that can do both regression and classification tasks, K-NN is the best fit
  • The more the training data better is your algorithm
  • Flexible to distance problems
  • Find the right value of K
  • The value of K must be odd

Happy Learning!

--

--

Sabita Rajbanshi
Machine Learning Community

Writing Towards Machine Learning and Artificial Intelligence.