Author: Anusree KJ
This blog helps you understand the working of KNN algorithm. It also provides detailed information on why and how KNN algorithm is useful in different analysis.
K-Nearest Neighbor (KNN), a simplest Machine Learning algorithm, is based on the Supervised Learning technique. KNN assumes the similarity among the new and the available data points and puts the new data point into the category that is most similar to the available categories. This means when new data appears, it can be easily classified into a well suited category by using KNN algorithm.
KNN is known as a non-parametric algorithm i.e., it does not make any assumption on underlying data. KNN uses the dataset for training to classify into similar groups. At the training phase, it simply stores the dataset and upon receiving new data it classifies that data into a category or group that is similar to it.
Why do we need a KNN Algorithm?
KNN algorithm is mostly used for classification problems to categorize the given data into K numbers of similar groups. The value of K is determined by the user as per the needs.
Eg: Suppose there are two categories, i.e., Category A and Category B, and there is one data point x1. In which category does this new data point x1 belong?
To solve this type of problem, we can use a KNN algorithm. KNN will help categorize this new data point x1 to a most similar group. Using KNN, we can easily identify the category or class of a particular dataset. (See the following image for more understanding)
How does KNN work?
Following are the steps of KNN algorithm:
Step 1: Select the number K of the neighbors
Step 2: Calculate the Euclidean distance for K number of neighbors
Step 3: Take the K nearest neighbors as per the Euclidean distance calculated in the previous step
Step 4: From these K neighbors, count the number of the data points in each category
Step 5: Assign the new data points to the category for which the number of the neighbor is maximum
Step 6: KNN model is ready
How to select the value of K in the KNN Algorithm?
There is no defined way to determine the best value for "K". Hence, we need to try some values to find the best out of them. The most used and preferred value for K is 5. A very low value for K such as value 1 or 2, can be noisy and may lead to the effects of outliers in the model.
Advantages of using KNN Algorithm:
It is very simple to implement.
It is robust for noisy training data.
It can be more effective when large training data is used.
Different Similarity Measurements
There are different similarity measurements available to calculate similarity among two points in a multidimensional space. Euclidean distance, Cosine Similarity, Manhattan distance are some of the different types of similarity measurements to calculate similarity among two data points. You can choose these similarity measurements based on the requirements.
The Euclidean distance is the distance calculated between two points.
Suppose we have a new data point. The task is to put the data point in the required category. Consider the below image:
Step 1: Choose the number of neighbors, say k=5.
Step 2: Calculate the Euclidean distance between the two data points.
By calculating the Euclidean distance we can find the nearest neighbors. In this case we get three nearest neighbors in category A and two nearest neighbors in category B. See below image:
Solution: As from the above image, the 3 nearest neighbors are from category A, hence the new data point must belong to category A.
Manhattan distance is a similarity distance measure, in which the distance between two points is calculated by the sum of the absolute differences of the points. It is given by:
Cosine similarity measures the similarity between two vectors or data points in multidimensional space. It is measured by the cosine of the angle between two vectors or data points. It determines whether these two vectors are pointing in the same direction. It is often used to measure similarity in text analysis.
Similarity among two points x and y is calculated as:
Use of Cosine Similarity over Euclidean Distance in Football Analysis
Euclidean distance is the distance between two data points A and B as shown in the below graph. It is used to find the distance between two real-valued vectors.
But Euclidean distance itself is not enough to find the player similarities. For example, as shown in the below graph, Ronaldo is close to Messi because they have high ratings in shots, speed or dribbles. But a young player like Joao Felix who has the same profile as Messi will be further away because his attributes are weaker but in the same proportion.
This is where cosine similarity comes into the picture. This measure of similarity between two vectors looks at the angle between them. It measures the similarity in the direction or orientation of the vectors and not the differences in their magnitude or scale.
Given two vectors a and b. The cosine similarity cos(θ) between vectors a and b is represented as:
θ being the angle between the vectors.
a.b is a dot product between x and y and is represented as:
||a|| and ||b|| represents magnitude of the vector is represented as:
In our example, as shown in the above figure, the angle between Messi and Joao Felix is smaller than the angle between Messi and Ronaldo. Even though they were further away. Smaller angle distance between two players is also taken as best player similarity rather than distance. Hence, cosine similarity helps us to capture correct similarities among players.
The K-Nearest Neighbor (KNN) algorithm is based on the Supervised Learning technique. KNN algorithm assumes the similarity among the new and the available data points and puts the new data point into the category that is most similar to the available categories.
KNN algorithm calculates Euclidean distance between two data points to group the data points into groups based on their similarities. Similarities among data points in multidimensional space can be calculated with different similarity measurements such as Euclidean distance, Cosine similarity, Manhattan distance, etc.
One has to choose the similarity measurements based on their requirements. For example, use cosine similarity over Euclidean distance to calculate similarity among players in Football analysis. This is because there will be players with less or more years of experience but with similar playing styles. Hence, use of Cosine similarity will be more useful.