CloudyML

K-nearest neighbours (KNN) Algorithms

K-Nearest Neighbours is a type of Supervised Machine Learning Algorithm used for Classification as well as regression problems, however it is mainly used for Classification Problems.  The Algorithms basically works on the similarity between the training labels to the testing labels by using Various Distance metrics.

Let’s take an example to understand the KNN algorithm

There are two classes in datasets, cross and dot. We have to predict whether the red square belongs to cross class or dot class. By using general method we can say that it belongs to green part but on using the algorithms we find that it belongs to dot class.

In Figure. 1

Working of KNN:
The following is an example to understand the concept of K and working of KNN algorithm −Suppose we have a dataset which is very clean shown in fig 1.

In Figure. 2

1.For K=1.

For K=1 we will take the first nearest Neighbour of the data point with Red Square as shown in figure 2.

By using K as 1 we find that the 1st label Neighbour to red label belongs to green class, which has probability of one for the red label to be the part of green class and is prone to overfitting, means it predicts the testing data as the class of label which is the first least distance Neighbour from testing label. To get rid of overfit we generally take K value as odd and greater than 1.

In Figure. 3

For K=3

We can see in the above diagram the three nearest neighbours of the data point with red square. Among those three, two of them lies in black Dot hence the Red Square will also be assigned in Black class.

Are you intrested in Data Science?

Wanna Explore All Domains of Data Science? We Are Invited You to Take a Look At Our Complete Data Science Course

Implementation of KNN

We are using Breast Cancer Dataset for better understanding of concept.

We have to import some basic libraries given in above image. We have used pandas for reading the csv file, sklear nto import the KNN classifier and splitthe data,Matplotlib and seaborn for data visualization. After importing the important libraries, we are going to see how our dataset and diagnosis (target value) look like.

As shown above my target values (Diagnosis) contains only two classes, first belongs to class 1 having benign (B) stage and second of class 2 having malignant (M) ,to plot the scatter graph first I have converted the categorical feature (Diagnosis) into 0 and 1, in which 0 means Benign and 1 means Malignant, and then  plot the bar graph or scatter plot for better visualization.

After mapping the categorical feature we get the target value in the form of 0 and 1 as you can see in above diagram, now we have our data in X and  Diagnosis value in target data frame.

Data Preprocessing

Data Preprocessing plays an important role in Machine Learning to enhance the accuracy and performance of the model. As you can see in figure 3 our dataset did not contain any type of NaN value  

Data Visualization

By using Seaborn countplot, we can visualize the count plot for target values as shown,

Scatter plot of some columns with the Target value(Diagnosis) shown as :

Splitting up the data in train and test

To test the model accuracy or score, we want some data due to which we split our dataset in some ratio. Generally, we take 80 % of our data for training purpose and 20% for testing purpose, to split the data we import the library from sklearn initially,

Feature Scaling

After scaling the data, we will fit our KNN Algorithm import from sklearn, in which we give the value of some parameters like 

1.n_neighbors

We can supply the n_value to our classifier, the score of model depends on how many neighbours you consider for classifying the testing data.

Optimal value for K

In KNN we always prefer to take odd values of K for better accuracy and our algo does not gets stuck. To find the best K we may use cross_validation technique using sklearn which gives array of score that takes K Fold for train and 1 fold for testing, and we take average of all scores of cross validation and from those we are going to extract our best K for which Score is MAX.

We only traverse our K value from 1 to 19 with 2 increments,which means “i” always took the odd value between 1 to 20. Let’s visualize the plot for the mean of cross_val_score.

As you can see in above graph for K=7 our score is maximum (1), thus we take K as 7 when we apply Classifier KNN for model building.

Model Building

In model building we are going to fit our training data on KNN classifier via sklearn.neighbors,

Score of model

Confusion Matric

Confusion matric is also known as Error Matric, it is a table containing rows and columns of features in which every cell contains a value which tells about the performance of algorithm. 

Become Job-Ready

DATA SCIENTIST

Get Complete Hands-On Practical Learning Experience

Scroll to Top