K-Nearest Neighbors (KNN)
The k-nearest neighbor algorithm is an ml learning algorithm used for both classification and regression predictive problems. This blog explains how to use the algorithm in R.
Introduction
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-enforce supervised system mastering set of rules. This is used to clear up classification. The KNNset of rules assumes that comparable matters exist in near proximity. In different words, comparable matters are close to every different.Hence KNN captures the concept of similarity(from time to time referred to as distance). In KNN, we calculate the distance of recent records point from all factors in the area, after which out of a lot of these distances, we choose Kneighbors( K -factors that have minimal distance from the brand new records point)
Part-a-
STEP-1-
We are using breast cancer data in our analysis. The main objective is to predict whether cancer is Benin or malignant.
STEP-2-
Our dataset is consists of 32 variables(30 Quantitative Variables and one qualitative variable, and one variable having ids of different individuals).
Our response variable is Diagnosis.
Benign | Malignant |
62.7% | 37.3% |
Therefore, out of total 569 cases, 62.7% were Benign, and 37.3 were malignant.
As we know, KNN uses distance as a metric of similarity, so variables on different scales can cause distance to vary from one factor to another. Hence, we normalized the data so that all the variables will be on the same scale.
Then we split data into training and testing parts.
Training set- For Training the model. Out of 569 cases, we trained the model on 469
Testing set- For validation of our model. We tested the model performance in 100 cases.
STEP-3-
We trained the KNN model on our testing set and made predictions on our test set of data
STEP-4-
We evaluated model performance by comparing actual labels of individuals with predicted levels, whether it is Benign or Malignant.
For k=21.
We get an accuracy of 98% or misclassification=2%.
STEP-5-
In order to give more weight to the outliers, we used a z-score instead of normalization.
In this case, for k=21 and using z-core instead of normalized score, we see accuracy has not changed as it is still 98% or the misclassification error is 2%.
Then we tested model performance for 3 more values of k.
FOR K=3:
For k=3, we can see that accuracy is 92%, or the misclassification error is 8%.
For K=7
For k=3 we can see that accuracy is 96%, or the misclassification error is 4%.
FOR K=19
We can see that for k=19, model accuracy is 97% or the misclassification error is 3%.
Summary-
we found out aboutcategory the use of k-nearest neighbors. Unlike many categoryalgorithms, KNN does now no longer do any learning. It really shops the education dataverbatim. Unlabeled check examples are then matched to the maximum comparable recordswithinside the education set the use of a distance function, and the unlabeled instance isassigned the label of the majority of its neighbors.Also,the Scaling of variables is very important in KNN as it is a distance-based method.
As the class of mew data point is decided by the majority of k nearest point value of k plays an important role in model performance.
K
Accuracy
Misclassification
21
98
2
3
92%
8%
5
96%
4%
19
97%
3%
From the above table, we can conclude that model accuracy is maximum for k=21
Part-b-
STEP-1-
We are using Hepatitis C data in our analysis. The main objective is to predict the stage of Baselinehistological.
STEP-2-
Our dataset is consists of 29 variables (20 Quantitative Variables and 9 qualitative variables).
Our response variable is Baselinehistological.staging
We need to convert categorical variables into factors.
Therefore, out of a total of 1385 cases,24.3% were in stage 1 24% were in stage 2, 25.6% in stage 3, and 26.1% in stage 4.
As we know, KNN uses distance as a metric of similarity, so variables on different scales can cause distance to vary from one factor to another. Hence, we normalized the data so that all the variables will be on the same scale.
Then we split data into training and testing parts.
Training set- For Training the model, Out of 1385 cases, we took on 1185 in the training set.
Testing set- For validation of our model. We tested the model performance in 200 cases.
STEP-3-
We trained the KNN model on our testing set and made predictions on our test set of data for k=47.
STEP-4-
We evaluated model performance by comparing actual labels of individuals with predicted levels, whether it is Benign or Malignant.
For k=47.
We get an accuracy of 23.5% or misclassification=76.5%.
STEP-5-
In order to give more weight to the outliers, we used a z-score instead of normalization.
In this case, for k=47 and using z-core instead of normalized score, we see accuracy has improved as it is 28% or misclassification error is 72%.
Then we tested model performance for 3 more values of k.
FOR K=27:
For k=27, we can see that accuracy is 23%, or the misclassification error is 77%.
For K=15
For k=15, we can see that accuracy is 23%, or the misclassification error is 77%.
FOR K=11
We can see that for k=11, model accuracy is 24%, or the misclassification error is 76%.
Summary-
we learned about classification using k-nearest neighbors. Unlike many classification algorithms, KNN does not do any learning. It simply stores the training data verbatim. Unlabeled test examples are then matched to the most similar records in the training set using a distance function, and the unlabeled example is assigned the label of the majority of its neighbors.
Also,the Scaling of variables is very important in KNN as it is a distance-based method.
As the class of mew data point is decided by the majority of k nearest point value of k plays an important role in model performance.
K
Accuracy
Misclassification
47
23.5%
76.5%
27
23%
77%
15
23%
77%
11
24%
76%
From the above table, we can conclude that model accuracy is maximum for k=27.
Since our overall accuracy is not good for any value of k, we should try to look for another classification algorithm that can perform better on this dataset.