kNN: K-Nearest Neighbours Algorithm in R

The purpose of a k-nearest neighbours algorithm (kNN) is to classify information. kNN is one of the most simplistic machine learning algorithms, and is very useful when it comes to solving classification problems.

Here, we are using a winery dataset (available from the UCI Machine Learning Repository).

 
You can see in the above image (which is a simplistic graph of how kNN works) that each observation is being categorised to a specific group (in this case a wine is being categorised as a 1, 2, or 3), with a boundary then defining the area for each category.

So, once the dataset is loaded as from above, we then proceed with the following steps:

Max-Min Normalization

As in the example that we used for the neural network illustration, we firstly “normalize” our data by creating a common scale for all the independent variables (or attributes that influence whether or not a wine belongs in a specific category).

This is necessary because without a common scale for each variable, then it will not be possible for kNN to meaningfully classify the variable of interest. We normalize as per the below:

#Max-Min Normalization
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

maxmindf <- as.data.frame(lapply(df, normalize))

Please see this link for further reference on how to use the normalization function.

Once we have done that, we can verify that our scaled variables (in this case alcoholscaled) is normal:

#Check for normality
summary(maxmindf$alcoholscaled)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.3507  0.5316  0.5186  0.6967  1.0000 

Training and Test Data

In order to test the validity of kNN as a classification algorithm, we then split our data into training and test data:

#Training and Test Data
trainset <- maxmindf[1:100, ]
testset <- maxmindf[101:178, ]

#Labels
trainset_labels <- mydata[1:100, 1]
testset_labels <- mydata[101:178, 1]

kNN

Using the Class library, we can then invoke the knn algorithm to provide us with predictions of wine classifications using the training dataset, which are then validated against the test dataset (as also done in the example of Analytics Vidhya).

#KNN
#install.packages("Class")
library(class)
knn_prediction <- knn(train = trainset, test = testset,cl = trainset_labels, k=10)

#Check Model Performance
#install.packages("gmodels")
library(gmodels)
CrossTable(x=testset_labels, y=knn_prediction,prop.chisq=FALSE)

Once the knn model is set up, the gmodels library can be used to create a cross-validation table using the CrossTable function:

Total Observations in Table:  78 

 
               | knn_prediction 
testset_labels |         1 |         2 |         3 | Row Total | 
---------------|-----------|-----------|-----------|-----------|
             1 |        21 |         0 |         0 |        21 | 
               |     1.000 |     0.000 |     0.000 |     0.269 | 
               |     1.000 |     0.000 |     0.000 |           | 
               |     0.269 |     0.000 |     0.000 |           | 
---------------|-----------|-----------|-----------|-----------|
             2 |         0 |        33 |         1 |        34 | 
               |     0.000 |     0.971 |     0.029 |     0.436 | 
               |     0.000 |     1.000 |     0.042 |           | 
               |     0.000 |     0.423 |     0.013 |           | 
---------------|-----------|-----------|-----------|-----------|
             3 |         0 |         0 |        23 |        23 | 
               |     0.000 |     0.000 |     1.000 |     0.295 | 
               |     0.000 |     0.000 |     0.958 |           | 
               |     0.000 |     0.000 |     0.295 |           | 
---------------|-----------|-----------|-----------|-----------|
  Column Total |        21 |        33 |        24 |        78 | 
               |     0.269 |     0.423 |     0.308 |           | 
---------------|-----------|-----------|-----------|-----------|

From the above, we see that our model correctly classified the majority of wines in the dataset, and there was only one instance where a wine belonging to category 2 was labelled as category 3. In this regard, the kNN model has been successful at classifying our variable of interest correctly based on the cross-validation of the results with the test dataset.

Leave a Reply

Your email address will not be published. Required fields are marked *