# kNN: K-Nearest Neighbours Algorithm in R and Python

The purpose of a k-nearest neighbours algorithm (kNN) is to classify information. kNN is one of the most simplistic machine learning algorithms, and is very useful when it comes to solving classification problems.

# kNN: R

Let’s start off by examining how this algorithm is formulated in R. Here, we are using a winery dataset (available from the UCI Machine Learning Repository).

You can see in the above image (which is a simplistic graph of how kNN works) that each observation is being categorised to a specific group (in this case a wine is being categorised as a 1, 2, or 3), with a boundary then defining the area for each category.

So, once the dataset is loaded as from above, we then proceed with the following steps:

### Max-Min Normalization

As in the example that we used for the neural network illustration, we firstly “normalize” our data by creating a common scale for all the independent variables (or attributes that influence whether or not a wine belongs in a specific category).

This is necessary because without a common scale for each variable, then it will not be possible for kNN to meaningfully classify the variable of interest. We normalize as per the below:

```#Max-Min Normalization
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}

maxmindf <- as.data.frame(lapply(df, normalize))
```

Please see this link for further reference on how to use the normalization function.

Once we have done that, we can verify that our scaled variables (in this case alcoholscaled) is normal:

```#Check for normality
summary(maxmindf\$alcoholscaled)

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.0000  0.3507  0.5316  0.5186  0.6967  1.0000
```

### Training and Test Data

In order to test the validity of kNN as a classification algorithm, we then split our data into training and test data:

```#Training and Test Data
trainset <- maxmindf[1:100, ]
testset <- maxmindf[101:178, ]

#Labels
trainset_labels <- mydata[1:100, 1]
testset_labels <- mydata[101:178, 1]
```

### kNN

Using the Class library, we can then invoke the knn algorithm to provide us with predictions of wine classifications using the training dataset, which are then validated against the test dataset (as also done in the example of Analytics Vidhya).

```#KNN
#install.packages("Class")
library(class)
knn_prediction <- knn(train = trainset, test = testset,cl = trainset_labels, k=10)

#Check Model Performance
#install.packages("gmodels")
library(gmodels)
CrossTable(x=testset_labels, y=knn_prediction,prop.chisq=FALSE)
```

Once the knn model is set up, the gmodels library can be used to create a cross-validation table using the CrossTable function:

```Total Observations in Table:  78

| knn_prediction
testset_labels |         1 |         2 |         3 | Row Total |
---------------|-----------|-----------|-----------|-----------|
1 |        21 |         0 |         0 |        21 |
|     1.000 |     0.000 |     0.000 |     0.269 |
|     1.000 |     0.000 |     0.000 |           |
|     0.269 |     0.000 |     0.000 |           |
---------------|-----------|-----------|-----------|-----------|
2 |         0 |        33 |         1 |        34 |
|     0.000 |     0.971 |     0.029 |     0.436 |
|     0.000 |     1.000 |     0.042 |           |
|     0.000 |     0.423 |     0.013 |           |
---------------|-----------|-----------|-----------|-----------|
3 |         0 |         0 |        23 |        23 |
|     0.000 |     0.000 |     1.000 |     0.295 |
|     0.000 |     0.000 |     0.958 |           |
|     0.000 |     0.000 |     0.295 |           |
---------------|-----------|-----------|-----------|-----------|
Column Total |        21 |        33 |        24 |        78 |
|     0.269 |     0.423 |     0.308 |           |
---------------|-----------|-----------|-----------|-----------|
```

From the above, we see that our model correctly classified the majority of wines in the dataset, and there was only one instance where a wine belonging to category 2 was labelled as category 3. In this regard, the kNN model has been successful at classifying our variable of interest correctly based on the cross-validation of the results with the test dataset.

# kNN: Python

Now, let's take a look at how this works in Python.

For this particular example, let us see how kNN can be used to classify breast cancer data, where this condition can be classified as either benign or malignant. The Introduction to Machine Learning with Python guide by Guido & Muller gives a much more detailed insight into how kNN works with Python, and I would highly recommend this guide if you want to dive deeper.

Essentially, we are using kNN to determine - based on available features in the dataset - whether the case for a specific person is determined as benign or malignant.

### 1. Firstly, we load our libraries.

```from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import mglearn
import matplotlib.pyplot as plt
from IPython.display import display
```

### 2. Now that we have imported the breast cancer data using sklearn, we can load the data to analyse its properties.

```data = load_breast_cancer()
data.target[[10, 50, 85]]
list(data.target_names)
```

The full dataset is available here.

### 3. We now partition our data into training and test data (80%-20% split).

```X, y = load_breast_cancer(return_X_y=True)
X_scaled = MinMaxScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2)
```

### 4. Using one nearest neighbour, we run the k-Nearest Neighbour algorithm.

```print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))
```

When the knn.fit function is run, we see that we generate a test set score of 0.94 - meaning that 94% of the data points were classified correctly:

(455, 30) (455,)
(114, 30) (114,)
Test set score: 0.94

The stars shown in the plot represent our test data - and we can see that the plot also illustrates the distance between our test and training data points.

### 5. We can now plot the kNN output.

```mglearn.plots.plot_knn_classification(n_neighbors=1)
plt.show()
```

# Use of kNN

Essentially, kNN is one of the more simplistic classification algorithms. A classification problem is one where it is necessary to classify data points into specific groups based on their attributes. In this situation, we are using kNN to classify incidences of breast cancer as either benign or malignant based on the features specific to each data point.

As we saw in the other example using R, kNN was also used in a similar way to classify wines into specific categories based on their attributes.

While there are more complex classification algorithms out there, k-nearest neighbors is a good starting point in attempting to solve classification problems.

## Code Scripts and Datasets

Hope you enjoyed this tutorial!

The full code is available by subscribing to my mailing list.

Upon subscription, you will receive full access to the codes and datasets for my tutorials, as well as a comprehensive course in regression analysis in both Python and R.