k-Nearest Neighbors: Analysis of Breast Cancer Data

kNN

We have already seen how the k-Nearest Neighbors algorithm can be used to solve classification problems in R – the example used was the winery classification example from the UCI Machine Learning Repository.

Here, let us see how kNN can be used to classify breast cancer data, where this condition can be classified as either benign or malignant. The Introduction to Machine Learning with Python guide by Guido & Muller gives a much more detailed insight into how kNN works with Python, and I would highly recommend this guide if you want to dive deeper.

Essentially, we are using kNN to determine – based on available features in the dataset – whether the case for a specific person is determined as benign or malignant.

1. Firstly, we load our libraries.

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import mglearn
import matplotlib.pyplot as plt
from IPython.display import display

2. Now that we have imported the breast cancer data using sklearn, we can load the data to analyse its properties.

data = load_breast_cancer()
data.target[[10, 50, 85]]
list(data.target_names)

The full dataset is available here.

3. We now partition our data into training and test data (80%-20% split).

X, y = load_breast_cancer(return_X_y=True)
X_scaled = MinMaxScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
                                                    test_size=0.2)

4. Using one nearest neighbour, we run the k-Nearest Neighbour algorithm.

print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))

When the knn.fit function is run, we see that we generate a test set score of 0.94 – meaning that 94% of the data points were classified correctly:

(455, 30) (455,)
(114, 30) (114,)
Test set score: 0.94

The stars shown in the plot represent our test data – and we can see that the plot also illustrates the distance between our test and training data points.

5. We can now plot the kNN output.

mglearn.plots.plot_knn_classification(n_neighbors=1)
plt.show()

kNN

Use of kNN

Essentially, kNN is one of the more simplistic classification algorithms. A classification problem is one where it is necessary to classify data points into specific groups based on their attributes. In this situation, we are using kNN to classify incidences of breast cancer as either benign or malignant based on the features specific to each data point.

As we saw in the other example using R, kNN was also used in a similar way to classify wines into specific categories based on their attributes.

While there are more complex classification algorithms out there, k-nearest neighbors is a good starting point in attempting to solve classification problems.

Leave a Reply

Your email address will not be published. Required fields are marked *