K-Means Clustering and Unsupervised Learning: Python and R

A K-Means Clustering algorithm allows us to group observations in close proximity to the mean. This allows us to create greater efficiency in categorising the data into specific segments.

Supervised vs. Unsupervised Learning

An important distinction in data science is the difference between supervised and unsupervised learning.

For instance, suppose I ask you your age. I don’t know the answer yet, but I know the information I am looking for, and I can make assumptions based on what the answer could be. This is known as supervised learning.

However, let us now suppose that I ask you, “Tell me something interesting about yourself”. This could be anything, and I have no way of estimating what you might tell me! This is an example of unsupervised learning

Clustering is typically employed when analysing data from an unsupervised learning standpoint; i.e. where the researcher does not necessarily know what relationship is being tested among the data, but is using clustering analysis to shed light on potential relationships.

One can employ this type of analysis across many different types of situations where we wish to analyse data broken down into specific groups and networks. This could include areas such as social network analysis, market segmentation, and so on.

My specific example looks at how clustering can be used to analyse stock price data. For another general example, I highly recommend Ben Larson’s tutorial on k-means clustering using student test result data, which applies many of the principles that I also use here.

In this particular example, we use a hypothetical dataset of stock returns and dividends in order to segment our data into different clusters. We also wish to analyse the relative groupings of stocks based on their 1-year returns and dividend yields. We use the sample_stock dataset (see end of post for details).

Part 1: k-means clustering with sklearn in Python

The below is an example of how sklearn in Python can be used to develop a k-means clustering algorithm.

Here, you will learn how to:

  1. Import kmeans and PCA through the sklearn library
  2. Devise an elbow curve to select the optimal number of clusters (k)
  3. Generate and visualise a k-means clustering algorithms

The particular example used here is that of stock returns. Specifically, the k-means scatter plot will illustrate the clustering of specific stock returns according to their dividend yield.

1. Import Libraries

Firstly, we import the pandas, pylab and sklearn libraries. Pandas is for the purpose of importing the dataset in csv format, pylab is the graphing library used in this example, and sklearn is used to devise the clustering algorithm.

import pandas
import pylab as pl
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

2. Define Variables

Then, the ‘sample_stocks.csv’ dataset is imported, with our Y variable defined as ‘returns’ and X variable defined as ‘dividendyield’. Additionally, we are also normalizing the data in order to make it suitable for analysis with PCA (Principal Component Analysis):

variables = pandas.read_csv('sample_stocks.csv')
Y = variables[['returns']]
X = variables[['dividendyield']]
X_norm = (X - X.mean()) / (X.max() - X.min())
Y_norm = (Y - Y.mean()) / (Y.max() - Y.min())

3. Determine “k” value from elbow curve

The elbow curve is then graphed using the pylab library. Specifically, we are devising a range from 1 to 20 (which represents our number of clusters), and our score variable denotes the percentage of variance explained by the number of clusters.

Nc = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in Nc]
kmeans
score = [kmeans[i].fit(Y).score(Y) for i in range(len(kmeans))]
score
pl.plot(Nc,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()

When we graph the plot, we see that the graph levels off rapidly after 3 clusters, implying that addition of more clusters do not explain much more of the variance in our relevant variable; in this case stock returns.

elbow curve

4. Devise PCA and k-means algorithms

The purpose behind these two algorithms are two-fold. Firstly, the pca algorithm is being used to convert data that might be overly dispersed into a set of linear combinations that can more easily be interpreted.

pca = PCA(n_components=1).fit(Y)
pca_d = pca.transform(Y)
pca_c = pca.transform(X)

From Step 3, we already know that the optimal number of clusters according to the elbow curve has been identified as 3. Therefore, we set n_clusters equal to 3, and upon generating the k-means output use the data originally transformed using pca in order to plot the clusters:

kmeans=KMeans(n_clusters=3)
kmeansoutput=kmeans.fit(Y)
kmeansoutput
pl.figure('3 Cluster K-Means')
pl.scatter(pca_c[:, 0], pca_d[:, 0], c=kmeansoutput.labels_)
pl.xlabel('Dividend Yield')
pl.ylabel('Returns')
pl.title('3 Cluster K-Means')
pl.show()

5. K-Means Output

3cluster k-means

From the above, we see that the clustering algorithm demonstrates an overall positive correlation between stock returns and dividend yields, implying that stocks paying higher dividend yields can be expected to have higher overall returns. While this is a more simplistic example and could be modelled through linear regression analysis, there are many instances where relationships between data will not be linear and k-means can serve as a valuable tool in understanding the data through clustering methods.

Part 2: Use of wss and kmeans in R

Now, let’s look at how we can accomplish the above using R.

Here is how the wss command is used to firstly determine the number of clusters, and how kmeans is then used to conduct the k-means analysis.

6. wss: Determining Number of Clusters

Firstly, we must determine the appropriate number of clusters in R as follows:

# Determine number of clusters
wss <- (nrow(sample_stocks)-1)*sum(apply(sample_stocks,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(sample_stocks,centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

In the above code, we use a maximum of 20 clusters through plotting 1:20. This number is arbitrary, we could have chosen 5 as the maximum possible number of clusters. Once this is done, we then identify the ideal number of clusters at the point at which there is a cut off in our scree plot:

3-clusters

From the above, we see that our scree plot cuts off at 3 clusters. This is the point at which the "within groups sum of squares" is minimised at the third cluster.

7. kmeans: Clustering Algorithm

We now compute our k-means algorithm and plot output, setting the number of specified clusters equal to 3:

# K-Means Cluster Analysis
fit <- kmeans(sample_stocks, 3) # 3 cluster solution
# get cluster means
aggregate(sample_stocks,by=list(fit$cluster),FUN=mean)
# append cluster assignment
sample_stocks <- data.frame(sample_stocks, fit$cluster)
sample_stocks
sample_stocks$fit.cluster <- as.factor(sample_stocks$fit.cluster)
library(ggplot2)
ggplot(sample_stocks, aes(x=dividendyield, y=returns, color = sample_stocks$fit.cluster)) + geom_point()

Upon running this code, we see that our data has been "clustered" into three separate segments. Even though we do have a few outliers, our data has largely been separated into distinct groups:

kmeans

From this graph, we can identify that stocks which have had higher dividend yields have also seen higher returns. In this regard, we can use k-means clustering to identify patterns in data that would not otherwise be apparent through traditional methods of analysis.

Conclusion

In this tutorial, you have learned:

  • How to identify the number of clusters using the elbow method and wss
  • Generate a k-means algorithm
  • Use this algorithm in interpreting data from an "unsupervised learning" perspective

Dataset

sample_stocks.csv