Simple Clustering with python and R

Clustering, it’s fun but it can be misleading. There’s a chapter dedicated to it in Data Analysis with OST by P. Janert where he demonstrates simple usage of Pycluster.

The functions that Janert shows are: kcluster; clustercentroids; kmedioids. The output of the kcluster code is the silhouette coefficient but not the nice graphic, in the kmedioids function (which allows you to supply your own distance metric) the code outputs points grouped by cluster but not the nice graph. I modified the code slightly to use pylab to plot a figure something like that shown in the chapter.

import Pycluster as pc
import numpy as np
import sys
from pylab import plot, show, scatter

# The is probably a better way to convert the data to pylab acceptable format
# but this is simple and works
def toxy(data):
    [ (x.append(a[0]), y.append(a[1])) for a in data ]
    return x,y

# Read data filename and desired number of clusters from command line
filename, n = sys.argv[1], int( sys.argv[2] )

data = np.loadtxt( filename )

datax,datay = toxy(data)

# Perform clustering and find centroids
clustermap, _, _ = pc.kcluster( data, nclusters=n, npass=50 )
centroids, _ = pc.clustercentroids( data, clusterid=clustermap )

clusterx, clustery = toxy( centroids )

scatter(datax, datay, marker='+')
scatter(clusterx, clustery, c='r')

# removed silhouette code - it takes several seconds to calculate and I can't be bothered to wait.

The code now generates a scatter plot with the original dataset and however many clusters you asked to code to generate (see below).

Given that R exists I figured I would also take a look at how to reproduce the results with it, turns out it’s rather simple.

library(stats, graphics)
data <- read.table("ch13_workshop")
km <- kmeans(data, 7,  nstart=50)
# Several bits of summary data are calculated
# Run in interactive mode to see it

# create a nice plot
points(km$centers, col='red')

In this case, using the default save options we get a pdf, I converted it to a jpg using Preview.

As with most things R, there are plenty of modifications you can make to functions and, rather than explain it here, take a look at the help doc – in R, type ?kmeans

* Note to self *
There is also access to hierarchical clustering in R via the hclust function. This differs from kmeans in that you have to provide a dist object:

d <- dist(data)
h <- hclust(d)

again, for further help hit the included doc – ?hclust.

In addition, both provide access to Kohonen or self organising maps (SOM). In Pycluster you call somcluster, while with R you need to import the kohonen library – library(kohonen). I like the R package for SOMs as it provides some nice visualisations and the options of unsupervised and supervised SOMs – to find out more type ??kohonen in an R shell. I have a post for SOMs in the pipe, the result of having read a published paper that manages to do several things very wrong &| skip important details.