I need a new job

Data Analysis with open source tools – Kernel Density Estimates

Some datasets are now available from O’Reilly

Get them here

There is a note on the forums: “For data sets that are publicly available, he didn’t replicate the data set, but included the URLs where you can find those data sets. Also, he explained that not all figures in the book have an attached data set; many figures are function plots or otherwise dynamically generated.”


Data Analysis with OST by Philipp K Janert is an interesting book if you are into … data analysis, it also pairs rather nicely with Programming Collective Intelligence by Toby Segaran.
One problem, (see here for more), with the book is that some of the data is most notable by its absence, for example the number of months in office served by each of the (US) presidents. This data is even referred to in the chapter workshop on numpy meaning that you can’t follow along.

The result: frustration as you open chrome, find the data and parse it (or just approximate a portion of it). When compared to Collective Intelligence, where Segaran thoughtfully provides code to fetch and parse data alongside preprocessed snippets, this is an unwelcome process*.

Enough with the complaining! I made up some data and used it with the code provided, my approximated set looks like this:

1	1
9	1
18	1
29	2
34	1
36	1
41	1
50	1
51	2
52	14
57	1
62	1
69	2
90	1
92	1
94	10
98	1

The number of months is on the left, the number of observations on the right. I expanded the list to 0 .. 99 with each number as well as creating an array that looked like: [1, 9, 18, 29, 29, 34, 36, 41, 50, 51, 51, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 57, 62, 69, 69, 90, 92, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 98], this is the data that I analysed using the KDE.

Janert uses numpy to do his analysis because it allows for the function to be condensed into a single block of code that, under the covers, is C:

from numpy import *
def kde(z, w, xv): # z: position w: bandwidth xv: vector of points
    return sum(exp(-0.5*((z-xv)/w)**2)/sqrt(2*pi*w**2))

#convert the above array into a numpy array, allows broadcasting
x = array([1, 9, 18, 29, 29, 34, 36, 41, 50, 51, 51, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 57, 62, 69, 69, 90, 92, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 98])

w = 0.8 # The bandwidth used in the book

# Now for all points calculate the kde
for pos in linspace( min(x), max(x), 1000):
    print pos, kde(pos, w, x)

If you aren’t familiar with numpy.linspace open a python shell and do this

from numpy import linspace
help(linspace)

This will tell you that min(x) is the start point, max(x) the end point and 1000, the number of steps. The result is an array of evenly spaced numbers over the specified interval.

Janert uses GnuPlot (he has written a book) and Matplotlib to produce images. I usually jump between the two depending on what I am doing, here, for ease of use, I redirected the output to a file and plotted using GnuPlot:

plot [0:100][0:15] 'hist.txt' using 1:2 with boxes, 'foo1.5' using 1:2 with lines lt 3, 'foo2.5' using 1:2 with lines lt -1

where hist.txt is the expanded list, the fooi files are KDEs with the bandwidth set to i. lt -1/2/3 changes the colour of the lines used by GnuPlot and [0:100] sets the width of the xaxis and [0:15] the width of the yaxis. If you are on a mac you can install gnuplot using brew otherwise use your package manager.

The output should look something like this, note that neither of the bandwidths are those used by Janert (0.8) and that my data is truncated/an approximation.

kde hist plot

Now, this looks something like the image in the chapter!

THERE ARE OTHER WAYS!

Using the gaussian_kde in scipy.stats which is described here the second via R. KDEs can be completed using two processes:

The first uses density, a function that ships with R. It works as follows:

d <- density(x) # where x is your dataset
plot(d)

I will amend this example to include my dataset but at the moment Leffe beer is getting the better of me.

The next method, and perhaps easier to understand, is the ks library. This library doesn’t ship with R, install it using the package manager & make sure that the install dependencies tab is checked.

In this example we will use the faithful dataset (to find out more about faithful type the following at the R console):

 ?faithful

To generate the KDE all you need to do is:

library(ks)
attach(faithful)
h <- hpi(x=waiting)
fhat <- kde(x=waiting, h=h) plot(fhat, drawpoints=TRUE) 

But what does it all mean? I have no idea! I am not an R expert, but here’s what I get from this:
attach adds the faithful dataset to the path, this means that rather than typing ‘faithful$waiting’ you just add ‘waiting’
hpi: a function that determines the bandwidth plugin
kde: the kernel density estimation function
plot: creates the nice images.
to find out more about these functions type

?function

in the R console. WARNING: R help is mostly code and can be a bit heavy.

The result should be something like this:

CAVEAT EMPTOR
I am not a mathematician nor am I smart/intelligent/competent, I am playing about with this stuff for my own amusement and drinking a fairly potent beer at the same time. If you find a problem with this (it’s factually wrong or it doesn’t work for you) please let me know via the comments and I will fix things.


* Collective Intelligence would be an utter failure if Segaran hadn’t provided some form of data. Data Analysis looks like it can still be a good reference text – despite poor english (what are the O’Reilly editors doing & yes this entry isn’t well written) and unclear explanation of equations for the mathematically challenged.


Some datasets are now available from O’Reilly

Get them here

There is a note on the forums: “For data sets that are publicly available, he didn’t replicate the data set, but included the URLs where you can find those data sets. Also, he explained that not all figures in the book have an attached data set; many figures are function plots or otherwise dynamically generated.”