R

Connecting Hive and R via MySQL using Sqoop

Rrrrrr – because pirates > ninjas

This is all kind of backwards, why would I want to take data out of cascalog/hdfs to MySQL and then back into hdfs/hive? Anyway, the general aim here it to get some data, from Hive, into R and do something with it.

Step 1: Get Data
In the first step we will grab some data from an existing database, created when doing a demo of cascalog, and load it into Hive using a tool called sqoop. Sqoop is a tool designed to move large quantities of data between hdfs and structured datastores.

Download sqoop and cd into bin/

./sqoop import --connect jdbc:mysql://localhost/cascalog  --username dan  --table unique_user_counts -m 1 --hive-import

This will import data from the cascalog MySQL database, specifically the unique_user_counts database. The –hive-import switch will place data into Hive (I installed Hive using Homebrew). The -m toggle defines the number of mappers the job will use, in this instance just 1.

You can take a peek at the data using the Hive repl.

λ bin  hive
hive> show databases;
OK
default
Time taken: 2.48 seconds
hive> use default;
OK
Time taken: 0.014 seconds
hive> show tables;
OK
unique_user_counts
Time taken: 0.286 seconds
hive> select * from unique_user_counts;
...

This dumps out the data. If you want to know more about your table then you can use the describe ‘table_name’ command.

To get data into R Hive must be started in server mode, this can be achieved by

hive --service hiveserver

The final step is R …

library(RJDBC)

hive_jars <-list.files("/usr/local/Cellar/hadoop/1.1.1/libexec", pattern="jar$", full.names=T)
hadoop_jars <- list.files("/usr/local/Cellar/hive/0.9.0/libexec/lib", pattern="jar$", full.names=T)

hive_driver <- JDBC("org.apache.hadoop.hive.jdbc.HiveDriver", c(hive_jars, hadoop_jars))

hive_conn <- dbConnect(hive_driver, "jdbc:hive://localhost:10000/default")

rs <- dbGetQuery(hive_conn,"select client, count from unique_user_counts")

x <- 1:length(rs$count)
plot(x, rs$count)

This gives me a very boring plot – it’s really of no interest at all, it’s marginally more interesting than doing a select count(*) from …

The *interesting* part here is the definition of the jars required to connect to Hive, this is largely identical to a previous post.

Connect to Lucid using R

*will work for any JDBC db*

First you will need to install the RJDBC library. I did this using the R package manager (remember to include all dependencies).

The following lines connect to my local lucid instance, select the number of users from my pretend visitors table and stores the result in the ‘result’ variable.

library(RJDBC)
driver <- JDBC("org.luciddb.jdbc.LucidDbClientDriver", "/Users/dan/Desktop/LucidDbClient.jar", identifier.quote="'")
conn <- dbConnect(driver, "jdbc:luciddb:http://localhost:8034", "sa", "")
result <- dbGetQuery(conn, "select count(*) from visitors")
dbDisconnect(conn)

Remember, be a good person and close your connection :)
You can find out more from about RJDBC here http://cran.r-project.org/web/packages/RJDBC/index.html

Coloured horizontal bar graphs

R is great for creating some acceptable looking graphs, much more so that GnuPlot (which is also awesome ). R plots are decent, however there is a library called ggplot that is even better. I wanted to create some simple graphs of data and add a little colour – red for +ve values, blue for -ve. This is fairly simple to achieve with R and ggplot.

I can’t share the data that I am using or explain what it is. I load file content into a data.frame and then use the names to extract a entry of interest.


setwd("path/to/data/dir")
library(ggplot2)

data <- read.table("data", h=T, sep=",")

This reads the data in.


plot <- ggplot(data, aes(fill=componentToFill, x=X, y=component)) + 
geom_bar() +
coord_flip() + 
scale_y_continuous("") + 
scale_x_discrete("") + 
scale_fill_gradient2(low="blue", high="red")

component and componentToFill are things in your data frame, again I’ve had to change my names despite the fact you don’t have the data (I don’t want to get in trouble).


ggsave(filename="example.jpg", p, dpi=1000, width=7, height=5)

It’s probably a good idea to look at the plot, in this case we will save to a file called example.jpg.

The image I had looks like this.

Image

Sort R data.frame by column

Load in your data and convert to a data.frame. I assume you have headers in your file, if you don’t change the True to False in the following:

data <- read.table("myfile.txt", header=True)

If you don’t want to keep typing data$column_name

attach(data)

To list the names of the columns, for reference

names(data)

To sort the data.frame by a column, in this case dmFactorRMS

data[order(data$dmFactorRMS),]

If you’ve bothered to bind data you can do

data[order(dmFactorRMS)]

This won’t modify your input, therefore you will need to create a new varaible

sorted_data <- data[order(dmFactorRMS)]

Use R to access a mysql db

It’s fairly simple to connect to a MySQL database using R.

You may need to install the RMySQL library. Use the Package Manager to do so.

# require the mysql interface
library("RMySQL")
m <- dbDriver("MySQL")

#what database do you want to connect to
#this is just like Python and Perl connections
con <- dbConnect(m, password="your-passwd", db="your-db")

# prepare and send a query
rs <- dbSendQuery( con, statement = "select x from y where x > 10")

# get all the data. n = -1 means return all data
data <- fetch( rs, n = -1)

# now do something with the data ...

It’s that simple. However, like many things, the more you look at it the more you realise you don’t know. To explore further open R and type ??rmysql or ??mysql.

Simple Clustering with python and R

Clustering, it’s fun but it can be misleading. There’s a chapter dedicated to it in Data Analysis with OST by P. Janert where he demonstrates simple usage of Pycluster.

The functions that Janert shows are: kcluster; clustercentroids; kmedioids. The output of the kcluster code is the silhouette coefficient but not the nice graphic, in the kmedioids function (which allows you to supply your own distance metric) the code outputs points grouped by cluster but not the nice graph. I modified the code slightly to use pylab to plot a figure something like that shown in the chapter.

import Pycluster as pc
import numpy as np
import sys
from pylab import plot, show, scatter

# The is probably a better way to convert the data to pylab acceptable format
# but this is simple and works
def toxy(data):
    x,y=[],[]
    [ (x.append(a[0]), y.append(a[1])) for a in data ]
    return x,y

# Read data filename and desired number of clusters from command line
filename, n = sys.argv[1], int( sys.argv[2] )

data = np.loadtxt( filename )

datax,datay = toxy(data)

# Perform clustering and find centroids
clustermap, _, _ = pc.kcluster( data, nclusters=n, npass=50 )
centroids, _ = pc.clustercentroids( data, clusterid=clustermap )

clusterx, clustery = toxy( centroids )

scatter(datax, datay, marker='+')
scatter(clusterx, clustery, c='r')
show()

# removed silhouette code - it takes several seconds to calculate and I can't be bothered to wait.

The code now generates a scatter plot with the original dataset and however many clusters you asked to code to generate (see below).

Given that R exists I figured I would also take a look at how to reproduce the results with it, turns out it’s rather simple.

library(stats, graphics)
data <- read.table("ch13_workshop")
km <- kmeans(data, 7,  nstart=50)
# Several bits of summary data are calculated
# Run in interactive mode to see it

# create a nice plot
plot(data)
points(km$centers, col='red')

In this case, using the default save options we get a pdf, I converted it to a jpg using Preview.

As with most things R, there are plenty of modifications you can make to functions and, rather than explain it here, take a look at the help doc – in R, type ?kmeans

* Note to self *
There is also access to hierarchical clustering in R via the hclust function. This differs from kmeans in that you have to provide a dist object:

d <- dist(data)
h <- hclust(d)

again, for further help hit the included doc – ?hclust.

In addition, both provide access to Kohonen or self organising maps (SOM). In Pycluster you call somcluster, while with R you need to import the kohonen library – library(kohonen). I like the R package for SOMs as it provides some nice visualisations and the options of unsupervised and supervised SOMs – to find out more type ??kohonen in an R shell. I have a post for SOMs in the pipe, the result of having read a published paper that manages to do several things very wrong &| skip important details.

Anscombe’s Quartet – or playing with R after drinking London Pride.

Reading Data Analysis with Open Source Tools? If you’ve gone past chapter 3, table 3.1 then you have seen it. You have probably seen some plots too – that’s the one of the main things about the dataset – look at your data before breaking out the calculator.

I think the following block is the Anscombe dataset:

x1 y1 x2 y2 x3 y3 x4 y4
10.0	8.04	10.0	9.14	10.0	7.46	8.0	6.58
8.0	6.95	8.0	8.14	8.0	6.77	8.0	5.76
13.0	7.58	13.0	8.74	13.0	12.74	8.0	7.71
9.0	8.81	9.0	8.77	9.0	7.11	8.0	8.84
11.0	8.33	11.0	9.26	11.0	7.81	8.0	8.47
14.0	9.96	14.0	8.10	14.0	8.84	8.0	7.04
6.0	7.24	6.0	6.13	6.0	6.08	8.0	5.25
4.0	4.26	4.0	3.10	4.0	5.39	19.0	12.50
12.0	10.84	12.0	9.13	12.0	8.15	8.0	5.56
7.0	4.82	7.0	7.26	7.0	6.42	8.0	7.91
5.0	5.68	5.0	4.74	5.0	5.73	8.0	6.89

save it in a file called ‘anscombe.txt’. Then copy & paste the following R code into a file and save it as anscombe.R

setwd("path-to-your-working-directory")
data <- read.table("anscombe.txt", h=T)

info <- function(a,b) c( 
	paste("cc:", signif(cor(a,b),2)), 
	paste("var x,y:", signif(var(a),2), ",", signif(var(b),2)), 
	paste("mean x,y:", signif(mean(a),2), ",", signif(mean(b),2)),
	paste("linres", signif(lsfit(a,b)$coefficients[1],1))
	)

par(mfrow=c(2,2))
attach(data)

make_plot <- function(a,b) c(
	plot(a,b, xlim=c(3,20), xlab="x", ylab="y", ylim=c(2,13), main=info(a,b), pch=20),
	abline(lsfit(a,b), col="red") )

make_plot(x1,y1)
make_plot(x2,y2)
make_plot(x3,y3)
make_plot(x4,y4)

This, probably unidiomatic, R blurb should create a grid of plots. Each plot is one of the x,y pairs and above each figure is the correlation coefficient, variance, mean and intercepts.

It should look like this.

There’s actually a bit of precompiled code in R to do all this. Open the R terminal or GUI and type ??anscombe, this is an alias for help.search(“anscombe”). The familiar help dialog should pop-up with a single entry: Topic = Anscombe. Scroll to the bottom of the post and copy & paste the example code into your R terminal/gui. Enjoy the exciting plots.

R + London Pride => ftw!

I’ve been drinking, it’s late (00:50) and my g/f is nagging me to turn the computer off, if there’s a mistake let me know via the comments and I will fix it.

Create Animated Principal Component Analysis Biplots

PCA is a handy tool (see SVD), actually, a handy procedure, that converts a set of observations into a set of values of uncorrelated variables. The uncorrelated variables are called principal components. The outcome is that, given a load of input, we can get a bunch of components that describe the variance, ordered high to low, of the input and are uncorrelated with the preceding component. It is a method that’s handy for data reduction and has been used in many areas including biology.

I took a bunch of proteins with circular dichroism (CD) spectra, parsed the associated pdb files and ran PCA on vectors of various attributes and the associated spectral values across a wavelength range associated with synchrotron radiation CD.

I used R to perform PCA:

input <- args[1]
output <- args[2]
data <- read.csv( input, h=T)
pcx <- prcomp( data, scale=T )
png( output, height=800, width=800 )
biplot(pcx, choices=1:2, main=input, scale=1)
png()

The script, executed using Rscript, takes two arguments: an input file of attributes separated using a comma (csv); an output file, the name of the .png created using the biplot command. The input files were created using a python script.

As a result I have 66 .png files. If you want to create an animation, that loops over a collection, then you need to grab imagemagick so that you can convert the .png to .gif. Because I am using a mac I installed imagemagick using homebrew:

-=[biomunky@blacksheep png]=- brew install imagemagick

Then, to convert all the pngs to gif I run convert (imagemagick) using a bash for loop:

for i in `ls *.png | sed 's/\.png//'`; do convert $i.png $i.gif; done

The result? A load of gifs, excellent.

Imagemagick can then create an animation from the collection of gifs:

convert -delay 50 -loop 0 *.gif animated.gif

-delay 50: causes a delay of 50 hundredths of a second between images.
-loop 0: causes an infinite loop.
*.gif: the input
animated.gif: the name of the output image.

The final output can be seen below – click the image and it should open in a new tab.

To understand what you are looking at read about PCA and biplots.

Data Analysis with open source tools – Kernel Density Estimates

Some datasets are now available from O’Reilly

Get them here

There is a note on the forums: “For data sets that are publicly available, he didn’t replicate the data set, but included the URLs where you can find those data sets. Also, he explained that not all figures in the book have an attached data set; many figures are function plots or otherwise dynamically generated.”


Data Analysis with OST by Philipp K Janert is an interesting book if you are into … data analysis, it also pairs rather nicely with Programming Collective Intelligence by Toby Segaran.
One problem, (see here for more), with the book is that some of the data is most notable by its absence, for example the number of months in office served by each of the (US) presidents. This data is even referred to in the chapter workshop on numpy meaning that you can’t follow along.

The result: frustration as you open chrome, find the data and parse it (or just approximate a portion of it). When compared to Collective Intelligence, where Segaran thoughtfully provides code to fetch and parse data alongside preprocessed snippets, this is an unwelcome process*.

Enough with the complaining! I made up some data and used it with the code provided, my approximated set looks like this:

1	1
9	1
18	1
29	2
34	1
36	1
41	1
50	1
51	2
52	14
57	1
62	1
69	2
90	1
92	1
94	10
98	1

The number of months is on the left, the number of observations on the right. I expanded the list to 0 .. 99 with each number as well as creating an array that looked like: [1, 9, 18, 29, 29, 34, 36, 41, 50, 51, 51, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 57, 62, 69, 69, 90, 92, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 98], this is the data that I analysed using the KDE.

Janert uses numpy to do his analysis because it allows for the function to be condensed into a single block of code that, under the covers, is C:

from numpy import *
def kde(z, w, xv): # z: position w: bandwidth xv: vector of points
    return sum(exp(-0.5*((z-xv)/w)**2)/sqrt(2*pi*w**2))

#convert the above array into a numpy array, allows broadcasting
x = array([1, 9, 18, 29, 29, 34, 36, 41, 50, 51, 51, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 57, 62, 69, 69, 90, 92, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 98])

w = 0.8 # The bandwidth used in the book

# Now for all points calculate the kde
for pos in linspace( min(x), max(x), 1000):
    print pos, kde(pos, w, x)

If you aren’t familiar with numpy.linspace open a python shell and do this

from numpy import linspace
help(linspace)

This will tell you that min(x) is the start point, max(x) the end point and 1000, the number of steps. The result is an array of evenly spaced numbers over the specified interval.

Janert uses GnuPlot (he has written a book) and Matplotlib to produce images. I usually jump between the two depending on what I am doing, here, for ease of use, I redirected the output to a file and plotted using GnuPlot:

plot [0:100][0:15] 'hist.txt' using 1:2 with boxes, 'foo1.5' using 1:2 with lines lt 3, 'foo2.5' using 1:2 with lines lt -1

where hist.txt is the expanded list, the fooi files are KDEs with the bandwidth set to i. lt -1/2/3 changes the colour of the lines used by GnuPlot and [0:100] sets the width of the xaxis and [0:15] the width of the yaxis. If you are on a mac you can install gnuplot using brew otherwise use your package manager.

The output should look something like this, note that neither of the bandwidths are those used by Janert (0.8) and that my data is truncated/an approximation.

kde hist plot

Now, this looks something like the image in the chapter!

THERE ARE OTHER WAYS!

Using the gaussian_kde in scipy.stats which is described here the second via R. KDEs can be completed using two processes:

The first uses density, a function that ships with R. It works as follows:

d <- density(x) # where x is your dataset
plot(d)

I will amend this example to include my dataset but at the moment Leffe beer is getting the better of me.

The next method, and perhaps easier to understand, is the ks library. This library doesn’t ship with R, install it using the package manager & make sure that the install dependencies tab is checked.

In this example we will use the faithful dataset (to find out more about faithful type the following at the R console):

 ?faithful

To generate the KDE all you need to do is:

library(ks)
attach(faithful)
h <- hpi(x=waiting)
fhat <- kde(x=waiting, h=h) plot(fhat, drawpoints=TRUE) 

But what does it all mean? I have no idea! I am not an R expert, but here’s what I get from this:
attach adds the faithful dataset to the path, this means that rather than typing ‘faithful$waiting’ you just add ‘waiting’
hpi: a function that determines the bandwidth plugin
kde: the kernel density estimation function
plot: creates the nice images.
to find out more about these functions type

?function

in the R console. WARNING: R help is mostly code and can be a bit heavy.

The result should be something like this:

CAVEAT EMPTOR
I am not a mathematician nor am I smart/intelligent/competent, I am playing about with this stuff for my own amusement and drinking a fairly potent beer at the same time. If you find a problem with this (it’s factually wrong or it doesn’t work for you) please let me know via the comments and I will fix things.


* Collective Intelligence would be an utter failure if Segaran hadn’t provided some form of data. Data Analysis looks like it can still be a good reference text – despite poor english (what are the O’Reilly editors doing & yes this entry isn’t well written) and unclear explanation of equations for the mathematically challenged.


Some datasets are now available from O’Reilly

Get them here

There is a note on the forums: “For data sets that are publicly available, he didn’t replicate the data set, but included the URLs where you can find those data sets. Also, he explained that not all figures in the book have an attached data set; many figures are function plots or otherwise dynamically generated.”