Sqoop

Connecting Hive and R via MySQL using Sqoop

Rrrrrr – because pirates > ninjas

This is all kind of backwards, why would I want to take data out of cascalog/hdfs to MySQL and then back into hdfs/hive? Anyway, the general aim here it to get some data, from Hive, into R and do something with it.

Step 1: Get Data
In the first step we will grab some data from an existing database, created when doing a demo of cascalog, and load it into Hive using a tool called sqoop. Sqoop is a tool designed to move large quantities of data between hdfs and structured datastores.

Download sqoop and cd into bin/

./sqoop import --connect jdbc:mysql://localhost/cascalog  --username dan  --table unique_user_counts -m 1 --hive-import

This will import data from the cascalog MySQL database, specifically the unique_user_counts database. The –hive-import switch will place data into Hive (I installed Hive using Homebrew). The -m toggle defines the number of mappers the job will use, in this instance just 1.

You can take a peek at the data using the Hive repl.

λ bin  hive
hive> show databases;
OK
default
Time taken: 2.48 seconds
hive> use default;
OK
Time taken: 0.014 seconds
hive> show tables;
OK
unique_user_counts
Time taken: 0.286 seconds
hive> select * from unique_user_counts;
...

This dumps out the data. If you want to know more about your table then you can use the describe ‘table_name’ command.

To get data into R Hive must be started in server mode, this can be achieved by

hive --service hiveserver

The final step is R …

library(RJDBC)

hive_jars <-list.files("/usr/local/Cellar/hadoop/1.1.1/libexec", pattern="jar$", full.names=T)
hadoop_jars <- list.files("/usr/local/Cellar/hive/0.9.0/libexec/lib", pattern="jar$", full.names=T)

hive_driver <- JDBC("org.apache.hadoop.hive.jdbc.HiveDriver", c(hive_jars, hadoop_jars))

hive_conn <- dbConnect(hive_driver, "jdbc:hive://localhost:10000/default")

rs <- dbGetQuery(hive_conn,"select client, count from unique_user_counts")

x <- 1:length(rs$count)
plot(x, rs$count)

This gives me a very boring plot – it’s really of no interest at all, it’s marginally more interesting than doing a select count(*) from …

The *interesting* part here is the definition of the jars required to connect to Hive, this is largely identical to a previous post.