python

Jython & Lucid

Install jython

 brew install jython

or whatever package installer your machine has.
Install pip

 sudo easy_install pip

Install Virtualenv

 sudo pip install virtualenv

Create new environment

virtualenv -p /usr/local/bin/jython <your-project-name>

Activate virtualenv


cd <your-project-name>

source bin/activate (I use zsch & this works for me)

Install jip

 ./bin/pip jip

Run a query


from com.ziclix.python.sql import zxJDBC

db = zxJDBC.connect("jdbc:luciddb:http://localhost:8034", "sa", None, "org.luciddb.jdbc.LucidDbClientDriver")

cursor = db.cursor()

cursor.execute("select * from reports.watching where event_id = '757670'")

for row in cursor:
print row

cursor.close()
db.close()

Anagrams with python

I was asked to write a bit of python code to find anagrams. The code consists of a class with one method and a unittest.TestCase companion. The anagram class takes a list of words and performs some crude checks to make sure the class word list is not empty. Passing a word to the get_anagrams will return a set([]) of all anagrams for that word. For example, passing in ‘cat’ will return set([‘act’, ‘cat’]).

The word list used here is found in /usr/share/dict/words, this should be ok for mac and linux users, anyone using windows will have to provide a word list and change the tests.

#!/usr/bin/env python

import itertools
import unittest
import types
import os

class Anagram( object ):
    def __init__(self, words):
        """
        Create an array containing words supplied in <words>.
        Raises errors if 1) words isn't a list type
                         2) words turns out to be empty or contains only non-StringType(s)
        """
        if not isinstance(words, types.ListType):
            raise TypeError("Anagram takes a list of strings")
        else:
            self.words = [ i for i in words if isinstance( i, types.StringType )]

            if len(self.words) == 0:
                raise ValueError("Empty word list")
            
    def get_anagrams(self, word):
        """
        Given a word returns a list of anagrams for <word>
        Example: x.get_anagrams('cat') would return set(['act', 'cat'])
        """
        if not isinstance(word, types.StringType):
            raise TypeError("get_anagrams takes a StringType")

        #Get a list of words from self.words that are the same length as word
        filtered_words = set([i for i in self.words if len(i) == len(word)])

        #create a list of all possible combintations of word - create tuple of chars
        #then join the ('d', 'a', 'n') to form  string - 'dan'
        word_permutations = set([ ''.join(w) for w in itertools.permutations( word )])

        #use set.intersection to find all 'real' words, specifically those found in <words>
        return  word_permutations.intersection(filtered_words)

class TestAnagram( unittest.TestCase ):
    def test_constructor(self):
        self.assertRaises(TypeError,  Anagram, 1)
        self.assertRaises(TypeError,  Anagram, "cat")        
        self.assertRaises(TypeError,  Anagram, ())
        self.assertRaises(ValueError, Anagram, [1,2])        
    
    def setUp(self):
        filename   = '/usr/share/dict/words'
        if not os.path.isfile(filename):
            raise IOError( "Sample word file does not exist" )
    
        self.words = [i.strip() for i in open(filename).readlines()]
        self.ana   = Anagram(self.words)

    def test_get_anagrams(self):
        cat_like = self.ana.get_anagrams('cat')
        python_no_mates   = self.ana.get_anagrams('python')
        self.assertEqual( cat_like, set(['act','cat']) )
        self.assertEqual( python_no_mates,  set([]) )
        
    def test_argument_get_anagrams(self):
        self.assertRaises(TypeError, self.ana.get_anagrams, 1)
        self.assertRaises(TypeError, self.ana.get_anagrams, [1])
        self.assertRaises(TypeError, self.ana.get_anagrams, {})
        
if __name__ == '__main__':
    unittest.main()

python, scala and clojure – old notes.

Below are some notes I made when initially looking at Scala and Clojure.
(List(1,2,3) :\ 0.0) (_+_)

Where’s my for loop?

Counting residues … start with a list of characters something like in the case of python.

characters = ['a','b','b','c','c','c','d','d','d','d']

What we want is a count of the number of each residue, therefore what we want is a mutable data structure – specifically an empty dictionary (also called associative arrays, maps, hashmap, kevin) and a for loop.

counts = {}
for achar in characters:
    if character in counts:
        counts[achar] += 1
    else:
        counts[achar] = 1

great, counts is a dictionary, which can also be declared using counts = dict(). We then loop over each character (achar) in characters, if the counts dictionary contains achar the count is incremented (+= 1) otherwise a new key is created in the dictionary and associated with the value 1 (counts[achar] = 1).

the output of this will be a populated dictionary

{'a': 1, 'c': 3, 'b': 2, 'd': 4}

the same outcome can be reached using fewer lines of code:

counts = {}
for achar in characters:
	counts[achar] = counts.get(achar, 0) + 1

This time we used the dictionary method .get. The method returns the value for the key if it’s in the dictionary and 0 if it isn’t. Without the 0 .get would return None – you can’t add None and 1!

So we’ve already got what we wanted but there’s yet another way – we could use a for comprehension and a list method .count to do the work

>>> characters.count('d')
4

Wrap the .count call in a for comprehension

>>> counts = [ (x, characters.count(x)) for x in set(characters)]

The result is however different. We are presented with an array of tuples

[('a', 1), ('c', 3), ('b', 2), ('d', 4)]

but we wanted a dictionary, well you can just wrap the comprehension in a dict()

>>> counts = dict([ (x, characters.count(x)) for x in set(characters)])
{'a': 1, 'c': 3, 'b': 2, 'd': 4}

We can also do this using a reduce and a function (since an assignment can’t be made directly in the lambda, we call with another function !!). This isn’t something to do in ‘real’ code – but may help with understanding clojure/scala

def addToDict( amap, achar ): 
	amap[achar] = amap.get(achar, 0) + 1
	return amap

counts = reduce( lambda x,y: addToDict(x,y), characters, {} )
print counts

So, first we define a function that takes a dictionary (called amap) and a character (achar), sets/increments the count and returns the dictionary – it’s identical to the example above – except that it now carries extra overhead. We then use the global reduce function and a lambda to call the function with the dictionary (the {} at the end of the reduce line) and each character in characters until the entire sequence has been covered.

If the above isn’t obvious then perhaps the following will help:

>>> thing_to_reduce = [1,2,3]
>>> the_function_to_apply = lambda i, running_total: running_total + i
>>> xsum = reduce( the_function_to_apply, thing_to_reduce, 0)
6

Should you do this in your code? No, probably not! So why bother? It may be helpful in working out how to achieve the same thing in the functional languages that all the cools kids are using.

# SCALA #

It’s perhaps less of a departure from languages like perl and python than clojure and haskell, although much like clojure, the docs often make reference/comparison to Java code – not much use if you don’t know Java. It also has two types of ‘variable’ – val and var. var is an indicator that you can mutate while val indicates that the item pointed to is immutable (this isn’t totally true).

scala> var x = 1
x: Int = 1

scala> x = 2
x: Int = 2

scala> val x = 1
x: Int = 1

scala> x = 2
<console>:8: error: reassignment to val
       x = 2
         ^

Again, we define characters as a list of chars

val characters:List[Char] = List('a','b','b','c','c','c','d','d','d','d')
// the :List[Char] can be dropped
val characters = List('a','b','b','c','c','c','d','d','d','d')

With scala we can import mutable maps and then use a for loop to populate it

import scala.collection.mutable.Map
val mutableCounter = Map[Char,Int]()
for ( achar <- characters ) mutableCounter(achar) = (mutableCounter.getOrElse(achar,0) + 1)
println( mutableCounter )

and from this we get: Map(c -> 3, a -> 1, d -> 4, b -> 2). A value can be extracted from the map using the .get method – mutableCounter.get(‘a’). Ideally we want to avoid mutable state (it’s helpful when delving into the world of concurrent programming), to achieve this we can use immutable collections and foldLeft on the characters List.

import scala.collection.immutable.Map
val foldCounts = characters.foldLeft(Map[Char,Int]()) {
    (amap, achar) =>  amap ++ Map(achar -> (amap.getOrElse(achar, 0) + 1  )) }
//res4: scala.collection.immutable.Map[Char,Int] = Map(a -> 1, b -> 2, c -> 3, d -> 4)

but, you can make this look pythonic again, this time making use of the count function which belongs to List.

println( characters.count(_ == 'd') )

To do it for all chars

characters.distinct map { achar => (achar, characters.count( _ == achar ) ) } 

to get a map, rather than List[(Char, Int)] wrap append .toMap on the end of the line

scala> characters.distinct map { achar => (achar, characters.count( _ == achar ) ) } toMap
res10: scala.collection.immutable.Map[Char,Int] = Map(a -> 1, b -> 2, c -> 3, d -> 4)

I think that the fold is the best method to use although I haven’t tested it.

# CLOJURE #
Clojure is a bit different, we don’t have the same options of vars and vals and, in addition, there are lots of parens to deal with!

Starting with the same thing – a vector of characters

(def characters [ \a \b \b \c \c \c \d \d \d \d])

you can check the class of characters by doing by typing the following at the REPL

(def characters [ \a \b \b \c \c \c \d \d \d \d])
(class \a)
(class characters)

you should see: java.lang.Character and clojure.lang.PersistentVector

It may be tempting to write a function that loops over our vector of characters and builds a dictionary of characters and their counts

(defn tally-characters-v1 [chars mydict]
  (if (empty? chars) mydict
      (let [achar (first chars)]
	(tally-characters-v1
	 (rest chars)
	 (assoc mydict achar (inc (mydict achar 0)))))))
	
(println (tally-characters-v1 characters {}))

Copy and paste the above into a REPL and you should get a clojure.lang.PersistentArrayMap. This can be accessed by key value pairs, much as you can do in python:

user=> (counts \d)
4

That function called itself, plus an empty map had to be passed in. You can avoid this step by altering the declaration

(defn tally-characters-v1
  ([inchars] tally-characters [inchars {}])
  (tally-characters-v1 [inchars counts]) ...

Now the method can be called by passing just the characters vector.

As i understand it functions should make use of the recur special form rather than self-calls.

(defn tally-characters-v2 [chars]
  (loop [coll chars mydict {}]
    (if (empty? coll) mydict
	(recur (rest coll)
	       (assoc mydict (first coll)
		 (inc (mydict (first coll) 0)))))))

that does as expected and produces a clojure.lang.PersistentArrayMap, but why bother creating a function using defn when reduce and an anonymous function will do just fine?

(def counts (reduce
	     (fn [amap achar]
	       (assoc amap achar (inc (amap achar 0))))
	     {} characters))

but, you know we don’t even need to do that when writing code for use in real applications. Clojure has a function called frequencies that does this

(println (frequencies characters))

nice.

The code block below shows the latest version of frequencies as defined in clojure.core

(defn frequencies
  "Returns a map from distinct items in coll to the number of times they appear."
  {:added "1.2" :static true}
  [coll]
  (persistent!
   (reduce (fn [counts x]
	     (assoc! counts x (inc (get counts x 0))))
	   (transient {}) coll)))

This code has not been checked for timing nor has any attempt been made to make it optimal or align with any programming best practice. It’s a collection of notes i made when learning something new.

Simple Clustering with python and R

Clustering, it’s fun but it can be misleading. There’s a chapter dedicated to it in Data Analysis with OST by P. Janert where he demonstrates simple usage of Pycluster.

The functions that Janert shows are: kcluster; clustercentroids; kmedioids. The output of the kcluster code is the silhouette coefficient but not the nice graphic, in the kmedioids function (which allows you to supply your own distance metric) the code outputs points grouped by cluster but not the nice graph. I modified the code slightly to use pylab to plot a figure something like that shown in the chapter.

import Pycluster as pc
import numpy as np
import sys
from pylab import plot, show, scatter

# The is probably a better way to convert the data to pylab acceptable format
# but this is simple and works
def toxy(data):
    x,y=[],[]
    [ (x.append(a[0]), y.append(a[1])) for a in data ]
    return x,y

# Read data filename and desired number of clusters from command line
filename, n = sys.argv[1], int( sys.argv[2] )

data = np.loadtxt( filename )

datax,datay = toxy(data)

# Perform clustering and find centroids
clustermap, _, _ = pc.kcluster( data, nclusters=n, npass=50 )
centroids, _ = pc.clustercentroids( data, clusterid=clustermap )

clusterx, clustery = toxy( centroids )

scatter(datax, datay, marker='+')
scatter(clusterx, clustery, c='r')
show()

# removed silhouette code - it takes several seconds to calculate and I can't be bothered to wait.

The code now generates a scatter plot with the original dataset and however many clusters you asked to code to generate (see below).

Given that R exists I figured I would also take a look at how to reproduce the results with it, turns out it’s rather simple.

library(stats, graphics)
data <- read.table("ch13_workshop")
km <- kmeans(data, 7,  nstart=50)
# Several bits of summary data are calculated
# Run in interactive mode to see it

# create a nice plot
plot(data)
points(km$centers, col='red')

In this case, using the default save options we get a pdf, I converted it to a jpg using Preview.

As with most things R, there are plenty of modifications you can make to functions and, rather than explain it here, take a look at the help doc – in R, type ?kmeans

* Note to self *
There is also access to hierarchical clustering in R via the hclust function. This differs from kmeans in that you have to provide a dist object:

d <- dist(data)
h <- hclust(d)

again, for further help hit the included doc – ?hclust.

In addition, both provide access to Kohonen or self organising maps (SOM). In Pycluster you call somcluster, while with R you need to import the kohonen library – library(kohonen). I like the R package for SOMs as it provides some nice visualisations and the options of unsupervised and supervised SOMs – to find out more type ??kohonen in an R shell. I have a post for SOMs in the pipe, the result of having read a published paper that manages to do several things very wrong &| skip important details.

Generate Codons with Python and Haskell

There are 64 (4**3) dna codons, of which Leder and Nirenberg were able to establish the sequence of 54, they code for the 20 standard amino acids (not the 29 my 3rd year project molecular biology undergrads believe to exist). Using python and itertools it’s simple to create all 64 using a generator, although I have no reason to do so.

from itertools import product
for codon in product('ATCG', repeat=3): print codon

yey, codons, I think.

Let’s do the same with haskell, not that I know how to do anything with them afterwards!
First, take three attempts to type ‘first’, now start GHCi

let bases = ["A", "T", "C", "G"]
let codons = [ (a, b, c) | a <- bases, b <- bases, c <- bases]
codons
length codons

That should yield a list of codons, all I need to do now is learn haskell, someone on hacker news/twitter suggested that to learn scala you had to learn haskell first, ho hum.

Data Analysis with open source tools – Kernel Density Estimates

Some datasets are now available from O’Reilly

Get them here

There is a note on the forums: “For data sets that are publicly available, he didn’t replicate the data set, but included the URLs where you can find those data sets. Also, he explained that not all figures in the book have an attached data set; many figures are function plots or otherwise dynamically generated.”


Data Analysis with OST by Philipp K Janert is an interesting book if you are into … data analysis, it also pairs rather nicely with Programming Collective Intelligence by Toby Segaran.
One problem, (see here for more), with the book is that some of the data is most notable by its absence, for example the number of months in office served by each of the (US) presidents. This data is even referred to in the chapter workshop on numpy meaning that you can’t follow along.

The result: frustration as you open chrome, find the data and parse it (or just approximate a portion of it). When compared to Collective Intelligence, where Segaran thoughtfully provides code to fetch and parse data alongside preprocessed snippets, this is an unwelcome process*.

Enough with the complaining! I made up some data and used it with the code provided, my approximated set looks like this:

1	1
9	1
18	1
29	2
34	1
36	1
41	1
50	1
51	2
52	14
57	1
62	1
69	2
90	1
92	1
94	10
98	1

The number of months is on the left, the number of observations on the right. I expanded the list to 0 .. 99 with each number as well as creating an array that looked like: [1, 9, 18, 29, 29, 34, 36, 41, 50, 51, 51, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 57, 62, 69, 69, 90, 92, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 98], this is the data that I analysed using the KDE.

Janert uses numpy to do his analysis because it allows for the function to be condensed into a single block of code that, under the covers, is C:

from numpy import *
def kde(z, w, xv): # z: position w: bandwidth xv: vector of points
    return sum(exp(-0.5*((z-xv)/w)**2)/sqrt(2*pi*w**2))

#convert the above array into a numpy array, allows broadcasting
x = array([1, 9, 18, 29, 29, 34, 36, 41, 50, 51, 51, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 57, 62, 69, 69, 90, 92, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 98])

w = 0.8 # The bandwidth used in the book

# Now for all points calculate the kde
for pos in linspace( min(x), max(x), 1000):
    print pos, kde(pos, w, x)

If you aren’t familiar with numpy.linspace open a python shell and do this

from numpy import linspace
help(linspace)

This will tell you that min(x) is the start point, max(x) the end point and 1000, the number of steps. The result is an array of evenly spaced numbers over the specified interval.

Janert uses GnuPlot (he has written a book) and Matplotlib to produce images. I usually jump between the two depending on what I am doing, here, for ease of use, I redirected the output to a file and plotted using GnuPlot:

plot [0:100][0:15] 'hist.txt' using 1:2 with boxes, 'foo1.5' using 1:2 with lines lt 3, 'foo2.5' using 1:2 with lines lt -1

where hist.txt is the expanded list, the fooi files are KDEs with the bandwidth set to i. lt -1/2/3 changes the colour of the lines used by GnuPlot and [0:100] sets the width of the xaxis and [0:15] the width of the yaxis. If you are on a mac you can install gnuplot using brew otherwise use your package manager.

The output should look something like this, note that neither of the bandwidths are those used by Janert (0.8) and that my data is truncated/an approximation.

kde hist plot

Now, this looks something like the image in the chapter!

THERE ARE OTHER WAYS!

Using the gaussian_kde in scipy.stats which is described here the second via R. KDEs can be completed using two processes:

The first uses density, a function that ships with R. It works as follows:

d <- density(x) # where x is your dataset
plot(d)

I will amend this example to include my dataset but at the moment Leffe beer is getting the better of me.

The next method, and perhaps easier to understand, is the ks library. This library doesn’t ship with R, install it using the package manager & make sure that the install dependencies tab is checked.

In this example we will use the faithful dataset (to find out more about faithful type the following at the R console):

 ?faithful

To generate the KDE all you need to do is:

library(ks)
attach(faithful)
h <- hpi(x=waiting)
fhat <- kde(x=waiting, h=h) plot(fhat, drawpoints=TRUE) 

But what does it all mean? I have no idea! I am not an R expert, but here’s what I get from this:
attach adds the faithful dataset to the path, this means that rather than typing ‘faithful$waiting’ you just add ‘waiting’
hpi: a function that determines the bandwidth plugin
kde: the kernel density estimation function
plot: creates the nice images.
to find out more about these functions type

?function

in the R console. WARNING: R help is mostly code and can be a bit heavy.

The result should be something like this:

CAVEAT EMPTOR
I am not a mathematician nor am I smart/intelligent/competent, I am playing about with this stuff for my own amusement and drinking a fairly potent beer at the same time. If you find a problem with this (it’s factually wrong or it doesn’t work for you) please let me know via the comments and I will fix things.


* Collective Intelligence would be an utter failure if Segaran hadn’t provided some form of data. Data Analysis looks like it can still be a good reference text – despite poor english (what are the O’Reilly editors doing & yes this entry isn’t well written) and unclear explanation of equations for the mathematically challenged.


Some datasets are now available from O’Reilly

Get them here

There is a note on the forums: “For data sets that are publicly available, he didn’t replicate the data set, but included the URLs where you can find those data sets. Also, he explained that not all figures in the book have an attached data set; many figures are function plots or otherwise dynamically generated.”

Fetch protein sequences from the PDB using Scala (and clojure)

I am trying to use scala in day to day work, I find that using one language (poorly) gets boring. Typically I use python to grab data from the PDB using URLLIB, this can be achieved as follows:

url = 'http://wwww.rcsb.org/pdb/files/%s.pdb" % s
content = urllib.urlopen(url).readlines()

It’s very simple, requiring a single import (urllib). The scala code isn’t as succinct but clearly shows that scala is a viable scripting alternative.

#!/bin/bash
exec scala "$0" "$@"
!#
import scala.io.Source.{fromInputStream}
import java.net.URL
import java.io.{FileWriter}

object GetSeqs{
	def main(args: Array[String]) = {
		if (args.length != 1) {
			println("Gimme a PDBID")
			exit(1)
		}	
		val id = args(0)
		val url = new URL(String.format("http://www.pdb.org/pdb/files/fasta.txt?structureIdList=%s",id))
		val output = new FileWriter(id + ".fasta")
		for (line <- fromInputStream(url.openStream).getLines ) {
			output.write(line)
		}
		output.close	
	}
}
GetSeqs.main( args)

The first three lines allow you to call this code as you would any other script (you don’t need to compile this code). The rest is fairly self explanatory: create a url; open and file to write to; read a stream and write the data to the output file; close the file.

The result I get, when run as: sh seq-fetch.scala 1hnn
is:

>1HNN:B|PDBID|CHAIN|SEQUENCE
MSGADRSPNAGAAPDSAPGQAAVASAYQRFEPRAYLRNNYAPPRGDLCNPNGVGPWKLRCLAQTFATGEVSGRTLIDIGS
GPTVYQLLSACSHFEDITMTDFLEVNRQELGRWLQEEPGAFNWSMYSQHACLIEGKGECWQDKERQLRARVKRVLPIDVH
QPQPLGAGSPAPLPADALVSAFCLEAVSPDLASFQRALDHITTLLRPGGHLLLIGALEESWYLAGEARLTVVPVSEEEVR
EALVRSGYKVRDLRTYIMPAHLQTGVDDVKGVFFAWAQKVGL
>1HNN:A|PDBID|CHAIN|SEQUENCE
MSGADRSPNAGAAPDSAPGQAAVASAYQRFEPRAYLRNNYAPPRGDLCNPNGVGPWKLRCLAQTFATGEVSGRTLIDIGS
GPTVYQLLSACSHFEDITMTDFLEVNRQELGRWLQEEPGAFNWSMYSQHACLIEGKGECWQDKERQLRARVKRVLPIDVH
QPQPLGAGSPAPLPADALVSAFCLEAVSPDLASFQRALDHITTLLRPGGHLLLIGALEESWYLAGEARLTVVPVSEEEVR
EALVRSGYKVRDLRTYIMPAHLQTGVDDVKGVFFAWAQKVGL

The same can be accomplished using clojure with a couple of java functions, for lack of clojure knowledge:

(use '[clojure.contrib.duck-streams :only (write-lines read-lines)])
(def pdbid ( .substring (nth *command-line-args* 0) 0 4 ) )
(def url (str  "http://www.pdb.org/pdb/files/fasta.txt?structureIdList=" pdbid ))
(def output (str pdbid ".fasta"))
(write-lines output  (read-lines (java.net.URL. url)))

To run this script, using a couple of environment variables, I typed java -cp $CLJ_JAR:$CONTRIB_JAR clojure.main fetch-and-write-a-pdb-seq.clj 1hnnA. The output is exactly the same as above.