pdb

Fetch protein sequences from the PDB using Scala (and clojure)

I am trying to use scala in day to day work, I find that using one language (poorly) gets boring. Typically I use python to grab data from the PDB using URLLIB, this can be achieved as follows:

url = 'http://wwww.rcsb.org/pdb/files/%s.pdb" % s
content = urllib.urlopen(url).readlines()

It’s very simple, requiring a single import (urllib). The scala code isn’t as succinct but clearly shows that scala is a viable scripting alternative.

#!/bin/bash
exec scala "$0" "$@"
!#
import scala.io.Source.{fromInputStream}
import java.net.URL
import java.io.{FileWriter}

object GetSeqs{
	def main(args: Array[String]) = {
		if (args.length != 1) {
			println("Gimme a PDBID")
			exit(1)
		}	
		val id = args(0)
		val url = new URL(String.format("http://www.pdb.org/pdb/files/fasta.txt?structureIdList=%s",id))
		val output = new FileWriter(id + ".fasta")
		for (line <- fromInputStream(url.openStream).getLines ) {
			output.write(line)
		}
		output.close	
	}
}
GetSeqs.main( args)

The first three lines allow you to call this code as you would any other script (you don’t need to compile this code). The rest is fairly self explanatory: create a url; open and file to write to; read a stream and write the data to the output file; close the file.

The result I get, when run as: sh seq-fetch.scala 1hnn
is:

>1HNN:B|PDBID|CHAIN|SEQUENCE
MSGADRSPNAGAAPDSAPGQAAVASAYQRFEPRAYLRNNYAPPRGDLCNPNGVGPWKLRCLAQTFATGEVSGRTLIDIGS
GPTVYQLLSACSHFEDITMTDFLEVNRQELGRWLQEEPGAFNWSMYSQHACLIEGKGECWQDKERQLRARVKRVLPIDVH
QPQPLGAGSPAPLPADALVSAFCLEAVSPDLASFQRALDHITTLLRPGGHLLLIGALEESWYLAGEARLTVVPVSEEEVR
EALVRSGYKVRDLRTYIMPAHLQTGVDDVKGVFFAWAQKVGL
>1HNN:A|PDBID|CHAIN|SEQUENCE
MSGADRSPNAGAAPDSAPGQAAVASAYQRFEPRAYLRNNYAPPRGDLCNPNGVGPWKLRCLAQTFATGEVSGRTLIDIGS
GPTVYQLLSACSHFEDITMTDFLEVNRQELGRWLQEEPGAFNWSMYSQHACLIEGKGECWQDKERQLRARVKRVLPIDVH
QPQPLGAGSPAPLPADALVSAFCLEAVSPDLASFQRALDHITTLLRPGGHLLLIGALEESWYLAGEARLTVVPVSEEEVR
EALVRSGYKVRDLRTYIMPAHLQTGVDDVKGVFFAWAQKVGL

The same can be accomplished using clojure with a couple of java functions, for lack of clojure knowledge:

(use '[clojure.contrib.duck-streams :only (write-lines read-lines)])
(def pdbid ( .substring (nth *command-line-args* 0) 0 4 ) )
(def url (str  "http://www.pdb.org/pdb/files/fasta.txt?structureIdList=" pdbid ))
(def output (str pdbid ".fasta"))
(write-lines output  (read-lines (java.net.URL. url)))

To run this script, using a couple of environment variables, I typed java -cp $CLJ_JAR:$CONTRIB_JAR clojure.main fetch-and-write-a-pdb-seq.clj 1hnnA. The output is exactly the same as above.