Read a fasta file with Clojure

I am still not comfortable with clojure. I start to think it’s amazing and the something happens that sends me back to scala or more frequently python.

One thing that crops up fairly often is extracting a sequence from a fasta file. Sometimes I read the whole thing into memory and other times all I want is a single sequence. For S&Gs I wrote a little bit of clojure to do both (i am aware this is poor reinvention of the sequence extraction wheel).

The code looks like this:

(use '[clojure.java.io] )
(use '[clojure.contrib.string :only [blank?]] )

(def filename "fasta.txt")

(defn read-fasta-file
  "Reads a .fasta file into memory"
  ( [c] (read-fasta-file c {} nil))
  ( [c sequences id]
      (let [cl (first c)]
	(if (empty? cl) sequences
	    (if (.startsWith cl ">")
	      (recur (rest c)
		     sequences
		     cl)
	      (recur (rest c)
		     (assoc sequences id (str (sequences id) (.trim cl)))
		     id))))))

(defn extract-sequence
  ( [coll id] (extract-sequence coll id nil nil ))
  ( [coll id current-id sequence]
      (let [cl (first coll) fid (str ">" id)]
	(if (empty? coll)
	  (if (nil? sequence) nil
	      (apply str sequence))
	  (cond
	   (.startsWith cl ">") (recur (rest coll) id cl sequence)
;	   (blank? cl) (do (println cl) (recur (rest coll) id cl sequence))
	   :else (if (.startsWith current-id fid)
		   (recur (rest coll) id current-id (str sequence cl ) )
		   (recur (rest coll) id current-id sequence)))))))

;read the file in a lazy way
(println ">" (with-open [rdr (reader "pdbaa")] (extract-sequence (line-seq rdr) "3JU9A")))

(defn fasta [] (with-open [rdr (reader filename)]
	       (read-fasta-file (line-seq rdr))))

It’s not too well thought out or understood! I think that the file is read lazily – clojuredoc annotations state that line-seq’ing a (reader file) is lazy. The functions aren’t lazy! read-fasta-file will store the entire fasta file as a dictionary/hashmap. extract-sequence will continue looping over the file content until EOF because there i didn’t include a ‘sequence found’ boolean to indicate if what I was looking for had been found.

If you’ve read this far and you are shaking your head & thinking why the f*ck has he done it like that, I would be really grateful if you could provide feedback – i would appreciate it.

maybe this code is better for reading, once a resource containing the sequences has been acquired:

(def fasta [">cat" "123" "456", ">dog" "567", "890"])

(defn combine [m k v]
  (assoc m k (str (m k) v)))

(defn parseFasta [content]
  (loop [c content
	 m {}
	 cid nil]
    (if (empty? c) m
	(if (.startsWith (first c) ">")
	  (let [id (first c)]
	    (recur (rest c) (assoc m id "") id))
	  (recur (rest c) (combine m cid (first c)) cid)))))

I don’t think this will benefit from wrapping with lazy-seq, but I am not sure, I really have to focus more one clojure.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s