From python/perl to scala: parsing gen files

Sometimes I have to deal with data from circular dichroism experiments.  This data comes in the form of a .gen file, containing experimental conditions, the name and address of the experimentalist (for retribution purposes) and the raw/smoothed data.  All I want from these files is the PDB code, wavelength and smoothed data.  If I was doing this in python/perl, with my mutable variables/continue and limited typing, I would do something like this

use warnings;
use strict;
my $line = "PDB                                       3CHY\n";
$line =~ /PDB\s+(\w+)/;
my $pdbid_1 = $1;
# or like this
my $pdbid_2 = (split(/\s+/, $line))[1];

At the same time, I would parse the wavelength and the associated millidegree value (wavelength and column z, below).

The lines take the form: wavelength x y z ….

for line in open(genfile).readlines():
if line.startswith("PDB"):
parsePDB(line)
elif re.match(r"^\d", line):
   ...split line, convert values to int and double, stash into lists

Doing the same with scala doesn’t seem so simple (because I don’t know what I am doing).  I have come up with the following solution that I don’t think is particularly scala-ish (I am learning – maybe)

val content = fromFile(fname).getLines.toList

// This gets me the pdb id

val pdbid = ( content find ( (p: String) => p.startsWith("PDB"))).get.trim match { case pdbidRE( pdbid ) => pdbid }

// Now for the spectrum data - I want this as a List[(Double, Double)] so I can run some filter commands later ...
val statsWithNumberRE = """^(\d+.*)""".r
val spectrum: List[(Double, Double)] =  for {
   line <- content
   val k = line trim match { case startsWithNumberRE(data) => data; 
   case _ => None
   if k != None
} yield ( k.toString.split("\t")(0).toDouble, k.toString.split("\t")(3).toDouble )

I tried a couple of different things, including returning a list with the first element set to the PDB id and the tail of the list being a List[(Double,Double)], however the list returned was always a List[java.lang.Object], indicating that I had done something wrong, I am yet to work out better ways to do this at beginner level. Once I have the List[(Double,Double)] I so desperately wanted, I define another function that allows me to extract specific wavelength ranges.

 def getSpectrum( start: Double, end: Double ) {    spectrum filter ( (x: (Double,Double))=> x._1 >= start && x._1 <= end ) } 
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s