Scala fetch a web page within time limit

Someone asked me to write a quick script to pull xml records from pubmed and extract various chunks of data, I figured scala was the way to go. Five minutes later they asked me to include a timeout – something I haven’t required before! Typically I don’t bother with timeouts and would use something like:

import scala.io.Source.{fromInputStream}
import java.net._

val url = new URL("http://btareawasteofspace.com")
val content = fromInputStream( url.openStream ).getLines.mkString("\n")

that chunk of code will import the required libraries, connect to a resource, get the content and chuck it into a string.

This time, instead of doing the above, I made use of the scala.io.Source library and URLConnection – to set the timeout and retrieve content:

import scala.io.Source.{fromInputStream}
import java.net._

val url = new URL( seqUrl )
val urlCon = url.openConnection()
urlCon.setConnectTimeout(2000)
urlCon.setReadTimeout( 1000 )
val content = fromInputStream( urlCon.getInputStream ).getLines.mkString("\n")

As far as I understand it, the URLConnection allows me to define how long the connection will wait for a request. The timeout argument (an int) is in milliseconds and a failure to get data before the allocated period results in a java.net.SocketTimeoutException.

Advertisements

2 comments

  1. This won’t work in every case; in particular, it won’t throw an execption after the time you specify via setReadTimeout() if the server is down or taking a long time to establish new connections.

    Java has two relevant timeouts here: a connect timeout and a read timeout. Your example sets the read timeout, but not the connect timeout; you need an extra line in there:

    val url = new URL( urlString )
    val connection = url.openConnection()
    connection.setConnectTimeout( 2000 )
    connection.setReadTimeout( 2000 )
    val content = fromInputStream( urlCon.getInputStream ).getLines.mkString(“\n”)

    This would give you a max timeout of 4 seconds, allowing two seconds to establish the connection and two more to read the response. I’m still not entirely sure where the request fits in; I’ve been assuming the time between sending the request and receiving the first packet counts against the read timeout.

    A connection timeout throws the same exception thrown by the read timeout, java.net.SocketTimeoutException.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s