A disaster, minimised!

I’ve not been blogging this last few months what with all my spare time going into trying to do some proper computer science and then writing my dissertation. Last night, I had a catastrophe – I noticed something in my results that should be impossible and traced it back to a subtle bug that compromised all my results to date! Pretty nasty at this stage in the project…

The effect was subtle and I didn’t think it would alter my conclusions. That said, to ignore it and continue wouldn’t be right. The alternative of explaining about the bug and its effects in my dissertation is not something I wanted to have to do either.

After a few minutes of sitting with my head in my hands I decided to fix it and start again. After all, it’s just compute time and elbow grease, it’s not like I just threw away a month’s time on the LHC or anything! Turns out, my decision to script everything paid off and I could pretty much throw a few tens of hours of compute time in to reproduce all my data and then a couple of hours with the charting software and I’m good to go. The choice of LaTeX also turned out to be even more of a winner as I was able to rebuild my document with the new figures and any layout modifications required almost trivially.

I was right – the conclusions do not change, however they are now more striking and there are no oddities that I can’t really explain. Tips of the day for those doing work like this:

  • script everything you can – just in case you need to redo stuff
  • use LaTeX – because you can swap out every figure for a new version easily

There are plenty of other reasons for applying these two tips, but there’s two reasons I hadn’t thought of before yesterday.


Scripting Java with JavaScript

Java programs run on the Java Virtual Machine, a kind of virtual computer that hides many of the differences between the different kinds of computers it’s running on. Folks have been writing implementations of other languages that run on this virtual machine for a while now – besides JVM-specific languages like Scala and Groovy, you can also get ports of existing languages like JRuby, Jython and JavaScript.

Conveniently, in the Java 6 specification (released way back in September, 2006), official scripting support is required in the javax.script package, and a slightly stripped-down build of Mozilla Rhino, the JavaScript implementation is shipped with the JVM.

I’ve been meaning to take a look at this for a while now, and I decided to use these facilities to solve a problem I was having in my MSc. project.

My project consists of runnable experiments that produce some kind of results over sets of data. I want to have fully set up experiments ready to run so that I can repeat or extend the experiment very easily without having to refer to notes or other documentation, which involves programs that accept configuration information and wire up components.

The Java code to do this kind of thing tends to be very verbose – lots of parsing, type-checking and an inability to declare simple data structures straight into code. It’s tedious to write and then hard to read afterwards. Using JavaScript to describe my experiment setup looked like a good solution.

Example: creating a data structure that provides two named date parameters in Java, as concisely as I can:

package com.crossedstreams.experiment;

import java.text.SimpleDateFormat;
import java.util.HashMap;
import java.util.Map;

public class RunExperiment {
  public static void main(String[] args) throws Exception {
    SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd hh:hh:ss");

    Map config = new HashMap();

    config.put("start", format.parse("2012-02-01 00:00:00"));
    config.put("end", format.parse("2012-02-08 00:00:00"));

    // do stuff with this config object...

That’s a lot of code just to name a couple of dates! The amount of code involved hides the important stuff – the dates. Now, achieving the same with JavaScript…

var config = {
  start: new Date("February 1, 2012 00:00:00"),
  end: new Date("February 8, 2012 00:00:00")

// do stuff with this config

When there are many parameters and components do deal with, it gets tough to stay on top of. Some of what I’m doing involves defining functions to implement filters and generate new views over data elements and JavaScript helps again here letting me define my implementers inline as part of the configuration:

filter: new com.crossedstreams.Filter({
  accept: function(element) {
    return element == awesome;

This approach isn’t without problems, for example there’s some ugliness when it comes to using Java collections in JavaScript and JavaScript objects as collections. To be expected I guess – they are different languages that work in different ways so there’s going to be some ugly at some of the interfaces, maybe even some interpretation questions that don’t have one right answer.

Nothing I’ve come up against so far can’t be fairly easily overcome when you figure it out. I think that using Java to build components to strict interfaces and then configuring and wiring them up using a scripting language like JavaScript without leaving the JVM can be a pretty good solution.


Setting up my Project Website

One of the assessed deliverables for my MSc project is a project website, so I’ve been having a bit of a setup session this weekend.

The objectives set for the website are a little… what’s the word… vague? See what you think:

A multipage website summarizing the work so far.
– Objectives
– Deliverables
– Plan
– Literature

That’s it as far as I can tell. Exactly how will the delivered work be assessed? Your guess is probably about as good as mine. Having looked at the discussion forum for the module (the full-timers did this in the first half of the year – I’ve been told I set my own deadlines when it comes to the project stuff as I’m not a full-time student) it seems that the marking scheme was quite severe with many complaints about low marks and little evident explanation, so I’ll make some enquiries before I start work on the content proper.

Back in April, I asked how the website deliverable should be ‘handed in’ and was told that a zip with some files in it would be fine.

Screw that.

I mean, seriously – the world has moved on. To be even vaguely interesting, I’m thinking about reusing relevant content from this blog, and some of the tooling I’m using like Ganttproject saves XML data that’s crying out for some transformation and JavaScript magic.  I have my own domain name and there’s an opportunity here to learn some stuff about infrastructure (and I am doing this MSc. to learn stuff in the first place), so I’ve been setting up a server. Again, checking back on the forums, some of the other students went the same route and there’s no evidence of it harming their chances. I think hosting the project website as a subdomain of crossedstreams.com makes sense – I already own the domain name and subdomains are a simple matter of extra DNS records, which is dead easy to set up with my provider, getNetPortal.

I shan’t be hosting my site on getNetPortal though. As I spend most of my professional life working on the Java EE platform, Java is the obvious choice. Why not use a different language for the experience? Whilst I’ve got the time to learn a bit about hosting a public-facing website, I’m not sure I’ll have the time to learn a new way of creating websites that I’ll be happy with… not to mention that there’s a toolset and delivery pipeline that varies from platform to platform. Playing about with Erlang or some such will have to wait for another day.

GetNetPortal do host Java web applications, but it’s a shared Tomcat environment with a bunch of limitations as well as apparently risks to other people’s app availability if I deploy more than three times in a day. So where else can I go? Other specialised hosting companies are out there, but they’re not exactly cheap…

So I’ve provisioned myself a server on Amazon’s Elastic Compute Cloud (Amazon EC2). Amazon provide a bunch of images themselves and one of them happens to be a Linux-based 64bit Tomcat 7 server. Time between me finding the image I wanted and having a working server available? About five minutes. No matter how you cut it, that’s pretty awesome. To be honest, the biggest challenge was choosing an image – there’s a huge number to choose from and I tried a couple of other images that weren’t as well set up before settling on the Amazon-provided one. The best thing – EC2 is pay-as-you-go, at dirt cheap rates for low utilisation.

For those of you who haven’t seen EC2, here’s a couple of screenshots that might help explain what it’s all about. First up, let’s take a look at the application server I provisioned.

AWS Management Console with my instances
AWS Management Console with my instances

Checking my bill tonight, I can see an itemised account of exactly what I’ve been billed for. Being able to see this level of detail should let me stay in control of what I’m spending.

Amazon Web Services - Billing
Amazon Web Services - Billing

The rest of my time has been spent having a look around my new server, setting up Tomcat (to host a placeholder app in the root context) and iptables (to route traffic from the privileged ports 80 and 443 out to the ports Tomcat is listening on – 8080 and 8443 – thus avoiding the need to install a dedicated webserver or run Tomcat with root privileges), setting up some self-signed SSL certificates (I’ll need those so that I can bring up apps that require logon – without SSL, those usernames and passwords would be floating around the internetz in clear, negating the point of their existence) and finally scripting up the setup process in case I need to set this stuff up again.

Now, I can tick off the project tasks around setting up hosting nice and early. Quite a productive weekend!

Essays on the State of the Art and Future of Text Mining

The coursework for this Text Mining module has been quite challenging. Each week we had a task to complete, along the lines of evaluating training of a part-of-speech tagger (a piece of software that tries to tag words with the part of speech they serve), or create a named entity recogniser (a piece of software that tries to work out that some sequences of words have meaning above their component parts – for example “New York” means something different to “new” and “York”) using various methods. As I’ve worked through though, the goals have become clear – we were building up components that could work in sequence to process text. Neat.

One aspect of the coursework that was unusual was that it is all to be handed in together at the end, rather than week by week. If I’m honest it’d probably have been a little easier if I’d done the coursework in step with the lecture days – I actually fell a little behind because of various commitments.

Then there was the essay. A 3,000 word essay on the state of the art of text mining and my views for the future of the field.

I’ve not written an essay for at least 15 years now, and getting started was a real challenge. Text mining and Semantic Web maybe? Sentiment analysis is the future? I was pulling my hair out, trying to find an angle that I could argue cleanly though, citing academic research and the like. I’ve been screwing up outlines on bits of paper about a week now!

That said, when I headed into Manchester yesterday and sat in my lectures, I had something of an epiphany. I guess the problem was that I feel the field has huge untapped potential, and I struggle to argue through a point of view I care about when I can’t see the current approaches panning out. I’m going to take a bit of a risk, and write an essay that (constructively) criticises some aspects of text mining today, proposing and arguing through a slightly different approach.

We’ll see how it goes – the last few bits of paper have so far avoided a one-way ticket to the bin. Hopefully I can produce a well-argued, reasonably interesting essay that I’ll get some marks for!

Why I didn’t write any software for Windows Mobile

A few year ago, around 2006 at a guess, I saved up a bit of my hard-earned dollar and bought a Dell Axim X51v. It was a wonderful little device for the time and I fancied having a go at writing software for it.

So I went to the Microsoft website to find out how to do that, where I was confronted with a request for more cash. In order to write a line of code for Windows Mobile at that time, you had to shell out for licenses to use Microsoft’s IDE and developer tools. That’s on top of whatever fees that MS was getting from Dell and the license I’d bought with the device to actually run Windows Mobile.

Naturally, I baulked at the idea and never gave it a go.

Nor have I bought anything from Microsoft since – although that wasn’t a conscious decision. It’s just that since then, there hasn’t been anything that wanted to do in terms of development that mandated some kind of payment. Case in point – my faithful little HTC Magic, succeeded by my Samsung Galaxy S mobile phones. These phones are thoroughly awesome bits of kit which run on Android technology, and recently I had my first dabble in Android development.

Of course, everything you need to write software for Android is freely available on the web, and you can expect a post of two about how that’s going.

Out of curiosity, I checked back in on Microsoft, and it sure looks like you can write for Windows Mobile these days for free. Would it still cost money to write for Windows Mobile if the competition wasn’t giving away their goodies for free? I also had a look at Apple’s tooling to build stuff for the iPhone but I couldn’t work out if it’s free right now or not. (I couldn’t be bothered to look for more than a minute or two to be honest – any readers know?)

I wonder if my decisions since then would have played out any differently if I’d been able to just download the stuff I’d needed to have a go back on ’06? Who knows, I might have gotten hooked on the Microsoft toolset like Visual Studio.

Oracle Sues Google over Android

News has emerged of legal action being taken by Oracle (which recently acquired Sun Microsystems, the company behind Java) against Google over alleged infringements of patents in the Android operating system which currently enjoys great popularity in the mobile phone market.

As I’ve been giving software patents a bit of thought recently, I find this development quite interesting. The actual complaint against Google has been posted on VentureBeat and is worth a read. The language used comes over as direct and aggressive, but I think that’s just the way these legal claims are phrased generally.

There are a number of patents involved, all issued in the United States – so what are the alleged infringements? Let’s take a look – I’ll link the patents mentioned to copies of the claims and give a few thoughts based on a quick review of what I think the gist is. I should also say that patent claims are generally pretty dull to read so I don’t claim to have done a detailed analysis!

US Patent No. 6125447, “Protection domains to provide security in a computer system” filed in 1997. Lays out mechanisms to manage what software components can do in a computer system.

US Patent No. 6192476, “Controlling access to a resource”, filed again in 1997. A method for controlling, for example, the access a thread has to system resources based on what code is running in the thread at the time.

US Patent No. 5966702, “Method and apparatus for pre-processing and packaging class files” filed in 1999. A collection of mechanisms used in Java’s classloading operations.

US Patent No. 7426720, “System and method for dynamic preloading of classes through memory space cloning of a master runtime system process”, filed 2003. Er – what the title says!

US Patent No. RE38104, “Method and apparatus for resolving data references in generated code”, a re-issue of another patent originally filed in 1992 – three years before Java was released – by the James Gosling. This patent is all about mechanisms Java uses to achieve the flexibility of an interpreted language with performance more akin to a compiled language. I wonder if this (potential) patent suit was a driver for Gosling’s departure from Oracle in early 2010?

US Patent No. 6910205, “Interpreting functions utilizing a hybrid of virtual and native machine”, filed 2002. More mechanisms Java uses to improve performance.

US Patent No. 6061520, “Method and system for performing static initialization”, filed 1998, improving performance of initialising static arrays.

Regardless of your views on software patents in general, I’d say these patents are quite well written and quite specific to the ways Java works. It’s likely that many will view this action as evil Oracle sniping at good ol’ Google, but I’m not sure I share that view.

This isn’t a corporate giant picking on the little guy – right now, Oracle is  worth around $115bn, and Google weighs in at around $158bn. Why shouldn’t Oracle use the intellectual property assets it acquired when it bought Sun?

Will this lawsuit damage Google or kill Android? I don’t think so – besides the size and diversity of Google, Oracle and Google don’t seem to be directly competing in related domains, so it wouldn’t make much sense for Oracle to actually want to damage Google. I expect that this action is a move to get Oracle a slice of the Android pie and I think it might succeed.

Will Oracle’s customers turn away because of this action? Oracle’s big in the corporate world, and big companies aren’t very likely to take issue with business machinations such as these.

It could turn evil from there. If legal action against Google sticks, is it possible for action to be taken against everyone downstream of Google from the phone companies to we the users? It’s more difficult to imagine those kinds of moves being played by Oracle – there would surely then be reputational damage.

Another question I’m not sure of the answer to – where is OpenJDK in all this? Does this action present a future risk to these open source efforts or are there differences in licensing between Android and other open source initiatives?

At this stage at least, it seems to me that Oracle is playing the patent game by the rules. If there’s something wrong, it’s with the game, not the players.