Essays on the State of the Art and Future of Text Mining

The coursework for this Text Mining module has been quite challenging. Each week we had a task to complete, along the lines of evaluating training of a part-of-speech tagger (a piece of software that tries to tag words with the part of speech they serve), or create a named entity recogniser (a piece of software that tries to work out that some sequences of words have meaning above their component parts – for example “New York” means something different to “new” and “York”) using various methods. As I’ve worked through though, the goals have become clear – we were building up components that could work in sequence to process text. Neat.

One aspect of the coursework that was unusual was that it is all to be handed in together at the end, rather than week by week. If I’m honest it’d probably have been a little easier if I’d done the coursework in step with the lecture days – I actually fell a little behind because of various commitments.

Then there was the essay. A 3,000 word essay on the state of the art of text mining and my views for the future of the field.

I’ve not written an essay for at least 15 years now, and getting started was a real challenge. Text mining and Semantic Web maybe? Sentiment analysis is the future? I was pulling my hair out, trying to find an angle that I could argue cleanly though, citing academic research and the like. I’ve been screwing up outlines on bits of paper about a week now!

That said, when I headed into Manchester yesterday and sat in my lectures, I had something of an epiphany. I guess the problem was that I feel the field has huge untapped potential, and I struggle to argue through a point of view I care about when I can’t see the current approaches panning out. I’m going to take a bit of a risk, and write an essay that (constructively) criticises some aspects of text mining today, proposing and arguing through a slightly different approach.

We’ll see how it goes – the last few bits of paper have so far avoided a one-way ticket to the bin. Hopefully I can produce a well-argued, reasonably interesting essay that I’ll get some marks for!


Why I didn’t write any software for Windows Mobile

A few year ago, around 2006 at a guess, I saved up a bit of my hard-earned dollar and bought a Dell Axim X51v. It was a wonderful little device for the time and I fancied having a go at writing software for it.

So I went to the Microsoft website to find out how to do that, where I was confronted with a request for more cash. In order to write a line of code for Windows Mobile at that time, you had to shell out for licenses to use Microsoft’s IDE and developer tools. That’s on top of whatever fees that MS was getting from Dell and the license I’d bought with the device to actually run Windows Mobile.

Naturally, I baulked at the idea and never gave it a go.

Nor have I bought anything from Microsoft since – although that wasn’t a conscious decision. It’s just that since then, there hasn’t been anything that wanted to do in terms of development that mandated some kind of payment. Case in point – my faithful little HTC Magic, succeeded by my Samsung Galaxy S mobile phones. These phones are thoroughly awesome bits of kit which run on Android technology, and recently I had my first dabble in Android development.

Of course, everything you need to write software for Android is freely available on the web, and you can expect a post of two about how that’s going.

Out of curiosity, I checked back in on Microsoft, and it sure looks like you can write for Windows Mobile these days for free. Would it still cost money to write for Windows Mobile if the competition wasn’t giving away their goodies for free? I also had a look at Apple’s tooling to build stuff for the iPhone but I couldn’t work out if it’s free right now or not. (I couldn’t be bothered to look for more than a minute or two to be honest – any readers know?)

I wonder if my decisions since then would have played out any differently if I’d been able to just download the stuff I’d needed to have a go back on ’06? Who knows, I might have gotten hooked on the Microsoft toolset like Visual Studio.

Oracle Sues Google over Android

News has emerged of legal action being taken by Oracle (which recently acquired Sun Microsystems, the company behind Java) against Google over alleged infringements of patents in the Android operating system which currently enjoys great popularity in the mobile phone market.

As I’ve been giving software patents a bit of thought recently, I find this development quite interesting. The actual complaint against Google has been posted on VentureBeat and is worth a read. The language used comes over as direct and aggressive, but I think that’s just the way these legal claims are phrased generally.

There are a number of patents involved, all issued in the United States – so what are the alleged infringements? Let’s take a look – I’ll link the patents mentioned to copies of the claims and give a few thoughts based on a quick review of what I think the gist is. I should also say that patent claims are generally pretty dull to read so I don’t claim to have done a detailed analysis!

US Patent No. 6125447, “Protection domains to provide security in a computer system” filed in 1997. Lays out mechanisms to manage what software components can do in a computer system.

US Patent No. 6192476, “Controlling access to a resource”, filed again in 1997. A method for controlling, for example, the access a thread has to system resources based on what code is running in the thread at the time.

US Patent No. 5966702, “Method and apparatus for pre-processing and packaging class files” filed in 1999. A collection of mechanisms used in Java’s classloading operations.

US Patent No. 7426720, “System and method for dynamic preloading of classes through memory space cloning of a master runtime system process”, filed 2003. Er – what the title says!

US Patent No. RE38104, “Method and apparatus for resolving data references in generated code”, a re-issue of another patent originally filed in 1992 – three years before Java was released – by the James Gosling. This patent is all about mechanisms Java uses to achieve the flexibility of an interpreted language with performance more akin to a compiled language. I wonder if this (potential) patent suit was a driver for Gosling’s departure from Oracle in early 2010?

US Patent No. 6910205, “Interpreting functions utilizing a hybrid of virtual and native machine”, filed 2002. More mechanisms Java uses to improve performance.

US Patent No. 6061520, “Method and system for performing static initialization”, filed 1998, improving performance of initialising static arrays.

Regardless of your views on software patents in general, I’d say these patents are quite well written and quite specific to the ways Java works. It’s likely that many will view this action as evil Oracle sniping at good ol’ Google, but I’m not sure I share that view.

This isn’t a corporate giant picking on the little guy – right now, Oracle is  worth around $115bn, and Google weighs in at around $158bn. Why shouldn’t Oracle use the intellectual property assets it acquired when it bought Sun?

Will this lawsuit damage Google or kill Android? I don’t think so – besides the size and diversity of Google, Oracle and Google don’t seem to be directly competing in related domains, so it wouldn’t make much sense for Oracle to actually want to damage Google. I expect that this action is a move to get Oracle a slice of the Android pie and I think it might succeed.

Will Oracle’s customers turn away because of this action? Oracle’s big in the corporate world, and big companies aren’t very likely to take issue with business machinations such as these.

It could turn evil from there. If legal action against Google sticks, is it possible for action to be taken against everyone downstream of Google from the phone companies to we the users? It’s more difficult to imagine those kinds of moves being played by Oracle – there would surely then be reputational damage.

Another question I’m not sure of the answer to – where is OpenJDK in all this? Does this action present a future risk to these open source efforts or are there differences in licensing between Android and other open source initiatives?

At this stage at least, it seems to me that Oracle is playing the patent game by the rules. If there’s something wrong, it’s with the game, not the players.

Hibernate 3 Tip – Log PreparedStatement bindings

I was trying to see what values were being bound to placeholders in the JDBC PreparedStatements generated by Hibernate DAO test classes I’ve created as they go about their persistent business.

Dead easy, right? Hibernate supports a configuration parameter ‘show_sql’. Set that to true and see what’s going on under the covers. Well… not so much. For a simple save operation, setting that results in the following logged output:

Hibernate: insert into ComponentGroup (name, id) values (?, ?)

Not exactly what I was hoping for. I don’t know what values have been bound to those two question-marks. After a bit of faffing and google-fu, I found this short-but-sweet post which showed a number of additional logging options to enable (assuming use of log4j). As it turns out, the important one in this case is:

Which add a little more detail to the output:

Hibernate: insert into ComponentGroup (name, id) values (?, ?)
19:23:40,753 TRACE StringType:151 – binding ‘group1’ to parameter: 1
19:23:40,754 TRACE LongType:151 – binding ‘1’ to parameter: 2

It has to be the TRACE level, not DEBUG, and I can now see that the effective SQL, substituting for the placeholders is

insert into ComponentGroup(name, id) values (‘group1’, 1)

which helps me work out the detail of what’s going on.

Quick Review of ‘Spring in Action’

My better half bought me a copy of Spring in Action (2nd Edition) by Craig Walls last year. I think it’s been a great help for me as I’ve been getting started with Spring.

I’d say the first four chapters are worth reading in sequence to get a feel for what Spring does and how it does it.

Chapter 1 introduces what Spring does, with some really nice examples of how using the dependency injection capabilities allows components to be mocked up and unit tested much more easily. I think writing good unit tests can be challenging (well, it is for me anyway) so it’s nice to see this theme taking a prominent role.

Chapters 2 and 3 start onto a description of Spring’s dependency injection capabilities, from declaring beans and references to craziness like declaratively substituting method implementations in a class.

Chapter 4 moves on to Spring support for aspect-oriented programming, a technique with which behaviours of an application that really don’t belong in an object’s code (think security, auditing, etc.) can be defined outside of your business logic. There’s a nice theme of examples running through these chapters that somehow does make this stuff make intuitive sense.

From here on in things get a bit more esoteric. Other topics I found interesting were database access (covering JDBC, Hibernate, JPA and more), web services, EJB and JMS – but there are many more. For these later topics, you tend to get a little background and step-by-step introductions with examples. Given the range of topics there’s a often a surprising depth. The material has also proven to be quite accessible when I’ve gone back for reference.

There aren’t really any downsides. Where I’d like the material to go deeper there are other books I can get that are more specific. The occasional humour can be irritating if I’m in the wrong mood but hey, it’s in moderation.

If you’re looking for a good general introduction to what Spring is and what it can do for you,  can recommend this book.

On Software Patents

As I’ve been trying to broaden my knowledge of IT and software development, I thought it would be a good idea to read up on the issues around Intellectual Property as it relates to software – specifically, the idea of software patents and the implications for developers. I think this stuff is important – infringement of patents can lead to legal action which is expensive and can damage reputations.

There’s a lot of information out there on the subject, so rather than just repeat stuff that’s already been said, I thought I would link off a few resources I thought were informative/enjoyable and why.

I found Paul Graham‘s essay, ‘Are Software Patents Evil?‘ after reading a few other resources, but I’d recommend it as a first read as it’s not too long and it has a prosaic style which I thought was quite accessible. It also seems to be a fairly balanced account of the pros and cons of software patents, whereas the other resources I found tended to be in one camp or the other.

Ciaran O’Riordan has published a lengthy overview of the state of play with Software Patents in Europe, also referencing interesting material about the reality of software patents and impact on innovation in the US. There’s a lot of information here between the content and the links and I found it took a bit of digesting, but worth it to find out the recent history with regards software patent legislation.

Patent Risks of Open Source Software is a nice short article, focusing on the legal risks inherent in Open Source. There are some good points made in this paper, answering questions like ‘can you just swap out an open source component that infringes a patent for a custom component you wrote yourself and be safe?’. Although the article is focussed on Open Source, it seems (to me) that most of it is actually applicable to software in general – how much protection do you really get if the closed source software that you’re using is found to infringe patents?

I couldn’t decide whether a user of software that infringes on a third party’s patent could be liable for infringement themselves, so I asked the question on The answer seems to be that yes, a user could be liable, and there are a couple of statements and links off to articles that support that conclusion.

It seems to be something of a consensus that the software patent situation is becoming more heated, and that this focus is being driven by newer players in the game taking legal action perhaps inappropriately against other parties infringing their patents. Searching for company names and ‘patent’ tends to find sequences of results that patent-related news for that company.

The most surprising thing for me about this whole patent business is that a patent lasts twenty years. In IT today, the world changes week by week and month by month. Twenty years ago, there was no such thing as a website. I guess no-one patented the idea of a website. I wonder how the world would be different if someone had?

Pop quiz – can you think of an example of a computing technology that succeeded because it wasn’t patented, or one that succeeded because it was?

Nexus and OpenJDK

An odd one tonight, using the Nexus Repository Manager with OpenJDK, the open source Java implementation. Nexus mostly works fine, but fails to re-index the public repository group with (according to the wrapper log) a JVM crash.

jvm 1    | 2010-06-21 20:55:11 INFO  [pool-1-thread-1] – o.s.n.i.DefaultInde~          – Cascading merge of group indexes for group “public”, where repository “releases” is member.
wrapper  | JVM exited unexpectedly.
wrapper  | JVM exited in response to signal UNKNOWN (11).

The problem manifested in the Eclipse IDE when the repositories view wouldn’t update, showing an empty folder under the Nexus public repo.

Switching out the OpenJDK implementation for the Sun implementation fixes the problem, and now re-indexing the public repository group works fine. Bug report NEXUS-3603 raised, but if you’re seeing this issue swapping the Java implementations seems to work.

End of Patterns for e-Business

Well, the P4EB exam was last week, so that wraps up that module – unless something goes terribly, terribly wrong and I have to resit!

The exam deviated from the previous years’ exams quite a bit. In two parts, the first part being pretty much just bookwork, the second part being a choice of three questions and more analysis based. In previous years, the second part was a set of three standalone questions, which meant that they were pretty well defined and it was fairly easy to see what knowledge the question wanted you to demonstrate.

This year, the second part consisted of business description and context diagram that was then used as the basis for all three questions. I thought that the questions weren’t so well defined, and so I’m not totally sure how much or little I should have answered with. Oh well – time will tell!

That’s also half-way through the taught part of the course – three down, three to go. For the last three I’ll be heading back to Computer Science modules, probably centred on Logic, Ontologies and Natural Language Processing – which means that I need to spend some quality time with mathematical logic this summer ready for next year.

Revising for P4EB Exam

I feel revision for the Patterns for e-Business exam is going pretty well. There are some interesting questions to answer, such as describing the difference between the Strategy and State patterns. That one’s absorbed a fair bit of thought to get my ideas to some degree of clarity and conciseness.

My revision schedule has settled down into a pattern now, being the public-spirited chap that I am, I’ll share what works for me with you.

Step 1 – Lecture Notes and Background

Review the lecture notes piece by piece, making sure that every term, statement and nuance is understood. This usually starts as soon as the lectures are done. Often, I’ll work through the notes noting down each important statement as a question so that I can quiz myself.

I haven’t yet had an exam immediately following the lecture series (I think in each case so far, the first five weeks have been lectures, followed by a reading week, a subsequent five week series, a reading week, and then the exam period) which would cut that time down to one week, meaning that revision would need to happen during the course of the exams.

This time involves a fair bit of ‘reading around’ the subject, chasing down those subtleties that I missed during the course of the lectures. Easily done – the pace can be kinda intense. This bit probably averages less than an hour a day – but it’s a marathon, not a sprint.

Step 2 – Past Papers

Answer every question on every past exam paper I can find to learn how the questions are asked and how to answer them.

I don’t pay too much attention to past papers until I feel I’ve got a good coverage of the course material. I hope that this helps me avoid just learning how to answer the exam questions. It’s more about the learning than the exams, right?

This step is generally no more than a week or two before the exam.

Step 3 – Exam Day

I like afternoon exams. I follow my usual schedule and get into University before 9am, giving me the whole morning to review the past papers and any troublesome spots one last time and generally take it fairly easy. It’s nice not to have to worry about travelling and delays, too.

This approach has also worked well for me for the Sun Certifications I’ve taken. As far as I can tell, there’s no real short cuts to learning stuff – it takes time and effort (if only I had a USB port for my internal memory!)