Between prep for my MSc. project, getting married, snowed under at work, starting the my next MSc. module and being full of cold, there hasn’t been much time for blogging…
So today was day 4 of the Text Mining module. As a friend put it, “Text Mining? What – like using grep?”
Text Mining is defined as finding previously unknown information in unstructured data. Unknown – as in never explicitly written down.
So by ‘text’, we mean un- or partially-structured data, like word documents or this blog page. There’s some structure here, headings, subheadings, lists and the like. but it’s not ‘structured’ in the sense that database tables are, with fields and columns and a type system.
Tools like grep can match words (more generally, expressions describing relatively simple patterns of characters called regular expressions), so whilst they’re fairly easy to use (so long as you don’t try to push them too far), they are limited in the complexity of what they can do.
For example, you can’t easily use grammatical ideas, like identifying documents that are about fish (a fish), but not fishing (I fish). You can’t search for documents related to a concept, and recognising generic names or technical terms is out. You can’t build structures like indices to help with searches, which means that over reasonably large collections of documents, grep is too slow to be very useful.
I’m still getting my head around how it hangs together, but text mining seems like a set of gloriously messy, pragmatic and seemingly pretty successful ways to let computers listen in on the languages that humans have evolved.