Day 4 covered methods of automatically identifying clusters in data – and some of the issues that arise using those techniques.
Doing this automatic identification is called unsupervised learning, because it doesn’t depend on having a set of labelled data examples to hand. The learning is done purely based on the statistical and probabilistic properties of the data itself.
I got to say, I’m struggling with the probablistic side of things – my intuition isn’t helping me much, so I’ve been doing the books thing to try and really get my head round it. So little time…
We also covered some techniques involving reducing the dimensionality of data – say a dataset has a thousand properties, and the computational overhead of processing increases with the number of properties. You’ll need some way of reducing the number of properties, whilst retaining the maximal information they encoded. We were looking at selecting features a couple of weeks ago, but today we looked at PCA – Principle Component Analysis, a technique to ‘project’ information into a smaller number of dimensions, or properties.
I quite like this paper on PCA, if you’re looking for somewhere to get an idea what it’s about.
And that, if you were reading this blog a few weeks ago, is where the eigenvalues and eigenvectors come in.
We also have a project to complete in the next couple of weeks, so time is very much of the essence right now. I suspect that as with the Semi-Structured Data and the Web course last year, the deeper concepts behind some of this material will only become clear to me when I’ve completed the set material and start to work on the revision of what we’ve covered.
Back in the day, revision was time off between lessons and exams – these days, not so much!