CRUNCH 2017, Budapest

Friday 20th October 2017 at the Hungarian Railway History Park

I decided to attend CRUNCH 2017 earlier this year when I was working in data engineering and it was the first data engineering conference I found in Europe. Turns out it was a great conference – a good mix of talks, interesting venue, well organised with great food and fun people. I just filled out the feedback form and there wasn’t much I could think of for them to improve.

There’s another conference – AMUSE – A UX conference that takes place in the same venue at the same time. I didn’t attend of the UX talks, simply because I was in the data talks all day but I did have look at their schedule and some of the materials they produced and it looks pretty good too. If I’d had a gap I’d have gone alone and had a look.

I wasn’t intending to write a blog post, so I didn’t take good enough notes yesterday. That’s a shame as the first day’s talks were great too, and there were a couple of talks by women. All the talks I attended today were by men. The conference intro told us that 20% of CRUNCH attendees were female, with 50% female for AMUSE.

Beyond Ad-Click Prediction

If you didn’t test it, assume it doesn’t work.

Dirk Gorissen runs a Machine Learning meetup in London (which is useful for me, as I’m setting one up in Sheffield) and his day job is working with autonomous vehicles. He talked about machine learning-related projects he does in his own time.

He’s worked on processing data from ground-penetrating radar to find land mines and developed drones to help conservationists keep an eye on Orangutans that have been released into the wild in Borneo. He also talked about some of the challenges autonomous vehicles face, from old ladies in wheelchairs chasing ducks, through remembering that cyclist that just disappeared behind that bus, to dealing with bullying humans.

There are many opportunities to work on projects that build on the same skills that you use to predict whether someone will click on an ad. He suggests finding a local meetup, getting involved with charities or using resources like Kaggle to find projects to inspire you.

Scaling Reporting and Analytics at LinkedIn

Premature materialization is premature optimisation – the root of all evil

Shikanth Shankar talked about a problem that we’ve all known, the tendency to have metrics that overlap and disagree (produced by different teams, typically!). LinkedIn faced this problem and built a system to produce metrics once and make available to all. He stressed the need for end-to-end thinking for systems as well as people.

Analysing Your First 200M Users

Avoid complexity at all costs

Mohammad Shahangian joined Pinterest as its eighth engineer, two years after leaving education, and was tasked with writing its data strategy. Today, Pinterest has a billion boards, two billion searches per month and makes two trillion recommendations per year. Guess he got it right!

He opens with a story about when Pinterest opened up its initial Twitter and Facebook signup to email, with the resulting influx of spam accounts that’ll be familiar to anyone who’s dealt with email signups! To figure out how bad the problem was, they hid the email signup UI from the webpage whilst leaving the form invisible but still present in the webpage – reasoning that humans would stop signing up by email, revealing the spammer’s scripts. Neat!

IMG_20171020_121050.jpg

Hiding the email signup UI revealed the spammers at Pinterest

If you treat your userbase as one homogenous group, your userbase will become one homogenous group

He went on to talk about how Pinterest used data to figure out that “Pin” was a causing adoption problems in some languages, how segmentation really matters for decision making, the dangers of “linkage” (experiment tainting control), and how optimising everything for clicks can lead to missed opportunities or bad UX – think clickbait!

Science the Sh*t out of your Business

A single KPI is data-driven

Justin Bozonier is the lead data scientist at Grubhub and wrote Test-Driven Machine Learning. I knew him as @databozo after a couple of people I’d met yesterday had mentioned him.

He dove a little deeper into some of the maths behind measures, but also talked about how to interpret and use them. Is a dashboard full of charts or a single KPI that says ship or don’t ship data-driven? He argues it’s a single KPI, because a dashboard full of metrics needs a bunch of interpretation to drive a decision.

Justin also talked about false positives and shipping – shipping a feature that doesn’t have a positive impact doesn’t do any harm, and getting it out there has value in generating information and clearing the work out of the way. What about features that actually have negative impacts? Your analysis can show how much of negative impact is likely, helping with the choice to ship. Not shipping stuff that actually has a positive impact costs you the benefit you would have had if you’d shipped it earlier, so there’s a tension here.

Event-Driven Growth Hacking

Sustainable growth is data-driven

Thomas in’t Veld opens with the observation that small companies have had access to big analytics for about ten years.

He skims the architectural choices Peak has made to provide analytics for a mobile app-only brain training-style product, including the first time I’ve heard of a product called Snowplow for consuming and processing event data. He describes three “conditions for growth” in your company, including metrics and calculations to figure out whether your growth is sustainable (around whether or not your cost of acquiring customers is more or less than a customer’s expected lifetime value).

He finishes with five lessons learnt at Peak. Get the right culture, plan your event data carefully, validate early in the pipeline, deal with unit (not aggregate) economics and keep it simple.

AI in Production

Skin wins

Gio Fernandez-Kincade had me with “I’m going to talk about the things we don’t talk about”. Things like “How do you know that training data you crowdsourced is actually any good?” and “Does it work for our data?”.

He talked about his experience at companies like Etsy, picking out about something that I’ve seen everywhere I’ve had contact with companies involved with user-provided visual content – that skin wins. People will post up… inappropriate content and it’ll be a problem for the quality of your site, and for your data science!

He also talked about some of the difficulties of taking models into production – like how you integrate them with your existing systems? It may be fast enough to classify an image in 800ms, but when a query needs has 40 images to classify that time adds up. He hopes that within 5-10 years we’ll see production-ready AI systems where these kinds of concerns have been dealt with.

A new one for me, he talked about how asking users to complete multi-step processes before being allowed to register drives down registrations – and the “Big ****ing Button”, that is, a “Sign Up Here!” button – is pretty much unbeatable.

Lessons Learned from Teaching Data Science to over a Million People (slides)

Sean Kross one of the brains behind a set of three data science courses on the Coursera site. The Data Science Specialization has had 4 million enrolments!

He shared some of his insights into why people like the courses. First, they give everything away for free. People can pay for a certificate, but they don’t have to. The materials are hosted on Github and the four scientists in Sean’s group have written fifteen books between them using the Leanpub publishing platform, which are also free with the option to pay.

The courses use real-world data. They partner with SwiftKey for datasets around swipe-based mobile keyboards and Yelp for a bunch of different data.

They run every course every month instead of following the semester-style pattern that’s common. That means that people can finish quickly if they want to, or if life gets in the way, they can drop out and pick up again next month.

Finally, each course leads into the next, instead of being a collection of unrelated study material.

There’s also a new fourth course, still under wraps, on the way!

How Deep is your Data?

img_20171020_174021.jpg

Sean and Mad bemused at questions about “Deep Data”. Were attendees just playing?

“Deep data” kept coming up in questions – none of the speakers knew what it was. Maybe it’s a new term will improve how we communicate data stuff and is just emerging at CRUNCH 2017. Or it’s another silly buzzword that we’ll be rolling our eyes at by CRUNCH 2018. My money’s on the latter!

The questions were asked using sli.do – we just hit up the URL and then could ask and vote for questions. After the talk, the top voted questions got asked. Easy.

Update: recordings and slide decks are available now.

Advertisements

Recommended Tech Podcasts

I think podcasts are a great way of keeping up with a topic in that otherwise dead brain time when you’re travelling to work, washing the dishes and cleaning the floor. Here’s a few of the best that don’t focus on any one particular technology I’ve found over the last few years.

Security Now (feed)

Since 2005, Steve Gibson and Leo Laporte have been talking security each week. You’ll get a summary of any high-impact or interesting security news, deep dives on technical topics and listener Q&As. You also get detailed show notes and full transcriptions of each podcast at grc.com, a service that has proved useful more than once in referring back to something I’d heard.

This is the place I first heard about Heartbleed and Shellshock. Steve’s discussion of HTTP/2 is both in-depth and straightforward, explaining a few details I’d missed in my own reading. The politics around security, privacy, advertising and encryption are also often a topic of discussion, and he recently explained how to use three dumb routers to securely operate IoT devices at home.

Episodes

Weekly, 1-2 hours. Summary of news early in the episode, deep dives later.

Recommended For

If you work in tech, you should be listening to this. If you don’t. but you have any interest at all in computers, you’ll probably get a lot out of it too.

Software Engineering Radio (feed)

‘The Podcast for Professional Software Developers’ has been working with IEEE Software since 2012, but has been broadcasting interviews with software industry luminaries since 2006. This is where I first learnt about REST, way back in 2008. More recently, the episodes on Redis, innovating with legacy systems, and marketing myself (which is why I’m making an effort to blog regularly!) really got me thinking.

Episodes

A little variable in timing, but normally at least one per month. 1-2 hours per episode, short introduction then straight on to the interview.

Recommended For

No prizes for guessing ‘Software Developers’. I think this is great podcast for broadening your awareness of what’s going on out there outside whatever area you’re focussing on.

CodePen Radio (feed)

CodePen lets you write and share code with others, but that’s largely incidental to the podcast. Instead, the founders Chris Coyier, Alex Vasquez and Tim Sabat talk about the challenges and choices they face building and running CodePen. One of the things I like is the discussion of mistakes and compromises – it’s food for thought and makes me feel better about the mistakes and compromises I make!

They cover a variety of topics around running a site like CodePen. They talk about how their ‘Recent Activity’ feature works, switching from running their own database to using Amazon’s RDS, and how they deal with edge cases. They also talk about the business side of things, like hiring people and getting funding.

Episodes

2-4 episodes per month. A minute or two for introductions, moving on to main topic.

Recommended For

Detailed, practical insights into building and operating a small, successful tech company in 2016, so if this is something you do or want to do, I’d listen to this.

Developer Tea (feed)

Jonathan Cutrell produces ten-minute interviews and advice snippets for developers. He’s talked about prototypes, focus and ensuring professionalism. I think of this one as the super-short-form version of SERadio.

Episodes

10 minutes, 2-3 times weekly. Short intro, then content.

Recommended For

Software developers, maybe designers. The short format might work for you or not – I personally find it doesn’t seem to stick as well as the longer podcasts. I think a lot of the advice here is aimed at early-career developers, but still worthwhile for later career if you have time.

Wrapping Up

Have I missed any great podcasts along these lines? Let me know!

Expressively Selecting a Strategy using ES2016

I find myself needing to select a strategy based on some arbitrary function of input often enough to look for a neat solution. Maybe it’s the output of a remote service that I want to decorate with a summary, or records from a document store that I want to normalize somehow. ES2015’s destructuring, Array find method and arrow functions provide the most flexible, concise and expressive way of choosing the appropriate strategy from a list on a first-match basis that I’ve come up with so far. I’ll be using Babel to transpile Node 4 up to ES2015 spec.

For example, say our spec says that given an input y:
* if it’s a string, uppercase it
* else if it’s an array, return a string describing the length
* else if it’s an object, return a string describing the number of keys
* else return “Nothing Matched” and the default toString() output

We’ll define an array of pairs of functions, where the first element in each pair will be treated like a predicate, and the second will be invoked if the first ‘matches’. Arrow functions make this definition much clearer than the traditional function() {...} syntax.

const renderingStrategies = [
  [x => typeof x === 'string',  x => x.toUpperCase()],
  [x => Array.isArray(x),       x => `Array with ${x.length} elements`],
  [x => typeof x === 'object',  x => `Object with ${Object.keys(x).length} keys`],
  [() => true,                  x => `Nothing matched '${x}'`]
];

That seems fairly expressive to me, mapping pretty directly onto the spec. You could use an array of objects, each with a pair of methods like (match, handle), but that involves quite a bit more boilerplate. Likewise, an if/else-if/else structure could do the job, but it’s more boilerplate and, for me at least, doesn’t imply the intent as clearly.

Now, we need a function that, for an input, selects the first strategy for which the predicate is true. Use array find() to choose the first matching predicate and destructuring to clearly pull out the predicate make this a one-liner.

const render = x => renderingStrategies.find(([matches]) => matches(x))[1](x);

render('Hello World'); // HELLO WORLD
render([1, 2, 3, 4]);  // Array with 4 elements
render({x: 1, y: 2});  // Object with 2 keys
render(1234);          // Nothing matched '1234'

Performance of this selector and its if/elseif/else version are roughly equivalent, both completing a million selections in around a second on my computer. It’s a shame that the only simple way I can see to pull out the decorator function (without a verbose filter and map) is to extract by index. Let me know if you can see a better way!

If we were to use promises, then we could use destructuring again, and make our function asynchronous. For example:

const render = x => Promise.resolve(renderers.find(([matches]) => matches(x)))
  .then(([,decorate]) => decorate(x));

If you can improve on this, or suggest a better solution, leave me a comment, or get me on Twitter.

Node.js Microservice Optimisations

A few performance, scalability and availability tips for running Node.js microservices.

Unlike monolithic architectures, microservices typically have a relatively small footprint and achieve their goals by collaborating with other microservices over a network. Node.js has strengths that make it an obvious implementation choice, but some of its default behaviour could catch you out.

 

Cache your DNS results

Node does not cache the results of DNS queries. That means that every time your application uses a DNS name, it might be looking up an IP address for that name first.

It might seem odd that Node handles DNS queries like this. The quick version – the system calls that applications can use don’t expose important DNS details, preventing applications from using TTL information to manage caching. If you’re interested, Catchpoint has a nice walkthrough of why DNS works the way that it does and why applications typically work naively with DNS.

Never caching DNS lookups is going to really hurt your application’s performance and scalability. I think the simplest solution from a developer’s perspective is to add your own naive DNS cache. There are even libraries to help, like dnscache. I’d tend to err on the side of short cache expiry, particularly if you don’t own the DNS names your looking up. Even a 60-second cache will have a big impact on a system that’s doing a lot of DNS lookups.

An alternative, if you are running in an environment where you have sufficient control, is to add a caching DNS resolver to your system. This might be a little more complex but a better solution for some scenarios as it should be able to take advantage of the full DNS records, avoiding the hardcoded expiry. Bind, dnsmasq and unbound are solutions in this space and a little Google-fu should find you tutorials and walkthroughs.

Reuse HTTP Connections

Based on the network traffic I’ve seen from applications and test code, Node’s global HTTP agent disables HTTP Keep-Alive by default, always sending a Connection:close request header. That means that whether the server you’re talking to supports it or not, your Node application will create and destroy an HTTP connection for every request you make. That’s a lot of potentially unnecessary overhead on your service and the network. I’d expect a typical microservice to be talking frequently to a relatively small set of other services, in which case keep-alive might improve performance and scalability.

Enabling keep-alive is straightforward if it makes sense to do so, passing the option to a new agent or setting the global agent http.globalAgent.keepAlive andhttp.globalAgent.keepAliveMsecs parameters as is appropriate for your situation.

Tell Node if it’s running in less than 1.5G of memory

According to RisingStack, Node assumes it has 1.5G of memory to work with. If you’re running in less, you can configure the allowed sizes of the different memory areas via v8 command line parameters. Their suggestion is to configure the old generation space by adding the “–max_old_space_size” with a numeric value for number of megabytes to the startup command.

For a 512M available, they suggest 400M old generation space. I couldn’t find a great deal of information about the memory settings and their defaults in v8, so I’m using 80% as a starting point rule of thumb.

Summary

These tips might be pretty obvious – but they’re also subtle and easy to miss, particularly if you’re testing in a larger memory space, looping back to localhost or some local container.

 

Continuous Integration for Researchers?

TL;DR

Could tailored continuous integration help scientific researchers avoid errors in their data and code?

Computer Error?

Nature reported on the growing problem of errors in the computer code produced by researchers back in 2010. Last year, news hit the press about an error made in an Excel spreadsheet that undermined public policy in the UK. Mike Croucher discusses several more examples of bad code leading to bad research in his talk ‘Is your Research Software Correct?’.

It seems odd that computers are involved in these kinds of errors – after all, we write instructions down in the form of programs, complete and unambiguous descriptions of our methods. We feed the programs to computers and they do exactly what the programs tell them to do. If there’s an error, the scientific method should catch them when other researchers fail to reproduce the results. So why are errors slipping through?

That’s the question that Mike and I were chewing over between talks at TEDxSHU in December 2015. I think the talks I heard there inspired me to think harder about trying to find an answer. It seems like the first step to solving the problem is reproducing results.

Reproducibility Fail

My MSc. dissertation involved processing a load of data that I was given and running programs that I’d written to draw conclusions. Although my dissertation ran to many thousands of words, it was a fairly shallow description – my interpretation, in fact – of what the data said and what the code did. I can’t give you the data or the code as there were privacy and intellectual property concerns about both.

If I’m going to tear it apart, my dissertation really describes what I intended to tell a computer to do to execute my experiment. Then it claims success based on what happened when it did what I actually told it to do.

If you had my code, you could run it on your own data and see if my conclusions held up. You could inspect it for yourself. You could see the tests I wrote and maybe write some yourself if you had concerns. You could see exactly what versions of what library code I was using – maybe there have been bugs discovered since that invalidate my conclusions. If you had my data you could check that my answers were at least correct at the time and are still correct on more recent versions of the libraries.

If you had my code and my data, you won’t know what kind of computer I did the work on or how it was set up. Even that could change the result – remember the pentium bug? Finally, if you had all that information, you’ve still got to get hold of everything that you need, wire it all up and do your verifications. That’s quite a time and cost commitment, assuming that you still can get hold of all that stuff months or years later.

Continuous Integration to the Rescue?

I’m sure I’ve just skimmed the surface of the problem here – I’m not a researcher myself, nor am I claiming that my dissertation was in any way equivalent to an academic paper. It’s just an example I can talk about,  and it’s enough to give me an idea. It sounds a little like the “works on my machine” problem that used to be rife in software development. One of the tools we use to solve it is “continuous integration”.

Developers push their code to a system that “builds” it independently, in a clean and consistent environment (unlike a developer’s computer!). “Building” might involve steps like getting libraries you need, compiling and testing your code. If that system can’t independently build and test your code, then the build breaks and you fix it.

A solution along these lines would necessarily have to automatically verify that all the information needed to get the code running, such as the code itself, configuration parameters, libraries and their versions, and so forth are present and correct. If the solution could also accept data and results, and then verify that the code runs against the data to produce the results, then it seems like we’ve demonstrated reproducibility.

Setting your own CI server isn’t necessarily straightforward, but Codeship, SnapCI and the like show that hosted versions of such solutions work, offer high levels of privacy and (IMHO) simplify the user experience dramatically. A solution like one of these, but tailored to the needs and skills of researchers might help us start to solve the problem.

Tailored CI for Researchers

I think that the needs of a researcher might differ a little from those of a software developer. What kinds of tailoring am I talking about? How about:

  • quick, easy uploading of code, data and results, every effort to make it “just work” for a researcher with minimal general computing skills
  • built-in support for common research computing platforms like MATLAB and Mathematica
  • simple version control applied automatically behind the scenes – maybe by default each upload of code, data and results is a new commit on a single branch
  • maybe even entirely web-based development for the commonly-taken paths (taking cloud9 as inspiration)
  • support taking your code and data straight into big cloud and HPC compute services
  • enable more expert users to take more control of the build and test process for more unusual situations
  • private by default with ability to share code, data and results with individuals or groups
  • ability to allow individuals or groups to execute your code on their data, or their code on your data, without actually seeing any of your code or data
  • what-if scenarios, for example, does the code still produce the correct results if I update a library? How about if I run it on a Mac instead of a Windows machine?
  • support for academic scenarios like teams that might be researching under a grant but then move on to other things
  • support for important publication concerns like citations
  • APIs to allow integration with other academic services like figshare and academic journal systems

I think that’s the idea, in a nutshell. I’m not sure if it’s already being or been done, or if not, what could happen next, so I’m punting it into the public domain. If you have any comments or criticism, or if there’s anything I’ve skimmed over that you’d like me to talk about more please leave me a comment or ping me on Twitter.

Embassytown

China Miéville’s Embassytown finally made it to the top of my reading list, after the recommendation on Terminally Incoherent. I have to agree with everything Luke says, it’s a pretty compelling sci-fi mixture. There’s a little fantastical technology but the story revolves around humans interacting with an alien society whose use of language is fundamentally different to our own.

I finished the book yesterday and by chance listened to an oddly relevant episode on the Grammar Girl podcast this morning. “Because as a Preposition” talks about a new use of the word “because”, for example “I didn’t do my homework because Skyrim”. To me, this sounds wrong. No, sounds is too weak – it feels wrong, jarring, like other kinds of grammatical error. If I read it at speed, I read “…because of Skyrim”. It was a grammatical error when I learned to speak, read and listen, back in the early eighties. Regarding constructions that were erroneous and became acceptable after I’d learnt seem to be quite deep inside me, more a sense like taste or smell with instinctive likes and dislikes than something I think about.

The podcast talks about how this use was happening in popular culture for those who learnt English after I did, so maybe to them it feels different, natural, when they use or observe it. You’ll see why it’s relevant when you read the book!

I found the story itself to be well crafted and I struggled to put it down. I’d certainly recommend it if you’re a fan of SF and the ideas of language and mind interest you.

Finishing my MSc. Dissertation

I finished my dissertation a couple of months ago, and since graduated. Finishing was a great feeling, but I certainly remember the time when I thought I was losing control of the whole thing. I thought my experiments would fail to produce any positive results, and I lost any confidence I would finish at all. A time of sleepless nights and distracted days, but I learned I’m not alone in feeling that way whilst trying to get my dissertation to come together.  To anyone else who’s in that place, try not to get too stressed and negative about it. Stay focussed on what you want to achieve and keep going. If I can do it, you can – it will come together.

Here’s the final result of all that work, Pattern Recognition in Computer System Events – Paul Brabban, published here in the School of Computer Science library. If you want to read it, I’d suggest having a skim over the introduction and then maybe skip to the conclusions. If you’re still interested then the detail is in the middle sections and if you want to try and reproduce my work, there is an appendix detailing some of the implementation choices I made.

I’m lucky to have had such great tuition and support at Manchester, not to mention the excellent supervision I received for my project from Dr. Gavin Brown. I was also very happy to receive some great feedback from my external examiner,  Professor Muffy Calder at the University of Glasgow. I couldn’t have done the project without the support of the industry partner, so thanks to them and their representatives. My mum and stepbrother painstakingly proofread my later drafts and picked out any number of grammatical errors, and my wife, my friends and my family supported me and listened to me going on and on about computer science geekery.

My eternal gratitude to everyone I’ve mentioned and anyone I’ve forgotten!