Time for sysadmins to learn data science

At PuppetConf 2012, I had an epiphany when watching a talk by Google’s Jamie Wilkinson where he was live-hacking monitoring data in R. I can’t recommend his talk highly enough — as an analytics guy, this blew my mind:

Since then, one thing has become clear to me: As we scale applications and start thinking of servers as cattle rather than pets, coping with the vast amounts of data they generate will require increasingly advanced approaches. That means over time, monitoring will require the integration of statistics and machine learning in a way that’s incredibly rare today, on both the tools and people sides of the equation.

It’s clear that the analysis paralysis induced by the wall of dashboards doesn’t work. We’ve moved to an approach defined largely by alerting on-demand with tools like Nagios, Sensu, and PagerDuty. Most of the data is never viewed unless there’s a problem, in which case you investigate much more deeply than you ever see in any overview or dashboard.

However, most alerting remains broken. It’s based on dumb thresholds rather than anything even the slightest bit smarter. You’re lucky if you can get something as advanced as alerting based on percentiles, let alone standard deviations or their robust alternatives (black magic!). With log analysis, it’s considered great if you can even manage basic pattern-matching to group together repetitive entries. Granted, this is a big step forward from manual analysis, but we’re still a long way from the moon.

This needs to change. As scale and complexity increase with companies moving to the cloud, to microservice architectures, and to transient containers, monitoring needs to go back to school for its Ph.D. to cope with this new generation of IT.

Exceptions are few and far between, often as add-ons that many users haven’t realized exist — for example Prelert (first for Splunk, now available as a standalone API engine too), or Bischeck for Nagios. Etsy open-sourced the Kale stack, which does some of this, but it wasn’t widely adopted. More recently Numenta announced Grok, its own foray into anomaly detection, which looks quite impressive. And today, Twitter announced another R-based tool in its anomaly-detection suite. Many of you may be surprised to hear that, completely on the other end of the tech spectrum, IBM’s monitoring tools can do some of this too.

On the system-state side, we’re seeing more entrants helping deal with related problems like configuration drift including Metafor, ScriptRock, and Opsmatic. They take a variety of approaches at present. But it’s clear that in the long term, a great deal of intelligence will be required behind the scenes because it’s incredibly difficult to effectively visualize web-scale systems.

The tooling of the future applies techniques like adaptive thresholds that vary by day, time, and more; predictive analytics; and anomaly detection to do things like:

Avoid false-positive alerts that wake you up at 3am for no reason;
Prevent eye strain from staring at hundreds of graphs looking for a blip;
Pinpoint problems before they would hit a static threshold, like an instance gradually running out of RAM; and
Group together alerts from a variety of applications and systems into a single logical error.

DevOps or not, I’m running into more people and bleeding-edge vendors who are bringing a “data science” approach to IT. This is epitomized by attendees to Jason Dixon’s Monitorama conference. Before long, it will be unavoidable in modern infrastructure.

Want to get started? You could do a lot worse than Coursera’s data-science specialization.

Disclosure: Prelert, Splunk, IBM, and ScriptRock are clients. Puppet Labs has been. Etsy, Metafor, Nagios Inc, Numenta, Opsmatic, Twitter, and PagerDuty are not.

3 comments

Chris Kernaghan says:

January 8, 2015 at 4:50 am

Donnie,

I love the article and have to say it mirrors my own experience in trying to work out a ‘quick and dirty’ way to do performance aggregation and analysis for SAP landscapes. I learnt very quickly that visualisations are very easy, data wrangling is very hard and you need a set of tools to work with it at scale and ensure repeatability of results.

The idea of treating servers as cattle and not pets is a critical mindshift and you can see the difference in places that use Star Wars names for their servers over those which have a constructed named convention. I remember seeing servers being nursed through their problems when they should have been rebuilt and standardised – but process and paperwork got in the way.

I do see a lot of broken monitoring processes which are espoused by the BIG IT firms, mostly because they are based on cookie cutter templates. Monitoring is hard, because it is thought of as simple and because it is a completely inwardly focussed activity which does not have an easily calculable ROI. Therefore receives little investment, and when Enterprise companies who already have the OSS fear the only monitoring options available to them are expensive products from CA, IBM, HP ….. Monitoring is swept under the carpet and never talked about until something goes wrong.

I think there is a missed opportunity for a lot of companies who want to undertake a Big Data project – they should use their lack of monitoring infrastructure as a test project. It has low risk and has the ability to drive a lot of value. If a Big Data team can complete a Big Data project for server infrastructure monitoring on their own infrastructure, on time and in budget, then they know their stuff :-). Not that such expensive resources would ever be tested on such a demeaning project to ensure their hype lives up to their actual capability.

Mat Schaffer says:

January 11, 2015 at 6:13 pm

For what it’s worth, static threshold alerts are not broken/useless. But you do have to be careful about how you use them. Over the last couple of years I’ve gotten a lot of mileage out of both static alerts and double-exponential-smoothing-based alerts. https://github.com/Netflix/atlas/wiki/Examples#double-exponential-smoothing has some details and I look forward to more of the tooling being open-sourced in the coming months.

John Kinsella says:

January 20, 2015 at 12:17 pm

The people writing monitoring tools could make better systems by incorporating improved analytics, but if sysadmins have to “learn data science,” somebody is doing something wrong. Tool makers, senior sysadmins, system engineers – sure, learn away.

Things like bischeck/prelert need to become more adopted and ingrained. Graphite has some nifty functions built-in to do some of this.

Monitorama is awesome. But the reason so is because monitoring sucks. Hopefully the monitoring tools (and data provided to them) will eventually become good enough that Monitorama will become a boring event.

Donnie Berkholz's Story of Data

Time for sysadmins to learn data science