Donnie Berkholz's Story of Data

Cloud outages, transparency, and trust

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

CloudHarmony

The ongoing blips and bloops of public-cloud outages, whether planned or unplanned, continue to draw headlines and outrage. And rightly so, since downtime for those who use a single availability zone or even a single region can cost millions in lost business and reputation for companies whose own websites and online stores disappear.

The latest is a much-maligned 40-hour outage on Verizon’s new cloud:

As this tweet shows, the most important part of every outage, planned or unplanned, isn’t the outage itself. It’s everything surrounding it.

It’s the comms, stupid

Much like Bill Clinton’s 1992 rallying cry “It’s the economy, stupid,” cloud providers need to focus on what customers really care about.

Take a look at the CloudHarmony cloud-uptime listings. While AWS is among the top performers, Azure is far from it. Google has a few hours of downtime, and up-and-comer DigitalOcean is more comparable to Azure than AWS.

This suggests to me that outage frequency, within a certain range, isn’t a blocker on adoption of an otherwise compelling cloud provider. The question isn’t which provider is best — but what is the upper limit of what customers find acceptable.

One factor that does very clearly make a difference, however, is communications about the outage. The best-of-breed providers have status sites and Twitter accounts where they post periodic updates, whether an outage was planned or unplanned. Heroku and GitHub are good examples of this. While both sites have their share of downtime, they use strong transparency to maintain the trust of their users.

On the other side of the spectrum is Microsoft, which used to post nice postmortems but has since largely given it up. If you match up their public postmortems with articles pointing out Azure outages, you’ll note a significant disparity, particularly in the last year or two.

I got this bland, unattributed statement courtesy of Microsoft analyst relations:

Reliability is critical to our customers and therefore, extremely important to us. While we aim to deliver high uptime of all services, unfortunately sometimes machines break, software has bugs and people make mistakes, and these are realities that occur across all cloud vendors. When these unusual instances occur, our main focus is fixing the problem, getting the service working and then investigating the failure. Once we identify the cause of the failure we share those learnings with our customers so they can see what went wrong. We also take steps to mitigate that being a problem in the future, so that customers feel confident in us and the service.

We all understand that sometimes things break, because clouds are incredibly complex systems. We’re only really looking for two things out of it: (1) don’t have the same problem twice, and (2) keep us informed. Unfortunately, they aren’t living up to the second half of that. And they’re far from the only ones — see the Verizon example at the beginning of this piece.

As I argued a year ago:

For those wondering what a great postmortem looks like, Mark Imbriaco (in the past at Heroku, GitHub, and DigitalOcean) gives a masterclass here:

Monitorama 2013 – Mark Imbriaco from Monitorama on Vimeo.

And there’s a plethora of examples posted at sites including the following:

If you don’t have trust; if you think old-school opacity is still the right approach; you don’t have loyal customers and they’ll leave you at their first opportunity. Now you’ve seen the examples and the counterexamples — go forth and communicate!

Disclosure: Amazon Web Services, Microsoft, and Salesforce.com (Heroku) are clients. GitHub has been. Google, Verizon, CloudHarmony, and DigitalOcean are not.

by-sa

2 comments

  1. There’s also Allspaw’s write-up here, which includes the video from his 3-hour tutorial on post-mortems at Velocity: http://www.kitchensoap.com/2014/11/14/the-infinite-hows-or-the-dangers-of-the-five-whys/

  2. I think re. keeping customers informed, immediacy is important. Awareness in time with specifics as to source/cause can enable the right kind of “graceful degradation”, presuming you haven’t automated it.

Leave a Reply

Your email address will not be published. Required fields are marked *