Sometimes Dragons

Automated Diagnostics: a conversation (and demo) with PagerDuty

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

I recently had the chance to film the above What Is/How to video with PagerDuty’s Jake Cohen on “What is automated diagnostics? How to reduce escalations and accelerate resolution with automation.”

In short, automated diagnostics is about helping responders to an incident determine if there is an issue and, if so, what domain expertise is needed. The goal is to resolve the issue as quickly as possible while also reducing the number of people brought in and steps involved to achieve a diagnosis and resolution.

Jake walks through all of this in the video, and he uses a particularly useful visual to demonstrate the tendency of incidents to draw in more people over time in an attempt to figure out what knowledge, skills, and systems access are needed (and who might be available to provide these essentials):

The text "Limited expertise and access drive up incident escalations" above a diagram showing the number of people on a zoom/slack/teams call increasing as time progresses. A sample message of "Who can pull the logs?"

The problem is therefore two-fold: incidents require time to track down needed information and expertise, while also drawing in an increasing number of team members, many of whom may not be needed in order to achieve a resolution (but all of whom have been pulled away from other tasks).

This is where automated diagnostics comes in. In many ways it is about organizing and curating existing organizational domain expertise (which in many organizations lies disparately in various docs, wikis, chat apps, and sometimes only within the minds of domain experts). But it is also about codifying this expertise (in the form of runbooks), and then automating the presentation of such knowledge and related checks and actions to the folks responding to an incident.

The title "Emulating the domain expert's investigation" and subtitle "Codify and automate the domain expert's investigation" above an illustration showing piecemeal runbook steps translated to a codified automated diagnostics runbook.

As Jake notes:

When that incident occurs, can we replicate and emulate what the domain experts tend to check when they are first paged…about an alert.

In addition to creating more detailed context around an incident and providing suggested actions (including potential temporary remediations, or “band-aid fixes”), automated diagnostic runbooks can also note which teams or individual domain experts should be brought in under specific circumstances, which further helps streamline the process and reduce the number of team members who are brought in.

For more on automated diagnostics (or to see a demo), watch the video (in two parts) below or see the full transcript and related links.

Disclaimer: The videos discussed here were sponsored by PagerDuty (but this post was not).

No Comments

Leave a Reply

Your email address will not be published. Required fields are marked *