Skip to main content

External Resources

This is a collection of external resources that may be useful for learning more about elements of incident response. Please feel free to submit PRs to add new resources if you find something particularly interesting.

Incident Response Procedures

Articles

  • PagerDuty Incident Response Guide This is the full (slightly sanitized) version of PagerDuty's internal incident response documentation, and it is very comprehensive. It is an excellent resource for seeing how to apply our general principles to a specific service.
  • Remote Incident Response This article by Ryan Frantz with help from Dr. Laura Maguire discusses the unique challenges of dealing with incident response with a distributed team.

Talks and Videos

  • nrrd 911 ic me: The Incident Commander Role This is a talk by Alice Goldfuss from SRECon 2016 where she talked about the incident response process at New Relic; this includes a discussion of severity levels and how they used a chatbot to automate elements of the process.

Books

  • Incident Management for Operations This is a book that talks about how to apply the ICS system to IT operations; it is a good introduction to the topic and describes how this actually looks in practice.

Incident Retrospectives (aka Postmortems)

Articles

  • Each Necessary, But Only Jointly Sufficient This 2012 blog post from John Allspaw provides a short description of why the idea of a "root cause" is a fundamentally flawed idea, and why learning must be the driving force behind incident analysis, not fixing.
  • Etsy Debriefing Facilitation Guide This is the incident retrospective guide used by Etsy and open-sourced in 2016; it's an excellent resource for conducting your own debriefings and the basis for a lot of similar guides throughout the industry.
  • The Infinite Hows This article by John Allspaw talks about the issues with the commonly used "Five Whys" system of incident analysis, and does an excellent job providing an overview of an alternative approach.

Talks and Videos

  • Incidents As We Imagine Them Versus How They Actually Are This is a talk by John Allspaw at PagerDuty Summit 2018 which is an excellent summary of the thorny issues around doing incident response and how what actually happened often gets oversimplified in a desire to make incidents fit in standardized boxes. If you watch nothing else about incident analysis, watch this.
  • Who Destroyed Three Mile Island? This talk by Nickolas Means at LeadDev Austin 2018 talks about the 1979 Three Mile Island disaster and is an excellent walkthrough of the difference between first stories and second stories, and the dangers of hindsight and outcome bias.

Books

  • The Field Guide to Understanding Human Error This one of the best texts for incident analysis, written by Sidney Dekker, a leader in the field. While not specifically about software services, the guidance in this book is applicable to almost any technical system.