YouTube video summary

Courtney Nash Discusses Incident Management, Automation, and the VOID Report

Technology03 Oct 202415 min summary
Courtney Nash Discusses Incident Management, Automation, and the VOID Report

InfoQ Dev Conference and VOID Report Introduction

  • The InfoQ Dev conference in Boston will feature senior software practitioners sharing their experiences on critical topics such as generative AI, security, and modern web applications, with plenty of time for attendees to connect with peers and speakers at social events 37s.
  • Courtney Nash is the author of the recent VOID report and a longtime contributor to the incident management space, having delivered a talk on the topic at QCon New York last year 1m1s.
  • The VOID report explores the unintended consequences of automation in software, including the role of AI, and provides a call to action for the community to share incident management data 2m24s.
  • Courtney Nash is the founder of the VOID and has a background as an editor at O'Reilly, where she worked on the Head First series of books and the SRE book 1m37s.
  • The VOID report was started in 2021 as a research program to collect examples of incidents in the wild, using public incident reports as a starting point 2m44s.
  • The report aims to provide a comprehensive analysis of incidents and their causes, and to encourage the community to share their own incident data to facilitate learning and improvement 1m19s.
  • The discussion covers a wide range of topics, including D metrics, working with socio-technical systems, and comparing different approaches to incident management 1m21s.
  • Courtney Nash's goal is to create a shared understanding of incidents and their causes, and to encourage the community to work together to improve incident management practices 1m31s.
  • A collection of over 3,000 incident reports was gathered, highlighting the lack of a centralized library for incident reports in the industry, unlike in aviation where anonymized incident reports have been shared to improve safety 3m45s.
  • This collection has grown to over 10,000 public incident reports, including write-ups from industry professionals, news articles, and tweets, which are used to tackle broad research goals 4m40s.
  • Research was conducted to dispel myths about incidents, including incident response, management, analysis, and attitudes, using real data instead of relying on attitudes and perceptions 5m10s.
  • A study on mean time to respond (MTTR) was inspired by an engineer at Google, revealing variability in MTTR for incidents and challenging the assumption that lower MTTR indicates greater resilience or reliability 5m22s.
  • The findings on MTTR were shared with the Dora metrics team, leading to a shift in the metrics they encourage people to use 5m53s.
  • Research was also conducted on root cause analysis, coinciding with Microsoft Azure's shift away from this approach in favor of post-incident reviews and publishing postmortem analyses 6m16s.
  • The Microsoft Azure team now publishes video postmortems with customers and executives, indicating a change in approach to incident analysis 6m42s.
  • The time is ripe to explore automation in incident management, with the VOID report participating in this shift 7m4s.

VOID Report Research and Findings

  • A qualitative methodology, specifically thematic analysis, was used to examine the role of automation in incidents, analyzing around 10,000 reports and narrowing it down to 200 incidents that showed some indication of automation involvement 7m43s.
  • The analysis revealed that automation plays multiple roles in incidents, sometimes even within the same incident, such as detection, problem, or resolution, and often makes it harder to resolve 9m30s.
  • The research found that humans have to intervene in about 75% of incidents involving automation to resolve or deal with the issue 10m12s.
  • The study's findings contradict the common narrative that automation will free humans from tasks they are bad at or do not want to do, instead highlighting the importance of human intervention in complex systems 10m32s.
  • Research from other domains, such as aviation, healthcare, and nuclear power plant systems, supports the idea that automation can be a challenging team player in complex systems 10m38s.
  • The VOID report's analysis of automation in incidents is limited by the data available, as public incident reports only provide a partial view of the incident, and internal discussions may reveal more information 8m18s.
  • The study used keyword searches to identify incidents involving automation, initially yielding around 5,600 incidents, which were then narrowed down to 200 7m52s.
  • The analysis developed codes to identify similar phenomena in the text data, which were then revised and revisited to develop themes or categories of automation's role in incidents 8m49s.
  • The research highlights the varied roles automation plays in incidents, which can be categorized into archetypes, and emphasizes the importance of considering these complexities when designing and implementing automation systems 9m26s.
  • The prevalent mental model is that automation is good at certain things and humans are good at others, but this model does not accurately reflect how complex systems work 11m0s.
  • Research has shown that there are "ironies of automation," a concept introduced by L. Bainbridge in the 1980s, which highlights the paradoxes that arise when humans and automation interact 11m39s.
  • The idea is not to eliminate automation, but to readjust mental models and rethink how systems are built to help humans, rather than replace them 12m1s.
  • The VOID report suggests retiring MTTR (Mean Time To Recovery) as a metric, as it can be misleading, and instead focusing on more nuanced approaches to incident management 12m24s.
  • Some organizations are moving away from shortsighted approaches like root cause analysis (RCA) and towards more comprehensive methods 12m31s.
  • The criticism that some people may not be ready to move on from MTTR or RCA is acknowledged, but it is argued that averages can be misleading and that organizations should focus on more nuanced approaches 13m58s.
  • The work of DORA (DevOps Research and Assessment) is praised for its contributions to the industry, particularly in highlighting the importance of developer experience and culture 13m6s.
  • The use of data science and actual data can help organizations better understand the distribution of their metrics and make more informed decisions 13m41s.
  • Many organizations deal with non-normally distributed data, making it difficult to take meaningful averages, means, medians, or modes, as the data is often skewed 14m18s.
  • Large organizations with thousands of incidents per year may be able to derive some value from their data, such as Cloudflare, Google, and Microsoft, which publish their incident data 14m39s.
  • Noisy and skewed data require a large sample size or transformations, such as log transformations, to make sense of them, but these transformations can make the data meaningless to others in the organization 15m18s.
  • Assigning numbers to complex systems can be challenging, and using metrics like MTTR (Mean Time To Recovery) can be misleading and may not provide meaningful insights 15m43s.
  • Incentivizing metrics like MTTR can lead to unintended behaviors, such as engineers and incident response teams being assigned OKRs (Objectives and Key Results) that can affect their bonuses 16m5s.
  • Making metrics a target can change people's behaviors, often leading to negative consequences, as illustrated by an XKCD cartoon 16m41s.
  • The focus on metrics and targets can negatively impact the people responsible for making systems work, particularly those at the "sharp end" of these systems 17m6s.

Challenging Traditional Incident Management Metrics

  • Research often focuses on studying incidents and failures, but normal work is not studied as much, despite its importance in understanding how systems function 17m15s.
  • Studying normal work and tradeoff decisions in incident management can provide valuable insights, as seen in research by Dr. Laura Maguire 17m38s.
  • Research is being conducted to study normal work in high-pressure and high-tempo situations to understand what information incident responders and software engineers need to collect to help managers and senior leadership make decisions 17m58s.
  • The focus is on understanding what works and what is successful, rather than just analyzing failures, and recognizing the socio-technical side of systems, including the roles people play in making them work 18m44s.
  • There is a tendency to focus on technology and architecture, but it's essential to consider both the social and technical aspects and how they interact 19m11s.
  • Automation is beneficial for companies, and high-performing organizations are correlated with mature practices in this space, although there is currently no data to prove this 19m27s.
  • Investing in learning from incidents and having dedicated incident analysis roles can give organizations a competitive advantage and increase the confidence of engineers in handling and understanding their systems 19m50s.
  • A preliminary survey was conducted as part of the 2024 report to understand what people are actually doing in the space, as the current understanding is skewed towards those at the cutting edge of incident analysis 20m42s.
  • The Learning from Incidents community, started by Nora Jones, has a large Slack community, and the perception of what people are doing is influenced by interactions with those who are personally and intellectually at the cutting edge of this field 20m57s.
  • Incident analysis is crucial for organizations, and investing in it can lead to higher confidence in feeding information back into the organization in useful and beneficial ways, with dedicated roles, executive support, and funded programs being key factors 21m47s.
  • Organizations that invest in incident analysis may have certain organizational and competitive advantages, but more data is needed to prove this 22m34s.
  • Research on incident management is focused on people, as they are essential to this field, and more people are needed to participate in this research to provide data and help improve practices 22m56s.
  • Digital services are mission-critical to the world, and participating in this work, sharing experiences, and publicly discussing incidents can help improve their resilience and reliability 23m30s.
  • A call to action is made for people to participate in research and share their experiences to help improve incident management practices, with the promise that it is not just marketing or surveys, but actual research 23m22s.
  • Automation has multiple roles in incidents, and understanding these roles can help improve incident management, with archetypes such as the Sentinel, Gremlin, Meddler, Unreliable Narrator, Spectator, and Action Item being identified 24m36s.
  • The VOID report and other research aim to provide insights into incident management and improve practices, with the goal of making digital services more resilient and reliable 24m11s.

Automation in Complex Systems

  • The concept of automation can be complex and nuanced, with different definitions and interpretations, but a common understanding is that automation involves a computer performing tasks instead of a human, often in a repetitive and faster manner 25m38s.
  • The definition of automation often includes a third layer, which is the notion that it can perform tasks better than humans, but this is where the concept can become problematic 26m12s.
  • In complex systems, automation can fail in unexpected and surprising ways, which can be confusing and difficult to diagnose 26m40s.
  • Research by Lay Bainbridge and others has focused on the ironies of automation and the concept of automation surprises, highlighting the challenges of designing automation systems that can handle complex and unpredictable systems 27m1s.
  • Complex systems are defined as those that cannot be fully modeled or understood by one person, and are characterized by non-linear relationships between inputs and outputs 27m16s.
  • The concept of automation originated from the idea of assembly lines and linear systems, where tasks can be easily modeled and predicted, but this does not always translate to complex systems 27m35s.
  • In complex systems, automation failures can be difficult to diagnose and may not fail in expected ways, making it challenging to design and implement effective automation solutions 28m0s.
  • The design of automation systems in complex environments requires careful consideration of the potential failure modes and the limitations of automation in handling unexpected events 28m10s.
  • Automation can make complex systems harder to understand and manage, especially when things go wrong, as it can be difficult to introspect and understand what the automated system is doing 28m17s.
  • This issue is not unique to IT and is also seen in other domains such as medicine, healthcare, and aviation, where automation can make it harder for humans to deal with problems when they arise 29m26s.
  • The mental model of automation needs to be rethought, and it should be designed as a joint player or team member, rather than just a tool, to make it easier to work with and understand 29m38s.
  • To achieve this, it's recommended to focus on making automation more introspectable and easier to understand, and to prioritize developer UX for tooling and incident management 30m2s.
  • The amount of time and money spent on product UX should also be invested in developer UX for complex systems, to make it easier for developers to work with these systems 30m11s.
  • Automation in simple linear systems is different from automation in complex systems, and they should not be treated the same 30m42s.
  • While AI may be able to help with explainability and understandability, it's recommended not to jump into AI solutions too quickly, but rather to focus on getting the basics of automation right first 31m29s.
  • The argument that AI will make things better and smarter is not necessarily true, as humans are still responsible for making AI systems, and the basics of automation need to be understood before moving on to more complex solutions 31m42s.
  • There's a logical black hole in understanding how AI can diagnose complex systems better than humans, especially when the AI hasn't been given access to all the necessary inputs, and this is a major challenge in incident management and automation 31m47s.
  • The idea that AI can model an entire system is flawed, and it's essential to acknowledge that AI needs to know where to look for information to be effective 32m21s.
  • The hype surrounding autonomous cars and generative AI is similar, and it's based on the same misconceptions about automation and complex systems, which can lead to unrealistic expectations 33m16s.

Improving Incident Management through Joint Cognitive Systems

  • It's crucial for organizations to focus on foundational automation and understanding complex systems before implementing AI solutions, rather than relying on AI to recommend solutions that may not work 33m54s.
  • Learning from incidents and introspecting system failures can help organizations identify pain points and improve response times, making it easier for engineers and incident responders to do their jobs 34m2s.
  • Providing better tools for incident responders and engineers is essential for organizations to improve their incident management capabilities 34m45s.
  • The concept of joint cognitive systems and making automation a team player is a challenging area, with 10 key challenges identified in the VOID report, including the need for a deeper understanding of agent theory and complex systems 35m1s.
  • The challenges in incident management can be addressed by leveraging human factors research and user interface expertise to improve the joint cognitive systems that work together to achieve a common goal 35m38s.
  • Joint cognitive systems research originated from the fields of aviation and surgical environments, where the collaboration between humans and machines was critical to preventing harm 36m4s.
  • The concept of joint cognitive systems involves anthropomorphizing computers to create a team-like environment where humans and machines work together to achieve a common goal 36m33s.
  • This approach requires updating the mental model from "computers are better at" and "humans are better at" to "we are a team trying to achieve this common goal" 36m56s.
  • To achieve this, it's essential to have better tooling, introspective systems, and user experience (UX) experts who can design internal tools that facilitate collaboration between humans and machines 37m11s.
  • The idealized version of a joint cognitive system is exemplified by the Iron Man suit, which represents a system that can communicate with its user and provide real-time information 37m43s.
  • Designing such a system requires a deep understanding of human factors, user experience, and the ability to create a system that can guide the user's attention and summarize information 38m21s.
  • Implementing this approach requires significant work, investment, and a shift in corporate priorities, which can be challenging, especially in the current economic climate 38m37s.

Apples and Volkswagens: The Problem with Aggregate Incident Metrics

  • A recording titled "Comparing Apples and Volkswagens: The Problem with Aggregate Incident Metrics" is available online, discussing the issue of using metrics that don't accurately represent the reality of one's experience 39m3s.
  • The title is inspired by the speaker's late mother, a sociologist who used the metaphor of comparing apples to Volkswagens to convey that different things can't be directly compared 39m17s.
  • The recording explores whether duration and severity of incidents are related, finding no statistical correlation between the two 40m13s.
  • The speaker aims to empower people at the sharp end of incident response to have more meaningful conversations with those at the blunt end by providing data-driven insights 40m57s.
  • The speaker is collecting incident data and invites people to submit their own incidents through a simple form or by reaching out through LinkedIn or the website's contact form 41m19s.
  • A larger survey will be fielded this year to gather more data on the experiences and effectiveness of incident responders and on-call teams 41m51s.
  • People can sign up for the newsletter to receive updates on the survey and other related information 42m8s.
  • The conversation is coming to a close, with the host expressing gratitude for the guest's time and mentioning that additional information will be made available in the future 42m16s.
  • The host thanks the guest again, stating that there was a lot to cover and that they will link to relevant reports and other materials 42m20s.
  • The conversation ends with the host thanking the guest once more and expressing appreciation for the discussion 42m28s.
Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else
Save this summary

Then save anything you watch or read next.

Bookmark this summary, then save any video, article or PDF you read next.

Save to your library
Browse all Technology →

Ready to get started?

Save, summarize & chat with your content.

GET STARTED

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop