Summit’s Technologist-in-Residence, Tim Kohn, provides valuable insights on lessons leaders can learn from last week’s Crowdstrike incident. As he notes, effective incident response and a culture of preparedness require cross-functional coordination. Read more below.
Last week’s Crowdstrike incident was a painful reminder that big blast-radius failures are intrinsic to the connected systems powering our world. We should absolutely wring every preventative lesson from this event but still embrace the reality that unforeseen bad things will happen. In my experience, there are some relatively lightweight practices that help organizations prepare and prevent recurrence: - A named "incident manager" and escalation path should be identified and ready before an event occurs. These individuals are responsible for coordinating resolution AND communicating status and next steps. Even the smallest teams need central coordination. - Define a plan for communicating when messaging and/or email systems are down. A phone or text fan-out through reporting chains works well as long people are prepared with team contact information. - Develop an objective classification system for events: Know what qualifies as critical / high / medium / low and have defined engagement and escalation for each category. This will help ensure the worst issues get prioritized and minimize churn for low-severity issues. - Root Cause Analysis and (if necessary) a Correction of Errors document should be prepared after the issue is remediated. When done well, these practices promote transparency and a culture of continuous improvement while ensuring similar failures don’t happen again. Google SRE has helpful guidance and an example. - Run regular tabletop and “game day” simulations covering the most likely or impactful scenarios. An exercise that simulates many unavailable workstations would be good preparation for the Crowdstrike bug (plus similar higher-likelihood, high-impact failures). - Finally, consider developing relationships with the appropriate incident response vendors (e.g., forensic work) and with local law enforcement. This can be particularly important for incidents that might involve data loss or other security breaches. Effective incident response is not solely the responsibility of engineering teams. It requires a coordinated, cross-functional effort to create a culture of preparedness and to address any potential impacts on your customers (e.g., customer service), your prospects (e.g., sales and marketing) and your brand (e.g., PR and communications).