Incident Response

Cyber Incident Response Playbook: From Detection to Lessons Learned

Table of Contents

The attack on MGM Resorts in September 2023 started with a 10-minute phone call.

Scattered Spider social-engineered an MGM IT help-desk worker into resetting MFA credentials on a privileged Okta account. Within hours, BlackCat/ALPHV ransomware had encrypted slot machines, hotel key systems, restaurant point-of-sale terminals, and reservation platforms across multiple Las Vegas properties. The final tab: $100 million in Q3 losses and a 10-day operational outage.

That same week, Caesars Entertainment was hit by the same threat actor using the same playbook. Caesars reportedly negotiated a $15 million ransom payment and avoided material operational disruption.

Same attacker. Same technique. Radically different outcomes. The variable wasn’t security tooling — it was response execution.

TL;DR

  • NIST SP 800-61 Rev. 3 (April 2025) restructures incident response around CSF 2.0’s six functions and explicitly recommends playbooks as the preferred format for documenting response procedures
  • CIRCIA requires critical infrastructure entities to notify CISA within 72 hours of a covered cyber incident and within 24 hours of a ransomware payment; final rule expected May 2026; penalties up to $500,000 per day for non-compliance
  • The MGM-Caesars divergence shows that response speed and pre-established decision authority determine outcomes more than any individual security control
  • A playbook is not an incident response plan — it is the operational step-by-step that gets executed at 2am when people are stressed and Slack is down

What a Playbook Is (and Isn’t)

An incident response plan defines governance: who has authority, what the escalation path looks like, and what notification obligations apply. A playbook is what your team does — step by step — when a specific type of incident hits.

NIST SP 800-61 Rev. 3, finalized April 3, 2025, explicitly notes that “formatting procedures within a playbook instead of another format can improve their usability.” The revision also structurally reorganized incident response to align with CSF 2.0’s six functions — Govern, Identify, Protect, Detect, Respond, and Recover — integrating it into the broader risk management program rather than treating it as a standalone process.

In practice, you want separate playbooks for separate incident types: ransomware, data exfiltration, business email compromise, DDoS, insider threat, third-party breach. What follows is the generic six-phase framework that every specific playbook builds on. The six-phase incident response plan structure covers governance and documentation requirements in depth; this playbook covers execution.

Phase 1: Preparation

Preparation is the only phase you fully control before the incident occurs. Everything else is reaction.

Build your response infrastructure now:

  • Incident response team with named roles: Incident Commander (single decision-maker), Technical Lead, Legal Counsel (internal and pre-identified outside counsel), Communications Lead, and Executive Sponsor. Every role needs a named primary and a named backup — not a title, a person.
  • Out-of-band communication channel: When your network is isolated or email is down, your response team needs another way to communicate. Signal group, phone tree, or a separate communication environment. Specify it in writing and test it.
  • Pre-identified outside counsel with IR capability: Sourcing legal help mid-incident costs hours and often the attorney-client protection for communications made before counsel is engaged.
  • Forensic investigator on retainer: A 24-hour engagement delay for a forensics firm is expensive when the clock has already started.
  • Cyber insurance carrier procedures: Most policies require notifying the carrier before engaging external counsel or forensics. Know the process before it matters.

Pre-define decision authorities:

  • Who can authorize network isolation of a production system? This decision needs to happen in minutes. If it requires three approval layers, attackers gain time.
  • Who can approve a ransomware payment? What is the OFAC sanctions screening process?
  • Who is authorized to speak with the press?
  • Who makes the call to notify regulators?

The MGM response illustrates the cost of undefined decision authority. Unclear escalation paths for network isolation allowed ransomware to propagate during the decision gap. Every hour of delay between detection and containment is another hour of potential encryption or exfiltration.

Test your playbooks annually. Organizations that have never walked through a ransomware scenario in a tabletop are executing their playbook for the first time under maximum stress. That is not preparation — it is a hypothesis. The cyber resilience and business continuity framework covers tabletop exercise design and integration with BCP.

Phase 2: Detection and Analysis

The goal of detection is not to find every threat — it is to confirm an incident has occurred and scope it accurately before over- or under-reacting.

Immediate triage questions:

  1. Is this a confirmed security incident or a system failure?
  2. What systems are affected, and what data do they process?
  3. How far has the attack progressed — what is the blast radius?
  4. Is data exfiltration confirmed or suspected?
  5. What is the most likely initial access vector?

Severity classification: Define your levels in advance — P1 (Critical) through P4 (Low) — with specific criteria and escalation requirements at each level. CISA’s Federal Government Cybersecurity Incident and Vulnerability Response Playbooks use a comparable tiered severity structure. P1 incidents should trigger immediate Incident Commander activation and executive notification within one hour.

Start the clock. From the moment you “reasonably believe” a covered cyber incident has occurred, CIRCIA’s 72-hour reporting clock starts. The OCC’s 36-hour notification requirement for computer security incidents at supervised institutions also begins at detection — not at investigation completion. Timestamp everything: when the alert fired, when the first analyst reviewed it, when the incident was confirmed, and every subsequent finding.

Preserve forensic evidence before anything else is touched. Containment actions that destroy forensic data are a persistent problem. Many organizations capture logs and images after the fact and discover critical evidence is gone — encrypted along with the affected systems or overwritten during recovery. The playbook should state explicitly: preserve before remediate.

Phase 3: Containment

Containment has two phases: short-term (stop the spread) and long-term (establish a stable operating perimeter while investigation continues).

Short-term containment:

  • Isolate affected systems from the network while preserving forensic state where possible
  • Block the identified attack vector: disable compromised credentials, sinkhole C2 IP ranges, kill malicious processes
  • Implement emergency network segmentation to prevent lateral movement
  • Capture forensic images before further remediation

Long-term containment:

Establish a clean environment from which investigation and parallel recovery can proceed. This may mean standing up shadow infrastructure, re-routing critical processes through unaffected systems, or activating the business continuity plan for disrupted functions.

The Caesars-MGM comparison is the clearest real-world illustration of why containment speed matters. Caesars reportedly contained faster, which limited the ransomware’s propagation window. MGM’s extended containment gap — driven by unclear decision authority — allowed significantly broader encryption across more systems. Every system encrypted after detection is a system that could have been saved by a faster decision.

Phase 4: Eradication

Once contained, the goal is removing the attacker completely. Partial eradication is the most common cause of reinfection.

Eradication checklist:

  • Remove all malware, backdoors, webshells, and persistence mechanisms from affected systems
  • Reset all compromised credentials — assume every credential on an affected system is compromised until proven otherwise
  • Patch the exploited vulnerability across all instances, not just the initially identified system
  • Audit privileged access: were domain admin or service account credentials accessed or used?
  • Validate that all C2 communication channels are severed
  • Conduct threat hunting across adjacent systems — the affected systems you found first may not be the only ones

The MOVEit breach is the cautionary example. When Cl0p exploited CVE-2023-34362 — a SQL injection zero-day in Progress Software’s MOVEit Transfer product — the campaign hit more than 2,700 organizations and exposed data on approximately 95 million individuals. Many organizations that patched their MOVEit instance quickly discovered the exfiltration had already occurred. Eradication requires understanding when access began, not just closing the current vulnerability.

Phase 5: Recovery

Recovery is staged, verified restoration — not “restore from backup and go live.”

Recovery sequence:

  1. Validate that affected systems are clean before reconnecting to any production environment
  2. Restore from the most recent clean backup — verify backup integrity before restoring, not after
  3. Reconnect systems to production in stages, monitoring for reinfection at each stage
  4. Validate system integrity against known-good baselines before advancing
  5. Restore full operations only after confirming clean state across all affected systems

The 3-2-1-1-0 backup rule: Three copies, two different media types, one offsite, one immutable or air-gapped, and zero untested restores. The “zero untested” element is where most organizations fail — they have backups that have never been used in a recovery drill and discover during an actual incident that restoration procedures are broken or slower than documented.

For publicly traded companies, assess materiality obligations as recovery progresses. Under the SEC cybersecurity disclosure rule, material incidents require an 8-K filing within four business days of a materiality determination. The materiality clock runs concurrently with recovery — it does not pause while systems are being restored.

Phase 6: Post-Incident Activity and Lessons Learned

The post-incident phase is where the next incident either gets prevented or set up to repeat. Most organizations skip it, or treat it as a 30-minute debrief that gets scheduled and rescheduled until nobody schedules it again.

Root cause analysis: The output should not be “ransomware hit us.” It should be the specific control failure that allowed initial access to succeed, propagation to go undetected, and containment to be delayed. MGM’s root cause was a help-desk process with no identity verification protocol for MFA reset requests. Fixing “ransomware” without fixing that process leaves the same door open.

Lessons learned report (within 10 business days of recovery):

  • Complete incident timeline from initial access through full recovery
  • Root cause analysis with contributing factors
  • Control gaps that enabled or extended the incident
  • Response actions taken and their effectiveness — what worked, what slowed you down
  • Specific corrective action items with named owners and deadlines
  • Plan and playbook updates required before the next incident

Regulatory documentation: Maintain complete documentation of the incident and response for regulatory examination. The OCC, FDIC, and state regulators have all requested incident timelines and response documentation during post-incident examinations. What you cannot document, from a regulatory standpoint, did not happen.

CIRCIA Reporting Integration

CIRCIA applies to covered entities across 16 critical infrastructure sectors. The final rule has been delayed to May 2026 — but the reporting obligations should be embedded in your playbook now, because the rule will apply immediately upon finalization and compliance timelines will not accommodate a program build-out after the fact.

CIRCIA clock triggers:

  • 72 hours from “reasonably believing” a covered cyber incident occurred → notify CISA
  • 24 hours from making a ransomware payment → notify CISA
  • Supplemental report required within 72 hours if additional material information emerges after the initial report

The controlling phrase is “reasonably believes” — not “investigation confirms” or “scope is fully known.” The clock starts at detection and plausible assessment, not at forensic conclusion. Build your escalation process to hit the 72-hour notification before it becomes a violation.

Concurrent notification obligations: A single incident may simultaneously trigger CIRCIA (72 hours), OCC computer security notification (36 hours), applicable state breach notification laws (commonly 30-72 hours depending on state), and the SEC 8-K requirement (four business days from materiality determination). Your playbook should include a notification decision tree — a pre-mapped set of triggers, timelines, and approval steps for each obligation — embedded as a standard phase of response, not a footnote.

Common Playbook Failures

No pre-established out-of-band communication channel. During a network-down event or ransomware encryption, your primary communication tools may be unavailable. Every playbook needs a specified backup channel — tested.

Contact information that isn’t current. The Incident Commander listed in the playbook left the company eight months ago. Every role needs a named backup, and all contact information should be verified quarterly. An incident is not the time to discover that the IR team’s contact list is stale.

No forensic preservation step. Containment actions that destroy forensic evidence are common and costly. The playbook should contain explicit language: preserve system images and logs before any remediation action.

Undefined ransom payment authority. For ransomware incidents, who can authorize a payment? What is the legal review requirement? OFAC sanctions screening is a legal prerequisite before any ransom payment — paying a sanctioned entity creates regulatory liability on top of the incident itself. This decision chain cannot be improvised at 3am.

The lessons learned meeting gets cancelled. It always does. The business is back online, the pressure has lifted, and everyone is behind on their real work. Make the post-incident review a non-negotiable obligation with a fixed deadline. It is the only mechanism for converting incidents into program improvements.

So What?

A cyber incident response playbook is what happens when an incident response plan meets an actual incident. The MGM and Caesars comparison from September 2023 shows that response speed, pre-established decision authority, and clear containment triggers determine outcomes more than any individual security control.

Build the playbook before the incident. Test it annually in a scenario that simulates realistic conditions — not a reading exercise. Update it after every tabletop and every actual incident. When the 72-hour CIRCIA clock starts or the 36-hour OCC clock begins, you want a team that has rehearsed the process — not a team reading the plan for the first time under maximum stress.

The Incident Response & Breach Notification Kit includes an incident response plan template, incident classification and severity matrix, tabletop exercise scenarios, and breach notification decision trees covering federal and state requirements.

Frequently Asked Questions

What is the difference between an incident response plan and a playbook?
An incident response plan defines governance: roles, responsibilities, escalation paths, notification obligations, and the overall framework. A playbook defines execution: the specific step-by-step actions your team performs during a specific type of incident (ransomware, data exfiltration, business email compromise). NIST SP 800-61 Rev. 3 (April 2025) explicitly recommends formatting response procedures as playbooks because the format improves usability under stress. You need both — the plan establishes authority, the playbook tells you what to do at 2am.
What does NIST SP 800-61 Rev. 3 change about incident response?
NIST finalized SP 800-61 Rev. 3 on April 3, 2025, simultaneously withdrawing Rev. 2. The key structural change: instead of the Rev. 2 four-phase lifecycle (Preparation, Detection/Analysis, Containment/Eradication/Recovery, Post-Incident), Rev. 3 maps incident response recommendations to CSF 2.0's six functions — Govern, Identify, Protect, Detect, Respond, and Recover. This integrates incident response into the broader cybersecurity risk management program rather than treating it as a standalone process.
When does the CIRCIA 72-hour reporting clock start?
CIRCIA's 72-hour clock starts from the moment an organization 'reasonably believes' a covered cyber incident has occurred — not when the investigation is complete or the scope is fully known. For ransomware payments, a separate 24-hour clock applies from the moment of payment. The CIRCIA final rule has been delayed to May 2026, but organizations in critical infrastructure sectors should build these timelines into their playbooks now. Penalties for non-compliance can reach $500,000 per day.
What was the key difference between the MGM and Caesars incident responses in 2023?
Both were hit by Scattered Spider in September 2023 using the same social engineering technique — calling the IT help desk to reset MFA on a privileged account. MGM's response was slowed by unclear decision authority for network isolation; BlackCat/ALPHV ransomware propagated during that delay, resulting in a $100 million Q3 loss and a 10-day operational outage. Caesars reportedly identified the attack earlier, made rapid containment decisions, and negotiated a reported $15 million ransom payment — avoiding material operational disruption. Response speed and pre-established decision authority were the differentiating variables.
What should be in a ransomware-specific incident response playbook?
A ransomware playbook should include: (1) initial detection and triage criteria — how you distinguish ransomware from other failures; (2) isolation authority — who can authorize immediate network isolation of affected systems; (3) ransom payment decision tree — OFAC sanctions screening is required before any payment; (4) backup validation procedure — confirming backup integrity before restoration; (5) law enforcement notification (FBI Cyber Division); (6) CISA and regulatory notification timelines; (7) communication protocols for customers, employees, and the public; (8) forensic preservation steps before remediation.
How do you run an effective post-incident review?
Schedule it within 10 business days of returning to normal operations — before the institutional memory fades. The agenda: timeline reconstruction from initial access through recovery, root cause analysis (the actual control failure, not just 'we got ransomware'), what response actions worked and what slowed you down, specific action items with owners and due dates, and plan and playbook updates required. The output should be a written report retained for regulatory examination. The failure mode is cancelling the meeting once the pressure is off — make it a standing obligation with a firm deadline.
Rebecca Leung

Rebecca Leung

Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.

Related Framework

Incident Response & Breach Notification Kit

Step-by-step incident response playbooks and breach notification templates for all 50 states.

Immaterial Findings ✉️

Weekly newsletter

Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.

Join practitioners from banks, fintechs, and asset managers. Delivered weekly.