RiskTemplates · The Daily Brief Sunday, May 24, 2026

Feature Incident Response

Incident KRIs: Volume, Severity, Time to Contain, Time to Resolve, and Root Cause Patterns

How to build an incident KRI dashboard that measures what regulators actually care about—response times, severity patterns, repeat incident rates, and root cause closure.

Table of Contents

TL;DR

  • Incident KRIs measure whether your program generates learning and accountability—not just whether you logged what happened
  • The core set: volume by severity, MTTD, MTTC, MTTR, regulatory notification timeliness, repeat incident rate, and corrective action aging
  • Root cause trending is the leading indicator most programs skip—rising repeat rates in any category are a governance problem, not just an operational one
  • OCC and FFIEC examiners look for evidence that incidents produce improvement, not just documentation

The CRO pulls up last quarter’s board deck and asks three questions: How many P1 incidents did we have? What was our average time to contain? And how many had the same root cause as the quarter before?

If your team scrambles to pull spreadsheets, the answer is already bad. Not because of the numbers—because those questions revealed that you don’t have incident KRIs. You have an incident log.

The difference matters when a regulator walks in. OCC and FFIEC examiners don’t just want to see that you tracked incidents. They want evidence that your program generates learning, drives accountability, and catches deterioration before it becomes a 36-hour notification event. That’s what KRIs do that logs don’t.

Why Incident Metrics Need a KRI Layer

Most incident programs are good at recording. They capture what happened, when it was detected, when it was resolved, and who owned the response. That’s a log. KRIs are different: they’re the metrics that tell you whether the program itself is healthy—whether it’s improving or degrading over time.

The FFIEC 36-hour computer security incident notification rule set an explicit external performance standard: banking organizations must notify the OCC no later than 36 hours after determining that a notification incident has occurred. That’s a clock you can’t miss. Your KRIs should be calibrated to catch the conditions that put that clock at risk before the clock starts.

An institution tracking strong MTTD and MTTC metrics will almost never have a 36-hour notification problem—because it detects and contains quickly enough to make the determination and notification with time to spare. An institution that can’t answer basic questions about its response times is flying blind toward that deadline.

The Core Incident KRI Set

1. Incident Volume by Severity Tier

Track total incidents by severity tier (P1/Critical, P2/High, P3/Moderate, P4/Low) month-over-month. Volume alone doesn’t tell you much—a high-transaction-volume fintech might generate dozens of P4 events weekly without any systemic concern. What matters is trend and composition.

Rising P1 volume over consecutive quarters is a KRI. A shift from mostly P4 to increasing P2/P3 is a KRI signaling control degradation upstream. Flat volume with increasing customer impact per incident is a KRI showing that your severity classification may be understating harm.

Threshold guidance:

  • Green: No quarter-over-quarter increase in P1/P2 volume; P1 incidents remain within board-approved tolerance
  • Amber: 15–25% increase in P1/P2 volume quarter-over-quarter, or isolated spike with documented root cause
  • Red: >25% increase in P1/P2 volume without identifiable cause, or P1 count exceeds board-approved tolerance

2. Mean Time to Detect (MTTD)

MTTD measures how long between an incident occurring and your team becoming aware of it. This is where most incident programs are blind. An alert-heavy environment can still have poor MTTD if alerts aren’t reviewed promptly, if monitoring coverage has gaps, or if teams have alert fatigue from too many false positives.

MTTD degrades MTTR silently. If you’re not catching P1 incidents for four hours, your 24-hour MTTR target is already compressed into a 20-hour window before the response even starts.

Threshold guidance:

  • Green: P1 MTTD under 1 hour; P2 MTTD under 4 hours
  • Amber: P1 MTTD 1–4 hours consistently; P2 MTTD 4–12 hours
  • Red: P1 MTTD exceeding 4 hours; any P1 incident with MTTD over 8 hours

3. Mean Time to Contain (MTTC)

Containment means the incident is no longer spreading. Customer impact is capped, systems are isolated or failed over, and further damage is stopped. This is different from resolution, which means the problem is fully fixed and normal operations are restored.

These two metrics get conflated constantly in incident reporting, and the conflation matters. Tracking them separately tells you exactly where your response process breaks down: is the gap in the detect-to-contain phase (awareness and initial response), or contain-to-resolve (root cause identification and remediation)?

For critical incidents, containment should target under four hours. High-severity incidents should target under eight hours.

4. Mean Time to Resolve (MTTR)

Resolution is full restoration—root cause identified, fix deployed, systems normalized, post-incident review scheduled. The cross-industry MTTR average runs roughly 72 hours, but financial institutions with mature programs average 15–24 hours for critical incidents (FS-ISAC, 2023 benchmarks).

Be careful not to close incidents prematurely to hit MTTR targets. A common pattern: incidents get marked “resolved” at containment, the underlying root cause fix gets tracked as a separate project, and MTTR looks clean while the actual vulnerability remains open. Incident triage and severity classification discipline directly affects the integrity of your MTTR data.

SeverityMTTD TargetMTTC TargetMTTR Target
P1 Critical< 1 hour< 4 hours< 24 hours
P2 High< 4 hours< 8 hours< 48 hours
P3 Moderate< 12 hours< 24 hours< 5 business days
P4 Low< 24 hoursN/A< 10 business days

5. Regulatory Notification Rate and Timeliness

Under 12 CFR Part 53, banking organizations must notify the OCC within 36 hours of determining that a notification incident has occurred. Track two separate KRIs here:

Notification rate: What percentage of your incidents triggered the regulatory notification threshold? Tracking this over time shows whether your exposure to notification-level events is stable, growing, or declining.

Notification timeliness: Of incidents that required notification, what percentage were reported within the 36-hour window? Late notifications or missed notifications are MRA material.

A secondary signal: incidents that required analysis to determine whether notification was required. If this category is growing, your severity classification framework may be creating unnecessary ambiguity at the reporting threshold.

6. Repeat Incident Rate by Root Cause Category

This is the metric most programs skip, and it correlates most directly with program maturity.

Categorize incidents by root cause: technology failure, human error, process failure, third-party or vendor failure, external event (fraud, cyberattack, natural event). Track month-over-month: what percentage of incidents share a root cause category with an incident from the prior 90 days?

A repeat incident rate above 25–30% in any category means your corrective action plans either aren’t being implemented or aren’t addressing the actual root cause. The Basel Committee’s principles for sound operational risk management explicitly connect loss data analysis to identifying control weaknesses and repeating risk patterns—institutions that can’t demonstrate declining repeat rates over time have an examination finding waiting to happen.

The July 2024 CrowdStrike outage is an instructive case at scale: a single third-party software update created cascading failures across global financial institutions, airlines, and healthcare systems simultaneously. Institutions that had previously tracked third-party technology failure as a root cause category—and had open corrective actions from prior vendor incidents—were exposed to a governance question they needed to answer quickly.

7. Open Corrective Actions: Age and Closure Rate

Every significant incident should produce at least one corrective action with a defined owner and due date. Track:

  • Open corrective actions by age bucket: 0–30 days, 31–60 days, 61–90 days, 90+ days
  • Overdue rate: Corrective actions past their due date that remain open
  • Reopen rate: Corrective actions marked closed that are linked to a subsequent incident in the same root cause category

Corrective action ownership without accountability produces the classic pattern: lots of logged actions, minimal closure, and the same categories showing up in the repeat incident KRI. The KRI governance and ownership framework covers how to build accountability structures that actually close the loop.

Threshold guidance:

  • Green: < 10% of corrective actions overdue; no corrective actions open > 90 days without documented extension
  • Amber: 10–25% of corrective actions overdue; some items in 90+ day bucket with documented reason
  • Red: > 25% overdue; corrective actions in 90+ day bucket without documented rationale or escalation

8. Customer Impact Rate

What percentage of incidents resulted in customer-facing impact? Of those, what was the average duration of impact and the estimated customer count affected?

Persistent customer-facing incidents—even at P2/P3 severity—suggest systemic resilience gaps. Track this separately from internal operational incidents. Your regulators and your customers care about different things; your KRI dashboard should reflect both.

Setting Thresholds Against Your Risk Appetite

Incident KRI thresholds aren’t benchmarks you copy from a framework. They’re calibrated to your risk appetite, your business model, and your regulatory environment.

A real-time payments processor with 24/7 transaction volumes has near-zero tolerance for P1 incidents lasting more than two hours—because two hours at peak can mean millions in failed transactions and regulatory notification. A community bank with lower transaction intensity may calibrate differently.

Start with the question your board actually cares about: how much customer impact and regulatory exposure is acceptable, and under what conditions? Work backward to the operational metrics that predict that impact before it occurs. The KRI thresholds and false green/false red guidance covers calibration mechanics in detail.

The OCC’s 2025 Cybersecurity and Financial System Resilience Report emphasized that institutions need to demonstrate not just that they detect and respond to incidents, but that they learn from them. That’s exactly what root cause KRIs are designed to surface.

Root Cause Patterns as Leading Indicators

If you track 12 months of incidents by root cause category, patterns emerge that single-incident analysis misses entirely.

A cluster of third-party vendor failures in Q1 and Q2 is a leading indicator for Q3—especially if the underlying vendor relationships haven’t been remediated or renegotiated. Repeating human-error incidents in a specific process often precede a larger failure when that error hits a high-value transaction at a critical moment. Technology failure incidents concentrated in a specific system or platform suggest capacity or maintenance issues building toward a more significant outage.

This is how operational risk KRIs function as leading indicators rather than lagging scorecards. The Basel Committee on operational risk has long connected loss data analysis to prospective risk identification—the same principle applies at the incident level.

The repeat incident rate, combined with root cause trending, gives you the data to tell your board: “We’ve had four operational failures traced to the same process gap in 90 days. Here’s our remediation plan and our closure KPI.” That’s a fundamentally different conversation than “we had four incidents.”

Building the Dashboard

A practical incident KRI dashboard doesn’t require a sophisticated GRC platform. It requires consistent data entry, defined ownership, and a regular review cadence.

The minimum viable setup:

  • Weekly incident data entry by the incident response team (severity, root cause category, MTTD/MTTC/MTTR, customer impact yes/no)
  • Monthly KRI calculation and threshold assessment by the risk team
  • Quarterly board reporting with trend charts, threshold status, and corrective action aging

The KRI Library (132 Key Risk Indicators) includes the full incident KRI set—volume, severity, containment time, resolution time, repeat rate, and corrective action aging—with calibrated thresholds ready to drop into your operational risk reporting. If you’re building this from scratch, it saves months of calibration work. Get the KRI Library here.

So What?

If your incident program can’t answer the CRO’s three questions—current volume by severity, average response times, and repeat rate by root cause—it’s time to build the KRI layer.

Start with what you can measure today: pull the last 90 days of incidents, classify by severity and root cause category, and calculate rough MTTD/MTTC/MTTR by severity tier. That baseline tells you where the gaps are and gives you the first data point for trend tracking.

Then set thresholds calibrated to your risk appetite, assign ownership, and put the dashboard in front of the board quarterly. Incident KRIs that trend in the wrong direction without a documented remediation plan aren’t a monitoring problem. They’re a governance problem.

Regulators examining your incident program aren’t looking for a clean log. They’re looking for evidence that your program generates accountability and improvement over time. Incident KRIs are how you prove it.


Sources:

◆ Need the working template?

Start with the source guide.

These answer-first guides summarize the required fields, evidence, and implementation steps behind the templates practitioners search for.

◆ Immaterial Findings · Weekly

Sharp risk & compliance insights. No fluff.

◆ FAQ

Frequently asked questions.

What are the most important incident KRIs for financial institutions?
The core incident KRIs are: incident volume by severity tier, mean time to detect (MTTD), mean time to contain (MTTC), mean time to resolve (MTTR), repeat incident rate by root cause category, regulatory notification timeliness, and open corrective action aging. Together these tell you whether your incident program is functioning or just documenting failures.
What MTTR benchmark should financial institutions target?
Financial services benchmarks average 15-24 hours MTTR for critical incidents, versus a cross-industry average of roughly 72 hours. High-severity incidents should target resolution within 24-48 hours; critical incidents under 8 hours. Your KRI threshold should be calibrated to your program maturity and regulatory expectations, not a generic industry figure.
How does incident volume function as a KRI versus a KPI?
Incident volume is a lagging indicator of control failure—it tells you something already went wrong. It becomes a leading KRI when tracked as a trend: if volume is rising month-over-month in a specific root cause category, that trend warns of control degradation before the next major event. Trend matters more than absolute count.
What does a regulator expect to see in an incident KRI dashboard?
OCC and FFIEC examiners expect institutions to track response times against defined SLAs, document root cause analysis for all significant incidents, demonstrate closure of corrective actions, and show declining repeat incident rates over time. Red flags include chronic SLA breaches, open root cause actions, and the same incident category recurring quarter after quarter.
How do root cause KRIs prevent repeat incidents?
Root cause KRIs—specifically the rate of incidents traced to the same systemic failure—show whether your corrective action plans are actually closing the gap or just documenting it. An institution with a 30%+ repeat incident rate in any category likely has a corrective action process that isn't addressing underlying causes.
What is the difference between MTTC and MTTR?
MTTC (Mean Time to Contain) measures how long it takes to stop an incident from spreading—customer impact is capped, systems are isolated, the bleeding stops. MTTR (Mean Time to Resolve) measures full restoration: root cause identified, fix deployed, operations normalized. These are different milestones and should be tracked separately. Conflating them hides where response bottlenecks actually occur.
Rebecca Leung

Author

Rebecca Leung

Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.

◆ Related framework

KRI Library (132 Key Risk Indicators)

132 KRIs with thresholds, data sources, and escalation triggers pre-built for financial services.

Immaterial Findings · Newsletter

The brief, in your inbox.

Enforcement of the week, a framework breakdown, and the prompts that are actually worth running. Delivered to your inbox. Free.