How to Build an Operational Risk Management Framework From Scratch
TL;DR:
- An operational risk management (ORM) framework isn’t a document — it’s a living system of identification, assessment, monitoring, and response that regulators will test under pressure.
- The OCC’s 2025 supervisory priorities explicitly call out operational risk controls, third-party resilience, and enterprise change management — examiners are looking for this.
- Build yours around four core pillars: risk and control self-assessments (RCSAs), key risk indicators (KRIs), loss event tracking, and scenario analysis. Here’s exactly how.
Your Examiner Doesn’t Care About Your Risk Framework Slide Deck
The OCC’s 2025 Bank Supervision Operating Plan reorganized its priorities into three categories: financial risk, operational risk, and compliance risk. Operational risk got its own section for the first time — with specific callouts for cybersecurity preventative controls, incident response, third-party risk management, and enterprise change management.
Translation: examiners are done accepting “we have an ORM program” as an answer. They want to see how it works. How risks flow from identification through assessment to monitoring. What triggers escalation. Who owns what.
And if you’re at a mid-size bank or fintech that’s been running on spreadsheets and good intentions, this is the guide that gets you from nothing to examiner-ready.
What Operational Risk Actually Covers (It’s Broader Than You Think)
The Basel Committee’s definition — “the risk of loss resulting from inadequate or failed internal processes, people, and systems, or from external events” — is deceptively simple. In practice, operational risk includes:
| Risk Category | Examples | Who Typically Owns It |
|---|---|---|
| Process failures | Transaction errors, settlement breaks, misapplied payments | Operations / COO |
| People risk | Unauthorized trading, employee fraud, key person dependency | HR + Business Line |
| Technology risk | System outages, failed deployments, data corruption | CTO / CIO |
| External events | Cyberattacks, natural disasters, vendor failures | CISO + BCP |
| Legal & compliance | Regulatory fines, contract disputes, litigation | CLO + CCO |
| Third-party risk | Vendor outages, concentration risk, fourth-party exposure | TPRM / Vendor Management |
The reason this matters: most firms that get MRAs (Matters Requiring Attention) or consent orders for operational risk deficiencies don’t fail because they ignored one big thing. They fail because they didn’t connect the dots across categories.
Case in point: In October 2020, the OCC issued a cease and desist order against Citibank and assessed a $400 million civil money penalty for deficiencies in risk management, data governance, and internal controls. The order cited “serious and longstanding deficiencies” across multiple operational risk areas — not one spectacular failure, but the accumulation of inadequate processes, poor data quality, and insufficient board oversight. When Citibank failed to remediate adequately, the OCC came back in July 2024 with an additional $75 million penalty specifically for violating the original order and lacking processes to monitor data quality impacts on regulatory reporting.
That’s $475 million because operational risk management was treated as a checkbox exercise instead of an integrated system.
The Four Pillars of an ORM Framework
Every functional ORM program rests on four interconnected components. Skip one and the whole thing wobbles.
Pillar 1: Risk and Control Self-Assessment (RCSA)
The RCSA is where you identify what can go wrong and evaluate whether your controls are working. It’s the foundation everything else builds on.
How to run one that isn’t theater:
-
Scope by process, not department. Map your RCSAs to actual business processes (loan origination, wire transfers, customer onboarding) rather than org chart boxes. A single process can cross three departments — the risk doesn’t care about your reporting lines.
-
Use a consistent risk taxonomy. Align to Basel II event types or develop your own, but make it consistent across the enterprise. The seven Basel event types are:
- Internal fraud
- External fraud
- Employment practices & workplace safety
- Clients, products & business practices
- Damage to physical assets
- Business disruption & system failures
- Execution, delivery & process management
-
Rate inherent risk AND residual risk. Inherent = what’s the exposure without controls. Residual = what’s left after controls. If your residual risk rating is always “low,” your assessment is wrong — go back and stress-test.
-
Document control effectiveness with evidence, not opinions. “We have a dual-approval process” isn’t evidence. “Dual-approval rejected 47 transactions in Q4, 12 of which were confirmed errors” is evidence. Track control testing results, exception rates, and failure incidents.
-
Refresh annually at minimum — quarterly for high-risk processes. An RCSA that’s 18 months old is a liability, not an asset. Forvis Mazars’ 2024 RCSA best practices guidance emphasizes that RCSA capabilities must be “adaptable, agile, and integrated” to keep pace with evolving operational environments. Static annual exercises don’t cut it anymore.
30-day RCSA launch plan:
- Days 1–5: Select 3–5 critical processes for pilot. Pull process maps, prior audit findings, and loss event data.
- Days 6–15: Facilitate RCSA workshops with process owners and control owners. Use structured templates with consistent rating scales (1–5 for likelihood and impact).
- Days 16–25: Compile results, identify gaps between documented controls and actual practice, flag residual risks above appetite.
- Days 26–30: Present results to risk committee, assign action items with owners and deadlines, establish refresh cadence.
Pillar 2: Key Risk Indicators (KRIs)
KRIs are the early warning system. They tell you a risk is materializing before it becomes a loss event.
The difference between a good KRI and a useless one:
| Bad KRI | Good KRI | Why It’s Better |
|---|---|---|
| ”Number of system outages" | "Unplanned system downtime hours for core banking platform, trailing 30 days” | Specific, measurable, tied to a critical system, trended over time |
| ”Employee turnover" | "Turnover rate in BSA/AML team, trailing 90 days vs. 12-month avg” | Targets a high-risk function, includes comparison baseline |
| ”Number of customer complaints" | "Complaint-to-transaction ratio for wire transfers, month over month” | Normalized, focused on a risk-prone process, shows trajectory |
| ”Vendor incidents" | "Critical/high severity incidents from Tier 1 vendors, trailing quarter, vs. SLA breach threshold” | Tiered by vendor criticality, benchmarked against contractual SLAs |
Setting thresholds that trigger action:
Every KRI needs three zones:
- Green (within appetite): Business as usual. Report in standard dashboards.
- Amber (approaching limit): Investigate root cause. First-line risk owner must document analysis within 5 business days. Risk committee notified.
- Red (breached): Immediate escalation. First-line must submit remediation plan within 48 hours. Second-line validates. Board risk committee briefed at next meeting (or emergency session if severity warrants).
A KRI without thresholds is just a metric. A metric without thresholds is just a number. Numbers don’t prevent losses — escalation protocols do.
Start with 15–20 KRIs across your top risk categories. Don’t try to boil the ocean with 200 indicators nobody monitors. You can expand later once the muscle memory exists.
Pillar 3: Loss Event Tracking
Every operational risk loss needs to be captured, categorized, analyzed, and fed back into your RCSA and KRI program. This is where most firms are weakest — they track big losses when forced to, but let the small ones slip through.
Why small losses matter: According to ORX’s 2024 Banking Operational Risk Loss Data Report, global banks reported over 65,000 loss events in 2023, with an average loss size of €231,651. But the real insight was in frequency trends: low-severity external fraud events hit their highest level in ORX’s 22-year database history in 2022, with 38% of firms reporting their all-time peak fraud event counts that year. Transaction-related losses — processing errors, accounting mistakes, failed settlements — hit nearly €8 billion in 2023, making them the costliest operational risk category that year.
The firms that catch these trends early are the ones with disciplined loss event capture. The ones that don’t find out during exam prep.
What to capture for every loss event:
- Event date and discovery date (the gap between these tells you something about detection controls)
- Basel event type classification
- Gross loss amount
- Recovery amount (insurance, legal settlements)
- Net loss
- Root cause (use a standardized taxonomy: process, people, technology, external)
- Business line and process
- Related control failures (link back to RCSA)
- Near-miss indicator (was this a loss or a near-miss? Both matter)
Minimum reporting thresholds: Most mid-size banks use €10,000–€20,000 as the minimum capture threshold. Below that, tracking costs exceed insight value. But track near-misses regardless of potential amount — they’re free lessons.
Pillar 4: Scenario Analysis
Scenario analysis stress-tests your framework against plausible-but-severe events that haven’t happened yet. It’s where you answer: “What if our core processor goes down for 72 hours?” or “What if a key vendor gets breached and exfiltrates customer data?”
Why this matters now more than ever: The July 2024 CrowdStrike outage demonstrated exactly how a single third-party failure cascades. A faulty software update crashed 8.5 million Windows machines globally, disrupting banks, airlines, hospitals, and government services. Insurance firm Parametrix estimated the top 500 US companies (excluding Microsoft) faced approximately $5.4 billion in financial losses. Banks that had scenario-analyzed a “critical vendor software failure” event were the ones with tested playbooks and faster recovery.
Running useful scenarios:
- Select 5–8 scenarios annually that align with your top inherent risks and emerging threats. Prioritize scenarios the OCC is signaling concern about: cybersecurity events, third-party failures, and operational resilience disruptions.
- Define severity and frequency estimates using structured expert judgment. Bring risk owners, business leaders, and subject matter experts together — not just the risk team in a room guessing.
- Quantify potential impact in terms of direct financial loss, regulatory penalties, customer impact, and reputational damage.
- Test your response capabilities against the scenario. Don’t just estimate the loss — walk through what you’d actually do. Who gets called? What decisions need to be made in the first hour? The first 24 hours?
- Feed results back into capital planning (for Basel requirements) and insurance coverage reviews.
The ORM Lifecycle: Connecting the Pillars
These four pillars don’t operate in isolation. Here’s how they feed each other:
RCSA identifies risks & control gaps
↓
KRIs monitor the risks RCSA identified
↓
Loss events validate (or challenge) RCSA ratings
↓
Scenarios stress-test the risks KRIs can't predict
↓
All four inform risk appetite, capital, and reporting
↓
Board & management receive integrated view
↓
Loop back: update RCSA with new loss data & scenario results
The integration test: If a loss event occurs and you can’t trace it back to a risk in your RCSA, either your RCSA missed something or your taxonomy doesn’t match your loss event categories. If a KRI breaches red and nobody acts, your escalation protocols are broken. If a scenario materializes and your response looks nothing like what you planned, your scenarios are fantasy.
Building From Nothing: A 120-Day Implementation Roadmap
If you’re starting fresh — maybe you just got hired as the first risk manager, or maybe the examiner just handed you an MRA — here’s a realistic timeline.
Days 1–30: Foundation
- Inventory existing risk-related documentation (even if scattered across departments)
- Define your operational risk taxonomy (Basel-aligned or custom)
- Draft the ORM policy: scope, governance, roles, risk appetite statement
- Identify your risk committee structure and meeting cadence
- Deliverable: Approved ORM policy, risk taxonomy, governance structure
Days 31–60: RCSA Pilot
- Select 5 high-risk processes for initial RCSA
- Conduct facilitated workshops with process owners
- Document inherent risks, controls, control effectiveness, and residual risk ratings
- Identify immediate gaps (risks with no controls, or controls with no evidence of effectiveness)
- Deliverable: Completed RCSA for 5 processes, gap analysis report
Days 61–90: KRIs and Loss Event Tracking
- Design 15–20 KRIs across top risk categories from RCSA results
- Set green/amber/red thresholds with risk committee input
- Implement loss event capture process (can start with a structured spreadsheet — don’t let tool selection delay launch)
- Back-populate with known loss events from past 12 months
- Deliverable: KRI dashboard (even if in Excel), loss event register, escalation procedures
Days 91–120: Scenario Analysis and Reporting
- Conduct 3–5 scenario analysis workshops
- Build first board-level ORM report integrating RCSA results, KRI status, loss event trends, and scenario outcomes
- Establish quarterly reporting cadence
- Document the program for examiner consumption — methodology documents, governance minutes, evidence of action on identified risks
- Deliverable: Scenario analysis results, first integrated ORM report, program documentation package
The honest truth: 120 days gets you functional, not mature. A genuinely embedded ORM program takes 12–18 months of repetition, calibration, and cultural change. But 120 days gets you something defensible when the examiner shows up — and that’s the immediate goal.
Three Lines of Defense: Who Owns What
Operational risk is one of those domains where “everyone owns it” quickly becomes “nobody owns it.” Apply the three lines of defense model clearly:
| Line | Role in ORM | Specific Responsibilities |
|---|---|---|
| 1st Line: Business Units | Own and manage operational risks daily | Conduct RCSAs, maintain controls, report loss events, monitor KRIs, escalate breaches |
| 2nd Line: Risk Management | Provide the framework, challenge, and oversight | Design ORM methodology, set risk appetite, validate RCSAs, aggregate reporting, independent challenge |
| 3rd Line: Internal Audit | Independent assurance | Audit the ORM framework itself — is it effective? Are RCSAs credible? Are KRIs actionable? Are loss events being captured? |
At a mid-size bank (under $50B assets): The operational risk function typically sits within the CRO organization, staffed by 2–5 dedicated ORM professionals. They own the methodology but depend on first-line risk coordinators embedded in each business unit to execute RCSAs and KRI monitoring.
At a fintech or smaller bank: You might not have a dedicated ORM team. In that case, the CCO or Head of Risk typically owns the framework, with operational risk activities distributed among department heads. The key is documenting those assignments — an examiner wants to see named individuals, not vague “the business owns it” statements.
What Examiners Actually Look For
Based on the OCC’s Semiannual Risk Perspective and recent enforcement trends, here’s what gets flagged:
-
No documented ORM policy or outdated policy. If your ORM policy is from 2019 and doesn’t mention third-party risk, cyber, or operational resilience, it’s a finding.
-
RCSAs that don’t reflect the actual risk environment. If your RCSA hasn’t been updated since you onboarded a major new vendor or launched a new product, that’s a gap.
-
KRIs with no defined thresholds or escalation procedures. Tracking 50 metrics nobody acts on is worse than tracking 10 that drive decisions.
-
No loss event tracking or incomplete capture. If the only losses you’ve recorded are the ones the auditor found, you have a culture problem.
-
Board and management reporting that’s all green. A risk dashboard that never shows amber or red signals either a perfect organization (unlikely) or a broken assessment process (probable).
-
No connection between ORM outputs and business decisions. The framework exists to inform decisions — new product approvals, vendor selections, capital allocation, technology investments. If none of those reference your ORM data, the program is decorative.
So What? Why This Matters Right Now
Operational risk isn’t theoretical. In 2023, global banks still experienced over 65,000 loss events tracked by ORX, with transaction-related losses alone hitting €8 billion. The CrowdStrike outage in July 2024 showed how a single third-party failure can cascade into billions in losses across industries. And regulators are watching more closely than ever — the OCC’s 2025 priorities elevate operational risk to a dedicated supervisory category for the first time.
If you’re at a mid-size bank or fintech without a structured ORM framework, you’re not just carrying unmanaged risk — you’re carrying exam risk. The MRA for “inadequate operational risk management” is one of the most common findings in community and mid-size bank exams.
The good news: you don’t need a million-dollar GRC platform to start. You need a policy, a taxonomy, a handful of RCSAs, some meaningful KRIs, and a process for capturing when things go wrong. Start there. Iterate. Mature.
Need a head start? The Compliance Essentials Bundle includes risk assessment templates, issues tracking, and documentation frameworks designed for exactly this stage of program buildout.
FAQ
What’s the difference between operational risk and enterprise risk management?
Operational risk is a subset of enterprise risk management (ERM). ERM covers all risk categories — credit, market, liquidity, strategic, reputational, and operational. An ORM framework zooms in on risks from internal processes, people, systems, and external events. In practice, your ORM program should feed into the broader ERM framework, with operational risk data flowing into enterprise-level risk appetite statements and board reporting. At many mid-size banks, the ORM team sits within the ERM function but maintains its own methodology and assessment cycle.
How many KRIs should a mid-size bank track?
Start with 15–20, focusing on your highest inherent risks from the RCSA. Common starter KRIs include: system availability rates for critical applications, transaction error rates, employee turnover in key risk functions (compliance, BSA, operations), cybersecurity incident volume, vendor SLA breach rates, and customer complaint trends. Quality over quantity — every KRI should have defined thresholds, an owner, and a documented escalation path. You can expand as the program matures, but tracking 100 indicators nobody reviews is worse than tracking 15 that drive action.
Can we build an ORM framework in spreadsheets or do we need GRC software?
Spreadsheets are a perfectly valid starting point, especially for banks under $10B in assets. What matters to examiners is the process, not the platform. A well-maintained Excel-based RCSA with documented methodology, evidence of refresh, and clear action tracking is infinitely better than a six-figure GRC tool that nobody uses properly. That said, spreadsheets break down around 50+ RCSAs or when you need to aggregate KRI data from multiple sources automatically. Plan for a GRC migration once the program is stable and you’ve proven the methodology works — typically 12–18 months after initial implementation.
Rebecca Leung
Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.