AI Operational Resilience: Making Sure AI Systems Don't Break the Business
Table of Contents
TL;DR:
- AI systems are creating new single points of failure that most BCP/DR programs don’t account for — and the November 2025 Cloudflare outage proved how fast things cascade.
- Regulators are connecting the dots between AI risk and operational resilience: DORA, SR 20-24, the EU AI Act, and OCC Heightened Standards all expect you to plan for AI failure.
- Map your AI dependencies, stress-test vendor concentration, build fallback procedures, and run AI-specific tabletop exercises before your examiner asks why you didn’t.
Your BCP Probably Doesn’t Cover AI Failure. That’s a Problem.
On November 18, 2025, a Cloudflare outage took down ChatGPT, Claude, Shopify, and dozens of other services for over three hours. Financial institutions using OpenAI’s API for loan document processing, customer support, and compliance workflows didn’t just lose a chatbot — they lost operational capability with no fallback in place.
Then, on March 24, 2026, OpenAI announced it was shutting down Sora, its AI video generation tool, just six months after launch. App access ends April 26, 2026; API access dies September 24, 2026. Any firm that built workflows around Sora now has months to rip and replace.
These aren’t hypothetical scenarios — they’re the new normal. Between November 2025 and March 2026, major AI platforms including ChatGPT, Claude, and Cloudflare-dependent services experienced multiple disruptions, some lasting over 12 hours. And here’s the cascade problem: when one AI platform goes down, users flood alternatives, which overwhelms those systems too.
Most business continuity plans were written for server outages, natural disasters, and pandemic scenarios. AI system failure is a fundamentally different beast — and your BCP needs to catch up.
The Regulatory Landscape: Operational Resilience Meets AI
Regulators aren’t waiting for you to figure this out. Multiple frameworks now explicitly or implicitly require AI operational resilience planning.
US: SR 20-24 and OCC Heightened Standards
The interagency paper “Sound Practices to Strengthen Operational Resilience” (SR 20-24), issued by the Federal Reserve, FDIC, and OCC, draws from existing guidance on operational risk management, business continuity management, third-party risk management, and cybersecurity risk management. While it doesn’t explicitly mention AI, its principles apply directly:
- Identify critical operations and core business lines that depend on AI
- Map internal and external dependencies — including AI vendor APIs
- Maintain sound scenario analysis that includes technology disruption
The OCC’s Heightened Standards require larger banks to adjust risk governance when introducing AI activities. Examiners are already asking about AI dependency during operational resilience reviews.
EU: DORA and the Critical Provider Framework
The Digital Operational Resilience Act (DORA), which entered into application on January 17, 2025, is the most prescriptive framework for AI operational resilience. DORA requires financial entities to:
- Maintain a Register of Information on all ICT third-party arrangements (due to competent authorities by April 30, 2025)
- Assess and manage ICT concentration risk across critical providers
- Conduct threat-led penetration testing and resilience testing
On November 18, 2025, the European Supervisory Authorities (ESAs) published the first list of designated Critical ICT Third-Party Providers (CTPPs) under DORA. The list includes hyperscale cloud providers, data center operators, and financial services technology vendors — IBM was among those formally designated in December 2025. These CTPPs now face direct EU-level oversight based on four criteria: systemic impact, reliance by financial entities, sector concentration, and substitutability.
EU AI Act: Resilience for High-Risk Systems
The EU AI Act, fully applicable August 2, 2026, requires high-risk AI system providers to ensure appropriate levels of accuracy, robustness, and cybersecurity. Article 15 specifically mandates that high-risk AI systems achieve “an appropriate level of resilience” against errors, faults, and inconsistencies. If your AI system makes or supports consequential financial decisions, resilience isn’t optional — it’s a legal requirement.
AI Dependency Mapping: Know What Breaks When AI Goes Down
Before you can plan for AI failure, you need to know where AI lives in your organization. Most firms can’t answer this question completely.
The Dependency Inventory
Build an AI dependency map that covers every business process touching AI:
| Dependency Category | What to Document | Example |
|---|---|---|
| Direct AI services | Model name, provider, API endpoint, SLA terms | OpenAI GPT-4 for customer support triage |
| Embedded AI | Vendor products with AI components you didn’t build | Fraud detection in your payment processor |
| AI infrastructure | Cloud providers, data pipelines, vector databases | AWS Bedrock for model hosting |
| Data dependencies | Training data sources, real-time data feeds | Market data feeds for trading models |
| Human dependencies | ML engineers, data scientists, model validators | 2-person ML ops team |
Finding Shadow AI
Your inventory is incomplete if you only count sanctioned tools. Shadow AI — employees using ChatGPT, Copilot, and other tools without IT approval — creates undocumented dependencies that won’t show up in your BCP until something breaks.
Detection methods:
- Network monitoring for API calls to known AI providers
- Procurement and expense audits for AI subscription charges
- Browser extension and endpoint agent scanning
- Employee surveys (you’ll be surprised what people admit to using)
The Critical Path Question
For every AI-dependent process, ask: What happens if this AI system is unavailable for 4 hours? 24 hours? 7 days?
Map each to one of three categories:
- Mission-critical: Business stops without it (e.g., AI-driven fraud detection on transaction processing)
- Important but degraded: Business continues at reduced capacity (e.g., AI-assisted customer service)
- Nice-to-have: Manual workaround exists with minimal impact (e.g., AI-generated internal reports)
This classification drives your investment in fallbacks.
Vendor Concentration Risk: The Elephant in the Room
Here’s an uncomfortable truth: a handful of AI providers power most of the financial services industry’s AI capabilities. OpenAI, Anthropic, Google, and Microsoft dominate the foundation model layer. AWS, Azure, and GCP dominate the infrastructure layer. That’s a massive concentration of risk.
Why This Matters
McKinsey’s 2025 State of AI survey found that over six consecutive years of research, few risks associated with AI use are mitigated by most respondents’ organizations. Vendor concentration risk is consistently under-managed.
The Sora shutdown illustrates the product discontinuation risk: OpenAI can kill a product with 30 days’ notice, and there’s nothing in your contract that prevents it. If you built workflows on Sora, you’re now scrambling.
As Aon’s 2026 AI Risk report noted, AI platform dependencies “introduce concentration risk and underscore the importance of supply chain resilience in AI environments.”
Concentration Risk Assessment
Evaluate your AI vendor portfolio across five dimensions:
| Risk Dimension | Assessment Question | Red Flag |
|---|---|---|
| Provider concentration | How many critical processes depend on one AI vendor? | >3 critical processes on single vendor |
| Infrastructure concentration | Do your AI vendors share the same cloud provider? | All AI vendors on AWS |
| Model concentration | Are you using one foundation model family across use cases? | GPT-4 for everything |
| Geographic concentration | Where are your AI vendors’ data centers? | All US-based, no EU fallback |
| Financial viability | Can your AI vendor sustain operations long-term? | Vendor burning cash with no profitability path |
That last dimension deserves special attention. Reports in early 2026 flagged that some major AI providers face significant financial sustainability questions, with projected losses of billions per year. If your critical AI vendor becomes financially distressed, your operational resilience plan needs to account for it.
Building AI Fallback Procedures
Every mission-critical AI process needs a documented fallback. Here’s the framework:
Fallback Tier Model
Tier 1 — Automated failover (seconds to minutes):
- Secondary AI vendor API activated via circuit breaker
- Load balancer routes to backup model
- Degraded but functional service continues
- Best for: Real-time fraud detection, trading models, customer-facing AI
Tier 2 — Managed switchover (minutes to hours):
- Operations team activates pre-configured backup
- May involve switching to a simpler model or rule-based system
- Service quality drops but core function maintained
- Best for: AI-assisted underwriting, document processing, compliance screening
Tier 3 — Manual operations (hours to days):
- Human operators take over AI-dependent processes
- Pre-written runbooks guide manual execution
- Capacity limited by available staff
- Best for: AI-generated reports, risk analytics, internal tooling
What a Good Fallback Plan Includes
For each AI-dependent process, document:
- Trigger criteria: What conditions activate the fallback? (API error rate > 5%? Latency > 10 seconds? Complete outage?)
- Decision authority: Who authorizes the switch? At 2 AM on a Saturday?
- Switchover procedure: Step-by-step, including technical commands
- Capacity constraints: How much volume can the fallback handle?
- Communication protocol: Who gets notified — internal teams, customers, regulators?
- Restoration procedure: How do you switch back when the primary AI service recovers?
- Data reconciliation: How do you sync data processed during the fallback period?
AI Tabletop Exercises: Test Before You Need It
You wouldn’t skip fire drills because your building hasn’t burned down. The same logic applies to AI resilience.
Tabletop exercises for AI failures are becoming standard practice — and increasingly required. DORA mandates threat-led penetration testing for financial entities. The FFIEC IT Examination Handbook expects BCP testing that reflects current technology dependencies. And as an ISACA analysis on operational resilience in the AI era noted, resilience “determines whether a business can withstand shocks and continue to generate value in the event of a disaster.”
Five AI Tabletop Scenarios to Run
Scenario 1: Complete AI vendor outage Your primary AI provider is down for 8+ hours. No API access. How do customer-facing and back-office operations continue?
Scenario 2: AI model producing wrong outputs Your credit decisioning model starts approving high-risk applicants at 3x the normal rate. The model isn’t “down” — it’s just wrong. How quickly do you detect it? Who pulls the kill switch?
Scenario 3: AI vendor discontinues a product You receive notice that a critical AI tool will be discontinued in 60 days (see: Sora). What’s your migration plan? Do you even have one?
Scenario 4: Cascading AI failure Your primary AI vendor goes down. You fail over to your secondary. The secondary is also degraded because everyone else failed over to them too. Now what?
Scenario 5: AI data poisoning or model compromise An adversary has been feeding manipulated data to your AI system for weeks. Your outputs are subtly wrong but have been used for thousands of decisions. What’s your investigation and remediation protocol?
Running the Exercise
- Participants: Business unit leads, ML/AI team, IT operations, risk management, compliance, legal
- Duration: 2-3 hours per scenario
- Output: Gap analysis, updated runbooks, remediation action items with owners and deadlines
- Frequency: At least annually, more often for mission-critical AI systems
The 30/60/90-Day Implementation Roadmap
Days 1-30: Discovery and Assessment
| Week | Deliverable | Owner |
|---|---|---|
| Week 1 | Complete AI dependency inventory (including shadow AI discovery) | CTO / Head of AI |
| Week 2 | Classify all AI dependencies by criticality tier | CRO / Head of Operational Risk |
| Week 3 | Assess vendor concentration risk across all five dimensions | TPRM Lead |
| Week 4 | Identify gaps between current BCP and AI dependencies | Head of BCP/DR |
Days 31-60: Build Fallback Framework
| Week | Deliverable | Owner |
|---|---|---|
| Week 5-6 | Design and document Tier 1/2/3 fallback procedures for all mission-critical AI | ML Ops Lead + Business Unit Owners |
| Week 7 | Negotiate secondary vendor agreements or develop in-house backup models | Procurement + ML Engineering |
| Week 8 | Update BCP/DR plans to include AI-specific failure scenarios and recovery procedures | Head of BCP/DR |
Days 61-90: Test and Operationalize
| Week | Deliverable | Owner |
|---|---|---|
| Week 9-10 | Conduct first AI tabletop exercise (Scenario 1 or 2) | Head of Operational Risk |
| Week 11 | Implement monitoring and alerting for AI system health and vendor SLA compliance | ML Ops / IT Operations |
| Week 12 | Executive readout with gap analysis, remediation plan, and ongoing testing schedule | CRO |
So What?
AI operational resilience isn’t a future-state problem — it’s a right-now problem. The Cloudflare outage proved that AI vendor failures cascade instantly. The Sora shutdown proved that AI products can disappear with weeks of notice. And regulators from the ESAs to the OCC are actively building frameworks that expect you to have planned for all of it.
The firms that treat AI like any other critical infrastructure — with dependency mapping, concentration risk limits, documented fallbacks, and regular testing — will weather AI disruptions without breaking a sweat. The firms that don’t will be explaining to their examiner why a single API going down shut off their fraud detection for half a day.
Start with the dependency map. You can’t protect what you can’t see.
Need a structured framework? The Business Continuity & Disaster Recovery Kit includes BIA templates, recovery plan frameworks, and testing protocols you can adapt for AI operational resilience.
FAQ
How is AI operational resilience different from traditional IT resilience?
Traditional IT resilience focuses on infrastructure — servers, networks, databases — where failures are usually binary (up or down) and well-understood. AI resilience adds unique challenges: models can degrade silently (producing wrong outputs without error messages), vendor concentration risk is extreme (a handful of providers power most AI), and failure modes cascade unpredictably (one platform’s outage overwhelms alternatives). Your BCP needs AI-specific scenarios, not just an extension of existing IT recovery plans.
What regulations require AI operational resilience planning?
In the EU, DORA (effective January 2025) explicitly requires ICT operational resilience including vendor concentration risk management. The EU AI Act (fully applicable August 2026) mandates resilience for high-risk AI systems under Article 15. In the US, SR 20-24’s interagency sound practices and the OCC Heightened Standards apply operational resilience expectations to AI dependencies. The FFIEC IT Examination Handbook expects BCP testing that covers current technology dependencies. While no US regulation explicitly says “plan for AI failure,” examiners are already asking the question.
How often should we test AI resilience?
Run AI-specific tabletop exercises at least annually — quarterly for mission-critical AI systems. Automated failover mechanisms (Tier 1 fallbacks) should be tested monthly. Review and update your AI dependency inventory every time you onboard a new AI vendor, deploy a new model, or experience a significant AI-related incident. DORA requires advanced testing including threat-led penetration testing for critical ICT services, which should include AI systems.
Rebecca Leung
Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.
Keep Reading
BIA vs Risk Assessment: What's the Difference and When to Use Each
Business impact analysis vs risk assessment — learn the key differences, when to use each, and how to integrate both into your BCM program.
Apr 3, 2026
Business ContinuityBusiness Impact Analysis Questionnaire Template: 50 Questions to Ask
A complete business impact analysis questionnaire template with 50 questions across 10 categories. Based on FFIEC, NIST SP 800-34, and ISO 22301 guidance.
Mar 30, 2026
Business ContinuityISO 22301 Certification: Cost, Timeline, and Step-by-Step Roadmap for 2026
ISO 22301 certification costs $15K-$60K+ depending on org size. Get realistic timelines, a month-by-month implementation roadmap, and tips to avoid common pitfalls.
Mar 30, 2026
Immaterial Findings ✉️
Weekly newsletter
Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.
Join practitioners from banks, fintechs, and asset managers. Delivered weekly.