AI Operational Resilience: Making Sure AI Systems Don't Break the Business

TL;DR:

AI systems are creating new single points of failure that most BCP/DR programs don’t account for — and the November 2025 Cloudflare outage proved how fast things cascade.

Regulators are connecting the dots between AI risk and operational resilience: DORA, SR 20-24, the EU AI Act, and OCC Heightened Standards all expect you to plan for AI failure.

Map your AI dependencies, stress-test vendor concentration, build fallback procedures, and run AI-specific tabletop exercises before your examiner asks why you didn’t.

Your BCP Probably Doesn’t Cover AI Failure. That’s a Problem.

On November 18, 2025, a Cloudflare outage took down ChatGPT, Claude, Shopify, and dozens of other services for over three hours. Financial institutions using OpenAI’s API for loan document processing, customer support, and compliance workflows didn’t just lose a chatbot — they lost operational capability with no fallback in place.

Then, on March 24, 2026, OpenAI announced it was shutting down Sora, its AI video generation tool, just six months after launch. App access ends April 26, 2026; API access dies September 24, 2026. Any firm that built workflows around Sora now has months to rip and replace.

These aren’t hypothetical scenarios — they’re the new normal. Between November 2025 and March 2026, major AI platforms including ChatGPT, Claude, and Cloudflare-dependent services experienced multiple disruptions, some lasting over 12 hours. And here’s the cascade problem: when one AI platform goes down, users flood alternatives, which overwhelms those systems too.

Most business continuity plans were written for server outages, natural disasters, and pandemic scenarios. AI system failure is a fundamentally different beast — and your BCP needs to catch up.

The Regulatory Landscape: Operational Resilience Meets AI

Regulators aren’t waiting for you to figure this out. Multiple frameworks now explicitly or implicitly require AI operational resilience planning.

US: SR 20-24 and OCC Heightened Standards

The interagency paper “Sound Practices to Strengthen Operational Resilience” (SR 20-24), issued by the Federal Reserve, FDIC, and OCC, draws from existing guidance on operational risk management, business continuity management, third-party risk management, and cybersecurity risk management. While it doesn’t explicitly mention AI, its principles apply directly:

Identify critical operations and core business lines that depend on AI
Map internal and external dependencies — including AI vendor APIs
Maintain sound scenario analysis that includes technology disruption

The OCC’s Heightened Standards require larger banks to adjust risk governance when introducing AI activities. Examiners are already asking about AI dependency during operational resilience reviews.

EU: DORA and the Critical Provider Framework

The Digital Operational Resilience Act (DORA), which entered into application on January 17, 2025, is the most prescriptive framework for AI operational resilience. DORA requires financial entities to:

Maintain a Register of Information on all ICT third-party arrangements (due to competent authorities by April 30, 2025)
Assess and manage ICT concentration risk across critical providers
Conduct threat-led penetration testing and resilience testing

On November 18, 2025, the European Supervisory Authorities (ESAs) published the first list of designated Critical ICT Third-Party Providers (CTPPs) under DORA. The list includes hyperscale cloud providers, data center operators, and financial services technology vendors — IBM was among those formally designated in December 2025. These CTPPs now face direct EU-level oversight based on four criteria: systemic impact, reliance by financial entities, sector concentration, and substitutability.

EU AI Act: Resilience for High-Risk Systems

The EU AI Act, fully applicable August 2, 2026, requires high-risk AI system providers to ensure appropriate levels of accuracy, robustness, and cybersecurity. Article 15 specifically mandates that high-risk AI systems achieve “an appropriate level of resilience” against errors, faults, and inconsistencies. If your AI system makes or supports consequential financial decisions, resilience isn’t optional — it’s a legal requirement.

AI Dependency Mapping: Know What Breaks When AI Goes Down

Before you can plan for AI failure, you need to know where AI lives in your organization. Most firms can’t answer this question completely.

The Dependency Inventory

Build an AI dependency map that covers every business process touching AI:

Dependency Category	What to Document	Example
Direct AI services	Model name, provider, API endpoint, SLA terms	OpenAI GPT-4 for customer support triage
Embedded AI	Vendor products with AI components you didn’t build	Fraud detection in your payment processor
AI infrastructure	Cloud providers, data pipelines, vector databases	AWS Bedrock for model hosting
Data dependencies	Training data sources, real-time data feeds	Market data feeds for trading models
Human dependencies	ML engineers, data scientists, model validators	2-person ML ops team

Finding Shadow AI

Your inventory is incomplete if you only count sanctioned tools. Shadow AI — employees using ChatGPT, Copilot, and other tools without IT approval — creates undocumented dependencies that won’t show up in your BCP until something breaks.

Detection methods:

Network monitoring for API calls to known AI providers
Procurement and expense audits for AI subscription charges
Browser extension and endpoint agent scanning
Employee surveys (you’ll be surprised what people admit to using)

The Critical Path Question

For every AI-dependent process, ask: What happens if this AI system is unavailable for 4 hours? 24 hours? 7 days?

Map each to one of three categories:

Mission-critical: Business stops without it (e.g., AI-driven fraud detection on transaction processing)
Important but degraded: Business continues at reduced capacity (e.g., AI-assisted customer service)
Nice-to-have: Manual workaround exists with minimal impact (e.g., AI-generated internal reports)

This classification drives your investment in fallbacks.

Vendor Concentration Risk: The Elephant in the Room

Here’s an uncomfortable truth: a handful of AI providers power most of the financial services industry’s AI capabilities. OpenAI, Anthropic, Google, and Microsoft dominate the foundation model layer. AWS, Azure, and GCP dominate the infrastructure layer. That’s a massive concentration of risk.

Why This Matters

McKinsey’s 2025 State of AI survey found that over six consecutive years of research, few risks associated with AI use are mitigated by most respondents’ organizations. Vendor concentration risk is consistently under-managed.

The Sora shutdown illustrates the product discontinuation risk: OpenAI can kill a product with 30 days’ notice, and there’s nothing in your contract that prevents it. If you built workflows on Sora, you’re now scrambling.

As Aon’s 2026 AI Risk report noted, AI platform dependencies “introduce concentration risk and underscore the importance of supply chain resilience in AI environments.”

Concentration Risk Assessment

Evaluate your AI vendor portfolio across five dimensions:

Risk Dimension	Assessment Question	Red Flag
Provider concentration	How many critical processes depend on one AI vendor?	>3 critical processes on single vendor
Infrastructure concentration	Do your AI vendors share the same cloud provider?	All AI vendors on AWS
Model concentration	Are you using one foundation model family across use cases?	GPT-4 for everything
Geographic concentration	Where are your AI vendors’ data centers?	All US-based, no EU fallback
Financial viability	Can your AI vendor sustain operations long-term?	Vendor burning cash with no profitability path

That last dimension deserves special attention. Reports in early 2026 flagged that some major AI providers face significant financial sustainability questions, with projected losses of billions per year. If your critical AI vendor becomes financially distressed, your operational resilience plan needs to account for it.

Building AI Fallback Procedures

Every mission-critical AI process needs a documented fallback. Here’s the framework:

Fallback Tier Model

Tier 1 — Automated failover (seconds to minutes):

Secondary AI vendor API activated via circuit breaker
Load balancer routes to backup model
Degraded but functional service continues
Best for: Real-time fraud detection, trading models, customer-facing AI

Tier 2 — Managed switchover (minutes to hours):

Operations team activates pre-configured backup
May involve switching to a simpler model or rule-based system
Service quality drops but core function maintained
Best for: AI-assisted underwriting, document processing, compliance screening

Tier 3 — Manual operations (hours to days):

Human operators take over AI-dependent processes
Pre-written runbooks guide manual execution
Capacity limited by available staff
Best for: AI-generated reports, risk analytics, internal tooling

What a Good Fallback Plan Includes

For each AI-dependent process, document:

Trigger criteria: What conditions activate the fallback? (API error rate > 5%? Latency > 10 seconds? Complete outage?)
Decision authority: Who authorizes the switch? At 2 AM on a Saturday?
Switchover procedure: Step-by-step, including technical commands
Capacity constraints: How much volume can the fallback handle?
Communication protocol: Who gets notified — internal teams, customers, regulators?
Restoration procedure: How do you switch back when the primary AI service recovers?
Data reconciliation: How do you sync data processed during the fallback period?

AI Tabletop Exercises: Test Before You Need It

You wouldn’t skip fire drills because your building hasn’t burned down. The same logic applies to AI resilience.

Tabletop exercises for AI failures are becoming standard practice — and increasingly required. DORA mandates threat-led penetration testing for financial entities. The FFIEC IT Examination Handbook expects BCP testing that reflects current technology dependencies. And as an ISACA analysis on operational resilience in the AI era noted, resilience “determines whether a business can withstand shocks and continue to generate value in the event of a disaster.”

Five AI Tabletop Scenarios to Run

Scenario 1: Complete AI vendor outage Your primary AI provider is down for 8+ hours. No API access. How do customer-facing and back-office operations continue?

Scenario 2: AI model producing wrong outputs Your credit decisioning model starts approving high-risk applicants at 3x the normal rate. The model isn’t “down” — it’s just wrong. How quickly do you detect it? Who pulls the kill switch?

Scenario 3: AI vendor discontinues a product You receive notice that a critical AI tool will be discontinued in 60 days (see: Sora). What’s your migration plan? Do you even have one?

Scenario 4: Cascading AI failure Your primary AI vendor goes down. You fail over to your secondary. The secondary is also degraded because everyone else failed over to them too. Now what?

Scenario 5: AI data poisoning or model compromise An adversary has been feeding manipulated data to your AI system for weeks. Your outputs are subtly wrong but have been used for thousands of decisions. What’s your investigation and remediation protocol?

Running the Exercise

Participants: Business unit leads, ML/AI team, IT operations, risk management, compliance, legal
Duration: 2-3 hours per scenario
Output: Gap analysis, updated runbooks, remediation action items with owners and deadlines
Frequency: At least annually, more often for mission-critical AI systems

The 30/60/90-Day Implementation Roadmap

Days 1-30: Discovery and Assessment

Week	Deliverable	Owner
Week 1	Complete AI dependency inventory (including shadow AI discovery)	CTO / Head of AI
Week 2	Classify all AI dependencies by criticality tier	CRO / Head of Operational Risk
Week 3	Assess vendor concentration risk across all five dimensions	TPRM Lead
Week 4	Identify gaps between current BCP and AI dependencies	Head of BCP/DR

Days 31-60: Build Fallback Framework

Week	Deliverable	Owner
Week 5-6	Design and document Tier 1/2/3 fallback procedures for all mission-critical AI	ML Ops Lead + Business Unit Owners
Week 7	Negotiate secondary vendor agreements or develop in-house backup models	Procurement + ML Engineering
Week 8	Update BCP/DR plans to include AI-specific failure scenarios and recovery procedures	Head of BCP/DR

Days 61-90: Test and Operationalize

Week	Deliverable	Owner
Week 9-10	Conduct first AI tabletop exercise (Scenario 1 or 2)	Head of Operational Risk
Week 11	Implement monitoring and alerting for AI system health and vendor SLA compliance	ML Ops / IT Operations
Week 12	Executive readout with gap analysis, remediation plan, and ongoing testing schedule	CRO

So What?

AI operational resilience isn’t a future-state problem — it’s a right-now problem. The Cloudflare outage proved that AI vendor failures cascade instantly. The Sora shutdown proved that AI products can disappear with weeks of notice. And regulators from the ESAs to the OCC are actively building frameworks that expect you to have planned for all of it.

The firms that treat AI like any other critical infrastructure — with dependency mapping, concentration risk limits, documented fallbacks, and regular testing — will weather AI disruptions without breaking a sweat. The firms that don’t will be explaining to their examiner why a single API going down shut off their fraud detection for half a day.

Start with the dependency map. You can’t protect what you can’t see.

Need a structured framework? The Business Continuity & Disaster Recovery Kit includes BIA templates, recovery plan frameworks, and testing protocols you can adapt for AI operational resilience.

FAQ

How is AI operational resilience different from traditional IT resilience?

Traditional IT resilience focuses on infrastructure — servers, networks, databases — where failures are usually binary (up or down) and well-understood. AI resilience adds unique challenges: models can degrade silently (producing wrong outputs without error messages), vendor concentration risk is extreme (a handful of providers power most AI), and failure modes cascade unpredictably (one platform’s outage overwhelms alternatives). Your BCP needs AI-specific scenarios, not just an extension of existing IT recovery plans.

What regulations require AI operational resilience planning?

In the EU, DORA (effective January 2025) explicitly requires ICT operational resilience including vendor concentration risk management. The EU AI Act (fully applicable August 2026) mandates resilience for high-risk AI systems under Article 15. In the US, SR 20-24’s interagency sound practices and the OCC Heightened Standards apply operational resilience expectations to AI dependencies. The FFIEC IT Examination Handbook expects BCP testing that covers current technology dependencies. While no US regulation explicitly says “plan for AI failure,” examiners are already asking the question.

How often should we test AI resilience?

Run AI-specific tabletop exercises at least annually — quarterly for mission-critical AI systems. Automated failover mechanisms (Tier 1 fallbacks) should be tested monthly. Review and update your AI dependency inventory every time you onboard a new AI vendor, deploy a new model, or experience a significant AI-related incident. DORA requires advanced testing including threat-led penetration testing for critical ICT services, which should include AI systems.