Business Continuity

AI Operational Resilience: Making Sure AI Systems Don't Break the Business

Table of Contents

TL;DR:

  • AI systems are creating new single points of failure that most BCP/DR programs don’t account for — and the November 2025 Cloudflare outage proved how fast things cascade.
  • Regulators are connecting the dots between AI risk and operational resilience: DORA, SR 20-24, the EU AI Act, and OCC Heightened Standards all expect you to plan for AI failure.
  • Map your AI dependencies, stress-test vendor concentration, build fallback procedures, and run AI-specific tabletop exercises before your examiner asks why you didn’t.

Your BCP Probably Doesn’t Cover AI Failure. That’s a Problem.

On November 18, 2025, a Cloudflare outage took down ChatGPT, Claude, Shopify, and dozens of other services for over three hours. Financial institutions using OpenAI’s API for loan document processing, customer support, and compliance workflows didn’t just lose a chatbot — they lost operational capability with no fallback in place.

Then, on March 24, 2026, OpenAI announced it was shutting down Sora, its AI video generation tool, just six months after launch. App access ends April 26, 2026; API access dies September 24, 2026. Any firm that built workflows around Sora now has months to rip and replace.

These aren’t hypothetical scenarios — they’re the new normal. Between November 2025 and March 2026, major AI platforms including ChatGPT, Claude, and Cloudflare-dependent services experienced multiple disruptions, some lasting over 12 hours. And here’s the cascade problem: when one AI platform goes down, users flood alternatives, which overwhelms those systems too.

Most business continuity plans were written for server outages, natural disasters, and pandemic scenarios. AI system failure is a fundamentally different beast — and your BCP needs to catch up.

The Regulatory Landscape: Operational Resilience Meets AI

Regulators aren’t waiting for you to figure this out. Multiple frameworks now explicitly or implicitly require AI operational resilience planning.

US: SR 20-24 and OCC Heightened Standards

The interagency paper “Sound Practices to Strengthen Operational Resilience” (SR 20-24), issued by the Federal Reserve, FDIC, and OCC, draws from existing guidance on operational risk management, business continuity management, third-party risk management, and cybersecurity risk management. While it doesn’t explicitly mention AI, its principles apply directly:

  • Identify critical operations and core business lines that depend on AI
  • Map internal and external dependencies — including AI vendor APIs
  • Maintain sound scenario analysis that includes technology disruption

The OCC’s Heightened Standards require larger banks to adjust risk governance when introducing AI activities. Examiners are already asking about AI dependency during operational resilience reviews.

EU: DORA and the Critical Provider Framework

The Digital Operational Resilience Act (DORA), which entered into application on January 17, 2025, is the most prescriptive framework for AI operational resilience. DORA requires financial entities to:

  • Maintain a Register of Information on all ICT third-party arrangements (due to competent authorities by April 30, 2025)
  • Assess and manage ICT concentration risk across critical providers
  • Conduct threat-led penetration testing and resilience testing

On November 18, 2025, the European Supervisory Authorities (ESAs) published the first list of designated Critical ICT Third-Party Providers (CTPPs) under DORA. The list includes hyperscale cloud providers, data center operators, and financial services technology vendors — IBM was among those formally designated in December 2025. These CTPPs now face direct EU-level oversight based on four criteria: systemic impact, reliance by financial entities, sector concentration, and substitutability.

EU AI Act: Resilience for High-Risk Systems

The EU AI Act, fully applicable August 2, 2026, requires high-risk AI system providers to ensure appropriate levels of accuracy, robustness, and cybersecurity. Article 15 specifically mandates that high-risk AI systems achieve “an appropriate level of resilience” against errors, faults, and inconsistencies. If your AI system makes or supports consequential financial decisions, resilience isn’t optional — it’s a legal requirement.

AI Dependency Mapping: Know What Breaks When AI Goes Down

Before you can plan for AI failure, you need to know where AI lives in your organization. Most firms can’t answer this question completely.

The Dependency Inventory

Build an AI dependency map that covers every business process touching AI:

Dependency CategoryWhat to DocumentExample
Direct AI servicesModel name, provider, API endpoint, SLA termsOpenAI GPT-4 for customer support triage
Embedded AIVendor products with AI components you didn’t buildFraud detection in your payment processor
AI infrastructureCloud providers, data pipelines, vector databasesAWS Bedrock for model hosting
Data dependenciesTraining data sources, real-time data feedsMarket data feeds for trading models
Human dependenciesML engineers, data scientists, model validators2-person ML ops team

Finding Shadow AI

Your inventory is incomplete if you only count sanctioned tools. Shadow AI — employees using ChatGPT, Copilot, and other tools without IT approval — creates undocumented dependencies that won’t show up in your BCP until something breaks.

Detection methods:

  • Network monitoring for API calls to known AI providers
  • Procurement and expense audits for AI subscription charges
  • Browser extension and endpoint agent scanning
  • Employee surveys (you’ll be surprised what people admit to using)

The Critical Path Question

For every AI-dependent process, ask: What happens if this AI system is unavailable for 4 hours? 24 hours? 7 days?

Map each to one of three categories:

  1. Mission-critical: Business stops without it (e.g., AI-driven fraud detection on transaction processing)
  2. Important but degraded: Business continues at reduced capacity (e.g., AI-assisted customer service)
  3. Nice-to-have: Manual workaround exists with minimal impact (e.g., AI-generated internal reports)

This classification drives your investment in fallbacks.

Vendor Concentration Risk: The Elephant in the Room

Here’s an uncomfortable truth: a handful of AI providers power most of the financial services industry’s AI capabilities. OpenAI, Anthropic, Google, and Microsoft dominate the foundation model layer. AWS, Azure, and GCP dominate the infrastructure layer. That’s a massive concentration of risk.

Why This Matters

McKinsey’s 2025 State of AI survey found that over six consecutive years of research, few risks associated with AI use are mitigated by most respondents’ organizations. Vendor concentration risk is consistently under-managed.

The Sora shutdown illustrates the product discontinuation risk: OpenAI can kill a product with 30 days’ notice, and there’s nothing in your contract that prevents it. If you built workflows on Sora, you’re now scrambling.

As Aon’s 2026 AI Risk report noted, AI platform dependencies “introduce concentration risk and underscore the importance of supply chain resilience in AI environments.”

Concentration Risk Assessment

Evaluate your AI vendor portfolio across five dimensions:

Risk DimensionAssessment QuestionRed Flag
Provider concentrationHow many critical processes depend on one AI vendor?>3 critical processes on single vendor
Infrastructure concentrationDo your AI vendors share the same cloud provider?All AI vendors on AWS
Model concentrationAre you using one foundation model family across use cases?GPT-4 for everything
Geographic concentrationWhere are your AI vendors’ data centers?All US-based, no EU fallback
Financial viabilityCan your AI vendor sustain operations long-term?Vendor burning cash with no profitability path

That last dimension deserves special attention. Reports in early 2026 flagged that some major AI providers face significant financial sustainability questions, with projected losses of billions per year. If your critical AI vendor becomes financially distressed, your operational resilience plan needs to account for it.

Building AI Fallback Procedures

Every mission-critical AI process needs a documented fallback. Here’s the framework:

Fallback Tier Model

Tier 1 — Automated failover (seconds to minutes):

  • Secondary AI vendor API activated via circuit breaker
  • Load balancer routes to backup model
  • Degraded but functional service continues
  • Best for: Real-time fraud detection, trading models, customer-facing AI

Tier 2 — Managed switchover (minutes to hours):

  • Operations team activates pre-configured backup
  • May involve switching to a simpler model or rule-based system
  • Service quality drops but core function maintained
  • Best for: AI-assisted underwriting, document processing, compliance screening

Tier 3 — Manual operations (hours to days):

  • Human operators take over AI-dependent processes
  • Pre-written runbooks guide manual execution
  • Capacity limited by available staff
  • Best for: AI-generated reports, risk analytics, internal tooling

What a Good Fallback Plan Includes

For each AI-dependent process, document:

  1. Trigger criteria: What conditions activate the fallback? (API error rate > 5%? Latency > 10 seconds? Complete outage?)
  2. Decision authority: Who authorizes the switch? At 2 AM on a Saturday?
  3. Switchover procedure: Step-by-step, including technical commands
  4. Capacity constraints: How much volume can the fallback handle?
  5. Communication protocol: Who gets notified — internal teams, customers, regulators?
  6. Restoration procedure: How do you switch back when the primary AI service recovers?
  7. Data reconciliation: How do you sync data processed during the fallback period?

AI Tabletop Exercises: Test Before You Need It

You wouldn’t skip fire drills because your building hasn’t burned down. The same logic applies to AI resilience.

Tabletop exercises for AI failures are becoming standard practice — and increasingly required. DORA mandates threat-led penetration testing for financial entities. The FFIEC IT Examination Handbook expects BCP testing that reflects current technology dependencies. And as an ISACA analysis on operational resilience in the AI era noted, resilience “determines whether a business can withstand shocks and continue to generate value in the event of a disaster.”

Five AI Tabletop Scenarios to Run

Scenario 1: Complete AI vendor outage Your primary AI provider is down for 8+ hours. No API access. How do customer-facing and back-office operations continue?

Scenario 2: AI model producing wrong outputs Your credit decisioning model starts approving high-risk applicants at 3x the normal rate. The model isn’t “down” — it’s just wrong. How quickly do you detect it? Who pulls the kill switch?

Scenario 3: AI vendor discontinues a product You receive notice that a critical AI tool will be discontinued in 60 days (see: Sora). What’s your migration plan? Do you even have one?

Scenario 4: Cascading AI failure Your primary AI vendor goes down. You fail over to your secondary. The secondary is also degraded because everyone else failed over to them too. Now what?

Scenario 5: AI data poisoning or model compromise An adversary has been feeding manipulated data to your AI system for weeks. Your outputs are subtly wrong but have been used for thousands of decisions. What’s your investigation and remediation protocol?

Running the Exercise

  • Participants: Business unit leads, ML/AI team, IT operations, risk management, compliance, legal
  • Duration: 2-3 hours per scenario
  • Output: Gap analysis, updated runbooks, remediation action items with owners and deadlines
  • Frequency: At least annually, more often for mission-critical AI systems

The 30/60/90-Day Implementation Roadmap

Days 1-30: Discovery and Assessment

WeekDeliverableOwner
Week 1Complete AI dependency inventory (including shadow AI discovery)CTO / Head of AI
Week 2Classify all AI dependencies by criticality tierCRO / Head of Operational Risk
Week 3Assess vendor concentration risk across all five dimensionsTPRM Lead
Week 4Identify gaps between current BCP and AI dependenciesHead of BCP/DR

Days 31-60: Build Fallback Framework

WeekDeliverableOwner
Week 5-6Design and document Tier 1/2/3 fallback procedures for all mission-critical AIML Ops Lead + Business Unit Owners
Week 7Negotiate secondary vendor agreements or develop in-house backup modelsProcurement + ML Engineering
Week 8Update BCP/DR plans to include AI-specific failure scenarios and recovery proceduresHead of BCP/DR

Days 61-90: Test and Operationalize

WeekDeliverableOwner
Week 9-10Conduct first AI tabletop exercise (Scenario 1 or 2)Head of Operational Risk
Week 11Implement monitoring and alerting for AI system health and vendor SLA complianceML Ops / IT Operations
Week 12Executive readout with gap analysis, remediation plan, and ongoing testing scheduleCRO

So What?

AI operational resilience isn’t a future-state problem — it’s a right-now problem. The Cloudflare outage proved that AI vendor failures cascade instantly. The Sora shutdown proved that AI products can disappear with weeks of notice. And regulators from the ESAs to the OCC are actively building frameworks that expect you to have planned for all of it.

The firms that treat AI like any other critical infrastructure — with dependency mapping, concentration risk limits, documented fallbacks, and regular testing — will weather AI disruptions without breaking a sweat. The firms that don’t will be explaining to their examiner why a single API going down shut off their fraud detection for half a day.

Start with the dependency map. You can’t protect what you can’t see.

Need a structured framework? The Business Continuity & Disaster Recovery Kit includes BIA templates, recovery plan frameworks, and testing protocols you can adapt for AI operational resilience.

FAQ

How is AI operational resilience different from traditional IT resilience?

Traditional IT resilience focuses on infrastructure — servers, networks, databases — where failures are usually binary (up or down) and well-understood. AI resilience adds unique challenges: models can degrade silently (producing wrong outputs without error messages), vendor concentration risk is extreme (a handful of providers power most AI), and failure modes cascade unpredictably (one platform’s outage overwhelms alternatives). Your BCP needs AI-specific scenarios, not just an extension of existing IT recovery plans.

What regulations require AI operational resilience planning?

In the EU, DORA (effective January 2025) explicitly requires ICT operational resilience including vendor concentration risk management. The EU AI Act (fully applicable August 2026) mandates resilience for high-risk AI systems under Article 15. In the US, SR 20-24’s interagency sound practices and the OCC Heightened Standards apply operational resilience expectations to AI dependencies. The FFIEC IT Examination Handbook expects BCP testing that covers current technology dependencies. While no US regulation explicitly says “plan for AI failure,” examiners are already asking the question.

How often should we test AI resilience?

Run AI-specific tabletop exercises at least annually — quarterly for mission-critical AI systems. Automated failover mechanisms (Tier 1 fallbacks) should be tested monthly. Review and update your AI dependency inventory every time you onboard a new AI vendor, deploy a new model, or experience a significant AI-related incident. DORA requires advanced testing including threat-led penetration testing for critical ICT services, which should include AI systems.

Rebecca Leung

Rebecca Leung

Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.

Immaterial Findings ✉️

Weekly newsletter

Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.

Join practitioners from banks, fintechs, and asset managers. Delivered weekly.