Business Continuity

AI and Business Continuity: How to Plan for AI System Failures and Model Risk

Table of Contents

On July 19, 2024, a faulty software update from CrowdStrike crashed approximately 8.5 million Windows systems worldwide. The resulting outage — later described as the largest in the history of information technology — caused an estimated $5.4 billion in losses for Fortune 500 companies. Airlines grounded flights. Banks locked customers out of accounts. Hospitals rerouted patient care.

CrowdStrike isn’t an AI company. But the failure pattern it exposed is the same one that will define AI continuity failures over the next decade: single-vendor dependency, cascading system effects, and a gap between what organizations’ BCPs assumed could fail and what actually failed.

Your BCP was probably written for server outages. It wasn’t written for a model that keeps running while quietly making the wrong decisions — or for the day your AI vendor gets acquired, deprecates your foundation model, or suffers an infrastructure failure that takes your fraud detection offline at 2am on a Friday.

TL;DR

  • AI systems fail differently than traditional IT — often silently, through drift or degraded outputs rather than visible downtime
  • CrowdStrike’s July 2024 outage ($5.4B in Fortune 500 losses) illustrated the systemic risk of single-vendor technology dependency — the same pattern applies to AI vendors
  • Your BIA needs an AI dependency layer: map which business functions rely on AI and what the failure impact is
  • Recovery objectives (RTO/RPO) for AI systems must account for model version management and fallback procedures, not just infrastructure restoration
  • EU AI Act and SR 11-7 create regulatory obligations around AI resilience that are landing in BCP examinations

The Core Problem: AI Doesn’t Fail Like a Server

Traditional IT continuity planning is built around a simple failure model: systems are up or down. Availability monitoring catches failures quickly, runbooks activate, and recovery procedures restore service.

AI failures don’t follow this pattern. They fall into three categories that your existing BCP almost certainly doesn’t address:

1. Silent Degradation (Model Drift)

The model keeps running. The API returns a 200 status. But the underlying statistical patterns the model learned during training have diverged from current reality. Credit models trained on 2021 borrower behavior may systematically miscalibrate in 2025. Fraud models trained on card-present transactions may miss card-not-present fraud patterns that emerged after COVID.

Silent drift doesn’t trigger monitoring alerts. It surfaces as unexplained losses, rising complaint volumes, or a regulatory examination finding that your model “no longer reflects current risk.”

2. Vendor Platform Failures

According to Information Week, many organizations have concentrated critical AI functions with a handful of providers — foundation model vendors, AI-as-a-service platforms, or specialized vertical AI companies. When those platforms go down, organizations with no fallback face complete functional failure.

The BCI’s analysis of AI and business continuity notes that most enterprises have no continuity plan for the day their foundation model gets deprecated, repriced, or acquired. These aren’t hypothetical risks — they’re foreseeable events in a rapidly consolidating AI market.

3. Adversarial Failures (Data Poisoning, Prompt Injection)

Ransomware attacks targeting AI infrastructure surged significantly in 2025, with threat actors specifically targeting AI workloads and GPU resources. Beyond ransomware, adversarial inputs — prompt injection, model evasion, data poisoning — can corrupt AI outputs without disabling the underlying infrastructure. The model appears operational while producing manipulated results.

For high-risk AI systems in financial services, these aren’t theoretical threats. They’re active attack vectors.

Step 1: Add an AI Dependency Layer to Your BIA

The Business Impact Analysis is the foundation of BCP. If your BIA doesn’t capture AI dependencies, your BCP can’t address AI failures.

For every business function in your BIA, add the following questions:

Does this function rely on an AI system?

  • What does the AI do? (Describe the decision or process)
  • Who is the AI vendor / platform?
  • What is the estimated impact if the AI is unavailable for 1 hour? 4 hours? 24 hours?
  • Is there a manual fallback? Is it documented and tested?
  • What are the early warning indicators that the AI is failing or degrading?

Build a simple AI dependency register as part of the BIA. It doesn’t need to be a separate system — add a column to your existing BIA spreadsheet. What you’re building is visibility: which functions have AI dependencies, and what those dependencies mean for recovery.

Business FunctionAI SystemVendorManual Fallback?Drift IndicatorsRTO (AI)
Fraud detection[Vendor model][Vendor]Manual review queueRising false-negative rate2 hours
Loan decisioningInternal credit modelN/AManual underwritingKS stat, PSI > threshold4 hours
Customer service routingLLM-based classifier[Vendor]Manual queueResolution rate, escalation rate1 hour
AML transaction monitoringRules + ML hybrid[Vendor]Rules-only modeAlert rate deviation8 hours

This table becomes the AI-specific section of your BIA and drives recovery objective setting.

Step 2: Set Recovery Objectives for AI Systems

RTO and RPO apply to AI systems — but they work differently than for traditional IT.

RTO for AI systems should be set based on the business function the AI supports, not the AI technology itself. If fraud detection has a 2-hour RTO, that means the function must be operational within 2 hours — either through AI recovery or through a manual fallback process. Both paths need to be planned and tested.

RPO for AI systems is more complex. For traditional IT, RPO is about data freshness — restore from a backup taken N hours ago. For AI:

  • Model version management matters. Restoring AI infrastructure to a stale model version may produce different outputs than expected — in fraud detection, that could mean different approval rates or detection patterns. Your RPO should specify which model version is the acceptable fallback.
  • Training data currency matters. If your model is continuously retrained on recent data, what happens when it’s restored to an older checkpoint? Does the business function owner understand that outputs may differ?
  • Configuration and feature pipelines matter. An AI model restored without its feature engineering pipeline or real-time data feeds may produce nonsensical outputs.

Work with your data science and ML engineering teams to document what a “recovered AI system” actually means for each model in scope.

Step 3: Write AI Failure Scenarios for Your BCP

Traditional BCP scenarios — ransomware, natural disaster, key person loss — need AI-specific variants. Add these to your testing calendar:

Scenario: Foundation model deprecated. Your primary AI vendor announces end-of-life for the model version you depend on in 90 days. The replacement model produces measurably different outputs. What’s your migration plan? Who owns it? What’s the fallback if migration isn’t complete in time?

Scenario: AI vendor platform outage. Your AI-as-a-service provider experiences an unplanned infrastructure outage during peak processing hours. Your fraud model, document classification pipeline, and customer service routing are all offline. What processes go to manual? How long can you sustain manual operations? What’s the communication plan for customers and regulators?

Scenario: Silent model degradation detected. Monitoring flags that your credit model’s population stability index has exceeded your drift threshold. The model is technically operational but outputs may no longer be reliable. Do you continue using the model? Suspend it? Who has authority to make that call? What’s the documented escalation path?

Scenario: Adversarial attack on AI system. A threat actor has injected malicious inputs into a document processing pipeline, causing the AI to approve fraudulent claims. The attack is discovered three days after it began. What’s the incident response process? How do you quantify the exposure? What regulatory notification requirements apply?

Each scenario should include: trigger conditions, detection method, immediate response steps, escalation path, fallback procedures, and recovery criteria. For more scenario structure, see 10 Tabletop Exercise Scenarios for Business Continuity.

Step 4: Address AI Vendor Risk in Your BCP

Single-vendor dependency in AI is a continuity risk that standard TPRM programs aren’t always designed to catch. Your BCP and vendor management programs need to align on AI vendor resilience.

For every critical AI vendor, document:

Contractual protections:

  • Uptime SLAs with meaningful penalty provisions
  • Data portability rights (can you extract your data and models if the vendor fails?)
  • Change notification requirements (how much notice do you get before model updates?)
  • Model deprecation notice periods

Operational safeguards:

  • Can you export a static version of the model for fallback use?
  • Does the vendor publish their own BCP and have they tested it?
  • Is there a multi-vendor architecture for this function, or are you fully concentrated?

Monitoring and early warning:

  • What leading indicators tell you the vendor platform is degrading before it fails?
  • Do you receive proactive notifications or rely on status page monitoring?

Operational Resilience and Business Continuity is relevant here: the PRA’s operational resilience framework and its financial services equivalents are increasingly focused on important business services — which in many organizations are now AI-dependent.

The Regulatory Landscape

EU AI Act: Article 15 requires that high-risk AI systems demonstrate resilience — specifically, the ability to resist attempts to alter outputs and to maintain accuracy across the operational lifecycle. Redundancy and failover mechanisms are explicitly required for critical infrastructure AI. Full applicability for high-risk AI providers lands August 2, 2026. Non-compliance penalties reach €15 million or 3% of global annual turnover.

SR 11-7 (Model Risk Management): The Federal Reserve and OCC’s model risk guidance requires that model inventories include governance over model performance monitoring and validation. While SR 11-7 predates modern AI, examiners apply its principles to ML models and increasingly expect to see model risk considerations integrated into BCP programs. AI Consequential Decision-Making compliance covers where the regulatory lines are being drawn.

FFIEC BCM: The FFIEC’s Business Continuity Management booklet doesn’t specifically address AI, but the principles apply: critical functions must have documented recovery procedures, and systems supporting critical functions must be included in the testing program. As AI becomes embedded in more critical financial institution functions, examiners are beginning to ask specifically about AI dependencies in BCP examinations.

ISACA’s 2025 Operational Resilience guidance specifically calls out AI as a new category of operational risk requiring integration into enterprise resilience programs — alongside cyber resilience and third-party resilience. The ISACA Now Blog analysis notes that most organizations still treat AI governance as separate from continuity planning.

What a Mature AI Continuity Posture Looks Like

Immature: AI is deployed, but not captured in the BIA. No manual fallback documented. Recovery objectives were set for the underlying infrastructure but not for the AI function. Testing has never included an AI failure scenario.

Developing: AI dependencies are in the BIA. Vendor SLAs exist. RTO/RPO cover AI functions. No testing of fallback procedures.

Mature: AI dependency register complete and updated annually. Manual fallback procedures documented and tested. RTO/RPO include model version management and data currency requirements. At least one AI failure scenario included in the annual BCP testing calendar. Board reporting includes AI resilience as a distinct program element.

Most organizations are between immature and developing. The gap between developing and mature is primarily a testing and documentation problem, not a technology problem.

So What?

The CrowdStrike outage didn’t require AI to demonstrate that technology vendor dependency is an existential continuity risk. AI raises the stakes — because AI failures can be silent, gradual, and deeply embedded in consequential decisions before anyone notices.

Your BCP wasn’t written for this. That doesn’t mean it can’t cover it — it means you need to add an AI dependency layer to your BIA, write AI failure scenarios into your testing calendar, and document fallback procedures for every function where AI is making decisions that matter.

The organizations that get ahead of this now won’t be scrambling when the first AI vendor outage becomes their headline.

The Business Continuity & Disaster Recovery (BCP/DR) Kit includes a BIA template, BCP plan templates, and tabletop exercise facilitator guides — all ready to be updated for AI dependencies. For organizations also managing model risk, the AI Risk Assessment Template covers model inventory, pre-deployment checklists, and monitoring frameworks that connect directly to BCP continuity requirements.

For the broader AI resilience picture, see AI Operational Resilience for Financial Services — which covers how AI is reshaping the resilience conversation across the financial sector.


FAQ

What’s the most common AI continuity gap in BCP programs? No manual fallback for AI-dependent processes. Most organizations have BCPs that assume IT systems fail cleanly and recovery restores the same functionality. When an AI model is the critical system — and it either fails or degrades — there’s often no documented path for the human process that substitutes. Write the manual procedure before you need it.

How do we include AI in our BIA without a full model inventory? Start with function-level questions, not system-level questions. For each business function, ask: does this function rely on any automated decisioning, scoring, classification, or recommendation system? If yes, document it. You don’t need a complete ML inventory to add AI dependencies to a BIA — you need your business function owners to be honest about what they’re relying on.

Does FFIEC specifically examine AI in BCP? Not yet with dedicated AI-specific guidance, but FFIEC examiners apply the general BCP requirements to AI-dependent critical functions. As AI becomes more embedded in financial institution operations, examiner questions about AI dependencies in BCP programs are increasing. Don’t wait for formal guidance — the obligation to cover critical functions is already there.

What is an acceptable manual fallback for AI fraud detection? A manual fraud review queue with defined escalation criteria, staffing estimates, and throughput assumptions. It doesn’t need to match AI volume indefinitely — it needs to sustain operations for the RTO window (e.g., 4 hours) while AI is restored. Document the queue management process, who authorizes transactions above threshold, and what transactions are suspended vs. processed at elevated risk tolerance during the fallback period.

Frequently Asked Questions

Why is AI different from traditional IT in business continuity planning?
Traditional IT systems fail in binary, visible ways — they're up or down. AI systems can fail silently: they keep running while their outputs gradually degrade due to model drift, data quality issues, or adversarial inputs. This silent failure mode means AI may be making bad decisions for days or weeks before anyone notices. Your BCP needs specific detection criteria — not just availability monitoring — to catch AI failures before they cause material harm.
What is model drift and how does it affect business continuity?
Model drift occurs when the statistical patterns an AI model was trained on diverge from the patterns in current data. A fraud detection model trained pre-COVID may perform poorly on post-COVID transaction patterns. A credit scoring model trained before a rate cycle shift may systematically misevaluate risk. Unlike a server outage, drift doesn't trigger a downtime alert — it shows up as unexplained losses, customer complaints, or regulatory findings. BCP planning needs to account for detecting drift and having fallback procedures.
Do we need separate BCPs for AI systems?
Most organizations don't need standalone AI BCPs — they need existing BCPs updated to cover AI-specific failure modes. For each business function that depends on an AI system, the BCP should document: what the AI does, how failure is detected, what the manual or fallback process is, and who has authority to suspend or override the AI. High-risk AI systems (those making consequential decisions) warrant more detailed treatment.
What does the EU AI Act require for business continuity of high-risk AI?
EU AI Act Article 9 requires high-risk AI systems to have comprehensive risk management systems, and Article 15 requires robustness, resilience, and accuracy requirements. For high-risk AI systems, operators must implement redundancy and failover mechanisms to ensure continuity even when primary AI systems fail. Penalties for non-compliance reach €15 million or 3% of global annual turnover, whichever is higher.
What is an AI vendor single point of failure and how do we mitigate it?
An AI vendor becomes a single point of failure when your organization relies on one provider for a critical AI capability — a foundation model, an AI-powered fraud detection service, or a document processing platform — without fallback options. The CrowdStrike outage of July 2024, which caused $5.4 billion in losses to Fortune 500 companies, illustrated how single-vendor dependency in any critical technology creates systemic exposure. Mitigation strategies include multi-vendor architecture for critical AI functions, contractual SLAs with meaningful penalty provisions, and documented manual fallback procedures for every AI-dependent process.
How do we set RTO and RPO for AI systems?
RTO (Recovery Time Objective) for an AI system should be set based on the business function it supports — not the AI technology itself. If the AI handles loan decisioning with a 4-hour RTO for the function, then either the AI must be recoverable within that window or a manual decisioning process must be available. RPO (Recovery Point Objective) for AI is more nuanced: it includes not just data recovery but also model version management, because an AI restored to a stale model version may produce different outputs than expected.
Rebecca Leung

Rebecca Leung

Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.

Related Framework

Business Continuity & Disaster Recovery (BCP/DR) Kit

BCP and DR templates with BIA, recovery procedures, and a standalone tabletop exercise kit.

Immaterial Findings ✉️

Weekly newsletter

Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.

Join practitioners from banks, fintechs, and asset managers. Delivered weekly.