AI Risk

Continuous Monitoring for AI Models: Drift, Degradation, and Compliance Triggers

April 19, 2026 Rebecca Leung
Table of Contents

TL;DR:

  • SR 11-7 treats ongoing monitoring as a co-equal pillar with initial validation — not a quarterly checkbox
  • AI/ML models drift faster than traditional statistical models; PSI > 0.20 triggers re-validation under industry standard practice
  • Different model types require different monitoring KPIs: credit models need PSI + accuracy decay; GenAI needs output quality + hallucination rate
  • OCC Bulletin 2026-13 explicitly excludes GenAI from current guidance — institutions must build their own GenAI monitoring frameworks while separate guidance develops

Your model passed initial validation. PSI checked out, backtests closed clean, SR 11-7 documentation was signed off. Six months later, a demographic shift in your application population has been silently degrading your credit model’s fairness metrics. Nobody noticed because the monitoring report only tracked aggregate accuracy, not segment-level performance. Your examiner is now asking why ongoing monitoring failed to catch it.

This is the ongoing monitoring problem that trips up model risk teams repeatedly. Initial validation gets the dedicated timeline, the resources, the independent review. Ongoing monitoring gets a quarterly spreadsheet and whoever can spare an afternoon.

That imbalance is exactly what regulators are correcting — and the 2026 updated guidance makes it harder to coast.

Why Ongoing Monitoring Is Not Optional

SR 11-7 (Federal Reserve, April 2011) establishes ongoing monitoring as one of three co-equal pillars alongside model development and validation. Section IV states explicitly that model risk does not diminish post-deployment — monitoring is expected to confirm that models continue performing as intended and to detect early warning signs before they become exam findings.

In practice, MRAs citing model monitoring gaps are among the most consistent findings in model risk examinations. Institutions that treat monitoring as documentation rather than detection routinely discover this the hard way.

In 2026, OCC Bulletin 2026-13 updated interagency model risk management guidance to reinforce monitoring and validation as core risk management obligations. The bulletin also explicitly excludes generative AI and agentic AI from current guidance scope — acknowledging that existing frameworks were not designed for AI-specific monitoring needs, while signaling that dedicated GenAI guidance is coming.

The takeaway for practitioners: if your monitoring program was built for logistic regression in 2015, it isn’t built for your current model portfolio.

The Four Types of Model Drift You Need to Detect

Not all drift is the same, and a single monitoring metric won’t catch all four.

1. Data Drift (Covariate Shift)

The distribution of input features changes from what the model was trained on. The model logic remains intact — but the population being scored now looks different from the training population.

Classic example: A credit scoring model trained on 2019–2021 application data encounters applicants with pandemic-era employment volatility, gig-economy income, and post-inflation expense profiles. The model’s scores reflect assumptions that no longer hold for a material portion of applications.

Primary metric: Population Stability Index (PSI)

PSI ValueInterpretationRequired Action
< 0.10Input distributions stableContinue standard monitoring
0.10 – 0.20Moderate shift detectedInvestigate; segment-level performance review
> 0.20Significant shiftTrigger re-validation or model replacement

These thresholds are well-established in financial services model monitoring practice and examiners are familiar with them. If your monitoring policy doesn’t define PSI thresholds, that’s a gap that will come up.

2. Concept Drift

The statistical relationship between inputs and the target outcome changes — even when the inputs themselves haven’t drifted materially. The world changed; the model didn’t.

Example: A fraud detection model trained on 2022 wire transfer patterns doesn’t recognize 2025-era AI-assisted social engineering attacks. Transaction amount, geography, and timing haven’t shifted dramatically — but fraudulent transactions now resemble legitimate ones in ways the model can’t detect.

Primary metrics: False positive rate trend, false negative rate trend, precision-recall curve decay over rolling 90-day windows.

3. Performance Degradation

Overall model accuracy, discrimination, or output quality declines measurably. This is the most visible type of drift but often the last one detected — by the time AUROC drops 5 points, the model has been operating with impaired performance for months.

Primary metrics: AUROC, Gini coefficient, KS statistic (classification models); RMSE and MAE trend (regression models).

4. Fairness Metric Drift

Disparate impact or disparate treatment patterns emerge or worsen in a protected class segment — while overall model performance remains stable. This is the drift type regulators are most concerned about and the one most teams are least equipped to catch. A model can maintain its aggregate Gini coefficient while quietly failing a demographic segment.

Primary metrics: Demographic parity ratio, equalized odds (true positive rate parity by group), adverse action rate by protected class. The 80% rule — used in employment discrimination law and increasingly applied to AI credit models — flags when the selection rate for a protected group falls below 80% of the highest-selected group.

Compliance Triggers Requiring Re-Validation

SR 11-7 defines when re-validation is required. For AI/ML models, these need to be extended beyond the original guidance’s assumptions:

Standard SR 11-7 triggers:

  • Material change to model design, methodology, or assumptions
  • Material change to the underlying training or input data
  • Performance degradation detected in ongoing monitoring
  • Changed business or regulatory environment affecting model use
  • Outcomes that deviate materially from expected results

AI/ML-specific triggers now expected in MRM practice:

  • Vendor model version update (for third-party or API-based models — including foundation model providers)
  • PSI exceeds 0.20 for any key input feature
  • AUROC/Gini drops more than your defined threshold (typically 3–5 points) from validation baseline
  • Any fairness metric breach — adverse action rate disparity or demographic parity ratio below defined tolerance
  • Significant change in model use case, user population, or deployment context
  • Material model incident, complaint spike, or adverse outcome attributable to model output

All triggers need to be documented in your monitoring policy with specific thresholds, escalation procedures, and approval authorities for triggered re-validation. Examiners want to see that triggers are defined before events occur — not reconstructed after.

Monitoring Frequency by Model Risk Tier

How frequently you monitor should reflect the risk a model carries. Annual reviews may be defensible for a low-risk internal reporting model. They aren’t defensible for an AI model making credit decisions on thousands of applications monthly.

Risk TierModel ExamplesRecommended Monitoring Frequency
Tier 1 (High)Credit decisioning, fraud detection, customer pricing, adverse actionMonthly; continuous monitoring for high-volume deployments
Tier 2 (Medium)Collections scoring, churn prediction, next-best-offerQuarterly, with monthly spot checks on key KPIs
Tier 3 (Low)Internal forecasting, reporting models, non-consequential outputsSemi-annual or annual

These frequencies assume stable conditions. Any trigger event from the list above requires an immediate out-of-cycle review regardless of tier.

GenAI-Specific Monitoring: A Different Problem Set

The metrics above were designed for supervised learning models with numeric outputs. Generative AI models are fundamentally different — outputs are probabilistic text, “correctness” is harder to define, and a vendor can change model behavior without any internal deployment change.

As OCC Bulletin 2026-13 acknowledged by explicitly excluding GenAI from its scope, existing frameworks don’t fully map to generative AI. Institutions deploying GenAI today can’t wait for updated guidance — they need to build monitoring now.

What to track for GenAI models:

  • Output quality benchmarking: Measure against a fixed test set quarterly — same prompts, evaluate whether outputs remain within expected accuracy and quality bounds.
  • Hallucination/confabulation rate: What percentage of responses contain factually incorrect or invented content? Requires human evaluation on sampled outputs or automated fact-checking tools.
  • Tone and policy compliance drift: Does output remain within policy guardrails — no prohibited disclosures, no UDAAP risk, no fair lending exposure?
  • Refusal rate trends: A sudden spike or drop in model refusals may indicate a vendor model update that changed safety guardrails without your knowledge.
  • Vendor model version tracking: Log every time your foundation model provider releases a new version. Treat each version change as a potential re-evaluation trigger.

The vendor update problem has no equivalent in traditional MRM. When OpenAI, Anthropic, or Google releases a new model version, your GenAI application may immediately begin producing different outputs — without any internal code change. Your monitoring program needs to detect this, which requires keeping a fixed benchmark test set and running it after every detected version change.

For the pre-deployment validation framework that precedes ongoing monitoring, see AI Model Validation Best Practices: Why Traditional Testing Breaks with Generative AI.

Building Your Monitoring Program: 90-Day Roadmap

If your monitoring program is currently a quarterly spreadsheet, here’s how to build toward something defensible.

Days 1–30: Inventory and Tier

  • Complete or update your AI/ML model inventory — see LLM Model Risk Assessment: What MRM Teams Actually Need to Test for a testing framework
  • Assign a risk tier (High/Medium/Low) to each model with documented rationale
  • Identify current monitoring state: what’s being tracked, how often, by whom, and with what process for escalation

Days 31–60: Metric Selection and Threshold Setting

  • Define KPIs for each model risk tier (PSI, AUROC, fairness ratios, GenAI output quality metrics)
  • Document thresholds for each KPI tied to your risk appetite — these need to be specific numbers, not “significant degradation”
  • Define all re-validation triggers with threshold values and escalation paths
  • Build monitoring dashboards — a well-structured Excel report is defensible if it consistently covers the right metrics

Days 61–90: Policy Documentation and Governance Integration

  • Document your monitoring policy: scope, frequency, KPIs, thresholds, escalation procedures, model owner responsibilities
  • Define accountability — who owns each model’s monitoring outputs and signs off on quarterly reviews?
  • Establish a model incident reporting process for threshold breaches
  • Present monitoring program design to model risk governance; document approval and board-level summary

The “So What” for Your Next Exam

When your examiner asks about model monitoring, the answers that close MRAs are:

  • “We have a documented monitoring policy with defined KPIs and thresholds per model risk tier.”
  • “We track PSI on all credit models monthly, with a documented threshold of 0.20 for re-validation triggering.”
  • “We have a trigger-based re-validation process — here are the trigger definitions and the records of when they’ve been activated.”

The answers that open MRAs:

  • “We review models quarterly and haven’t seen any issues.”
  • “We rely on our vendor to notify us when the model needs to be updated.”
  • “We track overall accuracy but don’t segment monitoring by protected class.”

For the governance framework that connects monitoring to your broader AI risk program, see SR 11-7 for AI Systems: Applying Legacy Model Risk Guidance to LLMs.

The AI Risk Assessment Template & Guide includes a pre-built model monitoring dashboard template with SR 11-7-aligned KPIs for credit, fraud, and classification models — a starting point if you’re building from scratch.

Frequently Asked Questions

What is model drift and why does it matter for compliance?
Model drift occurs when a model's input data distribution changes (data drift) or the relationship between inputs and outputs shifts (concept drift), causing performance to degrade. For compliance, undetected drift can introduce disparate impact, bias, or unsafe outputs — creating regulatory exposure under SR 11-7 and fair lending laws without any internal alert firing.
What PSI threshold triggers re-validation under SR 11-7?
Industry practice uses three tiers: PSI < 0.10 = stable (no action required), PSI 0.10–0.20 = moderate shift warranting segment-level review, PSI > 0.20 = significant shift requiring re-validation or model replacement. These thresholds should be documented in your monitoring policy and tied to your risk appetite statement.
How often should AI models be monitored compared to traditional models?
Traditional statistical models can tolerate quarterly monitoring in stable conditions. AI/ML models making credit, fraud, or pricing decisions should be monitored monthly at minimum — continuous for high-volume deployments. GenAI models can change behavior with each vendor update, requiring event-triggered re-evaluation rather than calendar-based review alone.
Does OCC Bulletin 2026-13 apply to generative AI models?
No. OCC Bulletin 2026-13 explicitly excludes generative AI and agentic AI from its scope, stating these models are 'novel and rapidly evolving.' The OCC has indicated it will issue separate guidance for GenAI. Until then, institutions are expected to apply SR 11-7 principles while developing their own GenAI monitoring frameworks.
What are the mandatory re-validation triggers under SR 11-7?
SR 11-7 identifies: material changes to model design or assumptions; material changes to underlying data; significant performance degradation; changes in the regulatory or business environment; and outcomes that deviate materially from expectations. For AI models, vendor model updates and PSI breaches are additional triggers the original guidance doesn't address but examiners expect.
What documentation do examiners expect for a model monitoring program?
A written monitoring policy with: in-scope models, monitoring frequency by risk tier, specific KPIs and thresholds, escalation procedures, governance reporting structure, and monitoring output records. For AI models, examiners also expect evidence that bias and fairness metrics are monitored continuously — not just assessed at pre-deployment.
Rebecca Leung

Rebecca Leung

Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.

Related Framework

AI Risk Assessment Template & Guide

Comprehensive AI model governance and risk assessment templates for financial services teams.

Immaterial Findings ✉️

Weekly newsletter

Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.

Join practitioners from banks, fintechs, and asset managers. Delivered weekly.