Continuous Monitoring for AI Models: Drift, Degradation, and Compliance Triggers
Table of Contents
TL;DR:
- SR 11-7 treats ongoing monitoring as a co-equal pillar with initial validation — not a quarterly checkbox
- AI/ML models drift faster than traditional statistical models; PSI > 0.20 triggers re-validation under industry standard practice
- Different model types require different monitoring KPIs: credit models need PSI + accuracy decay; GenAI needs output quality + hallucination rate
- OCC Bulletin 2026-13 explicitly excludes GenAI from current guidance — institutions must build their own GenAI monitoring frameworks while separate guidance develops
Your model passed initial validation. PSI checked out, backtests closed clean, SR 11-7 documentation was signed off. Six months later, a demographic shift in your application population has been silently degrading your credit model’s fairness metrics. Nobody noticed because the monitoring report only tracked aggregate accuracy, not segment-level performance. Your examiner is now asking why ongoing monitoring failed to catch it.
This is the ongoing monitoring problem that trips up model risk teams repeatedly. Initial validation gets the dedicated timeline, the resources, the independent review. Ongoing monitoring gets a quarterly spreadsheet and whoever can spare an afternoon.
That imbalance is exactly what regulators are correcting — and the 2026 updated guidance makes it harder to coast.
Why Ongoing Monitoring Is Not Optional
SR 11-7 (Federal Reserve, April 2011) establishes ongoing monitoring as one of three co-equal pillars alongside model development and validation. Section IV states explicitly that model risk does not diminish post-deployment — monitoring is expected to confirm that models continue performing as intended and to detect early warning signs before they become exam findings.
In practice, MRAs citing model monitoring gaps are among the most consistent findings in model risk examinations. Institutions that treat monitoring as documentation rather than detection routinely discover this the hard way.
In 2026, OCC Bulletin 2026-13 updated interagency model risk management guidance to reinforce monitoring and validation as core risk management obligations. The bulletin also explicitly excludes generative AI and agentic AI from current guidance scope — acknowledging that existing frameworks were not designed for AI-specific monitoring needs, while signaling that dedicated GenAI guidance is coming.
The takeaway for practitioners: if your monitoring program was built for logistic regression in 2015, it isn’t built for your current model portfolio.
The Four Types of Model Drift You Need to Detect
Not all drift is the same, and a single monitoring metric won’t catch all four.
1. Data Drift (Covariate Shift)
The distribution of input features changes from what the model was trained on. The model logic remains intact — but the population being scored now looks different from the training population.
Classic example: A credit scoring model trained on 2019–2021 application data encounters applicants with pandemic-era employment volatility, gig-economy income, and post-inflation expense profiles. The model’s scores reflect assumptions that no longer hold for a material portion of applications.
Primary metric: Population Stability Index (PSI)
| PSI Value | Interpretation | Required Action |
|---|---|---|
| < 0.10 | Input distributions stable | Continue standard monitoring |
| 0.10 – 0.20 | Moderate shift detected | Investigate; segment-level performance review |
| > 0.20 | Significant shift | Trigger re-validation or model replacement |
These thresholds are well-established in financial services model monitoring practice and examiners are familiar with them. If your monitoring policy doesn’t define PSI thresholds, that’s a gap that will come up.
2. Concept Drift
The statistical relationship between inputs and the target outcome changes — even when the inputs themselves haven’t drifted materially. The world changed; the model didn’t.
Example: A fraud detection model trained on 2022 wire transfer patterns doesn’t recognize 2025-era AI-assisted social engineering attacks. Transaction amount, geography, and timing haven’t shifted dramatically — but fraudulent transactions now resemble legitimate ones in ways the model can’t detect.
Primary metrics: False positive rate trend, false negative rate trend, precision-recall curve decay over rolling 90-day windows.
3. Performance Degradation
Overall model accuracy, discrimination, or output quality declines measurably. This is the most visible type of drift but often the last one detected — by the time AUROC drops 5 points, the model has been operating with impaired performance for months.
Primary metrics: AUROC, Gini coefficient, KS statistic (classification models); RMSE and MAE trend (regression models).
4. Fairness Metric Drift
Disparate impact or disparate treatment patterns emerge or worsen in a protected class segment — while overall model performance remains stable. This is the drift type regulators are most concerned about and the one most teams are least equipped to catch. A model can maintain its aggregate Gini coefficient while quietly failing a demographic segment.
Primary metrics: Demographic parity ratio, equalized odds (true positive rate parity by group), adverse action rate by protected class. The 80% rule — used in employment discrimination law and increasingly applied to AI credit models — flags when the selection rate for a protected group falls below 80% of the highest-selected group.
Compliance Triggers Requiring Re-Validation
SR 11-7 defines when re-validation is required. For AI/ML models, these need to be extended beyond the original guidance’s assumptions:
Standard SR 11-7 triggers:
- Material change to model design, methodology, or assumptions
- Material change to the underlying training or input data
- Performance degradation detected in ongoing monitoring
- Changed business or regulatory environment affecting model use
- Outcomes that deviate materially from expected results
AI/ML-specific triggers now expected in MRM practice:
- Vendor model version update (for third-party or API-based models — including foundation model providers)
- PSI exceeds 0.20 for any key input feature
- AUROC/Gini drops more than your defined threshold (typically 3–5 points) from validation baseline
- Any fairness metric breach — adverse action rate disparity or demographic parity ratio below defined tolerance
- Significant change in model use case, user population, or deployment context
- Material model incident, complaint spike, or adverse outcome attributable to model output
All triggers need to be documented in your monitoring policy with specific thresholds, escalation procedures, and approval authorities for triggered re-validation. Examiners want to see that triggers are defined before events occur — not reconstructed after.
Monitoring Frequency by Model Risk Tier
How frequently you monitor should reflect the risk a model carries. Annual reviews may be defensible for a low-risk internal reporting model. They aren’t defensible for an AI model making credit decisions on thousands of applications monthly.
| Risk Tier | Model Examples | Recommended Monitoring Frequency |
|---|---|---|
| Tier 1 (High) | Credit decisioning, fraud detection, customer pricing, adverse action | Monthly; continuous monitoring for high-volume deployments |
| Tier 2 (Medium) | Collections scoring, churn prediction, next-best-offer | Quarterly, with monthly spot checks on key KPIs |
| Tier 3 (Low) | Internal forecasting, reporting models, non-consequential outputs | Semi-annual or annual |
These frequencies assume stable conditions. Any trigger event from the list above requires an immediate out-of-cycle review regardless of tier.
GenAI-Specific Monitoring: A Different Problem Set
The metrics above were designed for supervised learning models with numeric outputs. Generative AI models are fundamentally different — outputs are probabilistic text, “correctness” is harder to define, and a vendor can change model behavior without any internal deployment change.
As OCC Bulletin 2026-13 acknowledged by explicitly excluding GenAI from its scope, existing frameworks don’t fully map to generative AI. Institutions deploying GenAI today can’t wait for updated guidance — they need to build monitoring now.
What to track for GenAI models:
- Output quality benchmarking: Measure against a fixed test set quarterly — same prompts, evaluate whether outputs remain within expected accuracy and quality bounds.
- Hallucination/confabulation rate: What percentage of responses contain factually incorrect or invented content? Requires human evaluation on sampled outputs or automated fact-checking tools.
- Tone and policy compliance drift: Does output remain within policy guardrails — no prohibited disclosures, no UDAAP risk, no fair lending exposure?
- Refusal rate trends: A sudden spike or drop in model refusals may indicate a vendor model update that changed safety guardrails without your knowledge.
- Vendor model version tracking: Log every time your foundation model provider releases a new version. Treat each version change as a potential re-evaluation trigger.
The vendor update problem has no equivalent in traditional MRM. When OpenAI, Anthropic, or Google releases a new model version, your GenAI application may immediately begin producing different outputs — without any internal code change. Your monitoring program needs to detect this, which requires keeping a fixed benchmark test set and running it after every detected version change.
For the pre-deployment validation framework that precedes ongoing monitoring, see AI Model Validation Best Practices: Why Traditional Testing Breaks with Generative AI.
Building Your Monitoring Program: 90-Day Roadmap
If your monitoring program is currently a quarterly spreadsheet, here’s how to build toward something defensible.
Days 1–30: Inventory and Tier
- Complete or update your AI/ML model inventory — see LLM Model Risk Assessment: What MRM Teams Actually Need to Test for a testing framework
- Assign a risk tier (High/Medium/Low) to each model with documented rationale
- Identify current monitoring state: what’s being tracked, how often, by whom, and with what process for escalation
Days 31–60: Metric Selection and Threshold Setting
- Define KPIs for each model risk tier (PSI, AUROC, fairness ratios, GenAI output quality metrics)
- Document thresholds for each KPI tied to your risk appetite — these need to be specific numbers, not “significant degradation”
- Define all re-validation triggers with threshold values and escalation paths
- Build monitoring dashboards — a well-structured Excel report is defensible if it consistently covers the right metrics
Days 61–90: Policy Documentation and Governance Integration
- Document your monitoring policy: scope, frequency, KPIs, thresholds, escalation procedures, model owner responsibilities
- Define accountability — who owns each model’s monitoring outputs and signs off on quarterly reviews?
- Establish a model incident reporting process for threshold breaches
- Present monitoring program design to model risk governance; document approval and board-level summary
The “So What” for Your Next Exam
When your examiner asks about model monitoring, the answers that close MRAs are:
- “We have a documented monitoring policy with defined KPIs and thresholds per model risk tier.”
- “We track PSI on all credit models monthly, with a documented threshold of 0.20 for re-validation triggering.”
- “We have a trigger-based re-validation process — here are the trigger definitions and the records of when they’ve been activated.”
The answers that open MRAs:
- “We review models quarterly and haven’t seen any issues.”
- “We rely on our vendor to notify us when the model needs to be updated.”
- “We track overall accuracy but don’t segment monitoring by protected class.”
For the governance framework that connects monitoring to your broader AI risk program, see SR 11-7 for AI Systems: Applying Legacy Model Risk Guidance to LLMs.
The AI Risk Assessment Template & Guide includes a pre-built model monitoring dashboard template with SR 11-7-aligned KPIs for credit, fraud, and classification models — a starting point if you’re building from scratch.
Related Template
AI Risk Assessment Template & Guide
Comprehensive AI model governance and risk assessment templates for financial services teams.
Frequently Asked Questions
What is model drift and why does it matter for compliance?
What PSI threshold triggers re-validation under SR 11-7?
How often should AI models be monitored compared to traditional models?
Does OCC Bulletin 2026-13 apply to generative AI models?
What are the mandatory re-validation triggers under SR 11-7?
What documentation do examiners expect for a model monitoring program?
Rebecca Leung
Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.
Related Framework
AI Risk Assessment Template & Guide
Comprehensive AI model governance and risk assessment templates for financial services teams.
Keep Reading
NIST AI RMF MAP Function: How to Frame AI Risk Context Before You Build or Deploy
The MAP function is where NIST AI RMF risk management actually starts. Learn what MAP 1-5 require, how financial institutions implement them, and why most teams get this wrong.
Apr 21, 2026
AI RiskAgentic AI Governance: The Compliance Gap Nobody's Talking About
SR 11-7, Reg E, and UDAAP weren't built for AI that acts autonomously. Here's where your compliance program has a blind spot—and what to build before regulators close it.
Apr 20, 2026
AI RiskAI Model Validation Best Practices: Why Traditional Testing Breaks with Generative AI
Traditional SR 11-7 validation breaks with generative AI. Learn why deterministic testing fails for LLMs and what new validation approaches financial services firms actually need.
Apr 18, 2026
Immaterial Findings ✉️
Weekly newsletter
Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.
Join practitioners from banks, fintechs, and asset managers. Delivered weekly.