AI Risk

LLM Model Risk Assessment: What MRM Teams Actually Need to Test

April 10, 2026 Rebecca Leung
Table of Contents

Here’s the conversation happening in MRM teams right now: “The business unit wants to deploy an LLM for contract review. It needs to go through model risk.” “OK. How do we validate it?”

Silence.

Then someone suggests running the SR 11-7 checklist and hoping for the best. The validation report goes out — full of language about “conceptual soundness” and “outcome analysis” — and neither the validator nor the examiner is entirely sure whether it captures what actually matters.

The framework is real. The intent is right. But the implementation is theater.

Here’s what model risk management teams actually need to test when evaluating large language models, drawn from regulatory guidance, NIST frameworks, and the practical reality of how examiners are approaching these reviews in 2025 and 2026.


TL;DR

  • SR 11-7 applies to LLMs, but its traditional testing assumptions — deterministic outputs, annual validation — don’t hold for generative AI
  • MRM teams need to run at least six distinct test types that traditional model validation doesn’t require: hallucination evaluation, prompt variance testing, red-teaming, bias/fairness, data lineage review, and continuous drift monitoring
  • The GAO found in May 2025 that examiners apply SR 11-7 inconsistently to AI tools, creating both compliance uncertainty and under-validation risk
  • Documentation gaps are the top examiner finding — MRM teams need pre-deployment test results, risk tiering rationale, and ongoing monitoring metrics in the file before go-live

Why Your SR 11-7 Checklist Doesn’t Work for LLMs

SR 11-7 — the 2011 Federal Reserve and OCC supervisory guidance on model risk management — was built around three assumptions that no longer hold for generative AI.

Determinism. Traditional models produce the same output from the same input. LLMs are stochastic. Run the same prompt 50 times, get 50 slightly different answers. This breaks the validation assumption that pre-deployment testing can exhaustively characterize a model’s behavior.

Bounded scope. A credit scorecard does one thing. An LLM connected to a customer service platform can be prompted to do hundreds of things, many of which the deployment team never anticipated. The attack surface is orders of magnitude larger.

Explainability. SR 11-7’s “effective challenge” requirement assumes validators can examine how a model reaches conclusions. The internal mechanics of transformer models are not auditable the same way as a regression equation. Effective challenge for an LLM has to mean something different.

The GAO’s May 2025 report on AI in financial services (GAO-25-107197) found that existing supervisory guidance — including SR 11-7 — can apply to AI models, but institutions bear the burden of figuring out how. It also found that examiners apply this guidance inconsistently, with some banks opting to disable AI features from vendor products rather than navigate an unpredictable MRM review process.

The OCC Bulletin 2025-26 added useful clarity: institutions should tailor MRM practices to the nature and complexity of their models. For LLMs, that means the standard playbook needs significant adaptation — not abandonment. For a deeper look at how each SR 11-7 pillar must evolve, see our post on SR 11-7 in the Age of AI.

The 6 Tests MRM Teams Actually Need to Run

1. Conceptual Soundness Review (Adapted for LLMs)

Traditional conceptual soundness asks: is the model mathematically appropriate for the use case? For an LLM, that question shifts to: is this model type appropriate for the risk profile of this use case?

An LLM generating marketing copy for a checking account is materially different from an LLM auto-generating adverse action notices for denied loan applications. The MRM team’s conceptual soundness review should document:

  • What the model is being used for — not just what it can do
  • Why a generative model is appropriate for that use case versus a deterministic alternative
  • What happens when the model output is wrong — who reviews it, what controls exist
  • Whether the model’s training data and any fine-tuning is appropriate for the regulatory context
  • Human-in-the-loop controls: under what circumstances does a human review, override, or catch model output before it affects a customer

2. Hallucination and Factual Accuracy Testing

Hallucination — generating plausible-sounding but incorrect information — is the defining reliability risk for LLMs. For financial services, hallucinated facts in a regulatory summary, client-facing response, or compliance analysis can cause real harm.

MRM teams should define a hallucination rate threshold before deployment and test against it. This means:

  • Building a ground-truth dataset of questions with verifiable, domain-specific answers
  • Running the model repeatedly across that dataset to account for stochasticity
  • Testing specifically for the domain where the model will operate: regulatory terminology, product features, compliance procedures
  • Documenting the acceptable error rate and the monitoring threshold that triggers re-validation

There’s no regulatory standard yet for “acceptable” hallucination rates — which means MRM teams have to define and defend their own threshold. That threshold, and the test results against it, should be in the model file before go-live.

3. Prompt Variance and Semantic Consistency Testing

Because LLMs are non-deterministic, validation needs to test not just what the model says but how consistently it says it across semantically equivalent inputs.

Prompt variance testing means:

  • Asking the same substantive question multiple ways and checking for consistent answers
  • Testing how small phrasing differences affect outputs
  • Identifying whether the model gives materially different treatment to equivalent inputs from different demographic contexts — directly relevant to fair lending

As GARP noted in February 2026, this represents a shift from periodic validation to behavioral analysis — continuous evidence of behavior rather than a point-in-time snapshot. Prompt-variance tests and semantic consistency checks are the practical implementation of that shift.

4. Red-Teaming and Adversarial Testing

Red-teaming means deliberately trying to make the model fail. For regulated institutions deploying LLMs in customer-facing or decisioning contexts, this is becoming a mandatory step — NIST AI RMF and NIST AI 600-1 (the Generative AI Profile) both reference adversarial testing as a core risk management practice.

A financial services red-team exercise for an LLM should test:

  • Prompt injection: Can a user override the model’s system prompt through clever input formatting?
  • Jailbreaking: Can the model be coaxed into generating content that violates institutional policy?
  • Data extraction: Can the model be prompted to surface sensitive information from its training data or context window?
  • Regulatory non-compliance: Can the model generate content that violates UDAAP, ECOA, or fair lending requirements under adversarial prompting?
  • Hallucination under pressure: Does accuracy degrade when users ask leading or misleading questions?

Red-team findings should be documented in the model file, along with the mitigations applied. An unmitigated red-team finding — even a low-severity one — left undocumented is an MRA waiting to happen.

5. Fairness and Bias Testing

If an LLM is used in any context where its outputs could affect customers differently by protected class — credit communications, benefit eligibility, customer service response quality — the MRM team needs to run fairness testing.

This is not abstract ethics. The CFPB has been explicit that UDAAP and ECOA apply to AI-generated outputs. An LLM that gives meaningfully different quality responses to customers based on input features correlated with race or gender is a fair lending problem.

Fairness testing for LLMs means:

  • Building demographic-varied prompt sets to test for differential output quality
  • Checking adverse action and denial language for consistency across customer profiles
  • Running disparate impact analysis on outputs where the model influences customer outcomes

Document the test methodology, the demographic prompt variants used, and the outcome comparison. Examiners increasingly expect this to be in the pre-deployment package for any LLM that touches customer-facing workflows.

For technical detail on fairness evaluation methods, see our post on AI model validation testing techniques.

6. Continuous Drift Monitoring (Not Annual Validation)

Traditional models get validated before deployment and reviewed annually. LLMs degrade continuously. Model providers push updates. Prompts evolve. User behavior changes how the model gets used in ways developers never anticipated.

MRM teams need to define ongoing monitoring metrics and thresholds before deployment:

  • Output quality metrics: accuracy against ground-truth set, hallucination rate, semantic consistency score
  • Usage pattern monitoring: detecting prompt patterns that fall outside the model’s validated operational scope
  • API and model version tracking: logging when the underlying model is updated by the provider and triggering re-evaluation
  • Human escalation rate: if human reviewers are overriding model outputs at an increasing rate, that’s a signal of degradation

Define re-validation triggers explicitly. A major provider version update, degradation below your pre-defined quality threshold, or deployment into a new use case should all trigger a formal re-validation — not just a note in the monitoring log.

What Examiners Are Actually Looking For

Based on the OCC’s 2025 model risk clarification and the GAO’s findings on regulatory examination of AI, the most common MRM documentation gaps for LLMs look like this:

GapWhat Examiners Flag
No model inventory entryLLM in production with no formal classification as a model
Weak risk tieringHigh-stakes LLM classified as low-risk with no documented rationale
Missing pre-deployment testingValidation report with no hallucination rate, no red-team results
No monitoring planLLM deployed with no defined quality metrics or re-validation triggers
Inadequate limitation documentationFile doesn’t describe what the model does poorly or when it shouldn’t be relied on
Effective challenge gapValidation conducted by deployment team, no independent review

The SR 11-7 guidance requires that model limitations be clearly identified and that users understand those limitations. For LLMs, this means explicit documentation in the model file, in user-facing guidance, and in escalation procedures that define when model output should not be used without human review.

For the documentation structure examiners expect, see our post on AI model documentation requirements for examiners.

The Effective Challenge Problem

SR 11-7 requires that models be subject to “effective challenge” — independent review by someone with the expertise to evaluate the model’s suitability and limitations. For a logistic regression model, effective challenge means reviewing coefficients, checking variable selection, testing out-of-sample performance.

For an LLM, effective challenge is a harder problem. The model’s internal mechanics aren’t auditable in the traditional sense. The validation team may not have generative AI expertise. And the model’s behavior in production may differ from its behavior in a controlled validation environment.

In practice, effective challenge for LLMs means:

  • The validation team executes the six tests above and evaluates results against pre-defined thresholds
  • The validator is independent from the model deployment team
  • The challenge is documented with evidence — not just a signature on a validation report that says “model approved”

Some institutions are solving this by creating an AI model validation center of excellence with dedicated LLM expertise, separate from traditional MRM functions. Others are engaging third-party validators for LLM-specific review alongside internal teams. Either approach is defensible — but the examination record needs to show that effective challenge actually happened, with documented test results to back it up.

So What?

MRM teams that apply traditional validation playbooks to LLMs are building documentation gaps that examiners are increasingly equipped to find. The conceptual framework from SR 11-7 is still sound — model validation, effective challenge, ongoing monitoring, clear documentation of limitations. The methods needed to satisfy those requirements are completely different for generative AI.

The six-test framework above gives MRM teams a working checklist. But the most important step is defining thresholds before deployment: what hallucination rate is acceptable for this use case? What red-team finding is a go/no-go? What triggers re-validation?

If those thresholds aren’t in the model file before the model goes live, the model shouldn’t go live.

The AI Risk Assessment Template & Guide includes a pre-deployment LLM validation checklist, model inventory template, risk tiering criteria, and third-party AI vendor due diligence questionnaire — built for MRM teams that need to demonstrate SR 11-7 alignment without constructing the framework from scratch.

Frequently Asked Questions

Does SR 11-7 actually apply to LLMs?
Yes. The Federal Reserve and OCC confirmed that SR 11-7 applies to AI/ML models, including LLMs. The challenge is that SR 11-7 was written for traditional statistical models, so MRM teams must interpret how concepts like 'effective challenge,' 'conceptual soundness,' and 'ongoing monitoring' map onto non-deterministic generative models. OCC Bulletin 2025-26 explicitly noted that institutions should tailor MRM practices to the nature and complexity of their models.
What's different about validating an LLM vs. a traditional credit scoring model?
Traditional models are deterministic: same input produces the same output. LLMs are stochastic—the same prompt can produce different outputs each time. This breaks the standard validation assumption that you can characterize a model's behavior exhaustively through pre-deployment testing. You also can't easily reconstruct how an LLM reached a conclusion, which complicates conceptual soundness review and creates fair lending documentation challenges.
What is red-teaming and does my MRM team need to do it?
Red-teaming for AI means deliberately trying to break the model—testing for harmful outputs, prompt injection, jailbreaks, and policy violations. For regulated financial institutions deploying LLMs in customer-facing or decisioning contexts, red-teaming is rapidly becoming a de facto regulatory expectation. NIST AI RMF and NIST AI 600-1 (the Generative AI Profile) both reference adversarial testing as a core risk management practice.
How often do LLMs need to be re-validated?
Unlike traditional models validated annually, LLMs need continuous monitoring because their effective behavior can shift with API version updates, prompt changes, or changes in the underlying training data. MRM teams should define triggers for formal re-validation before deployment: major provider version updates, degradation below quality thresholds, or deployment into a new use case not covered by original validation scope.
What documentation does an examiner expect for an LLM in production?
At minimum: a model inventory entry covering purpose, risk tier, and use context; pre-deployment testing results including hallucination rate, bias testing, and red-team findings; ongoing monitoring metrics and thresholds; documented limitations and escalation procedures; and a clear description of human-in-the-loop controls. LLMs used in high-risk contexts—credit decisioning, compliance automation—should carry documentation equivalent to a Tier 1 model under SR 11-7.
Rebecca Leung

Rebecca Leung

Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.

Related Framework

AI Risk Assessment Template & Guide

Comprehensive AI model governance and risk assessment templates for financial services teams.

Immaterial Findings ✉️

Weekly newsletter

Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.

Join practitioners from banks, fintechs, and asset managers. Delivered weekly.