AI Model Documentation: What Examiners Actually Want to See in 2026
Table of Contents
TL;DR:
- SR 11-7 and OCC Bulletin 2011-12 require documentation “detailed enough that parties unfamiliar with the model can understand its operation” — and examiners are now applying that standard to AI/ML models.
- Traditional model documentation templates miss AI-specific elements: training data provenance, hyperparameter decisions, explainability approaches, drift thresholds, and known failure modes.
- Build an AI model card for every model in your inventory — we break down each section below with what examiners expect.
Your Model Documentation Is Probably Getting You MRAs
Here’s what happens in most AI-related examinations right now: the examiner asks for the model documentation. Your team hands over the standard model development document — the one that worked fine for your logistic regression models. The examiner flips through it, asks where the training data provenance is documented, how you validated for bias, what your drift thresholds are, and what happens when the model hallucinates. Blank stares.
According to McKinsey’s 2025 State of AI survey, 51% of organizations using AI reported at least one negative AI-related consequence in the past year — with inaccuracy as the leading issue. Yet only 28% have CEO-level governance oversight, and just 17% report board-level AI governance. The documentation gap is a symptom of a broader governance problem, and examiners know it.
The regulatory expectations haven’t changed in principle — SR 11-7 (Federal Reserve, April 2011) and OCC Bulletin 2011-12 still form the backbone. But the application to AI/ML models requires fundamentally different documentation practices. Here’s what examiners actually want to see.
What SR 11-7 and OCC 2011-12 Actually Require
Both guidance documents establish the same core principle: model documentation must be “sufficiently detailed that parties unfamiliar with a model could understand how the model operates, its limitations, and its key assumptions.” That standard applies to every model — including the ones powered by neural networks and large language models.
The guidance organizes documentation requirements around three pillars:
| Pillar | SR 11-7 Expectation | AI/ML Translation |
|---|---|---|
| Model Development | Document theoretical basis, methodology choices, data sources, variable selection, testing results | Training data provenance, architecture decisions, hyperparameter choices, feature engineering, benchmark evaluations |
| Model Validation | Independent review of soundness, including developmental evidence, process verification, and outcomes analysis | Adversarial testing results, bias evaluations, robustness checks, validation techniques specific to ML/LLM models |
| Ongoing Monitoring | Performance tracking, comparison of outcomes to expectations, stability assessment | Drift detection thresholds, performance decay metrics, retraining triggers, output consistency monitoring |
The problem isn’t that the guidance is silent on what to document — it’s that most MRM teams are using documentation templates built for spreadsheet-based models in 2012 and haven’t updated them for models where “variable selection” means “we fine-tuned a transformer on 500GB of text data.”
The Model Card: Your AI Documentation Foundation
The concept of a “model card” was introduced by Mitchell et al. in 2019 at Google and has since become an industry standard adopted by organizations including Hugging Face and Google DeepMind. Think of it as a nutrition label for machine learning models — a standardized format that makes model details scannable and comparable.
For regulated financial services firms, the model card isn’t a replacement for your full model development document. It’s the executive summary that sits on top and gives examiners — and your own risk committee — a rapid understanding of what the model does, how it was built, and where the risks live.
A well-constructed model card for a regulated AI system should include:
- Model Details: Name, version, type (classification, regression, generative), owner, intended purpose
- Intended Use: Approved use cases, out-of-scope uses, known limitations
- Training Data: Source, size, date range, representativeness, any known biases
- Performance Metrics: Accuracy, precision, recall, F1, AUC — appropriate to the model type
- Fairness Evaluations: Disparate impact testing results across protected classes
- Ethical Considerations: Known failure modes, potential for harm, human oversight requirements
- Caveats and Recommendations: What users should and shouldn’t rely on
Section-by-Section: What Goes in AI Model Documentation
Below is the expanded documentation template that maps SR 11-7’s requirements to AI/ML-specific content. This is what examiners are looking for in 2026.
Section 1: Model Overview and Business Context
What examiners expect: A clear statement of what the model does, why it was built, and what business decisions depend on its output.
What to document:
- Model name, version number, and unique identifier (ties to your model inventory)
- Business problem being solved and the decision the model informs
- Model type and architecture (logistic regression, random forest, neural network, LLM, etc.)
- Model tier/risk classification (links to your risk tiering methodology when published)
- Model owner (name and role — not a team, a person)
- Date deployed, last validated, next validation due
Section 2: Training Data Provenance
What examiners expect: Full documentation of where the training data came from, how it was selected, and whether it’s representative of the population the model serves.
This is where most AI documentation falls short. Traditional model docs describe data sources in a paragraph. For AI/ML models, examiners want:
| Element | What to Document |
|---|---|
| Data sources | Every source, including vendor-provided data, internal databases, public datasets, and synthetic data |
| Date ranges | Training data time window and whether it covers relevant economic cycles |
| Volume | Number of records, features, and target distribution |
| Representativeness | Demographic breakdown compared to the target population |
| Labeling methodology | Who labeled the data, what instructions they followed, inter-annotator agreement rates |
| Data cleaning | Outlier detection methods, missing value treatment, transformation logic |
| Known gaps | What the data doesn’t cover — and what that means for model performance |
For LLMs and generative AI models, also document: the source corpus (or vendor’s disclosed training approach), any fine-tuning data, retrieval-augmented generation (RAG) knowledge bases, and the data governance controls applied to each.
Section 3: Architecture and Design Decisions
What examiners expect: Not a textbook explanation of how neural networks work — a documented rationale for why you chose this architecture and the tradeoffs you considered.
What to document:
- Model architecture choice and alternatives considered (with rationale for selection)
- Key hyperparameters and how they were tuned (grid search, Bayesian optimization, manual selection)
- Feature engineering decisions — which features were created, transformed, or excluded and why
- For LLMs: prompt engineering approach, system prompts, temperature and parameter settings, guardrails configuration
- For ensemble models: component model descriptions and combination methodology
- Computational resources required for training and inference
- Third-party components used (pre-trained models, open-source libraries, vendor APIs) — with version numbers
Section 4: Performance Metrics and Validation Results
What examiners expect: Quantitative evidence that the model works, documented in a way that enables effective challenge.
What to document:
| Model Type | Primary Metrics | Additional Requirements |
|---|---|---|
| Classification | Accuracy, precision, recall, F1, AUC-ROC | Confusion matrix, performance by segment |
| Regression | RMSE, MAE, R², MAPE | Residual analysis, prediction intervals |
| Generative/LLM | Faithfulness score, hallucination rate, toxicity rate, task-specific accuracy | Human evaluation results, benchmark scores (e.g., MMLU, HellaSwag) |
For every metric, document:
- The validation dataset (separate from training data)
- How test/validation splits were created
- Performance across demographic subgroups (this isn’t optional — fair lending requirements demand it)
- Comparison to the baseline/champion model and to simpler alternative approaches
- Known conditions where performance degrades
Section 5: Explainability and Interpretability
What examiners expect: How you can explain the model’s decisions — especially when those decisions affect consumers.
SR 11-7 requires “effective challenge,” which means someone independent must be able to question the model’s logic. For opaque AI models, this means documenting your explainability approach:
- Global explanations: What drives the model overall? (Feature importance, SHAP summary plots, attention patterns)
- Local explanations: How do you explain individual decisions? (LIME, individual SHAP values, counterfactual explanations)
- Limitations of the explanation method: SHAP and LIME are approximations — document what they can’t capture
- Consumer-facing explanations: If the model drives adverse action notices (credit denial, insurance pricing), document how the explanation is generated and its fidelity to the actual model logic
Section 6: Known Limitations and Failure Modes
What examiners expect: Honest documentation of where the model breaks down.
This is counterintuitive for some teams — why would you document your model’s weaknesses? Because examiners already assume every model has them. What concerns them is when you haven’t identified them.
What to document:
- Known edge cases where performance degrades
- Input conditions that produce unreliable outputs
- Population segments where accuracy drops below acceptable thresholds
- For LLMs: documented hallucination patterns, prompt injection vulnerabilities, topics where the model lacks expertise
- Compensating controls for each limitation (human review, fallback logic, output filters)
- Scenarios that should trigger a model kill switch or shutdown (when published)
Section 7: Ongoing Monitoring Plan
What examiners expect: A defined plan — not a vague commitment to “keep an eye on it.”
What to document:
| Monitoring Element | Specification |
|---|---|
| Performance metrics tracked | List each metric and its threshold |
| Data drift detection | Method (PSI, KL divergence, chi-squared), threshold, check frequency |
| Concept drift detection | How you detect when the relationship between inputs and outputs changes |
| Alert and escalation paths | Who gets notified, at what thresholds, and what happens next |
| Retraining triggers | Specific conditions that require model retraining or re-validation |
| Monitoring cadence | Daily, weekly, monthly — tied to model tier |
| Reporting | Who receives monitoring reports and how often |
Section 8: Change Log and Version History
What examiners expect: A complete record of every change to the model, who approved it, and why.
- Date, description, and rationale for every model update
- Who approved the change (name and role)
- Whether the change triggered re-validation
- Rollback plan if the change degrades performance
- Version number tied to your model inventory
The EU AI Act Raises the Bar Even Higher
If you operate in or serve EU markets, Article 11 of the EU AI Act and its Annex IV establish specific technical documentation requirements for high-risk AI systems that go beyond US regulatory expectations. High-risk AI systems must have technical documentation prepared before being placed on the market.
Annex IV requires documentation of: the general system description and intended purpose, design specifications including key algorithm choices and trade-offs, system architecture and computational resources, training data descriptions including provenance and labeling procedures, validation and testing procedures with metrics and test logs, and human oversight measures.
For firms operating across jurisdictions, the practical approach is to build documentation to the EU standard — it will satisfy both EU requirements and US examiner expectations under SR 11-7.
Colorado SB 205: Documentation for Deployers
Colorado’s AI Act (SB 24-205), effective February 1, 2026, adds state-level documentation requirements. Both developers and deployers of high-risk AI systems must maintain documentation of their risk management policies, including impact assessments and records of consumer notifications. Deployers must also document any known incidents of algorithmic discrimination and the corrective measures taken.
If you haven’t read our full Colorado AI Act compliance guide, start there for the complete picture.
Making It Practical: A 30/60/90-Day Documentation Roadmap
You can’t fix every documentation gap at once. Here’s how to prioritize:
Days 1–30: Triage and Template
- Deliverable: Updated AI model documentation template approved by MRM leadership
- Owner: Head of Model Risk Management or Chief Risk Officer
- Actions: Inventory your current AI/ML models (or update your model inventory). Assess each model’s existing documentation against the sections above. Identify the highest-risk models with the biggest documentation gaps. Build the template using the sections in this article.
Days 31–60: High-Risk Model Documentation
- Deliverable: Complete documentation packages for all Tier 1 (highest risk) AI models
- Owner: Individual model owners (data scientists/ML engineers), validated by MRM team
- Actions: Populate the new template for each Tier 1 model. Conduct gap-fill research where training data provenance or design rationale was never formally documented. Add explainability documentation (SHAP, LIME results) where missing. Document known limitations honestly.
Days 61–90: Validation, Process Integration, and Tier 2 Models
- Deliverable: Independent review of Tier 1 documentation; completed documentation for Tier 2 models
- Owner: Model validation team (or qualified third party)
- Actions: Independent reviewers validate that documentation supports effective challenge. Integrate the new template into the model development lifecycle — every new model ships with a completed model card. Begin Tier 2 model documentation. Establish the quarterly documentation review cadence.
So What?
Documentation isn’t paperwork — it’s your evidence of sound risk management. When an examiner asks “how does this AI model work?” and you hand them a comprehensive, honest model card with training data provenance, performance metrics by demographic group, documented limitations, and a clear monitoring plan, you’ve just demonstrated exactly what SR 11-7 demands.
When you hand them a two-page Word doc that says “this model uses machine learning to predict credit risk” and nothing else, you’ve just earned an MRA. Or worse.
The firms that treat AI documentation as an afterthought will spend their 2026 exam cycles writing remediation plans. The ones that build it into the model development process — documentation as a first-class artifact, not a compliance checkbox — will spend those cycles deploying more models.
Need a structured framework to assess and document your AI model risks? The AI Risk Assessment Template & Guide includes risk scoring matrices, documentation templates, and control frameworks designed for AI/ML models in financial services.
FAQ
What documentation do examiners look for first when reviewing AI models?
Examiners typically start with the model inventory to understand scope, then request the full model development document for high-risk models. They focus on training data provenance (where did the data come from and is it representative?), validation results (especially performance across demographic segments), and the ongoing monitoring plan (what triggers re-validation?). If any of these are missing or thin, it’s an immediate red flag. Documentation of known limitations is also high on their list — they want to see that you’ve identified the model’s weaknesses, not just its strengths.
How is AI model documentation different from traditional model documentation?
Traditional model documentation (designed for linear regressions, scorecards, and financial models) focuses on variable selection, coefficient stability, and backtesting. AI model documentation must additionally cover training data provenance and labeling methodology, hyperparameter tuning decisions, explainability approaches for opaque models (SHAP, LIME), known failure modes like hallucinations or adversarial vulnerabilities, and drift detection thresholds. The core regulatory principle is the same — documentation sufficient for an unfamiliar party to understand the model — but the specifics are fundamentally different for AI/ML systems.
Do I need a separate model card for every AI model in production?
Yes. Every AI model in your model inventory should have its own model card — a standardized summary document covering its purpose, training data, performance, fairness evaluations, and limitations. The model card serves as the “nutrition label” that gives examiners and risk committees a quick understanding of the model. Full development documentation sits behind it for deeper review. For vendor-provided AI models, you should still create a model card documenting what the vendor disclosed, your independent validation results, and any gaps in vendor transparency.
Rebecca Leung
Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.
Keep Reading
AI Model Validation: Testing Techniques That Actually Work for ML and LLM Models
A practitioner's guide to ai model validation techniques that satisfy OCC SR 11-7, FFIEC, and CFPB requirements for ML and LLM models in financial services.
Apr 3, 2026
AI RiskAI Model Monitoring and Drift Detection: How to Keep Models From Going Off the Rails
Practical guide to AI model monitoring and drift detection — types of drift, statistical tests, alert thresholds, and regulatory expectations for production ML systems.
Apr 1, 2026
AI RiskPrompt Injection Attacks: What Compliance Teams Need to Know Right Now
Prompt injection is the #1 LLM vulnerability. Learn how it threatens financial services compliance and what controls to implement today.
Mar 31, 2026
Immaterial Findings ✉️
Weekly newsletter
Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.
Join practitioners from banks, fintechs, and asset managers. Delivered weekly.