Confabulation and Hallucination Risk: What NIST AI 600-1 Says and How to Test for It
Table of Contents
TL;DR:
- NIST AI 600-1 identifies confabulation as one of its 12 primary generative AI risk categories — not a footnote, but a first-class compliance risk with specific testing requirements.
- The framework requires pre-deployment TEVV covering confabulation rates across domain-specific tasks, explicit go/no-go thresholds, and continuous post-deployment monitoring.
- Compliance teams typically treat hallucination as a product quality problem. Examiners will treat it as a model risk governance failure if you can’t show your controls.
- This post covers exactly what NIST AI 600-1 requires, which testing techniques satisfy those requirements, and how to document the results for an AI governance review.
Most compliance teams think about LLM hallucination the same way product teams do — a quality problem to minimize, an annoying failure mode that occasionally surfaces wrong information. That framing misses what’s at stake in regulated environments. When your LLM writes a compliance advisory with a fabricated regulatory citation, when your customer chatbot invents a fee structure that doesn’t exist, when your contract review tool confidently cites a clause that’s absent from the document — you don’t have a quality problem. You have a model risk governance failure.
NIST AI 600-1, the Generative AI Profile released July 2024, treats it that way. Confabulation is one of 12 formally identified GenAI risk categories, with specific actions mapped across all four NIST AI RMF functions — GOVERN, MAP, MEASURE, and MANAGE. If you’re deploying GenAI in financial services and you can’t point to a confabulation testing program, you’re missing a documented NIST expectation.
Here’s what the framework actually says and how to build the program around it.
What NIST AI 600-1 Means by Confabulation
The formal definition matters: NIST AI 600-1 defines confabulation as “the production of confidently stated but erroneous or false content.” The document goes further, specifying that confabulations also include “generated outputs that diverge from the prompts or other input or that contradict previously generated statements in the same context.”
That’s a broader definition than most teams are working with. It captures:
- Factual fabrication — invented facts, statistics, citations, entities
- Prompt divergence — the model answers a different question than was asked
- Internal contradiction — responses that conflict with what the model said moments earlier
- Attribution failure — the model cites a real source but misrepresents what it says
The root cause, per NIST, is architectural: LLMs generate outputs that approximate the statistical distribution of their training data. They are fundamentally optimized to produce plausible text, not verified facts. Confabulations are what happens when statistical plausibility and factual accuracy diverge — which in financial services happens constantly, because regulatory specifics, client data, and recent enforcement actions are poorly represented in most training corpora.
NIST specifically calls out consequential decision-making domains: “Risks of confabulated content may be especially important to monitor when integrating GAI into applications involving consequential decision making.” Credit, compliance, risk assessment, customer interaction — all fall squarely in that category.
The 12 Risk Categories and Where Confabulation Sits
NIST AI 600-1 identifies 12 primary risk categories for generative AI systems:
| # | Risk Category | Financial Services Relevance |
|---|---|---|
| 1 | CBRN Information or Capabilities | Low — unless dual-use research |
| 2 | Confabulation | High — all customer-facing and analytical GenAI |
| 3 | Dangerous, Violent, or Hateful Content | Low-medium |
| 4 | Data Privacy | High — training data, PII in prompts |
| 5 | Environmental Impacts | Low for compliance teams |
| 6 | Harmful Bias or Homogenization | High — fair lending, UDAAP exposure |
| 7 | Human-AI Configuration | High — over-reliance risk |
| 8 | Information Integrity | High — regulatory filings, reports |
| 9 | Information Security | High — prompt injection, data leakage |
| 10 | Intellectual Property | Medium |
| 11 | Obscene, Degrading, Abusive Content | Low-medium |
| 12 | Value Chain and Component Integration | High — third-party model dependencies |
For most financial services AI use cases — compliance chatbots, contract review, regulatory analysis, adverse action explanations — confabulation is a Tier 1 risk. So is information integrity (whether generated content is factually grounded). These two are related and often require overlapping controls.
What NIST AI 600-1 Requires: Controls by Function
The framework organizes confabulation controls across all four NIST AI RMF functions. Most teams implement MEASURE (testing) but skip GOVERN (governance structure), MAP (risk framing), and MANAGE (ongoing treatment). That’s a coverage gap.
GOVERN: Policy and accountability structure
Before any testing happens, your governance framework needs to establish:
- A classification for GenAI use cases that identifies confabulation as a risk category requiring pre-deployment evaluation
- Clear ownership: who is responsible for confabulation testing, who reviews results, who has authority to block deployment
- A policy or standard that defines acceptable confabulation thresholds for different deployment contexts (a regulatory research assistant has different tolerances than a customer-facing chatbot giving compliance advice)
The GOVERN function requires that these structures exist before models reach MEASURE. An organization that runs confabulation tests but has no policy defining what acceptable results look like has checked a procedural box without the governance substance.
MAP: Risk framing before you test
The MAP function requires you to characterize the confabulation risk before testing begins. That means:
- Impact assessment: What decisions does this model inform? What happens if it fabricates a regulatory citation, invents a compliance deadline, or contradicts a policy document? Who is harmed?
- Deployment context documentation: Is this model customer-facing? Analyst-facing? Does it produce regulatory submissions? The confabulation risk profile is fundamentally different across contexts.
- Domain characterization: What regulatory domains, product types, or factual domains will the model be queried on? Your TEVV design needs to cover these domains — generic hallucination tests are insufficient.
This is the step teams skip most often. If your TEVV doesn’t test the domains where the model will actually be deployed — specific regulations, product structures, compliance requirements — you’re testing statistical performance, not operational risk.
MEASURE: The TEVV requirement in detail
This is where most teams spend their effort, and it’s also where NIST AI 600-1 is most specific. The framework calls for pre-deployment TEVV that assesses confabulation rates across domain-specific tasks, with defined thresholds and go/no-go gates before deployment.
What pre-deployment confabulation TEVV looks like:
1. Benchmark testing
Standardized benchmarks establish a baseline confabulation rate against known-answer tasks:
- TruthfulQA tests whether LLMs propagate common misconceptions — run it on every model version change and significant prompt update. Note: TruthfulQA is now saturated by training data inclusion, so results need to be interpreted alongside other benchmarks.
- HalluLens (presented at ACL 2025) provides a more recent, less saturated benchmark for both intrinsic and extrinsic confabulation.
- BLEURT and FactScore evaluate factual precision in generated outputs — particularly useful for document-grounded tasks like contract review or regulatory analysis.
These establish baselines. They don’t replace domain-specific testing.
2. Domain-specific adversarial testing
Generic benchmarks won’t catch the confabulations that matter most. You need to test the specific regulatory and product domains where the model will operate:
- Prompt the model with questions about regulations that changed recently (post-training data cutoff) and evaluate whether it hallucinates current requirements
- Test with real client scenarios that have definitive right answers — loan eligibility determinations, disclosure requirements, filing deadlines
- Ask about specific enforcement actions, cases, or guidance documents and verify citations are real
- Test with ambiguous or incomplete prompts to evaluate whether the model asks for clarification vs. fabricates assumptions
3. Red-teaming
NIST AI 600-1 explicitly names red-teaming as a required measurement technique. NIST’s own software tool, Dioptra, is designed for AI model testing including red-teaming exercises.
Red-teaming for confabulation should target:
- Prompts designed to elicit overconfident responses (“What is the exact penalty under Section X?” when no specific penalty exists)
- Citation requests (“What does OCC Bulletin X say about Y?”) for guidance that either doesn’t exist or has been superseded
- Multi-turn prompts that challenge earlier responses to test for internal contradiction
- Edge-case and out-of-distribution queries where training data is thin
4. RAG grounding evaluation
For document-grounded use cases — contract review, regulatory Q&A, policy analysis — RAG architecture is your primary confabulation mitigation. But RAG introduces its own failure modes that need TEVV:
| RAG Failure Mode | What to Test |
|---|---|
| Retrieval miss | Correct answer exists in corpus but wasn’t retrieved — model fills gap with fabrication |
| Attribution error | Model uses retrieved content but doesn’t cite it, or cites wrong source |
| Boundary violation | Model goes beyond retrieved documents to answer from parametric memory |
| Retrieval hallucination | Model claims it retrieved something that wasn’t actually in the corpus |
Test each failure mode explicitly. RAG reduces hallucination rates by 60–80% in production systems, but that reduction is meaningless if you can’t verify your specific system achieves it in your specific deployment context.
5. Setting thresholds and go/no-go gates
NIST AI 600-1 requires explicit performance and safety thresholds with deployment gates — you need a defined answer to “what confabulation rate is acceptable for this use case?”
The threshold is context-specific:
- Customer-facing chatbot providing compliance information: lower tolerance — fabricated regulatory requirements expose the institution to liability
- Internal analyst tool drafting regulatory memos (human review required): higher tolerance — human-in-the-loop catches errors before external publication
- Regulatory submission drafting assistance: near-zero tolerance — human must verify every citation before submission
Document the threshold, who set it, the rationale, and the TEVV results against it. If the model doesn’t clear the threshold, it doesn’t deploy until retesting shows otherwise.
MANAGE: Post-deployment monitoring and incident response
Confabulation risk doesn’t end at deployment. NIST AI 600-1’s MANAGE function requires:
- Ongoing monitoring metrics: track user-reported factual errors, output review samples, and downstream correction rates in production
- Feedback loops: mechanisms for users or reviewers to flag confabulations — and a process that logs, categorizes, and routes them back to the model risk team
- Retest triggers: define conditions that require re-TEVV — model version updates, prompt changes, new use case expansions, significant user-reported error patterns
- Incident response: if a confabulation causes a material downstream error (a customer acted on a fabricated compliance requirement, a filing included a nonexistent regulation), that’s an AI incident requiring formal documentation and root cause analysis
The NIST AI RMF MEASURE function post covers the broader TEVV framework in detail — confabulation testing sits within that structure.
Human-in-the-Loop: The Underrated Control
NIST AI 600-1 explicitly addresses human-AI configuration as a separate risk category (risk #7), but its treatment intersects directly with confabulation management. The framework’s guidance on confabulation consistently points toward human oversight as the compensating control when automated testing can’t achieve acceptable thresholds:
- Don’t deploy GenAI in contexts where confabulations can reach external audiences without human review unless confabulation rates are demonstrably low
- Define which outputs require independent human verification before action
- Train users on the specific confabulation failure modes relevant to their use case — most users significantly overestimate LLM factual reliability
The compliance teams that have the most trouble with LLM confabulation are the ones where the AI’s outputs go from model to decision without human check. That’s not just a risk management failure — it’s a governance design failure.
Documentation That Satisfies an Examiner
If an examiner asks to see your confabulation controls for a deployed GenAI system, here’s what the documentation should include, mapped to NIST AI 600-1 requirements:
| Documentation Element | NIST AI 600-1 Alignment | Where to File It |
|---|---|---|
| Confabulation risk classification in AI inventory | GOVERN: AI use case registry | Model inventory / AI governance register |
| Impact assessment for confabulation failures | MAP: harm and impact analysis | Pre-deployment risk assessment |
| TEVV results with benchmark scores | MEASURE: testing documentation | Model validation file |
| Go/no-go threshold and approval | MEASURE: deployment gate | Model approval memorandum |
| Ongoing monitoring metrics and review cadence | MANAGE: monitoring plan | Operational monitoring documentation |
| Incident log for confabulation events | MANAGE: incident response | Issues management tracker |
This documentation structure mirrors what SR 26-02 requires for traditional model validation — the same rigor applied to GenAI’s specific risk profile.
So What?
The shift you need to make is treating confabulation as a model risk governance question, not a product quality question. Those are governed by different teams, different documentation standards, and different escalation paths.
NIST AI 600-1 gives you the framework requirement. What it doesn’t give you is the testing infrastructure, the threshold-setting methodology, or the documentation templates — those are program-design decisions that your team has to make. Start with the highest-consequence deployments: any GenAI touching customer decisions, regulatory submissions, or compliance advisory functions. Build the TEVV there first.
The NIST AI 600-1 overview post covers all 12 risk categories in the framework. If you haven’t read the existing LLM hallucination management guide, that covers the detection and mitigation controls from a broader risk management lens — this post covers the regulatory compliance angle specifically.
Need a framework for AI model governance that includes pre-deployment checklists aligned to NIST AI 600-1? The AI Risk Assessment Template & Guide includes confabulation risk assessment tools, TEVV documentation templates, and a model inventory designed for financial services teams.
Related Template
AI Risk Assessment Template & Guide
Comprehensive AI model governance and risk assessment templates for financial services teams.
Frequently Asked Questions
What is confabulation in NIST AI 600-1?
How does NIST AI 600-1 differ from just calling it hallucination?
What TEVV does NIST AI 600-1 require for confabulation?
What benchmarks can I use to test for confabulation?
Does RAG eliminate confabulation risk?
How should financial services firms document confabulation controls for examiners?
Rebecca Leung
Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.
Related Framework
AI Risk Assessment Template & Guide
Comprehensive AI model governance and risk assessment templates for financial services teams.
Keep Reading
GenAI Supply Chain Risk: Third-Party Model Dependencies and NIST AI 600-1 Controls
Most financial institutions using GenAI APIs don't fully own their AI supply chain. NIST AI 600-1 says that's your problem. Here's what you need to control.
Apr 25, 2026
AI RiskDeveloper vs. Deployer vs. Operator: Role-Specific Obligations Under NIST AI 600-1
NIST AI 600-1 assigns different GenAI risk obligations to developers, deployers, and operators. Here's what each role actually owns—and where the gaps live.
Apr 25, 2026
AI RiskGenerative AI Incident Disclosure and Content Provenance: NIST AI 600-1 Requirements
What NIST AI 600-1 requires when your GenAI system fails: incident disclosure obligations, after-action review requirements, and content provenance tracking.
Apr 24, 2026
Immaterial Findings ✉️
Weekly newsletter
Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.
Join practitioners from banks, fintechs, and asset managers. Delivered weekly.