TEVV for Generative AI: Pre-Deployment Testing Requirements Under NIST AI 600-1
Table of Contents
TL;DR:
- NIST AI 600-1 requires pre-deployment TEVV (Test, Evaluate, Verify, Validate) across all 12 GenAI risk categories before deploying any generative AI system — not just accuracy testing.
- The five highest-priority testing domains for financial services are: confabulation, harmful bias, information security, information integrity, and data privacy.
- Structured red-team exercises are required under MP-2.3-005 — before deployment, not as an afterthought.
- You need documented go/no-go gates (GV-1.3-002) and stop-build authority (GV-1.3-006) before testing starts, not after results are in.
Most compliance teams check a box when their AI vendor says the model passed internal safety testing. That’s not TEVV. That’s vendor attestation — and under NIST AI 600-1, it doesn’t satisfy your pre-deployment obligations as a deployer.
NIST AI 600-1, the Generative AI Profile released July 2024, is explicit: organizations that deploy generative AI to customers or in business processes cannot outsource TEVV to the model vendor. Your use case, your regulatory context, and your customer population require your own testing. The vendor tested their model. You need to test your deployment.
That distinction matters enormously for financial services. A foundation model might perform acceptably across general benchmarks. But when you deploy it for loan application assistance, your ECOA exposure is a function of how that model performs across your specific applicant population — not the vendor’s test set.
Here’s what the framework actually requires, and how to build a testing program that satisfies it.
Why Traditional Model Validation Falls Short for GenAI
SR 11-7 model validation was designed for a different kind of model. Traditional financial models — credit scores, fraud detection classifiers, AML monitoring systems — produce structured outputs against defined inputs. Validation tests statistical soundness: backtesting, sensitivity analysis, out-of-time performance benchmarking.
Generative AI systems produce probabilistic, open-ended outputs in natural language. The same input can produce different outputs on different runs. The failure modes aren’t underfitting or overfitting — they’re confabulation, bias amplification, prompt injection vulnerability, and privacy leakage. SR 11-7 wasn’t designed to catch any of these.
NIST AI 600-1 fills the gap. It provides 12 risk categories specific to or amplified by generative AI — the vocabulary and testing requirements that traditional model risk management simply didn’t need until now. The four functions of the broader AI RMF (GOVERN, MAP, MEASURE, MANAGE) structure how those requirements flow from governance policy through deployment and monitoring.
The NIST AI RMF MEASURE function covers TEVV methodology at a high level across all AI types. AI 600-1 makes it concrete for generative systems — specifying what you test, not just that you test.
What TEVV Means for GenAI
TEVV stands for Test, Evaluate, Verify, and Validate. For generative AI specifically, each component has a distinct function:
| Component | What It Does | GenAI Specific |
|---|---|---|
| Test | Structured examination of system behavior before deployment | Domain-specific confabulation rates, adversarial probing, bias disparity testing |
| Evaluate | Benchmark performance against pre-defined thresholds | Demographic parity ratios, confabulation error rates, injection success rates |
| Verify | Confirm the system meets its technical specifications | Output format compliance, latency SLAs, model version matches documentation |
| Validate | Confirm the system is appropriate for its intended real-world use | Use-case relevance, regulatory environment fit, customer population performance |
The critical distinction: Verify asks “does it do what we built?” Validate asks “should we deploy it for this purpose?” Most teams do the former and skip the latter. AI 600-1 requires both.
For financial services, the Validate step is where ECOA, UDAAP, and fair lending obligations live. A GenAI system that technically does what it was designed to do can still generate discriminatory adverse action notices, confabulate regulatory requirements, or give customers legally inaccurate product information. Validation catches that before deployment.
The 12-Category Testing Matrix
NIST AI 600-1 defines 12 risk categories. Not all require equal testing depth for every use case — but you need to assess each and document why your testing coverage is sufficient for your specific deployment context.
| Risk Category | Testing Required | Financial Services Priority |
|---|---|---|
| Confabulation | Benchmark scoring on domain-specific queries, RAG grounding tests | Critical — customer-facing or advisory uses |
| Harmful Bias & Homogenization | Demographic disparity analysis across protected classes | Critical — any use touching underwriting, pricing, decisions |
| Information Security | Prompt injection, data extraction, model extraction testing | Critical — all deployments |
| Data Privacy | Training data memorization, PII leakage testing | High — all customer-data uses |
| Information Integrity | Susceptibility to disinformation amplification | High — any public-facing or advisory use |
| Human-AI Configuration | Automation bias testing, appropriate escalation | High — customer service, decisioning support |
| Value Chain & Component Integration | Third-party model dependency risk, supply chain exposure | High — all third-party GenAI deployments |
| Intellectual Property | Output copyright analysis, training data attribution | Medium — varies by use case |
| CBRN Information | Capability assessment for dangerous content elicitation | Low — unless model has broad generation capability |
| Dangerous/Violent Content | Content safety boundary testing | Low-Medium — varies by customer interaction |
| Obscene/Degrading Content | Content safety boundary testing | Low-Medium — varies by customer interaction |
| Environmental Impacts | Compute cost and energy efficiency assessment | Low — organizational sustainability reporting |
Your MAP function work (risk classification and context framing) should determine which categories get deep testing versus a documented risk acceptance decision.
The Five Priority Testing Domains in Detail
1. Confabulation Testing
Confabulation — the production of confidently stated but erroneous content — is the GenAI failure mode most likely to create direct regulatory exposure in financial services. A customer chatbot that invents a fee structure that doesn’t exist in your disclosures. A compliance assistant that fabricates a regulatory citation. A document review tool that misses a clause by confidently describing something adjacent.
Pre-deployment confabulation testing requires:
- Domain-specific benchmark evaluation: Don’t use general benchmarks. Test on queries drawn from your actual use case domain. If you’re deploying a mortgage assistance chatbot, test it on mortgage-specific fact scenarios where correct answers are verifiable against your product documentation.
- RAG grounding assessment: If you’re using retrieval-augmented generation to ground outputs in verified documents, test whether retrieval actually constrains hallucination — or whether the model still goes off-document. Test boundary cases: what happens when the retrieved document doesn’t contain a clear answer?
- Threshold definition: Define an acceptable confabulation rate before testing. GV-1.3-002 requires performance thresholds be established pre-testing so deployment decisions aren’t made post-hoc. For high-consequence customer-facing uses, a confabulation rate above 2-5% on domain-specific tasks should trigger a no-deploy decision.
See the full confabulation testing methodology for benchmark options and documentation format.
2. Harmful Bias Testing
For financial services, this is where ECOA and UDAAP exposure concentrates. AI 600-1’s Harmful Bias category requires demographic disparity analysis — testing whether the GenAI system’s outputs differ materially across protected class attributes.
Pre-deployment bias testing requires:
- Disparate performance analysis: Does the system perform at the same accuracy rate across race, gender, age, national origin, and other protected characteristics? For GenAI, this includes output quality, response completeness, and confabulation rates disaggregated by simulated demographic scenarios.
- Output tone and framing analysis: Does the system consistently use different language when describing similar situations to demographically different users? Subtle framing differences in customer communication can create UDAAP exposure.
- Adverse action notice evaluation: For any GenAI used in decisioning support, evaluate whether the explanations generated for denial decisions satisfy ECOA’s adverse action notice requirements — including specificity and legal accuracy.
3. Information Security Testing
NIST AI 600-1 (MP-2.3-005) requires adversarial testing before and after deployment. This is not optional and it’s not satisfied by vendor attestation.
Required information security tests:
- Direct prompt injection: Attempts to override system instructions through user input
- Indirect prompt injection: Attempts to inject malicious instructions through retrieved documents or external data sources (critical for RAG deployments)
- Data extraction probing: Attempts to elicit training data, system prompts, or other confidential information through structured queries
- Model behavior boundary testing: Testing for jailbreak scenarios — attempts to circumvent content safety controls through role-play, hypotheticals, or encoded requests
For financial services deployments, also test: Can users elicit the system prompt? Can they extract other users’ conversation history? Can they cause the model to produce compliance-relevant content it’s configured to avoid?
4. Data Privacy Testing
Training data memorization is an underappreciated pre-deployment risk. Large language models can memorize and reproduce verbatim samples from training data — including PII, internal documents, or confidential financial information that appeared in training corpora.
Privacy pre-deployment testing includes:
- Memorization probing: Structured queries designed to elicit verbatim reproduction of training data
- PII leakage testing: Testing whether the model produces real names, account numbers, or other identifying information in response to partial prompts
- Inference attack resistance: Testing whether model outputs reveal information that could allow re-identification of individuals from training data
5. Information Integrity Testing
Information integrity describes whether your GenAI system amplifies or resists disinformation. In a customer-facing context, this means: can adversarial users get your chatbot to repeat and appear to endorse false regulatory or product information? Can users manipulate the system into making statements that appear authoritative but are factually wrong?
This matters beyond internal risk — customer complaints arising from a GenAI system repeating user-supplied disinformation as fact create both UDAAP exposure and reputational risk.
Red-Team Testing Requirements
NIST AI 600-1 (MP-2.3-005) explicitly requires structured adversarial testing — red-team exercises — before deployment. Red-team testing goes beyond automated benchmark testing: it involves skilled testers actively trying to break the system using the same techniques adversarial users would employ.
Pre-deployment red-team scope should include:
- All information security attack vectors described above
- Scenario-based testing aligned to your specific use case (e.g., a mortgage chatbot red-team that tests for confabulated interest rate quotes, false regulatory claims, and discriminatory framing)
- Testing by personnel outside the team that built the system — independent assessment is a consistent NIST requirement
Who should conduct the red team? At large institutions with dedicated AI risk teams, this is an internal function separate from the model development team. At smaller fintechs or banks, this is a case for bringing in external expertise — at minimum for the first deployment of a new GenAI use case type. The independence requirement isn’t ceremonial; it’s how you catch the things the development team is blind to.
Red-team documentation should capture: tester identities and independence from development, scope and methodology, all significant findings, mitigations applied, and a residual risk conclusion that feeds into the go/no-go decision.
Go/No-Go Gates and Stop-Build Authority
This is where most organizations’ GenAI governance has the biggest gap. Testing without pre-defined deployment criteria is theater — it generates findings but doesn’t drive decisions.
NIST AI 600-1 requires:
GV-1.3-002: Establish performance thresholds before testing. Define what passing looks like before any test data is in. This prevents post-hoc rationalization where “acceptable” shifts to match whatever results come back. Thresholds should be documented in your AI use case approval package and approved by model risk or governance before testing begins.
GV-1.3-006 and GV-1.3-007: Stop-build authority. The framework requires a defined policy and a named role empowered to halt development or deployment when testing reveals unacceptable risk — regardless of business pressure, schedule, or investment sunk. At most mid-size banks, this lives with the CRO or Chief Model Risk Officer. At fintechs, it’s often the Head of Compliance or equivalent first- or second-line risk leader.
Stop-build authority only works if it’s documented and the authority has teeth. If the executive team can simply override a compliance objection without a formal risk acceptance process, the control doesn’t exist.
Post-Deployment: When TEVV Continues
Pre-deployment TEVV is not a one-time event. NIST AI 600-1 requires continuous monitoring post-deployment because generative AI behavior can drift even when the model version doesn’t change — as the input distribution shifts, as fine-tuning is applied, or as the system prompt evolves.
Post-deployment requirements include:
- Ongoing confabulation monitoring: Track confabulation rates on production queries, with a cadence defined in the deployment approval (monthly for high-risk uses, quarterly for lower-risk)
- Ongoing bias monitoring: Monitor demographic performance disparities in production outputs, with triggers for re-evaluation if disparity metrics exceed thresholds
- Periodic re-red-teaming: Schedule adversarial testing at defined intervals after deployment — at minimum annually, and triggered by any significant model update or use case change
- Incident tracking: Log any outputs that triggered user complaints, regulatory inquiries, or internal flags as AI incidents (connected to the incident disclosure obligations covered separately in NIST AI 600-1’s content provenance and incident disclosure requirements)
How to Document TEVV for Examiners
Examiners applying SR 11-7 principles to GenAI deployments will expect documentation that mirrors a model validation report — but extended to cover the GenAI-specific testing domains.
Your TEVV documentation package should include:
- Pre-deployment test plan: Use case description, risk tier, applicable AI 600-1 risk categories, testing scope and methodology, defined pass/fail thresholds (GV-1.3-002)
- Test results by risk category: Benchmark scores and methodology for confabulation, bias disparity analysis results by demographic group, red-team findings and severity ratings, information security test results
- Go/no-go decision documentation: Who made the deployment decision, what findings were considered, any mitigations applied before go-live, and documented risk acceptances for any findings within tolerance
- Post-deployment monitoring plan: Metrics, cadence, responsible owner, escalation triggers
For third-party GenAI deployments, your documentation also needs to cover the Value Chain risk category — what testing the vendor conducted, what access you have to their test results, and what independent testing you ran on your use case layer on top.
So What?
The regulators haven’t published a GenAI TEVV examination module yet — but they’re applying AI 600-1 principles under existing SR 11-7 and safety-and-soundness authorities right now. The 2024 interagency request for information on AI, OCC Bulletin 2025-26, and every bank partner questionnaire on AI governance all point the same direction: demonstrate your pre-deployment testing, or explain why you didn’t think it was necessary.
“We tested it informally before we launched” isn’t going to hold up. “Here’s our TEVV plan, our threshold documentation, our red-team report, and our go/no-go decision” will.
If you’re building or deploying GenAI in financial services and you don’t have a structured pre-deployment testing program, start with the AI 600-1 risk category overview, prioritize the five testing domains most applicable to your use case, and build your threshold documentation before you start testing.
The AI Risk Assessment Template & Guide includes a pre-deployment assessment scorecard covering all 11 AI risk domains, worked examples for GenAI use cases, and documentation templates built for examiner review — so you’re not starting from a blank spreadsheet.
Frequently Asked Questions
Does NIST AI 600-1 pre-deployment testing apply to third-party GenAI tools we didn’t build?
Yes, and this is one of the most important points in the framework. If you are deploying a third-party GenAI tool — whether it’s a vendor chatbot, an AI-assisted underwriting tool, or a compliance workflow automation — you are a “deployer” under NIST AI 600-1, and your testing obligations are not satisfied by the vendor’s testing. You must conduct TEVV appropriate to your specific use case, regulatory context, and customer population. The vendor’s safety card is informative, not sufficient.
How long does pre-deployment TEVV take for a GenAI system?
It depends on use case complexity and risk tier. A low-risk internal productivity tool with no customer-facing output might require two to three weeks of testing and documentation. A customer-facing GenAI system touching regulated decisioning or advice — mortgage chatbot, credit application support, investment information — realistically requires six to ten weeks for a thorough TEVV including red-team testing, bias analysis, and documentation. Plan for it before you commit to a deployment timeline.
Can we use the vendor’s red-team results to satisfy MP-2.3-005?
Partially. Vendor red-team results are useful inputs and should be requested and reviewed. But they test the model in isolation — not your specific deployment configuration, system prompt, customer population, or use case. Your adversarial testing needs to test the full deployment: model + configuration + retrieval system + user interface + customer interaction patterns. Most vendor testing won’t cover that stack.
What if testing reveals a significant finding we can’t fix before the planned launch date?
Stop-build authority (GV-1.3-006) exists for exactly this situation. The framework requires that deployment be blocked when testing reveals risk above defined tolerance — regardless of schedule. The operational answer is: document the finding, assess severity against your defined thresholds, and either fix it before launching or document a formal risk acceptance with appropriate controls and a remediation timeline. Launching with a known unmitigated finding and no formal risk acceptance is the worst outcome — it transforms a governance gap into a deliberate decision that will look very bad in an examination.
How does this interact with our existing model risk management process?
NIST AI 600-1 TEVV is additive to SR 11-7 model validation, not a replacement. For GenAI systems that meet SR 11-7’s definition of a “model,” you still need traditional validation: conceptual soundness review, backtesting, benchmarking against alternatives. AI 600-1 TEVV adds the GenAI-specific testing domains on top — confabulation, adversarial robustness, bias across output modalities. Your model validation policy should explicitly address how these requirements interact, and your model risk committee should have approved the testing framework before you’re facing your first GenAI deployment decision.
Related Template
AI Risk Assessment Template & Guide
Comprehensive AI model governance and risk assessment templates for financial services teams.
Frequently Asked Questions
What does TEVV mean in NIST AI 600-1 for generative AI?
What must be tested before deploying a generative AI system under NIST AI 600-1?
How is GenAI TEVV different from traditional SR 11-7 model validation?
Is red-team testing required under NIST AI 600-1 before deployment?
What are go/no-go gates under NIST AI 600-1 for GenAI deployment?
How do I document TEVV results for a bank examiner?
Rebecca Leung
Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.
Related Framework
AI Risk Assessment Template & Guide
Comprehensive AI model governance and risk assessment templates for financial services teams.
Keep Reading
GenAI Supply Chain Risk: Third-Party Model Dependencies and NIST AI 600-1 Controls
Most financial institutions using GenAI APIs don't fully own their AI supply chain. NIST AI 600-1 says that's your problem. Here's what you need to control.
Apr 25, 2026
AI RiskDeveloper vs. Deployer vs. Operator: Role-Specific Obligations Under NIST AI 600-1
NIST AI 600-1 assigns different GenAI risk obligations to developers, deployers, and operators. Here's what each role actually owns—and where the gaps live.
Apr 25, 2026
AI RiskGenerative AI Incident Disclosure and Content Provenance: NIST AI 600-1 Requirements
What NIST AI 600-1 requires when your GenAI system fails: incident disclosure obligations, after-action review requirements, and content provenance tracking.
Apr 24, 2026
Immaterial Findings ✉️
Weekly newsletter
Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.
Join practitioners from banks, fintechs, and asset managers. Delivered weekly.