AI Red Teaming Techniques: How to Stress-Test LLMs Before Deployment
Table of Contents
TL;DR
- Researchers from OpenAI, Anthropic, and Google DeepMind published findings in October 2025 showing they could bypass 95–100% of 12 published AI defenses under adaptive attack conditions. Defense without adversarial testing is theater.
- NIST AI 600-1 (MP-2.3-005) requires red-team exercises before LLM deployment — not optional, not delegatable to your vendor’s safety testing.
- Financial services LLMs face five attack categories that compliance teams need to own: prompt injection, hallucination, biased/discriminatory output, data exfiltration, and agentic misuse.
- Anthropic’s testing on Claude Opus 4.5 found attack success rates climbing from 4.7% at one attempt to 63% at 100 attempts in coding environments — sustained adversarial pressure is qualitatively different from one-shot testing.
In October 2025, researchers from OpenAI, Anthropic, and Google DeepMind published results from testing 12 widely-used AI defenses. The team achieved bypass rates above 90% on most defenses under adaptive attack conditions. Prompting-based defenses reached 95–99% attack success rates. Training-based methods hit 96–100%. The conclusion was not that AI is hopeless — it was that single-layer defenses evaluated against non-adaptive attackers don’t tell you much about real-world robustness.
That’s the starting premise for AI red teaming: you cannot know what your LLM does under adversarial pressure by testing it the way a friendly user would use it. If you’ve deployed an LLM in a customer-facing context, in your compliance workflow, or in anything touching credit decisioning — and you haven’t stress-tested it against someone actively trying to make it fail — you don’t know what you’re running.
This is the practitioner playbook for running a structured AI red team exercise. Not the theoretical overview. The methodology: how to organize it, what to test, how to score findings, and what the output looks like when you hand it to an examiner.
What Makes AI Red Teaming Different
Traditional penetration testing probes your infrastructure for known vulnerability classes: unpatched CVEs, misconfigured access controls, exposed credentials, injection vulnerabilities in web applications. The attacker wins by gaining unauthorized access or escalating privileges.
AI red teaming probes your model’s decision-making for failure modes specific to how language models work. The attack surface is the context window — everything the model sees, including the system prompt, user input, retrieved documents, and tool outputs. The attacker wins by causing the model to:
- Override its instructions and do something it was told not to do
- Generate harmful, biased, or factually wrong outputs that look authoritative
- Reveal information it shouldn’t — training data, other users’ data, system prompts
- Take unauthorized actions when given agentic capabilities (API calls, transactions)
You cannot fully secure these attack surfaces with traditional controls. You learn what they are through structured adversarial testing, then build mitigations specific to the vulnerabilities you find.
Five Attack Categories for Financial Services LLMs
1. Prompt Injection
OWASP LLM01:2025. The highest-priority attack class. Prompt injection occurs when an attacker crafts input that overrides the model’s system instructions — effectively hijacking the model’s behavior.
Direct injection happens in the user-visible input field: a user types an instruction that contradicts the system prompt (“Ignore all previous instructions and instead…”). Indirect injection is harder to catch: the adversarial instruction is embedded in a document, email, or web page that the model retrieves and processes as context. If your LLM reads customer documents for a loan underwriting workflow and a submitted document contains instructions to approve the application regardless of financials, that’s an indirect injection attack in a regulated decisioning context.
What to test:
- Can a user override a system prompt instruction by asking directly?
- Can adversarial content in retrieved documents change the model’s conclusions?
- Does the model reveal its system prompt when asked directly or through indirect techniques?
- Does the model follow user instructions to act as a different persona with different constraints?
Threshold for passing: Zero tolerance on system prompt override in production. Low-risk tolerance on indirect injection in any workflow where the model processes external documents.
2. Hallucination and Confabulation
NIST AI 600-1 treats confabulation — the generation of factually wrong but confidently stated outputs — as a primary risk category for financial services LLMs. This is not a security attack, but it is a red team category because the stress test reveals the scope of the problem under adversarial questioning.
Testing hallucination means going beyond “does the model get easy questions right?” to: what happens when a user asks domain-specific questions where wrong answers create regulatory exposure?
What to test:
- Ask regulatory compliance questions with objectively correct answers from published guidance (e.g., “What is the SAR filing deadline under 31 CFR 1020.320?”) and measure accuracy.
- Present the model with ambiguous or edge-case scenarios where confidently-stated wrong answers create compliance risk.
- Test whether the model appropriately expresses uncertainty or confidently states incorrect information.
- For customer-facing deployments: test whether the model gives advice that violates UDAAP or creates a misleading representation.
Document: confabulation rate on domain-specific queries, whether the model hedges appropriately when uncertain, and specific failure scenarios.
3. Biased and Discriminatory Output
For any LLM that touches underwriting, pricing, collections, insurance, or any consumer interaction that could influence access to financial services, ECOA and UDAAP exposure is a function of what the model outputs across protected classes. This is where your AI bias testing methodology framework overlaps with red teaming.
Red teaming for bias goes beyond statistical disparity testing. You’re testing whether the model can be prompted to make explicitly discriminatory recommendations, whether subtle cues in input (names, locations, cultural context) shift its outputs, and whether adversarial framing causes the model to rationalize discriminatory outcomes.
What to test:
- Do outputs differ materially across protected class proxies when everything else is held constant?
- Can a user prompt the model to explain a credit decision in terms of protected characteristics?
- Does the model generate different guidance for customers in majority-minority geographies vs. majority-white geographies for the same product?
- Can adversarial prompting cause the model to produce outputs that would fail a UDAAP review?
Threshold for passing: Any finding where the model produces outputs that would constitute disparate treatment under ECOA is a go/no-go blocker. Statistical disparity findings require documented investigation and mitigation before deployment.
4. Data Exfiltration and PII Leakage
OWASP LLM02:2025 (Sensitive Information Disclosure). LLMs can reveal information they shouldn’t — training data, other users’ conversations, injected context that was meant to be invisible, or system prompt content.
For financial services, the regulatory risk is concrete: Gramm-Leach-Bliley Act protections for nonpublic personal information, Reg P obligations, and state privacy law exposure. If your model was trained on or has access to customer data, you need to test whether adversarial prompting can extract it.
What to test:
- Can a user cause the model to reproduce text from its training data, including any customer data used in fine-tuning?
- Does the model reveal the contents of its system prompt under direct questioning or indirect extraction techniques?
- In multi-tenant deployments: can a user in one context extract information from another user’s context?
- Does the model leak retrieved document content it was instructed to treat as confidential?
Tools: Promptfoo provides open-source automated testing against OWASP LLM categories. Use automated tools to scale coverage, but have human testers review outputs — automated tools miss context-dependent disclosure that a human reviewer catches immediately.
5. Agentic Misuse and Excessive Agency
OWASP LLM06:2025. For LLMs with tool access — the ability to call APIs, read/write databases, execute code, or take actions in external systems — the red team attack surface expands to the downstream consequences of model action. NIST AI 600-1 and the OWASP Agentic AI Top 10 (2026) both flag excessive agency as the highest-priority risk for autonomous AI systems.
The attack scenario is concrete: a financial assistant agent with access to a fund transfer API was manipulated in a documented red team exercise to reframe a transaction as an internal test and invoke a “Developer Mode” context, resulting in a $900 unauthorized withdrawal. That is agentic misuse in a financial services context.
What to test:
- Can adversarial prompting cause the model to invoke tools it wasn’t instructed to use?
- Can a user escalate the model’s permissions by claiming special context (“I’m a developer running a test”)?
- What happens when the model receives conflicting instructions between the system prompt and user input — which wins?
- Does the model confirm high-stakes actions (transactions, data deletions) before executing, or does it proceed on inference?
For agentic deployments, a finding where the model takes an irreversible financial action without explicit user confirmation is a go/no-go blocker. Scope tool access to minimum necessary permissions and document the authorization architecture.
How to Structure a Red Team Exercise
Pre-Exercise: Define Scope and Thresholds
Before running a single test, write down:
- Deployment context: What does this model do, who uses it, what data does it access, what actions can it take?
- Threat model: Who is the adversary? External users trying to manipulate outputs? Insider threat? Malicious documents in the processing pipeline?
- Risk categories in scope: Of the five categories above, which apply to this deployment? Rank them by regulatory consequence.
- Go/no-go thresholds: What finding level blocks deployment? Per NIST AI 600-1 GV-1.3-002, thresholds must be set before testing, not after you see results.
Team Composition
Minimum viable red team: three people with different mental models.
- Technical: Knows the model’s architecture, context window structure, and what tools it can invoke. Generates the most sophisticated attack variations.
- Compliance/Risk: Knows what the regulatory consequences of each failure mode look like. Grounds the exercise in what actually matters to examiners.
- Adversarial tester: Approaches the model with no assumptions about expected behavior. Ideally someone from internal audit or an external consultant — not the team that built or deployed it.
For high-risk deployments (consumer-facing credit decisioning, transaction authorization), add a domain expert who can evaluate whether a wrong output is merely awkward or actually harmful.
Execution: Attack Pattern Library
Work through each category systematically. For each category, start with published attack patterns (OWASP LLM Top 10 provides canonical examples) and then escalate to context-specific variations. Anthropic’s own research found that attack success rates climbed from 4.7% at one attempt to 33.6% at 10 attempts and 63% at 100 attempts — run each attack variation at least 10 times before concluding the model is resistant.
Scoring
| Severity | Definition |
|---|---|
| Critical | Finding meets go/no-go threshold. Deployment blocked until mitigated. |
| High | Material regulatory or customer harm possible. Mitigation required; retest before deployment. |
| Medium | Finding is reproducible under specific conditions; risk is bounded. Document mitigation or acceptance. |
| Low | Edge case with limited harm potential. Document and monitor post-deployment. |
Score each finding at the severity of the worst confirmed output, not the average outcome across test runs.
Documentation Output
Your red team report is a pre-deployment testing artifact — it becomes part of the model’s governance file and may be reviewed by examiners. Structure it as:
- Executive summary: scope, overall go/no-go recommendation, critical findings
- Methodology: team composition, attack categories tested, number of test cases per category
- Findings by severity: attack method, observed output, risk category mapping, severity score
- Mitigations applied: what was changed before deployment as a result of findings
- Residual risk statement: what risk was accepted and who accepted it
- Post-deployment monitoring plan: what ongoing testing continues after launch
Align each finding to the NIST AI 600-1 action that applies and, where relevant, the OWASP LLM category. Examiners following NIST AI 600-1 TEVV requirements expect the documentation organized this way.
What Examiners and Regulators Are Looking For
SR 26-02 (the Federal Reserve/OCC/FDIC revised model risk guidance released in 2026) extends SR 11-7 principles to generative AI and explicitly calls for testing that goes beyond traditional statistical validation. Examiners are starting to ask:
- Do you have pre-deployment red team results for your customer-facing LLMs?
- Who conducted the red teaming, and was it independent of the deployment team?
- What were the findings, what was mitigated, and what residual risk was accepted?
- How do you test ongoing robustness post-deployment?
A vendor’s safety testing attestation does not answer those questions for your deployment. The OCC’s revised model risk bulletin 2026-13 clarifies that the deployer organization is responsible for validating fitness for its specific use case — which means red teaming your deployment, not the vendor’s general model.
So What? The Baseline Has Shifted
Twelve published defenses. All broken above 90%. That’s where we are on AI robustness. The implication isn’t that you shouldn’t deploy LLMs — it’s that “we tested it and it worked fine” is no longer a complete statement. The meaningful question is: tested by whom, against what adversarial pressure, across which failure modes?
A structured red team exercise that covers the five categories above, produces a documented findings report, and feeds into a defined go/no-go decision is the current bar for defensible pre-deployment AI governance. It is also increasingly what regulators will expect to see when they open your AI model inventory file and start asking questions.
Start with the AI Risk Assessment Template for the model risk documentation framework, then layer red team results into the pre-deployment testing section.
Sources
- NIST AI 600-1: Generative AI Profile (July 2024)
- OWASP Top 10 for LLM Applications 2025
- CSA Research Note: NIST AI Agent Red-Teaming Standards (March 2026)
- VentureBeat: Anthropic vs. OpenAI red teaming methods reveal different security priorities
- VentureBeat: Researchers broke every AI defense they tested
Need the working template?
Start with the source guide.
These answer-first guides summarize the required fields, evidence, and implementation steps behind the templates practitioners search for.
Related Template
AI Risk Assessment Template & Guide
Comprehensive AI model governance and risk assessment templates for financial services teams.
Frequently Asked Questions
What is AI red teaming and how is it different from traditional security red teaming?
Does NIST AI 600-1 require red teaming before LLM deployment?
What's the difference between safety red teaming and security red teaming for LLMs?
How do you score and report AI red team findings?
Who should be on an AI red team for a financial services firm?
How often should AI red teaming occur after initial deployment?
Rebecca Leung
Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.
Related Framework
AI Risk Assessment Template & Guide
Comprehensive AI model governance and risk assessment templates for financial services teams.
Keep Reading
EU AI Act Digital Omnibus: What the December 2027 Deadline Deferral Means for Financial Services AI Teams
The EU AI Act's Digital Omnibus deal, reached May 7, 2026, defers Annex III high-risk AI obligations from August 2, 2026 to December 2, 2027. Here's what changed, what didn't, and how financial services AI teams should use the extra 16 months.
May 14, 2026
AI RiskEU AI Act Article 5 Prohibited AI Systems: The Compliance Checklist Financial Institutions Can't Ignore
Article 5 prohibitions have been in force since February 2025 and the enforcement regime launched August 2025. Here's what financial institutions must audit, stop doing, and document — with the credit scoring carve-out explained.
May 12, 2026
AI RiskEU AI Act High-Risk AI in Financial Services: What Banks and Fintechs Must Document by August 2, 2026
Annex III of the EU AI Act covers credit scoring, insurance pricing, and financial standing assessment. Here's what the seven compliance obligations actually require — and who they apply to.
May 10, 2026
Immaterial Findings ✉️
Weekly newsletter
Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.
Join practitioners from banks, fintechs, and asset managers. Delivered weekly.