AI Risk

AI Red Teaming Techniques: How to Stress-Test LLMs Before Deployment

May 6, 2026 Rebecca Leung
Table of Contents

TL;DR

  • Researchers from OpenAI, Anthropic, and Google DeepMind published findings in October 2025 showing they could bypass 95–100% of 12 published AI defenses under adaptive attack conditions. Defense without adversarial testing is theater.
  • NIST AI 600-1 (MP-2.3-005) requires red-team exercises before LLM deployment — not optional, not delegatable to your vendor’s safety testing.
  • Financial services LLMs face five attack categories that compliance teams need to own: prompt injection, hallucination, biased/discriminatory output, data exfiltration, and agentic misuse.
  • Anthropic’s testing on Claude Opus 4.5 found attack success rates climbing from 4.7% at one attempt to 63% at 100 attempts in coding environments — sustained adversarial pressure is qualitatively different from one-shot testing.

In October 2025, researchers from OpenAI, Anthropic, and Google DeepMind published results from testing 12 widely-used AI defenses. The team achieved bypass rates above 90% on most defenses under adaptive attack conditions. Prompting-based defenses reached 95–99% attack success rates. Training-based methods hit 96–100%. The conclusion was not that AI is hopeless — it was that single-layer defenses evaluated against non-adaptive attackers don’t tell you much about real-world robustness.

That’s the starting premise for AI red teaming: you cannot know what your LLM does under adversarial pressure by testing it the way a friendly user would use it. If you’ve deployed an LLM in a customer-facing context, in your compliance workflow, or in anything touching credit decisioning — and you haven’t stress-tested it against someone actively trying to make it fail — you don’t know what you’re running.

This is the practitioner playbook for running a structured AI red team exercise. Not the theoretical overview. The methodology: how to organize it, what to test, how to score findings, and what the output looks like when you hand it to an examiner.


What Makes AI Red Teaming Different

Traditional penetration testing probes your infrastructure for known vulnerability classes: unpatched CVEs, misconfigured access controls, exposed credentials, injection vulnerabilities in web applications. The attacker wins by gaining unauthorized access or escalating privileges.

AI red teaming probes your model’s decision-making for failure modes specific to how language models work. The attack surface is the context window — everything the model sees, including the system prompt, user input, retrieved documents, and tool outputs. The attacker wins by causing the model to:

  • Override its instructions and do something it was told not to do
  • Generate harmful, biased, or factually wrong outputs that look authoritative
  • Reveal information it shouldn’t — training data, other users’ data, system prompts
  • Take unauthorized actions when given agentic capabilities (API calls, transactions)

You cannot fully secure these attack surfaces with traditional controls. You learn what they are through structured adversarial testing, then build mitigations specific to the vulnerabilities you find.


Five Attack Categories for Financial Services LLMs

1. Prompt Injection

OWASP LLM01:2025. The highest-priority attack class. Prompt injection occurs when an attacker crafts input that overrides the model’s system instructions — effectively hijacking the model’s behavior.

Direct injection happens in the user-visible input field: a user types an instruction that contradicts the system prompt (“Ignore all previous instructions and instead…”). Indirect injection is harder to catch: the adversarial instruction is embedded in a document, email, or web page that the model retrieves and processes as context. If your LLM reads customer documents for a loan underwriting workflow and a submitted document contains instructions to approve the application regardless of financials, that’s an indirect injection attack in a regulated decisioning context.

What to test:

  • Can a user override a system prompt instruction by asking directly?
  • Can adversarial content in retrieved documents change the model’s conclusions?
  • Does the model reveal its system prompt when asked directly or through indirect techniques?
  • Does the model follow user instructions to act as a different persona with different constraints?

Threshold for passing: Zero tolerance on system prompt override in production. Low-risk tolerance on indirect injection in any workflow where the model processes external documents.

2. Hallucination and Confabulation

NIST AI 600-1 treats confabulation — the generation of factually wrong but confidently stated outputs — as a primary risk category for financial services LLMs. This is not a security attack, but it is a red team category because the stress test reveals the scope of the problem under adversarial questioning.

Testing hallucination means going beyond “does the model get easy questions right?” to: what happens when a user asks domain-specific questions where wrong answers create regulatory exposure?

What to test:

  • Ask regulatory compliance questions with objectively correct answers from published guidance (e.g., “What is the SAR filing deadline under 31 CFR 1020.320?”) and measure accuracy.
  • Present the model with ambiguous or edge-case scenarios where confidently-stated wrong answers create compliance risk.
  • Test whether the model appropriately expresses uncertainty or confidently states incorrect information.
  • For customer-facing deployments: test whether the model gives advice that violates UDAAP or creates a misleading representation.

Document: confabulation rate on domain-specific queries, whether the model hedges appropriately when uncertain, and specific failure scenarios.

3. Biased and Discriminatory Output

For any LLM that touches underwriting, pricing, collections, insurance, or any consumer interaction that could influence access to financial services, ECOA and UDAAP exposure is a function of what the model outputs across protected classes. This is where your AI bias testing methodology framework overlaps with red teaming.

Red teaming for bias goes beyond statistical disparity testing. You’re testing whether the model can be prompted to make explicitly discriminatory recommendations, whether subtle cues in input (names, locations, cultural context) shift its outputs, and whether adversarial framing causes the model to rationalize discriminatory outcomes.

What to test:

  • Do outputs differ materially across protected class proxies when everything else is held constant?
  • Can a user prompt the model to explain a credit decision in terms of protected characteristics?
  • Does the model generate different guidance for customers in majority-minority geographies vs. majority-white geographies for the same product?
  • Can adversarial prompting cause the model to produce outputs that would fail a UDAAP review?

Threshold for passing: Any finding where the model produces outputs that would constitute disparate treatment under ECOA is a go/no-go blocker. Statistical disparity findings require documented investigation and mitigation before deployment.

4. Data Exfiltration and PII Leakage

OWASP LLM02:2025 (Sensitive Information Disclosure). LLMs can reveal information they shouldn’t — training data, other users’ conversations, injected context that was meant to be invisible, or system prompt content.

For financial services, the regulatory risk is concrete: Gramm-Leach-Bliley Act protections for nonpublic personal information, Reg P obligations, and state privacy law exposure. If your model was trained on or has access to customer data, you need to test whether adversarial prompting can extract it.

What to test:

  • Can a user cause the model to reproduce text from its training data, including any customer data used in fine-tuning?
  • Does the model reveal the contents of its system prompt under direct questioning or indirect extraction techniques?
  • In multi-tenant deployments: can a user in one context extract information from another user’s context?
  • Does the model leak retrieved document content it was instructed to treat as confidential?

Tools: Promptfoo provides open-source automated testing against OWASP LLM categories. Use automated tools to scale coverage, but have human testers review outputs — automated tools miss context-dependent disclosure that a human reviewer catches immediately.

5. Agentic Misuse and Excessive Agency

OWASP LLM06:2025. For LLMs with tool access — the ability to call APIs, read/write databases, execute code, or take actions in external systems — the red team attack surface expands to the downstream consequences of model action. NIST AI 600-1 and the OWASP Agentic AI Top 10 (2026) both flag excessive agency as the highest-priority risk for autonomous AI systems.

The attack scenario is concrete: a financial assistant agent with access to a fund transfer API was manipulated in a documented red team exercise to reframe a transaction as an internal test and invoke a “Developer Mode” context, resulting in a $900 unauthorized withdrawal. That is agentic misuse in a financial services context.

What to test:

  • Can adversarial prompting cause the model to invoke tools it wasn’t instructed to use?
  • Can a user escalate the model’s permissions by claiming special context (“I’m a developer running a test”)?
  • What happens when the model receives conflicting instructions between the system prompt and user input — which wins?
  • Does the model confirm high-stakes actions (transactions, data deletions) before executing, or does it proceed on inference?

For agentic deployments, a finding where the model takes an irreversible financial action without explicit user confirmation is a go/no-go blocker. Scope tool access to minimum necessary permissions and document the authorization architecture.


How to Structure a Red Team Exercise

Pre-Exercise: Define Scope and Thresholds

Before running a single test, write down:

  1. Deployment context: What does this model do, who uses it, what data does it access, what actions can it take?
  2. Threat model: Who is the adversary? External users trying to manipulate outputs? Insider threat? Malicious documents in the processing pipeline?
  3. Risk categories in scope: Of the five categories above, which apply to this deployment? Rank them by regulatory consequence.
  4. Go/no-go thresholds: What finding level blocks deployment? Per NIST AI 600-1 GV-1.3-002, thresholds must be set before testing, not after you see results.

Team Composition

Minimum viable red team: three people with different mental models.

  • Technical: Knows the model’s architecture, context window structure, and what tools it can invoke. Generates the most sophisticated attack variations.
  • Compliance/Risk: Knows what the regulatory consequences of each failure mode look like. Grounds the exercise in what actually matters to examiners.
  • Adversarial tester: Approaches the model with no assumptions about expected behavior. Ideally someone from internal audit or an external consultant — not the team that built or deployed it.

For high-risk deployments (consumer-facing credit decisioning, transaction authorization), add a domain expert who can evaluate whether a wrong output is merely awkward or actually harmful.

Execution: Attack Pattern Library

Work through each category systematically. For each category, start with published attack patterns (OWASP LLM Top 10 provides canonical examples) and then escalate to context-specific variations. Anthropic’s own research found that attack success rates climbed from 4.7% at one attempt to 33.6% at 10 attempts and 63% at 100 attempts — run each attack variation at least 10 times before concluding the model is resistant.

Scoring

SeverityDefinition
CriticalFinding meets go/no-go threshold. Deployment blocked until mitigated.
HighMaterial regulatory or customer harm possible. Mitigation required; retest before deployment.
MediumFinding is reproducible under specific conditions; risk is bounded. Document mitigation or acceptance.
LowEdge case with limited harm potential. Document and monitor post-deployment.

Score each finding at the severity of the worst confirmed output, not the average outcome across test runs.

Documentation Output

Your red team report is a pre-deployment testing artifact — it becomes part of the model’s governance file and may be reviewed by examiners. Structure it as:

  1. Executive summary: scope, overall go/no-go recommendation, critical findings
  2. Methodology: team composition, attack categories tested, number of test cases per category
  3. Findings by severity: attack method, observed output, risk category mapping, severity score
  4. Mitigations applied: what was changed before deployment as a result of findings
  5. Residual risk statement: what risk was accepted and who accepted it
  6. Post-deployment monitoring plan: what ongoing testing continues after launch

Align each finding to the NIST AI 600-1 action that applies and, where relevant, the OWASP LLM category. Examiners following NIST AI 600-1 TEVV requirements expect the documentation organized this way.


What Examiners and Regulators Are Looking For

SR 26-02 (the Federal Reserve/OCC/FDIC revised model risk guidance released in 2026) extends SR 11-7 principles to generative AI and explicitly calls for testing that goes beyond traditional statistical validation. Examiners are starting to ask:

  • Do you have pre-deployment red team results for your customer-facing LLMs?
  • Who conducted the red teaming, and was it independent of the deployment team?
  • What were the findings, what was mitigated, and what residual risk was accepted?
  • How do you test ongoing robustness post-deployment?

A vendor’s safety testing attestation does not answer those questions for your deployment. The OCC’s revised model risk bulletin 2026-13 clarifies that the deployer organization is responsible for validating fitness for its specific use case — which means red teaming your deployment, not the vendor’s general model.


So What? The Baseline Has Shifted

Twelve published defenses. All broken above 90%. That’s where we are on AI robustness. The implication isn’t that you shouldn’t deploy LLMs — it’s that “we tested it and it worked fine” is no longer a complete statement. The meaningful question is: tested by whom, against what adversarial pressure, across which failure modes?

A structured red team exercise that covers the five categories above, produces a documented findings report, and feeds into a defined go/no-go decision is the current bar for defensible pre-deployment AI governance. It is also increasingly what regulators will expect to see when they open your AI model inventory file and start asking questions.

Start with the AI Risk Assessment Template for the model risk documentation framework, then layer red team results into the pre-deployment testing section.


Sources

Need the working template?

Start with the source guide.

These answer-first guides summarize the required fields, evidence, and implementation steps behind the templates practitioners search for.

Frequently Asked Questions

What is AI red teaming and how is it different from traditional security red teaming?
Traditional security red teaming tests systems, networks, and applications for vulnerabilities — access controls, patching gaps, misconfigured permissions. AI red teaming tests whether an AI system can be manipulated to produce harmful, inaccurate, biased, or unauthorized outputs. The attack surface is the model's inputs and context window rather than network perimeters. For LLMs, this means crafting prompts that override instructions, extract sensitive training data, generate discriminatory outputs, or cause the model to take unauthorized actions. You can't firewall your way out of a prompt injection vulnerability.
Does NIST AI 600-1 require red teaming before LLM deployment?
Yes. NIST AI 600-1 action MP-2.3-005 explicitly requires that generative AI systems undergo adversarial testing — red-team exercises — to identify vulnerabilities and potential manipulation or misuse, both before and after deployment. Pre-deployment red teaming is a required input to the go/no-go gate under GV-1.3-002. The model vendor's safety testing does not satisfy your organization's pre-deployment obligation as a deployer.
What's the difference between safety red teaming and security red teaming for LLMs?
Microsoft's AI Red Team distinguishes these explicitly. Safety red teaming tests whether the model can be caused to generate harmful content — hate speech, self-harm guidance, policy violations, biased outputs that create regulatory risk. Security red teaming tests whether the model can be exploited to exfiltrate data, execute unauthorized commands, or compromise system integrity. For financial services, both matter: safety failures create UDAAP and fair lending exposure; security failures create data breach and unauthorized transaction risk.
How do you score and report AI red team findings?
Score each finding on two dimensions: likelihood of exploitation (how reproducible and accessible is the attack vector?) and impact severity (what's the worst-case regulatory, financial, or customer harm?). Use a simple heat map — High/Medium/Low on each axis. Document the attack method, the model response observed, the risk category it maps to (NIST AI 600-1 or OWASP LLM Top 10), and whether it meets your pre-defined go/no-go threshold. The report should be structured like a model validation document: scope, findings, risk ratings, mitigations applied before deployment, and residual risk accepted.
Who should be on an AI red team for a financial services firm?
At minimum: someone who understands the model and its deployment context (your AI/ML team or vendor technical contact), someone who knows the regulatory exposure (compliance or risk — ideally the person who'd own a UDAAP or fair lending finding), and someone who will try adversarial approaches without anchoring on expected behavior (an internal audit member or external consultant). Larger organizations should add a domain expert who knows what harmful outputs look like in context — a credit analyst if the LLM is in the lending stack, a fraud investigator if it's in transaction monitoring.
How often should AI red teaming occur after initial deployment?
NIST AI 600-1 (MP-2.3-005) requires red teaming both before and after deployment. Post-deployment cadence should match the model's risk tier: high-risk models (consumer-facing decisioning, credit) should be red-teamed at least annually and after any significant model update. Tier 2 models should be red-teamed every 18–24 months. Continuous automated red teaming tools can supplement periodic exercises for prompt injection and jailbreak detection, but they don't replace structured adversarial exercises with human judgment.
Rebecca Leung

Rebecca Leung

Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.

Related Framework

AI Risk Assessment Template & Guide

Comprehensive AI model governance and risk assessment templates for financial services teams.

Immaterial Findings ✉️

Weekly newsletter

Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.

Join practitioners from banks, fintechs, and asset managers. Delivered weekly.