PII in AI Systems: How to Handle Personal Data When Using LLMs
Table of Contents
TL;DR:
- Cyberhaven Labs found that 11% of data employees paste into ChatGPT is confidential — and most firms have no controls to stop it
- GLBA, CCPA, and state breach notification laws all apply when PII flows through LLM systems — even third-party APIs
- Build a PII protection stack: classify data before it reaches the model, detect PII in prompts and outputs, and lock down vendor contracts with AI-specific data handling clauses
Samsung engineers pasted proprietary source code into ChatGPT in March 2023. Three separate incidents. Confidential semiconductor code, internal meeting transcripts, and test sequences — all fed into a third-party AI system with no data retention controls. Samsung banned ChatGPT company-wide within weeks.
That was source code. Now imagine it’s customer Social Security numbers, account balances, or loan application data. For financial services firms deploying LLMs, PII handling in AI systems isn’t a theoretical risk — it’s a compliance obligation with teeth.
The PII Attack Surface in LLM Systems
Personal data can leak into and out of LLM systems at every stage. Understanding the full attack surface is the first step to controlling it.
Input-Side Exposure
The most common vector: employees pasting customer data directly into LLM prompts. Cyberhaven Labs analyzed how 1.6 million workers use ChatGPT and found that 11% of the data pasted into the tool was confidential. That includes customer records, financial data, and proprietary information.
The input-side attack surface includes:
| Vector | Example | Risk Level |
|---|---|---|
| Direct prompts | ”Summarize this customer complaint” (with full name, account number, SSN attached) | Critical |
| Fine-tuning datasets | Training an internal model on customer service transcripts containing NPI | High |
| RAG retrieval databases | Vector databases populated with unredacted customer documents | High |
| Context windows | Multi-turn conversations where PII accumulates across messages | Medium |
| File uploads | Uploading spreadsheets with customer data for analysis | Critical |
Output-Side Exposure
LLMs can also leak PII in their outputs — especially models trained or fine-tuned on data containing personal information. The OWASP Top 10 for LLM Applications (2025) ranks Sensitive Information Disclosure as the #2 risk, up from #6 in the previous version. Output risks include:
- Memorization attacks: Models can memorize and regurgitate training data, including PII, when prompted with the right context
- Inference attacks: Even without direct memorization, models can infer sensitive attributes from seemingly anonymous data
- Prompt leakage: System prompts containing data classification rules or customer-specific context leaking through crafted queries
What the Regulations Actually Require
If you’re in financial services, multiple regulatory frameworks apply when PII flows through AI systems. Here’s what each demands.
GLBA (Gramm-Leach-Bliley Act)
GLBA’s Financial Privacy Rule restricts when financial institutions can disclose “nonpublic personal information” (NPI) to nonaffiliated third parties. When an employee pastes customer NPI into a third-party LLM API, that’s a disclosure. The Safeguards Rule requires administrative, technical, and physical safeguards for customer information — including when it’s processed by AI systems.
What this means for your AI program:
- Sending customer NPI to a third-party LLM provider (OpenAI, Anthropic, Google) without contractual safeguards likely violates GLBA
- Your information security program must cover AI-related data flows
- Consumer opt-out rights apply to AI-processed data sharing
CCPA / CPRA
California’s privacy law got significantly more relevant for AI in July 2025 when the California Privacy Protection Agency finalized regulations on automated decision-making technology (ADMT). Key requirements effective January 1, 2026:
- Risk assessments required before using AI/ADMT for significant decisions
- Consumer notification when AI is used to make decisions about them
- Opt-out rights for automated decision-making
- Data minimization — only process personal information that’s “reasonably necessary and proportionate”
The data minimization requirement is particularly sharp for LLMs. If you’re feeding an entire customer file into a model when you only need three data points, that’s a compliance gap.
State Breach Notification Laws
All 50 states plus DC have breach notification laws. If PII is exposed through an AI system — whether through a prompt injection attack, a model memorization leak, or an employee error — it triggers the same notification requirements as any other breach.
California’s SB 446, effective January 1, 2026, tightens the timeline: breach notifications must be sent to consumers within 30 calendar days of discovery. That’s a hard deadline — and AI-related breaches are notoriously difficult to scope quickly because you often can’t determine exactly what data the model processed or retained.
Building a PII Protection Stack for LLM Systems
Policies won’t save you. Architecture will. Here’s a layered approach to protecting PII in AI systems.
Layer 1: Data Classification Before the Model
Classify data before it reaches any LLM. This is your first line of defense.
Implementation steps:
- Extend your data classification schema to include AI-specific categories: “Approved for LLM Processing,” “Requires Redaction Before LLM,” “Prohibited from LLM Use”
- Map data elements to classification levels. At minimum: SSN, account numbers, and dates of birth are always prohibited. Names and addresses require context-dependent evaluation.
- Integrate DLP tools with your LLM gateway. Modern DLP solutions can inspect prompts before they reach the API endpoint and block or redact sensitive elements in real time.
- Create an approved data inventory — a positive list of data types allowed in LLM interactions, rather than trying to block everything bad
Who owns this: Chief Data Officer or Head of Data Governance. If you don’t have one, your CISO or Head of Compliance.
Layer 2: PII Detection in Prompts and Outputs
Even with classification controls, PII will leak through. Detection is your safety net.
Prompt-side detection:
- Deploy named entity recognition (NER) models trained on financial services data to scan prompts before submission
- Flag patterns: SSN formats (XXX-XX-XXXX), account number formats, email addresses, phone numbers
- Use regex + ML hybrid approaches — regex catches formats, ML catches context (“my customer John Smith at 123 Main St”)
Output-side detection:
- Scan all model outputs before they reach the end user
- Flag any PII that wasn’t present in the original prompt (potential memorization leak)
- Log flagged outputs for compliance review
Technical implementation:
- Route all LLM traffic through a secure API gateway that inspects both requests and responses
- Set up automated alerts when PII detection thresholds are exceeded
- Maintain audit logs of all detected PII instances for regulatory examination
Layer 3: De-Identification and Data Masking
When you legitimately need to process data that contains PII through an LLM, de-identify it first.
Techniques by effectiveness:
| Technique | How It Works | Strength | Limitation |
|---|---|---|---|
| Tokenization | Replace PII with random tokens, maintain a secure mapping table | Strong — reversible for authorized users | Requires secure token vault |
| Data masking | Replace real values with realistic fake values | Strong for testing/development | Irreversible — can’t map back |
| K-anonymity | Ensure each record is indistinguishable from at least k-1 others | Good for datasets | Doesn’t protect against attribute disclosure |
| Differential privacy | Add calibrated noise to data or model outputs | Mathematically provable privacy guarantees | Reduces model accuracy |
| Synthetic data generation | Create statistically similar but entirely artificial datasets | Eliminates direct PII exposure | May not preserve all data relationships |
For most financial services LLM use cases, tokenization is the right answer. Replace customer names, account numbers, and SSNs with tokens before the data hits the model. Map them back on the other side. The model never sees real PII.
Layer 4: Privacy-Preserving AI Approaches
For firms building or fine-tuning their own models on customer data, consider architectural approaches that minimize PII exposure:
- Federated learning: Train models across decentralized data sources without centralizing the raw data. Each node trains locally and only shares model updates, not the underlying PII. Research published in the International Journal of Computer Applications confirms federated learning delivers privacy-enforced analytics essential for fintech applications.
- On-premise / private cloud deployment: Run LLMs within your security perimeter rather than sending data to third-party APIs. Eliminates the GLBA disclosure issue entirely.
- Retrieval-Augmented Generation (RAG) with access controls: Keep PII in a secured, access-controlled database. The LLM retrieves relevant context at query time but doesn’t store it in model weights.
Vendor Contract Requirements
If you’re using a third-party LLM provider, your vendor contract is a critical control. Most standard AI service agreements are insufficient for financial services PII handling.
Required contract clauses:
- No training on customer data: Explicit prohibition on using your prompts, inputs, or outputs to train or improve the provider’s models. Get this in writing — not buried in a terms of service update.
- Data retention limits: Maximum retention period for prompt data, with automatic deletion. Zero retention is ideal; 30 days maximum.
- Subprocessor restrictions: Require notification and approval before the provider uses subprocessors who may access your data.
- Breach notification obligations: Provider must notify you within 24-72 hours of any security incident involving your data — faster than the 30-day consumer notification window.
- Audit rights: Right to audit the provider’s data handling practices, including how prompts are stored, processed, and deleted.
- Data residency: Specify where your data is processed and stored. Critical for firms subject to data localization requirements.
- Return/deletion obligations: Clear process for retrieving or certifying deletion of all your data upon contract termination.
The FTC has explicitly warned that AI companies using customer data for training without clear notice or affirmative consent risk enforcement action under Section 5 of the FTC Act. Make sure your contract addresses this.
30/60/90-Day Implementation Roadmap
Days 1-30: Discovery and Quick Wins
- Week 1: Inventory all AI/LLM tools in use (approved and shadow AI). Survey employees. Check network logs for API calls to OpenAI, Anthropic, Google AI, and other providers.
- Week 2: Classify the data types flowing through each tool. Flag any NPI/PII exposure.
- Week 3: Deploy basic DLP rules to block SSNs, account numbers, and other high-risk PII from LLM prompts. Even regex-based blocking catches the most common patterns.
- Week 4: Review and update vendor contracts for all approved AI tools. Send addendum requests for missing clauses.
Owner: CISO or Head of Compliance. Deliverable: AI data flow inventory and gap assessment.
Days 31-60: Architecture and Controls
- Week 5-6: Deploy a secure LLM gateway that routes all AI traffic through a single inspection point. Implement prompt scanning and output monitoring.
- Week 7-8: Build or integrate PII detection models. Test against your actual data patterns (financial services PII is specific — account numbers, SWIFT codes, loan IDs have distinct formats).
Owner: Engineering/IT Security lead. Deliverable: LLM gateway with PII detection deployed.
Days 61-90: Governance and Monitoring
- Week 9-10: Establish monitoring dashboards tracking PII detection rates, blocked prompts, and policy violations. Set alerting thresholds.
- Week 11-12: Conduct tabletop exercise: “An employee fed 10,000 customer records into ChatGPT. Now what?” Test your breach notification workflow, regulatory reporting, and customer communication.
Owner: CRO or Head of Risk. Deliverable: Operational monitoring and tested incident response playbook.
So What?
Every financial institution using LLMs is processing personal data through AI — whether they’ve built controls for it or not. The regulatory landscape is tightening: CCPA’s ADMT rules are live, GLBA has always applied to AI data flows (firms just weren’t thinking about it), and state breach notification laws don’t care whether the breach happened through a traditional database or a chatbot prompt.
The firms that get ahead will treat PII in AI systems the same way they treat PII in traditional systems: classify it, protect it, detect exposures, and have a tested plan for when things go wrong. The firms that don’t will learn the hard way — from an examiner’s findings letter or a state AG investigation.
If you’re building your AI data privacy controls and want a head start, the Data Privacy Compliance Kit includes data classification frameworks, vendor assessment templates, and breach notification workflows that adapt directly to AI use cases.
For more on preventing data leakage in AI systems, see our guide on AI Data Leakage Prevention: Protecting Sensitive Data When Employees Use LLMs. For training data governance fundamentals, check out AI Training Data Governance: Managing Data Quality, Consent, and Provenance.
FAQ
Does GLBA apply when employees use ChatGPT with customer data?
Yes. GLBA’s Financial Privacy Rule restricts disclosure of nonpublic personal information to nonaffiliated third parties. When an employee inputs customer NPI into a third-party LLM API, that constitutes a disclosure. The Safeguards Rule also requires that your information security program covers AI-related data flows, including controls on what data can be sent to external AI services.
What’s the difference between data masking and tokenization for AI systems?
Data masking permanently replaces real values with realistic but fake data — useful for testing and development but irreversible. Tokenization replaces PII with random tokens while maintaining a secure mapping table, allowing authorized users to reverse the process. For production LLM use cases where you need to map results back to real customers, tokenization is the better choice. For model training and development, data masking or synthetic data generation eliminates PII exposure entirely.
Do state breach notification laws apply if PII is exposed through an AI system?
Yes. Every U.S. state has breach notification laws that apply regardless of how the exposure occurred. If PII is exposed through a prompt injection attack, a model memorization leak, or an employee error involving an AI system, you must follow the same notification procedures as any other data breach. California’s SB 446, effective January 1, 2026, requires notification within 30 calendar days of discovery — and AI-related breaches are often harder to scope because determining exactly what data the model processed or retained can be technically complex.
Rebecca Leung
Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.
Keep Reading
AI and Consumer Data Rights: Where CCPA, State Privacy Laws, and AI Decisions Collide
How consumer data rights like deletion, opt-out, and access apply when businesses use AI for automated decisions — mapped across CCPA, Colorado, Virginia, and 17 other state laws.
Apr 3, 2026
Data PrivacyAI Training Data Governance: Managing Data Quality, Consent, and Provenance
How to build an AI training data governance program that covers data quality, consent, provenance tracking, and regulatory compliance for financial services.
Apr 2, 2026
Data PrivacyAI Data Leakage Prevention: A Practitioner's Guide to Protecting Sensitive Data in LLM Systems
Learn how to prevent AI data leakage from LLMs in financial services. Covers the 5 leakage vectors, OWASP LLM top risks, NIST controls, and a 90-day implementation roadmap.
Mar 27, 2026
Immaterial Findings ✉️
Weekly newsletter
Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.
Join practitioners from banks, fintechs, and asset managers. Delivered weekly.