Disparate Impact Testing Techniques: Statistical Methods Examiners Actually Accept

May 5, 2026 • Rebecca Leung •

disparate impact testing fair lending adverse impact ratio

Table of Contents

TL;DR

OCC Bulletin 2025-16 (July 2025) and FDIC guidance (August 2025) removed disparate impact from federal fair lending examination frameworks — but state laws (CA, MA, NJ, NY), the Fair Housing Act, and private litigation still create live exposure

The four statistical methods regulators and plaintiffs have historically used: Adverse Impact Ratio (four-fifths rule), regression analysis, Fisher’s exact test/standard deviation analysis, and BISG proxy methodology

Each method answers a different question; no single test is sufficient for a defensible AI model testing program

Documentation is the exam deliverable — undocumented testing is the same as no testing when an examiner or plaintiff asks

The OCC’s July 2025 bulletin removing disparate impact from its Fair Lending examination booklet generated exactly the reaction compliance teams expected: some departments immediately asked whether they could deprioritize statistical testing. The answer is no — and the reasoning matters more than the conclusion.

Federal de-emphasis of disparate impact doesn’t extinguish the risk. It redistributes it. Massachusetts AG Andrea Joy Campbell settled with Earnest Operations for $2.5 million in July 2025 over AI underwriting disparate impact — after the federal rollback had already begun. New Jersey codified disparate impact under its Lending Anti-Discrimination Act in December 2025. New York’s FAIR Business Practices Act, effective February 2026, gives the state AG new tools to pursue algorithmic discrimination independently of federal enforcement posture.

The statistical testing discipline hasn’t become optional. It’s become more complex — because now you’re managing exposure across jurisdictions with different standards, not just one federal framework. What follows is the practitioner’s guide to the four methods that actually matter and how to document them.

Why “No Federal Disparate Impact” Doesn’t Mean “No Disparate Impact Risk”

Understanding your residual exposure requires mapping it specifically. As of mid-2026:

Federal: OCC Bulletin 2025-16 removed disparate impact references from the OCC’s Fair Lending booklet. The FDIC followed in August 2025. The CFPB finalized its ECOA Regulation B amendment eliminating the effects test effective July 21, 2026. Federally supervised institutions face disparate treatment-only examination risk at the federal level.

State law: California DFPI, Massachusetts AG, New Jersey (Lending Anti-Discrimination Act), and New York retain disparate impact liability. Institutions operating in multiple states are not covered by the federal shift.

Fair Housing Act: The Supreme Court’s Texas Department of Housing v. Inclusive Communities (2015) held that the FHA supports disparate impact claims for residential mortgage lending. That ruling remains in effect regardless of ECOA changes.

Private litigation: The statute of limitations under ECOA is five years. Disparities building now can be litigated by private plaintiffs under state law or FHA for years after federal enforcement contracts.

The practical posture: maintain statistical testing capability, but adapt documentation to reflect which legal framework each test is designed to satisfy.

The Four Statistical Methods Examiners and Plaintiffs Use

1. Adverse Impact Ratio (AIR) — The Screening Test

The adverse impact ratio is the starting point for any fair lending statistical review. It answers one narrow question: are two groups being approved at materially different rates?

How to calculate it:

AIR = Approval rate for protected class ÷ Approval rate for comparison group (highest rate)

The four-fifths rule, from the EEOC’s Uniform Guidelines on Employee Selection Procedures (UGESP), treats an AIR below 0.80 (80%) as a signal of adverse impact. This was designed for employment selection, but regulators and litigants have applied the same threshold in credit lending contexts.

Practical limits: AIR is a screening test, not a finding. A group with a lower credit profile will show a lower approval rate even under a perfectly calibrated model. AIR below 0.80 doesn’t prove discrimination — it triggers the next layer of analysis. Also, AIR is unstable when sample sizes are small; you need at least 30 applicants per group before the ratio is meaningful.

At which decision points to run AIR: Application approval, pricing (APR bracket), product assignment (if multiple products have meaningfully different terms), and adverse action.

2. Regression Analysis — Controlling for Legitimate Factors

Regression analysis answers the question AIR can’t: after controlling for the legitimate underwriting variables, does a protected class characteristic still predict outcomes?

The standard approach for mortgage and consumer credit is logistic regression (binary outcome: approved/denied) or OLS regression (continuous outcome: APR). Include all legitimate underwriting variables the model actually uses:

Control Variable Category	Examples
Credit quality	Credit score, derogatory marks, bankruptcy, foreclosure history
Capacity	Debt-to-income ratio, income verification
Collateral	Loan-to-value ratio, appraisal value
Loan characteristics	Loan amount, term, purpose, product type, occupancy
Geography	MSA, census tract income category

If a racial or ethnic group coefficient remains statistically significant (p < 0.05) after including all legitimate controls, you have unexplained disparate impact — the model’s structure is producing differential outcomes that can’t be attributed to credit risk factors.

For AI models specifically: The regression approach faces a challenge when the model uses hundreds of variables. Including all model inputs as controls essentially replicates the model’s logic, which can mask disparate impact. Some practitioners use a two-stage approach: run the model on a test dataset, then regress the model’s outputs against protected class status and a limited set of legitimate factors. The residual gap is what you’re measuring.

Examiner expectation (historically): The FDIC’s fair lending examination procedures describe regression analysis as the standard methodology for pricing and underwriting disparities. Even with the federal disparate impact shift, regression methodology remains the gold standard for demonstrating that you’ve looked for unexplained disparities and addressed them.

3. Fisher’s Exact Test and Standard Deviation Analysis

When sample sizes are small, Fisher’s Exact test is the statistically appropriate tool for evaluating whether differences in approval rates between groups are statistically significant or likely due to random variation.

Standard deviation (sigma) test: For larger samples, the standard deviation test computes the z-score of the difference in selection rates. A z-score above 1.96 (p < 0.05) indicates the difference is statistically significant at the 95% confidence level. A z-score above 2.0 has historically been treated as a presumptive trigger for further investigation in fair lending contexts.

Fisher’s Exact: Produces a p-value directly from a 2x2 contingency table (protected class vs. comparison group, approved vs. denied). Use when group counts are below 30.

Neither test tells you why the difference exists — only that it’s unlikely to be random. Statistical significance is a necessary precondition for a disparate impact claim, not a sufficient one.

4. BISG Proxy Methodology — When You Don’t Have Race Data

Most credit models don’t contain race or ethnicity as an input field — and shouldn’t. But you can’t test for disparate impact without knowing the demographic composition of your applicant pool.

Bayesian Improved Surname Geocoding (BISG) estimates the probability that an individual belongs to each race/ethnicity category by combining:

Surname analysis (Census Bureau surname frequency data by race/ethnicity)
Geocoding (census tract demographic composition from ACS data)

The result is a probability vector for each applicant: e.g., P(White) = 0.72, P(Hispanic) = 0.18, P(Asian) = 0.06. These probabilities are used in weighted regression analysis rather than hard-coded classifications.

BISG limitations to document:

Surname proxy is most reliable for Hispanic and Asian applicants; less reliable for Black applicants (due to lower correlation between surname and race in Census data)
Geocoding accuracy degrades in census tracts with high racial/ethnic mixing
Does not distinguish between subgroups within broad categories
The CFPB accepted BISG in its supervisory guidance; state regulators have generally followed

Alternative: Disclosed data where available. For HMDA-covered loans, you have actual race/ethnicity data for most applicants. Use BISG to fill gaps for applicants who declined to disclose, not as a substitute for disclosed data.

Choosing the Right Method: A Decision Framework

Situation	Primary Method	Secondary Method
Large applicant pool (500+ per group)	AIR screening → regression	Standard deviation test
Small applicant pool (<30 per group)	Fisher’s Exact test	Document sample limitation
AI model with no race/ethnicity field	BISG proxy → AIR	BISG-weighted regression
Pricing disparity analysis	Regression (OLS on APR)	AIR on pricing tiers
Adverse action disparity	AIR → Fisher’s/standard deviation	Regression on denial rates
Pre-deployment model testing	All four methods on test dataset	LDA search

The LDA Obligation: What Happens When Testing Finds Disparity

A less discriminatory alternative (LDA) is an alternative model or decision rule that serves the same business purpose — predicting credit risk — with less adverse impact. Under traditional disparate impact doctrine, demonstrating adverse impact doesn’t automatically create liability; it shifts the burden to the institution to show business necessity. If a plaintiff can then identify an available LDA, liability may follow.

Proactively searching for LDAs during model development — before deployment — is the documented best practice. The search doesn’t require testing every conceivable alternative. It requires documenting what alternatives were considered, why they were rejected on business grounds, and whether the current approach represents the least discriminatory path to serving the business purpose.

A basic LDA analysis log covers:

The current model’s adverse impact ratios by protected class
Alternative specifications tested (different variable weightings, thresholds, feature sets)
The AIR and business performance metrics (accuracy, default prediction) of each alternative
The rationale for selecting the current model over alternatives with lower adverse impact

This documentation is the difference between a proactive compliance program and a reactive litigation response.

Documenting Testing for Exam Readiness

The question isn’t whether your testing will satisfy a federal examiner today — it’s whether it will satisfy a state examiner or plaintiff tomorrow. Documentation requirements are the same regardless of the reviewing body:

Required artifacts:

Testing protocol memo — which groups, which decision points, which methods, which thresholds
AIR calculations — by group, by decision point, with sample sizes
Regression output tables — model specification, coefficients, p-values, R²
BISG methodology documentation — if used, explain the approach, data sources, and limitations
Findings memo — what was found, what the finding means, what action was taken
LDA search documentation — what alternatives were evaluated and why the current approach was retained
Monitoring plan — how often testing repeats, who reviews results, what triggers remediation

The absence of documentation — even when testing was actually conducted — is an independent finding. “We ran the numbers but didn’t write it up” is not a defensible response to a regulatory inquiry or a discovery request.

So What? Actions for Compliance and Model Risk Teams

The federal disparate impact shift creates a false sense of security for teams that conflate “OCC won’t examine for it” with “we have no exposure.” The exposure is real — it’s just more fragmented and harder to predict.

Right now:

Map your jurisdiction footprint. If you operate in CA, MA, NJ, or NY, you have active disparate impact exposure under state law regardless of federal posture.
Confirm which of the four statistical methods your current testing program uses and whether each decision point is covered.
Check whether your BISG methodology documentation is current and addresses known limitations.

Before next model deployment:

Run all four methods on the pre-deployment test dataset.
Conduct and document an LDA search.
Get the findings memo signed by the model owner and compliance officer before launch.

Ongoing:

Schedule post-deployment disparity monitoring at least quarterly for high-volume, high-impact models.
Set re-testing triggers: any model update, training data refresh, or >10% change in approval volume by product.

The statistical work isn’t going away. The regulatory audience for it has just gotten more complicated.

Need a structured pre-deployment AI risk assessment with built-in bias testing documentation? The AI Risk Assessment Template & Guide includes an AI Use Case Inventory with auto-tiering, bias testing scorecard, and worked examples across eight use cases mapped to the current regulatory landscape.

Also see: AI Bias Testing for Fair Lending: Methodologies Every Risk Team Needs, AI and Fair Lending: UDAAP Risk in Algorithmic Decisioning, and AI Risk Assessment Template: Pre-Deployment Checklist for Financial Services.

Need the working template?

Start with the source guide.

These answer-first guides summarize the required fields, evidence, and implementation steps behind the templates practitioners search for.

AI Risk

AI Risk Assessment Template Guide

Open source guide →

Frequently Asked Questions

Does the OCC's removal of disparate impact from its Fair Lending booklet mean we can stop testing for it?

No. OCC Bulletin 2025-16 removed disparate impact from the OCC's examination framework. The FDIC followed on August 29, 2025. But state laws in California, Massachusetts, New Jersey, and New York still recognize disparate impact liability. The Fair Housing Act still supports disparate impact claims for residential mortgage lending regardless of the CFPB rule change. Private plaintiffs can still sue. And a future administration can reinstate federal enforcement. Institutions that stand down entirely on testing are betting on political permanence — not a sound compliance posture.

What is the adverse impact ratio and how is it calculated?

The adverse impact ratio (AIR) is the approval rate for a protected class group divided by the approval rate for the highest-performing comparison group. An AIR below 0.80 (80%) triggers the four-fifths rule from the EEOC's Uniform Guidelines on Employee Selection Procedures (UGESP), signaling potential adverse impact. For AI credit models, you calculate AIR for each relevant protected class — race/ethnicity, sex, national origin — at each decision point (approval, pricing, product assignment). An AIR below 0.80 doesn't prove discrimination; it triggers further statistical analysis and a review of business justification.

When do I use regression analysis vs. the adverse impact ratio for fair lending testing?

Use the AIR as a screening tool at each decision point. Use regression analysis to isolate whether disparities persist after controlling for legitimate underwriting variables — credit score, DTI, LTV, loan purpose, occupancy type, and geography. If raw disparities exist but disappear after regression controls, the model may be reflecting legitimate risk differences. If disparities persist in regression, you have a harder problem: the model's decisioning structure is producing adverse outcomes even after controlling for creditworthiness factors, which is the core of a disparate impact claim.

What is BISG proxy methodology and why is it necessary for AI fair lending testing?

BISG (Bayesian Improved Surname Geocoding) is a technique for estimating the race and ethnicity of applicants when self-reported data is unavailable or incomplete. It combines surname analysis with geocoding (census tract demographics) to produce probabilistic race/ethnicity estimates for each applicant. The CFPB and other federal agencies have accepted BISG in supervisory contexts for identifying disparate impact. For AI models — especially those trained on datasets that don't include race/ethnicity fields — BISG is often the only way to measure demographic outcomes. Its limitations: surname proxy is more reliable for Hispanic and Asian populations than for Black applicants; geocoding precision degrades in mixed-demographic census tracts.

What documentation do examiners expect to see from disparate impact testing of an AI model?

At minimum: a testing protocol document defining which groups were tested, which decision points were analyzed, which statistical methods were used, and what thresholds were applied; results showing AIR calculations and regression outputs for each protected class at each decision point; evidence of a less discriminatory alternative (LDA) search if adverse impact was found; and a conclusion memo signed by the model owner and compliance officer explaining what was found, what was done about it, and what monitoring is in place. The absence of documentation — even if testing was done informally — is itself an exam finding.

What is a less discriminatory alternative (LDA) and when is it required?

A less discriminatory alternative is a model or policy that serves the same business purpose as your current approach but produces less adverse impact on protected classes. Under disparate impact doctrine, once a plaintiff (or examiner) demonstrates that a practice produces significant adverse impact, the institution must show the practice is justified by business necessity — and if the plaintiff can demonstrate a less discriminatory alternative exists, liability may follow. Proactively searching for LDAs during model development, not just post-hoc, is the defensive posture CFPB staff have consistently recommended. The search doesn't have to be exhaustive, but it has to be documented.

Rebecca Leung

Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.

Related Framework

AI Risk Assessment Template & Guide

Comprehensive AI model governance and risk assessment templates for financial services teams.

See What's Included → Buy Now — $59

Keep Reading

AI Risk

EU AI Act Digital Omnibus: What the December 2027 Deadline Deferral Means for Financial Services AI Teams

The EU AI Act's Digital Omnibus deal, reached May 7, 2026, defers Annex III high-risk AI obligations from August 2, 2026 to December 2, 2027. Here's what changed, what didn't, and how financial services AI teams should use the extra 16 months.

May 14, 2026

AI Risk

EU AI Act Article 5 Prohibited AI Systems: The Compliance Checklist Financial Institutions Can't Ignore

Article 5 prohibitions have been in force since February 2025 and the enforcement regime launched August 2025. Here's what financial institutions must audit, stop doing, and document — with the credit scoring carve-out explained.

May 12, 2026

AI Risk

EU AI Act High-Risk AI in Financial Services: What Banks and Fintechs Must Document by August 2, 2026

Annex III of the EU AI Act covers credit scoring, insurance pricing, and financial standing assessment. Here's what the seven compliance obligations actually require — and who they apply to.

May 10, 2026

Immaterial Findings ✉️

Weekly newsletter

Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.

Join practitioners from banks, fintechs, and asset managers. Delivered weekly.