Disparate Impact Testing Techniques: Statistical Methods Examiners Actually Accept
Table of Contents
TL;DR
- OCC Bulletin 2025-16 (July 2025) and FDIC guidance (August 2025) removed disparate impact from federal fair lending examination frameworks — but state laws (CA, MA, NJ, NY), the Fair Housing Act, and private litigation still create live exposure
- The four statistical methods regulators and plaintiffs have historically used: Adverse Impact Ratio (four-fifths rule), regression analysis, Fisher’s exact test/standard deviation analysis, and BISG proxy methodology
- Each method answers a different question; no single test is sufficient for a defensible AI model testing program
- Documentation is the exam deliverable — undocumented testing is the same as no testing when an examiner or plaintiff asks
The OCC’s July 2025 bulletin removing disparate impact from its Fair Lending examination booklet generated exactly the reaction compliance teams expected: some departments immediately asked whether they could deprioritize statistical testing. The answer is no — and the reasoning matters more than the conclusion.
Federal de-emphasis of disparate impact doesn’t extinguish the risk. It redistributes it. Massachusetts AG Andrea Joy Campbell settled with Earnest Operations for $2.5 million in July 2025 over AI underwriting disparate impact — after the federal rollback had already begun. New Jersey codified disparate impact under its Lending Anti-Discrimination Act in December 2025. New York’s FAIR Business Practices Act, effective February 2026, gives the state AG new tools to pursue algorithmic discrimination independently of federal enforcement posture.
The statistical testing discipline hasn’t become optional. It’s become more complex — because now you’re managing exposure across jurisdictions with different standards, not just one federal framework. What follows is the practitioner’s guide to the four methods that actually matter and how to document them.
Why “No Federal Disparate Impact” Doesn’t Mean “No Disparate Impact Risk”
Understanding your residual exposure requires mapping it specifically. As of mid-2026:
Federal: OCC Bulletin 2025-16 removed disparate impact references from the OCC’s Fair Lending booklet. The FDIC followed in August 2025. The CFPB finalized its ECOA Regulation B amendment eliminating the effects test effective July 21, 2026. Federally supervised institutions face disparate treatment-only examination risk at the federal level.
State law: California DFPI, Massachusetts AG, New Jersey (Lending Anti-Discrimination Act), and New York retain disparate impact liability. Institutions operating in multiple states are not covered by the federal shift.
Fair Housing Act: The Supreme Court’s Texas Department of Housing v. Inclusive Communities (2015) held that the FHA supports disparate impact claims for residential mortgage lending. That ruling remains in effect regardless of ECOA changes.
Private litigation: The statute of limitations under ECOA is five years. Disparities building now can be litigated by private plaintiffs under state law or FHA for years after federal enforcement contracts.
The practical posture: maintain statistical testing capability, but adapt documentation to reflect which legal framework each test is designed to satisfy.
The Four Statistical Methods Examiners and Plaintiffs Use
1. Adverse Impact Ratio (AIR) — The Screening Test
The adverse impact ratio is the starting point for any fair lending statistical review. It answers one narrow question: are two groups being approved at materially different rates?
How to calculate it:
AIR = Approval rate for protected class ÷ Approval rate for comparison group (highest rate)
The four-fifths rule, from the EEOC’s Uniform Guidelines on Employee Selection Procedures (UGESP), treats an AIR below 0.80 (80%) as a signal of adverse impact. This was designed for employment selection, but regulators and litigants have applied the same threshold in credit lending contexts.
Practical limits: AIR is a screening test, not a finding. A group with a lower credit profile will show a lower approval rate even under a perfectly calibrated model. AIR below 0.80 doesn’t prove discrimination — it triggers the next layer of analysis. Also, AIR is unstable when sample sizes are small; you need at least 30 applicants per group before the ratio is meaningful.
At which decision points to run AIR: Application approval, pricing (APR bracket), product assignment (if multiple products have meaningfully different terms), and adverse action.
2. Regression Analysis — Controlling for Legitimate Factors
Regression analysis answers the question AIR can’t: after controlling for the legitimate underwriting variables, does a protected class characteristic still predict outcomes?
The standard approach for mortgage and consumer credit is logistic regression (binary outcome: approved/denied) or OLS regression (continuous outcome: APR). Include all legitimate underwriting variables the model actually uses:
| Control Variable Category | Examples |
|---|---|
| Credit quality | Credit score, derogatory marks, bankruptcy, foreclosure history |
| Capacity | Debt-to-income ratio, income verification |
| Collateral | Loan-to-value ratio, appraisal value |
| Loan characteristics | Loan amount, term, purpose, product type, occupancy |
| Geography | MSA, census tract income category |
If a racial or ethnic group coefficient remains statistically significant (p < 0.05) after including all legitimate controls, you have unexplained disparate impact — the model’s structure is producing differential outcomes that can’t be attributed to credit risk factors.
For AI models specifically: The regression approach faces a challenge when the model uses hundreds of variables. Including all model inputs as controls essentially replicates the model’s logic, which can mask disparate impact. Some practitioners use a two-stage approach: run the model on a test dataset, then regress the model’s outputs against protected class status and a limited set of legitimate factors. The residual gap is what you’re measuring.
Examiner expectation (historically): The FDIC’s fair lending examination procedures describe regression analysis as the standard methodology for pricing and underwriting disparities. Even with the federal disparate impact shift, regression methodology remains the gold standard for demonstrating that you’ve looked for unexplained disparities and addressed them.
3. Fisher’s Exact Test and Standard Deviation Analysis
When sample sizes are small, Fisher’s Exact test is the statistically appropriate tool for evaluating whether differences in approval rates between groups are statistically significant or likely due to random variation.
Standard deviation (sigma) test: For larger samples, the standard deviation test computes the z-score of the difference in selection rates. A z-score above 1.96 (p < 0.05) indicates the difference is statistically significant at the 95% confidence level. A z-score above 2.0 has historically been treated as a presumptive trigger for further investigation in fair lending contexts.
Fisher’s Exact: Produces a p-value directly from a 2x2 contingency table (protected class vs. comparison group, approved vs. denied). Use when group counts are below 30.
Neither test tells you why the difference exists — only that it’s unlikely to be random. Statistical significance is a necessary precondition for a disparate impact claim, not a sufficient one.
4. BISG Proxy Methodology — When You Don’t Have Race Data
Most credit models don’t contain race or ethnicity as an input field — and shouldn’t. But you can’t test for disparate impact without knowing the demographic composition of your applicant pool.
Bayesian Improved Surname Geocoding (BISG) estimates the probability that an individual belongs to each race/ethnicity category by combining:
- Surname analysis (Census Bureau surname frequency data by race/ethnicity)
- Geocoding (census tract demographic composition from ACS data)
The result is a probability vector for each applicant: e.g., P(White) = 0.72, P(Hispanic) = 0.18, P(Asian) = 0.06. These probabilities are used in weighted regression analysis rather than hard-coded classifications.
BISG limitations to document:
- Surname proxy is most reliable for Hispanic and Asian applicants; less reliable for Black applicants (due to lower correlation between surname and race in Census data)
- Geocoding accuracy degrades in census tracts with high racial/ethnic mixing
- Does not distinguish between subgroups within broad categories
- The CFPB accepted BISG in its supervisory guidance; state regulators have generally followed
Alternative: Disclosed data where available. For HMDA-covered loans, you have actual race/ethnicity data for most applicants. Use BISG to fill gaps for applicants who declined to disclose, not as a substitute for disclosed data.
Choosing the Right Method: A Decision Framework
| Situation | Primary Method | Secondary Method |
|---|---|---|
| Large applicant pool (500+ per group) | AIR screening → regression | Standard deviation test |
| Small applicant pool (<30 per group) | Fisher’s Exact test | Document sample limitation |
| AI model with no race/ethnicity field | BISG proxy → AIR | BISG-weighted regression |
| Pricing disparity analysis | Regression (OLS on APR) | AIR on pricing tiers |
| Adverse action disparity | AIR → Fisher’s/standard deviation | Regression on denial rates |
| Pre-deployment model testing | All four methods on test dataset | LDA search |
The LDA Obligation: What Happens When Testing Finds Disparity
A less discriminatory alternative (LDA) is an alternative model or decision rule that serves the same business purpose — predicting credit risk — with less adverse impact. Under traditional disparate impact doctrine, demonstrating adverse impact doesn’t automatically create liability; it shifts the burden to the institution to show business necessity. If a plaintiff can then identify an available LDA, liability may follow.
Proactively searching for LDAs during model development — before deployment — is the documented best practice. The search doesn’t require testing every conceivable alternative. It requires documenting what alternatives were considered, why they were rejected on business grounds, and whether the current approach represents the least discriminatory path to serving the business purpose.
A basic LDA analysis log covers:
- The current model’s adverse impact ratios by protected class
- Alternative specifications tested (different variable weightings, thresholds, feature sets)
- The AIR and business performance metrics (accuracy, default prediction) of each alternative
- The rationale for selecting the current model over alternatives with lower adverse impact
This documentation is the difference between a proactive compliance program and a reactive litigation response.
Documenting Testing for Exam Readiness
The question isn’t whether your testing will satisfy a federal examiner today — it’s whether it will satisfy a state examiner or plaintiff tomorrow. Documentation requirements are the same regardless of the reviewing body:
Required artifacts:
- Testing protocol memo — which groups, which decision points, which methods, which thresholds
- AIR calculations — by group, by decision point, with sample sizes
- Regression output tables — model specification, coefficients, p-values, R²
- BISG methodology documentation — if used, explain the approach, data sources, and limitations
- Findings memo — what was found, what the finding means, what action was taken
- LDA search documentation — what alternatives were evaluated and why the current approach was retained
- Monitoring plan — how often testing repeats, who reviews results, what triggers remediation
The absence of documentation — even when testing was actually conducted — is an independent finding. “We ran the numbers but didn’t write it up” is not a defensible response to a regulatory inquiry or a discovery request.
So What? Actions for Compliance and Model Risk Teams
The federal disparate impact shift creates a false sense of security for teams that conflate “OCC won’t examine for it” with “we have no exposure.” The exposure is real — it’s just more fragmented and harder to predict.
Right now:
- Map your jurisdiction footprint. If you operate in CA, MA, NJ, or NY, you have active disparate impact exposure under state law regardless of federal posture.
- Confirm which of the four statistical methods your current testing program uses and whether each decision point is covered.
- Check whether your BISG methodology documentation is current and addresses known limitations.
Before next model deployment:
- Run all four methods on the pre-deployment test dataset.
- Conduct and document an LDA search.
- Get the findings memo signed by the model owner and compliance officer before launch.
Ongoing:
- Schedule post-deployment disparity monitoring at least quarterly for high-volume, high-impact models.
- Set re-testing triggers: any model update, training data refresh, or >10% change in approval volume by product.
The statistical work isn’t going away. The regulatory audience for it has just gotten more complicated.
Need a structured pre-deployment AI risk assessment with built-in bias testing documentation? The AI Risk Assessment Template & Guide includes an AI Use Case Inventory with auto-tiering, bias testing scorecard, and worked examples across eight use cases mapped to the current regulatory landscape.
Also see: AI Bias Testing for Fair Lending: Methodologies Every Risk Team Needs, AI and Fair Lending: UDAAP Risk in Algorithmic Decisioning, and AI Risk Assessment Template: Pre-Deployment Checklist for Financial Services.
Need the working template?
Start with the source guide.
These answer-first guides summarize the required fields, evidence, and implementation steps behind the templates practitioners search for.
Related Template
AI Risk Assessment Template & Guide
Comprehensive AI model governance and risk assessment templates for financial services teams.
Frequently Asked Questions
Does the OCC's removal of disparate impact from its Fair Lending booklet mean we can stop testing for it?
What is the adverse impact ratio and how is it calculated?
When do I use regression analysis vs. the adverse impact ratio for fair lending testing?
What is BISG proxy methodology and why is it necessary for AI fair lending testing?
What documentation do examiners expect to see from disparate impact testing of an AI model?
What is a less discriminatory alternative (LDA) and when is it required?
Rebecca Leung
Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.
Related Framework
AI Risk Assessment Template & Guide
Comprehensive AI model governance and risk assessment templates for financial services teams.
Keep Reading
EU AI Act Digital Omnibus: What the December 2027 Deadline Deferral Means for Financial Services AI Teams
The EU AI Act's Digital Omnibus deal, reached May 7, 2026, defers Annex III high-risk AI obligations from August 2, 2026 to December 2, 2027. Here's what changed, what didn't, and how financial services AI teams should use the extra 16 months.
May 14, 2026
AI RiskEU AI Act Article 5 Prohibited AI Systems: The Compliance Checklist Financial Institutions Can't Ignore
Article 5 prohibitions have been in force since February 2025 and the enforcement regime launched August 2025. Here's what financial institutions must audit, stop doing, and document — with the credit scoring carve-out explained.
May 12, 2026
AI RiskEU AI Act High-Risk AI in Financial Services: What Banks and Fintechs Must Document by August 2, 2026
Annex III of the EU AI Act covers credit scoring, insurance pricing, and financial standing assessment. Here's what the seven compliance obligations actually require — and who they apply to.
May 10, 2026
Immaterial Findings ✉️
Weekly newsletter
Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.
Join practitioners from banks, fintechs, and asset managers. Delivered weekly.