AI Bias Testing for Fair Lending: Methodologies Every Risk Team Needs

Earnest Operations LLC thought its AI underwriting model was just doing its job—filtering applications based on relevant financial factors. The Massachusetts AG disagreed. On July 10, 2025, a $2.5 million settlement made the consequences clear: when your model disfavors Black, Hispanic, and non-citizen applicants, “the algorithm did it” is not a defense.

AI-driven lending decisions are now firmly in the regulatory crosshairs. The CFPB has made it explicit—black-box models don’t exempt lenders from ECOA, Regulation B, or the Fair Housing Act. If your credit model produces disparate outcomes along protected class lines and you can’t explain why or prove you tested for it, you have a problem.

The good news: bias testing methodologies have matured significantly. The bad news: most risk teams are either not testing at all, running the wrong tests, or misinterpreting what the results mean.

Here’s the complete guide.

TL;DR

AI credit models must comply with ECOA and the Fair Housing Act—regulators apply these laws the same way regardless of whether a human or an algorithm makes the decision.

Multiple bias testing methodologies exist; they detect different types of bias and no single test is sufficient.

The Earnest Operations $2.5M settlement (July 2025) shows state AGs will fill enforcement gaps left by federal agencies.

The CFPB’s September 2023 guidance requires lenders using AI to provide specific, accurate adverse action reasons—not generic sample-form language.

Why AI Makes Bias Testing Harder—and More Important

Traditional credit underwriting used transparent factors: credit score, DTI, income verification. Examiners could trace a denial reason directly to a rule. AI models work differently. A gradient boosting model might use 500+ features with complex nonlinear interactions. A deep learning model might be nearly uninterpretable by design.

This creates two intertwined problems:

Problem 1: Proxies. AI models trained on historical data will encode historical discrimination. If minority applicants were historically denied at higher rates in certain zip codes, and the model learns that zip code predicts default risk (which itself reflects historical discrimination), the model perpetuates the cycle. The model never “sees” race—but race is embedded in the patterns.

Problem 2: Explainability. ECOA and Regulation B require lenders to provide applicants with specific reasons for adverse actions. If your model is a black box, telling a denied applicant “the model said no” doesn’t cut it. The CFPB’s September 2023 guidance is unambiguous: specific and accurate reasons must reflect the factors actually considered or scored by the model. Sample checklist items that don’t match your model’s actual outputs violate the regulation.

Neither problem disappears by choosing a more complex model. Both require proactive testing.

The Legal Framework: What You’re Actually Testing Against

Before diving into methodologies, get clear on the legal landscape:

Law	Applies To	Key Prohibition	Enforcement
Equal Credit Opportunity Act (ECOA) / Reg B	All credit products	Disparate treatment + disparate impact on protected classes (race, color, religion, national origin, sex, marital status, age, familial status)	CFPB, federal banking agencies
Fair Housing Act	Mortgage lending, residential real estate	Disparate treatment + disparate impact based on race, color, national origin, sex, disability, familial status, religion	HUD, DOJ
State fair lending laws	Varies by state	Often broader protected classes (Massachusetts includes immigration status)	State AGs

The disparate impact theory is the key one for AI. A lender doesn’t have to intend to discriminate—if an AI model produces statistically significant worse outcomes for protected class members, the lender must justify the practice as necessary to achieve a legitimate business goal, and there’s no less discriminatory alternative. The Earnest case turned on exactly this: Massachusetts AG alleged the model disfavored Black and Hispanic applicants in approval rates and loan terms, and that the company failed to test for this and failed to remediate it.

Methodology 1: Disparate Impact Analysis (The Foundation)

This is your starting point. Measure outcome rates (approval, denial, pricing tiers, APR) by protected class and compare them.

The four-fifths rule: The industry standard threshold, drawn from EEOC employment guidance and applied by analogy to lending contexts. If a protected group receives a positive outcome at less than 80% of the rate of the most favored group, that’s a potential disparate impact trigger requiring further analysis.

Example: If white applicants are approved at 70% and Black applicants at 50%, the ratio is 50/70 = 71%. That’s below 80%—a red flag that requires investigation.

What to measure:

Approval/denial rates by race, national origin, sex
Loan pricing (APR, fees) by protected class after controlling for risk factors
Product steering (were protected class applicants offered less favorable products?)
Cutoff disparities (are protected class applicants disproportionately near decision boundaries?)

Critical limitation: Disparate impact analysis tells you there’s a problem. It doesn’t tell you what’s causing it or whether it’s legally justified. You need additional testing to answer those questions.

Data requirement: You need to know the race and national origin of applicants, which means you need demographic data. For mortgage lending, HMDA provides this. For non-mortgage consumer credit, lenders typically use BISG (Bayesian Improved Surname Geocoding) to infer probable race/ethnicity from name and address data. BISG isn’t perfect, but it’s the accepted methodology—CFPB and DOJ have used it in enforcement actions.

Methodology 2: Regression-Based Decomposition

Disparate impact ratios tell you what—regression analysis helps you understand why.

Run a regression with the lending outcome as the dependent variable and include:

Legitimate risk factors (credit score, DTI, LTV, income, employment history)
Protected class membership (as indicator variables)

If protected class variables remain statistically significant after controlling for all legitimate risk factors, you have evidence of residual discrimination that’s not explained by creditworthiness differences.

What to do with the results:

Non-significant protected class coefficients with a clean model → good signal, but keep monitoring
Significant protected class coefficients → investigate what’s driving it; look for proxies in the feature set
Proxy features (zip code, school name, employer type, social network) that correlate with protected class → remove or reconstruct

The Earnest case illustrates this. The Massachusetts AG alleged the company trained its models on historical human decisions that themselves reflected biased underwriting practices. When you train an ML model on biased historical data, the model learns and replicates that bias. The solution: use holdout testing, synthetic data augmentation, and regular retraining on debiased datasets.

Methodology 3: Counterfactual Fairness Testing

This is one of the most powerful—and underused—approaches. The question: if everything about an applicant were the same except their protected class, would the model produce the same outcome?

How it works:

Take a sample of application records
Change only the demographic-proxy features (flip race-correlated features like zip code, name-derived probabilities)
Run the same application through the model
Measure how often the outcome changes

If changing features that should be legally irrelevant causes the model’s output to change, you have a fairness problem even if aggregate disparate impact ratios look clean.

Tools: IBM’s AI Fairness 360 (open source), What-If Tool (Google), SHAP-based counterfactual frameworks. These tools can automate counterfactual generation at scale.

Why regulators care: Counterfactual fairness speaks directly to the legal question of whether a “similarly situated” applicant was treated differently. It maps well onto the matched-pair testing methodology that DOJ and CFPB have used in mortgage discrimination investigations.

Methodology 4: Calibration Testing Across Groups

A model is calibrated if its predicted probability of default actually matches observed default rates. A well-calibrated model predicts 20% default probability, and approximately 20% of those applicants actually default.

The bias problem: A model can be calibrated overall while being poorly calibrated for minority subgroups. If the model consistently overestimates default probability for Black applicants (predicting 30% when actual defaults are 20%), it’s systematically undervaluing their creditworthiness—and they’ll be denied or steered to worse products at higher rates.

How to test:

Score all applications with the model
Split the population by protected class
Compare predicted default rates vs. actual default rates within each group
Flag groups where predicted rates systematically exceed actual rates by more than your materiality threshold (typically 3-5 percentage points)

Calibration errors are particularly insidious because they don’t show up in standard disparate impact testing. A model can have a disparate impact ratio above 80% while still being significantly miscalibrated against a protected class—especially if that group has a smaller sample size in the training data.

Methodology 5: Adversarial Debiasing and Reweighting

If testing reveals bias, you need to address it. Two primary approaches:

Adversarial debiasing: Train the model using an adversarial architecture that simultaneously tries to predict creditworthiness and fails to predict protected class membership from the same features. Forces the model to find features genuinely correlated with credit risk and not correlated with demographics.

Reweighting/resampling: Adjust training data weights to increase representation of underrepresented groups or reduce the influence of historically biased decisions. Requires careful implementation—you don’t want to introduce new statistical problems while solving the bias problem.

Key principle: Document every debiasing intervention in your model documentation. The techniques used, the tradeoffs accepted, the before/after testing results, and the rationale for your final approach. SR 11-7 requires this documentation, and examiners will ask for it.

What Examiners Actually Want to See

When CFPB or OCC examiners look at your AI fair lending program, they’re looking for:

1. Evidence you tested before deployment—not just after a complaint. A fair lending testing protocol that’s part of model validation, not a reactive exercise.

2. Ongoing monitoring, not just pre-launch validation. Data distributions shift. What passes testing in 2024 may produce disparate outcomes by 2026 as applicant demographics or economic conditions change. Schedule quarterly disparate impact monitoring for all production credit models.

3. Governance and accountability. Who owns fair lending testing? Who reviews results? What’s the escalation path when a test fails? At most mid-size banks, this sits with Model Risk Management with oversight from Fair Lending Compliance. At fintechs without a formal MRM function, the Head of Compliance and VP of Data Science typically share ownership. Document it clearly.

4. Adverse action notice compliance. If your model is making or contributing to credit denials, your adverse action notices must reflect the specific factors the model actually used. The CFPB’s 2023 guidance is clear: using sample checklist items that don’t match your model’s actual outputs is a violation. Your explainability framework (SHAP values, LIME outputs, or other feature attribution) should feed directly into adverse action notice generation.

5. Vendor AI accountability. If you’re using a third-party credit scoring or underwriting AI, you are responsible for its fair lending compliance. “The vendor handles it” is not an acceptable answer. Your vendor due diligence process should require evidence of the vendor’s bias testing methodology, the frequency of testing, results from the most recent tests, and a commitment to share future testing results. Document this in your third-party risk management program.

Building a Fair Lending Testing Protocol: 30/60/90/120-Day Roadmap

Days 1–30: Inventory and baseline

Inventory all AI/ML models contributing to credit decisions (approvals, pricing, collections, line management)
Classify models by risk tier (consumer credit AI = typically Tier 1 or Tier 2)
Collect demographic data or implement BISG methodology where demographic data is unavailable
Run initial disparate impact analysis on production models

Days 31–60: Deep analysis

For any model with disparate impact ratios below 80%: run regression decomposition and counterfactual testing
Audit training data for historical bias; document known limitations
Review adverse action notice language against model feature outputs
Assess third-party AI vendor contracts for fair lending testing requirements

Days 61–90: Remediation and governance

Implement any required debiasing interventions (reweighting, adversarial debiasing, feature removal)
Validate remediation effectiveness with before/after testing documentation
Establish quarterly disparate impact monitoring schedule
Draft or update Fair Lending AI Policy covering testing requirements, ownership, escalation, and remediation standards

Days 91–120: Documentation and exam readiness

Update model risk documentation to include bias testing methodology, results, and remediation history
Brief Fair Lending Committee and Board Risk Committee on AI fair lending risk posture
Train model owners and validators on bias testing requirements and regulatory expectations
Conduct tabletop exercise: “CFPB requests our fair lending AI testing documentation tomorrow—what do we pull?”

The So What

The Earnest settlement in July 2025 sent a clear message: even as federal agencies pull back on disparate impact enforcement, state AGs are filling the gap. Massachusetts. Illinois. New York. California. These states have strong fair lending frameworks and the appetite to use them.

For risk teams, the calculus is simple: AI bias testing is no longer optional. It’s a core component of model risk management for any model touching credit decisions. The question isn’t whether you should be testing—it’s whether your testing methodology is robust enough to withstand scrutiny when a regulator asks.

The firms caught flat-footed are the ones who assumed their vendor’s model was clean, or who ran a one-time disparate impact analysis at launch and called it done. Don’t be that firm.

Running AI models in credit decisions and not sure if your testing protocol covers all the bases? The AI Risk Assessment Template & Guide includes a fair lending testing framework, model risk controls checklist, and documentation templates designed for the current regulatory environment.

Frequently Asked Questions

What is the four-fifths rule in AI bias testing?

The four-fifths (80%) rule is a disparate impact threshold drawn from EEOC employment guidelines and widely applied in fair lending analysis. If a protected group receives a positive credit outcome at less than 80% of the rate of the most favored group, that disparity is considered a potential disparate impact violation requiring further investigation and potential remediation. It’s a trigger for deeper analysis, not a standalone test.

Does the CFPB require AI credit models to explain their decisions?

Yes. The CFPB’s September 2023 guidance makes clear that lenders using AI or complex algorithms must still provide specific, accurate reasons for adverse credit actions under ECOA and Regulation B. The specific reasons must reflect the factors actually considered or scored by the model—generic sample checklist items that don’t match the model’s outputs are insufficient and violate Regulation B.

What happened in the Earnest Operations AI fair lending settlement?

On July 10, 2025, Massachusetts Attorney General Andrea Joy Campbell announced a $2.5 million settlement with Earnest Operations LLC, a student loan company. The AG alleged that Earnest’s AI underwriting models produced unlawful disparate impact against Black, Hispanic, and non-citizen applicants—and that the company failed to adequately test its models for disparate impact and trained models on historical human decisions that themselves reflected biased underwriting. The case is a landmark example of state-level AI fair lending enforcement.