Bias in AI Demonstrator

Pick a scenario, compare the per-group fairness metrics for a naive model and a debiased model, and see why "high overall accuracy" does not mean a model is fair to every group.

A bank predicts whether to approve a loan application based on income, credit score, and employment history. Group A and Group B represent two zip-code clusters with different historical lending records.

Dataset visualization

Cluster ACluster BFilled = predicted approveHollow = predicted deny

Currently showing naive predictions. Toggle the switch above to compare.

Naive model (trained on biased history)

The model uses a stricter approval threshold for one group, mirroring biased historical labels.

MetricCluster ACluster BGap
Approval rate
Fraction predicted approved
50.0%4.0%46.0%
Accuracy
Predictions matching truth
65.0%70.0%5.0%
True positive rate
Caught when truly approve
64.2%11.8%52.4%
False positive rate
Wrongly approved when truly deny
34.0%0.0%34.0%

Debiased model (equalized-odds adjustment)

Per-group thresholds are tuned so the true positive rates match more closely.

MetricCluster ACluster BGap
Approval rate
Fraction predicted approved
43.0%20.0%23.0%
Accuracy
Predictions matching truth
60.0%74.0%14.0%
True positive rate
Caught when truly approve
52.8%41.2%11.7%
False positive rate
Wrongly approved when truly deny
31.9%9.1%22.8%

What just happened

The naive model can have respectable overall accuracy while still producing very different error rates across groups. Debiasing closes the per-group gap on TPR but typically lowers raw accuracy and may shift other tradeoffs. There is no single fairness metric that works for every situation, which is why algorithmic fairness is a design choice and not a one-click setting.

Two formal definitions of fairness

Demographic parity

The model approves both groups at the same rate. It does not look at who actually deserves approval, only at the overall rates.

Equalized odds

Among true positives, both groups are caught at the same rate. Among true negatives, both groups are wrongly flagged at the same rate.

Real cases of algorithmic bias

The synthetic data above is illustrative. These four cases are from peer-reviewed studies, journalism, and regulatory filings.

2016

COMPAS Recidivism Risk Score

ProPublica analyzed risk scores produced by the COMPAS tool, which courts in several US states used to predict whether a defendant would re-offend. The investigation found that black defendants who did not re-offend were roughly twice as likely to be incorrectly flagged as high risk compared with white defendants who did not re-offend. Northpointe, the vendor, disputed the framing, and the case became one of the most-cited examples of how a model can be calibrated overall yet still produce very different error rates per group.

2019

Apple Card Credit Limits

When the Apple Card launched, several customers reported that women received much lower credit limits than their husbands, even when the couples shared finances and the woman had a higher credit score. The New York Department of Financial Services opened an investigation into Goldman Sachs, the issuing bank. The case highlighted that even without sex as an input, an opaque underwriting model can still produce sex-correlated outcomes through proxies.

2018

Amazon Resume Screening Tool

Reuters reported that Amazon abandoned an experimental resume-screening model after the company found it systematically downgraded resumes that contained the word "women's" (as in "women's chess club captain") and that came from two all-women's colleges. The model had been trained on a decade of past hiring decisions, which were themselves dominated by men. Amazon could not guarantee the model would be neutral and shut the project down.

2019

Optum Healthcare Risk Algorithm

A study in Science by Obermeyer and colleagues examined a widely used commercial algorithm that helped US hospitals identify patients for extra care. The algorithm used past healthcare costs as a proxy for medical need. Because less money had historically been spent on black patients with the same conditions, the algorithm assigned them lower risk scores than equally sick white patients. Correcting the proxy more than doubled the share of black patients flagged for extra care.

Reference Guide

What is algorithmic fairness

A model can be accurate on average while still treating two groups very differently. Algorithmic fairness asks how to define and measure equal treatment when the groups are protected attributes such as race, gender, age, or disability status.

Researchers have proposed dozens of fairness definitions. Two common ones are demographic parity, which asks that the approval rate be equal across groups, and equalized odds, which asks that true positive and false positive rates be equal across groups. These two definitions can conflict, and a model usually cannot satisfy both at the same time.

The metrics table in this tool reports approval rate, accuracy, true positive rate, and false positive rate for each group, plus the gap between groups.

Why hiding the attribute fails

A first instinct is to remove the protected attribute (such as race or gender) from the training data. This is sometimes called "fairness through unawareness". It rarely works.

Other features often act as proxies. Zip code proxies for race in many US cities. Resume keywords proxy for gender. Healthcare cost proxies for medical need but undercounts care for groups that historically had less access to it.

In the Optum study, the algorithm did not use race as an input. It still produced racially disparate outcomes because the cost proxy was itself a product of unequal access to care.

Reading the metrics table

Approval rate. Fraction of applicants the model predicts approve. A demographic-parity gap is the difference in approval rates between groups.

Accuracy. Fraction of predictions that match the true label. Two groups can have the same accuracy with very different error patterns.

True positive rate (TPR). Of the people who truly should be approved, what fraction did the model catch. Also called recall or sensitivity.

False positive rate (FPR). Of the people who truly should be denied, what fraction did the model wrongly approve. Equalized odds asks for TPR and FPR to be similar across groups.

Mitigation strategies overview

Pre-processing. Adjust the training data so each group is represented fairly. Examples include reweighting samples and relabeling biased historical decisions.

In-processing. Add a fairness constraint to the loss function during training. The model is asked to balance accuracy with a fairness target.

Post-processing. Train an unconstrained model, then apply per-group thresholds chosen so a fairness metric is satisfied. The debiased toggle in this tool is a simple form of post-processing.

Every mitigation has tradeoffs. Improving one fairness metric often worsens another or lowers raw accuracy. Choosing which tradeoff is acceptable is a values question, not a math question.