AI in Mammography: Background and Demo

Why radiology is both a promising and failure-prone setting for AI, plus a demo on 945 mammogram cases that surfaced systematic BI-RADS severity downgrading and subgroup gaps.

In A Giant Leap, Dr. Robert Wachter explains why radiology is both a promising and a failure-prone setting for AI. Radiology is not just image pattern recognition: the same scan can mean very different things depending on the patient’s age, history, and prior disease course. Even for the same patient, imaging collected across different sites can lead to different outputs because of differences in scanners, imaging protocols, or patient positioning. Radiologists often compare the current image with years of prior scans across modalities like CT, MRI, and ultrasound, while also moving through a broader workflow of verifying the patient, reviewing the clinical history, toggling between old and new images, and dictating an impression.

Many radiology AI tools today are still single-task tools, each built to detect one condition at a time, while human radiologists must evaluate the full clinical picture. Together, this creates many ways AI can fail even when a model looks strong in isolation. It can miss context, ignore prior imaging, fit poorly into workflow, or overreach beyond the narrow task it was designed for. This is exactly why radiology AI needs an intelligent reliability layer inside clinical workflows: to make performance, risks, and failure patterns visible and support safer AI-enabled care.

Where AI is delivering value — and where automation bias creeps in

AI is already delivering clear value in some parts of radiology, especially in structured screening workflows like cancer detection. In A Giant Leap, Wachter points to mammography as one example where AI-assisted screening has shown promising results, highlighting the Swedish MASAI trial, where an AI-assisted workflow identified about 20% more cancers while reducing radiologist workload.[1] He also notes an important behavioral risk: when AI marks a case as positive, radiologists may be inclined to go along with it rather than overrule it, because missing a true positive feels much riskier than accepting a false alarm.

That is part of what makes the MASAI workflow so interesting: cases the AI considered more suspicious were reviewed by two radiologists, while lower-risk cases could be signed off by one.[1] This is exactly the kind of issue we want to help uncover: where AI may push radiologists toward approving false positives, and where its failure patterns suggest the need for more careful human review, so hospitals can design better review workflows and workload policies, as in the MASAI trial. This will help radiologists stay vigilant, rather than become overly reliant on AI or be conditioned by it to accept false outcomes.

A demo: what intelligent reliability looks like in practice

To make that concrete, we built a demo to show what an intelligent reliability layer for radiology AI can look like in practice. For this demo, we evaluated a mammography-adapted LLaVA-based vision-language model[2] on 945 mammogram question-answer pairs from the VinDr-Mammo evaluation dataset,[3] covering tasks such as ACR density, abnormality detection, BI-RADS scoring, laterality, and view classification. The model reached 79.1% overall accuracy and an F1 score of 0.883, but the more important insight was that those averages hid clinically meaningful failure patterns.

Three concrete risks the demo surfaced

A systematic BI-RADS severity downgrading pattern, where the model often called cases less severe than they really were.
Weak abnormality detection, with many truly abnormal cases labeled "normal".
Non-uniform performance across patients, with substantial drops for older patients on tasks like density assessment and abnormality detection.

The demo also showed where the model was more dependable, such as basic view and laterality identification.

Case-level analysis

The case-level analysis made these patterns more tangible. In one example, a highly suspicious BI-RADS 5 case was predicted as BI-RADS 1 — the worst possible direction of error — and the model’s answers scattered widely across repeated runs with near-maximum uncertainty. In another, a benign BI-RADS 1 case was predicted correctly every time, with unanimous agreement and zero uncertainty.

Uncertainty as a signal for human oversight

More broadly, uncertainty was operationally useful across the evaluation. We computed uncertainty from the model’s own outputs and found that when the model was most confident, performance was much stronger; when uncertainty was high, accuracy dropped sharply. Uncertainty helps show where radiologist intervention is most useful. That framing is also showing up in the broader screening mammography literature, which is increasingly focused on balancing workload reduction against clinical risk rather than reporting average accuracy alone.[4]

Cases with high model uncertainty, severity-downgrading risk, or abnormality-detection disagreement could be routed for second review — rather than treated like routine AI-supported reads.

Why these patterns are actionable

These insights are actionable because they can change how AI-assisted radiology workflows are designed. Cases with high model uncertainty, severity downgrading risk, or abnormality-detection disagreement could be routed for second review rather than treated like routine AI-supported reads. Subgroup performance monitoring could also alert teams when the model is performing worse for certain patient populations, such as older patients, and trigger additional auditing or threshold adjustments.

Such insights become a practical system for deciding when AI can safely reduce workload and when radiologists need to employ more critical thinking and oversight. By separating cases where AI is reliable from cases where it is uncertain or failure-prone, radiologists can spend less time double-checking low-risk outputs and more time on the cases that truly need expert judgment, reducing cognitive burden while preserving patient safety.

Looking ahead: foundation models raise the bar

An intelligent reliability layer will become increasingly important as radiology AI becomes broader and more capable. Even LLaVA-Mammo, which already goes beyond single-label screening by handling multiple mammography tasks, is still focused on one imaging domain. Newer foundation-model efforts such as Pillar-0 point toward a future where radiology AI spans more modalities, body regions, and findings, which raises the bar further for continuous monitoring and oversight.[5]

More broadly, recent governance work in oncology argues that responsible deployment requires more than model development alone — it needs ongoing oversight across clinical, operational, and research settings.[6] As these systems become more embedded in care, the need for continuous monitoring, deep visibility into failure modes, and clear signals for when human oversight matters most will only grow.

AI in Mammography: Background and Demo

Where AI is delivering value — and where automation bias creeps in

A demo: what intelligent reliability looks like in practice

Three concrete risks the demo surfaced

Case-level analysis

Uncertainty as a signal for human oversight

Why these patterns are actionable

Looking ahead: foundation models raise the bar

See more from the demo

Thanks — we’ll be in touch.

References