SEED AI | 健康 AI：高分不等于临床 ready SEED AI | In health AI, benchmark wins are not clinical readiness AI-assisted · reviewed

2026-06-27 · AI, health-ai, clinical-ai, benchmark

Paper

Evaluating the robustness and readiness of large frontier models in health AI applications

Yu Gu, Jingjing Fu, Xiaodong Liu, ..., Eric Topol, Hoifung Poon & Paul Vozila · Nature Medicine, 2026

Yu Gu、Eric Topol 和 Hoifung Poon 团队近期在 Nature Medicine 发表一篇 Article，系统评估大型 frontier models 在健康 AI 应用中的鲁棒性与 readiness。它的重点不是再证明 GPT-5、Gemini 2.5 Pro 或 OpenAI-o3 能在医学 benchmark 上拿高分，而是追问一个更接近临床的问题：当图像缺失、选项顺序变化、视觉证据被替换、模型还要解释自己时，这些高分还站得住吗？

Content infographic

为什么健康 AI 不能只看高分

这篇论文真正面对的不是“模型医学知识够不够”，而是临床决策场景天然有多模态输入、不完整信息、模糊线索和高失败成本。医学 benchmark 通常把问题压缩成可评分的选择题或 VQA，但真实医疗里更关键的是：模型在信息缺失、证据冲突、噪声输入和责任边界下是否仍然可靠。

AI 介入的主因是证据太碎、预测建模太难，次因是临床判断带有主观性且验证成本高。价值链位置属于 clinical decision / deployment 之前的 evaluation layer：这篇文章并不是把 AI 直接推入临床工作流，而是在回答“什么样的评估证据才足以支撑健康 AI 的 readiness 叙事”。

真正的新意是评价方式变了

这篇文章改变的不是某个医疗模型本身，而是健康 AI 的评价链条。传统读法看 leaderboard accuracy，容易把“某模型在 NEJM、JAMA、VQA-RAD 或 OmniMedVQA 上表现好”理解为模型已经具备临床推理能力；作者把评价拆成 6 类 stress tests，观察模型在输入被破坏、格式被扰动、视觉证据被替换和 reasoning 被要求解释时如何变化。

这一步很重要，因为它把 AI 应用边界从“模型答对多少题”推进到“模型为什么答对、在什么条件下答错、是否知道自己不该答”。对临床 AI 来说，这比单一准确率更接近真实落地，因为临床问题往往不是 clean benchmark，而是模态不完整、线索互相矛盾、患者背景不充分的情境。

证据强在压力测试，弱在真实世界闭环

这篇最强的证据形态是多 benchmark、多模型、结构化扰动加 clinician-guided profiling。作者评估的模型包括 GPT-4o、GPT-5、OpenAI-o3、OpenAI-o4-mini、Gemini 2.5 Pro、Claude 3.5 Sonnet、DeepSeek-VL2、Qwen3-VL、LLaVA-Med 1.5 和 MedGemma；任务覆盖 VQA-RAD、PMC-VQA、OmniMedVQA、MIMIC-CXR、JAMA 和 NEJM 等数据源。6 类压力测试包括 input removal、visual-required subset、format perturbation、distractor manipulation、visual substitution 和 reasoning signal fidelity。

分母和结果很关键。T1 在 NEJM 947 题和 JAMA 1,314 题上比较 image+text 与 text-only；GPT-5 在 NEJM 上从 81.33% 降到 67.41%，Gemini 2.5 Pro 从 81.12% 降到 67.43%，但在 JAMA 上下降小得多，说明不同 benchmark 测的东西并不一样。T2 的 NEJM Visual-required Subset 有 197 题，本应依赖图像，但去掉图像后 GPT-5 仍有 41.32%、Gemini 40.1%、OpenAI-o3 38.58%、OpenAI-o4-mini 37.97%，都高于 20% 随机基线，提示模型可能在用文本先验、疾病流行度或记忆关联猜答案。

更临床相关的是 T5 visual substitution：作者在 40 个 NEJM 问题中把原始图像替换成支持 distractor 的临床合理图像，文本和选项不变。GPT-5 从 84.0% 降到 35.0%，Gemini 2.5 Pro 从 76.0% 降到 52.5%，o4-mini 和 o3 也分别下降约 23 和 32.5 个百分点。T6 还显示，chain-of-thought 在 NEJM 上并没有稳定增益，模型生成的解释常出现“答案对但逻辑错”、虚构视觉发现、把初始感知错误继续放大的模式。

别误读第一点：这不是前瞻性临床试验，也不是医院真实工作流部署证据，而是 retrospective benchmarks 上的鲁棒性测试。别误读第二点：stress test 发现的失败模式不能直接推出某个模型“完全不能用”，但足以说明 benchmark 高分还不能被当成临床 readiness 证据。

最重要的一点：杠杆在评估，卡点在验证成本

这篇文章对 AI 应用边界的核心判断是：健康 AI 当前最真实的杠杆，不是马上替代医生做临床判断，而是建立更便宜、更系统的失效发现机制。Stress tests 能把“模型表现好不好”拆成更可诊断的问题：它是否忽略图像？是否依赖选项位置？是否被无关 distractor 影响？是否在没有证据时仍然编出合理解释？

卡点则仍在验证成本。选择题和 benchmark 可以快速评分，所以模型进步看起来很快；但临床信任需要前瞻性验证、真实工作流适配、责任归属、数据治理、可审计性和长期安全监测。这些验证昂贵、慢、主观且受监管环境约束，所以健康 AI 的落地速度会显著慢于一般知识问答或代码任务。

怎样批判性地读这篇临床 AI 论文

第一条红旗是证据层级。这篇已经比普通 benchmark 论文强很多，但仍然不是 prospective real-world validation；它告诉我们 leaderboard 不能代表 clinical readiness，却没有证明某个系统可以安全进入临床决策。

第二条红旗是任务形式。NEJM/JAMA 选择题和 VQA 能暴露重要问题，但它们仍是受限格式，不能覆盖开放式诊疗、纵向病程、医患沟通、多科协作和真实 EHR 工作流。

第三条红旗是数据污染和文本先验。作者用 PX-60 私有 X-ray 小数据集补充检查，但规模和模态都有限；公开 benchmark 是否进入过模型预训练语料，仍然难完全排除。

第四条红旗是 closed API 与可复现性。代码和最小复现数据已开放，但完整复现依赖商业模型 API、受限或版权数据集，以及当时的模型版本。对健康 AI 来说，模型更新、日志、内部推理不可见和数据驻留要求，都会影响真实机构能否审计和部署。

第五条红旗是 reasoning 可解释性。更长的解释不等于更可靠的推理；如果模型能生成流畅但虚构的视觉证据，那么“看起来像临床推理”的文本本身也需要独立验证。

启发与待解问题：AI 边界在便宜验证之外

传播层面最清楚的角度是“健康 AI 的高分陷阱”：不要只追逐模型排行榜，而要解释什么样的 stress test 才能让普通读者理解临床风险。公共表达上，重点可以放在“模型不是答错才危险，答对但理由错也危险”。

投资和合作判断上，这篇提示要优先看评估基础设施、临床数据治理、可审计 workflow 和持续监测能力，而不是只看单次 benchmark 分数。一个健康 AI 团队如果没有真实场景中的 failure taxonomy、stress testing、日志审计和责任边界，模型再强也很难形成临床级壁垒。

科研方向跟踪上，我会继续看三类证据：第一，stress testing 是否能从选择题扩展到开放式诊疗任务；第二，是否出现跨医院、前瞻性、真实 workflow 的验证；第三，模型更新后鲁棒性是否稳定，而不是每次版本变化都重新归零。

待解问题有两个。第一，能否建立一套被医院、监管者和模型公司共同接受的 health AI robustness audit？第二，如何把“模型知道自己不知道”做成可度量、可审计、可追责的临床系统能力？

Yang 的信号评级：High

轴一，信号强度：High。理由：这篇文章抓住了健康 AI 的关键矛盾：benchmark 高分和临床 readiness 之间还有一层鲁棒性、解释真实性和真实工作流验证的鸿沟。

轴二，临床成熟度：Medium-Low。理由：它提供了更好的评价框架和开放代码，但证据仍主要来自 retrospective benchmarks 和压力测试；真正的临床成熟度还需要前瞻性、真实世界、可审计的部署证据。

一句话总结：健康 AI 最快进步的是可评分 benchmark，最慢建立的是临床信任。

The Yu Gu, Eric Topol and Hoifung Poon teams recently published a Nature Medicine Article evaluating the robustness and readiness of large frontier models in health AI applications. The point is not to show again that GPT-5, Gemini 2.5 Pro or OpenAI-o3 can score well on medical benchmarks. The sharper question is whether those scores still hold when images are removed, answer choices are shuffled, visual evidence is substituted and the model is asked to explain itself.

Content infographic

Why health AI cannot be judged by high scores alone

The real problem is not simply whether models contain enough medical knowledge. Clinical decision settings involve multimodal inputs, incomplete information, ambiguous signals and high failure costs. Medical benchmarks often compress this into scoreable multiple-choice or VQA tasks, but real healthcare asks whether a model remains reliable under missing information, conflicting evidence, noisy inputs and unclear responsibility boundaries.

The primary reason AI enters this space is fragmented evidence and hard prediction; secondary reasons are subjective judgment and expensive validation. The value-chain position is the evaluation layer before clinical decision/deployment. This paper does not place AI directly into the clinical workflow. It asks what evidence should be required before health AI readiness claims become credible.

What changed is the evaluation chain

The contribution is not a new medical model. It changes the evaluation chain for health AI. A conventional reading focuses on leaderboard accuracy and can easily treat strong scores on NEJM, JAMA, VQA-RAD or OmniMedVQA as evidence of clinical reasoning. The authors instead decompose evaluation into six stress tests, asking how models behave when inputs are degraded, formats are perturbed, visual evidence changes and reasoning traces are audited.

This matters because it moves the AI-boundary question from “how many items did the model answer correctly?” to “why did it answer correctly, under what conditions does it fail, and does it know when it should not answer?” For clinical AI, that is closer to deployment reality than a single accuracy number, because real cases are rarely clean benchmark items.

The evidence is strong stress testing, not real-world closure

The strongest evidence shape is multi-benchmark, multi-model, structured perturbation plus clinician-guided profiling. The evaluated models include GPT-4o, GPT-5, OpenAI-o3, OpenAI-o4-mini, Gemini 2.5 Pro, Claude 3.5 Sonnet, DeepSeek-VL2, Qwen3-VL, LLaVA-Med 1.5 and MedGemma. Tasks span VQA-RAD, PMC-VQA, OmniMedVQA, MIMIC-CXR, JAMA and NEJM. The six stress tests cover input removal, a visual-required subset, format perturbation, distractor manipulation, visual substitution and reasoning signal fidelity.

The denominators matter. T1 compares image+text versus text-only conditions on 947 NEJM questions and 1,314 JAMA questions. GPT-5 drops from 81.33% to 67.41% on NEJM, and Gemini 2.5 Pro drops from 81.12% to 67.43%, while JAMA shows much smaller declines. T2 uses a 197-item NEJM Visual-required Subset. Even after removing images, GPT-5 reaches 41.32%, Gemini 40.1%, OpenAI-o3 38.58% and OpenAI-o4-mini 37.97%, all above the 20% random baseline, suggesting reliance on textual priors, disease prevalence or memorized associations.

The more clinically important test is T5 visual substitution. In 40 NEJM questions, the authors replaced the original image with a clinically plausible image supporting a distractor while keeping the vignette and answer choices unchanged. GPT-5 drops from 84.0% to 35.0%, Gemini 2.5 Pro from 76.0% to 52.5%, and o4-mini and o3 drop by roughly 23 and 32.5 percentage points. T6 further shows that chain-of-thought prompting does not reliably help on NEJM, and manual audits find recurring patterns: correct answers with wrong logic, fabricated visual findings and perceptual errors amplified by later reasoning.

Do not misread this in two ways. First, this is not a prospective clinical trial or real hospital workflow deployment; it is robustness testing on retrospective benchmarks. Second, the failure modes do not prove that any model is useless, but they do show that benchmark wins are not clinical readiness evidence.

The leverage is evaluation; the bottleneck is validation cost

The core AI-boundary lesson is that the most realistic near-term leverage in health AI is not replacing physicians in clinical judgment. It is building cheaper, more systematic ways to find failure modes. Stress tests turn “does the model perform well?” into more diagnostic questions: does it ignore images, rely on answer position, get fooled by distractors or generate plausible explanations without evidence?

The bottleneck remains validation cost. Multiple-choice benchmarks can be scored quickly, so model progress looks fast. Clinical trust requires prospective validation, workflow fit, responsibility assignment, data governance, auditability and long-term safety monitoring. Those tests are expensive, slow, subjective and institutionally constrained, which is why health AI deployment should move much more slowly than general question answering or coding tools.

How to read this clinical AI paper critically

The first red flag is evidence level. This paper is much stronger than ordinary benchmark reports, but it is still not prospective real-world validation. It shows that leaderboards cannot stand in for readiness, but it does not prove that a system is safe for clinical decision support.

The second red flag is task form. NEJM/JAMA multiple-choice tasks and VQA are useful stressors, but they remain constrained formats. They cannot fully capture open-ended diagnosis, longitudinal context, clinician-patient interaction, multi-specialty coordination or real EHR workflows.

The third red flag is contamination and textual priors. The authors add a small private PX-60 X-ray evaluation, but it is limited in scale and modality. Whether public benchmarks have appeared in model pretraining corpora remains hard to exclude completely.

The fourth red flag is closed APIs and reproducibility. Code and minimum reproducibility data are released, but full replication depends on commercial model APIs, restricted or copyrighted datasets and point-in-time model versions. In health AI, model updates, logging behavior, hidden reasoning and data residency requirements all affect institutional auditability and deployment.

The fifth red flag is explainability. Longer explanations do not mean more reliable reasoning. If a model can generate fluent but fabricated visual evidence, the text that appears to be clinical reasoning must itself be independently validated.

Takeaways and open questions: AI boundaries sit beyond cheap validation

For public communication, the clearest angle is the high-score trap in health AI. The public point is not that models are useless; it is that being right for the wrong reason can be dangerous, especially when the model presents a fluent clinical explanation.

For investment and collaboration judgment, this paper shifts attention toward evaluation infrastructure, clinical data governance, auditable workflows and continuous monitoring. A health AI team without failure taxonomy, stress testing, logs and responsibility boundaries may struggle to build a clinical-grade moat, even if its model scores are strong.

For research tracking, I would follow three evidence streams: whether stress testing moves beyond multiple-choice tasks into open-ended clinical work; whether cross-hospital prospective workflow evidence appears; and whether robustness persists after model version updates rather than resetting with every release.

Two open questions matter most. First, can health AI develop a robustness audit that hospitals, regulators and model companies all accept? Second, can “the model knows when it does not know” become a measurable, auditable and accountable clinical system capability?

Yang’s signal rating: High

Axis one, signal strength: High. Reason: the paper targets the central tension in health AI: the gap between benchmark success and the robustness, reasoning fidelity and workflow evidence needed for clinical readiness.

Axis two, clinical maturity: Medium-Low. Reason: the framework and code are useful, but the evidence is still primarily retrospective benchmarks and stress tests. True clinical maturity requires prospective, real-world, auditable deployment evidence.

One-sentence summary: Health AI improves fastest on scoreable benchmarks and slowest on clinical trust.