New arXiv survey warns exam‑ready LLMs are not yet clinic‑ready, and offers MR‑Bench to test medical reasoning

LLMs score well on tests but clinical care is different

A new arXiv paper, "Medical Reasoning with Large Language Models: A Survey and MR‑Bench" (https://arxiv.org/abs/2604.08559), argues that strong performance on medical exam‑style tasks does not guarantee safe, reliable clinical decision‑making. The authors survey recent work showing that while large language models (LLMs) can mimic exam answers and medical knowledge retrieval, real‑world care is safety‑critical, context‑dependent and requires up‑to‑date evidence and calibrated uncertainty estimates.

What’s missing from current evaluations?

According to the survey, existing benchmarks emphasize static question‑answer performance rather than step‑by‑step reasoning, management decisions, and the ability to detect and communicate uncertainty. It has been reported that LLMs can hallucinate clinical facts or present low‑confidence inferences as confident recommendations, raising safety concerns if such outputs are used in patient care. The paper stresses the need to evaluate models on reasoning chains, counterfactuals, and evolving evidence rather than only on exam scores.

MR‑Bench: a focused stress test for clinical reasoning

To address these gaps, the authors propose MR‑Bench, a new benchmark framework intended to better probe medical reasoning capabilities — including diagnostic inference, treatment planning, and justification under uncertainty. The proposal focuses on metrics beyond accuracy, such as justification quality, error modes, and the model’s ability to revise conclusions when given new data. The paper is framed as both a survey and a call to shift evaluation practices toward safety‑oriented, clinically realistic scenarios.

Policy and deployment implications

Why does this matter beyond the lab? Deployment of medical LLMs sits at the intersection of health regulation, data governance, and geopolitics. Reportedly, export controls on advanced chips and national AI strategies are already shaping where and how large models are trained and deployed, and regulators in the US, EU and China are increasingly scrutinizing clinical AI. The survey and MR‑Bench aim to give clinicians, developers and regulators better tools to judge when an LLM is ready — and when it clearly is not.