MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition

New benchmark separates thinking from revision

A new arXiv paper, MEDLEY-BENCH (arXiv:2604.16009), introduces a behavioural benchmark for metacognition — the capacity of a system to monitor and regulate its own reasoning. The dataset and evaluation explicitly separate independent reasoning, private self-revision, and socially influenced revision under genuine inter‑model disagreement. The authors report the benchmark was run across 35 models, designed to tease apart whether larger models merely assess their outputs better or actually exert reliable control over their chain of thought.

What the benchmark measures

MEDLEY-BENCH frames metacognition as a set of behavioural decisions: can a model detect when it is likely wrong, fix its own mistakes privately, and change its stance after seeing peer disagreement? The paper constructs scenarios that force trade-offs between private correction and social conformity, allowing researchers to observe whether a model’s self-revision mirrors sincere error correction or simply echoes other models. Reportedly, the benchmark includes a mix of model types to capture a broad behavioural landscape.

Findings: evaluation improves faster than control

According to the authors, larger models tend to get better at evaluating their own answers — they can flag uncertainty and identify probable errors more reliably — but this does not uniformly translate into robust self‑correction or principled resistance to social pressure. In short: scale appears to buy evaluation, but not consistent control over reasoning or revision strategies. It has been reported that performance diverges most sharply when private self-revision and socially influenced revision come into conflict.

Why this matters — and the geopolitical angle

For Western readers, the result is a cautionary note: better self‑assessment is not the same as dependable, controllable reasoning. As national AI stacks evolve, with firms across the U.S. and China (for example Baidu (百度) and Alibaba (阿里巴巴)) racing to deploy large language models, benchmarks that dissect metacognition will be crucial for safety and product design. It has been reported that restrictions on model access and cross‑border data flows — from export controls to platform policies — complicate comprehensive evaluation, making independent, transparent benchmarks like MEDLEY-BENCH all the more important.