Rescaling Confidence: What Scale Design Reveals About LLM Metacognition

A new arXiv preprint (arXiv:2603.09309) argues that a seemingly trivial design choice—the numeric scale used when large language models (LLMs) verbalize confidence—shapes what we think those models “know” about their own answers. The paper examines verbalized confidence (typically reported on a 0–100 scale) across six LLMs and three datasets, and finds that the scale is far from neutral. But can a choice of numbers really change a model’s apparent self-knowledge? The authors say yes.

Key findings

Across models and tasks the study finds heavy discretization: confidence reports cluster at a few round values, leaving large portions of the 0–100 range effectively unused. It has been reported that rescaling — changing the mapping from internal scores to the displayed numeric range — alters measured calibration and the appearance of metacognitive ability. In other words, the same underlying model can look better or worse at judging its own correctness depending on how researchers or interfaces map its internal signals to human-readable percentages.

Why it matters

For Western policymakers and engineers unfamiliar with nuances of LLM evaluation: uncertainty estimates are a core tool for safe deployment in black‑box settings, from content moderation to medical assistants. If scale design biases those estimates, then benchmarks, user-facing interfaces, and even regulatory audits could be misled. As governments and regulators worldwide push for AI transparency, and as companies contend with export controls and scrutiny around reliability, something as mundane as "0–100" is suddenly a policy-relevant choice. The paper prompts a simple but urgent question for practitioners: standardize your scales, or at least document them — otherwise you may be measuring the metric, not the model.