← Back to stories Detailed close-up of electronic microchips on a circuit board, showcasing technology and engineering intricacies.
Photo by Jakub Pabis on Pexels
ArXiv 2026-05-27

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

Key finding

A new preprint on arXiv (arXiv:2605.26789) warns that standard post‑training evaluations can be misleading: a model that retains or even improves its atomic factual knowledge may still fail to assemble those facts into correct multi‑step answers. If a model “knows” individual facts, does it follow that it can reason compositionally? The paper’s short answer is no — the authors show a phenomenon they call “composition collapse,” where recipes with statistically indistinguishable atomic knowledge produce markedly different compositional behavior.

What the authors did and found

The paper argues that aggregate benchmark scores treat multi‑hop reasoning as a single capability and therefore hide a critical failure mode. Through controlled experiments the authors construct model variants whose single‑fact accuracy is effectively the same, yet whose multi‑step question answering diverges. Post‑training or finetuning that raises atomic accuracy does not reliably translate into improved compositional reasoning, the study finds, suggesting that current evaluation practices conflate knowing facts with the ability to assemble them.

Why it matters

This matters to every organization building large language models — from US and European labs to Chinese firms such as Baidu (百度), Alibaba (阿里巴巴) and Tencent (腾讯) — because product and policy decisions often hinge on benchmark numbers. It has been reported that many groups optimize heavily to aggregate scores; reportedly, this can encourage brittle improvements that don’t generalize to chain‑of‑thought tasks. The paper’s remedy is practical: adopt finer‑grained, compositional evaluations and training objectives that explicitly test and teach fact‑assembly, not just fact retention. In a world of export controls and intense geopolitical scrutiny over AI capability, efficient, robust compositional reasoning may become as important as raw factual breadth — and current benchmarks are not yet up to the task.

Research
View original source →