New "Robust Reasoning Benchmark" challenges LLMs with 14 perturbations on AIME problems

Summary

A new preprint on arXiv, "Robust Reasoning Benchmark" (arXiv:2604.08571), argues that high scores on standard mathematical benchmarks can mask brittle, format-sensitive reasoning in large language models (LLMs). The paper proposes a perturbation pipeline of 14 techniques designed to scramble or disguise problem presentation while preserving mathematical content. Can an LLM still solve a problem when the surface form changes? The authors set out to answer that question with a systematic stress test.

What the authors did

The team applied their pipeline to the AIME 2024 dataset — problems drawn from the American Invitational Mathematics Examination, a high-school level contest that tests multi-step quantitative reasoning — and used the transformed problems to probe model robustness. They reportedly evaluated eight LLMs across the perturbed dataset, measuring how performance degrades under changes such as rewording, symbol substitution, layout alteration, and introduction of irrelevant text. The framework is designed to expose overfitting to conventional textual formatting rather than true compositional reasoning.

Findings and caveats

It has been reported that model accuracy falls substantially on many of the perturbations, indicating that strong benchmark performance does not necessarily imply generalizable mathematical reasoning. Because this is a preprint, the claims are preliminary and have not undergone peer review; readers should treat specific numeric results as provisional. The authors have released the pipeline and dataset translations for community use on arXiv, inviting replication and extension.

Why it matters

The work highlights a practical risk for deploying LLMs in domains that require robust reasoning: models that latch onto formatting shortcuts can fail when presentation changes. That matters for education, scientific assistance, and any application where input distributions shift. The benchmark offers a way for researchers and engineers to stress-test models beyond conventional accuracy metrics and to prioritize robustness in model development.