OmniToM: a new benchmark probes whether LLMs really model other minds
What the paper introduces
A team of researchers has posted OmniToM on arXiv (arXiv:2605.26322), proposing a shift in how Theory of Mind (ToM) is evaluated in large language models (LLMs). Instead of judging models solely by an end-point answer to a social-reasoning question, OmniToM requires models to produce explicit intermediate belief states — the mental representations of what different agents know or believe at each step. Why does that matter? Because a correct final answer can mask whether the model actually constructed the underlying mental model or simply guessed the outcome.
Key findings
Using explicit belief modeling, the authors show that many LLMs that score well on conventional ToM tests fail to maintain consistent, human-like belief attributions across intermediate steps. The benchmark exposes cases where a model’s final response is right for the wrong reasons — it doesn’t reliably attribute beliefs, intentions, or knowledge to agents throughout a scenario. The paper provides a protocol for eliciting and evaluating these intermediate states and reports systematic gaps in current models’ social reasoning capabilities.
Why it matters — for developers, regulators and geopolitics
OmniToM is relevant to both Western and Chinese model developers alike. Better diagnostics for social reasoning are important for chatbot safety, alignment research, and applications where understanding human mental states affects outcomes — from education to customer service. For Chinese LLM efforts such as Baidu (百度)’s ERNIE series or other domestic models, the benchmark offers a way to demonstrate deeper interpretability beyond end-to-end accuracy. It has been reported that regulators and policymakers are increasingly focused on model interpretability and risk assessment; a benchmark that exposes hidden failure modes could feed directly into safety standards and compliance debates amid ongoing tech rivalry and export-control discussions.
Next steps
OmniToM is released as a preprint on arXiv and invites the community to adopt, critique, and extend its methodology. The paper’s core claim is simple but consequential: asking a model what agents believe, step by step, gives a clearer picture of whether it truly models other minds — or just finesses the answer.
