← Back to stories Glowing neon sign with pixelated Game Over text in a dark arcade setting.
Photo by cottonbro studio on Pexels
ArXiv 2026-03-20

LLMs still stumble at whodunit: Clue-style text game exposes gaps in multi-step deductive reasoning

Experiment and key finding

A new arXiv preprint (arXiv:2603.17169) builds a text-based, multi-agent version of the classic board game Clue as a controlled testbed to probe whether large language models can perform multi-step deductive reasoning. The authors put six agents into the rule-based environment, drawn from GPT-4o-mini and Gemini-2.5-Flash, and measure how well they can infer the culprit after rounds of private and public reasoning. It has been reported that deducing "whodunit" proves challenging for these LLM agents even in this simplified, symbolic setting.

Methods and interventions tested

The paper implements a multi-agent, turn-based Clue simulator so that reasoning mistakes can be traced to specific inference steps rather than noisy real-world input. The team also investigates interventions — reportedly including fine-tuning on structured logic puzzles — aimed at improving chain-of-thought and long-horizon deduction. The setup lets researchers isolate the models’ failure modes: do they forget earlier clues, misapply simple elimination logic, or hallucinate facts that break the inference chain?

Why this matters

Why care about a game? Because complex decision-making in many applications—legal reasoning, medical triage, automated debugging—depends on reliably chaining together many small, correct inferences. If state-of-the-art models from major labs struggle in a stylized logic test, that raises questions about their readiness for high-stakes, multi-step tasks. It also matters for how researchers evaluate progress: benchmark scores on single-step QA may mask brittleness that only shows up under multi-turn, multi-agent dynamics.

Broader context and next steps

The paper joins a growing body of work using synthetic environments to stress-test reasoning capabilities. Can fine-tuning on structured logical tasks or architectural changes close the gap? The arXiv posting invites follow-up experiments and replication. Readers can consult arXiv:2603.17169 for details and to judge the authors’ evidence for themselves.

AIResearchGaming
View original source →