Did You Forget What I Asked? New Paper Shows LLMs Drop Formatting When Tasks Get Hard

Large language models can solve hard problems. They often fail the simplest instruction at the same time. A new arXiv preprint, "Did You Forget What I Asked? Prospective Memory Failures in Large Language Models" (arXiv:2603.23530), frames that surprising gap through the lens of prospective memory from cognitive psychology and demonstrates it with a controlled, verifiable paradigm. The finding is stark: as task demands rise, compliance with explicit formatting constraints — the things that make model outputs machine-parseable and legally or operationally useful — reliably degrades.

Findings from the paper

The authors combine benchmark tasks of increasing complexity with strict formatting requirements to isolate when and how models "forget" an instruction they were explicitly given. Across their experiments, formatting adherence falls off even when task accuracy remains acceptable, suggesting a distinct failure mode separate from reasoning or factual errors. The paper uses a controlled setup that lets researchers measure formatting compliance quantitatively, providing a clearer test than anecdotal prompt-engineering examples.

Why it matters

This is not just an academic quibble. Many real-world deployments — data pipelines, legal document generation, API adapters, and safety filters — depend on precise output structure. If an LLM produces the right answer but in the wrong shape, automation breaks. It has been reported that geopolitical pressures and export controls on AI are nudging organizations to rely on a broader mix of models from different jurisdictions (OpenAI to Chinese firms such as Baidu (百度)), so robust, predictable instruction-following across architectures is increasingly important.

Implications and next steps

The paper points to practical mitigations: stronger evaluation suites that test prospective memory, architecture and training tweaks that reinforce long-range instruction retention, and deployment strategies that verify formatting post hoc. The question remains: can we teach models to remember instructions the way humans do, or will we need hybrid systems and external checkers to make LLM outputs reliably usable? This study gives practitioners a clear metric and an urgent reason to find out.