New arXiv paper urges rethink of benchmarks for “knowledge‑work” AI

A new preprint on arXiv argues that current benchmarks for large‑language‑model (LLM) agents—used to evaluate coding, research assistance and clinical support—are poorly matched to real knowledge work. The paper, "Design and Report Benchmarks for Knowledge Work" (arXiv:2605.23262), says higher scores on conventional NLP tests do not reliably show that a system can carry out sustained, verifiable knowledge tasks in the real world. How do you know a model can actually do a job, not just ace a quiz?

The authors call for a redesign of benchmark design and reporting: move from single‑shot question answering to multi‑step, tool‑enabled workflows; measure traceability, verifiability and human‑machine handoffs; and report experimental context and failure modes in standardized ways. The abstract notes domains already being explored—coding, scientific research and healthcare—where surface performance often masks brittle behavior. Reportedly, the paper recommends richer metadata about evaluation setups so downstream users and regulators can better interpret claims.

Why this matters—industry and geopolitics

This is more than an academic bookkeeping exercise. Companies from OpenAI and Google to China's Baidu (百度) and Alibaba (阿里巴巴) are racing to productize LLMs as knowledge assistants. Regulators and enterprise buyers want reliable evidence of safety and efficacy. It has been reported that export controls and broader tech‑policy tensions between the U.S. and China are accelerating industry incentives to demonstrate trustworthy capability locally as well as globally. Better benchmarks could shape procurement, research funding and cross‑border collaboration.

The paper is available on arXiv and authors encourage community engagement through arXivLabs, the platform’s collaborative feature set. Read the full preprint at https://arxiv.org/abs/2605.23262 to see the proposed reporting checklist and examples that aim to make knowledge‑work claims auditable rather than merely impressive.