Stop Comparing LLM Agents Without Disclosing the Harness

Key claim

A new position paper on arXiv (arXiv:2605.23950) argues that when evaluating long‑horizon tasks across models with comparable frontier capability, the agent execution harness — the infrastructure layer that governs context construction, tool interaction, orchestration and verification around a language model — is often a stronger determinant of performance than the underlying model itself. The authors make a blunt point: if you compare agents, disclose the harness. Otherwise you might be measuring engineering around models rather than the models.

Why the harness matters

The paper lays out concrete ways harnesses shape outcomes: how prompts are chunked and retrieved, which external tools are called and how results are validated, and how multi‑step plans are orchestrated and retried. Small changes in orchestration or verification logic can turn a failing multi‑step task into a success. Short improvements. Big gains. That matters for reproducibility: benchmarking only the model — without scripts, plugins, retries and verifier details — produces results that are hard to replicate or interpret. The authors call for standardized disclosure and richer benchmark protocols so that researchers and product teams can tell whether gains come from model advances or from better engineering around the model.

Geopolitics and industry implications

This debate matters beyond academia. China’s fast‑growing AI ecosystem — from Baidu (百度) to Alibaba (阿里巴巴) and other local labs — competes on both models and integration engineering, and reportedly faces constraints on access to the most advanced accelerators due to export controls. When hardware or model upgrades are hard to access, improving the harness becomes a cost‑effective way to close capability gaps. For regulators, enterprises and international partners, the paper raises practical questions: should procurement and evaluation require harness disclosure? How should benchmark suites evolve to account for orchestration and tool chains rather than only model logits? The authors’ recommendation is simple and urgent: stop comparing black boxes; disclose the harness. Who wants to buy a car based only on its engine spec, without knowing its transmission, brakes, or driver aids?