Harness is rewriting the economics of tokens — price per word is dead, price per result is rising

A new unit of value

For the past two years the AI business debate has centered on a single metric: how much per million tokens. Cheap wins. Efficiency wins. But it has been reported that Anthropic’s (the US AI lab) engineering experiments with "harnesses" show a different reality: tokens are increasingly bought not for text but for reliable outcomes. In one oft‑cited test using Claude Opus 4.5, a solo agent produced a demo UI in 20 minutes for about $9, while a full harnessed pipeline ran six hours and cost roughly $200 — twenty times more expensive on the surface. Was the harness a waste? Reportedly not: the harness delivered a playable, tested product with 16 features and multiple sprints; the solo agent produced only a brittle demonstration.

From content cost to control cost

What the experiments reveal is mechanistic. A harness wraps generation with planning, execution, automated testing and iterative fixes. It pairs a generator and an evaluator that uses tools like Playwright to operate pages, take screenshots and give concrete feedback. These loops run five to fifteen rounds and can last hours. The bulk of those added tokens pay for verification and direction, not extra prose. In effect, token spending shifts from “how many words did you output?” to “how many planning, validation and rework cycles did you buy to get a working result?” That reframes token economics: small amounts of supervisory token can safeguard large bursts of generation token, making cheap planners highly valuable capital supervisors.

A moving target — models, harnesses and geopolitics

The value of harnessing is not fixed. It has been reported that Anthropic’s Opus 4.6 substantially internalizes planning and long‑run stability, shrinking the need for some harness components; when a model learns a capability, the formerly essential harness becomes redundant overhead. Commercial pricing already reflects this system view: Opus 4.6 fast mode reportedly charges much higher per‑token rates for low latency and US‑only inference (a compliance/latency premium), and OpenAI’s newer tiers charge differently for tool use and regional processing. Why mention geography? Because inference location, export controls and regulatory compliance are now part of the cost calculus — a geopolitical layer that affects which tokens you can run where and at what price. Meanwhile, Chinese cloud and model vendors such as Baidu (百度) and Alibaba (阿里巴巴) are also building agent toolchains that will face their own policy and market constraints.

The new business metric

The takeaway is simple and disruptive: the industry is moving from a static per‑token price list to a dynamic, boundary‑driven cost structure measured by task completion, reduced rework and reliability. Tokens become organizational fuel allocated across planning, generation, validation and tooling. So which is more important — the model or the harness? The real question for product and procurement teams will be: how much does it cost to deliver a real, working result — not just a convincing demo?