SGR-Bench: A new benchmark for agents that must navigate site-specific states to retrieve answers

What SGR-Bench measures

A team has published SGR-Bench on arXiv (arXiv:2605.22219), a benchmark designed to evaluate search agents on "state-gated retrieval" — situations where answer-bearing evidence is only available after the agent first sets a site-specific retrieval state (filters, query contexts, pagination, or login). The paper argues this class of specialized retrieval tasks has been undercharacterized in prior benchmarks that focus on open-web, stateless retrieval. How well do today's tool-using agents handle interfaces that require a sequence of correct interactions before the evidence appears?

How the benchmark works

SGR-Bench assembles tasks drawn from a range of specialized data sites and frames them so that success requires establishing the correct state before evidence appears. The authors report using a suite of automated agents that combine language models with tool calls to interact with page-like APIs and simulated UI steps. Performance is measured not just on final answer accuracy but on whether the agent executed the right state transitions and retrieved the underlying evidence.

Key findings

The authors report substantial gaps between agents that can issue single-shot queries and those that must perform multi-step stateful retrieval: many contemporary models fail to discover or maintain the necessary state, producing answers without verifiable evidence or failing outright. The paper also highlights failure modes — brittle navigation, misuse of filters, and assumptions that the web is stateless — and proposes evaluation metrics and task designs to surface them.

Why it matters

State-gated retrieval is common in specialized domains — legal research, government databases, paywalled archives, and enterprise systems — and it matters for trust, auditability, and compliance. As platforms and regulators increasingly control access to data, reportedly creating more gated or rate-limited interfaces, benchmarks like SGR-Bench will be important for building agents that can operate reliably and transparently in those environments. The full paper and dataset are available on arXiv.