Teaching AI by building tests: QuestBench proposes course-based benchmark construction for accountable knowledge work

A new arXiv paper argues for a simple pedagogical pivot: don’t just teach students to use AI as a productivity tool. Teach them to test it. The paper, "Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work" (arXiv:2605.21413), proposes QuestBench — a classroom practice in which students design, curate and critique evaluation benchmarks to learn how models fail, how datasets encode bias, and how to judge machine-produced knowledge. Short lessons. Hands-on work. A different emphasis: accountability over automation.

What the authors propose

The authors lay out a course workflow in which students create question sets, annotate evidence and evaluate model outputs against human judgments. The exercise does double duty: it is both a technical training in dataset construction and an ethical lesson in epistemic responsibility. Students learn to flag hallucinations, measure reproducibility, and document provenance. The paper frames benchmark construction not as a one-off research task but as an educational tool that highlights students’ role as critical consumers and producers of AI-generated information.

Why this matters — for educators and for tech policy

Why should teachers care? Because current curricula often stop at prompting, summarizing and tool use. Those are useful skills. But can you trust what a model tells you? QuestBench pushes students to answer that question empirically. The method also has broader implications for national AI ecosystems. In China, where universities and firms such as Baidu (百度) and Alibaba (阿里巴巴) are rapidly rolling out AI services and training programs, building local benchmarks can surface culturally specific failure modes and improve accountability for domestic deployment. And in a geopolitical climate shaped by export controls and debate over model governance, transparent benchmarking helps institutions demonstrate reliability and traceability.

The authors note caveats: benchmarks can be gamed, and dataset creation is itself value-laden. Still, the practical takeaway is clear. If education aims to produce responsible actors, it must teach students to interrogate as well as to instruct AI. Can a classroom that builds tests produce better, more skeptical users — and builders — of AI? QuestBench says yes, and it provides a road map for instructors willing to try.