Large models ace every exam yet are further from AGI: what does this paper reveal?

A new definition at odds with leaderboard thinking

Michael Timothy Bennett of the Australian National University has put a thorny question back on the table: do high scores on human-style exams mean progress toward artificial general intelligence (AGI)? Bennett’s arXiv paper argues they do not. Industry leaders have been talking past one another — it has been reported that a secret OpenAI–Microsoft deal even used a financial threshold (systems that can generate at least $100 billion in profit) as a proxy for AGI, while Nvidia’s Jensen Huang (黄仁勋) and Elon Musk have offered timelines from imminent to five years. The deeper problem, Bennett says, is conceptual: AGI has no agreed ruler, so everyone simply projects their hopes onto the same blurry target.

From mimicry to "artificial scientist"

Bennett reframes AGI as an “artificial scientist” — a system that, under real-world constraints of compute, memory and energy, can actively design experiments, infer causal structure, and balance exploration with exploitation. Why does that matter? Because current large language models (LLMs) are extreme correlational learners: they compress massive amounts of data into weights and approximate likely outputs. They ace exams and yet fail simple out-of-distribution tasks — ask a top model whether 9.11 or 9.9 is larger and it may blithely answer the wrong way. Bennett joins other sceptics such as Melanie Mitchell and calls from scholars like Yoshua Bengio and research efforts at DeepMind in arguing that pass-fail benchmarks and Turing-style tests have become a kind of Rorschach test — observers see what they expect, not what is objectively present.

A pragmatic pivot with policy implications

Bennett’s policy is concrete: stop worshipping the scaling law as the only route. He contrasts three meta-strategies — scale-maxing, simplicity-maxing and constraint-weakening — and argues that AGI will emerge from their combination, not from a single brute-force trajectory. Crucially he folds energy and resource limits into the definition, a move that makes the debate directly relevant to geopolitics: export controls and sanctions on advanced chips mean compute is a strategic, scarce resource, so definitions that prize energy-efficient, adaptive intelligence will matter to governments and firms alike. If Bennett’s “artificial scientist” benchmark gains traction, expect a reset in evaluation — not higher exam scores, but tests that throw AIs into novel physical environments, force them to propose and test hypotheses, and judge them by how much new knowledge they discover, not by how well they can mimic human text. Are we ready to swap leaderboard vanity for adaptability measures? That question may determine the next phase of the global AI race.