From Rote Memorization to Logical Games: Can AIs Truly Do Science?

The new yardstick for “scientific” AI

Rote memorization or scientific reasoning? That debate has sharpened after Science magazine asked a simple question with a profound sting: how will we know if AI is smart enough to do science? Chinese tech outlet Huxiu (虎嗅) spotlighted the argument, noting that large language models (LLMs) now draft papers, summarize literature, and even sketch experimental workflows—yet their triumphs on popular exams may say more about training data than genuine understanding.

Benchmarks are breaking

Models have been trained on much of the open internet. That means widely used academic benchmarks—like MMLU for multi-task language understanding—risk contamination, turning tough tests into glorified memory checks. Researchers call this “data pollution.” Enter GPQA, a graduate-level, Google-verified science QA benchmark designed to be intractable to web search. Even domain experts, with unlimited internet access, only score roughly 65–70%. When OpenAI’s o1 series reportedly surpassed 80% on the GPQA-Diamond split, it reignited debate: is this recall of patterns at scale—or evidence of multi-step scientific reasoning?

From final answers to process supervision

A growing consensus says outcomes aren’t enough; the path matters. New evaluation frameworks emphasize process supervision, auditing each reasoning step: did the model consider temperature and pressure, anticipate side reactions, and diagnose anomalies when experiments fail? This approach exposes “logical hallucinations” behind fluent scientific prose, prioritizing rigor over rhetoric. It also mirrors real lab practice, where careful inference and error analysis often count as much as the final result.

The real proving ground: closed-loop labs

The frontier test is “closed-loop automated discovery,” where AI steers robot chemists or computing pipelines toward open-ended goals—say, a better carbon-capture material—then updates hypotheses on the fly as data arrives. Can an AI distinguish model bias from measurement error? Can it learn from a handful of experiments and converge on truth? That reflective capacity is fast becoming the gold standard. And the geopolitical backdrop matters: U.S. export controls on advanced chips constrain compute for Chinese labs, nudging the field—both in China and globally—toward data- and reasoning-efficient approaches, smarter evaluation, and automation that extracts more from fewer trials. Chinese internet firms such as Baidu (百度) and Alibaba (阿里巴巴), alongside top universities, are investing in AI-for-science; however, concrete progress toward fully autonomous “digital scientists” remains incremental.

Outlook: redefining what we measure

The point isn’t to replace scientists but to forge a new collaboration model. As evaluations shift from exam scores to logical rigor, corrective skill in experiments, and cross-disciplinary generalization, we may get closer to knowing when an AI truly “does science.” Until then, each new benchmark—and each closed-loop run—poses the real question behind the hype: not what an AI can recite, but what it can discover.