Large Language Models and Scientific Discourse: Where's the Intelligence?

Large language models (LLMs) are superb at imitating scientific prose. But can they genuinely produce the kind of knowledge that the scientific method yields? A new arXiv preprint (arXiv:2603.23543) argues they cannot — at least not in the way human science is made.

What the paper says

The authors contrast two schematics: one of scientific knowledge production (hypothesis, experiment, peer critique, replication) and one of an LLM's data-gathering process (ingest large corpora, compress patterns, predict next token). They argue that LLMs excel at synthesizing and summarizing recorded knowledge but lack the experimental loop that generates new, testable knowledge. The paper reports that while models can mimic the surface of scientific discourse, they do not inherently design experiments, manipulate variables, or bear the accountability structures (peer review, reproducibility norms) that underpin scientific progress.

Implications and geopolitical context

The gap matters for funders, journals, and labs racing to integrate LLMs into research workflows. It has been reported that some vendors market LLMs as research assistants capable of hypothesis generation; reportedly, such claims should be treated cautiously. Globally, the push to build ever-larger models is shaped by geopolitics — export controls on advanced chips and national AI strategies affect who can train the biggest systems. Chinese firms such as Baidu (百度), Alibaba (阿里巴巴) and Tencent (腾讯) are major players in this race and are developing their own LLM stacks, but access to compute and datasets remains a strategic chokepoint. The practical takeaway: treat LLM outputs as productive heuristics and literature filters — not as substitutes for experiment-driven science.