Review Arcade flags risks in using LLMs to write and judge science

What the paper does

A new preprint on arXiv, "Review Arcade: On the Human Alignment and Gameability of LLM Reviews" (arXiv:2605.28897v1), probes a growing and consequential practice: using large language models (LLMs) to produce peer reviews for scientific papers. The authors report empirical experiments testing how well LLM-crafted reviews align with human judgment, and how easy it is to "game" those reviews by tweaking paper text or prompts. It has been reported that major conferences are already piloting LLM-assisted reviewing, and the authors frame their work as urgently timely.

Main concerns and findings

The paper, according to its abstract, treats both sides of the aisle: reviewers are using LLM assistance, and authors are using LLMs to revise submissions before they are sent. That creates a feedback loop. The study reportedly finds that while LLMs can simulate plausible, thorough reviews, they are sensitive to surface cues and prompt engineering — meaning authors could optimize manuscripts for positive automated feedback rather than for scientific rigor. Who benefits? Who loses? The short answer: incentives matter, and automation can shift them in opaque ways.

Why this matters for research integrity and policy

For Western readers less familiar with the dynamics of China’s AI ecosystem, note that many players are global. Models from U.S. firms, and Chinese firms such as Baidu (百度) and Alibaba (阿里巴巴), are part of the broader toolkit researchers might use. Geopolitics enters too: export controls, trade policy, and national AI strategies influence which models institutions can access — and that fragmentation could lead to uneven reviewing standards across regions. Policymakers and conference organizers face a hard choice: embrace efficiency and risk gameability, or restrict LLM use and risk falling behind.

Takeaway

The Review Arcade paper is a timely technical warning: automated reviews are not a neutral efficiency gain. They reshape incentives, can be nudged by prompt and text manipulation, and therefore require new guardrails if they are to augment — rather than undermine — peer review. The full preprint is available on arXiv for those who want to dig into methods and data.