QV May Be Enough: New arXiv paper questions the need for Keys in Transformer attention

Paper argues Query-Value (QV) alone may capture attention's essence

A new preprint on arXiv (arXiv:2603.15665) argues that the classic Query-Key-Value (QKV) formulation of Transformer attention may be pared back: it has been reported that Query-Value (QV) interactions alone can capture the essential behavior attributed to attention. Could attention work without keys? The authors approach the problem from first principles, using a linguistic lens centered on part-of-speech (POS) tagging and syntactic structure to derive a theoretical foundation for why QV might be sufficient.

Theory-driven derivation, not yet a production-tested claim

The paper presents a unified explanatory framework deriving attention’s role in modeling linguistic relations and shows how QV interactions emerge from syntactic regularities. The work is primarily analytical: the authors derive conditions under which keys are redundant and provide proofs and examples based on syntactic analysis. Because this is an arXiv preprint and not peer-reviewed, empirical generalization to large-scale models remains to be independently validated; it has been reported that the empirical support is preliminary.

Why practitioners should care — and why they should be cautious

If borne out in practice, a QV-only design could simplify architectures, reduce parameter counts, and change hardware and software optimization priorities for large language models (LLMs). That matters now, as firms and researchers scramble to squeeze efficiency gains amid rising compute costs and restrictions on advanced chip availability. However, claims about replacing keys bear close experimental scrutiny across languages, model sizes, and downstream tasks before the community rewrites textbooks or production stacks.

Context and next steps

The paper is available on arXiv for anyone to read and test (https://arxiv.org/abs/2603.15665). For Western readers unfamiliar with the technical background: Transformers use attention to weight how tokens attend to one another, and QKV is the canonical mathematical trick that has powered recent LLM breakthroughs. Whether QV reforms that canon will depend on independent replications, rigorous benchmarks, and follow-up work connecting the new theory to large-scale empirical results.