Per-Domain Generalizing Policies Paper Argues for Q-Value Functions Over State-Values

What the paper proposes

A new preprint on arXiv, "Per-Domain Generalizing Policies: On Learning Efficient and Robust Q-Value Functions (Extended Version with Technical Appendix)" (arXiv:2603.17544v1), advocates a shift in how learned planners are trained. Standard methods for per-domain generalization typically learn state-value functions represented with graph neural networks (GNNs) and use supervised learning on optimal plans produced by a teacher planner. The authors argue that learning Q-value functions instead yields policies that are far cheaper to obtain and more robust in practice. It has been reported that this approach reduces the dependency on full optimal plans from the teacher and improves sample efficiency during training.

Technical angle, briefly

Where previous work framed the task as predicting a scalar value for each state, this paper frames it as predicting action-conditioned Q-values, which directly estimate the expected return of taking specific actions in a state. Reportedly, this lets learners leverage local decision structure and smaller supervision signals, reducing the computational burden of producing training labels and making policies less sensitive to teacher imperfections. The extended version includes a technical appendix with proofs, experimental details, and benchmarks comparing GNN-based state-value learners to their Q-value counterparts across standard planning domains.

Why readers should care

This is a preprint — not yet peer-reviewed — but it speaks to a practical question that matters for robotics, automated planning, and combinatorial solvers: how do you train domain-general policies cheaply and reliably? Could a move to Q-value supervision change the default training paradigm for learned planners? Practitioners and researchers can read the full paper and technical appendix on arXiv (https://arxiv.org/abs/2603.17544) to judge applicability to their domains.