New arXiv benchmark exposes how large language models contradict themselves across related queries

What researchers released

A paper on arXiv titled "Quantifying Cross-Query Contradictions in Multi-Query LLM Reasoning" introduces a focused benchmark to measure contradictions that arise when large language models answer multiple, interdependent questions about the same "case-file." The authors assemble 390 multi-query reasoning instances labeled at the pair level as entailment, contradiction, or unknown, and frame the problem as maintaining a globally satisfiable belief state across related queries — what they call case-file logical consistency. The report is available on arXiv (it has been reported that the manuscript is part of the platform’s open-access catalogue).

Methods and measurement

Rather than judging single-answer correctness, the paper evaluates whether a model’s ensemble of answers can be made jointly consistent. The benchmark pairs interdependent queries and annotates the logical relations between answers; the goal is to quantify cross-query contradictions and to propose metrics that capture global satisfiability instead of isolated accuracy. The authors also discuss experimental protocols for probing multi-query behavior, and they make their dataset available to the community so others can reproduce and extend the work.

Why this matters

Why should practitioners care? Because many real-world deployments — legal intake systems, medical triage, customer support, long-running assistants — ask multiple related questions and expect a coherent belief state. When models contradict themselves, downstream decisions and trust decay. Reportedly, such inconsistencies are common even among state-of-the-art models, raising practical and safety concerns. The problem also intersects with regulatory and geopolitical scrutiny of AI: unreliable reasoning complicates certification, export controls, and deployment across jurisdictions that are tightening oversight of high-risk AI applications.

Takeaway

The new benchmark gives researchers and engineers a concrete way to measure and reduce cross-query contradictions. Short-term fixes might include retrieval-augmented pipelines or consistency-aware decoding; long-term solutions require architectural and evaluation shifts that reward global logical coherence, not just per-query plausibility. The dataset and metrics on arXiv should accelerate that work.