ManiBench: A Benchmark for Testing Visual-Logic Drift and Syntactic Hallucinations in Manim Code Generation

A new arXiv paper introduces ManiBench, a targeted benchmark designed to evaluate large language models' ability to generate Manim Community Edition (Manim CE) animations that are both temporally faithful and API-correct. Traditional code-evaluation suites such as HumanEval and MBPP measure syntactic correctness and static logic, but they do not capture whether generated code produces the intended dynamic visuals or respects evolving library APIs. The ManiBench authors argue that these omissions hide two important failure modes: visual-logic drift, where the produced animation no longer matches the pedagogical intent, and syntactic hallucinations, where models invent or misuse API calls.

What ManiBench tests

Manim CE is an open-source Python engine widely used for math and science visualizations; timing, choreography, and subtle visual semantics matter as much as whether the script executes without error. ManiBench assembles tasks that require temporal fidelity (correct sequencing, durations, and transitions) and version-aware API correctness, since Manim’s API has changed across releases and small name or argument errors break animations or change their meaning. The benchmark includes test cases that check rendered behavior across versions rather than only static compilation or unit tests, making it harder for models to “pass” by producing superficially plausible but pedagogically wrong code.

Why this matters

As developers and educators increasingly rely on LLMs to draft simulation and visualization code, these failure modes have real downstream costs: misleading educational videos, broken learning tools, and brittle automated content pipelines. Can a model write code that not only runs but also teaches correctly? ManiBench provides an evaluative framework to answer that question and to push model development toward semantics-aware code generation. The paper reports initial evaluations that expose common hallucination patterns and version-sensitivity, underscoring the need for benchmarks that reflect real-world usage beyond static unit tests.

The ManiBench paper is available on arXiv (arXiv:2603.13251v1) and the authors provide the benchmark suite for further research. For Western readers unfamiliar with this niche, think of ManiBench as a stress test for the “visual intelligence” of code-generating LLMs—an increasingly important capability as AI moves from producing text to orchestrating time-based, pedagogical experiences.