Open-source "MemPalace" (记忆宫殿) touts perfect long-memory score — community tests temper the hype

Bold claim, fast attention

It has been reported that an open-source long-term memory system called MemPalace (记忆宫殿), developed publicly with visible involvement from actress Milla Jovovich and engineer Ben Sigman, achieved a perfect 500/500 on the LongMemEval benchmark — a result that sent the project viral across developer and AI communities. The repository and demos are on GitHub, and the team has pitched a headline-grabbing price point (reportedly about $0.7 per year) to suggest persistent memory for large models can be nearly free. Big claim. Big appetite for a solution to a real LLM limitation.

How the system works — and where scrutiny begins

MemPalace borrows the ancient Method of Loci, structuring memories into wings, rooms, closets and drawers so that conversational context and raw transcripts are both retained. Official testing on a 22,000+ dialogue corpus reports a raw recall of 60.9%, rising to 94.8% after wing+room metadata filtering — a dramatic jump. The project also introduces AAAK, a bespoke short-hand encoding the team says can provide “30× lossless compression” and an efficient retrieval workflow that loads only ~170 tokens of “key facts” per model call.

But community analysis has pushed back. Independent tokenizers reportedly show AAAK can increase token counts (examples where 66 tokens rose to 73), and AAAK is not lossless in practice: on LongMemEval AAAK-mode scored 84.2% versus 96.6% for raw-mode retrieval — a 12.4-point drop. Metadata filtering itself uses mainstream tools (ChromaDB), so critics say the headline performance is not solely MemPalace’s invention and that AAAK’s trade-offs undermine some core claims.

Why it matters — and what comes next

Long-term memory is one of the most demanded features for practical LLM deployments: stateful assistants, team knowledge bases, and smoother developer workflows would all benefit. MemPalace shows a plausible, model-agnostic architecture (compatible with Claude, GPT, Gemini, etc.) that sidesteps heavy fine-tuning. But questions about compression fidelity, token economics, and the durability of a “cheap” public service remain. Will the free or ultra-low-cost model scale when users accumulate millions of tokens? Who owns and secures the stored dialogues across jurisdictions?

Reporters and developers will be watching both the code and the numbers. Exceptional benchmark headlines attract users — and scrutiny. In this case, the applause is real, but so are the caveats.