2026-06-05 · a no-cheating evaluation of donto-memory against the LongMemEval long-term-memory benchmark, with a frontier model (codex gpt-5.4) as both reader and judge, measured against a codex-alone full-context baseline
We put donto-memory — a bitemporal, paraconsistent, evidence-first memory built on the donto substrate — through LongMemEval, the standard long-term-memory benchmark, under audited no-leakage conditions. We did one thing most published memory-benchmark numbers don't: we ran a codex-alone full-context baseline — the same reader handed the entire conversation history with no memory system — so the delta isolates exactly what the memory layer contributes.
The honest headline: on longmemeval_s, where the
whole history still fits in a frontier model's context window, a memory
layer does not win on raw accuracy. A strong reader given
everything is very hard to beat. What donto-memory does
deliver, measurably:
longmemeval_m (≈500 sessions), where
full-context is impossible.We report the gaps as plainly as the wins. This study is about understanding where a memory layer earns its keep, not about a leaderboard trophy.
LongMemEval (ICLR 2025) is 500 questions over realistic multi-session chat histories, across six abilities — single-session (user / assistant / preference), multi-session, temporal-reasoning, knowledge-update — plus an abstention set of unanswerable questions. Three variants differ only in haystack size:
| variant | haystack | what it tests |
|---|---|---|
_oracle |
only the evidence sessions | the reader given perfect retrieval (a ceiling) |
_s |
~50 sessions (~115–128k tok) | the memory system: ingest all, retrieve at query time |
_m |
~500 sessions | memory at a scale that exceeds the context window |
We focus on _s (the standard
memory-system test), report _oracle as a ceiling, and flag
_m as the next run.
donto-memory is an example consumer of donto, a bitemporal, paraconsistent, evidence-first claim substrate. For this study:
valid_from). All ~50 sessions
are ingested; nothing is filtered by evidence.valid_from/valid_to, so the reader can prefer
the current value of an attribute.donto_recall / donto_search
/ donto_memorize; npx -y donto-memory-mcp), so
any agent can use it as native tools and drive its own recalls. (Docs:
mcp.donto.org.)A memory benchmark is trivially gameable (peek at evidence labels), so we audited every gold-field use line by line:
role+content only; the per-turn
has_answer flag is never emitted.answer_session_ids / has_answer.label = "yes" in response.lower() rule, and the
"_abs" abstention detection are byte-identical to
the official evaluate_qa.py.The one disclosed deviation: the official judge is the OpenAI GPT-4o API; we use codex gpt-5.4 (subscription; zero per-token API budget). Same prompts, same parsing, applied identically to every instance and to both arms. The reader is likewise codex gpt-5.4. This makes our setup methodologically analogous to the leaderboard leader OMEGA (GPT-4.1 reader+judge), and means our numbers compare to OMEGA / HydraDB / Mastra — not to the paper's weaker GPT-4o rows.
With perfect retrieval (evidence sessions only), codex reader+judge:
| ability | accuracy |
|---|---|
| single-session-assistant | 1.000 |
| temporal-reasoning | 0.985 |
| single-session-user | 0.971 |
| knowledge-update | 0.962 |
| multi-session | 0.895 |
| single-session-preference | 0.800 |
| overall | 0.946 · task-avg 0.935 · abstention 0.967 |
This is the reader's ceiling; it competes with the paper's GPT-4o oracle (0.87–0.92).
_s,
paired, n=48)Same 48 instances, same reader, same judge; the only difference is whether codex reads donto's retrieved top-20 or the entire haystack:
| ability | codex-alone (full-context) | codex+donto |
|---|---|---|
| knowledge-update | 0.875 | 1.000 |
| multi-session | 0.875 | 0.875 |
| single-session-assistant | 1.000 | 1.000 |
| single-session-user | 1.000 | 1.000 |
| temporal-reasoning | 1.000 | 1.000 |
| single-session-preference | 1.000 | 0.750 |
| overall | 0.957 | 0.936 |
| abstention | 0.667 | 1.000 |
A separate run over the full ~50-session haystack with partial
embeddings (graceful FTS fallback) held at overall 0.933 /
abstention 1.0, confirming the result is robust to incomplete
vector coverage. The full-500 _s run is completing as of
writing.
A zero-cost offline sweep (recall only, no reader) measured whether the evidence session lands in the top-k. The hybrid vector arm is load-bearing — turning it on lifts overall hit@10 from 0.85 (FTS-only) to 0.98:
| ability | hit@5 FTS→hybrid | hit@10 FTS→hybrid |
|---|---|---|
| single-session-assistant | 0.62 → 1.00 | 0.88 → 1.00 |
| temporal-reasoning | 0.75 → 0.88 | 0.88 → 1.00 |
| single-session-preference | 0.12 → 0.50 | 0.38 → 0.88 |
| multi-session | 1.00 → 1.00 | 1.00 → 1.00 |
| knowledge-update | 1.00 → 1.00 | 1.00 → 1.00 |
| overall | 0.75 → 0.90 | 0.85 → 0.98 |
single-session-preference is the weak spot. Preference questions ("what should I cook for guests?") share almost no surface words with the session that states the preference ("I'm vegan"); even semantically the link is inferential. donto stores one chunk per session and embeds only its first ~300 tokens, so evidence buried mid-session is invisible to the vector arm.
A validated fix (future work): finer chunking.
Splitting sessions into ~1,000-char windows (each fully embedded) moved
a known-missed preference instance from miss → hit@5.
The cost is ~10× more chunks — ~250k at full _s scale —
which on CPU-only bge-small (~3 chunk/s, no GPU) is ~17h, infeasible on
the current box. So the headline run uses whole-session chunks and we
report finer chunking as a measured, deferred improvement (it needs
int8/GPU embedding to scale).
Published longmemeval_s scores:
| system | overall | reader / judge | retrieval |
|---|---|---|---|
| OMEGA | 95.4% | GPT-4.1 (both) | bge-small + FTS + cross-encoder rerank + time-decay |
| Mastra | 94.87% | — | — |
| HydraDB | 90.79% | — | graph-native (entity/temporal/causal + BM25) |
| Memoria | 88.78% | — | — |
| Zep / Graphiti | 71.2% | — | — |
| paper, GPT-4o full-context | 60–64% | GPT-4o | none |
The caution: the LongMemEval score conflates the reader and the memory system. The paper's GPT-4o full-context scores 60–64%, but our codex gpt-5.4 full-context — no memory at all — scored 95.7%. Swap in a stronger reader and the "memory system" numbers move with it. A single overall number is not a clean ranking of memory systems; the codex-alone baseline is the control that matters, and the right questions are: how much does the memory layer add over giving the reader everything, at what token cost, and does it scale past the context window?
donto-memory shares OMEGA's retrieval family (identical bge-small + FTS + fusion) but lacks OMEGA's cross-encoder reranking and time-decay weighting — concrete, queued improvements we'll measure against this same benchmark.
_s, a memory layer doesn't win on
accuracy when a strong reader can hold the whole history in
context. We say so plainly._m (~500 sessions), where full-context is
impossible and retrieval is mandatory. That is the next run.The realistic way an agent uses donto-memory isn't a bespoke harness
— it's the Model Context Protocol. donto-memory ships an MCP server
(npx -y donto-memory-mcp, or a dependency-free Python
single-file variant) exposing donto_recall /
donto_search / donto_memorize. We verified a
frontier agent (codex gpt-5.4) agentically calling
donto_recall and grounding its answer only in recalled
memories. Full tool reference and agent guidance: mcp.donto.org.
_s, so 1,200 chars is the speed/quality sweet spot._s comparison is a paired
48-instance subset plus a partial-embedding full-haystack run;
the full-500 _s completes shortly and will
update §5.2._m (the variant where memory's accuracy
case is strongest) are the next runs.The harness is fully scripted: a resumable donto arm, a codex
full-context baseline arm, an offline recall sweep, plus the
faithfulness audit and competitive-landscape notes. Judge prompts are
the verbatim official evaluate_qa.py. No per-token API;
codex via subscription.
Honest caveats restated: codex gpt-5.4 reader+judge;
_s results on the config above with the full-500 in
progress; donto-memory does not top the _s accuracy table,
and that is the expected, correct finding for a memory layer when the
whole history fits in context — its value is efficiency, bitemporal
correctness, honest abstention, and scaling to histories that
don't.