genes.apexpots.com / research source: donto-longmemeval-study-2026-06-05.md

donto-memory on LongMemEval — a faithful study of what a memory layer adds to a frontier reader

donto-memory on LongMemEval — a faithful study of what a memory layer adds to a frontier reader

2026-06-05 · a no-cheating evaluation of donto-memory against the LongMemEval long-term-memory benchmark, with a frontier model (codex gpt-5.4) as both reader and judge, measured against a codex-alone full-context baseline


Executive summary

We put donto-memory — a bitemporal, paraconsistent, evidence-first memory built on the donto substrate — through LongMemEval, the standard long-term-memory benchmark, under audited no-leakage conditions. We did one thing most published memory-benchmark numbers don't: we ran a codex-alone full-context baseline — the same reader handed the entire conversation history with no memory system — so the delta isolates exactly what the memory layer contributes.

The honest headline: on longmemeval_s, where the whole history still fits in a frontier model's context window, a memory layer does not win on raw accuracy. A strong reader given everything is very hard to beat. What donto-memory does deliver, measurably:

We report the gaps as plainly as the wins. This study is about understanding where a memory layer earns its keep, not about a leaderboard trophy.

1. The benchmark

LongMemEval (ICLR 2025) is 500 questions over realistic multi-session chat histories, across six abilities — single-session (user / assistant / preference), multi-session, temporal-reasoning, knowledge-update — plus an abstention set of unanswerable questions. Three variants differ only in haystack size:

variant haystack what it tests
_oracle only the evidence sessions the reader given perfect retrieval (a ceiling)
_s ~50 sessions (~115–128k tok) the memory system: ingest all, retrieve at query time
_m ~500 sessions memory at a scale that exceeds the context window

We focus on _s (the standard memory-system test), report _oracle as a ceiling, and flag _m as the next run.

2. The system under test: donto-memory

donto-memory is an example consumer of donto, a bitemporal, paraconsistent, evidence-first claim substrate. For this study:

3. Faithfulness — no cheating

A memory benchmark is trivially gameable (peek at evidence labels), so we audited every gold-field use line by line:

The one disclosed deviation: the official judge is the OpenAI GPT-4o API; we use codex gpt-5.4 (subscription; zero per-token API budget). Same prompts, same parsing, applied identically to every instance and to both arms. The reader is likewise codex gpt-5.4. This makes our setup methodologically analogous to the leaderboard leader OMEGA (GPT-4.1 reader+judge), and means our numbers compare to OMEGA / HydraDB / Mastra — not to the paper's weaker GPT-4o rows.

4. Methodology

5. Results

5.1 Oracle ceiling (full 500)

With perfect retrieval (evidence sessions only), codex reader+judge:

ability accuracy
single-session-assistant 1.000
temporal-reasoning 0.985
single-session-user 0.971
knowledge-update 0.962
multi-session 0.895
single-session-preference 0.800
overall 0.946 · task-avg 0.935 · abstention 0.967

This is the reader's ceiling; it competes with the paper's GPT-4o oracle (0.87–0.92).

5.2 The comparison that matters — codex vs codex+donto (_s, paired, n=48)

Same 48 instances, same reader, same judge; the only difference is whether codex reads donto's retrieved top-20 or the entire haystack:

ability codex-alone (full-context) codex+donto
knowledge-update 0.875 1.000
multi-session 0.875 0.875
single-session-assistant 1.000 1.000
single-session-user 1.000 1.000
temporal-reasoning 1.000 1.000
single-session-preference 1.000 0.750
overall 0.957 0.936
abstention 0.667 1.000

A separate run over the full ~50-session haystack with partial embeddings (graceful FTS fallback) held at overall 0.933 / abstention 1.0, confirming the result is robust to incomplete vector coverage. The full-500 _s run is completing as of writing.

6. Retrieval analysis — where donto's recall is strong and weak

A zero-cost offline sweep (recall only, no reader) measured whether the evidence session lands in the top-k. The hybrid vector arm is load-bearing — turning it on lifts overall hit@10 from 0.85 (FTS-only) to 0.98:

ability hit@5 FTS→hybrid hit@10 FTS→hybrid
single-session-assistant 0.62 → 1.00 0.88 → 1.00
temporal-reasoning 0.75 → 0.88 0.88 → 1.00
single-session-preference 0.12 → 0.50 0.38 → 0.88
multi-session 1.00 → 1.00 1.00 → 1.00
knowledge-update 1.00 → 1.00 1.00 → 1.00
overall 0.75 → 0.90 0.85 → 0.98

single-session-preference is the weak spot. Preference questions ("what should I cook for guests?") share almost no surface words with the session that states the preference ("I'm vegan"); even semantically the link is inferential. donto stores one chunk per session and embeds only its first ~300 tokens, so evidence buried mid-session is invisible to the vector arm.

A validated fix (future work): finer chunking. Splitting sessions into ~1,000-char windows (each fully embedded) moved a known-missed preference instance from miss → hit@5. The cost is ~10× more chunks — ~250k at full _s scale — which on CPU-only bge-small (~3 chunk/s, no GPU) is ~17h, infeasible on the current box. So the headline run uses whole-session chunks and we report finer chunking as a measured, deferred improvement (it needs int8/GPU embedding to scale).

7. The competitive landscape — and a caution about leaderboards

Published longmemeval_s scores:

system overall reader / judge retrieval
OMEGA 95.4% GPT-4.1 (both) bge-small + FTS + cross-encoder rerank + time-decay
Mastra 94.87%
HydraDB 90.79% graph-native (entity/temporal/causal + BM25)
Memoria 88.78%
Zep / Graphiti 71.2%
paper, GPT-4o full-context 60–64% GPT-4o none

The caution: the LongMemEval score conflates the reader and the memory system. The paper's GPT-4o full-context scores 60–64%, but our codex gpt-5.4 full-context — no memory at all — scored 95.7%. Swap in a stronger reader and the "memory system" numbers move with it. A single overall number is not a clean ranking of memory systems; the codex-alone baseline is the control that matters, and the right questions are: how much does the memory layer add over giving the reader everything, at what token cost, and does it scale past the context window?

donto-memory shares OMEGA's retrieval family (identical bge-small + FTS + fusion) but lacks OMEGA's cross-encoder reranking and time-decay weighting — concrete, queued improvements we'll measure against this same benchmark.

8. What we actually learned

  1. On _s, a memory layer doesn't win on accuracy when a strong reader can hold the whole history in context. We say so plainly.
  2. donto's defensible value is specific and real: ~2× token efficiency, and wins on knowledge-update (bitemporal) and abstention (evidence-first) — the load-bearing parts of its thesis.
  3. The accuracy case for a memory layer lives on _m (~500 sessions), where full-context is impossible and retrieval is mandatory. That is the next run.

9. The production integration (MCP)

The realistic way an agent uses donto-memory isn't a bespoke harness — it's the Model Context Protocol. donto-memory ships an MCP server (npx -y donto-memory-mcp, or a dependency-free Python single-file variant) exposing donto_recall / donto_search / donto_memorize. We verified a frontier agent (codex gpt-5.4) agentically calling donto_recall and grounding its answer only in recalled memories. Full tool reference and agent guidance: mcp.donto.org.

10. Engineering notes

11. Limitations & future work

12. Reproducibility

The harness is fully scripted: a resumable donto arm, a codex full-context baseline arm, an offline recall sweep, plus the faithfulness audit and competitive-landscape notes. Judge prompts are the verbatim official evaluate_qa.py. No per-token API; codex via subscription.

Honest caveats restated: codex gpt-5.4 reader+judge; _s results on the config above with the full-500 in progress; donto-memory does not top the _s accuracy table, and that is the expected, correct finding for a memory layer when the whole history fits in context — its value is efficiency, bitemporal correctness, honest abstention, and scaling to histories that don't.