genes.apexpots.com / research source: donto-longmemeval-study-2026-06-05.md

donto-memory on LongMemEval — a faithful study of what a memory layer adds to a frontier reader

2026-06-05 · a no-cheating evaluation of donto-memory against the LongMemEval long-term-memory benchmark, with a frontier model (codex gpt-5.4) as both reader and judge, measured against a codex-alone full-context baseline

Executive summary

We put donto-memory — a bitemporal, paraconsistent, evidence-first memory built on the donto substrate — through LongMemEval, the standard long-term-memory benchmark, under audited no-leakage conditions. We did one thing most published memory-benchmark numbers don't: we ran a codex-alone full-context baseline — the same reader handed the entire conversation history with no memory system — so the delta isolates exactly what the memory layer contributes.

The honest headline: on longmemeval_s, where the whole history still fits in a frontier model's context window, a memory layer does not win on raw accuracy. A strong reader given everything is very hard to beat. What donto-memory does deliver, measurably:

~2× lower token cost at parity accuracy (it reads a retrieved top-k, not a ~120k-token haystack);
wins on knowledge-update (bitemporal valid-time → pick the latest value) and abstention (evidence-first → say "I don't know" instead of hallucinating);
a clear, diagnosed gap on preference recall, with a validated fix;
and the real accuracy case for a memory layer lives on longmemeval_m (≈500 sessions), where full-context is impossible.

We report the gaps as plainly as the wins. This study is about understanding where a memory layer earns its keep, not about a leaderboard trophy.

1. The benchmark

LongMemEval (ICLR 2025) is 500 questions over realistic multi-session chat histories, across six abilities — single-session (user / assistant / preference), multi-session, temporal-reasoning, knowledge-update — plus an abstention set of unanswerable questions. Three variants differ only in haystack size:

variant	haystack	what it tests
`_oracle`	only the evidence sessions	the reader given perfect retrieval (a ceiling)
`_s`	~50 sessions (~115–128k tok)	the memory system: ingest all, retrieve at query time
`_m`	~500 sessions	memory at a scale that exceeds the context window

We focus on _s (the standard memory-system test), report _oracle as a ceiling, and flag _m as the next run.

2. The system under test: donto-memory

donto-memory is an example consumer of donto, a bitemporal, paraconsistent, evidence-first claim substrate. For this study:

Ingest — each session becomes an episodic chunk anchored to a per-question holder, with the session date injected as valid-time (valid_from). All ~50 sessions are ingested; nothing is filtered by evidence.
Hybrid recall — holder-scoped lexical (FTS) + semantic (bge-small-en-v1.5, 384-dim, HNSW) retrieval, RRF-fused, with learned predicate alignment-closure expansion. This is the same retrieval family as the current public leader (OMEGA): identical embedding model + FTS + fusion.
Bitemporal valid-time — recalled rows carry valid_from/valid_to, so the reader can prefer the current value of an attribute.
MCP server — donto-memory is exposed over the Model Context Protocol (donto_recall / donto_search / donto_memorize; npx -y donto-memory-mcp), so any agent can use it as native tools and drive its own recalls. (Docs: mcp.donto.org.)

3. Faithfulness — no cheating

A memory benchmark is trivially gameable (peek at evidence labels), so we audited every gold-field use line by line:

Ingest renders sessions from role+content only; the per-turn has_answer flag is never emitted.
Recall queries with the QUESTION and holder only — never answer_session_ids / has_answer.
Reader sees only: the question, the question date, and the recalled memories. No gold answer, no evidence labels.
Judge receives the gold answer for grading only — exactly as the official harness does.
The judge prompts, the label = "yes" in response.lower() rule, and the "_abs" abstention detection are byte-identical to the official evaluate_qa.py.

The one disclosed deviation: the official judge is the OpenAI GPT-4o API; we use codex gpt-5.4 (subscription; zero per-token API budget). Same prompts, same parsing, applied identically to every instance and to both arms. The reader is likewise codex gpt-5.4. This makes our setup methodologically analogous to the leaderboard leader OMEGA (GPT-4.1 reader+judge), and means our numbers compare to OMEGA / HydraDB / Mastra — not to the paper's weaker GPT-4o rows.

4. Methodology

Harness — a resumable, checkpointed pipeline (ingest → embed → recall → reader → judge); per-instance checkpointing so a crash never re-spends compute.
Recall budget — bounded top-k (k=20) hybrid recall over the full ~50-session haystack (a real retrieval test, not the oracle hand-off).
Reader / judge — codex gpt-5.4, official prompts.
Baseline arm — the same codex reader handed the ENTIRE date-sorted haystack (~120k tokens), NO donto retrieval — isolating the memory system's contribution.

5. Results

5.1 Oracle ceiling (full 500)

With perfect retrieval (evidence sessions only), codex reader+judge:

ability	accuracy
single-session-assistant	1.000
temporal-reasoning	0.985
single-session-user	0.971
knowledge-update	0.962
multi-session	0.895
single-session-preference	0.800
overall	0.946 · task-avg 0.935 · abstention 0.967

This is the reader's ceiling; it competes with the paper's GPT-4o oracle (0.87–0.92).

5.2 The comparison that matters — codex vs codex+donto (`_s`, paired, n=48)

Same 48 instances, same reader, same judge; the only difference is whether codex reads donto's retrieved top-20 or the entire haystack:

ability	codex-alone (full-context)	codex+donto
knowledge-update	0.875	1.000
multi-session	0.875	0.875
single-session-assistant	1.000	1.000
single-session-user	1.000	1.000
temporal-reasoning	1.000	1.000
single-session-preference	1.000	0.750
overall	0.957	0.936
abstention	0.667	1.000

Accuracy: ~tied (0.957 vs 0.936 ≈ one instance on n=47).
Token cost: donto ~2.1× cheaper — measured median reader prompt 124k tok (full-context) vs 58k tok (codex+donto).
donto wins knowledge-update (bitemporal valid-time → latest value, not drowned in superseded history).
donto wins abstention (evidence-first → refuses unanswerable questions full-context hallucinates).
donto loses preference (whole-session chunks miss the subtle, inferential preference session — see §6).

A separate run over the full ~50-session haystack with partial embeddings (graceful FTS fallback) held at overall 0.933 / abstention 1.0, confirming the result is robust to incomplete vector coverage. The full-500 _s run is completing as of writing.

6. Retrieval analysis — where donto's recall is strong and weak

A zero-cost offline sweep (recall only, no reader) measured whether the evidence session lands in the top-k. The hybrid vector arm is load-bearing — turning it on lifts overall hit@10 from 0.85 (FTS-only) to 0.98:

ability	hit@5 FTS→hybrid	hit@10 FTS→hybrid
single-session-assistant	0.62 → 1.00	0.88 → 1.00
temporal-reasoning	0.75 → 0.88	0.88 → 1.00
single-session-preference	0.12 → 0.50	0.38 → 0.88
multi-session	1.00 → 1.00	1.00 → 1.00
knowledge-update	1.00 → 1.00	1.00 → 1.00
overall	0.75 → 0.90	0.85 → 0.98

single-session-preference is the weak spot. Preference questions ("what should I cook for guests?") share almost no surface words with the session that states the preference ("I'm vegan"); even semantically the link is inferential. donto stores one chunk per session and embeds only its first ~300 tokens, so evidence buried mid-session is invisible to the vector arm.

A validated fix (future work): finer chunking. Splitting sessions into ~1,000-char windows (each fully embedded) moved a known-missed preference instance from miss → hit@5. The cost is ~10× more chunks — ~250k at full _s scale — which on CPU-only bge-small (~3 chunk/s, no GPU) is ~17h, infeasible on the current box. So the headline run uses whole-session chunks and we report finer chunking as a measured, deferred improvement (it needs int8/GPU embedding to scale).

7. The competitive landscape — and a caution about leaderboards

Published longmemeval_s scores:

system	overall	reader / judge	retrieval
OMEGA	95.4%	GPT-4.1 (both)	bge-small + FTS + cross-encoder rerank + time-decay
Mastra	94.87%	—	—
HydraDB	90.79%	—	graph-native (entity/temporal/causal + BM25)
Memoria	88.78%	—	—
Zep / Graphiti	71.2%	—	—
paper, GPT-4o full-context	60–64%	GPT-4o	none

The caution: the LongMemEval score conflates the reader and the memory system. The paper's GPT-4o full-context scores 60–64%, but our codex gpt-5.4 full-context — no memory at all — scored 95.7%. Swap in a stronger reader and the "memory system" numbers move with it. A single overall number is not a clean ranking of memory systems; the codex-alone baseline is the control that matters, and the right questions are: how much does the memory layer add over giving the reader everything, at what token cost, and does it scale past the context window?

donto-memory shares OMEGA's retrieval family (identical bge-small + FTS + fusion) but lacks OMEGA's cross-encoder reranking and time-decay weighting — concrete, queued improvements we'll measure against this same benchmark.

8. What we actually learned

On _s, a memory layer doesn't win on accuracy when a strong reader can hold the whole history in context. We say so plainly.
donto's defensible value is specific and real: ~2× token efficiency, and wins on knowledge-update (bitemporal) and abstention (evidence-first) — the load-bearing parts of its thesis.
The accuracy case for a memory layer lives on _m (~500 sessions), where full-context is impossible and retrieval is mandatory. That is the next run.

9. The production integration (MCP)

The realistic way an agent uses donto-memory isn't a bespoke harness — it's the Model Context Protocol. donto-memory ships an MCP server (npx -y donto-memory-mcp, or a dependency-free Python single-file variant) exposing donto_recall / donto_search / donto_memorize. We verified a frontier agent (codex gpt-5.4) agentically calling donto_recall and grounding its answer only in recalled memories. Full tool reference and agent guidance: mcp.donto.org.

10. Engineering notes

Embedding is the scaling bottleneck, not retrieval. CPU-only bge-small does ~3 chunk/s; ~25k chunks ≈ 2h. Throughput scales inversely with chunk length (400ch→21/s, 1500ch→6/s, 3000ch→2.7/s); raising the embed cap past ~300 tokens gave no recall gain on _s, so 1,200 chars is the speed/quality sweet spot.
A real failure we caught and fixed. A single long-lived embed process leaked to 9.7 GB RSS and, colliding with live extraction traffic, swap-thrashed a 16 GB box to a near-total stall (~0 chunk/s while its own log claimed "3 stmt/s"). Fix: a segmented embed (fresh process per 4k chunks → bounded ~0.8 GB) at low priority. Standing lesson: a launched job is not a working job until you've watched the real metric move — verify throughput by the delta of the output count over wall-clock, never the program's self-reported rate.

11. Limitations & future work

Judge/reader is codex gpt-5.4, not GPT-4o (disclosed; comparable to OMEGA's GPT-4.1).
The headline _s comparison is a paired 48-instance subset plus a partial-embedding full-haystack run; the full-500 _s completes shortly and will update §5.2.
Finer chunking (preference fix) and _m (the variant where memory's accuracy case is strongest) are the next runs.
cross-encoder reranking + time-decay (OMEGA-style) are queued retrieval improvements.

12. Reproducibility

The harness is fully scripted: a resumable donto arm, a codex full-context baseline arm, an offline recall sweep, plus the faithfulness audit and competitive-landscape notes. Judge prompts are the verbatim official evaluate_qa.py. No per-token API; codex via subscription.

Honest caveats restated: codex gpt-5.4 reader+judge; _s results on the config above with the full-500 in progress; donto-memory does not top the _s accuracy table, and that is the expected, correct finding for a memory layer when the whole history fits in context — its value is efficiency, bitemporal correctness, honest abstention, and scaling to histories that don't.