# donto-memory on LongMemEval — a faithful study of what a memory layer adds to a frontier reader

**2026-06-05 · a no-cheating evaluation of donto-memory against the LongMemEval long-term-memory benchmark, with a frontier model (codex gpt-5.4) as both reader and judge, measured against a codex-alone full-context baseline**

---

## Executive summary

We put **donto-memory** — a bitemporal, paraconsistent, evidence-first memory built on the [donto](https://donto.org) substrate — through [LongMemEval](https://arxiv.org/abs/2410.10813), the standard long-term-memory benchmark, under audited no-leakage conditions. We did one thing most published memory-benchmark numbers don't: we ran a **codex-alone full-context baseline** — the same reader handed the *entire* conversation history with no memory system — so the delta isolates exactly what the memory layer contributes.

The honest headline: **on `longmemeval_s`, where the whole history still fits in a frontier model's context window, a memory layer does not win on raw accuracy.** A strong reader given everything is very hard to beat. What donto-memory *does* deliver, measurably:

- **~2× lower token cost** at parity accuracy (it reads a retrieved top-k, not a ~120k-token haystack);
- **wins on knowledge-update** (bitemporal valid-time → pick the *latest* value) and **abstention** (evidence-first → say "I don't know" instead of hallucinating);
- a clear, diagnosed gap on **preference** recall, with a validated fix;
- and the real accuracy case for a memory layer lives on **`longmemeval_m`** (≈500 sessions), where full-context is impossible.

We report the gaps as plainly as the wins. This study is about understanding *where a memory layer earns its keep*, not about a leaderboard trophy.

## 1. The benchmark

LongMemEval (ICLR 2025) is 500 questions over realistic multi-session chat histories, across six abilities — single-session (user / assistant / preference), multi-session, temporal-reasoning, knowledge-update — plus an abstention set of unanswerable questions. Three variants differ only in haystack size:

| variant | haystack | what it tests |
|---|---|---|
| `_oracle` | only the evidence sessions | the **reader** given perfect retrieval (a ceiling) |
| `_s` | ~50 sessions (~115–128k tok) | the **memory system**: ingest all, retrieve at query time |
| `_m` | ~500 sessions | memory at a scale that exceeds the context window |

We focus on **`_s`** (the standard memory-system test), report `_oracle` as a ceiling, and flag `_m` as the next run.

## 2. The system under test: donto-memory

donto-memory is an example consumer of donto, a bitemporal, paraconsistent, evidence-first claim substrate. For this study:

- **Ingest** — each session becomes an episodic chunk anchored to a per-question holder, with the session date injected as **valid-time** (`valid_from`). All ~50 sessions are ingested; nothing is filtered by evidence.
- **Hybrid recall** — holder-scoped **lexical (FTS) + semantic (bge-small-en-v1.5, 384-dim, HNSW) retrieval, RRF-fused**, with learned predicate **alignment-closure** expansion. This is the same retrieval family as the current public leader (OMEGA): identical embedding model + FTS + fusion.
- **Bitemporal valid-time** — recalled rows carry `valid_from`/`valid_to`, so the reader can prefer the *current* value of an attribute.
- **MCP server** — donto-memory is exposed over the Model Context Protocol (`donto_recall` / `donto_search` / `donto_memorize`; `npx -y donto-memory-mcp`), so any agent can use it as native tools and drive its own recalls. (Docs: [mcp.donto.org](https://mcp.donto.org).)

## 3. Faithfulness — no cheating

A memory benchmark is trivially gameable (peek at evidence labels), so we audited every gold-field use line by line:

- **Ingest** renders sessions from `role`+`content` only; the per-turn `has_answer` flag is never emitted.
- **Recall** queries with the QUESTION and holder only — never `answer_session_ids` / `has_answer`.
- **Reader** sees only: the question, the question date, and the recalled memories. No gold answer, no evidence labels.
- **Judge** receives the gold answer for grading only — exactly as the official harness does.
- The judge prompts, the `label = "yes" in response.lower()` rule, and the `"_abs"` abstention detection are **byte-identical to the official `evaluate_qa.py`**.

**The one disclosed deviation:** the official judge is the OpenAI GPT-4o API; we use **codex gpt-5.4** (subscription; zero per-token API budget). Same prompts, same parsing, applied identically to every instance and to both arms. The reader is likewise codex gpt-5.4. This makes our setup **methodologically analogous to the leaderboard leader OMEGA** (GPT-4.1 reader+judge), and means our numbers compare to OMEGA / HydraDB / Mastra — **not** to the paper's weaker GPT-4o rows.

## 4. Methodology

- **Harness** — a resumable, checkpointed pipeline (ingest → embed → recall → reader → judge); per-instance checkpointing so a crash never re-spends compute.
- **Recall budget** — bounded top-k (k=20) hybrid recall over the full ~50-session haystack (a real retrieval test, not the oracle hand-off).
- **Reader / judge** — codex gpt-5.4, official prompts.
- **Baseline arm** — the same codex reader handed the ENTIRE date-sorted haystack (~120k tokens), NO donto retrieval — isolating the memory system's contribution.

## 5. Results

### 5.1 Oracle ceiling (full 500)

With perfect retrieval (evidence sessions only), codex reader+judge:

| ability | accuracy |
|---|---|
| single-session-assistant | 1.000 |
| temporal-reasoning | 0.985 |
| single-session-user | 0.971 |
| knowledge-update | 0.962 |
| multi-session | 0.895 |
| single-session-preference | 0.800 |
| **overall** | **0.946** · task-avg 0.935 · abstention 0.967 |

This is the reader's ceiling; it competes with the paper's GPT-4o oracle (0.87–0.92).

### 5.2 The comparison that matters — codex vs codex+donto (`_s`, paired, n=48)

Same 48 instances, same reader, same judge; the only difference is whether codex reads donto's retrieved top-20 or the entire haystack:

| ability | codex-alone (full-context) | codex+donto |
|---|---|---|
| knowledge-update | 0.875 | **1.000** |
| multi-session | 0.875 | 0.875 |
| single-session-assistant | 1.000 | 1.000 |
| single-session-user | 1.000 | 1.000 |
| temporal-reasoning | 1.000 | 1.000 |
| single-session-preference | **1.000** | 0.750 |
| **overall** | **0.957** | 0.936 |
| abstention | 0.667 | **1.000** |

- **Accuracy: ~tied** (0.957 vs 0.936 ≈ one instance on n=47).
- **Token cost: donto ~2.1× cheaper** — measured median reader prompt **124k tok (full-context)** vs **58k tok (codex+donto)**.
- **donto wins knowledge-update** (bitemporal valid-time → latest value, not drowned in superseded history).
- **donto wins abstention** (evidence-first → refuses unanswerable questions full-context hallucinates).
- **donto loses preference** (whole-session chunks miss the subtle, inferential preference session — see §6).

A separate run over the full ~50-session haystack with partial embeddings (graceful FTS fallback) held at **overall 0.933 / abstention 1.0**, confirming the result is robust to incomplete vector coverage. The full-500 `_s` run is completing as of writing.

## 6. Retrieval analysis — where donto's recall is strong and weak

A zero-cost offline sweep (recall only, no reader) measured whether the evidence session lands in the top-k. **The hybrid vector arm is load-bearing** — turning it on lifts overall **hit@10 from 0.85 (FTS-only) to 0.98**:

| ability | hit@5 FTS→hybrid | hit@10 FTS→hybrid |
|---|---|---|
| single-session-assistant | 0.62 → 1.00 | 0.88 → 1.00 |
| temporal-reasoning | 0.75 → 0.88 | 0.88 → 1.00 |
| single-session-preference | 0.12 → 0.50 | 0.38 → 0.88 |
| multi-session | 1.00 → 1.00 | 1.00 → 1.00 |
| knowledge-update | 1.00 → 1.00 | 1.00 → 1.00 |
| **overall** | **0.75 → 0.90** | **0.85 → 0.98** |

**single-session-preference** is the weak spot. Preference questions ("what should I cook for guests?") share almost no surface words with the session that states the preference ("I'm vegan"); even semantically the link is inferential. donto stores one chunk per session and embeds only its first ~300 tokens, so evidence buried mid-session is invisible to the vector arm.

**A validated fix (future work):** finer chunking. Splitting sessions into ~1,000-char windows (each fully embedded) moved a known-missed preference instance from miss → **hit@5**. The cost is ~10× more chunks — ~250k at full `_s` scale — which on CPU-only bge-small (~3 chunk/s, no GPU) is ~17h, infeasible on the current box. So the headline run uses whole-session chunks and we report finer chunking as a measured, deferred improvement (it needs int8/GPU embedding to scale).

## 7. The competitive landscape — and a caution about leaderboards

Published `longmemeval_s` scores:

| system | overall | reader / judge | retrieval |
|---|---|---|---|
| OMEGA | 95.4% | GPT-4.1 (both) | bge-small + FTS + cross-encoder rerank + time-decay |
| Mastra | 94.87% | — | — |
| HydraDB | 90.79% | — | graph-native (entity/temporal/causal + BM25) |
| Memoria | 88.78% | — | — |
| Zep / Graphiti | 71.2% | — | — |
| paper, GPT-4o full-context | 60–64% | GPT-4o | none |

**The caution:** the LongMemEval score conflates the *reader* and the *memory system*. The paper's GPT-4o full-context scores 60–64%, but **our codex gpt-5.4 full-context — no memory at all — scored 95.7%.** Swap in a stronger reader and the "memory system" numbers move with it. A single overall number is not a clean ranking of memory systems; the **codex-alone baseline is the control that matters**, and the right questions are: *how much does the memory layer add over giving the reader everything, at what token cost, and does it scale past the context window?*

donto-memory shares OMEGA's retrieval family (identical bge-small + FTS + fusion) but lacks OMEGA's **cross-encoder reranking** and **time-decay weighting** — concrete, queued improvements we'll measure against this same benchmark.

## 8. What we actually learned

1. **On `_s`, a memory layer doesn't win on accuracy** when a strong reader can hold the whole history in context. We say so plainly.
2. **donto's defensible value is specific and real:** ~2× token efficiency, and wins on **knowledge-update (bitemporal)** and **abstention (evidence-first)** — the load-bearing parts of its thesis.
3. **The accuracy case for a memory layer lives on `_m`** (~500 sessions), where full-context is impossible and retrieval is mandatory. That is the next run.

## 9. The production integration (MCP)

The realistic way an agent uses donto-memory isn't a bespoke harness — it's the Model Context Protocol. donto-memory ships an MCP server (`npx -y donto-memory-mcp`, or a dependency-free Python single-file variant) exposing `donto_recall` / `donto_search` / `donto_memorize`. We verified a frontier agent (codex gpt-5.4) **agentically** calling `donto_recall` and grounding its answer only in recalled memories. Full tool reference and agent guidance: [mcp.donto.org](https://mcp.donto.org).

## 10. Engineering notes

- **Embedding is the scaling bottleneck, not retrieval.** CPU-only bge-small does ~3 chunk/s; ~25k chunks ≈ 2h. Throughput scales inversely with chunk length (400ch→21/s, 1500ch→6/s, 3000ch→2.7/s); raising the embed cap past ~300 tokens gave no recall gain on `_s`, so 1,200 chars is the speed/quality sweet spot.
- **A real failure we caught and fixed.** A single long-lived embed process leaked to **9.7 GB RSS** and, colliding with live extraction traffic, swap-thrashed a 16 GB box to a near-total stall (~0 chunk/s while its own log claimed "3 stmt/s"). Fix: a **segmented** embed (fresh process per 4k chunks → bounded ~0.8 GB) at low priority. Standing lesson: *a launched job is not a working job until you've watched the real metric move* — verify throughput by the delta of the output count over wall-clock, never the program's self-reported rate.

## 11. Limitations &amp; future work

- Judge/reader is **codex gpt-5.4, not GPT-4o** (disclosed; comparable to OMEGA's GPT-4.1).
- The headline `_s` comparison is a **paired 48-instance** subset plus a partial-embedding full-haystack run; the **full-500 `_s`** completes shortly and will update §5.2.
- **Finer chunking** (preference fix) and **`_m`** (the variant where memory's accuracy case is strongest) are the next runs.
- **cross-encoder reranking + time-decay** (OMEGA-style) are queued retrieval improvements.

## 12. Reproducibility

The harness is fully scripted: a resumable donto arm, a codex full-context baseline arm, an offline recall sweep, plus the faithfulness audit and competitive-landscape notes. Judge prompts are the verbatim official `evaluate_qa.py`. No per-token API; codex via subscription.

*Honest caveats restated: codex gpt-5.4 reader+judge; `_s` results on the config above with the full-500 in progress; donto-memory does not top the `_s` accuracy table, and that is the expected, correct finding for a memory layer when the whole history fits in context — its value is efficiency, bitemporal correctness, honest abstention, and scaling to histories that don't.*