genes.apexpots.com / research source: donto-memory-deep-mode-engine-2026-05-31.md

donto-memory deep-mode — engine reference

donto-memory deep-mode — engine reference

Date: 2026-05-31 Scope: Everything the mode: "deep" extraction pipeline does, from a POST /memorize request to the rows that land in the substrate. Covers code paths, prompts, dedup, salvage, token usage, the async queue, the audit log, the job-detail surface, observed empirics, and the known limits.

This document complements the empirical case study at /research/donto-deep-mode-eternal-recurrence-2026-05-31.html. Where that report reads results, this one explains the machinery that produced them.


1. What deep mode is

mode: "deep" is the iterative-novelty extraction lane inside donto-memory. Where mode: "single" does one LLM call with a maximalist prompt (~30 facts target) and mode: "exhaustive" does five parallel calls under different rhetorical “apertures” (surface / linguistic / presupposition / inferential / conceivable), mode: "deep" does N sequential calls of the same prompt, each shown a list of the facts the prior passes already produced. The model’s task each pass is only to find things the earlier passes missed.

The design intent:

Configuration is a single HTTP knob:

{
  "mode": "deep",
  "passes": 7,
  "modality": "descriptive",
  "holder": "agent:omega-bot",
  "session_id": "discord:1349727923434815519:1497274794586931220",
  "text": "...",
  "images": []
}

Synonyms: "sequential", "iterative" route to the same code path.


2. Request lifecycle

omega-bot
   │  POST https://memories.apexpots.com/memorize
   ▼
caddy (TLS, gzip)
   │
   ▼
donto-memory-api (127.0.0.1:7900, axum)
   │
   ├─► should_defer(req, default_mode)?
   │     true if mode ∈ {deep, exhaustive, sequential, iterative, multi, apertures}
   │     OR req.r#async == Some(true)
   │
   │  ── deferred path ──
   ├─► write "POST /memorize (queued)" audit row     (immediate)
   ├─► return HTTP 202 {status: "queued", queue_id, ...}  (immediate)
   ├─► spawn tokio task
   │       │
   │       ▼
   │   acquire AppState.async_memorize_lock  (single-permit Mutex)
   │       │
   │       ▼
   │   memorize_one(...)  ← see §3
   │       │
   │       ▼
   │   write "POST /memorize (async)" audit row with final stats
   │
   │  ── sync path ──
   └─► memorize_one(...) inline, return 200 with full body

Why the deferred path exists

Cloudflare in front of memories.apexpots.com cuts proxied HTTP at ~100 s. The Nietzsche run took 854 s end-to-end; deep mode cannot be served synchronously over a Cloudflare-fronted endpoint without 524-ing. The deferred path returns 202 immediately so the client never blocks, and the actual work runs to completion in the background.

Why a single-permit Mutex

async_memorize_lock: Arc<tokio::sync::Mutex<()>> lives on the AppState. Each spawned task acquires it before running memorize_one. The lock is intentionally narrow — it serialises deep extractions, not all memorize traffic, because:

The cost: queue is a Mutex, not a real queue. Multiple queued tasks pile up as parked futures inside the running process. Restarting the binary loses all in-flight + queued work. We’ve seen this twice: the binary was restarted at 14:44:58 mid-Nietzsche-rerun and again later, marking 2 orphaned (queued) rows as (lost) in the audit log. The startup path does the right thing here — it stamps surviving (queued) rows with a (lost) endpoint label and status_code=500 so they don’t ghost in the job list.


3. memorize_one — the inner engine

memorize_one(s: &AppState, req: &MemorizeReq)
│
├─► §3.1  OCR (if images attached)
├─► §3.2  Episodic ingest (always — raw text → substrate)
├─► §3.3  Optional LLM extraction (single | exhaustive | deep)
└─► §3.4  Semantic-claim ingest (per-fact ingest into substrate)

3.1 OCR

If req.images is non-empty and s.settings.ocr_enabled, a separate LLM call is made before extraction:

The augmented effective_text becomes both the episodic record body and the extractor input. An OCR failure is not fatal — it logs a warning and proceeds with the original text. There are no images in either of the runs analysed here, so the OCR path didn’t run.

3.2 Episodic ingest

The raw text is always written as an episodic record, regardless of extraction mode. This is the substrate’s atomic “something happened” anchor.

let episodic_input = IngestInput {
    holder: req.holder,           // "agent:omega-bot"
    session_id: req.session_id,   // "discord:<guild>:<channel>"
    text: effective_text,         // raw + OCR
    modality: req.modality,       // "descriptive"
    ...
};
episodic.ingest(substrate, pool, consumer_iri, &episodic_input)
    → (episodic_record_id, episodic_record_iri)

The episodic record gets a stable IRI like donto:record:<uuid> that is then handed to every subsequent semantic-claim ingest as source_record_iri so the provenance chain is episodic_record ← semantic_claim ← fact.

3.3 Extraction dispatch

req.mode (or s.settings.extract_mode if unset) is lowercased and matched:

Mode keyword Function Concurrency Default usage
single (default) extract_single 1 call omega-bot historic
exhaustive / multi / apertures extract_exhaustive 5 parallel research/testing
deep / sequential / iterative extract_deep N sequential omega-bot current

Unrecognised modes fall through to single. Deep mode also accepts req.passes (default 3, clamped to 1..=10).

3.4 Semantic-claim ingest

Each surviving fact (post-dedup) is written via the mem:module/semantic-claim module. The ingest is per-fact and sequential; there is no batch. Progress is logged every 5 s with ingest progress lines (239/697 ingested, 0 errors). Errors are accumulated but do not abort the run — a failed fact gets logged and the loop continues.

Each semantic claim ingest produces a row at the substrate level, anchored to episodic_record.record_iri, typed by the holder’s overlay, and bitemporal — visible from now() onward through the substrate’s tx_time machinery.


4. extract_deep — the orchestrator

The full function is ~140 lines at crates/donto-memory-core/src/extract.rs:507. The control flow:

extract_deep(text, holder, session_id, source_record_iri, images, passes)
│
│   seen: BTreeSet<content_key>  — global dedup
│   all_facts: Vec<ExtractedFact>  — running fact list
│   pass_yields: Vec<ApertureYield>  — per-pass audit
│   merged_usage: ChatUsage  — accumulated tokens
│
└─► for pass_n in 1..=passes {
        prior_block = if all_facts.is_empty() {
            None
        } else {
            Some(format_prior_facts_block(&all_facts))   ← see §5
        }

        call_one_with_context(SINGLE_PROMPT, pass_id, text, ..., prior_block)
        │
        ├─► on Ok(yield):
        │      for fact in yield.facts:
        │          fact.aperture = Some(pass_id)         ← authoritative pass label
        │          key = sha256(subject | predicate | object_iri_or_lit)
        │          if seen.insert(key):
        │              all_facts.push(fact); added += 1
        │          else:
        │              dedup_collisions += 1; collided += 1
        │      merge usage; log "deep pass complete"
        │
        └─► on Err(e):
               pass_yields.push(ApertureYield { error: Some(e) });
               log "deep pass failed"
               continue with next pass  ← failure of one pass does not abort
    }

The orchestrator does not retry failed passes. A pass that fails (e.g. pass_2 prose-not-JSON on the Nietzsche run) contributes zero facts and the loop moves on. This is intentional — the cost of a wasted pass is bounded — but as recommended in the previous report, a single retry on JSON parse failure would recover ~14% of capacity on a 7-pass run.


5. The prior-facts block

This is the only piece of the prompt that varies between passes. It’s prepended to the user prompt as a single block, formatted like:

Earlier passes over this same chunk already extracted the facts below.
Your job in this pass is to find EVERY remaining fact the previous passes
missed. Do NOT repeat anything in the list — content-hash dedup will drop
repeats anyway, so your job is pure novelty. Push harder: deeper inferences,
unstated assumptions, additional entities (including abstract/conceptual
ones, time/place anchors, counterfactuals), alternate framings,
finer-grained properties, temporal and spatial nuance, causal and dependency
links, contrastive readings, parts of named entities, generic-class facts
("X is a Y", "Y has property Z"), metalinguistic facts about the utterance
itself (sentence count, mood, register, sentiment, politeness, addressee,
speech act), pragmatic implicatures, conventional implicatures, scalar
implicatures, conversational maxims, intent, plan, prerequisite,
consequence, related concepts in the same domain, related practitioners,
related tools/standards/formats, the user's evident expertise level, the
user's evident emotional state, the user's evident workflow, the user's
evident dependencies, the user's evident substitutes-avoided, the user's
evident counterfactual world ("would be lost without X"), domain knowledge
implied. Aim for 30-60+ NEW facts in this pass. Repeat content will be
dropped — your incentive is breadth + novelty. Only return {"facts": []}
if you genuinely cannot think of one more angle.

ALREADY EXTRACTED (subject | predicate | object):
- discord:user:ajaxdavis | rdf:type | donto:DiscordUser
- discord:channel:donto | rdf:type | donto:DiscordChannel
- donto:Song | rdf:type | donto:ArtisticWork
- ...

Key behaviours:


6. call_one_with_context — the LLM call

The function builds an OpenAI-compatible chat-completion request body:

{
  "model": "z-ai/glm-5",
  "temperature": 0.2,
  "max_tokens": 8000,
  "response_format": { "type": "json_object" },
  "messages": [
    { "role": "system", "content": SINGLE_PROMPT },
    { "role": "user", "content":
        "holder: agent:omega-bot\n" +
        "session_id: discord:...\n" +
        "source_record_iri: donto:record:...\n\n" +
        prior_facts_block +               only present pass 2+
        "chunk:\n" + text +
        "\n\n" + COMMON_FRAGMENT           JSON schema reminder
    }
  ]
}

Notes:

The response is parsed into ChatCompletion { choices, usage, model }. The choices[0].message.content string is parsed as { "facts": [ ExtractedFact, ... ] }.


7. JSON salvage

max_tokens: 8000 is a hard truncation ceiling, and the model exhausts it on roughly half of all passes. When the JSON is structurally invalid (because the closing ] and } got cut off), the orchestrator does not throw away the entire pass. Instead:

1. Try strict JSON parse.
2. On failure, walk forward through the string looking for the
   "facts": [ marker, then scan element by element using a small
   bracket-balance state machine.
3. For each well-formed object found before the EOF point, parse it
   individually as ExtractedFact.
4. Discard the malformed tail (the last partial fact).
5. Return the salvaged Vec<ExtractedFact> with the original raw count,
   logged as `WARN  LLM JSON truncated; recovered partial facts`.

Empirically this saves enormous amounts of pass yield. On cat-is-red’s pass_1, the model output truncated at position 1:1 (extreme — almost the whole output was malformed prose-prefix) and the salvager still recovered 10 facts. On the Nietzsche pass_4 the salvager recovered 202 of an attempted 204+. Without this path each truncation would dump an entire 100+-fact pass.

The salvager has its own test (assert_eq!(out.len(), 2, ...) at extract.rs:1237) and is unit-tested against synthetic truncation cases.


8. Dedup — content-key hashing

After each pass the orchestrator computes a SHA256 of the fact’s content tuple:

content_key = SHA256(
    subject.bytes
    | 0x1f
    | predicate.bytes
    | 0x1f
    | (object_iri OR JSON-serialised object_lit).bytes
)

The key is then BTreeSet.insert(key) — first-write-wins. Confidence, modality, hypothesis_only, aperture label, and notes are deliberately excluded from the key so that a second pass restating a known fact with higher confidence still collides (we keep the earlier copy with its lower confidence).

Limitations of string-key dedup:

These are not bugs of the dedup function — they’re a known limit of going content-key-only. The standing recommendation is a semantic-dedup pass at the end, before substrate ingest.


9. Token + cost accounting

When the LLM endpoint returns usage in the response body, the orchestrator merges it into merged_usage. The substrate stores the final tally in the audit row’s columns:

Column Source Notes
prompt_tokens sum of usage.prompt_tokens across passes accurate when endpoint returns usage
completion_tokens sum of usage.completion_tokens across passes on Z.AI endpoint, this is an estimate, currently passes_succeeded × max_tokens
total_tokens sum accurate when individual sums are accurate
model choices[0].model (echoed by endpoint) z-ai/glm-5-20260211 for current runs

The “completion_tokens looks like an upper bound” footnote in the cost analysis is exactly this — when the endpoint omits usage.completion_tokens, we fall back to max_tokens × successful_passes, which is a pessimistic estimate.

Pricing (OpenRouter z-ai/glm-5)

Empirical per-run cost

Job Input Prompt tk Completion tk* Cost (worst case) Facts $/fact
Nietzsche, 7 passes (6 succeeded) 109 words 33,644 48,000* $0.112 1000 $0.000113
cat is red, 7 passes (7 succeeded) 3 words 45,839 56,000* $0.135 697 $0.000194

* completion_tokens estimated at 8000 × passes_succeeded; real value likely 60–80%.

A 7-pass deep run costs roughly $0.10–0.13 per message at worst-case GLM-5 prices. The cost is dominated by output tokens (~85% of total). Reducing passes for short messages, raising max_tokens (so fewer passes truncate-and-redo), and caching the prior-facts prefix are the three biggest cost levers.


10. Audit log — donto_x_memory_job_log

Every memorize and recall touches this table. The deep mode pipeline writes two rows per request:

-- Row 1: "queued" placeholder, written immediately after defer decision
INSERT INTO donto_x_memory_job_log (
    job_id, endpoint='POST /memorize (queued)',
    status_code=202, elapsed_ms=0,
    request=<full request body>, response=<{status:queued,queue_id:...}>,
    holder, session_id
)

-- Row 2: "async" final row, written when memorize_one returns
INSERT INTO donto_x_memory_job_log (
    job_id, endpoint='POST /memorize (async)',
    status_code=200 (or 500), elapsed_ms=<final>,
    request=<full request body>, response=<full response body>,
    facts_extracted, facts_ingested,
    model, prompt_tokens, completion_tokens, total_tokens,
    error=NULL (or message),
    holder, session_id
)

If the binary restarts mid-run, the startup path stamps any orphaned (queued) rows with endpoint POST /memorize (lost) and status_code=500 so they don’t appear “still running” forever. This is the marker we saw at 14:44:58 in this session.

Sync-mode requests write one row with endpoint POST /memorize (sync) or POST /memorize depending on path.

The audit table is the source of truth for the /jobs index and /jobs/<id> detail page.


11. The /jobs UI surface

There are three routes:

Route Purpose
GET /jobs Paginated index of recent jobs (status, route, holder, text preview, elapsed)
GET /jobs/<id> HTML detail page: request payload, response payload, per-pass yields, per-fact table
GET /jobs/<id>/raw JSON detail (same shape as the HTML page’s underlying data)

Detail-page features specific to deep mode:

The operator endpoint /jobs was previously gated by DONTO_MEMORY_OPS_TOKEN. It is currently set to empty in /etc/donto-memory/env so the page is publicly browsable. Anyone with the URL can read every memorize that has ever happened — fine for the current single-user deployment, will need a real auth model before multi-tenant.


12. What lands in the substrate

Per memorize, the substrate receives:

  1. One episodic record (donto:record:<uuid>) typed via mem:module/episodic. The full text + modality + holder + session_id. Bitemporal, visible from now().
  2. N semantic-claim records (one per fact, post-dedup, post-substrate-policy-gating) typed via mem:module/semantic-claim. Each carries:

The substrate’s policy gate can reject a fact (e.g. predicate-domain violation, identity policy violation). The orchestrator counts these as errors in the ingest progress log and the audit row’s facts_ingested will be lower than facts_extracted. On the Nietzsche run this gap was 2 facts (998/1000); on the cat-is-red run it was 0.

The substrate also handles federation, sleep-path reconsolidation, and the trust kernel — none of which are deep-mode-specific. They apply to every memorize regardless of mode.


13. Recall — the other side

Deep mode is a write mode. The complementary read mode is POST /recall, which:

  1. Embeds the query.
  2. Runs Reciprocal Rank Fusion across multiple retrievers (BM25, vector, IRI-prefix, holder).
  3. Filters by identity lens (which subset of statements is this holder allowed to see).
  4. Returns a ranked list with provenance back to the originating episodic record.

The more facts deep mode ingests per message, the higher the recall precision for queries that touch those facts — provided the dedup and identity work properly. Both runs in this analysis have been writes; we have not yet exercised recall against either output corpus.


14. Observed empirics so far

Two end-to-end deep-mode runs at production parameters (passes=7, modality=descriptive, holder=agent:omega-bot):

Run A — Nietzsche / Eternal Recurrence (109 words)

Pass Raw New unique Collisions Elapsed Notes
1 132 132 0 124 s clean cold start
2 FAIL 0 0 <1 s prose-not-JSON
3 202 202 0 62 s prior-facts redirect; fastest pass
4 204 202 2 127 s truncated, salvaged
5 159 157 2 109 s
6 158 156 2 142 s
7 158 151 7 124 s highest collision rate
Σ 1013 1000 13 854 s 998 ingested

Run B — “cat is red” (3 words)

Pass Raw New unique Collisions Elapsed Notes
1 10 10 0 46 s early truncation, salvaged
2 121 121 0 112 s prior-facts redirect
3 150 42 108 100 s 72% collision — model stuck
4 127 127 0 105 s re-energised
5 139 139 0 105 s re-energised
6 103 94 9 96 s
7 165 164 1 120 s
Σ 815 697 118 703 s 697 ingested

Key contrasts

Run A Run B
facts/input-word 9.2 232
cost $0.112 $0.135
%-truncations (passes truncated) ~30% ~86%
pass_2 outcome prose-fail 121 unique
substrate rejections 2/1000 0/697

The non-linearity in facts/word (3-word input produces 25× more facts per word than the 109-word input) is the headline finding from the empirics: the model’s elaboration capacity is decoupled from the input. Deep mode on tiny inputs is largely hallucinating consistent ontology. This is a feature for some use cases (priming recall on common concepts) and a bug for others (signal-to-noise in the substrate).


15. Known issues & quirks

In rough priority order:

  1. No retry on JSON parse failure. A single prose-not-JSON failure forfeits the entire pass (14% capacity loss on a 7-pass run).
  2. max_tokens: 8000 is too low. Most passes truncate; salvage works but loses the malformed tail. Recommend 12000.
  3. Completion-token usage isn’t reported by Z.AI endpoint. Cost accounting uses passes × max_tokens as an upper-bound estimate. Underreport real cost, overreport per-message cost ceiling.
  4. String-key dedup misses semantic duplicates. (Cat, rdf:type, Animal) and (Cat, rdf:type, Mammal) both land. Identity collapse (discord:user:ajax vs donto:Ajax) similarly slips through.
  5. Prior-facts window is hard-capped at 300. Beyond 300 cumulative facts, older facts fall out of the model’s context and can be re-derived (then dedup-dropped). For 7-pass runs producing 700-1000 facts this is a meaningful blind spot.
  6. Queue is an in-memory Mutex, not a durable table. Restart loses in-flight + queued work. Startup marks orphaned (queued) rows as (lost) but they’re gone.
  7. Per-pass tracing visibility lives at info! level. If the binary’s RUST_LOG defaults to WARN, journalctl shows nothing useful during a run. Verified this session — the original Nietzsche run was silent in the journal because the level filter was promoting.
  8. The format_prior_facts_block function has dead code (let take = facts.len().saturating_sub(facts.len().saturating_sub(300)); is a no-op preceded by let _ = take;). Cosmetic but should be cleaned up.
  9. No backpressure signal to clients. The 202 response gives a queue_id but no estimated_position or estimated_wait. With a single-permit Mutex and 7-pass runs taking 12–14 min, the 5th queued job will wait ~70 min with no signal.
  10. /jobs page is unauthenticated. Acceptable for single-user; needs auth before multi-tenant.

16. Roadmap (next concrete changes)

In order from cheapest-fastest-win to deepest:

  1. JSON-mode retry on parse failure — 5 lines of code, recovers the prose-not-JSON case.
  2. Bump max_tokens to 12000 — one constant change in build_request_body.
  3. Input-length-aware default passes on the bot sidepasses = clamp(2, ceil(words/25), 7). Saves 70-80% on short messages.
  4. Promote tracing level filter — explicit RUST_LOG=info,donto_memory=info,donto_memory_core=info in the systemd unit.
  5. Prompt caching on the prior-facts prefix — Anthropic-style cache_control or OpenAI prefix caching. Cuts input cost ~60% on later passes.
  6. Semantic dedup post-pass — collapse class hierarchies + identity synonyms before substrate ingest. Saves storage + improves recall precision.
  7. Quality-grounding score per fact — cheap LLM pass scoring 0–1 for direct-support vs speculation; route below-threshold facts to a candidate overlay.
  8. Real queue tabledonto_x_memory_extract_queue with FOR UPDATE SKIP LOCKED, multiple workers, restart-safe.
  9. Per-pass temperature ramp — start at 0.2 for pass 1, climb to 0.6 for pass 7. Push novelty harder where novelty is the explicit goal.
  10. Identity-resolution module — proper alias graph in the substrate so discord:user:ajaxdavis ↔︎ donto:AjaxDavis ↔︎ donto:Reader collapse at query time.
  11. Multi-tenant auth on /jobs and /memorize.

17. File map

If you’re reading source:

Concern File Lines
Deep orchestrator donto-memory-core/src/extract.rs 507–646
Prior-facts block formatter donto-memory-core/src/extract.rs 961–1003
Per-call HTTP + salvage donto-memory-core/src/extract.rs 676–870
SINGLE_PROMPT constant donto-memory-core/src/extract.rs 863–872
Content-key dedup donto-memory-core/src/extract.rs 112–135 (content_key)
Route handler + dispatch donto-memory/src/api/routes/memorize.rs 100–340
memorize_one inner engine donto-memory/src/api/routes/memorize.rs 333–560
Ingest progress logging donto-memory/src/api/routes/memorize.rs 497–540
Async lock initialisation donto-memory/src/main.rs 106–133
Job audit log table migrations: *_donto_x_memory_job_log.sql
Job detail page rendering donto-memory/src/api/routes/jobs.rs

Operator endpoints in production:


18. Glossary


Related reports: