Date: 2026-05-31 Scope: Everything
the mode: "deep" extraction pipeline does, from a
POST /memorize request to the rows that land in the
substrate. Covers code paths, prompts, dedup, salvage, token usage, the
async queue, the audit log, the job-detail surface, observed empirics,
and the known limits.
This document complements the empirical case study at /research/donto-deep-mode-eternal-recurrence-2026-05-31.html.
Where that report reads results, this one explains the machinery that
produced them.
mode: "deep" is the iterative-novelty
extraction lane inside donto-memory. Where mode: "single"
does one LLM call with a maximalist prompt (~30 facts target) and
mode: "exhaustive" does five parallel calls under different
rhetorical “apertures” (surface / linguistic / presupposition /
inferential / conceivable), mode: "deep" does N
sequential calls of the same prompt, each shown a list
of the facts the prior passes already produced. The model’s task each
pass is only to find things the earlier passes missed.
The design intent:
passes param, default 3, currently 7
from omega-bot). No automatic saturation detection — the operator
decides how hard to push.Configuration is a single HTTP knob:
{
"mode": "deep",
"passes": 7,
"modality": "descriptive",
"holder": "agent:omega-bot",
"session_id": "discord:1349727923434815519:1497274794586931220",
"text": "...",
"images": []
}Synonyms: "sequential", "iterative" route
to the same code path.
omega-bot
│ POST https://memories.apexpots.com/memorize
▼
caddy (TLS, gzip)
│
▼
donto-memory-api (127.0.0.1:7900, axum)
│
├─► should_defer(req, default_mode)?
│ true if mode ∈ {deep, exhaustive, sequential, iterative, multi, apertures}
│ OR req.r#async == Some(true)
│
│ ── deferred path ──
├─► write "POST /memorize (queued)" audit row (immediate)
├─► return HTTP 202 {status: "queued", queue_id, ...} (immediate)
├─► spawn tokio task
│ │
│ ▼
│ acquire AppState.async_memorize_lock (single-permit Mutex)
│ │
│ ▼
│ memorize_one(...) ← see §3
│ │
│ ▼
│ write "POST /memorize (async)" audit row with final stats
│
│ ── sync path ──
└─► memorize_one(...) inline, return 200 with full body
Cloudflare in front of memories.apexpots.com cuts
proxied HTTP at ~100 s. The Nietzsche run took 854 s end-to-end; deep
mode cannot be served synchronously over a Cloudflare-fronted endpoint
without 524-ing. The deferred path returns 202 immediately so the client
never blocks, and the actual work runs to completion in the
background.
async_memorize_lock: Arc<tokio::sync::Mutex<()>>
lives on the AppState. Each spawned task acquires it before
running memorize_one. The lock is intentionally narrow — it
serialises deep extractions, not all memorize traffic,
because:
The cost: queue is a Mutex, not a real queue. Multiple queued tasks
pile up as parked futures inside the running process. Restarting
the binary loses all in-flight + queued work. We’ve seen this
twice: the binary was restarted at 14:44:58 mid-Nietzsche-rerun and
again later, marking 2 orphaned (queued) rows as
(lost) in the audit log. The startup path does the right
thing here — it stamps surviving (queued) rows with a
(lost) endpoint label and status_code=500 so
they don’t ghost in the job list.
memorize_one —
the inner enginememorize_one(s: &AppState, req: &MemorizeReq)
│
├─► §3.1 OCR (if images attached)
├─► §3.2 Episodic ingest (always — raw text → substrate)
├─► §3.3 Optional LLM extraction (single | exhaustive | deep)
└─► §3.4 Semantic-claim ingest (per-fact ingest into substrate)
If req.images is non-empty and
s.settings.ocr_enabled, a separate LLM call is made
before extraction:
OCR_SYSTEM_PROMPT (transcribe every
visible word, return {"transcripts": [...]} per image in
order)temperature: 0.0, max_tokens: 4000,
response_format: { type: "json_object" }[OCR text from image #N] headers and prepended to
req.text.The augmented effective_text becomes both the episodic
record body and the extractor input. An OCR failure is not
fatal — it logs a warning and proceeds with the original text. There are
no images in either of the runs analysed here, so the OCR path didn’t
run.
The raw text is always written as an episodic record, regardless of extraction mode. This is the substrate’s atomic “something happened” anchor.
let episodic_input = IngestInput {
holder: req.holder, // "agent:omega-bot"
session_id: req.session_id, // "discord:<guild>:<channel>"
text: effective_text, // raw + OCR
modality: req.modality, // "descriptive"
...
};
episodic.ingest(substrate, pool, consumer_iri, &episodic_input)
→ (episodic_record_id, episodic_record_iri)The episodic record gets a stable IRI like
donto:record:<uuid> that is then handed to every
subsequent semantic-claim ingest as source_record_iri so
the provenance chain is
episodic_record ← semantic_claim ← fact.
req.mode (or s.settings.extract_mode if
unset) is lowercased and matched:
| Mode keyword | Function | Concurrency | Default usage |
|---|---|---|---|
single (default) |
extract_single |
1 call | omega-bot historic |
exhaustive / multi /
apertures |
extract_exhaustive |
5 parallel | research/testing |
deep / sequential /
iterative |
extract_deep |
N sequential | omega-bot current |
Unrecognised modes fall through to single. Deep mode
also accepts req.passes (default 3, clamped to
1..=10).
Each surviving fact (post-dedup) is written via the
mem:module/semantic-claim module. The ingest is
per-fact and sequential; there is no batch. Progress is
logged every 5 s with ingest progress lines
(239/697 ingested, 0 errors). Errors are accumulated but do
not abort the run — a failed fact gets logged and the loop
continues.
Each semantic claim ingest produces a row at the substrate level,
anchored to episodic_record.record_iri, typed by the
holder’s overlay, and bitemporal — visible from now()
onward through the substrate’s tx_time machinery.
extract_deep —
the orchestratorThe full function is ~140 lines at
crates/donto-memory-core/src/extract.rs:507. The control
flow:
extract_deep(text, holder, session_id, source_record_iri, images, passes)
│
│ seen: BTreeSet<content_key> — global dedup
│ all_facts: Vec<ExtractedFact> — running fact list
│ pass_yields: Vec<ApertureYield> — per-pass audit
│ merged_usage: ChatUsage — accumulated tokens
│
└─► for pass_n in 1..=passes {
prior_block = if all_facts.is_empty() {
None
} else {
Some(format_prior_facts_block(&all_facts)) ← see §5
}
call_one_with_context(SINGLE_PROMPT, pass_id, text, ..., prior_block)
│
├─► on Ok(yield):
│ for fact in yield.facts:
│ fact.aperture = Some(pass_id) ← authoritative pass label
│ key = sha256(subject | predicate | object_iri_or_lit)
│ if seen.insert(key):
│ all_facts.push(fact); added += 1
│ else:
│ dedup_collisions += 1; collided += 1
│ merge usage; log "deep pass complete"
│
└─► on Err(e):
pass_yields.push(ApertureYield { error: Some(e) });
log "deep pass failed"
continue with next pass ← failure of one pass does not abort
}
The orchestrator does not retry failed passes. A pass that fails (e.g. pass_2 prose-not-JSON on the Nietzsche run) contributes zero facts and the loop moves on. This is intentional — the cost of a wasted pass is bounded — but as recommended in the previous report, a single retry on JSON parse failure would recover ~14% of capacity on a 7-pass run.
This is the only piece of the prompt that varies between passes. It’s prepended to the user prompt as a single block, formatted like:
Earlier passes over this same chunk already extracted the facts below.
Your job in this pass is to find EVERY remaining fact the previous passes
missed. Do NOT repeat anything in the list — content-hash dedup will drop
repeats anyway, so your job is pure novelty. Push harder: deeper inferences,
unstated assumptions, additional entities (including abstract/conceptual
ones, time/place anchors, counterfactuals), alternate framings,
finer-grained properties, temporal and spatial nuance, causal and dependency
links, contrastive readings, parts of named entities, generic-class facts
("X is a Y", "Y has property Z"), metalinguistic facts about the utterance
itself (sentence count, mood, register, sentiment, politeness, addressee,
speech act), pragmatic implicatures, conventional implicatures, scalar
implicatures, conversational maxims, intent, plan, prerequisite,
consequence, related concepts in the same domain, related practitioners,
related tools/standards/formats, the user's evident expertise level, the
user's evident emotional state, the user's evident workflow, the user's
evident dependencies, the user's evident substitutes-avoided, the user's
evident counterfactual world ("would be lost without X"), domain knowledge
implied. Aim for 30-60+ NEW facts in this pass. Repeat content will be
dropped — your incentive is breadth + novelty. Only return {"facts": []}
if you genuinely cannot think of one more angle.
ALREADY EXTRACTED (subject | predicate | object):
- discord:user:ajaxdavis | rdf:type | donto:DiscordUser
- discord:channel:donto | rdf:type | donto:DiscordChannel
- donto:Song | rdf:type | donto:ArtisticWork
- ...
Key behaviours:
start = facts.len().saturating_sub(300)). On a 7-pass run
that exits with 697 facts, pass 7 sees facts 397–697. Earlier facts fall
out of the model’s context window. This is a deliberate cap to bound
prompt size, but it means the model can re-derive a fact already
extracted in pass 2 if it didn’t make the cutoff for pass 7.Aim for 30-60+ NEW facts in this pass
is a soft target. Pass 4 on the Nietzsche run actually delivered 204;
pass 3 on cat-is-red delivered only 42 unique (108/150 collisions).{"facts": []}). The model is told it’s a legitimate output
if it really has nothing left. In practice it never returns this — the
model finds something to extract even when the input is 3
words.call_one_with_context — the LLM callThe function builds an OpenAI-compatible chat-completion request body:
{
"model": "z-ai/glm-5",
"temperature": 0.2,
"max_tokens": 8000,
"response_format": { "type": "json_object" },
"messages": [
{ "role": "system", "content": SINGLE_PROMPT },
{ "role": "user", "content":
"holder: agent:omega-bot\n" +
"session_id: discord:...\n" +
"source_record_iri: donto:record:...\n\n" +
prior_facts_block + ← only present pass 2+
"chunk:\n" + text +
"\n\n" + COMMON_FRAGMENT ← JSON schema reminder
}
]
}Notes:
temperature: 0.2 is a configurable default
(DONTO_MEMORY_LLM_TEMPERATURE). Low but not zero. We could
justify pushing this up to 0.5–0.7 for later passes specifically — the
model is being asked for novelty, which is exactly what temperature is
for.max_tokens: 8000 is hardcoded in the
request body. This is the choke point that produced the salvage cases
across both runs. Bumping to 12000 is the standing recommendation.response_format: { "type": "json_object" } is the
OpenAI-style JSON-mode hint. Z.AI’s GLM-5 honours it most of the time
but pass_2 on the Nietzsche run still returned prose.reqwest::Client with a 900-second
per-request timeout (bumped from the original 180 s after a
pass-1 hit elapsed_ms=180002 mid-Pandoc experiment). The
right limit is “longer than the worst observed pass plus margin”; 900 s
is comfortable.{ "type": "image_url", "image_url": { "url": "..." } }
content parts on the user message, same shape as OpenAI vision. Deep
mode hasn’t been exercised with images yet.The response is parsed into
ChatCompletion { choices, usage, model }. The
choices[0].message.content string is parsed as
{ "facts": [ ExtractedFact, ... ] }.
max_tokens: 8000 is a hard truncation ceiling, and the
model exhausts it on roughly half of all passes. When the JSON is
structurally invalid (because the closing ] and
} got cut off), the orchestrator does not throw
away the entire pass. Instead:
1. Try strict JSON parse.
2. On failure, walk forward through the string looking for the
"facts": [ marker, then scan element by element using a small
bracket-balance state machine.
3. For each well-formed object found before the EOF point, parse it
individually as ExtractedFact.
4. Discard the malformed tail (the last partial fact).
5. Return the salvaged Vec<ExtractedFact> with the original raw count,
logged as `WARN LLM JSON truncated; recovered partial facts`.
Empirically this saves enormous amounts of pass yield. On cat-is-red’s pass_1, the model output truncated at position 1:1 (extreme — almost the whole output was malformed prose-prefix) and the salvager still recovered 10 facts. On the Nietzsche pass_4 the salvager recovered 202 of an attempted 204+. Without this path each truncation would dump an entire 100+-fact pass.
The salvager has its own test
(assert_eq!(out.len(), 2, ...) at
extract.rs:1237) and is unit-tested against synthetic
truncation cases.
After each pass the orchestrator computes a SHA256 of the fact’s content tuple:
content_key = SHA256(
subject.bytes
| 0x1f
| predicate.bytes
| 0x1f
| (object_iri OR JSON-serialised object_lit).bytes
)
The key is then BTreeSet.insert(key) — first-write-wins.
Confidence, modality, hypothesis_only, aperture label, and notes are
deliberately excluded from the key so that a second pass restating a
known fact with higher confidence still collides (we keep the earlier
copy with its lower confidence).
Limitations of string-key dedup:
discord:user:ajaxdavis and donto:AjaxDavis are
different keys; both land. Same for rdf:type vs
donto:isA. This is the identity-collapse phenomenon called
out in the case-study report.(donto:Cat, rdf:type, donto:Animal) and
(donto:Cat, rdf:type, donto:DomesticAnimal) are both kept.
This shows up as suspiciously-zero collision rates in passes where the
model is just generating finer-grained type assertions of already-known
entities.donto:requires vs donto:needs look distinct to
the hash.These are not bugs of the dedup function — they’re a known limit of going content-key-only. The standing recommendation is a semantic-dedup pass at the end, before substrate ingest.
When the LLM endpoint returns usage in the response
body, the orchestrator merges it into merged_usage. The
substrate stores the final tally in the audit row’s columns:
| Column | Source | Notes |
|---|---|---|
prompt_tokens |
sum of usage.prompt_tokens across passes |
accurate when endpoint returns usage |
completion_tokens |
sum of usage.completion_tokens across passes |
on Z.AI endpoint, this is an estimate, currently
passes_succeeded × max_tokens |
total_tokens |
sum | accurate when individual sums are accurate |
model |
choices[0].model (echoed by endpoint) |
z-ai/glm-5-20260211 for current runs |
The “completion_tokens looks like an upper bound” footnote in the
cost analysis is exactly this — when the endpoint omits
usage.completion_tokens, we fall back to
max_tokens × successful_passes, which is a pessimistic
estimate.
z-ai/glm-5)| Job | Input | Prompt tk | Completion tk* | Cost (worst case) | Facts | $/fact |
|---|---|---|---|---|---|---|
| Nietzsche, 7 passes (6 succeeded) | 109 words | 33,644 | 48,000* | $0.112 | 1000 | $0.000113 |
| cat is red, 7 passes (7 succeeded) | 3 words | 45,839 | 56,000* | $0.135 | 697 | $0.000194 |
* completion_tokens estimated at
8000 × passes_succeeded; real value likely 60–80%.
A 7-pass deep run costs roughly $0.10–0.13 per message at worst-case GLM-5 prices. The cost is dominated by output tokens (~85% of total). Reducing passes for short messages, raising max_tokens (so fewer passes truncate-and-redo), and caching the prior-facts prefix are the three biggest cost levers.
donto_x_memory_job_logEvery memorize and recall touches this table. The deep mode pipeline writes two rows per request:
-- Row 1: "queued" placeholder, written immediately after defer decision
INSERT INTO donto_x_memory_job_log (
job_id, endpoint='POST /memorize (queued)',
status_code=202, elapsed_ms=0,
request=<full request body>, response=<{status:queued,queue_id:...}>,
holder, session_id
)
-- Row 2: "async" final row, written when memorize_one returns
INSERT INTO donto_x_memory_job_log (
job_id, endpoint='POST /memorize (async)',
status_code=200 (or 500), elapsed_ms=<final>,
request=<full request body>, response=<full response body>,
facts_extracted, facts_ingested,
model, prompt_tokens, completion_tokens, total_tokens,
error=NULL (or message),
holder, session_id
)If the binary restarts mid-run, the startup path stamps any orphaned
(queued) rows with endpoint
POST /memorize (lost) and status_code=500 so
they don’t appear “still running” forever. This is the marker we saw at
14:44:58 in this session.
Sync-mode requests write one row with endpoint
POST /memorize (sync) or POST /memorize
depending on path.
The audit table is the source of truth for the /jobs
index and /jobs/<id> detail page.
/jobs UI surfaceThere are three routes:
| Route | Purpose |
|---|---|
GET /jobs |
Paginated index of recent jobs (status, route, holder, text preview, elapsed) |
GET /jobs/<id> |
HTML detail page: request payload, response payload, per-pass yields, per-fact table |
GET /jobs/<id>/raw |
JSON detail (same shape as the HTML page’s underlying data) |
Detail-page features specific to deep mode:
aperture group:
── pass_1 (132 facts) ──.pass_3, pass_5, etc.).JSON truncated in its logs (TODO:
this is currently in journal only; should be in the response body).The operator endpoint /jobs was previously gated by
DONTO_MEMORY_OPS_TOKEN. It is currently set to empty in
/etc/donto-memory/env so the page is publicly browsable.
Anyone with the URL can read every memorize that has ever happened —
fine for the current single-user deployment, will need a real auth model
before multi-tenant.
Per memorize, the substrate receives:
donto:record:<uuid>) typed via
mem:module/episodic. The full text + modality + holder +
session_id. Bitemporal, visible from now().mem:module/semantic-claim. Each carries:
subject, predicate,
object_iri or object_litconfidence, modality,
hypothesis_onlyaperture (= the pass label,
e.g. pass_3)source_record_iri (= the episodic record’s IRI from
step 1)ctx:memory)The substrate’s policy gate can reject a fact (e.g. predicate-domain
violation, identity policy violation). The orchestrator counts these as
errors in the ingest progress log and the audit row’s
facts_ingested will be lower than
facts_extracted. On the Nietzsche run this gap was 2 facts
(998/1000); on the cat-is-red run it was 0.
The substrate also handles federation, sleep-path reconsolidation, and the trust kernel — none of which are deep-mode-specific. They apply to every memorize regardless of mode.
Deep mode is a write mode. The complementary read
mode is POST /recall, which:
The more facts deep mode ingests per message, the higher the recall precision for queries that touch those facts — provided the dedup and identity work properly. Both runs in this analysis have been writes; we have not yet exercised recall against either output corpus.
Two end-to-end deep-mode runs at production parameters
(passes=7, modality=descriptive,
holder=agent:omega-bot):
| Pass | Raw | New unique | Collisions | Elapsed | Notes |
|---|---|---|---|---|---|
| 1 | 132 | 132 | 0 | 124 s | clean cold start |
| 2 | FAIL | 0 | 0 | <1 s | prose-not-JSON |
| 3 | 202 | 202 | 0 | 62 s | prior-facts redirect; fastest pass |
| 4 | 204 | 202 | 2 | 127 s | truncated, salvaged |
| 5 | 159 | 157 | 2 | 109 s | |
| 6 | 158 | 156 | 2 | 142 s | |
| 7 | 158 | 151 | 7 | 124 s | highest collision rate |
| Σ | 1013 | 1000 | 13 | 854 s | 998 ingested |
| Pass | Raw | New unique | Collisions | Elapsed | Notes |
|---|---|---|---|---|---|
| 1 | 10 | 10 | 0 | 46 s | early truncation, salvaged |
| 2 | 121 | 121 | 0 | 112 s | prior-facts redirect |
| 3 | 150 | 42 | 108 | 100 s | 72% collision — model stuck |
| 4 | 127 | 127 | 0 | 105 s | re-energised |
| 5 | 139 | 139 | 0 | 105 s | re-energised |
| 6 | 103 | 94 | 9 | 96 s | |
| 7 | 165 | 164 | 1 | 120 s | |
| Σ | 815 | 697 | 118 | 703 s | 697 ingested |
| Run A | Run B | |
|---|---|---|
| facts/input-word | 9.2 | 232 |
| cost | $0.112 | $0.135 |
| %-truncations (passes truncated) | ~30% | ~86% |
| pass_2 outcome | prose-fail | 121 unique |
| substrate rejections | 2/1000 | 0/697 |
The non-linearity in facts/word (3-word input produces 25× more facts per word than the 109-word input) is the headline finding from the empirics: the model’s elaboration capacity is decoupled from the input. Deep mode on tiny inputs is largely hallucinating consistent ontology. This is a feature for some use cases (priming recall on common concepts) and a bug for others (signal-to-noise in the substrate).
In rough priority order:
max_tokens: 8000 is too low. Most
passes truncate; salvage works but loses the malformed tail. Recommend
12000.passes × max_tokens
as an upper-bound estimate. Underreport real cost, overreport
per-message cost ceiling.(Cat, rdf:type, Animal) and
(Cat, rdf:type, Mammal) both land. Identity collapse
(discord:user:ajax vs donto:Ajax) similarly
slips through.(queued) rows as (lost) but they’re gone.info!
level. If the binary’s RUST_LOG defaults to WARN,
journalctl shows nothing useful during a run. Verified this session —
the original Nietzsche run was silent in the journal because the level
filter was promoting.format_prior_facts_block function has dead
code
(let take = facts.len().saturating_sub(facts.len().saturating_sub(300));
is a no-op preceded by let _ = take;). Cosmetic but should
be cleaned up.queue_id but no estimated_position or
estimated_wait. With a single-permit Mutex and 7-pass runs
taking 12–14 min, the 5th queued job will wait ~70 min with no
signal./jobs page is unauthenticated.
Acceptable for single-user; needs auth before multi-tenant.In order from cheapest-fastest-win to deepest:
max_tokens to 12000 — one
constant change in build_request_body.passes on the bot
side — passes = clamp(2, ceil(words/25), 7). Saves
70-80% on short messages.RUST_LOG=info,donto_memory=info,donto_memory_core=info in
the systemd unit.cache_control or OpenAI prefix caching.
Cuts input cost ~60% on later passes.candidate overlay.donto_x_memory_extract_queue with
FOR UPDATE SKIP LOCKED, multiple workers,
restart-safe.discord:user:ajaxdavis ↔︎
donto:AjaxDavis ↔︎ donto:Reader collapse at
query time./jobs and
/memorize.If you’re reading source:
| Concern | File | Lines |
|---|---|---|
| Deep orchestrator | donto-memory-core/src/extract.rs |
507–646 |
| Prior-facts block formatter | donto-memory-core/src/extract.rs |
961–1003 |
| Per-call HTTP + salvage | donto-memory-core/src/extract.rs |
676–870 |
| SINGLE_PROMPT constant | donto-memory-core/src/extract.rs |
863–872 |
| Content-key dedup | donto-memory-core/src/extract.rs |
112–135 (content_key) |
| Route handler + dispatch | donto-memory/src/api/routes/memorize.rs |
100–340 |
memorize_one inner engine |
donto-memory/src/api/routes/memorize.rs |
333–560 |
| Ingest progress logging | donto-memory/src/api/routes/memorize.rs |
497–540 |
| Async lock initialisation | donto-memory/src/main.rs |
106–133 |
| Job audit log table | migrations: *_donto_x_memory_job_log.sql |
— |
| Job detail page rendering | donto-memory/src/api/routes/jobs.rs |
— |
Operator endpoints in production:
127.0.0.1:7900 on donto-db VM
(us-central1-a, apex-494316).https://memories.apexpots.com/....donto-memory-api.service (loaded,
enabled, autorestart).postgres://donto:***@127.0.0.1:5432/donto_db.pass_1 …
pass_7).(subject, predicate, object) used for first-write-wins
dedup across passes.agent:omega-bot for the Discord bot’s auto-memorize.descriptive,
imperative, interrogative, etc. Currently
always descriptive from omega-bot.max_tokens.(aperture, raw_facts, elapsed_ms, error). Forms the input
to the /jobs/<id> page’s per-pass breakdown.Related reports: