An empirical read

What the LLM Actually Extracts —
A qualitative audit of donto-memory's first Discord corpus

Author: Thomas Davis ([email protected]) · Date: 2026-05-30 · Corpus window: 2026-05-30 02:31 UTC – 11:31 UTC (~9 h) · Method: direct inspection of every successful POST /memorize call from agent:omega-bot + every fact in its response

Abstract. donto-memory has been wired into a Discord bot (omega-bot) for less than a day, and seventeen messages have flowed through POST /memorize using

mode:
"single"

against z-ai/glm-5. That run produced 1,653 typed ontological statements across 250 distinct subjects and 638 distinct predicates. This report reads every chunk and every extracted triple. It finds: (a) the LLM reliably constructs a per-message Discord-entity skeleton (user, channel, session, message, bot, episodic record) — about 20 boilerplate facts per call regardless of input length, accounting for ~25% of the volume; (b) content-bearing extraction is sharp on substantive utterances (a 53-word workflow description yielded a clean dependency graph between actions, file types, runtimes, and outputs) and absurd on trivial utterances (the 3-word message "cat is alive" yielded 87 facts, including a marked- hypothesis fact that the cat "isSchrodingerCat"); (c) cross- chunk identity does not yet converge — the user xenonfun appears in three channels but reads as three distinct entities; and (d) the model is appropriately conservative about speculation — 1.7% of extracted facts carry hypothesis_only: true and the model puts Schrödinger-style reaches there, not in the asserted set. The corpus is small enough to read entirely; the patterns are crisp enough to act on. Concrete proposals for predicate alignment, identity convergence, and a boilerplate-suppression prompt land in §9.

1The corpus

Seventeen successful POST /memorize calls reached the production instance at memories.apexpots.com between 02:31 UTC and 11:31 UTC. All carried holder: "agent:omega-bot", all used mode: "single" against z-ai/glm-5 via OpenRouter, and all were a single Discord message embedded in the text field of the request:

POST /memorize
{
  "holder":     "agent:omega-bot",
  "session_id": "discord:1349727923434815519:1497274794586931220",
  "text":       "ajaxdavis in #donto: a dog fell into river and hunted fish",
  "mode":       "single"
}

The session_id is keyed on discord:<guild_id>:<channel_id> — per-channel, not per-user (which is a recoverable choice; see §8). Across the seventeen calls, four distinct sessions appear:

Channel (last 19 digits)	Calls	Facts produced
`…1497274794586931220` ("#donto")	7	509
`…1349727923434815522` ("#general")	4	508
`…1462240469864943626` ("#safiersemantics")	4	437
`discord:test` (diagnostic)	2	199

Two distinct human authors appear in the message text: ajaxdavis and xenonfun; one channel (#general) also surfaces a third user, girvo. None of the seventeen calls carry an images field — the multimodal path (agent.md §4) is wired server-side but the bot hasn't shipped image extraction yet.

2Volumetrics

17memorize calls

1,653facts extracted

97avg facts per call

250distinct subjects

638distinct predicates

112,760total LLM tokens

28hypothesis_only facts (1.7%)

~76 savg per call

Single-mode z-ai/glm-5 takes about 76 seconds per call and produces about 100 facts per call. The yield distribution is roughly bimodal:

Bucket	Calls	Avg ms	Avg input chars
0 (extraction failed)	1	51,144	58
50–99 facts	9	79,399	104
100–149 facts	5	66,370	149
150+ facts	2	93,588	54

Three things in that table merit a stare. First: input length is a weak predictor of yield. The two highest-yielding calls (150+ facts) average just 54 characters of input. Second: the one extraction failure (0 facts) was the EOF-truncation case the runtime now salvages — re-running the same input later produced 130 facts. Third: the dominant bucket (50–99 facts) is concentrated on substantive but moderate-length messages — about a sentence each. The system pays about $0.015–0.02 of OpenRouter spend per chunk.

3The Discord skeleton

Every chunk produces a recognisable opening pattern of structural facts before the LLM gets to anything content-specific. The skeleton takes about 15–25 of every call's facts (the ~25% boilerplate share) and looks like this in practice (taken from the "model override for agent" call, abbreviated):

(agent:omega-bot,        rdf:type,           ex:Agent)                       0.95
(agent:omega-bot,        ex:hasName,         "omega-bot")                    0.95
(agent:omega-bot,        ex:holdsMemoryContext, ctx:memory/episodic/3677…)   0.95
(ctx:memory/episodic/3677…, rdf:type,        ex:EpisodicMemoryChunk)         0.95
(discord:1349…:1462…,    rdf:type,           ex:DiscordSession)              0.95
(discord:1349…:1462…,    ex:occurredOnPlatform, ex:Discord)                  0.95
(discord:1349…:1462…,    ex:hasGuildId,      "1349727923434815519")          0.95
(discord:1349…:1462…,    ex:hasChannelId,    "1462240469864943626")          0.95
(xenonfun,               rdf:type,           ex:Person)                      0.9
(xenonfun,               ex:hasName,         "xenonfun")                     0.9
(xenonfun,               ex:isDiscordUser,   ex:True)                        0.9
(xenonfun,               ex:participatedInSession, discord:1349…:1462…)      0.9
(xenonfun,               ex:authoredMessage, ctx:memory/episodic/3677…)      0.9
(#safiersemantics,       rdf:type,           ex:DiscordChannel)              0.95
(#safiersemantics,       ex:hasName,         "safiersemantics")              0.95
(#safiersemantics,       ex:isChannelInGuild, "1349727923434815519")         0.9

This is the LLM doing the schema work donto-memory's design takes for granted — turning the bare session_id string into a typed Discord-session entity with a guild-id and channel-id, and constructing the user → message → channel → guild graph that downstream recall can walk. It is real ontology work; it would not happen if the agent went straight to donto_statement ingest. But it also obviously repeats. The seventeen chunks have produced seventeen slightly different discord:<guild>:<channel> DiscordSession typings, seventeen agent:omega-bot rdf:type ex:Agent assertions, and so on. The boilerplate is expensive in tokens, and most of it is also discoverable from the structure of donto-memory's overlay tables already. A v0.2 system prompt could ask the LLM to skip the platform boilerplate, knock 15-20 facts off every call, and recover ~20% of the per-call cost.

4Content extraction quality

The other 75% of each call's facts is content-specific extraction about the message's actual subject matter. Quality varies sharply with message substance — and not always in the way you'd expect.

Input text	Facts	Subject matter the model went to
"cat is alive" (3 words)	87	built an entire epistemic theory of the cat (see §5)
"creepy" (1 word)	82	boilerplate plus aesthetic typing of the word itself
"hi" (1 word)	55	greeting taxonomy; phatic-vs-substantive analysis
"a dog fell into river and hunted fish" (8 words)	108	dog, river, fish, falling, hunting — proper event decomposition
"who's dog. is this now just about that dog…well established FACT that does feel into the river"	94	discourse meta — recognises this as a reply, types the prior message as referent
"how much memory it sucking down?" (informal infra Q)	195	memory measurement, software perf, the elided "it"
"I have nemo at 256K and down to ~33GB with 6 concurrency"	121	nemo (the model), context length, RAM, concurrency parameter
"The loop is now: edit a part / CSS / HTML → node shot.js…"	105	workflow graph: actions, file types, runtimes, outputs (see §6)
"model override for agent"	154	maximal boilerplate; very little content because the message has little

A pattern emerges: the model fills its yield budget regardless of input substance. A 3-word message and a 53-word message both produce around 100 facts. The longer message has more facts per sentence because it has more substance; the shorter one has more facts per word because the model elaborates speculatively. The extreme case — "cat is alive" — deserves its own section.

5"Cat is alive" — anatomy of an over-yield

The single most interesting chunk in the corpus is the 3-word message "cat is alive". It produced 87 facts. The first 27 are the expected skeleton (Discord, user, channel, session). The next 4 are the right facts: the cat exists, the cat is an Animal, the cat hasLifeStatus "alive", the cat ex:isAlive true. So far so good — about 31 facts in. Then the model gets creative:

Subject	Predicate	Object	Conf	Mark
ex:cat:mentioned	ex:wasSubjectOf	discord:message:…	0.9
discord:user:ajaxdavis	ex:asserted	ex:proposition:cat-alive	0.95
ex:proposition:cat-alive	rdf:type	ex:Proposition	0.9
ex:proposition:cat-alive	ex:hasContent	"cat is alive"	0.9
ex:proposition:cat-alive	ex:isAbout	ex:cat:mentioned	0.9
ex:proposition:cat-alive	ex:hasTruthValue	"claimed"	0.8
discord:user:ajaxdavis	ex:hasKnowledgeOf	ex:cat:mentioned	0.8
discord:user:ajaxdavis	ex:observed	ex:cat:mentioned	0.7
ex:cat:mentioned	ex:hadUncertainStatus	true	0.6	[H]
ex:cat:mentioned	ex:wasPotentiallyDead	true	0.5	[H]
ex:cat:mentioned	ex:isSchrodingerCat	true	0.4	[H]

Read this carefully. The model has noticed that "cat is alive" is a statement about a cat's life status, an utterance about which it is sensible to ask why announce this, an utterance whose ordinary discourse-functional role is to resolve uncertainty about some cat's life status. It has therefore inferred that the cat was previously in an uncertain life-status, that this might mean the cat was potentially dead, and at confidence 0.4 with hypothesis_only: true it has named the cat Schrödinger's cat. This is — and I have to give credit where it is due — a sharp piece of pragmatic inference. It is also, even with the hypothesis flag, ridiculous. donto-memory's policy machinery has no way to mark these as "delete on policy change" or "expire after N days unless corroborated" (M11.x territory) so they sit in the substrate forever.

The lesson is structural: mode: "single" on z-ai/glm-5 with the maximalist prompt over-yields on short inputs. Two paths from here. One: a length-conditional prompt that asks for "at most ⌈3 × words⌉ facts" on sub-10-word inputs. Two: respect the maturity ladder (E0..E5) the substrate has and degrade the hypothesis_only Schrödinger fact at maturity 0 with a worker-side decay rule. Both are implementable in donto-memory without touching the substrate. The extracted Schrödinger inference is interesting; it does not need to be permanent.

6The dev-loop chunk — anatomy of a clean yield

At the opposite end of the substance spectrum is the workflow description from xenonfun in #safiersemantics:

The loop is now: edit a part / CSS / HTML → node shot.js … out.png <action> → look. ~6 seconds, zero recompile. Verified it renders identically to the built bundle. You only cargo build + deploy-hub.sh once you're happy, to ship it to the running hub.

Substance, density, and a clear graph. The model produced 105 facts, including this rich semantic skeleton:

(workflow:dev-loop-xenonfun, rdf:type,         ex:DevelopmentWorkflow)    0.95
(workflow:dev-loop-xenonfun, ex:hasStep,        action:edit-files)        0.95
(workflow:dev-loop-xenonfun, ex:hasStep,        action:run-shot-js)       0.95
(workflow:dev-loop-xenonfun, ex:hasStep,        action:view-output)       0.95
(workflow:dev-loop-xenonfun, ex:hasDuration,    "6")                      0.95
(workflow:dev-loop-xenonfun, ex:durationUnit,   "seconds")                0.95
(workflow:dev-loop-xenonfun, ex:requiresRecompile, false)                 0.95
(action:edit-files,          ex:involvesFileType, filetype:part)          0.95
(action:edit-files,          ex:involvesFileType, filetype:css)           0.95
(action:edit-files,          ex:involvesFileType, filetype:html)          0.95
(filetype:css,               ex:fullName,         "Cascading Style Sheets") 0.95
(filetype:html,              ex:fullName,         "HyperText Markup Language") 0.95
(file:shot.js,               rdf:type,            ex:JavaScriptFile)       0.99
(file:shot.js,               ex:executedBy,       software:node-js)        0.99
(file:shot.js,               ex:produces,         file:out.png)            0.95
(software:node-js,           rdf:type,            ex:JavaScriptRuntime)    0.99
(action:run-shot-js,         ex:usesRuntime,      software:node-js)        0.99
(file:out.png,               rdf:type,            ex:ImageFile)            0.99
(file:out.png,               ex:format,           "PNG")                   0.99
(file:out.png,               ex:isOutputOf,       file:shot.js)            0.95

This is properly typed, properly connected, and properly cross- referenced. The workflow has three steps, each step is a typed action, each action references the file types it touches, each file references the runtime it runs under, the runtime is typed, the output file is typed. A future recall like "how does xenonfun rebuild the page?" can walk this graph without any vector-similarity guesswork. The maturity 0 confidences are mostly 0.95–0.99 — the model is sure about everything because the source text was concrete.

The contrast with the cat example is the lesson. The same prompt and the same model produce a tightly-connected graph on a substantive input and a speculative cloud on a barren one. donto-memory's fact-count yield as a single quality metric will mislead — the cat chunk got 87, the dev-loop chunk got 105, but the second is meaningfully more recoverable.

7What the model marks as speculation

The corpus contains 1,653 facts, of which exactly 28 carry hypothesis_only: true. That is 1.7%. By construction the model is supposed to use this flag for inferred facts it isn't confident about. The cat-Schrödinger speculations are in the set. So are a handful of guesses about which Discord guild houses which channel, and a few sociolinguistic readings (a message ending in "haha" tagged ex:hasEmotionalTone "playful" with hypothesis_only). On the whole, the model is appropriately sparing with the flag. It is not using it to soft- mark every inference — most of the inferred facts (~28% of the corpus) are unmarked. The flag does seem to specifically mark "this is a real reach" facts rather than "this extrapolates beyond literal content" facts.

For an agent reading donto-memory output, this means polarity = "asserted" is not a high-confidence filter — 0.85-0.95 confidence inferred facts are mixed in. The right filter for "things I'm sure about" is WHERE hypothesis_only IS NOT TRUE AND confidence >= 0.9 or similar.

8Identity drift across messages

Three distinct chunks describe xenonfun:

Message	Subject IRI minted
"model override for agent" (#safiersemantics)	`xenonfun`
"how much memory it sucking down?" (#general)	`xenonfun`
"who's dog. is this now just about…" (#donto)	`discord:user:xenonfun`

Two different subject IRIs for the same person. And those are just two of the patterns; ajaxdavis appears across chunks as ajaxdavis, discord:user:ajaxdavis, and once as user:ajaxdavis. The substrate's identity-lens mechanism is the right home for resolving this (the likely_identity_v1 lens at confidence ≥0.85), but the substrate-side identity edges aren't being minted automatically. A recall by subject = "xenonfun" today will miss the two chunks where the bot's LLM used the longer IRI.

This is a fixable gap with three layers of intervention, in order of effort:

Prompt: the system prompt currently says "Reuse existing donto vocabulary where obvious". Add an explicit convention block listing the canonical IRI shapes for Discord entities: discord:user:<handle>, discord:channel:<name>, discord:message:<id>, and a last-resort fallback instruction "use a stable bare handle if canonical form is uncertain". This won't fix every case but should cut drift by half.
Post-processing: in the semantic-claim module's ingest, normalise subject/object IRIs against a small regex dictionary ("any subject that looks like a Discord handle, snap to discord:user:<handle>"). This is a write-time canonicalisation. Cheap; preserves provenance via the original-IRI metadata field.
Identity edges: after each /memorize, the worker (sleep path) walks the new facts and mints donto_same_referent rows between IRIs that name the same Discord user. This is the substrate-native solution and is where M11.x identity-cluster work would land.

Path 1 is a 30-minute prompt change. Path 2 is an afternoon of canonicalisation logic. Path 3 needs a new sleep-path operator and is the right end-state.

9Concrete opportunities

Three high-impact changes follow directly from this audit:

9.1 — Predicate alignment

The corpus has 638 distinct predicates from 1,653 facts. That is roughly 1 unique predicate per 2.6 statements. Some are genuinely distinct (ex:executedBy, ex:produces, ex:hasLifeStatus); many are slight variants of each other (ex:hasUsername vs ex:hasName vs ex:hasHandle; ex:participatedInSession vs ex:isParticipantOf vs ex:hasParticipant). The substrate ships donto_predicate_alignment for exactly this — and donto align auto is already a CLI command. A one-off run over the corpus would collapse the predicate count by 40-60% and dramatically improve recall by predicate filter. This is the cheapest single quality win.

9.2 — Boilerplate suppression in the system prompt

The Discord skeleton facts (omega-bot rdf:type Agent, episodic record rdf:type EpisodicMemoryChunk, session rdf:type DiscordSession) are mechanical and rederivable from donto-memory's overlay tables. The system prompt could be expanded to say:

The donto-memory runtime already records: which agent holds the memory, the episodic chunk's record IRI, the holder, the session IRI, and the time. DO NOT re-extract those structural facts. Focus on facts implied by the message content itself.

This alone would knock 15–20 facts off every call (so ~300 facts saved across the corpus), shorten the LLM response, save tokens, and remove the repetition in the substrate that predicate alignment currently has to clean up.

9.3 — Length-conditional yield

The current prompt asks for "30+ statements from sentence-length chunk, 100+ from paragraph" regardless of input length. On 3-word messages this produces the Schrödinger cat. A length- conditional clause —

Aim for one fact per 2-3 words of input content (excluding the <user> in #<channel>: prefix). Under-yield rather than over-yield on short utterances; the next call can fill gaps if needed.

— would map the cat chunk to ~4 facts instead of ~67, and would probably push the average call from ~76s to ~40-50s on short messages without losing anything important.

9.4 — Recall integration (still pending)

The bot is writing memories but not yet reading them (integration-patterns §1.2). The whole point of the corpus is to be available on the response hot path. A 17-message corpus is small enough to walk in 30–80 ms per recall (per the v0.1.0 paper), so latency is not a blocker. This is the single biggest open task in the integration backlog.

10Conclusion

The corpus is nine hours old and seventeen messages deep, but it is sufficient to read every chunk and every triple by hand. Three things are working: the structural Discord-entity skeleton is built consistently per chunk; content extraction on substantive messages yields tightly-connected typed graphs; the hypothesis_only flag is used sparingly and accurately. Three things need attention: a high boilerplate share that's mechanically rederivable; identity drift across chunks (the same user reads as several distinct subjects); and an over-yield on trivial inputs that puts speculative Schrödinger-style facts permanently into the substrate. None of these is hard. Predicate alignment is a single CLI run. Boilerplate suppression is a system-prompt edit. Length-conditioned extraction is another paragraph in the prompt. Identity convergence is harder but stepable, and the substrate already supports the machinery. The integration's largest remaining open task is to read the memories on the response path — write-only is the state of the bot today, and the most expensive memorize calls are producing data nobody is consulting.

The qualitative impression: the system is sharper than its defaults. The substrate handles append-only paraconsistent storage correctly; donto-memory's pipeline correctly produces typed graphs from raw chat. The remaining work is mostly about being less prolix.

What the LLM Actually Extracts — A qualitative audit of donto-memory's first Discord corpus