An empirical read
What the LLM Actually Extracts —
A qualitative audit of donto-memory's first Discord corpus
POST /memorize using mode:
"single" against z-ai/glm-5. That run produced
1,653 typed ontological statements across 250
distinct subjects and 638 distinct predicates.
This report reads every chunk and every extracted triple. It finds:
(a) the LLM reliably constructs a per-message Discord-entity skeleton
(user, channel, session, message, bot, episodic record) — about 20
boilerplate facts per call regardless of input length, accounting for
~25% of the volume; (b) content-bearing extraction is sharp on
substantive utterances (a 53-word workflow description yielded a
clean dependency graph between actions, file types, runtimes, and
outputs) and absurd on trivial utterances (the 3-word message
"cat is alive" yielded 87 facts, including a marked-
hypothesis fact that the cat "isSchrodingerCat"); (c) cross-
chunk identity does not yet converge — the user xenonfun
appears in three channels but reads as three distinct entities; and
(d) the model is appropriately conservative about speculation —
1.7% of extracted facts carry hypothesis_only: true and
the model puts Schrödinger-style reaches there, not in the
asserted set. The corpus is small enough to read entirely; the
patterns are crisp enough to act on. Concrete proposals for predicate
alignment, identity convergence, and a boilerplate-suppression
prompt land in §9.
1The corpus
Seventeen successful POST /memorize calls reached the
production instance at memories.apexpots.com between
02:31 UTC and 11:31 UTC. All carried holder:
"agent:omega-bot", all used mode: "single"
against z-ai/glm-5 via OpenRouter, and all were a single
Discord message embedded in the text field of the request:
POST /memorize
{
"holder": "agent:omega-bot",
"session_id": "discord:1349727923434815519:1497274794586931220",
"text": "ajaxdavis in #donto: a dog fell into river and hunted fish",
"mode": "single"
}
The session_id is keyed on discord:<guild_id>:<channel_id>
— per-channel, not per-user (which is a recoverable choice; see
§8). Across the seventeen calls, four distinct
sessions appear:
| Channel (last 19 digits) | Calls | Facts produced |
|---|---|---|
…1497274794586931220 ("#donto") | 7 | 509 |
…1349727923434815522 ("#general") | 4 | 508 |
…1462240469864943626 ("#safiersemantics") | 4 | 437 |
discord:test (diagnostic) | 2 | 199 |
Two distinct human authors appear in the message text:
ajaxdavis and xenonfun; one channel
(#general) also surfaces a third user, girvo.
None of the seventeen calls carry an images field —
the multimodal path (agent.md §4)
is wired server-side but the bot hasn't shipped image extraction yet.
2Volumetrics
Single-mode z-ai/glm-5 takes about 76 seconds per call
and produces about 100 facts per call. The yield distribution is
roughly bimodal:
| Bucket | Calls | Avg ms | Avg input chars |
|---|---|---|---|
| 0 (extraction failed) | 1 | 51,144 | 58 |
| 50–99 facts | 9 | 79,399 | 104 |
| 100–149 facts | 5 | 66,370 | 149 |
| 150+ facts | 2 | 93,588 | 54 |
Three things in that table merit a stare. First: input length is a weak predictor of yield. The two highest-yielding calls (150+ facts) average just 54 characters of input. Second: the one extraction failure (0 facts) was the EOF-truncation case the runtime now salvages — re-running the same input later produced 130 facts. Third: the dominant bucket (50–99 facts) is concentrated on substantive but moderate-length messages — about a sentence each. The system pays about $0.015–0.02 of OpenRouter spend per chunk.
3The Discord skeleton
Every chunk produces a recognisable opening pattern of structural facts before the LLM gets to anything content-specific. The skeleton takes about 15–25 of every call's facts (the ~25% boilerplate share) and looks like this in practice (taken from the "model override for agent" call, abbreviated):
(agent:omega-bot, rdf:type, ex:Agent) 0.95
(agent:omega-bot, ex:hasName, "omega-bot") 0.95
(agent:omega-bot, ex:holdsMemoryContext, ctx:memory/episodic/3677…) 0.95
(ctx:memory/episodic/3677…, rdf:type, ex:EpisodicMemoryChunk) 0.95
(discord:1349…:1462…, rdf:type, ex:DiscordSession) 0.95
(discord:1349…:1462…, ex:occurredOnPlatform, ex:Discord) 0.95
(discord:1349…:1462…, ex:hasGuildId, "1349727923434815519") 0.95
(discord:1349…:1462…, ex:hasChannelId, "1462240469864943626") 0.95
(xenonfun, rdf:type, ex:Person) 0.9
(xenonfun, ex:hasName, "xenonfun") 0.9
(xenonfun, ex:isDiscordUser, ex:True) 0.9
(xenonfun, ex:participatedInSession, discord:1349…:1462…) 0.9
(xenonfun, ex:authoredMessage, ctx:memory/episodic/3677…) 0.9
(#safiersemantics, rdf:type, ex:DiscordChannel) 0.95
(#safiersemantics, ex:hasName, "safiersemantics") 0.95
(#safiersemantics, ex:isChannelInGuild, "1349727923434815519") 0.9
This is the LLM doing the schema work donto-memory's design takes
for granted — turning the bare session_id string into a
typed Discord-session entity with a guild-id and channel-id, and
constructing the user → message → channel → guild graph that downstream
recall can walk. It is real ontology work; it would not happen if the
agent went straight to donto_statement ingest. But it
also obviously repeats. The seventeen chunks have produced seventeen
slightly different discord:<guild>:<channel>
DiscordSession typings, seventeen agent:omega-bot
rdf:type ex:Agent assertions, and so on. The boilerplate is
expensive in tokens, and most of it is also discoverable
from the structure of donto-memory's overlay tables already.
A v0.2 system prompt could ask the LLM to skip the platform
boilerplate, knock 15-20 facts off every call, and recover ~20% of
the per-call cost.
4Content extraction quality
The other 75% of each call's facts is content-specific extraction about the message's actual subject matter. Quality varies sharply with message substance — and not always in the way you'd expect.
| Input text | Facts | Subject matter the model went to |
|---|---|---|
| "cat is alive" (3 words) | 87 | built an entire epistemic theory of the cat (see §5) |
| "creepy" (1 word) | 82 | boilerplate plus aesthetic typing of the word itself |
| "hi" (1 word) | 55 | greeting taxonomy; phatic-vs-substantive analysis |
| "a dog fell into river and hunted fish" (8 words) | 108 | dog, river, fish, falling, hunting — proper event decomposition |
| "who's dog. is this now just about that dog…well established FACT that does feel into the river" | 94 | discourse meta — recognises this as a reply, types the prior message as referent |
| "how much memory it sucking down?" (informal infra Q) | 195 | memory measurement, software perf, the elided "it" |
| "I have nemo at 256K and down to ~33GB with 6 concurrency" | 121 | nemo (the model), context length, RAM, concurrency parameter |
| "The loop is now: edit a part / CSS / HTML → node shot.js…" | 105 | workflow graph: actions, file types, runtimes, outputs (see §6) |
| "model override for agent" | 154 | maximal boilerplate; very little content because the message has little |
A pattern emerges: the model fills its yield budget regardless of input substance. A 3-word message and a 53-word message both produce around 100 facts. The longer message has more facts per sentence because it has more substance; the shorter one has more facts per word because the model elaborates speculatively. The extreme case — "cat is alive" — deserves its own section.
5"Cat is alive" — anatomy of an over-yield
The single most interesting chunk in the corpus is the 3-word message
"cat is alive". It produced 87 facts. The first 27 are the
expected skeleton (Discord, user, channel, session). The next 4 are
the right facts: the cat exists, the cat is an Animal, the cat
hasLifeStatus "alive", the cat ex:isAlive true. So far
so good — about 31 facts in. Then the model gets creative:
| Subject | Predicate | Object | Conf | Mark |
|---|---|---|---|---|
| ex:cat:mentioned | ex:wasSubjectOf | discord:message:… | 0.9 | |
| discord:user:ajaxdavis | ex:asserted | ex:proposition:cat-alive | 0.95 | |
| ex:proposition:cat-alive | rdf:type | ex:Proposition | 0.9 | |
| ex:proposition:cat-alive | ex:hasContent | "cat is alive" | 0.9 | |
| ex:proposition:cat-alive | ex:isAbout | ex:cat:mentioned | 0.9 | |
| ex:proposition:cat-alive | ex:hasTruthValue | "claimed" | 0.8 | |
| discord:user:ajaxdavis | ex:hasKnowledgeOf | ex:cat:mentioned | 0.8 | |
| discord:user:ajaxdavis | ex:observed | ex:cat:mentioned | 0.7 | |
| ex:cat:mentioned | ex:hadUncertainStatus | true | 0.6 | [H] |
| ex:cat:mentioned | ex:wasPotentiallyDead | true | 0.5 | [H] |
| ex:cat:mentioned | ex:isSchrodingerCat | true | 0.4 | [H] |
Read this carefully. The model has noticed that "cat is alive"
is a statement about a cat's life status, an utterance about which it
is sensible to ask why announce this, an utterance whose
ordinary discourse-functional role is to resolve uncertainty
about some cat's life status. It has therefore inferred that
the cat was previously in an uncertain life-status, that this might
mean the cat was potentially dead, and at confidence 0.4 with
hypothesis_only: true it has named the cat
Schrödinger's cat. This is — and I have to give credit
where it is due — a sharp piece of pragmatic inference. It is also,
even with the hypothesis flag, ridiculous. donto-memory's
policy machinery has no way to mark these as "delete on policy
change" or "expire after N days unless corroborated"
(M11.x territory) so they sit in the substrate forever.
The lesson is structural: mode: "single" on
z-ai/glm-5 with the maximalist prompt over-yields on
short inputs. Two paths from here. One: a length-conditional prompt
that asks for "at most ⌈3 × words⌉ facts" on
sub-10-word inputs. Two: respect the maturity ladder (E0..E5)
the substrate has and degrade the
hypothesis_only Schrödinger fact at maturity
0 with a worker-side decay rule. Both are
implementable in donto-memory without touching the substrate. The
extracted Schrödinger inference is interesting; it does not need to
be permanent.
6The dev-loop chunk — anatomy of a clean yield
At the opposite end of the substance spectrum is the workflow description from xenonfun in #safiersemantics:
The loop is now: edit a part / CSS / HTML → node shot.js … out.png <action> → look. ~6 seconds, zero recompile. Verified it renders identically to the built bundle. You only cargo build + deploy-hub.sh once you're happy, to ship it to the running hub.
Substance, density, and a clear graph. The model produced 105 facts, including this rich semantic skeleton:
(workflow:dev-loop-xenonfun, rdf:type, ex:DevelopmentWorkflow) 0.95
(workflow:dev-loop-xenonfun, ex:hasStep, action:edit-files) 0.95
(workflow:dev-loop-xenonfun, ex:hasStep, action:run-shot-js) 0.95
(workflow:dev-loop-xenonfun, ex:hasStep, action:view-output) 0.95
(workflow:dev-loop-xenonfun, ex:hasDuration, "6") 0.95
(workflow:dev-loop-xenonfun, ex:durationUnit, "seconds") 0.95
(workflow:dev-loop-xenonfun, ex:requiresRecompile, false) 0.95
(action:edit-files, ex:involvesFileType, filetype:part) 0.95
(action:edit-files, ex:involvesFileType, filetype:css) 0.95
(action:edit-files, ex:involvesFileType, filetype:html) 0.95
(filetype:css, ex:fullName, "Cascading Style Sheets") 0.95
(filetype:html, ex:fullName, "HyperText Markup Language") 0.95
(file:shot.js, rdf:type, ex:JavaScriptFile) 0.99
(file:shot.js, ex:executedBy, software:node-js) 0.99
(file:shot.js, ex:produces, file:out.png) 0.95
(software:node-js, rdf:type, ex:JavaScriptRuntime) 0.99
(action:run-shot-js, ex:usesRuntime, software:node-js) 0.99
(file:out.png, rdf:type, ex:ImageFile) 0.99
(file:out.png, ex:format, "PNG") 0.99
(file:out.png, ex:isOutputOf, file:shot.js) 0.95
This is properly typed, properly connected, and properly cross- referenced. The workflow has three steps, each step is a typed action, each action references the file types it touches, each file references the runtime it runs under, the runtime is typed, the output file is typed. A future recall like "how does xenonfun rebuild the page?" can walk this graph without any vector-similarity guesswork. The maturity 0 confidences are mostly 0.95–0.99 — the model is sure about everything because the source text was concrete.
The contrast with the cat example is the lesson. The same prompt and the same model produce a tightly-connected graph on a substantive input and a speculative cloud on a barren one. donto-memory's fact-count yield as a single quality metric will mislead — the cat chunk got 87, the dev-loop chunk got 105, but the second is meaningfully more recoverable.
7What the model marks as speculation
The corpus contains 1,653 facts, of which exactly 28
carry hypothesis_only: true. That is 1.7%. By
construction the model is supposed to use this flag for inferred
facts it isn't confident about. The cat-Schrödinger speculations are
in the set. So are a handful of guesses about which Discord guild
houses which channel, and a few sociolinguistic readings (a message
ending in "haha" tagged ex:hasEmotionalTone "playful"
with hypothesis_only). On the whole, the model is appropriately
sparing with the flag. It is not using it to soft-
mark every inference — most of the inferred facts (~28% of the
corpus) are unmarked. The flag does seem to specifically mark
"this is a real reach" facts rather than "this
extrapolates beyond literal content" facts.
For an agent reading donto-memory output, this means
polarity = "asserted" is not a high-confidence filter —
0.85-0.95 confidence inferred facts are mixed in. The right filter
for "things I'm sure about" is
WHERE hypothesis_only IS NOT TRUE AND confidence >= 0.9
or similar.
8Identity drift across messages
Three distinct chunks describe xenonfun:
| Message | Subject IRI minted |
|---|---|
| "model override for agent" (#safiersemantics) | xenonfun |
| "how much memory it sucking down?" (#general) | xenonfun |
| "who's dog. is this now just about…" (#donto) | discord:user:xenonfun |
Two different subject IRIs for the same person. And those are just
two of the patterns; ajaxdavis appears across chunks as
ajaxdavis, discord:user:ajaxdavis, and once
as user:ajaxdavis. The substrate's identity-lens
mechanism is the right home for resolving this (the
likely_identity_v1 lens at confidence ≥0.85), but the
substrate-side identity edges aren't being minted automatically.
A recall by subject = "xenonfun" today will miss the
two chunks where the bot's LLM used the longer IRI.
This is a fixable gap with three layers of intervention, in order of effort:
- Prompt: the system prompt currently says
"Reuse existing donto vocabulary where obvious". Add an
explicit convention block listing the canonical IRI shapes for
Discord entities:
discord:user:<handle>,discord:channel:<name>,discord:message:<id>, and a last-resort fallback instruction "use a stable bare handle if canonical form is uncertain". This won't fix every case but should cut drift by half. - Post-processing: in the semantic-claim module's
ingest, normalise subject/object IRIs against a small regex
dictionary ("any subject that looks like a Discord handle, snap
to
discord:user:<handle>"). This is a write-time canonicalisation. Cheap; preserves provenance via the original-IRI metadata field. - Identity edges: after each
/memorize, the worker (sleep path) walks the new facts and mintsdonto_same_referentrows between IRIs that name the same Discord user. This is the substrate-native solution and is where M11.x identity-cluster work would land.
Path 1 is a 30-minute prompt change. Path 2 is an afternoon of canonicalisation logic. Path 3 needs a new sleep-path operator and is the right end-state.
9Concrete opportunities
Three high-impact changes follow directly from this audit:
9.1 — Predicate alignment
The corpus has 638 distinct predicates from 1,653 facts.
That is roughly 1 unique predicate per 2.6 statements. Some are
genuinely distinct (ex:executedBy,
ex:produces, ex:hasLifeStatus); many are
slight variants of each other (ex:hasUsername vs
ex:hasName vs ex:hasHandle;
ex:participatedInSession vs
ex:isParticipantOf vs ex:hasParticipant).
The substrate ships donto_predicate_alignment for
exactly this — and donto align auto is already a CLI
command. A one-off run over the corpus would collapse the predicate
count by 40-60% and dramatically improve recall by predicate filter.
This is the cheapest single quality win.
9.2 — Boilerplate suppression in the system prompt
The Discord skeleton facts (omega-bot rdf:type Agent, episodic record rdf:type EpisodicMemoryChunk, session rdf:type DiscordSession) are mechanical and rederivable from donto-memory's overlay tables. The system prompt could be expanded to say:
The donto-memory runtime already records: which agent holds the memory, the episodic chunk's record IRI, the holder, the session IRI, and the time. DO NOT re-extract those structural facts. Focus on facts implied by the message content itself.
This alone would knock 15–20 facts off every call (so ~300 facts saved across the corpus), shorten the LLM response, save tokens, and remove the repetition in the substrate that predicate alignment currently has to clean up.
9.3 — Length-conditional yield
The current prompt asks for "30+ statements from sentence-length chunk, 100+ from paragraph" regardless of input length. On 3-word messages this produces the Schrödinger cat. A length- conditional clause —
Aim for one fact per 2-3 words of input content (excluding the <user> in #<channel>: prefix). Under-yield rather than over-yield on short utterances; the next call can fill gaps if needed.
— would map the cat chunk to ~4 facts instead of ~67, and would probably push the average call from ~76s to ~40-50s on short messages without losing anything important.
9.4 — Recall integration (still pending)
The bot is writing memories but not yet reading them (integration-patterns §1.2). The whole point of the corpus is to be available on the response hot path. A 17-message corpus is small enough to walk in 30–80 ms per recall (per the v0.1.0 paper), so latency is not a blocker. This is the single biggest open task in the integration backlog.
10Conclusion
The corpus is nine hours old and seventeen messages deep, but it is sufficient to read every chunk and every triple by hand. Three things are working: the structural Discord-entity skeleton is built consistently per chunk; content extraction on substantive messages yields tightly-connected typed graphs; the hypothesis_only flag is used sparingly and accurately. Three things need attention: a high boilerplate share that's mechanically rederivable; identity drift across chunks (the same user reads as several distinct subjects); and an over-yield on trivial inputs that puts speculative Schrödinger-style facts permanently into the substrate. None of these is hard. Predicate alignment is a single CLI run. Boilerplate suppression is a system-prompt edit. Length-conditioned extraction is another paragraph in the prompt. Identity convergence is harder but stepable, and the substrate already supports the machinery. The integration's largest remaining open task is to read the memories on the response path — write-only is the state of the bot today, and the most expensive memorize calls are producing data nobody is consulting.
The qualitative impression: the system is sharper than its defaults. The substrate handles append-only paraconsistent storage correctly; donto-memory's pipeline correctly produces typed graphs from raw chat. The remaining work is mostly about being less prolix.