An end-to-end description of donto's vision and its rebuilt extraction engine, with a fully-measured single-document run — written for an external reviewer. 2026-06-03.
For the reviewer (e.g. ChatGPT Pro Research). This document is written to give you enough of donto's vision and system to critique it usefully. Part 1 is the vision (why we extract the way we do). Part 2 is the engine. Part 3 is a real run with exact numbers. Part 4 is what we'd most like your feedback on. Everything in Part 3 is measured on the production box today, not estimated. A companion report ("How the OpenCode Extraction Engine Works (and Where It Breaks)", same
/research/directory) covers the earlier failure analysis.
donto is a bitemporal, paraconsistent, evidence-first knowledge substrate built for the age of generative abundance. The thesis: for sixty years, generating typed knowledge was the scarce, human-bottlenecked step in every knowledge system; a guided frontier LLM now emits an essentially unbounded, multi-directional space of properties and relations about any entity for ~$0.0001 each. So the hard problem flipped from "how do we generate enough?" to "where do we put an unbounded, contradictory, evidence-anchored firehose without throwing most of it away?" Vector DBs and ordinary knowledge graphs collapse (dedup, pick a winner, invalidate-on-conflict). donto does the opposite: it holds incompatible claims forever as legal state, anchors each to its source, links them with typed argument edges, and re-ranks by reality over time instead of deleting on conflict. The extraction engine described here is the front door: its job is maximal faithful capture, deferring typing/alignment/identity/joining to query time.
Six principles drive every design decision below.
likelySameAs and a nameDiscrepancyWith.
Nothing is reconciled at extraction.surface_text →
donto_span → donto_evidence_link → revision →
content-addressed blob). "Show me the source for this claim" must always
return the passage.attestedBy, reportedIn,
accordingTo), never as the extractor's own judgement.The consequence for extraction: optimise for recall and faithfulness, never for tidiness, schema-conformance, or token economy. The operator's instruction this session was explicit — "extract every possible ontological fact about the chunk text; time, tokens, and money are not important."
A Python service (donto-api, FastAPI) drives
OpenCode — an agentic CLI — running
GLM-5.1 on a flat-rate "coding" subscription, inside a
Docker container.
docker exec in.prompt.txt +
source.txt, run OpenCode with that dir as CWD, and read
back the output file. No stdin/stdout plumbing.flock over N
slot files bounds concurrent OpenCode subprocesses across all
consumers.part table: every tool call, result, and
reasoning) is exported to JSONL and published read-only at
/opencode/logs, alongside a summary.Naïvely, you ask the agent to "extract facts and write
facts.json." This fails for high-recall
extraction: the model generates the entire output as the argument of a
single write tool call, which for a large output
never completes inside the time cap, so the call never executes and
nothing is captured — not even as assistant text (we
verified this directly). The fix is structural:
cat >>-appends them to
facts.jsonl in small bash batches (~30–60
facts), re-scanning the source between batches. Each batch is a
durable commit boundary — a timeout keeps everything
written so far.exit=0); the controller's only intra-run logic is
retry-on-empty (re-run a run that produced zero facts
due to an error).{"s":subj,"p":pred,"o":obj,"a":anchor,"c":conf,"h":hypothesis?}
with the object given directly (an ex: IRI for entities, a
bare number/string for literals). The Python parser reconstructs the
full donto fact (IRI-vs-literal, datatype incl. gYear/date, anchor)
losslessly. This roughly 3×'d facts-per-unit-time vs
verbose JSON and raised the anchor rate, with no loss of
information.A single pass — however long — still misses facts. So the durable controller (Python/Temporal, not the agent) owns an outer loop:
facts.jsonl with everything found so
far and prepends a gap-finding preamble — "read what's already
captured, then append ONLY what was missed: unused ontological lenses,
finer relations, deeper inferences." The agent reads the seed + source
and appends only genuinely new facts.(subject,predicate,object) across passes.This is "controller-owned loop, agent-as-bounded-worker," not "let the agent loop until it thinks it's done." It gives a durable commit boundary, retry semantics, and a measurable stop criterion.
The prompt assumes no subject matter. It states the abundance philosophy (above), then asks the agent to extract every entity (named or implied), attribute, relationship (both directions), event (participants/time/place/cause/result), quantity (+ qualifier), containment hierarchy, provenance/epistemics, contradiction, and to decode figurative language generically. The key recall driver is an explicit menu of ontological lenses — directions to form predicates from — presented as examples to spark creativity, with an explicit instruction to invent new predicates and new lenses of its own:
taxonomy/type · mereology (part/whole) · identity/persistence · topology/spatial · chronology/time · causation/etiology · teleology/function · agency & thematic roles · epistemology · deontology/norms · axiology/value · modality · qualia structure · lexical semantics · social ontology · process/event structure · constitution/material · dependence/grounding · genetic/provenance · comparison/similarity · quantity/measurement · disposition/capacity · speech-acts · phenomenology — "that was just a list of examples; invent your own."
The lenses are universal (no domain bias) and, as Part 3 shows, produce both broad coverage and a long tail of model-invented predicates.
Parsed facts go to
helpers.ingest_facts → dontosrv → Postgres. The source is
registered first (document + revision) so
anchors target a real revision_id; each fact's
surface_text is re-found in the source (exact, then
whitespace/case-tolerant) and materialised as a donto_span
+ donto_evidence_link. Hypothesis-only facts are preserved
with polarity=unknown, maturity=0. Predicate alignment and
entity identity-resolution are deferred to query time
(running them per-row at ingest hammered the substrate at abundance
scale and is, more importantly, against the vision).
Source. One row of the University of Newcastle Colonial Frontier Massacres in Australia 1788–1930 dataset, rendered to a 12,967-char text document: "Attack on NMP detachment — Combo James, Colin and Hamlet at Rannes station (1855)". Structured fields + multi-source narrative (period newspapers, depositions, later historians).
| Metric | Value |
|---|---|
| Unique facts (deduped s,p,o) | 2,333 (2,390 emitted lines, 57 dups) |
| Distinct predicates | 1,320 |
| Distinct subjects (entities) | 154 |
| Anchored (carry an exact source span) | 2,168 / 2,333 = 93% |
| Object is an entity-IRI (graph edge) | 658 (28%) · literal 1,675 (72%) |
| Pass 1 / Pass 2 | 1,882 new / +451 new → 2,333 |
| Pass 3 | interrupted by GLM usage cap (not by convergence) |
One 13-KB document yielded 2,333 evidence-anchored claims across 1,320 distinct predicates, 93% anchored.
Pass 2 added 24% on top of pass 1; equivalently, a single pass would have missed ~19% of the two-pass total. 451 ≫ the dry threshold (10), so pass 3 was warranted and starting when the usage cap hit — i.e., the document was not yet exhausted. This directly validates controller-owned loop-until-dry for the "never miss a fact" objective.
| Configuration | Facts | Notes |
|---|---|---|
| abundance prompt, 30-min cap, single pass | ~920 | earlier engine |
| agnostic + lenses + compact, 1-hr cap, single pass | 1,882 | this run, pass 1 |
| + loop-until-dry (2 passes) | 2,333 | this run, cumulative |
The 1-hour cap roughly doubled single-pass yield over the 30-min cap (the model keeps surfacing real facts when given time), and the second pass added another ~24%.
Heuristic categorisation of the 1,320 predicates by lens (predicate-instances):
identity/naming 290 · chronology/time 284 · taxonomy/type 198 · causation 167 ·
spatial/topology 161 · agency/role 141 · provenance/epistemic 140 · quantity 77 ·
social/kinship 75 · mereology 57 · contradiction 17 · modality 15
+ 711 instances of predicates the heuristic could NOT categorise — i.e. lenses/
predicates the model invented beyond the listed menu.
The spread confirms the lenses work as intended (broad, multi-directional), and the large uncategorised tail confirms the "invent your own" instruction is firing — the model is creating ontological directions we did not enumerate.
ex:depredations-on-flocks | decodedAs | "Aboriginal people killing or stealing sheep" (euphemism decode)
ex:capricornian-1913-07-19 | statesDistanceFromMurder | "two or three hundred yards" (source-attributed detail)
ex:aboriginal-attackers-rannes | tacticalPlanningEvidenced | "coalition building, intelligence gathering, timing, coordination, swift withdrawal"
ex:hamlet-nmp | rdfType | ex:... ; ex:nmp-regimental-system | assignedRegNoTo | ex:hamlet-nmp "Reg No 53"
ex:attack-rannes-1855 | countDiscrepancyWith | ... (paraconsistent contradiction)
ex:henry-walker | escapedBySleepingAt | ex:rannes-headstation ["was sleeping at the station over the cre"]
ex:nmp-corps | alsoKnownAs | "Native Mounted Police"
These show second-order capture (causation, motive, tactics),
provenance (attestedBy, authoredBy,
narratedBy, certaintyMarker,
sourceDepth), contradiction
(countDiscrepancyWith, corroboratedBy), and
generic euphemism-decoding — all from a domain-neutral prompt.
Each pass is dominated by output-token generation at
GLM's rate (~40 tok/s). On an earlier comparable run: of the wall-time,
the actual bash/file tool calls were ~0%; the rest was the
model generating the facts (and a minority on reasoning).
Per-document time is therefore set by (facts × tokens-per-fact)
÷ generation-rate. Implications: the compact format helps
(fewer tokens/fact); per-row wall-clock cannot beat the model's
generation speed; throughput across many documents is a
concurrency problem, not a per-row one.
The GLM "coding" subscription enforces a rolling 5-hour usage
limit
(error 1308: "Usage limit reached for 5 hour"). Our
exhaustive loop-until-dry is extremely generation-heavy (2,333 facts ×
multiple passes × many tool-turns), so it consumed the window's
allowance during a single multi-document session and began returning
HTTP 429 until reset. This is the dominant economic/throughput
constraint for a full corpus run (here: 2,818 documents), more
than CPU or per-row latency.
hypothesis_only). At 2,333 facts/doc, what's the right way
to evaluate faithfulness at scale — sampling? adversarial
verification of a subset? downstream task-lift?surface_text works at 93% and is
non-negotiable (it's the substrate's value).| Source size | 12,967 chars |
| Unique facts | 2,333 |
| Emitted lines / dups | 2,390 / 57 |
| Distinct predicates | 1,320 |
| Distinct subjects | 154 |
| Anchored | 2,168 (93%) |
| Entity-IRI objects / literals | 658 / 1,675 |
| Pass 1 / Pass 2 new | 1,882 / 451 |
| Pass 1 duration | ~48 min (to 1-hr cap) |
| Pass 2 duration | ~25 min |
| Stop cause | external GLM 5-hour usage cap (not convergence) |
| Anchor method | surface_text exact → whitespace/case-tolerant
re-find |
| Concern | Where |
|---|---|
| Agentic runner (docker exec, flock, file exchange, session-log export) | donto-api/opencode_agent.py |
| Agnostic prompt + ontological lenses | donto-api/prompts/extract_broad.txt |
| Compact-JSONL parse, incremental append, loop-until-dry, retry-on-empty | donto-api/opencode_extract.py |
| Ingestion + evidence anchoring + source registration | donto-api/helpers.py |
| Temporal workflow/activities | donto-api/workflows.py, activities.py |
| Substrate (Rust) + Postgres | dontosrv (:7879), donto-pg |
| Live session logs (tool calls + reasoning) | https://genes.apexpots.com/opencode/logs/ |
Generated 2026-06-03 from a live run on the donto-db box.
Companion: "How the OpenCode Extraction Engine Works (and Where It
Breaks)" and the "Generative Abundance" report, both at
https://genes.apexpots.com/research/.