donto — Generative-Abundance Knowledge Extraction: Vision, System, and a Measured Run

donto — Generative-Abundance Knowledge Extraction: Vision, System, and a Measured Run

An end-to-end description of donto's vision and its rebuilt extraction engine, with a fully-measured single-document run — written for an external reviewer. 2026-06-03.

For the reviewer (e.g. ChatGPT Pro Research). This document is written to give you enough of donto's vision and system to critique it usefully. Part 1 is the vision (why we extract the way we do). Part 2 is the engine. Part 3 is a real run with exact numbers. Part 4 is what we'd most like your feedback on. Everything in Part 3 is measured on the production box today, not estimated. A companion report ("How the OpenCode Extraction Engine Works (and Where It Breaks)", same /research/ directory) covers the earlier failure analysis.


0. One-paragraph orientation

donto is a bitemporal, paraconsistent, evidence-first knowledge substrate built for the age of generative abundance. The thesis: for sixty years, generating typed knowledge was the scarce, human-bottlenecked step in every knowledge system; a guided frontier LLM now emits an essentially unbounded, multi-directional space of properties and relations about any entity for ~$0.0001 each. So the hard problem flipped from "how do we generate enough?" to "where do we put an unbounded, contradictory, evidence-anchored firehose without throwing most of it away?" Vector DBs and ordinary knowledge graphs collapse (dedup, pick a winner, invalidate-on-conflict). donto does the opposite: it holds incompatible claims forever as legal state, anchors each to its source, links them with typed argument edges, and re-ranks by reality over time instead of deleting on conflict. The extraction engine described here is the front door: its job is maximal faithful capture, deferring typing/alignment/identity/joining to query time.


1. The vision (why the extractor behaves as it does)

Six principles drive every design decision below.

  1. Abundance / emit-free. Generation is cheap and getting ~10× cheaper per year. The substrate wants an unbounded, multi-directional firehose. There is no upper bound on facts or on distinct predicates — more is strictly better. A "proliferation" of freely-minted predicates (the live store has ~900K) is the signature of abundance, not a defect.
  2. Defer joining to query time. Typing, ontology alignment, identity resolution, and joins are not done at write time. The extractor emits free/untyped claims; alignment happens lazily at query time (predicate-closure, identity-as-hypothesis). This is the substrate's native strength and it is why the extractor is allowed to mint predicates with abandon.
  3. Paraconsistent / contradiction-preserving. Contradictions are legal and wanted. Two sources naming a victim "Anthony" vs "Thomas" Cox → both entities are kept, plus a likelySameAs and a nameDiscrepancyWith. Nothing is reconciled at extraction.
  4. Evidence-first. Every directly-stated claim is anchored to an exact source span (surface_textdonto_spandonto_evidence_link → revision → content-addressed blob). "Show me the source for this claim" must always return the passage.
  5. No authority is ground truth. Every newspaper, court, official, deposition, and historian is an interpretive witness. Claims are framed as source-attestations (attestedBy, reportedIn, accordingTo), never as the extractor's own judgement.
  6. Domain-neutral core. donto is a substrate; genealogy, agentic memory, and (planned) jsonresume→jobs are example consumers that test it. The extractor must assume no subject matter — baking a domain's vocabulary into the engine is a bug. (We hit this exact bug mid-development and corrected it; see Part 2.4.)

The consequence for extraction: optimise for recall and faithfulness, never for tidiness, schema-conformance, or token economy. The operator's instruction this session was explicit — "extract every possible ontological fact about the chunk text; time, tokens, and money are not important."


2. The extraction engine

2.1 Runtime

A Python service (donto-api, FastAPI) drives OpenCode — an agentic CLI — running GLM-5.1 on a flat-rate "coding" subscription, inside a Docker container.

2.2 The output mechanism (the hard-won core)

Naïvely, you ask the agent to "extract facts and write facts.json." This fails for high-recall extraction: the model generates the entire output as the argument of a single write tool call, which for a large output never completes inside the time cap, so the call never executes and nothing is captured — not even as assistant text (we verified this directly). The fix is structural:

2.3 Controller loop-until-dry (the exhaustiveness mechanism)

A single pass — however long — still misses facts. So the durable controller (Python/Temporal, not the agent) owns an outer loop:

  1. Pass 1: broad extraction (up to a 1-hour cap), incremental append, OpenCode-decides-done.
  2. Pass 2+ (continuation): the controller seeds facts.jsonl with everything found so far and prepends a gap-finding preamble — "read what's already captured, then append ONLY what was missed: unused ontological lenses, finer relations, deeper inferences." The agent reads the seed + source and appends only genuinely new facts.
  3. Stop: the loop runs until a pass adds < 10 new facts (dry-streak) or a pass cap (4). Facts are deduped by (subject,predicate,object) across passes.

This is "controller-owned loop, agent-as-bounded-worker," not "let the agent loop until it thinks it's done." It gives a durable commit boundary, retry semantics, and a measurable stop criterion.

2.4 The prompt: content-agnostic + ontological lenses

The prompt assumes no subject matter. It states the abundance philosophy (above), then asks the agent to extract every entity (named or implied), attribute, relationship (both directions), event (participants/time/place/cause/result), quantity (+ qualifier), containment hierarchy, provenance/epistemics, contradiction, and to decode figurative language generically. The key recall driver is an explicit menu of ontological lensesdirections to form predicates from — presented as examples to spark creativity, with an explicit instruction to invent new predicates and new lenses of its own:

taxonomy/type · mereology (part/whole) · identity/persistence · topology/spatial · chronology/time · causation/etiology · teleology/function · agency & thematic roles · epistemology · deontology/norms · axiology/value · modality · qualia structure · lexical semantics · social ontology · process/event structure · constitution/material · dependence/grounding · genetic/provenance · comparison/similarity · quantity/measurement · disposition/capacity · speech-acts · phenomenology — "that was just a list of examples; invent your own."

The lenses are universal (no domain bias) and, as Part 3 shows, produce both broad coverage and a long tail of model-invented predicates.

2.5 Ingestion + anchoring

Parsed facts go to helpers.ingest_facts → dontosrv → Postgres. The source is registered first (document + revision) so anchors target a real revision_id; each fact's surface_text is re-found in the source (exact, then whitespace/case-tolerant) and materialised as a donto_span + donto_evidence_link. Hypothesis-only facts are preserved with polarity=unknown, maturity=0. Predicate alignment and entity identity-resolution are deferred to query time (running them per-row at ingest hammered the substrate at abundance scale and is, more importantly, against the vision).


3. A measured run: event 10690 (loop-until-dry, 1-hour passes)

Source. One row of the University of Newcastle Colonial Frontier Massacres in Australia 1788–1930 dataset, rendered to a 12,967-char text document: "Attack on NMP detachment — Combo James, Colin and Hamlet at Rannes station (1855)". Structured fields + multi-source narrative (period newspapers, depositions, later historians).

3.1 Headline numbers

Metric Value
Unique facts (deduped s,p,o) 2,333 (2,390 emitted lines, 57 dups)
Distinct predicates 1,320
Distinct subjects (entities) 154
Anchored (carry an exact source span) 2,168 / 2,333 = 93%
Object is an entity-IRI (graph edge) 658 (28%) · literal 1,675 (72%)
Pass 1 / Pass 2 1,882 new / +451 new → 2,333
Pass 3 interrupted by GLM usage cap (not by convergence)

One 13-KB document yielded 2,333 evidence-anchored claims across 1,320 distinct predicates, 93% anchored.

3.2 The loop-until-dry result (why a single pass is not enough)

Pass 2 added 24% on top of pass 1; equivalently, a single pass would have missed ~19% of the two-pass total. 451 ≫ the dry threshold (10), so pass 3 was warranted and starting when the usage cap hit — i.e., the document was not yet exhausted. This directly validates controller-owned loop-until-dry for the "never miss a fact" objective.

3.3 Within-event evolution (same document, 10690)

Configuration Facts Notes
abundance prompt, 30-min cap, single pass ~920 earlier engine
agnostic + lenses + compact, 1-hr cap, single pass 1,882 this run, pass 1
+ loop-until-dry (2 passes) 2,333 this run, cumulative

The 1-hour cap roughly doubled single-pass yield over the 30-min cap (the model keeps surfacing real facts when given time), and the second pass added another ~24%.

3.4 Ontological-lens coverage

Heuristic categorisation of the 1,320 predicates by lens (predicate-instances):

identity/naming 290 · chronology/time 284 · taxonomy/type 198 · causation 167 ·
spatial/topology 161 · agency/role 141 · provenance/epistemic 140 · quantity 77 ·
social/kinship 75 · mereology 57 · contradiction 17 · modality 15
+ 711 instances of predicates the heuristic could NOT categorise — i.e. lenses/
predicates the model invented beyond the listed menu.

The spread confirms the lenses work as intended (broad, multi-directional), and the large uncategorised tail confirms the "invent your own" instruction is firing — the model is creating ontological directions we did not enumerate.

3.5 Representative facts (the texture)

ex:depredations-on-flocks | decodedAs | "Aboriginal people killing or stealing sheep"   (euphemism decode)
ex:capricornian-1913-07-19 | statesDistanceFromMurder | "two or three hundred yards"     (source-attributed detail)
ex:aboriginal-attackers-rannes | tacticalPlanningEvidenced | "coalition building, intelligence gathering, timing, coordination, swift withdrawal"
ex:hamlet-nmp | rdfType | ex:... ; ex:nmp-regimental-system | assignedRegNoTo | ex:hamlet-nmp "Reg No 53"
ex:attack-rannes-1855 | countDiscrepancyWith | ...                                        (paraconsistent contradiction)
ex:henry-walker | escapedBySleepingAt | ex:rannes-headstation  ["was sleeping at the station over the cre"]
ex:nmp-corps | alsoKnownAs | "Native Mounted Police"

These show second-order capture (causation, motive, tactics), provenance (attestedBy, authoredBy, narratedBy, certaintyMarker, sourceDepth), contradiction (countDiscrepancyWith, corroboratedBy), and generic euphemism-decoding — all from a domain-neutral prompt.

3.6 Where the wall-clock goes (timestamp analysis, from the session DB)

Each pass is dominated by output-token generation at GLM's rate (~40 tok/s). On an earlier comparable run: of the wall-time, the actual bash/file tool calls were ~0%; the rest was the model generating the facts (and a minority on reasoning). Per-document time is therefore set by (facts × tokens-per-fact) ÷ generation-rate. Implications: the compact format helps (fewer tokens/fact); per-row wall-clock cannot beat the model's generation speed; throughput across many documents is a concurrency problem, not a per-row one.


4. Findings, constraints, and questions for the reviewer

4.1 The binding constraint we just discovered: a 5-hour usage cap

The GLM "coding" subscription enforces a rolling 5-hour usage limit (error 1308: "Usage limit reached for 5 hour"). Our exhaustive loop-until-dry is extremely generation-heavy (2,333 facts × multiple passes × many tool-turns), so it consumed the window's allowance during a single multi-document session and began returning HTTP 429 until reset. This is the dominant economic/throughput constraint for a full corpus run (here: 2,818 documents), more than CPU or per-row latency.

4.2 Open questions (where we want your critique)

  1. Exhaustiveness vs. cost. Given a hard "never miss a fact" objective and a flat-rate-but-rate-capped model, how would you balance loop-until-dry depth against the 5-hour usage cap across a 2,818-doc corpus? Is there a principled stopping point short of "dry" that captures, say, 95% of facts at a fraction of the generation?
  2. Diminishing returns. Pass 1 → 1,882, pass 2 → +451 (24%). Is continuing to pass 3/4 worth it, or is there a better marginal-value signal than a raw dry-count (e.g., novelty-weighted, or lens-coverage-based)?
  3. The 1,320-predicate tail. We deliberately defer alignment to query time, but is there a cheap write-time signal worth capturing per minted predicate (a one-line gloss?) that materially lowers later alignment cost without violating emit-free? Or does that re-introduce the write-time tax we're avoiding?
  4. Quality of abundance. 93% anchored is strong, but ~7% are inferences/decodes (correctly flagged hypothesis_only). At 2,333 facts/doc, what's the right way to evaluate faithfulness at scale — sampling? adversarial verification of a subset? downstream task-lift?
  5. Agentic CLI vs. direct API. For pure extraction we keep OpenCode for the flat-rate economics + file/agentic convenience, but it adds reasoning/loop overhead. Given the 5-hour cap, would a direct streamed completion (no agentic loop) extract more facts per unit of the usage budget? That's the metric that now matters.
  6. Throughput. The box is 4 vCPU/16 GB and network-bound on the model. With the per-row engine fixed, is raising concurrency the right (only?) lever, and how would you stage it against the model's rate limit to maximise accepted anchored facts per usage-window?

4.3 What is already settled (so feedback can build on it)


Appendix A — measurements (event 10690, 2026-06-03)

Source size 12,967 chars
Unique facts 2,333
Emitted lines / dups 2,390 / 57
Distinct predicates 1,320
Distinct subjects 154
Anchored 2,168 (93%)
Entity-IRI objects / literals 658 / 1,675
Pass 1 / Pass 2 new 1,882 / 451
Pass 1 duration ~48 min (to 1-hr cap)
Pass 2 duration ~25 min
Stop cause external GLM 5-hour usage cap (not convergence)
Anchor method surface_text exact → whitespace/case-tolerant re-find

Appendix B — system map

Concern Where
Agentic runner (docker exec, flock, file exchange, session-log export) donto-api/opencode_agent.py
Agnostic prompt + ontological lenses donto-api/prompts/extract_broad.txt
Compact-JSONL parse, incremental append, loop-until-dry, retry-on-empty donto-api/opencode_extract.py
Ingestion + evidence anchoring + source registration donto-api/helpers.py
Temporal workflow/activities donto-api/workflows.py, activities.py
Substrate (Rust) + Postgres dontosrv (:7879), donto-pg
Live session logs (tool calls + reasoning) https://genes.apexpots.com/opencode/logs/

Generated 2026-06-03 from a live run on the donto-db box. Companion: "How the OpenCode Extraction Engine Works (and Where It Breaks)" and the "Generative Abundance" report, both at https://genes.apexpots.com/research/.