# donto — Generative-Abundance Knowledge Extraction: Vision, System, and a Measured Run

**An end-to-end description of donto's vision and its rebuilt extraction engine, with a fully-measured single-document run — written for an external reviewer. 2026-06-03.**

> **For the reviewer (e.g. ChatGPT Pro Research).** This document is written to give you enough of donto's *vision* and *system* to critique it usefully. Part 1 is the vision (why we extract the way we do). Part 2 is the engine. Part 3 is a real run with exact numbers. Part 4 is what we'd most like your feedback on. Everything in Part 3 is measured on the production box today, not estimated. A companion report ("How the OpenCode Extraction Engine Works (and Where It Breaks)", same `/research/` directory) covers the earlier failure analysis.

---

## 0. One-paragraph orientation

donto is a **bitemporal, paraconsistent, evidence-first knowledge substrate** built for the age of generative abundance. The thesis: for sixty years, *generating* typed knowledge was the scarce, human-bottlenecked step in every knowledge system; a guided frontier LLM now emits an essentially unbounded, multi-directional space of properties and relations about any entity for ~$0.0001 each. So the hard problem flipped from *"how do we generate enough?"* to *"where do we put an unbounded, contradictory, evidence-anchored firehose without throwing most of it away?"* Vector DBs and ordinary knowledge graphs **collapse** (dedup, pick a winner, invalidate-on-conflict). donto does the opposite: it **holds incompatible claims forever as legal state, anchors each to its source, links them with typed argument edges, and re-ranks by reality over time instead of deleting on conflict.** The extraction engine described here is the front door: its job is **maximal faithful capture**, deferring typing/alignment/identity/joining to query time.

---

## 1. The vision (why the extractor behaves as it does)

Six principles drive every design decision below.

1. **Abundance / emit-free.** Generation is cheap and getting ~10× cheaper per year. The substrate *wants* an unbounded, multi-directional firehose. There is no upper bound on facts or on distinct predicates — *more is strictly better*. A "proliferation" of freely-minted predicates (the live store has ~900K) is the **signature of abundance**, not a defect.
2. **Defer joining to query time.** Typing, ontology alignment, identity resolution, and joins are **not** done at write time. The extractor emits free/untyped claims; alignment happens lazily at query time (predicate-closure, identity-as-hypothesis). This is the substrate's native strength and it is *why* the extractor is allowed to mint predicates with abandon.
3. **Paraconsistent / contradiction-preserving.** Contradictions are legal and wanted. Two sources naming a victim "Anthony" vs "Thomas" Cox → both entities are kept, plus a `likelySameAs` and a `nameDiscrepancyWith`. Nothing is reconciled at extraction.
4. **Evidence-first.** Every directly-stated claim is anchored to an exact source span (`surface_text` → `donto_span` → `donto_evidence_link` → revision → content-addressed blob). "Show me the source for this claim" must always return the passage.
5. **No authority is ground truth.** Every newspaper, court, official, deposition, and historian is an *interpretive witness*. Claims are framed as source-attestations (`attestedBy`, `reportedIn`, `accordingTo`), never as the extractor's own judgement.
6. **Domain-neutral core.** donto is a substrate; genealogy, agentic memory, and (planned) jsonresume→jobs are *example consumers* that test it. The extractor must assume **no subject matter** — baking a domain's vocabulary into the engine is a bug. (We hit this exact bug mid-development and corrected it; see Part 2.4.)

The consequence for extraction: **optimise for recall and faithfulness, never for tidiness, schema-conformance, or token economy.** The operator's instruction this session was explicit — *"extract every possible ontological fact about the chunk text; time, tokens, and money are not important."*

---

## 2. The extraction engine

### 2.1 Runtime

A Python service (`donto-api`, FastAPI) drives **OpenCode** — an agentic CLI — running **GLM-5.1** on a flat-rate "coding" subscription, inside a Docker container.

- **Why a container.** OpenCode only honours the GLM provider config reliably inside the bot container (which holds the API key). We `docker exec` in.
- **File exchange.** A per-run scratch dir is shared host↔container. We write `prompt.txt` + `source.txt`, run OpenCode with that dir as CWD, and read back the output file. No stdin/stdout plumbing.
- **Concurrency.** A host-wide `flock` over N slot files bounds concurrent OpenCode subprocesses across all consumers.
- **Durability.** Jobs run as **Temporal** workflows (survive restarts, retried).
- **Observability.** Every run's native OpenCode session (sqlite `part` table: every tool call, result, *and reasoning*) is exported to JSONL and published read-only at `/opencode/logs`, alongside a summary.

### 2.2 The output mechanism (the hard-won core)

Naïvely, you ask the agent to "extract facts and write `facts.json`." This **fails** for high-recall extraction: the model generates the entire output as the argument of a *single* `write` tool call, which for a large output never completes inside the time cap, so the call never executes and **nothing is captured** — not even as assistant text (we verified this directly). The fix is structural:

- **Incremental append.** The agent emits facts as **JSONL** and `cat >>`-appends them to `facts.jsonl` in **small bash batches** (~30–60 facts), re-scanning the source between batches. Each batch is a **durable commit boundary** — a timeout keeps everything written so far.
- **OpenCode decides when it is done.** Within a run, the agent keeps appending until it judges the source covered (`exit=0`); the controller's only intra-run logic is **retry-on-empty** (re-run a run that produced zero facts due to an error).
- **Compact JSONL, not verbose.** Each line is `{"s":subj,"p":pred,"o":obj,"a":anchor,"c":conf,"h":hypothesis?}` with the object given directly (an `ex:` IRI for entities, a bare number/string for literals). The Python parser reconstructs the full donto fact (IRI-vs-literal, datatype incl. gYear/date, anchor) losslessly. This roughly **3×'d facts-per-unit-time** vs verbose JSON and *raised* the anchor rate, with no loss of information.

### 2.3 Controller loop-until-dry (the exhaustiveness mechanism)

A single pass — however long — still *misses* facts. So the durable controller (Python/Temporal, **not** the agent) owns an outer loop:

1. **Pass 1:** broad extraction (up to a 1-hour cap), incremental append, OpenCode-decides-done.
2. **Pass 2+ (continuation):** the controller **seeds `facts.jsonl` with everything found so far** and prepends a gap-finding preamble — "read what's already captured, then append ONLY what was missed: unused ontological lenses, finer relations, deeper inferences." The agent reads the seed + source and appends only genuinely new facts.
3. **Stop:** the loop runs until a pass adds **< 10 new facts** (dry-streak) or a pass cap (4). Facts are deduped by `(subject,predicate,object)` across passes.

This is "controller-owned loop, agent-as-bounded-worker," not "let the agent loop until it thinks it's done." It gives a durable commit boundary, retry semantics, and a *measurable* stop criterion.

### 2.4 The prompt: content-agnostic + ontological lenses

The prompt assumes **no subject matter**. It states the abundance philosophy (above), then asks the agent to extract every entity (named or implied), attribute, relationship (both directions), event (participants/time/place/cause/result), quantity (+ qualifier), containment hierarchy, provenance/epistemics, contradiction, and to decode figurative language generically. The key recall driver is an explicit menu of **ontological lenses** — *directions* to form predicates from — presented as **examples to spark creativity, with an explicit instruction to invent new predicates and new lenses of its own**:

> taxonomy/type · mereology (part/whole) · identity/persistence · topology/spatial · chronology/time · causation/etiology · teleology/function · agency & thematic roles · epistemology · deontology/norms · axiology/value · modality · qualia structure · lexical semantics · social ontology · process/event structure · constitution/material · dependence/grounding · genetic/provenance · comparison/similarity · quantity/measurement · disposition/capacity · speech-acts · phenomenology — *"that was just a list of examples; invent your own."*

The lenses are universal (no domain bias) and, as Part 3 shows, produce both broad coverage *and* a long tail of model-invented predicates.

### 2.5 Ingestion + anchoring

Parsed facts go to `helpers.ingest_facts → dontosrv → Postgres`. The source is registered first (`document` + `revision`) so anchors target a real `revision_id`; each fact's `surface_text` is re-found in the source (exact, then whitespace/case-tolerant) and materialised as a `donto_span` + `donto_evidence_link`. Hypothesis-only facts are preserved with `polarity=unknown, maturity=0`. Predicate alignment and entity identity-resolution are **deferred to query time** (running them per-row at ingest hammered the substrate at abundance scale and is, more importantly, against the vision).

---

## 3. A measured run: event 10690 (loop-until-dry, 1-hour passes)

**Source.** One row of the University of Newcastle *Colonial Frontier Massacres in Australia 1788–1930* dataset, rendered to a 12,967-char text document: "Attack on NMP detachment — Combo James, Colin and Hamlet at Rannes station (1855)". Structured fields + multi-source narrative (period newspapers, depositions, later historians).

### 3.1 Headline numbers

| Metric | Value |
|---|---|
| **Unique facts** (deduped s,p,o) | **2,333** (2,390 emitted lines, 57 dups) |
| **Distinct predicates** | **1,320** |
| Distinct subjects (entities) | 154 |
| **Anchored** (carry an exact source span) | **2,168 / 2,333 = 93%** |
| Object is an entity-IRI (graph edge) | 658 (28%) · literal 1,675 (72%) |
| Pass 1 / Pass 2 | 1,882 new / +451 new → 2,333 |
| Pass 3 | interrupted by GLM usage cap (not by convergence) |

**One 13-KB document yielded 2,333 evidence-anchored claims across 1,320 distinct predicates, 93% anchored.**

### 3.2 The loop-until-dry result (why a single pass is not enough)

- Pass 1 (broad, ~48 min, ran to the cap): **1,882** facts.
- Pass 2 (gap-finding, seeded with the 1,882): **+451 genuinely new** facts pass 1 had missed → 2,333.

Pass 2 added **24%** on top of pass 1; equivalently, **a single pass would have missed ~19% of the two-pass total.** 451 ≫ the dry threshold (10), so pass 3 was warranted and starting when the usage cap hit — i.e., the document was *not yet exhausted*. This directly validates controller-owned loop-until-dry for the "never miss a fact" objective.

### 3.3 Within-event evolution (same document, 10690)

| Configuration | Facts | Notes |
|---|---|---|
| abundance prompt, 30-min cap, single pass | ~920 | earlier engine |
| **agnostic + lenses + compact, 1-hr cap, single pass** | **1,882** | this run, pass 1 |
| **+ loop-until-dry (2 passes)** | **2,333** | this run, cumulative |

The 1-hour cap roughly **doubled** single-pass yield over the 30-min cap (the model keeps surfacing real facts when given time), and the second pass added another ~24%.

### 3.4 Ontological-lens coverage

Heuristic categorisation of the 1,320 predicates by lens (predicate-instances):

```
identity/naming 290 · chronology/time 284 · taxonomy/type 198 · causation 167 ·
spatial/topology 161 · agency/role 141 · provenance/epistemic 140 · quantity 77 ·
social/kinship 75 · mereology 57 · contradiction 17 · modality 15
+ 711 instances of predicates the heuristic could NOT categorise — i.e. lenses/
predicates the model invented beyond the listed menu.
```

The spread confirms the lenses work as intended (broad, multi-directional), and the large *uncategorised* tail confirms the "invent your own" instruction is firing — the model is creating ontological directions we did not enumerate.

### 3.5 Representative facts (the texture)

```
ex:depredations-on-flocks | decodedAs | "Aboriginal people killing or stealing sheep"   (euphemism decode)
ex:capricornian-1913-07-19 | statesDistanceFromMurder | "two or three hundred yards"     (source-attributed detail)
ex:aboriginal-attackers-rannes | tacticalPlanningEvidenced | "coalition building, intelligence gathering, timing, coordination, swift withdrawal"
ex:hamlet-nmp | rdfType | ex:... ; ex:nmp-regimental-system | assignedRegNoTo | ex:hamlet-nmp "Reg No 53"
ex:attack-rannes-1855 | countDiscrepancyWith | ...                                        (paraconsistent contradiction)
ex:henry-walker | escapedBySleepingAt | ex:rannes-headstation  ["was sleeping at the station over the cre"]
ex:nmp-corps | alsoKnownAs | "Native Mounted Police"
```

These show second-order capture (causation, motive, tactics), provenance (`attestedBy`, `authoredBy`, `narratedBy`, `certaintyMarker`, `sourceDepth`), contradiction (`countDiscrepancyWith`, `corroboratedBy`), and generic euphemism-decoding — all from a domain-neutral prompt.

### 3.6 Where the wall-clock goes (timestamp analysis, from the session DB)

Each pass is dominated by **output-token generation** at GLM's rate (~40 tok/s). On an earlier comparable run: of the wall-time, the actual `bash`/file tool calls were ~0%; the rest was the model generating the facts (and a minority on reasoning). **Per-document time is therefore set by (facts × tokens-per-fact) ÷ generation-rate.** Implications: the compact format helps (fewer tokens/fact); per-row wall-clock cannot beat the model's generation speed; throughput across many documents is a *concurrency* problem, not a per-row one.

---

## 4. Findings, constraints, and questions for the reviewer

### 4.1 The binding constraint we just discovered: a 5-hour usage cap

The GLM "coding" subscription enforces a **rolling 5-hour usage limit** (`error 1308: "Usage limit reached for 5 hour"`). Our exhaustive loop-until-dry is extremely generation-heavy (2,333 facts × multiple passes × many tool-turns), so it consumed the window's allowance during a single multi-document session and began returning HTTP 429 until reset. This is the **dominant economic/throughput constraint** for a full corpus run (here: 2,818 documents), more than CPU or per-row latency.

### 4.2 Open questions (where we want your critique)

1. **Exhaustiveness vs. cost.** Given a hard "never miss a fact" objective *and* a flat-rate-but-rate-capped model, how would you balance loop-until-dry depth against the 5-hour usage cap across a 2,818-doc corpus? Is there a principled stopping point short of "dry" that captures, say, 95% of facts at a fraction of the generation?
2. **Diminishing returns.** Pass 1 → 1,882, pass 2 → +451 (24%). Is continuing to pass 3/4 worth it, or is there a better marginal-value signal than a raw dry-count (e.g., novelty-weighted, or lens-coverage-based)?
3. **The 1,320-predicate tail.** We deliberately defer alignment to query time, but is there a *cheap* write-time signal worth capturing per minted predicate (a one-line gloss?) that materially lowers later alignment cost without violating emit-free? Or does that re-introduce the write-time tax we're avoiding?
4. **Quality of abundance.** 93% anchored is strong, but ~7% are inferences/decodes (correctly flagged `hypothesis_only`). At 2,333 facts/doc, what's the right way to *evaluate* faithfulness at scale — sampling? adversarial verification of a subset? downstream task-lift?
5. **Agentic CLI vs. direct API.** For pure extraction we keep OpenCode for the flat-rate economics + file/agentic convenience, but it adds reasoning/loop overhead. Given the 5-hour cap, would a direct streamed completion (no agentic loop) extract *more facts per unit of the usage budget*? That's the metric that now matters.
6. **Throughput.** The box is 4 vCPU/16 GB and network-bound on the model. With the per-row engine fixed, is raising concurrency the right (only?) lever, and how would you stage it against the model's rate limit to maximise *accepted anchored facts per usage-window*?

### 4.3 What is already settled (so feedback can build on it)

- Output **must** be incremental-append (the monolithic write fails); compact JSONL is ~3× more efficient and loses nothing.
- The loop **must** be controller-owned with a durable commit boundary, not agent-owned.
- The prompt **must** be content-agnostic; ontological lenses (as invent-your-own examples) are the recall driver.
- Alignment/identity **stay** at query time (per the vision + per a measured perf disaster doing them per-row).
- Evidence anchoring via `surface_text` works at 93% and is non-negotiable (it's the substrate's value).

---

## Appendix A — measurements (event 10690, 2026-06-03)

| | |
|---|---|
| Source size | 12,967 chars |
| Unique facts | 2,333 |
| Emitted lines / dups | 2,390 / 57 |
| Distinct predicates | 1,320 |
| Distinct subjects | 154 |
| Anchored | 2,168 (93%) |
| Entity-IRI objects / literals | 658 / 1,675 |
| Pass 1 / Pass 2 new | 1,882 / 451 |
| Pass 1 duration | ~48 min (to 1-hr cap) |
| Pass 2 duration | ~25 min |
| Stop cause | external GLM 5-hour usage cap (not convergence) |
| Anchor method | `surface_text` exact → whitespace/case-tolerant re-find |

## Appendix B — system map

| Concern | Where |
|---|---|
| Agentic runner (docker exec, flock, file exchange, session-log export) | `donto-api/opencode_agent.py` |
| Agnostic prompt + ontological lenses | `donto-api/prompts/extract_broad.txt` |
| Compact-JSONL parse, incremental append, loop-until-dry, retry-on-empty | `donto-api/opencode_extract.py` |
| Ingestion + evidence anchoring + source registration | `donto-api/helpers.py` |
| Temporal workflow/activities | `donto-api/workflows.py`, `activities.py` |
| Substrate (Rust) + Postgres | `dontosrv` (:7879), `donto-pg` |
| **Live session logs (tool calls + reasoning)** | `https://genes.apexpots.com/opencode/logs/` |

*Generated 2026-06-03 from a live run on the donto-db box. Companion: "How the OpenCode Extraction Engine Works (and Where It Breaks)" and the "Generative Abundance" report, both at `https://genes.apexpots.com/research/`.*
