# donto — How the OpenCode Extraction Engine Works (and Where It Breaks)

**An engineering report on agentic LLM fact-extraction into a paraconsistent substrate — 2026-06-03**

> **Purpose of this document.** This is a complete, honest technical description of
> how donto's "opencode" extraction pipeline works today, the failure modes we
> found while load-testing it on a 2,818-row historical dataset, the fixes applied
> so far, and an unresolved performance problem. It is written to be handed to an
> external reviewer (e.g. ChatGPT) for a third opinion. Section 8 lists the
> specific questions we want answered. Numbers are real measurements from
> 2026-06-02/03 on the production box, not estimates.

---

## 0. TL;DR

- **Goal:** read a source document (newspaper article, BDM record, native-title
  transcript, or — in the current test — one row of the *Colonial Frontier
  Massacres in Australia 1788–1930* dataset) and emit the **largest possible set
  of evidence-anchored RDF-style claims** into donto, a bitemporal,
  paraconsistent, evidence-first knowledge substrate.
- **How:** a Python service drives **OpenCode** (an agentic CLI) running
  **GLM-5.1** over a flat-rate coding subscription, inside a Docker container, via
  a shared-file scratch directory. The agent writes a `facts.json`; we ingest it,
  re-deriving an evidence span for every claim whose quoted `surface_text` is
  found verbatim in the source.
- **What works:** on small inputs it is excellent — a 2,222-char event yields
  **317 facts / 140 distinct predicates / 95% anchored** in ~441 s, including
  paraconsistent handling of a name conflict and an inferred jurisdictional
  hierarchy.
- **What breaks:** large inputs and high-recall prompts produce **zero facts**.
  The agent generates the entire output as the argument to a *single* `write`
  tool call; for large outputs that generation never completes within the
  time cap, so the tool call never executes and **nothing is captured** — not
  even as inline assistant text. We confirmed this directly (Section 5.3).
- **The tension:** "more facts" and "faster" pull against each other because
  facts are output tokens, and our output format (verbose JSON, ~55 tokens/fact)
  and output *mechanism* (one monolithic write) are both inefficient.
- **Proposed redesign (Section 7):** make the agent **iterate** — accumulate and
  self-critique `facts` across many small, persisted turns until its own reasoning
  says the output is complete — instead of one giant write; pair it with an
  **append-friendly line format** (JSONL/TSV, ~3× fewer tokens/fact); add
  inline-text capture as a fallback; warm per-slot agent state; and real
  concurrency.

---

## 1. System architecture

```
  one source document (≤ a few hundred KB of text)
        │
        ▼
  donto-api  (FastAPI, :8000)
   ├─ POST /extract-and-ingest   (synchronous)
   └─ POST /jobs/extract|/batch  (async → Temporal durable queue)
        │
        ▼
  Temporal workflow  (ExtractionWorkflow)
   1. extract_facts_activity   ── runs OpenCode + ingests, in one activity slot
   2. align_predicates_activity ── files close-match predicate alignments
   3. resolve_entities_activity ── cross-context identity edges
        │
        ▼
  OpenCodeAgent  (opencode_agent.py)
   docker exec into the omega-bot container → `opencode run` (GLM-5.1)
   file exchange over a shared bind mount; host-wide flock concurrency cap
        │  facts.json
        ▼
  ingest_facts (helpers.py)  → dontosrv (Rust, :7879) → Postgres 16 (donto-pg)
   /documents/register + /documents/revision   (save the source first)
   /assert/batch with per-fact anchors          → donto_span + donto_evidence_link
```

The substrate (`donto-pg`) currently holds ~**39.5M live statements** across
~**20K contexts** and ~**938K freely-minted predicates**. The design thesis is
*generative abundance*: emit free/untyped claims now, defer typing, alignment,
identity-resolution, and joining to **query time**; hold contradictions forever
as legal state; re-rank by reality over time rather than deleting on conflict.
Extraction is therefore tuned for **recall**, not precision-by-suppression.

---

## 2. The OpenCode runtime

`OpenCodeAgent.run(prompt, input_files, output_files, timeout, model)` is the
generic agentic step. Mechanics:

1. **Why a container.** OpenCode only honours the GLM coding-subscription provider
   config reliably *inside* the `omega-bot` container (which holds `GLM_API_KEY`).
   So we `docker exec` into it rather than running `opencode` on the host.
2. **File exchange.** A per-run scratch dir is created on a shared bind mount:
   host `/data/omega/shared/oc/<run_id>/` == container `/data/oc/<run_id>/`. We
   write `prompt.txt` + `source.txt` there, run with that dir as CWD, and read
   back `facts.json`. No stdin/stdout plumbing.
3. **The invocation** (verbatim shape):
   ```
   docker exec -e OPENCODE_CONFIG_CONTENT='{provider z-ai → api.z.ai/api/coding/paas/v4, model glm-5.1}' \
               -e HOME=/data/oc/<run_id> \
               omega-bot sh -c 'cd /data/oc/<run_id> && \
                 timeout <N> opencode run --dangerously-skip-permissions --format json "$(cat prompt.txt)"'
   ```
4. **Per-run isolated HOME.** `HOME` is set to the run's own scratch dir so each
   run gets an isolated OpenCode sqlite state (avoids `SQLITE_BUSY` when runs are
   concurrent in the shared container). **Cost:** every run bootstraps a fresh
   OpenCode DB — its stderr literally logs *"Performing one time database
   migration… Database migration complete."* on **every** invocation.
5. **Concurrency cap.** A host-wide `flock` over N slot files
   (`/var/lock/opencode-slots/slot-{0..N-1}.lock`, `OPENCODE_MAX_CONCURRENT`,
   currently 5) bounds how many `opencode run` subprocesses exist across *all*
   consumers (genealogy worker, memory worker, sync API, the Omega bot's
   self-memorisation). flock auto-releases on death, so killed runs never leak
   locks.
6. **Timeout.** `timeout <N>` (coreutils) bounds the agent; the Python subprocess
   timeout is `N+60`. An exceeded cap shows as `exit=143` (SIGTERM).
7. **Result.** `AgentRun{exit_code, elapsed_s, text (assembled from 'text'
   events), output_files, event_counts, stderr_tail, timed_out}`. **Important:**
   `elapsed_s` includes time spent *waiting for a flock slot*, which inflated some
   logged durations to 1,170–1,360 s even though the agent itself was capped at
   780 s.

---

## 3. The extraction prompt and multi-pass lenses

Extraction is agentic, not a single chat completion. The prompt instructs the
agent to **read `source.txt`** and **write `facts.json`** as `{"facts": [...]}`.

- **Pass 1 ("broad")** is a general sweep.
- **Passes 2+** (only in `exhaustive` mode) are single-lens passes — kinship,
  vital-events, places, identity-resolution, occupations, sources/provenance —
  each seeing all prior facts and told to find *new* ones. Facts are
  content-key deduped across passes; passes stop after K consecutive empty passes.

The batch path uses **single-pass broad** (`mode=opencode`, `passes=1`).

The **donto fact shape** each emitted object must follow:

```json
{ "subject": "ex:<kebab>", "predicate": "<camelCase>",
  "object": {"iri": "ex:<kebab>"} | {"literal": {"v": <value>, "dt": "<xsd type>"}},
  "anchor": {"surface_text": "<EXACT substring of source.txt>"} | null,
  "confidence": 0.0-1.0, "hypothesis_only": true|false }
```

The **anchor rule** is load-bearing: `surface_text` must be copyable
character-for-character from the source; otherwise `anchor:null` +
`hypothesis_only:true` + `confidence<0.9`. This is what makes every claim
traceable to evidence.

We recently replaced the original conservative prompt ("**reuse** predicates, do
not mint; aim for ~300 facts") with an **abundance prompt** (15.7 KB): "mint
predicates freely; no upper limit; `rdf:type`+`rdfs:label` every entity; emit
inverse edges; capture second-order relations — causation, reprisal chains,
attestation-as-its-own-fact, jurisdiction ladders, euphemism decoding,
legal-process chains; contradictions are wanted, never reconcile." This prompt is
philosophically aligned with the substrate — and, as Section 5 shows, it exposed
the engine's real bottleneck.

---

## 4. Ingestion and evidence anchoring

`ingest_facts(facts, context, revision_id, source_text)`:

1. Ensure the context exists (`/contexts/ensure`).
2. For each fact: parse the object into `object_iri` **or** `object_lit`; map
   `confidence`→`maturity` (0–4); route `hypothesis_only` facts to
   `polarity=unknown, maturity=0` (preserved, distinguishable).
3. **Anchor.** If `revision_id` + `source_text` are present and the fact carries
   a `surface_text`: accept the model's offsets if they match, else re-find the
   span by exact substring, else by a **whitespace/case-tolerant regex**
   (`_flex_find`). On success, attach an anchor that materialises a `donto_span`
   + `donto_evidence_link` row.
4. `POST /assert/batch` to dontosrv.

The **source is saved first**: `register_source_document` calls
`/documents/register` + `/documents/revision`, so facts anchor to a real
`revision_id`. **Bug found & fixed this session:** the *queued* (Temporal) path
called `ingest_facts` **without** `revision_id`/`source_text`, so queued
extractions saved the source but anchored **zero** spans. The synchronous
endpoint did it correctly. We aligned the queued path to the sync path; queued
extractions now anchor (verified 301/317 spans on a pilot).

---

## 5. Empirical evaluation and failure modes

Test corpus: the Univ. of Newcastle *Colonial Frontier Massacres* events export
(2,818 events; each rendered to a provenance-stamped text document of ~1.7 K–19 K
chars). All runs GLM-5.1, single broad pass.

### 5.1 It works well on small inputs

| Row | Source chars | Facts | Distinct preds | Anchored | Time |
|---|---|---|---|---|---|
| 10605 (Anthony Cox) | 2,222 | 317 | 140 | 95.0% | 441 s |
| 20781 (Lawn Hill) | 1,765 | 249 | 126 | 97.6% | 389 s |
| 38090 (North Keppel) | 2,351 | 322 | 165 | 68.6%¹ | 580 s |

¹ lower anchor rate is correct — many facts are structural inferences
(`rdf:type`, `likelySameAs`, grouping) that legitimately carry `anchor:null`.

Quality is genuinely strong: e.g. 10605 captured all 26 populated source fields,
held a name conflict paraconsistently ("Anthony" vs "Thomas" Cox → two entities +
`likelySameAs`), decomposed a follow-on reprisal into its own event, and inferred
a correct 1850 jurisdiction chain (station → run → Darling Downs → Police District
of Surat → Colony of NSW).

### 5.2 The "size cliff": large inputs → zero facts

With the *original* prompt, 3 of 5 larger rows returned **0 facts**:

| Row | Source chars | Result |
|---|---|---|
| 20744 (Goulbolba) | 6,925 | `exit=143`, 0 facts |
| 20111 ('King Billy') | 14,100 | `exit=143` @ 836 s, 0 facts |
| 10763 (NMP detachment) | 19,332 | `exit=143`, 0 facts |

**Fix that worked: chunking.** Splitting the source into ≤3,500-char chunks
(on paragraph→line→sentence→hard boundaries), running the broad pass per chunk,
and merging+deduping recovered all three:

| Row | Before | After chunking |
|---|---|---|
| 10763 | 0 | **1,236 facts, 95.1% anchored** |
| 20744 | 0 | **540 facts, 96.7%** |
| 20111 | 0 | **368 facts, 89.9%** |

Each chunk is a contiguous slice, so quoted `surface_text` still resolves against
the full source. We also added **retry-once-on-empty-chunk** (an occasional chunk
still returns nothing) so a single bad chunk no longer drops a slice of the doc.

### 5.3 The root failure mode (confirmed): the monolithic write never completes

The size cliff is a symptom. The cause, confirmed by a controlled run:

> **Diagnostic.** Abundance prompt (15.7 KB) + a single 2,222-char source, 280 s
> cap. Result: `exit=143`, `events={step_start:2, tool_use:1, step_finish:1}`,
> **`facts.json` = 0 bytes, assembled inline text = 0 chars.** stderr showed only
> the one-time sqlite migration.

Interpretation: the agent fires **one** tool call (the `read` of `source.txt`),
then begins generating the **entire `facts.json` as the argument of a single
`write` tool call**. For a high-recall prompt that argument is enormous; the
generation does not finish before the cap, so the `write` call **never executes**
→ no file. And because the content lives in the in-flight tool-call stream (not in
`text` events), it is **not** recoverable from the assistant text either. We get
*nothing* despite minutes of generation.

This also explains an earlier `exit=0` run that ran 639 s and still produced **no
file**: the model ended its turn without ever completing (or issuing) the write.

The original prompt survived only because it asked for *less* (~317 facts), so the
single write was small enough to finish (~441 s). Raising recall pushed the single
write past what one tool call can complete in time. **This is a mechanism problem,
not a "too many facts" problem** — the model is willing and able to generate the
volume; we cannot capture it.

### 5.4 Speed ↔ facts tension and resource contention

- Output is **verbose JSON** (~55 tokens/fact incl. anchor). Wall-clock ≈
  facts × tokens/fact ÷ generation-rate. So at fixed format, "3× the facts" ≈ "3×
  the time".
- The box is **4 vCPU / 16 GB**. opencode runs are largely network-bound on the
  GLM API (~10–17% CPU each), but **effective concurrency was ~1–2 of 5**: the
  Omega bot continuously self-memorises Discord messages through the *same* flock
  + GLM subscription (observed ~45–66 s runs, ~15–30 facts each), holding 2–3
  slots. Five concurrently-launched extractions serialised behind it (slot waits
  56 s → 390 s → 580 s).
- **Cold-start tax:** the per-run fresh `HOME` forces an OpenCode sqlite migration
  every single run.

---

## 6. Fixes applied this session

1. **Chunking** of large sources (≤3,500 chars), merge + dedup across chunks.
2. **Retry-once-on-empty-chunk.**
3. **Per-chunk cap raised** from 780 s → 30 min, with a matching Temporal activity
   ceiling (3 h) and a `_flex_find` substring fallback for anchors.
4. **Queued-path anchoring fixed** (forward `revision_id`+`source_text`).
5. **Abundance prompt** as a swappable file (`OC_EXTRACT_BROAD_PROMPT_FILE`), so
   prompts are tunable per-corpus without code changes.
6. Removed the prompt's mandatory `node` self-verify loop (it consumed agent turns
   on large inputs).

Net: reliability on large inputs went from *fails* to *works*; but raising recall
re-exposed §5.3, and per-event latency is now too high for 2,818 rows.

---

## 7. Proposed redesign (the open question)

We believe the engine can produce **thousands of facts per rich event in minutes**
if the output *mechanism* and *format* change. The central idea is to stop asking
for one monolithic write and instead use OpenCode as the **iterative agent it is**.
Candidate changes, roughly in priority order:

1. **An iterative, self-refining extraction loop (the core change).** Treat
   extraction as a loop the agent runs *many times* over, not a single emission.
   Each turn: (a) re-read `source.txt` and the facts written so far; (b) reason
   about what is still **missing or wrong** — uncovered clauses, un-typed entities,
   missing inverse/second-order relations, unanchored claims; (c) **append** the
   newly-found/corrected facts. The agent repeats until, *by its own reasoning*,
   the output is complete and correct (a full re-scan yields nothing new and every
   claim is anchored or properly flagged). This is the §5.3 fix **and** a recall
   multiplier: instead of one fragile giant generation, the agent accumulates and
   self-critiques toward completeness across many small, fast, **persisted** steps.
   It needs a clear stop criterion (e.g. *N consecutive re-scans add no new fact*)
   and a turn/time budget as a backstop.
2. **Append-friendly line format (JSONL or TSV), because the loop requires it.**
   You cannot cleanly *append* to a JSON `{"facts":[...]}` array — each turn would
   have to rewrite the whole (growing) file, re-triggering the monolithic-write
   failure. A line-oriented format appends with `>>` and never rewrites: **JSONL**
   (one fact object per line — preserves datatypes/anchors, easy to parse) or
   **TSV** (`subject⇥predicate⇥object⇥anchor`, ~18 tokens/fact vs ~55 for verbose
   JSON → ~**3× more facts per unit time**). Parse line-by-line in Python; a
   malformed line is skipped, not fatal. The loop + line format are one coupled
   design.
3. **Inline-text fallback.** Also parse the assembled assistant `text` (and accept
   `facts.jsonl`/`facts.tsv`/`facts.json`), so a non-file emission is never lost.
4. **Warm per-slot agent state.** Reuse N persistent `HOME` dirs (one per flock
   slot) to avoid the per-run sqlite migration, instead of a fresh `HOME` each
   run.
5. **Real concurrency.** Pause/curtail the Omega self-memorisation during a batch;
   raise `OPENCODE_MAX_CONCURRENT` (network-bound work has headroom on 4 vCPU);
   and/or run a row's chunks in parallel rather than sequentially.
6. **Open alternative we have *not* adopted (by owner preference):** bypass the
   agentic OpenCode CLI and call the GLM API directly (same model/subscription) as
   a single streamed completion — no agentic overhead, trivially parallel because
   it is pure network I/O. We are deliberately keeping the agentic OpenCode path
   and improving *it* first; this is noted for the reviewer's consideration.

---

## 8. Questions for the reviewer (third opinion wanted)

1. **Iterative loop + output mechanism.** Our intended fix (§7.1) is to have the
   agent loop many times — append, re-read, self-critique for completeness, repeat
   until *its own reasoning* judges `facts` complete and correct. Is that the right
   pattern for an agentic CLI, and what is the best **stop criterion** (dry-streak
   of N re-scans? an explicit self-assessed coverage check? a turn budget?) so it
   neither stops early nor loops forever? Is there a better OpenCode-native idiom
   (a specific tool, a different `--format`, a streaming-to-file pattern) to get a
   large, *complete* structured output out of an agentic CLI reliably?
2. **Format.** Is TSV the best compactness/robustness trade-off, or would
   JSON-Lines (one JSON object per line, appended) be safer to parse while still
   streamable? Any format that preserves datatypes/anchors but cuts tokens?
3. **Throughput.** Given a 4 vCPU / 16 GB box, a flat-rate GLM-5.1 coding
   subscription, and an agentic CLI that is network-bound — what concurrency and
   batching strategy would you choose for ~2,818 documents (and later, 50-page
   PDFs ≈ tens of chunks each)?
4. **Chunking vs context.** Chunking fixed reliability but loses cross-chunk
   context (an entity on chunk 1 and chunk 6 is minted twice and reconciled later
   at query time). Is per-chunk independence + downstream identity-resolution the
   right call, or should chunks carry a running entity/context summary (at the
   cost of serialising them)?
5. **Agentic vs direct API.** For *pure extraction* (read one doc → emit many
   facts), is there any real benefit to the agentic OpenCode wrapper over a direct
   streamed GLM completion, other than the flat-rate subscription billing? What
   would you weigh?
6. **Anchoring at scale.** We re-derive evidence spans by finding the model's
   quoted `surface_text` in the source (exact, then whitespace/case-tolerant). Is
   there a more robust span-grounding approach that survives paraphrase without
   trusting model-reported character offsets (which are unreliable)?

---

## Appendix A — key measurements

| Metric | Value |
|---|---|
| Substrate live statements / predicates / contexts | ~39.5M / ~938K / ~20K |
| Small-row yield (old prompt) | 317 facts / 140 preds / 95% anchored / 441 s |
| Large-row failure (old prompt, ≥6.9K chars) | 0 facts (`exit=143`) |
| Large-row recovery (chunked) | 1,236 / 540 / 368 facts; 90–95% anchored |
| Abundance prompt, 2.2K-char single chunk | 0 facts at 280 s & 1,800 s caps (write never completes) |
| Memory-worker comparison (small task) | ~45–66 s, ~15–30 facts, file written |
| Effective opencode concurrency (cap 5) | ~1–2 (Omega self-memorisation contention) |
| Output cost | verbose JSON ≈ 55 tokens/fact; proposed TSV ≈ 18 |
| Per-run overhead | OpenCode sqlite migration on every run (fresh HOME) |

## Appendix B — code map

| Concern | File |
|---|---|
| Agentic runner (docker exec, flock, file exchange) | `donto/apps/donto-api/opencode_agent.py` |
| Prompt, lenses, chunking, dedup, retry | `donto/apps/donto-api/opencode_extract.py` |
| Ingestion + evidence anchoring + source registration | `donto/apps/donto-api/helpers.py` |
| Temporal workflow + activities | `donto/apps/donto-api/workflows.py`, `activities.py` |
| Substrate (Rust) + Postgres | `dontosrv` (:7879), `donto-pg` (Postgres 16) |
| Swappable broad prompt | `donto/apps/donto-api/prompts/frontier_broad.txt` (`OC_EXTRACT_BROAD_PROMPT_FILE`) |

*Generated 2026-06-03 from live measurements on the donto-db box. Companion to the
donto substrate PRD and the "Generative Abundance" report at
`https://genes.apexpots.com/research/`.*
