donto — How the OpenCode Extraction Engine Works (and Where It Breaks)

An engineering report on agentic LLM fact-extraction into a paraconsistent substrate — 2026-06-03

Purpose of this document. This is a complete, honest technical description of how donto's "opencode" extraction pipeline works today, the failure modes we found while load-testing it on a 2,818-row historical dataset, the fixes applied so far, and an unresolved performance problem. It is written to be handed to an external reviewer (e.g. ChatGPT) for a third opinion. Section 8 lists the specific questions we want answered. Numbers are real measurements from 2026-06-02/03 on the production box, not estimates.

0. TL;DR

Goal: read a source document (newspaper article, BDM record, native-title transcript, or — in the current test — one row of the Colonial Frontier Massacres in Australia 1788–1930 dataset) and emit the largest possible set of evidence-anchored RDF-style claims into donto, a bitemporal, paraconsistent, evidence-first knowledge substrate.
How: a Python service drives OpenCode (an agentic CLI) running GLM-5.1 over a flat-rate coding subscription, inside a Docker container, via a shared-file scratch directory. The agent writes a facts.json; we ingest it, re-deriving an evidence span for every claim whose quoted surface_text is found verbatim in the source.
What works: on small inputs it is excellent — a 2,222-char event yields 317 facts / 140 distinct predicates / 95% anchored in ~441 s, including paraconsistent handling of a name conflict and an inferred jurisdictional hierarchy.
What breaks: large inputs and high-recall prompts produce zero facts. The agent generates the entire output as the argument to a single write tool call; for large outputs that generation never completes within the time cap, so the tool call never executes and nothing is captured — not even as inline assistant text. We confirmed this directly (Section 5.3).
The tension: "more facts" and "faster" pull against each other because facts are output tokens, and our output format (verbose JSON, ~55 tokens/fact) and output mechanism (one monolithic write) are both inefficient.
Proposed redesign (Section 7): make the agent iterate — accumulate and self-critique facts across many small, persisted turns until its own reasoning says the output is complete — instead of one giant write; pair it with an append-friendly line format (JSONL/TSV, ~3× fewer tokens/fact); add inline-text capture as a fallback; warm per-slot agent state; and real concurrency.

1. System architecture

  one source document (≤ a few hundred KB of text)
        │
        ▼
  donto-api  (FastAPI, :8000)
   ├─ POST /extract-and-ingest   (synchronous)
   └─ POST /jobs/extract|/batch  (async → Temporal durable queue)
        │
        ▼
  Temporal workflow  (ExtractionWorkflow)
   1. extract_facts_activity   ── runs OpenCode + ingests, in one activity slot
   2. align_predicates_activity ── files close-match predicate alignments
   3. resolve_entities_activity ── cross-context identity edges
        │
        ▼
  OpenCodeAgent  (opencode_agent.py)
   docker exec into the omega-bot container → `opencode run` (GLM-5.1)
   file exchange over a shared bind mount; host-wide flock concurrency cap
        │  facts.json
        ▼
  ingest_facts (helpers.py)  → dontosrv (Rust, :7879) → Postgres 16 (donto-pg)
   /documents/register + /documents/revision   (save the source first)
   /assert/batch with per-fact anchors          → donto_span + donto_evidence_link

The substrate (donto-pg) currently holds ~39.5M live statements across ~20K contexts and ~938K freely-minted predicates. The design thesis is generative abundance: emit free/untyped claims now, defer typing, alignment, identity-resolution, and joining to query time; hold contradictions forever as legal state; re-rank by reality over time rather than deleting on conflict. Extraction is therefore tuned for recall, not precision-by-suppression.

2. The OpenCode runtime

OpenCodeAgent.run(prompt, input_files, output_files, timeout, model) is the generic agentic step. Mechanics:

Why a container. OpenCode only honours the GLM coding-subscription provider config reliably inside the omega-bot container (which holds GLM_API_KEY). So we docker exec into it rather than running opencode on the host.
File exchange. A per-run scratch dir is created on a shared bind mount: host /data/omega/shared/oc/<run_id>/ == container /data/oc/<run_id>/. We write prompt.txt + source.txt there, run with that dir as CWD, and read back facts.json. No stdin/stdout plumbing.

The invocation (verbatim shape):

docker exec -e OPENCODE_CONFIG_CONTENT='{provider z-ai → api.z.ai/api/coding/paas/v4, model glm-5.1}' \
            -e HOME=/data/oc/<run_id> \
            omega-bot sh -c 'cd /data/oc/<run_id> && \
              timeout <N> opencode run --dangerously-skip-permissions --format json "$(cat prompt.txt)"'

Per-run isolated HOME. HOME is set to the run's own scratch dir so each run gets an isolated OpenCode sqlite state (avoids SQLITE_BUSY when runs are concurrent in the shared container). Cost: every run bootstraps a fresh OpenCode DB — its stderr literally logs "Performing one time database migration… Database migration complete." on every invocation.
Concurrency cap. A host-wide flock over N slot files (/var/lock/opencode-slots/slot-{0..N-1}.lock, OPENCODE_MAX_CONCURRENT, currently 5) bounds how many opencode run subprocesses exist across all consumers (genealogy worker, memory worker, sync API, the Omega bot's self-memorisation). flock auto-releases on death, so killed runs never leak locks.
Timeout. timeout <N> (coreutils) bounds the agent; the Python subprocess timeout is N+60. An exceeded cap shows as exit=143 (SIGTERM).
Result. AgentRun{exit_code, elapsed_s, text (assembled from 'text' events), output_files, event_counts, stderr_tail, timed_out}. Important: elapsed_s includes time spent waiting for a flock slot, which inflated some logged durations to 1,170–1,360 s even though the agent itself was capped at 780 s.

3. The extraction prompt and multi-pass lenses

Extraction is agentic, not a single chat completion. The prompt instructs the agent to read source.txt and write facts.json as {"facts": [...]}.

Pass 1 ("broad") is a general sweep.
Passes 2+ (only in exhaustive mode) are single-lens passes — kinship, vital-events, places, identity-resolution, occupations, sources/provenance — each seeing all prior facts and told to find new ones. Facts are content-key deduped across passes; passes stop after K consecutive empty passes.

The batch path uses single-pass broad (mode=opencode, passes=1).

The donto fact shape each emitted object must follow:

{ "subject": "ex:<kebab>", "predicate": "<camelCase>",
  "object": {"iri": "ex:<kebab>"} | {"literal": {"v": <value>, "dt": "<xsd type>"}},
  "anchor": {"surface_text": "<EXACT substring of source.txt>"} | null,
  "confidence": 0.0-1.0, "hypothesis_only": true|false }

The anchor rule is load-bearing: surface_text must be copyable character-for-character from the source; otherwise anchor:null + hypothesis_only:true + confidence<0.9. This is what makes every claim traceable to evidence.

We recently replaced the original conservative prompt ("reuse predicates, do not mint; aim for ~300 facts") with an abundance prompt (15.7 KB): "mint predicates freely; no upper limit; rdf:type+rdfs:label every entity; emit inverse edges; capture second-order relations — causation, reprisal chains, attestation-as-its-own-fact, jurisdiction ladders, euphemism decoding, legal-process chains; contradictions are wanted, never reconcile." This prompt is philosophically aligned with the substrate — and, as Section 5 shows, it exposed the engine's real bottleneck.

4. Ingestion and evidence anchoring

ingest_facts(facts, context, revision_id, source_text):

Ensure the context exists (/contexts/ensure).
For each fact: parse the object into object_iri or object_lit; map confidence→maturity (0–4); route hypothesis_only facts to polarity=unknown, maturity=0 (preserved, distinguishable).
Anchor. If revision_id + source_text are present and the fact carries a surface_text: accept the model's offsets if they match, else re-find the span by exact substring, else by a whitespace/case-tolerant regex (_flex_find). On success, attach an anchor that materialises a donto_span
- donto_evidence_link row.
POST /assert/batch to dontosrv.

The source is saved first: register_source_document calls /documents/register + /documents/revision, so facts anchor to a real revision_id. Bug found & fixed this session: the queued (Temporal) path called ingest_facts without revision_id/source_text, so queued extractions saved the source but anchored zero spans. The synchronous endpoint did it correctly. We aligned the queued path to the sync path; queued extractions now anchor (verified 301/317 spans on a pilot).

5. Empirical evaluation and failure modes

Test corpus: the Univ. of Newcastle Colonial Frontier Massacres events export (2,818 events; each rendered to a provenance-stamped text document of ~1.7 K–19 K chars). All runs GLM-5.1, single broad pass.

5.1 It works well on small inputs

Row	Source chars	Facts	Distinct preds	Anchored	Time
10605 (Anthony Cox)	2,222	317	140	95.0%	441 s
20781 (Lawn Hill)	1,765	249	126	97.6%	389 s
38090 (North Keppel)	2,351	322	165	68.6%¹	580 s

¹ lower anchor rate is correct — many facts are structural inferences (rdf:type, likelySameAs, grouping) that legitimately carry anchor:null.

Quality is genuinely strong: e.g. 10605 captured all 26 populated source fields, held a name conflict paraconsistently ("Anthony" vs "Thomas" Cox → two entities + likelySameAs), decomposed a follow-on reprisal into its own event, and inferred a correct 1850 jurisdiction chain (station → run → Darling Downs → Police District of Surat → Colony of NSW).

5.2 The "size cliff": large inputs → zero facts

With the original prompt, 3 of 5 larger rows returned 0 facts:

Row	Source chars	Result
20744 (Goulbolba)	6,925	`exit=143`, 0 facts
20111 ('King Billy')	14,100	`exit=143` @ 836 s, 0 facts
10763 (NMP detachment)	19,332	`exit=143`, 0 facts

Fix that worked: chunking. Splitting the source into ≤3,500-char chunks (on paragraph→line→sentence→hard boundaries), running the broad pass per chunk, and merging+deduping recovered all three:

Row	Before	After chunking
10763	0	1,236 facts, 95.1% anchored
20744	0	540 facts, 96.7%
20111	0	368 facts, 89.9%

Each chunk is a contiguous slice, so quoted surface_text still resolves against the full source. We also added retry-once-on-empty-chunk (an occasional chunk still returns nothing) so a single bad chunk no longer drops a slice of the doc.

5.3 The root failure mode (confirmed): the monolithic write never completes

The size cliff is a symptom. The cause, confirmed by a controlled run:

Diagnostic. Abundance prompt (15.7 KB) + a single 2,222-char source, 280 s cap. Result: exit=143, events={step_start:2, tool_use:1, step_finish:1}, facts.json = 0 bytes, assembled inline text = 0 chars. stderr showed only the one-time sqlite migration.

Interpretation: the agent fires one tool call (the read of source.txt), then begins generating the entire facts.json as the argument of a single write tool call. For a high-recall prompt that argument is enormous; the generation does not finish before the cap, so the write call never executes → no file. And because the content lives in the in-flight tool-call stream (not in text events), it is not recoverable from the assistant text either. We get nothing despite minutes of generation.

This also explains an earlier exit=0 run that ran 639 s and still produced no file: the model ended its turn without ever completing (or issuing) the write.

The original prompt survived only because it asked for less (~317 facts), so the single write was small enough to finish (~441 s). Raising recall pushed the single write past what one tool call can complete in time. This is a mechanism problem, not a "too many facts" problem — the model is willing and able to generate the volume; we cannot capture it.

5.4 Speed ↔︎ facts tension and resource contention

Output is verbose JSON (~55 tokens/fact incl. anchor). Wall-clock ≈ facts × tokens/fact ÷ generation-rate. So at fixed format, "3× the facts" ≈ "3× the time".
The box is 4 vCPU / 16 GB. opencode runs are largely network-bound on the GLM API (~10–17% CPU each), but effective concurrency was ~1–2 of 5: the Omega bot continuously self-memorises Discord messages through the same flock
- GLM subscription (observed ~45–66 s runs, ~15–30 facts each), holding 2–3 slots. Five concurrently-launched extractions serialised behind it (slot waits 56 s → 390 s → 580 s).
Cold-start tax: the per-run fresh HOME forces an OpenCode sqlite migration every single run.

6. Fixes applied this session

Chunking of large sources (≤3,500 chars), merge + dedup across chunks.
Retry-once-on-empty-chunk.
Per-chunk cap raised from 780 s → 30 min, with a matching Temporal activity ceiling (3 h) and a _flex_find substring fallback for anchors.
Queued-path anchoring fixed (forward revision_id+source_text).
Abundance prompt as a swappable file (OC_EXTRACT_BROAD_PROMPT_FILE), so prompts are tunable per-corpus without code changes.
Removed the prompt's mandatory node self-verify loop (it consumed agent turns on large inputs).

Net: reliability on large inputs went from fails to works; but raising recall re-exposed §5.3, and per-event latency is now too high for 2,818 rows.

7. Proposed redesign (the open question)

We believe the engine can produce thousands of facts per rich event in minutes if the output mechanism and format change. The central idea is to stop asking for one monolithic write and instead use OpenCode as the iterative agent it is. Candidate changes, roughly in priority order:

An iterative, self-refining extraction loop (the core change). Treat extraction as a loop the agent runs many times over, not a single emission. Each turn: (a) re-read source.txt and the facts written so far; (b) reason about what is still missing or wrong — uncovered clauses, un-typed entities, missing inverse/second-order relations, unanchored claims; (c) append the newly-found/corrected facts. The agent repeats until, by its own reasoning, the output is complete and correct (a full re-scan yields nothing new and every claim is anchored or properly flagged). This is the §5.3 fix and a recall multiplier: instead of one fragile giant generation, the agent accumulates and self-critiques toward completeness across many small, fast, persisted steps. It needs a clear stop criterion (e.g. N consecutive re-scans add no new fact) and a turn/time budget as a backstop.
Append-friendly line format (JSONL or TSV), because the loop requires it. You cannot cleanly append to a JSON {"facts":[...]} array — each turn would have to rewrite the whole (growing) file, re-triggering the monolithic-write failure. A line-oriented format appends with >> and never rewrites: JSONL (one fact object per line — preserves datatypes/anchors, easy to parse) or TSV (subject⇥predicate⇥object⇥anchor, ~18 tokens/fact vs ~55 for verbose JSON → ~3× more facts per unit time). Parse line-by-line in Python; a malformed line is skipped, not fatal. The loop + line format are one coupled design.
Inline-text fallback. Also parse the assembled assistant text (and accept facts.jsonl/facts.tsv/facts.json), so a non-file emission is never lost.
Warm per-slot agent state. Reuse N persistent HOME dirs (one per flock slot) to avoid the per-run sqlite migration, instead of a fresh HOME each run.
Real concurrency. Pause/curtail the Omega self-memorisation during a batch; raise OPENCODE_MAX_CONCURRENT (network-bound work has headroom on 4 vCPU); and/or run a row's chunks in parallel rather than sequentially.
Open alternative we have not adopted (by owner preference): bypass the agentic OpenCode CLI and call the GLM API directly (same model/subscription) as a single streamed completion — no agentic overhead, trivially parallel because it is pure network I/O. We are deliberately keeping the agentic OpenCode path and improving it first; this is noted for the reviewer's consideration.

8. Questions for the reviewer (third opinion wanted)

Iterative loop + output mechanism. Our intended fix (§7.1) is to have the agent loop many times — append, re-read, self-critique for completeness, repeat until its own reasoning judges facts complete and correct. Is that the right pattern for an agentic CLI, and what is the best stop criterion (dry-streak of N re-scans? an explicit self-assessed coverage check? a turn budget?) so it neither stops early nor loops forever? Is there a better OpenCode-native idiom (a specific tool, a different --format, a streaming-to-file pattern) to get a large, complete structured output out of an agentic CLI reliably?
Format. Is TSV the best compactness/robustness trade-off, or would JSON-Lines (one JSON object per line, appended) be safer to parse while still streamable? Any format that preserves datatypes/anchors but cuts tokens?
Throughput. Given a 4 vCPU / 16 GB box, a flat-rate GLM-5.1 coding subscription, and an agentic CLI that is network-bound — what concurrency and batching strategy would you choose for ~2,818 documents (and later, 50-page PDFs ≈ tens of chunks each)?
Chunking vs context. Chunking fixed reliability but loses cross-chunk context (an entity on chunk 1 and chunk 6 is minted twice and reconciled later at query time). Is per-chunk independence + downstream identity-resolution the right call, or should chunks carry a running entity/context summary (at the cost of serialising them)?
Agentic vs direct API. For pure extraction (read one doc → emit many facts), is there any real benefit to the agentic OpenCode wrapper over a direct streamed GLM completion, other than the flat-rate subscription billing? What would you weigh?
Anchoring at scale. We re-derive evidence spans by finding the model's quoted surface_text in the source (exact, then whitespace/case-tolerant). Is there a more robust span-grounding approach that survives paraphrase without trusting model-reported character offsets (which are unreliable)?

Appendix A — key measurements

Metric	Value
Substrate live statements / predicates / contexts	~39.5M / ~938K / ~20K
Small-row yield (old prompt)	317 facts / 140 preds / 95% anchored / 441 s
Large-row failure (old prompt, ≥6.9K chars)	0 facts (`exit=143`)
Large-row recovery (chunked)	1,236 / 540 / 368 facts; 90–95% anchored
Abundance prompt, 2.2K-char single chunk	0 facts at 280 s & 1,800 s caps (write never completes)
Memory-worker comparison (small task)	~45–66 s, ~15–30 facts, file written
Effective opencode concurrency (cap 5)	~1–2 (Omega self-memorisation contention)
Output cost	verbose JSON ≈ 55 tokens/fact; proposed TSV ≈ 18
Per-run overhead	OpenCode sqlite migration on every run (fresh HOME)

Appendix B — code map

Concern	File
Agentic runner (docker exec, flock, file exchange)	`donto/apps/donto-api/opencode_agent.py`
Prompt, lenses, chunking, dedup, retry	`donto/apps/donto-api/opencode_extract.py`
Ingestion + evidence anchoring + source registration	`donto/apps/donto-api/helpers.py`
Temporal workflow + activities	`donto/apps/donto-api/workflows.py`, `activities.py`
Substrate (Rust) + Postgres	`dontosrv` (:7879), `donto-pg` (Postgres 16)
Swappable broad prompt	`donto/apps/donto-api/prompts/frontier_broad.txt` (`OC_EXTRACT_BROAD_PROMPT_FILE`)

Generated 2026-06-03 from live measurements on the donto-db box. Companion to the donto substrate PRD and the "Generative Abundance" report at https://genes.apexpots.com/research/.