An engineering report on agentic LLM fact-extraction into a paraconsistent substrate — 2026-06-03
Purpose of this document. This is a complete, honest technical description of how donto's "opencode" extraction pipeline works today, the failure modes we found while load-testing it on a 2,818-row historical dataset, the fixes applied so far, and an unresolved performance problem. It is written to be handed to an external reviewer (e.g. ChatGPT) for a third opinion. Section 8 lists the specific questions we want answered. Numbers are real measurements from 2026-06-02/03 on the production box, not estimates.
facts.json; we ingest it, re-deriving an evidence span
for every claim whose quoted surface_text is found verbatim
in the source.write tool
call; for large outputs that generation never completes within the time
cap, so the tool call never executes and nothing is
captured — not even as inline assistant text. We confirmed this
directly (Section 5.3).facts across many small, persisted turns until its own
reasoning says the output is complete — instead of one giant write; pair
it with an append-friendly line format (JSONL/TSV, ~3×
fewer tokens/fact); add inline-text capture as a fallback; warm per-slot
agent state; and real concurrency. one source document (≤ a few hundred KB of text)
│
▼
donto-api (FastAPI, :8000)
├─ POST /extract-and-ingest (synchronous)
└─ POST /jobs/extract|/batch (async → Temporal durable queue)
│
▼
Temporal workflow (ExtractionWorkflow)
1. extract_facts_activity ── runs OpenCode + ingests, in one activity slot
2. align_predicates_activity ── files close-match predicate alignments
3. resolve_entities_activity ── cross-context identity edges
│
▼
OpenCodeAgent (opencode_agent.py)
docker exec into the omega-bot container → `opencode run` (GLM-5.1)
file exchange over a shared bind mount; host-wide flock concurrency cap
│ facts.json
▼
ingest_facts (helpers.py) → dontosrv (Rust, :7879) → Postgres 16 (donto-pg)
/documents/register + /documents/revision (save the source first)
/assert/batch with per-fact anchors → donto_span + donto_evidence_link
The substrate (donto-pg) currently holds ~39.5M
live statements across ~20K contexts and
~938K freely-minted predicates. The design thesis is
generative abundance: emit free/untyped claims now, defer
typing, alignment, identity-resolution, and joining to query
time; hold contradictions forever as legal state; re-rank by
reality over time rather than deleting on conflict. Extraction is
therefore tuned for recall, not
precision-by-suppression.
OpenCodeAgent.run(prompt, input_files, output_files, timeout, model)
is the generic agentic step. Mechanics:
omega-bot container (which holds GLM_API_KEY).
So we docker exec into it rather than running
opencode on the host./data/omega/shared/oc/<run_id>/ == container
/data/oc/<run_id>/. We write prompt.txt
+ source.txt there, run with that dir as CWD, and read back
facts.json. No stdin/stdout plumbing.docker exec -e OPENCODE_CONFIG_CONTENT='{provider z-ai → api.z.ai/api/coding/paas/v4, model glm-5.1}' \
-e HOME=/data/oc/<run_id> \
omega-bot sh -c 'cd /data/oc/<run_id> && \
timeout <N> opencode run --dangerously-skip-permissions --format json "$(cat prompt.txt)"'HOME is set to
the run's own scratch dir so each run gets an isolated OpenCode sqlite
state (avoids SQLITE_BUSY when runs are concurrent in the
shared container). Cost: every run bootstraps a fresh
OpenCode DB — its stderr literally logs "Performing one time
database migration… Database migration complete." on
every invocation.flock
over N slot files
(/var/lock/opencode-slots/slot-{0..N-1}.lock,
OPENCODE_MAX_CONCURRENT, currently 5) bounds how many
opencode run subprocesses exist across all
consumers (genealogy worker, memory worker, sync API, the Omega bot's
self-memorisation). flock auto-releases on death, so killed runs never
leak locks.timeout <N> (coreutils)
bounds the agent; the Python subprocess timeout is N+60. An
exceeded cap shows as exit=143 (SIGTERM).AgentRun{exit_code, elapsed_s, text (assembled from 'text' events), output_files, event_counts, stderr_tail, timed_out}.
Important: elapsed_s includes time spent
waiting for a flock slot, which inflated some logged durations
to 1,170–1,360 s even though the agent itself was capped at 780 s.Extraction is agentic, not a single chat completion. The prompt
instructs the agent to read source.txt and
write facts.json as
{"facts": [...]}.
exhaustive mode)
are single-lens passes — kinship, vital-events, places,
identity-resolution, occupations, sources/provenance — each seeing all
prior facts and told to find new ones. Facts are content-key
deduped across passes; passes stop after K consecutive empty
passes.The batch path uses single-pass broad
(mode=opencode, passes=1).
The donto fact shape each emitted object must follow:
{ "subject": "ex:<kebab>", "predicate": "<camelCase>",
"object": {"iri": "ex:<kebab>"} | {"literal": {"v": <value>, "dt": "<xsd type>"}},
"anchor": {"surface_text": "<EXACT substring of source.txt>"} | null,
"confidence": 0.0-1.0, "hypothesis_only": true|false }The anchor rule is load-bearing:
surface_text must be copyable character-for-character from
the source; otherwise anchor:null +
hypothesis_only:true + confidence<0.9. This
is what makes every claim traceable to evidence.
We recently replaced the original conservative prompt
("reuse predicates, do not mint; aim for ~300 facts")
with an abundance prompt (15.7 KB): "mint predicates
freely; no upper limit; rdf:type+rdfs:label
every entity; emit inverse edges; capture second-order relations —
causation, reprisal chains, attestation-as-its-own-fact, jurisdiction
ladders, euphemism decoding, legal-process chains; contradictions are
wanted, never reconcile." This prompt is philosophically aligned with
the substrate — and, as Section 5 shows, it exposed the engine's real
bottleneck.
ingest_facts(facts, context, revision_id, source_text):
/contexts/ensure).object_iri
or object_lit; map
confidence→maturity (0–4); route
hypothesis_only facts to
polarity=unknown, maturity=0 (preserved,
distinguishable).revision_id +
source_text are present and the fact carries a
surface_text: accept the model's offsets if they match,
else re-find the span by exact substring, else by a
whitespace/case-tolerant regex
(_flex_find). On success, attach an anchor that
materialises a donto_span
donto_evidence_link row.POST /assert/batch to dontosrv.The source is saved first:
register_source_document calls
/documents/register + /documents/revision, so
facts anchor to a real revision_id. Bug found &
fixed this session: the queued (Temporal) path called
ingest_facts without
revision_id/source_text, so queued extractions
saved the source but anchored zero spans. The
synchronous endpoint did it correctly. We aligned the queued path to the
sync path; queued extractions now anchor (verified 301/317 spans on a
pilot).
Test corpus: the Univ. of Newcastle Colonial Frontier Massacres events export (2,818 events; each rendered to a provenance-stamped text document of ~1.7 K–19 K chars). All runs GLM-5.1, single broad pass.
| Row | Source chars | Facts | Distinct preds | Anchored | Time |
|---|---|---|---|---|---|
| 10605 (Anthony Cox) | 2,222 | 317 | 140 | 95.0% | 441 s |
| 20781 (Lawn Hill) | 1,765 | 249 | 126 | 97.6% | 389 s |
| 38090 (North Keppel) | 2,351 | 322 | 165 | 68.6%¹ | 580 s |
¹ lower anchor rate is correct — many facts are structural inferences
(rdf:type, likelySameAs, grouping) that
legitimately carry anchor:null.
Quality is genuinely strong: e.g. 10605 captured all 26 populated
source fields, held a name conflict paraconsistently ("Anthony" vs
"Thomas" Cox → two entities + likelySameAs), decomposed a
follow-on reprisal into its own event, and inferred a correct 1850
jurisdiction chain (station → run → Darling Downs → Police District of
Surat → Colony of NSW).
With the original prompt, 3 of 5 larger rows returned 0 facts:
| Row | Source chars | Result |
|---|---|---|
| 20744 (Goulbolba) | 6,925 | exit=143, 0 facts |
| 20111 ('King Billy') | 14,100 | exit=143 @ 836 s, 0 facts |
| 10763 (NMP detachment) | 19,332 | exit=143, 0 facts |
Fix that worked: chunking. Splitting the source into ≤3,500-char chunks (on paragraph→line→sentence→hard boundaries), running the broad pass per chunk, and merging+deduping recovered all three:
| Row | Before | After chunking |
|---|---|---|
| 10763 | 0 | 1,236 facts, 95.1% anchored |
| 20744 | 0 | 540 facts, 96.7% |
| 20111 | 0 | 368 facts, 89.9% |
Each chunk is a contiguous slice, so quoted surface_text
still resolves against the full source. We also added
retry-once-on-empty-chunk (an occasional chunk still
returns nothing) so a single bad chunk no longer drops a slice of the
doc.
The size cliff is a symptom. The cause, confirmed by a controlled run:
Diagnostic. Abundance prompt (15.7 KB) + a single 2,222-char source, 280 s cap. Result:
exit=143,events={step_start:2, tool_use:1, step_finish:1},facts.json= 0 bytes, assembled inline text = 0 chars. stderr showed only the one-time sqlite migration.
Interpretation: the agent fires one tool call (the
read of source.txt), then begins generating
the entire facts.json as the argument of a single
write tool call. For a high-recall prompt that
argument is enormous; the generation does not finish before the cap, so
the write call never executes → no file.
And because the content lives in the in-flight tool-call stream (not in
text events), it is not recoverable from
the assistant text either. We get nothing despite minutes of
generation.
This also explains an earlier exit=0 run that ran 639 s
and still produced no file: the model ended its turn
without ever completing (or issuing) the write.
The original prompt survived only because it asked for less (~317 facts), so the single write was small enough to finish (~441 s). Raising recall pushed the single write past what one tool call can complete in time. This is a mechanism problem, not a "too many facts" problem — the model is willing and able to generate the volume; we cannot capture it.
HOME
forces an OpenCode sqlite migration every single run._flex_find
substring fallback for anchors.revision_id+source_text).OC_EXTRACT_BROAD_PROMPT_FILE), so prompts are tunable
per-corpus without code changes.node self-verify loop
(it consumed agent turns on large inputs).Net: reliability on large inputs went from fails to works; but raising recall re-exposed §5.3, and per-event latency is now too high for 2,818 rows.
We believe the engine can produce thousands of facts per rich event in minutes if the output mechanism and format change. The central idea is to stop asking for one monolithic write and instead use OpenCode as the iterative agent it is. Candidate changes, roughly in priority order:
source.txt and the facts written so far; (b) reason about
what is still missing or wrong — uncovered clauses,
un-typed entities, missing inverse/second-order relations, unanchored
claims; (c) append the newly-found/corrected facts. The
agent repeats until, by its own reasoning, the output is
complete and correct (a full re-scan yields nothing new and every claim
is anchored or properly flagged). This is the §5.3 fix
and a recall multiplier: instead of one fragile giant
generation, the agent accumulates and self-critiques toward completeness
across many small, fast, persisted steps. It needs a
clear stop criterion (e.g. N consecutive re-scans add no new
fact) and a turn/time budget as a backstop.{"facts":[...]} array — each turn would have to rewrite the
whole (growing) file, re-triggering the monolithic-write failure. A
line-oriented format appends with >> and never
rewrites: JSONL (one fact object per line — preserves
datatypes/anchors, easy to parse) or TSV
(subject⇥predicate⇥object⇥anchor, ~18 tokens/fact vs ~55
for verbose JSON → ~3× more facts per unit time). Parse
line-by-line in Python; a malformed line is skipped, not fatal. The loop
+ line format are one coupled design.text (and accept
facts.jsonl/facts.tsv/facts.json),
so a non-file emission is never lost.HOME dirs (one per flock slot) to avoid the per-run sqlite
migration, instead of a fresh HOME each run.OPENCODE_MAX_CONCURRENT (network-bound work has headroom on
4 vCPU); and/or run a row's chunks in parallel rather than
sequentially.facts complete and correct. Is that the right
pattern for an agentic CLI, and what is the best stop
criterion (dry-streak of N re-scans? an explicit self-assessed
coverage check? a turn budget?) so it neither stops early nor loops
forever? Is there a better OpenCode-native idiom (a specific tool, a
different --format, a streaming-to-file pattern) to get a
large, complete structured output out of an agentic CLI
reliably?surface_text in the source
(exact, then whitespace/case-tolerant). Is there a more robust
span-grounding approach that survives paraphrase without trusting
model-reported character offsets (which are unreliable)?| Metric | Value |
|---|---|
| Substrate live statements / predicates / contexts | ~39.5M / ~938K / ~20K |
| Small-row yield (old prompt) | 317 facts / 140 preds / 95% anchored / 441 s |
| Large-row failure (old prompt, ≥6.9K chars) | 0 facts (exit=143) |
| Large-row recovery (chunked) | 1,236 / 540 / 368 facts; 90–95% anchored |
| Abundance prompt, 2.2K-char single chunk | 0 facts at 280 s & 1,800 s caps (write never completes) |
| Memory-worker comparison (small task) | ~45–66 s, ~15–30 facts, file written |
| Effective opencode concurrency (cap 5) | ~1–2 (Omega self-memorisation contention) |
| Output cost | verbose JSON ≈ 55 tokens/fact; proposed TSV ≈ 18 |
| Per-run overhead | OpenCode sqlite migration on every run (fresh HOME) |
| Concern | File |
|---|---|
| Agentic runner (docker exec, flock, file exchange) | donto/apps/donto-api/opencode_agent.py |
| Prompt, lenses, chunking, dedup, retry | donto/apps/donto-api/opencode_extract.py |
| Ingestion + evidence anchoring + source registration | donto/apps/donto-api/helpers.py |
| Temporal workflow + activities | donto/apps/donto-api/workflows.py,
activities.py |
| Substrate (Rust) + Postgres | dontosrv (:7879), donto-pg (Postgres
16) |
| Swappable broad prompt | donto/apps/donto-api/prompts/frontier_broad.txt
(OC_EXTRACT_BROAD_PROMPT_FILE) |
Generated 2026-06-03 from live measurements on the donto-db box.
Companion to the donto substrate PRD and the "Generative Abundance"
report at https://genes.apexpots.com/research/.