genes.apexpots.com / research source: donto-cerebras-bakeoff-2026-06-04.md

donto — Extraction-Provider Bake-Off: Cerebras vs. z.ai vs. Codex, a Faithful 5-Way Run

donto — Extraction-Provider Bake-Off: Cerebras vs. z.ai vs. Codex, a Faithful 5-Way Run

A measured, single-article comparison of five LLM extraction back-ends — across TWO different agentic harnesses (OpenCode and the Codex CLI) — driving donto's real lens-sweep controller. Goal: pick a cheaper, faster, available extraction provider for the months ahead without losing faithfulness. 2026-06-04.

What this is. donto's extraction front door currently runs GLM-4.7 via z.ai's "coding" subscription through an agentic OpenCode driver. That path is fast enough but it is an expiring subsidy, TOS-risky for non-coding use, and rate-capped. This report tests four alternatives against the incumbent, on the same article, with the same lens-sweep prompt and ingest path, reading results out of live donto_statement. Three (A, B, C) ran through OpenCode inside the omega container; two (D, E) ran through the Codex CLI on the host on the user's ChatGPT Pro subscription — a different agentic harness, no API credit. Every number below is from the production box on 2026-06-04, not estimated. Where a provider could not run, that is stated plainly and the cap was probed live, not assumed. As of this writing, both Cerebras and z.ai are quota-exhausted, and the ChatGPT-Pro Codex path is the only live extraction route.


0. One-paragraph orientation

donto is a bitemporal, paraconsistent, evidence-first knowledge substrate built for the age of generative abundance: generation of typed claims is now cheap, so the engine's job is maximal faithful capture — emit free/untyped, multi-directional, evidence-anchored claims and defer typing/alignment/identity/joining to query time. The extraction engine that feeds it drives an agentic CLI over an LLM provider, sweeping a document through a broad lens prompt across multiple passes ("loop until dry"), then ingests the parsed facts as anchored statements. In this experiment two things vary: (1) which model on which hardware does the inference, and (2) which agentic harness drives the loop — OpenCode (A/B/C) or the Codex CLI (D/E). Everything else — the article, the lens-sweep prompt, the multi-pass "loop until dry" mechanism, the {s,p,o,a,c,h} fact shape, the parse/normalise/ingest path, and the live-DB measurement — is held fixed, so the five contexts are directly comparable.


1. Why a provider bake-off, and why five-way

The incumbent extraction provider (GLM-4.7 on z.ai's coding subscription) has three problems we want to engineer away:

  1. Cost trajectory. It is a flat-rate coding subscription used for non-coding extraction — an expiring subsidy, and per CLAUDE.md §4 a TOS-risk. We need a path we can run at volume for months.
  2. Throughput / latency. Extraction is the bottleneck in every consumer (genealogy, memory). Faster inference directly multiplies how much of the firehose we can capture per day.
  3. Hard caps. The subscription enforces weekly/monthly limits; when it caps, all extraction stops.

Cerebras was the first candidate: wafer-scale inference is the fastest tokens/sec generally available, and it hosts both a small open model (gpt-oss-120b) and — in preview — GLM-4.7 itself (zai-glm-4.7). That second fact made a clean OpenCode-internal three-way design possible (model effect on identical hardware; hardware effect on the identical model).

But both Cerebras and z.ai are now quota-exhausted (§4, §5). So this iteration adds a second harness and a second account entirely: the Codex CLI (codex-cli 0.130.0) running OpenAI models on the user's ChatGPT Pro subscription — which uses no API credit and is not subject to either capped account. That gives two more live providers:

What the five-way design now isolates:

Comparison Holds fixed Varies What it isolates
A vs B OpenCode harness, article, hardware (Cerebras) the model (gpt-oss-120b vs glm-4.7) model effect on identical hardware
B vs C model (glm-4.7), OpenCode harness, article the hardware/provider (Cerebras vs z.ai) hardware/provider effect on the identical model
D vs E Codex CLI harness, article, ChatGPT-Pro account the OpenAI model (gpt-5.4 vs gpt-5.3-codex-spark) depth-vs-speed within the live Codex path
(A–C) vs (D–E) article, lens prompt, fact shape, ingest path the agentic harness + provider account harness/availability effect — the only currently-live route

2. The two integration problems we hit on the OpenCode/Cerebras path (and fixed / worked around)

Standing up Cerebras behind donto's OpenCode driver surfaced two distinct, non-obvious failures. Both are documented here because they will recur for anyone wiring an agentic CLI to a reasoning-model endpoint. Neither applies to the Codex CLI path (D/E), which runs on the host with its own shell tool and its own multi-turn loop — see §2.3.

2.1 BLOCKER — reasoning_content echo → Cerebras HTTP 400 (FIXED)

OpenCode (v1.15.13) uses the Vercel AI-SDK, which, on a multi-turn agentic loop, echoes the assistant's prior reasoning_content / reasoning fields back into the next request's message array. Cerebras's chat-completions endpoint rejects those fields on inbound requests with an HTTP 400, so the agent loop died on turn two and produced 0 facts. This is purely an integration artifact — the model is fine; the SDK is replaying a field the upstream won't accept on input.

Fix: a tiny sanitizing reverse-proxy in front of Cerebras (/mnt/donto-data/workspace/donto-align/cerebras_proxy.py, listening on 172.18.0.1:8089). It forwards chat-completions to api.cerebras.ai but strips reasoning_content/reasoning from every message in the outbound request body before relaying. OpenCode points its provider.cerebras.options.baseURL at the proxy instead of the upstream. With the proxy in place the agent loop runs to completion. (For the preview zai-glm-4.7 model the proxy is a hard runtime dependency — kill the proxy and B fails even with billing intact. The proxy was verified up, HTTP 200 on /v1/models, throughout this session.)

2.2 BUG — OpenCode slot-pool deadlock, blind to free slots (WORKED AROUND)

donto bounds concurrent OpenCode subprocesses host-wide with a flock over OPENCODE_MAX_CONCURRENT slot files. The acquire loop only ever looks at slot indices 0 .. MAX_CONCURRENT-1. The ~10 production frontier-extraction jobs already running hold a set of slot files; if a benchmark process is launched with a lower MAX_CONCURRENT than the number of busy slots, it iterates only the low indices, finds them all held, and never even checks the free higher-numbered slot files — it deadlocks behind production instead of waiting for a genuinely free slot.

Work-around (per the run protocol): set OPENCODE_MAX_CONCURRENT=16 on every benchmark invocation so the acquire loop scans all slot files and waits for a free one. Benchmark extractions were run sequentially as root. This is a work-around, not a fix; the underlying loop should scan a dynamically-sized pool, not a fixed range. (D/E avoid this entirely — the Codex CLI does not touch the OpenCode slot pool.)

2.3 The Codex CLI harness (D/E) — what differs

Providers D and E do not use OpenCode at all. They run the Codex CLI (codex-cli 0.130.0) headless on the host under the user's ChatGPT Pro auth:

codex exec --dangerously-bypass-approvals-and-sandbox -C <run_dir> -c model="<MODEL>" "<lens-sweep prompt>"

Harness caveat (faithful representation). D and E are a different agentic harness on a different account than A–C. The prompt, fact shape, and ingest path are identical, so the extraction quality numbers are comparable — but wall-time and "passes" are not strictly apples-to-apples across harnesses (different tool-call overhead, different loop heuristics, host vs container). Read D/E wall-times as Codex-CLI figures, not as a like-for-like speed test against OpenCode.


3. Setup

Component Value
Article frontier EntryId 23778"Attack on Aboriginal people — Bundamba Lagoon (August 1860)", ~14,600 chars. Reused verbatim: /tmp/cerebras-test/source.txt.
Prompt the broad lens-sweep prompts/extract_broad.txt, identical for all five (D/E prepend only a short harness preamble that restates the prompt's own OUTPUT MECHANISM).
Controllers A/B/C: production multi-pass "loop-until-dry" driver opencode_extract.extract_facts_opencode over OpenCodeAgent (headless OpenCode). D/E: the Codex CLI's own shell-tool + multi-turn loop (codex exec, host).
Ingest the normal donto-api path for all five: register source document + revision, parse {s,p,o,a,c,h} via opencode_extract._parse_jsonl, ingest facts, attach evidence spans.
Measurement counts read from live donto_statement (rows where upper(tx_time) IS NULL); anchoring from donto_evidence_link. Spot-checked 2026-06-04.
Contexts A → ctx:test/cerebras-gptoss/23778; B → ctx:test/cerebras-glm/23778; C → ctx:test/zai-glm/23778; D → ctx:test/codex-normal/23778; E → ctx:test/codex-spark/23778.

Provider configs: A/B point OpenCode's cerebras provider at the sanitizing proxy (baseURL=http://172.18.0.1:8089/v1), key from /etc/donto/cerebras.env; C uses the unchanged production z.ai config (https://api.z.ai/api/coding/paas/v4, model glm-4.7, GLM_API_KEY); D/E use the Codex CLI on host ChatGPT-Pro auth (no API key/credit), codex exec … -c model=gpt-5.4 / gpt-5.3-codex-spark.


4. Results

4.1 The 5-way table (all figures from live donto_statement, spot-checked 2026-06-04)

Axis (A) gpt-oss-120b @ Cerebras (B) glm-4.7 @ Cerebras (C) glm-4.7 @ z.ai (INCUMBENT) (D) codex-normal gpt-5.4 (E) codex-spark gpt-5.3-codex-spark
Harness OpenCode (container) OpenCode (container) OpenCode (container) Codex CLI (host) Codex CLI (host)
Account / billing Cerebras PAYG Cerebras PAYG (preview) z.ai coding sub ChatGPT Pro (no API credit) ChatGPT Pro (no API credit)
Live context …/cerebras-gptoss/23778 …/cerebras-glm/23778 …/zai-glm/23778 …/codex-normal/23778 …/codex-spark/23778
Status ✅ real run ✅ real run (prior); fresh re-run BLOCKED ❌ blocked — 0 facts real run, LIVE path real run, LIVE path
Live facts 320 3,590 0 511 426
Distinct subjects 145 486 0 119 78
Distinct predicates 95 1,852 0 292 213
Anchored (≥1 evidence_link) 151 / 320 = 47.2% 2,498 / 3,590 = 69.6% 391 / 511 = 76.5% 368 / 426 = 86.4%
Object split (IRI / literal) 141 / 179 748 / 2,842 265 / 246 162 / 264
Predicate-style hygiene 79/93 bare-camel ≈ 83.2%; 2 vocab (rdf:type,rdfs:label); 0 kebab/space ≈89.7% camel among bare; 674 :-prefixed minted preds dilute all-pred camel to 56.9% n/a 273/292 = 93.5% camelCase; 2 vocab (rdf:type,rdfs:label); 0 kebab, 0 clause/spaced 90.1% camelCase (192/213 distinct); 0 kebab, 0 clause/spaced; one faithful artifact damagedCattle? (trailing ? preserved as emitted)
Source attribution edge-style (reportedIn/attestedBy) edge-style + some source-baked preds n/a clean edges (reportedIn,attestedBy,accordingTo) not baked into predicate names clean edges — e.g. 6+ distinct attestedBy edges on the event
JSONL cleanliness clean clean n/a clean clean
Controller / loop toolloop fired; 3 passes (196→303→320) toolloop fired; multi-pass did not fire toolloop fired; multi-pass, self-judged done toolloop fired; multi-pass (58→…→432 raw → 426 ingested), self-judged done; retry-on-empty NOT needed
Wall time 71.8 s ~prior run; re-run fast-fails at cap 0 s 461.1 s (104k tokens) 60.3 s (89k tokens)
Block reason HTTP 402 payment_required — Cerebras account-wide cap (proxy + direct, ~0.14 s) z.ai code 1310 "Weekly/Monthly Limit Exhausted, reset 2026-06-10 16:43:35"

Note on E's camel-%. A strict classifier that rejects the trailing ? counts 192/213 = 90.1% camelCase (the damagedCattle? artifact is the one excluded). A lenient regex that tolerates the ? counts 193/213. Either way, 0 kebab and 0 clause/spaced predicates — strong style discipline, the artifact faithfully preserved rather than silently normalised.

4.2 The isolated axes

4.3 On predicate counts and the abundance signature

B's 1,852 distinct predicates are inflated by 674 :-prefixed, clause-style minted predicates — high-resolution but ballooning the raw count (all-pred camel diluted to 56.9%; ≈89.7% among un-prefixed). The Codex runs (D/E) show the opposite, tighter profile: 292 / 213 distinct predicates, 0 clause-style and 0 kebab, 93.5% / 90.1% camelCase, and source attribution modeled as clean edges (reportedIn, attestedBy, accordingTo) rather than baked into predicate names — exactly what extract_broad.txt asks for. Either profile is aligned at query time by the substrate's alignment engine, not a static map (CLAUDE.md no-brittle-logic rule); but the Codex output needs less query-time predicate-folding to be load-bearing.


5. Quality verdict

FAITHFUL on the live data. The headline change since the 3-way run: with Cerebras (402) and z.ai (1310) BOTH quota-exhausted, the ChatGPT-Pro Codex CLI is now the only live extraction path — and it works well.


6. Economics

donto's extraction cost is dominated by tokens generated per document × documents per day, and by which accounts are not capped.

Provider / plan Shape Indicative rate Fit for donto extraction
Cerebras PAYG (gpt-oss-120b) per-token low-single-digit $ / Mtok best $/fact for shallow sweeps — currently 402-capped
Cerebras PAYG (zai-glm-4.7, preview) per-token preview pricing; verify best depth/$ if rate reasonable; needs sanitizing proxy — currently 402-capped
z.ai GLM coding subscription (incumbent) flat-rate fixed $ / month in use; weekly/monthly cap stops all extraction (1310, reset 2026-06-10); TOS-risky; expiring subsidy
ChatGPT Pro — Codex CLI / gpt-5.3-codex-spark (E) flat-rate Pro sub $0 marginal — no API credit the only live path now: fast (60 s), high-anchor (86.4%), clean predicates; Cerebras-accelerated OpenAI codex on an already-paid Pro sub
ChatGPT Pro — Codex CLI / gpt-5.4 (D) flat-rate Pro sub $0 marginal — no API credit live, deeper/slower (511 facts, 461 s); use when depth matters more than wall-time

The economic story has shifted. The all-time depth argument (B's ~11× over A) still holds when Cerebras billing is live — but right now both per-token providers are capped, and the ChatGPT-Pro Codex path costs no marginal API credit (it draws on a subscription the user already pays for). Among runnable providers, codex-spark (E) gives the best speed × anchoring × cleanliness at zero marginal cost; codex-normal (D) trades ~7.6× wall-time for more depth. The flat-rate posture means it does not meter per fact — attractive for steady volume, subject to whatever throughput limits the Pro sub enforces.

Caveat: exact $/Mtok for the preview zai-glm-4.7 on Cerebras was never pinned (account in payment_required); the Codex path has no per-token meter to pin, only the Pro subscription's own usage limits.


7. Recommendation

Live-now primary — use the Codex CLI on ChatGPT Pro as the working extraction path while Cerebras and z.ai are capped. It is the only runnable route today, costs no marginal API credit, and on this article it produced the best-anchored, cleanest-predicate extractions in the whole field.

All-time depth primary — adopt glm-4.7 on Cerebras (Provider B) once Cerebras billing is restored. On the same article + prompt it extracted 3,590 faithful facts (~11× gpt-oss, ~7× the Codex runs) at 69.6% anchoring — the depth ceiling measured. Keep it as the heavy-extraction engine when its account is live; it keeps the exact glm-4.7 model donto already runs, on faster hardware, off the z.ai subsidy.

Secondary tier — gpt-oss-120b on Cerebras (A). Fast/cheap recall-floor when Cerebras is live, but ~9% of B's yield and the lowest anchoring (47.2%); a coverage floor, not the main engine.

Incumbent (glm-4.7 @ z.ai, C) — fallback only, currently dead. Hard-capped now (1310, reset 2026-06-10 16:43:35); TOS-risky; expiring subsidy. Migrate off it.

Action items:

  1. Run the genealogy/memory engines on the Codex CLI path now (it is the only live route): wire donto-extract's swappable provider to codex exec -c model=gpt-5.3-codex-spark (default) / gpt-5.4 (depth) on ChatGPT-Pro auth; remember the sunset-default workaround (always pass -c model=, never the default gpt-5.3-codex).
  2. Restore Cerebras billing, then re-run B fresh into a clean context to confirm the 3,590-fact result reproduces (current B figure is a standing prior run).
  3. After 2026-06-10 16:43:35, run run_glm_zai.py to populate C and finally measure the same-model-different-hardware axis (B vs C).
  4. Treat the harness as a first-class swappable abstraction in donto-extract (OpenCode and Codex CLI), so capping one account or one harness never stops all extraction — exactly the resilience this session demonstrated.
  5. Fix the OpenCode slot-pool acquire loop to scan a dynamically-sized pool, not range(MAX_CONCURRENT) (§2.2).
  6. Pin per-token pricing for preview zai-glm-4.7 on Cerebras with a billed run (§6).

8. Honest limits


Measured on donto-db (apex-494316), 2026-06-04. Counts from live donto_statement (upper(tx_time) IS NULL); anchoring from donto_evidence_link. Caps probed live this session. Mechanisms: production opencode_extract multi-pass controller over OpenCodeAgent with broad lens prompt extract_broad.txt (A/B/C, OPENCODE_MAX_CONCURRENT=16); the Codex CLI (codex exec, codex-cli 0.130.0, ChatGPT Pro, host) with the same prompt + {s,p,o,a,c,h} fact shape + the same parse/ingest path (D/E). Codex run script: /tmp/cerebras-test/run_codex_extract.py.