donto research report — 2026-06-04 — operational companion to donto — The Substrate for Generative Abundance
Scope and honesty contract. This is an engineering report, not a pitch. Every fact count, anchor rate, and namespace split below was re-verified live against
donto_statementanddonto_evidence_linkondonto-pgon 2026-06-04 (read-onlySELECTs; donto invariant I3 honored, nothing mutated). Where a number comes from a run log or an analysis script rather than the database, it is labelled as such — and the run-logs themselves (run_codex_glean.pyoutputs,glean-spark.log,glean_smoke.out, the v2-citer source) are preserved on the box, not asserted from memory. Where a claim is an emerging finding rather than a settled measurement, it is flagged. This is an n=1 single-source study (one article, EntryId 23778) — the findings are directional engineering evidence, not a population benchmark; that caveat applies to the whole report and is not repeated at every line. The companion report The Cerebras / Codex Bake-off covers provider economics and the 5-way bake-off in more depth; this report adds the gleaning loop, the count-vs-coverage saturation principle, and the "why 7×" decomposition — the new material.
donto's thesis is that generating typed knowledge is no longer the scarce step. A guided frontier LLM emits an essentially unbounded, multi-directional space of evidence-anchored claims about any entity for fractions of a cent. The engineering question is therefore not "can we generate enough?" but "how do we drive a model to extract maximally (exhaust the meaningful content of a source) and sustainably (keep the firehose running under real-world quota limits), while keeping each claim anchored to its source — and defer typing, alignment, and identity resolution to query time?"
This session answered that question with measurements rather than assertions. The settled findings:
Provider economics force rotation, not a single
lane. The operator runs on flat-subscription / prepaid lanes
only (per-token API is TOS-clean but unaffordable at this volume). Every
such lane hard-caps the sustained firehose — z.ai's GLM coding sub hit
its weekly cap (error 1310, reset 2026-06-10), Cerebras PAYG returned
HTTP 402 (account out of credit), and ChatGPT-Pro/Codex carries a hidden
weekly cap plus datacenter-IP ban risk. The answer is a
multi-lane router that rotates by leftover quota, with
a declarative cap-detection registry. Built and tested: the
whole donto-extract repo suite is 91 tests
passing; the lanes module's own file
tests/test_lanes.py holds 17
tests.
A 5-way model bake-off on one source. On the same article (EntryId 23778), five extraction configurations produced wildly different live fact counts and anchor rates. The Codex-CLI runs anchored best (76.5%, 86.4%, and up to 99.1% after gleaning) versus the opencode runs (47.2%, 69.6%) — but with an important harness caveat (below).
Models self-stop by choice, not by
budget. Three baseline single-shot runs landed at
{496, 511, 426} facts; the gleaning loop that followed them was
never quota-capped (the spark glean log records
usage_capped: false, stopping on the pass cap with its last
pass still adding 616 new keys). The model judges itself "done" long
before it is exhausted. The supporting per-run reasoning-token figures
(e.g. a 511-fact run reportedly spending ~587 reasoning tokens) come
from run-logs, not the DB — see §5.1 and the appendix.
A harness gleaning loop fixes this.
model_reasoning_effort=xhigh + a resume-and-re-prompt loop
("you missed many, append ≥150 new, do not repeat") raised one gpt-5.4
run from 511 → 1,915 facts at 99.1% anchored, a single clean
namespace, and 0.00% exact-duplicate triples. A spark run
reached 3,227 facts in 6 passes and was still
climbing (last pass added 616 new keys) — it stopped on the pass
cap, not on saturation.
Count is the wrong target; coverage / saturation is
right. Pushing a raw count floor makes models pad with
noise: a gpt-5.4 run chasing volume degenerated into garbage
predicates (containsCommaCharacter,
answerWordCount,
coordinateMentionsLatitudeLabel). The disciplined spark
glean (3,227 facts, ~90% camelCase, single namespace, 0 exact-dups) is
healthier than count-chasing. Abundance is not
noise.
The headline "7× more facts" from glm decomposes to ~1.4–1.7× real unique knowledge. The naive ratio (glm 3,590 ÷ codex 511 = 7.0×) is mostly harness artifact: glm's run was two un-deduped merged runs under two near-disjoint namespaces, plus subject-level dual-namespace identity redundancy (31.8% of facts) and bare-flag restatements (18.5%). Raw counts are not comparable across harnesses.
The through-line: maximize meaningful evidence-anchored coverage, anchor as a separable stage, rotate providers to stay alive, and let the substrate fold identity/typing at query time. That is the abundance vision, operationalized and measured.
For sixty years the bottleneck in every knowledge system was generation — a human had to author each typed fact. That scarcity is gone. The literature is unambiguous: GPTKB extracted ~105M triples at ~$0.00009 each; AutoSchemaKG built a 900M-node schema-free graph; per-token costs are falling roughly an order of magnitude per year. Generation is now the cheap, abundant step.
When generation is abundant, three sub-problems become the real work:
Maximize meaningful yield. A model left to its own devices stops extracting when it feels done, not when the source is exhausted (Section 5 shows the run was not budget-bound). Getting the last large fraction of a source's content out requires deliberate harness engineering.
Keep every claim anchored. A fact without a retrievable source span is, for donto's evidence-first model, a liability. Anchoring is a measurable quality axis independent of count (Sections 4–5), and it is best handled as a separate stage (Section 8).
Defer joining. Models invent predicates and
entity IRIs as they go. Two runs will describe the same entity under
ex: and exo:; one model will mint
birthPlace, another placeOfBirth. donto's
design answer is emit free / untyped now, reconcile by
similarity at query time — never a hand-maintained synonym
table. Section 7 shows this is not just philosophy: most of glm's
apparent "7×" advantage is exactly this kind of identity
redundancy, which query-time alignment is designed to fold.
This report is the operational companion to that vision: how to actually drive the firehose.
The operator's hard constraint is cost. Per-token public APIs (OpenAI, Anthropic, etc.) are TOS-clean and uncapped, but unaffordable at sustained extraction volume. So extraction runs on flat-subscription or prepaid lanes — and every one of them hard-caps the firehose. Observed live this session:
| Lane | Cap mechanism (observed) | State this session |
|---|---|---|
| z.ai GLM coding subscription | Weekly cap → error 1310 (returned over HTTP 429) | Capped; reset 2026-06-10 16:43:35 |
| Cerebras PAYG | Out of credit → HTTP 402
payment_required (account-wide) |
Capped |
| ChatGPT-Pro / Codex CLI | Hidden weekly cap (~1–2 days at volume) + GCP-datacenter-IP ban risk for automated non-coding use | The only live route at time of writing |
TOS nuance (stated plainly). The per-token public APIs are the clean path — they are simply unaffordable here. The subscription lanes are affordable but carry real terms-of-service exposure: the z.ai/GLM "coding" subscription is intended for coding assistance, and driving a 24/7 non-coding extraction firehose through it (or through ChatGPT-Pro from a datacenter IP) is TOS-risky and an expiring subsidy, not a durable foundation. The router below is an availability mechanism for research on a budget; it is not an endorsement of using a coding subscription as a production extraction backend. The durable answer is paying per-token (or self-hosting) once the work is funded.
The economic logic is inescapable: a flat subscription is priced for interactive coding, not a 24/7 extraction firehose. Any single lane will throttle. The answer is to rotate lanes by leftover usage — when one caps, jump to the next with quota remaining.
Built at donto-extract/src/donto_extract/lanes/. Design
principles, all consistent with the no-brittle-logic rule (verified in
the source this session):
CapSignature data
record (registry.py), not as an if/elif ladder
buried in driver code: z.ai's body_code=("1310",),
Cerebras's http_status=(402,), codex's usage-snapshot cap.
The matcher in caps.py "matches the data" — there
is no if lane == ... branch. New lanes/signatures are added
declaratively.pool="codex" (verified in
registry.py) — they draw from one
ChatGPT-Pro quota pool. So failover within Codex buys no
headroom — when one Codex model caps, all four are capped. The
registry's pool field is what lets the router know
failover must jump pool, not just model. (The two
Cerebras lanes likewise share pool="cerebras".)sunset-400 is not a quota cap (it's a
deprecated-model signal) — the registry/matcher is built to treat it as
not-a-cap so the router doesn't waste a failover hop on it.lanes/cli.py: --lane, --auto,
--context, --json, --no-probe,
--timeout. There is no literal
--status flag — lane status is read through the
registry view, not a top-level flag. (An earlier internal note claimed a
--status flag and a router "91-test" count; both are
corrected here — see the test footprint below.)Test footprint (corrected for honesty). Verified by
running the suite this session: the whole
donto-extract repo suite is 91 tests
passing (pytest tests/ -q →
91 passed); the lanes module's own file
tests/test_lanes.py contains 17 tests. The
router is not independently "91 tests" — that figure is
the entire repo. Both numbers are now correct in this report.
The rotation strategy is what converts a set of individually-capped lanes into a sustained firehose. None of it changes extraction quality — that's the next section.
exo: 2,327 + ex: 1,258 facts; only
30 distinct local subject-names appear in
both, ~6% overlap). Treat its 3,590 as a merged
figure, not a single-run yield. Section 7 decomposes this in full.| # | Model / config | Harness | Live facts | Subj | Pred | Anchored | Anchor % | Namespaces |
|---|---|---|---|---|---|---|---|---|
| A | gpt-oss-120b @ Cerebras | opencode | 320 | 145 | 95 | 151 | 47.2% | 1 (ex:) |
| B | glm-4.7 @ Cerebras (two merged runs) | opencode | 3,590 | 486 | 1,852 | 2,498 | 69.6% | 5 (dom. exo:+ex:) |
| — | glm-4.7 @ z.ai | opencode | — | — | — | — | — | never ran — capped (code 1310) |
| D | gpt-5.4 (codex-normal) | Codex CLI | 511 | 119 | 292 | 391 | 76.5% | 1 (ex:) |
| E | gpt-5.3-codex-spark (single run) | Codex CLI | 426 | 78 | 213 | 368 | 86.4% | 1 |
Contexts (live): ctx:test/cerebras-gptoss/23778,
ctx:test/cerebras-glm/23778,
ctx:test/codex-normal/23778,
ctx:test/codex-spark/23778. The z.ai glm lane
(ctx:test/zai-glm/23778) has 0 rows — it
never executed because the lane was capped, an unintended but
instructive demonstration of Section 3.
ctx:test/codex-smoke/23778, 496 facts) anchored at
99.2% (492/496, live-verified). The Codex harness
simply tracks its citations better.Three baseline single-shot Codex runs on the same source landed at {496, 511, 426} facts. The substantive claim — that the model is satisficing, not running out of budget — is supported two ways:
"usage_capped": false and
"stop_reason": "max_passes (6)" with its final pass
still adding 616 new keys. The run was not throttled;
it would have kept producing. This is verified in
glean-spark.log on the box.usage_capped:false
evidence above.Either way the operational insight holds: left alone, a frontier model extracts what it considers a reasonable, representative set and stops — it satisfices. For donto's maximal-extraction goal that is a bug, not a feature.
Two levers, applied together:
model_reasoning_effort = xhigh — push
the model to think harder per pass. (Verified in the spark run's
rollout: all 10 turn-context efforts were honored as
xhigh.)codex exec --resume <session-id>, re-prompting each
pass with coverage framing: "you missed many; append ≥150 NEW facts;
do not repeat what you already emitted." Stop when a pass yields
<30 new keys twice in a row (a saturation gate), or
on a max-pass cap.The framing matters: telling the model it missed things and asking for coverage (not "more facts") is what overrides the satisfice instinct.
gpt-5.4 glean —
ctx:test/codex-glean-smoke/23778:
ex: namespace; 0.00% exact-duplicate
triples (verified: 1,915 distinct subject+predicate+object
triples, zero exact repeats).timed_out: true) yet still appended 1,298 facts before the
cut; pass 2 added 617 in 583 s.spark glean —
ctx:test/codex-glean-spark/23778:
stop_reason: max_passes (6),
usage_capped: false) — the last pass added 616 new
keys. The source was not saturated; there was more to
extract.ex: namespace; 0.00%
exact-duplicate triples. Wall time ~**1,306 s (~22 min)**
(run-log).511 → 1,915 is the cleanest "maximize meaningful yield"
result in the whole set: a 3.7× count lift that
simultaneously raised anchor coverage to
99.1%, kept a single clean namespace, and produced
zero exact-duplicate triples. That is meaningful
abundance — more real anchored claims, not padding. It is the pattern
the engine should default to.
When the loop was instead given a raw count floor
("emit thousands"), a gpt-5.4 run obeyed — and degenerated into
noise. The predicate tail collapsed into trivially-true
string-property assertions — predicates like
answerWordCount, containsCommaCharacter,
containsApostropheCharacter, and (this source carries
lat/long) coordinateMentionsLatitudeLabel. These are not
knowledge about the entity; they are the model manufacturing
filler to hit a number.
Caveat (honesty — this one is NOT a live DB context). This padded run was not retained in live
donto_statement(consistent with it being a rejected output). So its raw line count (recorded in run-logs around ~7,500–8,400 valid lines / ~560 distinct predicates) and the specific garbage-predicate names are harness-log / on-disk observations, not a queryable context — they are presented as observed-during-the-run, and I do not claim a precise live figure for it. The contrast it illustrates, however, rests entirely on live data: the disciplined spark glean's clean profile (89.8% camelCase, 0.00% exact-dup, single namespace — all verified above) is in the database.
Abundance ≠ noise. A count floor optimizes the wrong objective and the model games it. The right target is exhaustive meaningful coverage, with saturation deciding "done" — the operator's own framing: "not an absolute number, just everything possible."
The two glean runs make this concrete:
containsCommaCharacter-grade
lines — even with fewer facts.The operational rule that falls out: drive to saturation (a falling new-key curve), reject count floors, and let the predicate-cleanliness profile (camelCase share, exact-dup rate, namespace count) be a live quality monitor.
glm's 3,590 vs codex-normal's 511 is a naive 7.0×. Decomposed against live data, the real unique-knowledge multiple is ~1.4–1.7×. The factors (DB-verified except the one row explicitly marked):
| Factor | Multiplier | Evidence |
|---|---|---|
| Harness — two merged runs | ~1.9× | glm context is two un-deduped runs: exo: 2,327 +
ex: 1,258 facts (live); only 30 of ~483
distinct local subject-names appear in both namespaces (~6%
overlap → near-disjoint, never deduped) |
| Granularity / redundancy | ~1.5× | 31.8% of glm facts (1,140 / 3,590,
live) sit on subjects whose local name exists in both
ex: and exo: — the same entity described
twice under two IRIs. Plus 18.5% (663 /
3,590, live) bare
true/1/yes flag restatements |
| Q&A reification | ~1.1× | The source is a Q&A transcript; glm reified questions/answers as extra subjects |
| Real extra source coverage | ~1.45× | Span-union analysis: glm touched ~93.3% of source characters vs codex ~64.6% — analysis-script figure, not recomputable from DB counts alone (see §7.1) |
Net: ≈ 1.9 × 1.5 × 1.1 × 1.45 ≈ 4.5× of "apparent" advantage is artifact; the honest unique-knowledge multiple is ~1.4–1.7×. (These four factors are not cleanly orthogonal, so the decomposition is an estimate, not an identity — but every input number is sourced.)
Raw counts are not comparable across harnesses.
glm's "7×" is mostly merged-run concatenation plus dual-namespace
identity redundancy — exactly the kind of thing the substrate reconciles
at query time. Meanwhile Codex stops by choice, not
budget, and its single disciplined namespace means
multi-session UNION dedups cleanly — no
ex:/exo: identity-duplication tax to pay
later. The right target remains exhaustive meaningful coverage
judged by saturation, with identity and typing reconciled
downstream — donto's emit-free / defer-joining thesis,
validated by the numbers.
The gleaning loop maximizes yield; anchoring is a separate quality axis, and the cleanest way to handle it is to separate extraction from anchoring — let the model extract freely, then run an always-on post-hoc citer that locates a supporting span for each emitted fact.
v1 result (verified two ways). The citer lifted the
spark glean's anchoring substantially. Measured against the citer's
own input fact file, it raised anchor coverage 47.1% →
90.2%. Measured the donto-native way — distinct live
self-anchored statements in
donto_statement/donto_evidence_link — the
re-ingested context ctx:test/spark-cited/23778 (3,229
facts; the re-ingest added 2 vs the 3,227 source) goes from
50.0% → 91.9% anchored (2,969 / 3,229, live-verified).
(The two bases — input-file anchor rate vs DB self-anchor rate — differ
slightly; both are reported so neither is cherry-picked.) Either way it
is a large, real coverage win, and it confirms the architectural bet:
anchoring as a downstream stage works.
v1 problem (honest, do not overclaim). An
adversarial audit found that a locatable citation is not always
a supporting one. On a small, deliberately relational-skewed
sample, roughly ~40–46% of the recovered relational
citations were wrong — the citer found a span that
mentions an entity but does not support the asserted
relation (e.g. question-34 askedBy mr-watts anchored
to a different question's "By Mr. WATTS:" line because both share the
token watts). The ~40% figure is documented in the v2-citer
source header; the ~46% figure comes from an n=30 adversarial-judge
sample (relational subset n≈26, ~12 wrong) recorded in the run-logs.
This is a worst-case adversarial probe on a curated relational
sample, not a population rate — most facts are simpler
attributive/literal claims the citer handles correctly (those anchored
fine in v1). But it is a genuine correctness gap, and the exact rate is
a run-log/audit figure, not a DB-verified population statistic.
v2 (in progress). An instance-aware
co-location gate (real, in-tree at
cite_facts_v2.py): classify each fact structurally
by object type — literal objects keep the v1 lexical layer; IRI
(relational) objects must find a span that co-locates BOTH
endpoints (the subject's distinguishing token,
weighted by inverse batch document-frequency — IDF computed from the
data, never a hand-maintained stopword/synonym list — AND the object's
value token) in the same window, else route to semantic, else
mark unanchorable rather than attach a plausible-looking
neighbour. A wrong span is worse than none. This is
emerging work, not a settled result — its task is in
progress this session and no precise post-fix correctness number
is claimed here.
The takeaway for the architecture: anchor as a separate, auditable stage; measure not just coverage but correctness; and treat "locatable ≠ supporting" as a first-class problem.
Operationalizing generative abundance comes down to four engineering commitments, each now backed by live measurement:
Maximize meaningful coverage, not
count. Models satisfice and self-stop while still
producing (the spark glean was usage_capped: false
with a 616-new-key final pass). Drive them to
saturation with xhigh effort + a resume/re-prompt
gleaning loop and coverage framing. The exemplar: 511 → 1,915
facts at 99.1% anchored, single namespace, 0 exact-dups. Reject
count floors — they produce containsCommaCharacter-grade
noise.
Anchor as a separable stage. Extraction and citation are different quality axes. An always-on post-hoc citer lifted anchoring 50.0% → 91.9% (DB-verified) — but "locatable ≠ supporting," so the citer needs a co-location correctness gate (v2, in progress). Measure coverage and correctness.
Rotate providers to stay alive. Every flat-sub/prepaid lane caps (z.ai 1310, Cerebras 402, Codex weekly + IP risk). A declarative, pool-aware multi-lane router (91-test repo, 17 lane-specific tests) converts individually-capped lanes into a sustained firehose. This is purely an availability mechanism — it does not alter quality, and it is a budget-research expedient, not a TOS-clean production backend (§3.1).
Defer alignment and typing to query time. Raw counts are not comparable across harnesses, and most of glm's "7×" advantage is identity redundancy (31.8% of facts on dual-IRI subjects, two merged runs) — exactly what query-time identity alignment is designed to fold. Emit free, untyped, multi-namespace now; reconcile by similarity later. Do not maintain synonym tables or namespace-merge maps by hand.
The data demonstrates the abundance thesis three ways: the gleaning loop raises yield 3.7× while raising anchor coverage to 99.1% and keeping one clean namespace (meaningful, not padded); the saturation principle beats a count floor (clean spark glean > degenerate count-padding); and the why-7× decomposition shows the substrate's deferred-joining design is not a hedge but the right place to absorb the identity/typing tax that abundant generation necessarily produces.
Generation is cheap. The engineering — and the win — is in maximizing meaningful, anchored coverage and pushing everything else to query time. This is n=1, directional: one source, one session. The next step is to repeat it across many sources, domains, and models before any of these multipliers is treated as a constant.
All live counts:
donto_statement WHERE upper(tx_time) IS NULL with exact
context = equality on donto-pg, 2026-06-04.
Anchored = distinct live statements whose
statement_id is the source of ≥1 live
donto_evidence_link (upper(tx_time) IS NULL),
via LEFT JOIN on a distinct-statement_id
subquery. statement_timeout 120–300 s. Read-only
SELECTs only; donto invariant I3 (no
destructive overwrite) honored — nothing mutated. The 91-test suite was
re-run this session (pytest tests/ -q →
91 passed); the lanes registry/cap/CLI claims were read
from
donto-extract/src/donto_extract/lanes/{registry,caps,cli}.py.
Figures that are NOT recomputable from
donto_statement and are labelled so in-text:
per-pass new-key counts, wall times, and the
saturation/usage_capped flags come from
glean-spark.log and glean_smoke.out (run-logs
preserved on the box); per-run reasoning-token budgets (incl. the
~587-token datum) come from run_codex_glean.py run-logs and
are directional; the source-character-coverage percentages (93.3% /
64.6%) come from a span-union analysis script that needs source text +
span offsets; the padded-run line counts and garbage-predicate names are
on-disk/log observations of a run that was not retained in the
DB; and the adversarial mis-anchor rate (~40–46%) comes from the
v2-citer source header plus an n=30 audit log. Every other
number in this report is DB-verified.
Companion reports: donto — The Substrate for Generative Abundance (canonical vision) · Driving Frontier LLMs to Extract Maximally and Sustainably · The Cerebras / Codex Bake-off (provider economics + 5-way detail) · The donto Extraction System