genes.apexpots.com / research source: donto-extraction-engineering-2026-06-04.md

Extraction Engineering for Generative Abundance: Provider Rotation, Gleaning Loops, and the Coverage-not-Count Principle

donto research report — 2026-06-04 — operational companion to donto — The Substrate for Generative Abundance

Scope and honesty contract. This is an engineering report, not a pitch. Every fact count, anchor rate, and namespace split below was re-verified live against donto_statement and donto_evidence_link on donto-pg on 2026-06-04 (read-only SELECTs; donto invariant I3 honored, nothing mutated). Where a number comes from a run log or an analysis script rather than the database, it is labelled as such — and the run-logs themselves (run_codex_glean.py outputs, glean-spark.log, glean_smoke.out, the v2-citer source) are preserved on the box, not asserted from memory. Where a claim is an emerging finding rather than a settled measurement, it is flagged. This is an n=1 single-source study (one article, EntryId 23778) — the findings are directional engineering evidence, not a population benchmark; that caveat applies to the whole report and is not repeated at every line. The companion report The Cerebras / Codex Bake-off covers provider economics and the 5-way bake-off in more depth; this report adds the gleaning loop, the count-vs-coverage saturation principle, and the "why 7×" decomposition — the new material.

1. Executive summary

donto's thesis is that generating typed knowledge is no longer the scarce step. A guided frontier LLM emits an essentially unbounded, multi-directional space of evidence-anchored claims about any entity for fractions of a cent. The engineering question is therefore not "can we generate enough?" but "how do we drive a model to extract maximally (exhaust the meaningful content of a source) and sustainably (keep the firehose running under real-world quota limits), while keeping each claim anchored to its source — and defer typing, alignment, and identity resolution to query time?"

This session answered that question with measurements rather than assertions. The settled findings:

Provider economics force rotation, not a single lane. The operator runs on flat-subscription / prepaid lanes only (per-token API is TOS-clean but unaffordable at this volume). Every such lane hard-caps the sustained firehose — z.ai's GLM coding sub hit its weekly cap (error 1310, reset 2026-06-10), Cerebras PAYG returned HTTP 402 (account out of credit), and ChatGPT-Pro/Codex carries a hidden weekly cap plus datacenter-IP ban risk. The answer is a multi-lane router that rotates by leftover quota, with a declarative cap-detection registry. Built and tested: the whole donto-extract repo suite is 91 tests passing; the lanes module's own file tests/test_lanes.py holds 17 tests.
A 5-way model bake-off on one source. On the same article (EntryId 23778), five extraction configurations produced wildly different live fact counts and anchor rates. The Codex-CLI runs anchored best (76.5%, 86.4%, and up to 99.1% after gleaning) versus the opencode runs (47.2%, 69.6%) — but with an important harness caveat (below).
Models self-stop by choice, not by budget. Three baseline single-shot runs landed at {496, 511, 426} facts; the gleaning loop that followed them was never quota-capped (the spark glean log records usage_capped: false, stopping on the pass cap with its last pass still adding 616 new keys). The model judges itself "done" long before it is exhausted. The supporting per-run reasoning-token figures (e.g. a 511-fact run reportedly spending ~587 reasoning tokens) come from run-logs, not the DB — see §5.1 and the appendix.
A harness gleaning loop fixes this. model_reasoning_effort=xhigh + a resume-and-re-prompt loop ("you missed many, append ≥150 new, do not repeat") raised one gpt-5.4 run from 511 → 1,915 facts at 99.1% anchored, a single clean namespace, and 0.00% exact-duplicate triples. A spark run reached 3,227 facts in 6 passes and was still climbing (last pass added 616 new keys) — it stopped on the pass cap, not on saturation.
Count is the wrong target; coverage / saturation is right. Pushing a raw count floor makes models pad with noise: a gpt-5.4 run chasing volume degenerated into garbage predicates (containsCommaCharacter, answerWordCount, coordinateMentionsLatitudeLabel). The disciplined spark glean (3,227 facts, ~90% camelCase, single namespace, 0 exact-dups) is healthier than count-chasing. Abundance is not noise.
The headline "7× more facts" from glm decomposes to ~1.4–1.7× real unique knowledge. The naive ratio (glm 3,590 ÷ codex 511 = 7.0×) is mostly harness artifact: glm's run was two un-deduped merged runs under two near-disjoint namespaces, plus subject-level dual-namespace identity redundancy (31.8% of facts) and bare-flag restatements (18.5%). Raw counts are not comparable across harnesses.

The through-line: maximize meaningful evidence-anchored coverage, anchor as a separable stage, rotate providers to stay alive, and let the substrate fold identity/typing at query time. That is the abundance vision, operationalized and measured.

2. The abundance premise

For sixty years the bottleneck in every knowledge system was generation — a human had to author each typed fact. That scarcity is gone. The literature is unambiguous: GPTKB extracted ~105M triples at ~$0.00009 each; AutoSchemaKG built a 900M-node schema-free graph; per-token costs are falling roughly an order of magnitude per year. Generation is now the cheap, abundant step.

When generation is abundant, three sub-problems become the real work:

Maximize meaningful yield. A model left to its own devices stops extracting when it feels done, not when the source is exhausted (Section 5 shows the run was not budget-bound). Getting the last large fraction of a source's content out requires deliberate harness engineering.
Keep every claim anchored. A fact without a retrievable source span is, for donto's evidence-first model, a liability. Anchoring is a measurable quality axis independent of count (Sections 4–5), and it is best handled as a separate stage (Section 8).
Defer joining. Models invent predicates and entity IRIs as they go. Two runs will describe the same entity under ex: and exo:; one model will mint birthPlace, another placeOfBirth. donto's design answer is emit free / untyped now, reconcile by similarity at query time — never a hand-maintained synonym table. Section 7 shows this is not just philosophy: most of glm's apparent "7×" advantage is exactly this kind of identity redundancy, which query-time alignment is designed to fold.

This report is the operational companion to that vision: how to actually drive the firehose.

3. Provider economics — why every flat-sub lane caps, and the rotation answer

3.1 The constraint

The operator's hard constraint is cost. Per-token public APIs (OpenAI, Anthropic, etc.) are TOS-clean and uncapped, but unaffordable at sustained extraction volume. So extraction runs on flat-subscription or prepaid lanes — and every one of them hard-caps the firehose. Observed live this session:

Lane	Cap mechanism (observed)	State this session
z.ai GLM coding subscription	Weekly cap → error 1310 (returned over HTTP 429)	Capped; reset 2026-06-10 16:43:35
Cerebras PAYG	Out of credit → HTTP 402 `payment_required` (account-wide)	Capped
ChatGPT-Pro / Codex CLI	Hidden weekly cap (~1–2 days at volume) + GCP-datacenter-IP ban risk for automated non-coding use	The only live route at time of writing

TOS nuance (stated plainly). The per-token public APIs are the clean path — they are simply unaffordable here. The subscription lanes are affordable but carry real terms-of-service exposure: the z.ai/GLM "coding" subscription is intended for coding assistance, and driving a 24/7 non-coding extraction firehose through it (or through ChatGPT-Pro from a datacenter IP) is TOS-risky and an expiring subsidy, not a durable foundation. The router below is an availability mechanism for research on a budget; it is not an endorsement of using a coding subscription as a production extraction backend. The durable answer is paying per-token (or self-hosting) once the work is funded.

The economic logic is inescapable: a flat subscription is priced for interactive coding, not a 24/7 extraction firehose. Any single lane will throttle. The answer is to rotate lanes by leftover usage — when one caps, jump to the next with quota remaining.

3.2 The multi-lane router

Built at donto-extract/src/donto_extract/lanes/. Design principles, all consistent with the no-brittle-logic rule (verified in the source this session):

Declarative cap-detection registry. Each lane declares its cap signatures as a CapSignature data record (registry.py), not as an if/elif ladder buried in driver code: z.ai's body_code=("1310",), Cerebras's http_status=(402,), codex's usage-snapshot cap. The matcher in caps.py "matches the data" — there is no if lane == ... branch. New lanes/signatures are added declaratively.
Shared-pool awareness. Codex exposes four model lanes that all carry pool="codex" (verified in registry.py) — they draw from one ChatGPT-Pro quota pool. So failover within Codex buys no headroom — when one Codex model caps, all four are capped. The registry's pool field is what lets the router know failover must jump pool, not just model. (The two Cerebras lanes likewise share pool="cerebras".)
Non-cap discrimination. A Codex sunset-400 is not a quota cap (it's a deprecated-model signal) — the registry/matcher is built to treat it as not-a-cap so the router doesn't waste a failover hop on it.
CLI surface. Verified flags in lanes/cli.py: --lane, --auto, --context, --json, --no-probe, --timeout. There is no literal --status flag — lane status is read through the registry view, not a top-level flag. (An earlier internal note claimed a --status flag and a router "91-test" count; both are corrected here — see the test footprint below.)

Test footprint (corrected for honesty). Verified by running the suite this session: the whole donto-extract repo suite is 91 tests passing (pytest tests/ -q → 91 passed); the lanes module's own file tests/test_lanes.py contains 17 tests. The router is not independently "91 tests" — that figure is the entire repo. Both numbers are now correct in this report.

The rotation strategy is what converts a set of individually-capped lanes into a sustained firehose. None of it changes extraction quality — that's the next section.

4. The 5-way model bake-off

4.1 Read these caveats first

Two different agentic harnesses. Configurations A/B/C ran under opencode; D/E ran under the Codex CLI. These are different agent drivers. Extraction quality (anchor rate, namespace cleanliness) is comparable across them; raw speed and raw count are not like-for-like. Do not read the count column as a clean model ranking.
glm's 3,590 was two merged runs. The glm-4.7@Cerebras context is not one extraction — it is two un-deduped runs concatenated under two near-disjoint namespaces (live: exo: 2,327 + ex: 1,258 facts; only 30 distinct local subject-names appear in both, ~6% overlap). Treat its 3,590 as a merged figure, not a single-run yield. Section 7 decomposes this in full.

4.2 The table (all numbers live-verified 2026-06-04)

#	Model / config	Harness	Live facts	Subj	Pred	Anchored	Anchor %	Namespaces
A	gpt-oss-120b @ Cerebras	opencode	320	145	95	151	47.2%	1 (`ex:`)
B	glm-4.7 @ Cerebras (two merged runs)	opencode	3,590	486	1,852	2,498	69.6%	5 (dom. `exo:`+`ex:`)
—	glm-4.7 @ z.ai	opencode	—	—	—	—	—	never ran — capped (code 1310)
D	gpt-5.4 (codex-normal)	Codex CLI	511	119	292	391	76.5%	1 (`ex:`)
E	gpt-5.3-codex-spark (single run)	Codex CLI	426	78	213	368	86.4%	1

Contexts (live): ctx:test/cerebras-gptoss/23778, ctx:test/cerebras-glm/23778, ctx:test/codex-normal/23778, ctx:test/codex-spark/23778. The z.ai glm lane (ctx:test/zai-glm/23778) has 0 rows — it never executed because the lane was capped, an unintended but instructive demonstration of Section 3.

4.3 What the table says

Anchoring leadership is real and belongs to the Codex-CLI runs. 76.5% and 86.4% (single-run) beat the opencode runs' 47.2% and 69.6%. A supplementary Codex smoke run (ctx:test/codex-smoke/23778, 496 facts) anchored at 99.2% (492/496, live-verified). The Codex harness simply tracks its citations better.
Count alone is misleading. glm's 3,590 looks dominant but is two merged runs riddled with dual-namespace identity redundancy (Section 7). The single-run Codex configs are smaller but cleaner (one namespace, higher anchoring).
glm's 1,852 distinct predicates vs Codex's ~290 is the abundance signature — and the alignment problem — in miniature: free-minted predicates, to be reconciled at query time, not pruned at write time.

5. The gleaning loop — models self-stop by choice, not budget

5.1 The evidence that "done" is a judgment, not a ceiling

Three baseline single-shot Codex runs on the same source landed at {496, 511, 426} facts. The substantive claim — that the model is satisficing, not running out of budget — is supported two ways:

Directly, from the live glean log (the strongest evidence). The 6-pass spark glean records "usage_capped": false and "stop_reason": "max_passes (6)" with its final pass still adding 616 new keys. The run was not throttled; it would have kept producing. This is verified in glean-spark.log on the box.
From per-run token accounting (run-log, not DB). Internal notes record the {496,511,426}-fact runs against token budgets of roughly {127.6k, 104.1k, 89.1k}, with the 511-fact run reportedly spending only ~587 reasoning tokens — i.e. count does not track spend. This specific per-run reasoning-token figure is a run-log observation that the surviving logs did not let me re-derive line-for-line; treat it as directional, not DB-verified. It is consistent with, but weaker than, the usage_capped:false evidence above.

Either way the operational insight holds: left alone, a frontier model extracts what it considers a reasonable, representative set and stops — it satisfices. For donto's maximal-extraction goal that is a bug, not a feature.

5.2 The fix: xhigh effort + a harness resume-and-re-prompt loop

Two levers, applied together:

model_reasoning_effort = xhigh — push the model to think harder per pass. (Verified in the spark run's rollout: all 10 turn-context efforts were honored as xhigh.)
A harness gleaning loop — codex exec --resume <session-id>, re-prompting each pass with coverage framing: "you missed many; append ≥150 NEW facts; do not repeat what you already emitted." Stop when a pass yields <30 new keys twice in a row (a saturation gate), or on a max-pass cap.

The framing matters: telling the model it missed things and asking for coverage (not "more facts") is what overrides the satisfice instinct.

5.3 Results (live-verified)

gpt-5.4 glean — ctx:test/codex-glean-smoke/23778:

511 → 1,915 facts (a 3.7× lift over the single-shot baseline), 2 passes.
99.1% anchored (1,897 / 1,915, live-verified) — anchoring went up, not down.
Single ex: namespace; 0.00% exact-duplicate triples (verified: 1,915 distinct subject+predicate+object triples, zero exact repeats).
376 subjects / 398 predicates. Wall time ~**2,083 s (~34.7 min)** (run-log). Caveat from the log: pass 1 hit the 1,500 s timeout (timed_out: true) yet still appended 1,298 facts before the cut; pass 2 added 617 in 583 s.

spark glean — ctx:test/codex-glean-spark/23778:

3,227 facts in 6 passes. Per-pass cumulative keys (from the log): 1,245 → 1,922 → 2,074 → 2,267 → 2,611 → 3,227 (new keys per pass: 1,245 / 677 / 152 / 193 / 344 / 616).
It stopped on the max-pass cap, NOT on the dry gate (stop_reason: max_passes (6), usage_capped: false) — the last pass added 616 new keys. The source was not saturated; there was more to extract.
50.0% self-anchored (1,613 / 3,227, live-verified) — lower, addressed by the post-hoc citer in Section 8. ~90% camelCase predicates (live + log agree: 280 / 306 distinct = 91.5%; 2,897 / 3,227 facts = 89.8%). Single ex: namespace; 0.00% exact-duplicate triples. Wall time ~**1,306 s (~22 min)** (run-log).

5.4 Why the gpt-5.4 glean is the right exemplar

511 → 1,915 is the cleanest "maximize meaningful yield" result in the whole set: a 3.7× count lift that simultaneously raised anchor coverage to 99.1%, kept a single clean namespace, and produced zero exact-duplicate triples. That is meaningful abundance — more real anchored claims, not padding. It is the pattern the engine should default to.

6. Count is the wrong target — coverage and saturation are right

6.1 The cautionary tale

When the loop was instead given a raw count floor ("emit thousands"), a gpt-5.4 run obeyed — and degenerated into noise. The predicate tail collapsed into trivially-true string-property assertions — predicates like answerWordCount, containsCommaCharacter, containsApostropheCharacter, and (this source carries lat/long) coordinateMentionsLatitudeLabel. These are not knowledge about the entity; they are the model manufacturing filler to hit a number.

Caveat (honesty — this one is NOT a live DB context). This padded run was not retained in live donto_statement (consistent with it being a rejected output). So its raw line count (recorded in run-logs around ~7,500–8,400 valid lines / ~560 distinct predicates) and the specific garbage-predicate names are harness-log / on-disk observations, not a queryable context — they are presented as observed-during-the-run, and I do not claim a precise live figure for it. The contrast it illustrates, however, rests entirely on live data: the disciplined spark glean's clean profile (89.8% camelCase, 0.00% exact-dup, single namespace — all verified above) is in the database.

6.2 The principle

Abundance ≠ noise. A count floor optimizes the wrong objective and the model games it. The right target is exhaustive meaningful coverage, with saturation deciding "done" — the operator's own framing: "not an absolute number, just everything possible."

The two glean runs make this concrete:

The spark glean (3,227, ~90% camelCase, single namespace, 0 exact-dups) is healthier than a count-chasing run padded toward thousands of containsCommaCharacter-grade lines — even with fewer facts.
And it should have kept going: it stopped on the pass cap with its last pass still adding 616 new keys. Saturation, not a number, is the correct stop condition — and here saturation had not been reached.

The operational rule that falls out: drive to saturation (a falling new-key curve), reject count floors, and let the predicate-cleanliness profile (camelCase share, exact-dup rate, namespace count) be a live quality monitor.

7. The "why 7×" decomposition — measured

glm's 3,590 vs codex-normal's 511 is a naive 7.0×. Decomposed against live data, the real unique-knowledge multiple is ~1.4–1.7×. The factors (DB-verified except the one row explicitly marked):

Factor	Multiplier	Evidence
Harness — two merged runs	~1.9×	glm context is two un-deduped runs: `exo:` 2,327 + `ex:` 1,258 facts (live); only 30 of ~483 distinct local subject-names appear in both namespaces (~6% overlap → near-disjoint, never deduped)
Granularity / redundancy	~1.5×	31.8% of glm facts (1,140 / 3,590, live) sit on subjects whose local name exists in both `ex:` and `exo:` — the same entity described twice under two IRIs. Plus 18.5% (663 / 3,590, live) bare `true`/`1`/`yes` flag restatements
Q&A reification	~1.1×	The source is a Q&A transcript; glm reified questions/answers as extra subjects
Real extra source coverage	~1.45×	Span-union analysis: glm touched ~93.3% of source characters vs codex ~64.6% — analysis-script figure, not recomputable from DB counts alone (see §7.1)

Net: ≈ 1.9 × 1.5 × 1.1 × 1.45 ≈ 4.5× of "apparent" advantage is artifact; the honest unique-knowledge multiple is ~1.4–1.7×. (These four factors are not cleanly orthogonal, so the decomposition is an estimate, not an identity — but every input number is sourced.)

7.1 Two precisions that matter for honesty

The dual-namespace figure is subject-level redundancy, not exact-triple restatement. Live: 31.8% (1,140/3,590) of facts are on subjects appearing under both IRIs — an identity-resolution tax. The exact subject+predicate+object dual-namespace duplication is only ~1.2% (42 facts / 21 shared keys, live). Both are real; this report uses the right one in each place. The 31.8% is precisely what donto's query-time identity alignment is designed to fold — it is not waste, it is deferred joining made visible.
The 93.3%-vs-64.6% source-character coverage is the one number not recomputable from the DB alone — it needs the source text plus span char-offsets, so it is a span-union measurement from the analysis script, labelled as such. The anchor-span counts that feed it are DB-verified (glm 2,498 anchored facts; codex-normal 391).

7.2 The lesson

Raw counts are not comparable across harnesses. glm's "7×" is mostly merged-run concatenation plus dual-namespace identity redundancy — exactly the kind of thing the substrate reconciles at query time. Meanwhile Codex stops by choice, not budget, and its single disciplined namespace means multi-session UNION dedups cleanly — no ex:/exo: identity-duplication tax to pay later. The right target remains exhaustive meaningful coverage judged by saturation, with identity and typing reconciled downstream — donto's emit-free / defer-joining thesis, validated by the numbers.

8. Forward: always-on post-hoc citing (emerging — honest)

The gleaning loop maximizes yield; anchoring is a separate quality axis, and the cleanest way to handle it is to separate extraction from anchoring — let the model extract freely, then run an always-on post-hoc citer that locates a supporting span for each emitted fact.

v1 result (verified two ways). The citer lifted the spark glean's anchoring substantially. Measured against the citer's own input fact file, it raised anchor coverage 47.1% → 90.2%. Measured the donto-native way — distinct live self-anchored statements in donto_statement/donto_evidence_link — the re-ingested context ctx:test/spark-cited/23778 (3,229 facts; the re-ingest added 2 vs the 3,227 source) goes from 50.0% → 91.9% anchored (2,969 / 3,229, live-verified). (The two bases — input-file anchor rate vs DB self-anchor rate — differ slightly; both are reported so neither is cherry-picked.) Either way it is a large, real coverage win, and it confirms the architectural bet: anchoring as a downstream stage works.

v1 problem (honest, do not overclaim). An adversarial audit found that a locatable citation is not always a supporting one. On a small, deliberately relational-skewed sample, roughly ~40–46% of the recovered relational citations were wrong — the citer found a span that mentions an entity but does not support the asserted relation (e.g. question-34 askedBy mr-watts anchored to a different question's "By Mr. WATTS:" line because both share the token watts). The ~40% figure is documented in the v2-citer source header; the ~46% figure comes from an n=30 adversarial-judge sample (relational subset n≈26, ~12 wrong) recorded in the run-logs. This is a worst-case adversarial probe on a curated relational sample, not a population rate — most facts are simpler attributive/literal claims the citer handles correctly (those anchored fine in v1). But it is a genuine correctness gap, and the exact rate is a run-log/audit figure, not a DB-verified population statistic.

v2 (in progress). An instance-aware co-location gate (real, in-tree at cite_facts_v2.py): classify each fact structurally by object type — literal objects keep the v1 lexical layer; IRI (relational) objects must find a span that co-locates BOTH endpoints (the subject's distinguishing token, weighted by inverse batch document-frequency — IDF computed from the data, never a hand-maintained stopword/synonym list — AND the object's value token) in the same window, else route to semantic, else mark unanchorable rather than attach a plausible-looking neighbour. A wrong span is worse than none. This is emerging work, not a settled result — its task is in progress this session and no precise post-fix correctness number is claimed here.

The takeaway for the architecture: anchor as a separate, auditable stage; measure not just coverage but correctness; and treat "locatable ≠ supporting" as a first-class problem.

9. What this means for the substrate

Operationalizing generative abundance comes down to four engineering commitments, each now backed by live measurement:

Maximize meaningful coverage, not count. Models satisfice and self-stop while still producing (the spark glean was usage_capped: false with a 616-new-key final pass). Drive them to saturation with xhigh effort + a resume/re-prompt gleaning loop and coverage framing. The exemplar: 511 → 1,915 facts at 99.1% anchored, single namespace, 0 exact-dups. Reject count floors — they produce containsCommaCharacter-grade noise.
Anchor as a separable stage. Extraction and citation are different quality axes. An always-on post-hoc citer lifted anchoring 50.0% → 91.9% (DB-verified) — but "locatable ≠ supporting," so the citer needs a co-location correctness gate (v2, in progress). Measure coverage and correctness.
Rotate providers to stay alive. Every flat-sub/prepaid lane caps (z.ai 1310, Cerebras 402, Codex weekly + IP risk). A declarative, pool-aware multi-lane router (91-test repo, 17 lane-specific tests) converts individually-capped lanes into a sustained firehose. This is purely an availability mechanism — it does not alter quality, and it is a budget-research expedient, not a TOS-clean production backend (§3.1).
Defer alignment and typing to query time. Raw counts are not comparable across harnesses, and most of glm's "7×" advantage is identity redundancy (31.8% of facts on dual-IRI subjects, two merged runs) — exactly what query-time identity alignment is designed to fold. Emit free, untyped, multi-namespace now; reconcile by similarity later. Do not maintain synonym tables or namespace-merge maps by hand.

The data demonstrates the abundance thesis three ways: the gleaning loop raises yield 3.7× while raising anchor coverage to 99.1% and keeping one clean namespace (meaningful, not padded); the saturation principle beats a count floor (clean spark glean > degenerate count-padding); and the why-7× decomposition shows the substrate's deferred-joining design is not a hedge but the right place to absorb the identity/typing tax that abundant generation necessarily produces.

Generation is cheap. The engineering — and the win — is in maximizing meaningful, anchored coverage and pushing everything else to query time. This is n=1, directional: one source, one session. The next step is to repeat it across many sources, domains, and models before any of these multipliers is treated as a constant.

Appendix — verification method

All live counts: donto_statement WHERE upper(tx_time) IS NULL with exact context = equality on donto-pg, 2026-06-04. Anchored = distinct live statements whose statement_id is the source of ≥1 live donto_evidence_link (upper(tx_time) IS NULL), via LEFT JOIN on a distinct-statement_id subquery. statement_timeout 120–300 s. Read-only SELECTs only; donto invariant I3 (no destructive overwrite) honored — nothing mutated. The 91-test suite was re-run this session (pytest tests/ -q → 91 passed); the lanes registry/cap/CLI claims were read from donto-extract/src/donto_extract/lanes/{registry,caps,cli}.py.

Figures that are NOT recomputable from donto_statement and are labelled so in-text: per-pass new-key counts, wall times, and the saturation/usage_capped flags come from glean-spark.log and glean_smoke.out (run-logs preserved on the box); per-run reasoning-token budgets (incl. the ~587-token datum) come from run_codex_glean.py run-logs and are directional; the source-character-coverage percentages (93.3% / 64.6%) come from a span-union analysis script that needs source text + span offsets; the padded-run line counts and garbage-predicate names are on-disk/log observations of a run that was not retained in the DB; and the adversarial mis-anchor rate (~40–46%) comes from the v2-citer source header plus an n=30 audit log. Every other number in this report is DB-verified.

Companion reports: donto — The Substrate for Generative Abundance (canonical vision) · Driving Frontier LLMs to Extract Maximally and Sustainably · The Cerebras / Codex Bake-off (provider economics + 5-way detail) · The donto Extraction System