genes.apexpots.com / research source: donto-extraction-engineering-2026-06-04.md

Extraction Engineering for Generative Abundance: Provider Rotation, Gleaning Loops, and the Coverage-not-Count Principle

Extraction Engineering for Generative Abundance: Provider Rotation, Gleaning Loops, and the Coverage-not-Count Principle

donto research report — 2026-06-04 — operational companion to donto — The Substrate for Generative Abundance

Scope and honesty contract. This is an engineering report, not a pitch. Every fact count, anchor rate, and namespace split below was re-verified live against donto_statement and donto_evidence_link on donto-pg on 2026-06-04 (read-only SELECTs; donto invariant I3 honored, nothing mutated). Where a number comes from a run log or an analysis script rather than the database, it is labelled as such — and the run-logs themselves (run_codex_glean.py outputs, glean-spark.log, glean_smoke.out, the v2-citer source) are preserved on the box, not asserted from memory. Where a claim is an emerging finding rather than a settled measurement, it is flagged. This is an n=1 single-source study (one article, EntryId 23778) — the findings are directional engineering evidence, not a population benchmark; that caveat applies to the whole report and is not repeated at every line. The companion report The Cerebras / Codex Bake-off covers provider economics and the 5-way bake-off in more depth; this report adds the gleaning loop, the count-vs-coverage saturation principle, and the "why 7×" decomposition — the new material.


1. Executive summary

donto's thesis is that generating typed knowledge is no longer the scarce step. A guided frontier LLM emits an essentially unbounded, multi-directional space of evidence-anchored claims about any entity for fractions of a cent. The engineering question is therefore not "can we generate enough?" but "how do we drive a model to extract maximally (exhaust the meaningful content of a source) and sustainably (keep the firehose running under real-world quota limits), while keeping each claim anchored to its source — and defer typing, alignment, and identity resolution to query time?"

This session answered that question with measurements rather than assertions. The settled findings:

The through-line: maximize meaningful evidence-anchored coverage, anchor as a separable stage, rotate providers to stay alive, and let the substrate fold identity/typing at query time. That is the abundance vision, operationalized and measured.


2. The abundance premise

For sixty years the bottleneck in every knowledge system was generation — a human had to author each typed fact. That scarcity is gone. The literature is unambiguous: GPTKB extracted ~105M triples at ~$0.00009 each; AutoSchemaKG built a 900M-node schema-free graph; per-token costs are falling roughly an order of magnitude per year. Generation is now the cheap, abundant step.

When generation is abundant, three sub-problems become the real work:

  1. Maximize meaningful yield. A model left to its own devices stops extracting when it feels done, not when the source is exhausted (Section 5 shows the run was not budget-bound). Getting the last large fraction of a source's content out requires deliberate harness engineering.

  2. Keep every claim anchored. A fact without a retrievable source span is, for donto's evidence-first model, a liability. Anchoring is a measurable quality axis independent of count (Sections 4–5), and it is best handled as a separate stage (Section 8).

  3. Defer joining. Models invent predicates and entity IRIs as they go. Two runs will describe the same entity under ex: and exo:; one model will mint birthPlace, another placeOfBirth. donto's design answer is emit free / untyped now, reconcile by similarity at query time — never a hand-maintained synonym table. Section 7 shows this is not just philosophy: most of glm's apparent "7×" advantage is exactly this kind of identity redundancy, which query-time alignment is designed to fold.

This report is the operational companion to that vision: how to actually drive the firehose.


3. Provider economics — why every flat-sub lane caps, and the rotation answer

3.1 The constraint

The operator's hard constraint is cost. Per-token public APIs (OpenAI, Anthropic, etc.) are TOS-clean and uncapped, but unaffordable at sustained extraction volume. So extraction runs on flat-subscription or prepaid lanes — and every one of them hard-caps the firehose. Observed live this session:

Lane Cap mechanism (observed) State this session
z.ai GLM coding subscription Weekly cap → error 1310 (returned over HTTP 429) Capped; reset 2026-06-10 16:43:35
Cerebras PAYG Out of credit → HTTP 402 payment_required (account-wide) Capped
ChatGPT-Pro / Codex CLI Hidden weekly cap (~1–2 days at volume) + GCP-datacenter-IP ban risk for automated non-coding use The only live route at time of writing

TOS nuance (stated plainly). The per-token public APIs are the clean path — they are simply unaffordable here. The subscription lanes are affordable but carry real terms-of-service exposure: the z.ai/GLM "coding" subscription is intended for coding assistance, and driving a 24/7 non-coding extraction firehose through it (or through ChatGPT-Pro from a datacenter IP) is TOS-risky and an expiring subsidy, not a durable foundation. The router below is an availability mechanism for research on a budget; it is not an endorsement of using a coding subscription as a production extraction backend. The durable answer is paying per-token (or self-hosting) once the work is funded.

The economic logic is inescapable: a flat subscription is priced for interactive coding, not a 24/7 extraction firehose. Any single lane will throttle. The answer is to rotate lanes by leftover usage — when one caps, jump to the next with quota remaining.

3.2 The multi-lane router

Built at donto-extract/src/donto_extract/lanes/. Design principles, all consistent with the no-brittle-logic rule (verified in the source this session):

Test footprint (corrected for honesty). Verified by running the suite this session: the whole donto-extract repo suite is 91 tests passing (pytest tests/ -q91 passed); the lanes module's own file tests/test_lanes.py contains 17 tests. The router is not independently "91 tests" — that figure is the entire repo. Both numbers are now correct in this report.

The rotation strategy is what converts a set of individually-capped lanes into a sustained firehose. None of it changes extraction quality — that's the next section.


4. The 5-way model bake-off

4.1 Read these caveats first

  1. Two different agentic harnesses. Configurations A/B/C ran under opencode; D/E ran under the Codex CLI. These are different agent drivers. Extraction quality (anchor rate, namespace cleanliness) is comparable across them; raw speed and raw count are not like-for-like. Do not read the count column as a clean model ranking.
  2. glm's 3,590 was two merged runs. The glm-4.7@Cerebras context is not one extraction — it is two un-deduped runs concatenated under two near-disjoint namespaces (live: exo: 2,327 + ex: 1,258 facts; only 30 distinct local subject-names appear in both, ~6% overlap). Treat its 3,590 as a merged figure, not a single-run yield. Section 7 decomposes this in full.

4.2 The table (all numbers live-verified 2026-06-04)

# Model / config Harness Live facts Subj Pred Anchored Anchor % Namespaces
A gpt-oss-120b @ Cerebras opencode 320 145 95 151 47.2% 1 (ex:)
B glm-4.7 @ Cerebras (two merged runs) opencode 3,590 486 1,852 2,498 69.6% 5 (dom. exo:+ex:)
glm-4.7 @ z.ai opencode never ran — capped (code 1310)
D gpt-5.4 (codex-normal) Codex CLI 511 119 292 391 76.5% 1 (ex:)
E gpt-5.3-codex-spark (single run) Codex CLI 426 78 213 368 86.4% 1

Contexts (live): ctx:test/cerebras-gptoss/23778, ctx:test/cerebras-glm/23778, ctx:test/codex-normal/23778, ctx:test/codex-spark/23778. The z.ai glm lane (ctx:test/zai-glm/23778) has 0 rows — it never executed because the lane was capped, an unintended but instructive demonstration of Section 3.

4.3 What the table says


5. The gleaning loop — models self-stop by choice, not budget

5.1 The evidence that "done" is a judgment, not a ceiling

Three baseline single-shot Codex runs on the same source landed at {496, 511, 426} facts. The substantive claim — that the model is satisficing, not running out of budget — is supported two ways:

Either way the operational insight holds: left alone, a frontier model extracts what it considers a reasonable, representative set and stops — it satisfices. For donto's maximal-extraction goal that is a bug, not a feature.

5.2 The fix: xhigh effort + a harness resume-and-re-prompt loop

Two levers, applied together:

  1. model_reasoning_effort = xhigh — push the model to think harder per pass. (Verified in the spark run's rollout: all 10 turn-context efforts were honored as xhigh.)
  2. A harness gleaning loopcodex exec --resume <session-id>, re-prompting each pass with coverage framing: "you missed many; append ≥150 NEW facts; do not repeat what you already emitted." Stop when a pass yields <30 new keys twice in a row (a saturation gate), or on a max-pass cap.

The framing matters: telling the model it missed things and asking for coverage (not "more facts") is what overrides the satisfice instinct.

5.3 Results (live-verified)

gpt-5.4 gleanctx:test/codex-glean-smoke/23778:

spark gleanctx:test/codex-glean-spark/23778:

5.4 Why the gpt-5.4 glean is the right exemplar

511 → 1,915 is the cleanest "maximize meaningful yield" result in the whole set: a 3.7× count lift that simultaneously raised anchor coverage to 99.1%, kept a single clean namespace, and produced zero exact-duplicate triples. That is meaningful abundance — more real anchored claims, not padding. It is the pattern the engine should default to.


6. Count is the wrong target — coverage and saturation are right

6.1 The cautionary tale

When the loop was instead given a raw count floor ("emit thousands"), a gpt-5.4 run obeyed — and degenerated into noise. The predicate tail collapsed into trivially-true string-property assertions — predicates like answerWordCount, containsCommaCharacter, containsApostropheCharacter, and (this source carries lat/long) coordinateMentionsLatitudeLabel. These are not knowledge about the entity; they are the model manufacturing filler to hit a number.

Caveat (honesty — this one is NOT a live DB context). This padded run was not retained in live donto_statement (consistent with it being a rejected output). So its raw line count (recorded in run-logs around ~7,500–8,400 valid lines / ~560 distinct predicates) and the specific garbage-predicate names are harness-log / on-disk observations, not a queryable context — they are presented as observed-during-the-run, and I do not claim a precise live figure for it. The contrast it illustrates, however, rests entirely on live data: the disciplined spark glean's clean profile (89.8% camelCase, 0.00% exact-dup, single namespace — all verified above) is in the database.

6.2 The principle

Abundance ≠ noise. A count floor optimizes the wrong objective and the model games it. The right target is exhaustive meaningful coverage, with saturation deciding "done" — the operator's own framing: "not an absolute number, just everything possible."

The two glean runs make this concrete:

The operational rule that falls out: drive to saturation (a falling new-key curve), reject count floors, and let the predicate-cleanliness profile (camelCase share, exact-dup rate, namespace count) be a live quality monitor.


7. The "why 7×" decomposition — measured

glm's 3,590 vs codex-normal's 511 is a naive 7.0×. Decomposed against live data, the real unique-knowledge multiple is ~1.4–1.7×. The factors (DB-verified except the one row explicitly marked):

Factor Multiplier Evidence
Harness — two merged runs ~1.9× glm context is two un-deduped runs: exo: 2,327 + ex: 1,258 facts (live); only 30 of ~483 distinct local subject-names appear in both namespaces (~6% overlap → near-disjoint, never deduped)
Granularity / redundancy ~1.5× 31.8% of glm facts (1,140 / 3,590, live) sit on subjects whose local name exists in both ex: and exo: — the same entity described twice under two IRIs. Plus 18.5% (663 / 3,590, live) bare true/1/yes flag restatements
Q&A reification ~1.1× The source is a Q&A transcript; glm reified questions/answers as extra subjects
Real extra source coverage ~1.45× Span-union analysis: glm touched ~93.3% of source characters vs codex ~64.6% — analysis-script figure, not recomputable from DB counts alone (see §7.1)

Net: ≈ 1.9 × 1.5 × 1.1 × 1.45 ≈ 4.5× of "apparent" advantage is artifact; the honest unique-knowledge multiple is ~1.4–1.7×. (These four factors are not cleanly orthogonal, so the decomposition is an estimate, not an identity — but every input number is sourced.)

7.1 Two precisions that matter for honesty

7.2 The lesson

Raw counts are not comparable across harnesses. glm's "7×" is mostly merged-run concatenation plus dual-namespace identity redundancy — exactly the kind of thing the substrate reconciles at query time. Meanwhile Codex stops by choice, not budget, and its single disciplined namespace means multi-session UNION dedups cleanly — no ex:/exo: identity-duplication tax to pay later. The right target remains exhaustive meaningful coverage judged by saturation, with identity and typing reconciled downstream — donto's emit-free / defer-joining thesis, validated by the numbers.


8. Forward: always-on post-hoc citing (emerging — honest)

The gleaning loop maximizes yield; anchoring is a separate quality axis, and the cleanest way to handle it is to separate extraction from anchoring — let the model extract freely, then run an always-on post-hoc citer that locates a supporting span for each emitted fact.

v1 result (verified two ways). The citer lifted the spark glean's anchoring substantially. Measured against the citer's own input fact file, it raised anchor coverage 47.1% → 90.2%. Measured the donto-native way — distinct live self-anchored statements in donto_statement/donto_evidence_link — the re-ingested context ctx:test/spark-cited/23778 (3,229 facts; the re-ingest added 2 vs the 3,227 source) goes from 50.0% → 91.9% anchored (2,969 / 3,229, live-verified). (The two bases — input-file anchor rate vs DB self-anchor rate — differ slightly; both are reported so neither is cherry-picked.) Either way it is a large, real coverage win, and it confirms the architectural bet: anchoring as a downstream stage works.

v1 problem (honest, do not overclaim). An adversarial audit found that a locatable citation is not always a supporting one. On a small, deliberately relational-skewed sample, roughly ~40–46% of the recovered relational citations were wrong — the citer found a span that mentions an entity but does not support the asserted relation (e.g. question-34 askedBy mr-watts anchored to a different question's "By Mr. WATTS:" line because both share the token watts). The ~40% figure is documented in the v2-citer source header; the ~46% figure comes from an n=30 adversarial-judge sample (relational subset n≈26, ~12 wrong) recorded in the run-logs. This is a worst-case adversarial probe on a curated relational sample, not a population rate — most facts are simpler attributive/literal claims the citer handles correctly (those anchored fine in v1). But it is a genuine correctness gap, and the exact rate is a run-log/audit figure, not a DB-verified population statistic.

v2 (in progress). An instance-aware co-location gate (real, in-tree at cite_facts_v2.py): classify each fact structurally by object type — literal objects keep the v1 lexical layer; IRI (relational) objects must find a span that co-locates BOTH endpoints (the subject's distinguishing token, weighted by inverse batch document-frequency — IDF computed from the data, never a hand-maintained stopword/synonym list — AND the object's value token) in the same window, else route to semantic, else mark unanchorable rather than attach a plausible-looking neighbour. A wrong span is worse than none. This is emerging work, not a settled result — its task is in progress this session and no precise post-fix correctness number is claimed here.

The takeaway for the architecture: anchor as a separate, auditable stage; measure not just coverage but correctness; and treat "locatable ≠ supporting" as a first-class problem.


9. What this means for the substrate

Operationalizing generative abundance comes down to four engineering commitments, each now backed by live measurement:

  1. Maximize meaningful coverage, not count. Models satisfice and self-stop while still producing (the spark glean was usage_capped: false with a 616-new-key final pass). Drive them to saturation with xhigh effort + a resume/re-prompt gleaning loop and coverage framing. The exemplar: 511 → 1,915 facts at 99.1% anchored, single namespace, 0 exact-dups. Reject count floors — they produce containsCommaCharacter-grade noise.

  2. Anchor as a separable stage. Extraction and citation are different quality axes. An always-on post-hoc citer lifted anchoring 50.0% → 91.9% (DB-verified) — but "locatable ≠ supporting," so the citer needs a co-location correctness gate (v2, in progress). Measure coverage and correctness.

  3. Rotate providers to stay alive. Every flat-sub/prepaid lane caps (z.ai 1310, Cerebras 402, Codex weekly + IP risk). A declarative, pool-aware multi-lane router (91-test repo, 17 lane-specific tests) converts individually-capped lanes into a sustained firehose. This is purely an availability mechanism — it does not alter quality, and it is a budget-research expedient, not a TOS-clean production backend (§3.1).

  4. Defer alignment and typing to query time. Raw counts are not comparable across harnesses, and most of glm's "7×" advantage is identity redundancy (31.8% of facts on dual-IRI subjects, two merged runs) — exactly what query-time identity alignment is designed to fold. Emit free, untyped, multi-namespace now; reconcile by similarity later. Do not maintain synonym tables or namespace-merge maps by hand.

The data demonstrates the abundance thesis three ways: the gleaning loop raises yield 3.7× while raising anchor coverage to 99.1% and keeping one clean namespace (meaningful, not padded); the saturation principle beats a count floor (clean spark glean > degenerate count-padding); and the why-7× decomposition shows the substrate's deferred-joining design is not a hedge but the right place to absorb the identity/typing tax that abundant generation necessarily produces.

Generation is cheap. The engineering — and the win — is in maximizing meaningful, anchored coverage and pushing everything else to query time. This is n=1, directional: one source, one session. The next step is to repeat it across many sources, domains, and models before any of these multipliers is treated as a constant.


Appendix — verification method

All live counts: donto_statement WHERE upper(tx_time) IS NULL with exact context = equality on donto-pg, 2026-06-04. Anchored = distinct live statements whose statement_id is the source of ≥1 live donto_evidence_link (upper(tx_time) IS NULL), via LEFT JOIN on a distinct-statement_id subquery. statement_timeout 120–300 s. Read-only SELECTs only; donto invariant I3 (no destructive overwrite) honored — nothing mutated. The 91-test suite was re-run this session (pytest tests/ -q91 passed); the lanes registry/cap/CLI claims were read from donto-extract/src/donto_extract/lanes/{registry,caps,cli}.py.

Figures that are NOT recomputable from donto_statement and are labelled so in-text: per-pass new-key counts, wall times, and the saturation/usage_capped flags come from glean-spark.log and glean_smoke.out (run-logs preserved on the box); per-run reasoning-token budgets (incl. the ~587-token datum) come from run_codex_glean.py run-logs and are directional; the source-character-coverage percentages (93.3% / 64.6%) come from a span-union analysis script that needs source text + span offsets; the padded-run line counts and garbage-predicate names are on-disk/log observations of a run that was not retained in the DB; and the adversarial mis-anchor rate (~40–46%) comes from the v2-citer source header plus an n=30 audit log. Every other number in this report is DB-verified.


Companion reports: donto — The Substrate for Generative Abundance (canonical vision) · Driving Frontier LLMs to Extract Maximally and Sustainably · The Cerebras / Codex Bake-off (provider economics + 5-way detail) · The donto Extraction System