genes.apexpots.com / research source: donto-abundance-appendix-2026-06-02.md

donto — Generative Abundance: Research Appendix (2026-06-02)

donto — Substrate for Generative Abundance: Research Appendix

Companion to the iteration-4 unified report. Structured output of the 5 forward-looking research deep-dives (2026-06-02).

frontier-llm-generative-decomposition

The founder's thesis — "a frontier LLM can emit an immeasurable amount of typed properties in any direction about a thing, with guidance" — is no longer a metaphor in 2025-2026; it is a measured, replicated engineering result. The cleanest proof is GPTKB: pointed at a single CHEAP model (GPT-4o-mini) and asked to recursively elaborate entities, it materialized 105M triples over 2.9M entities using 2,133 distinct relations and 367 classes — roughly 36 typed properties per entity on average, for $0.00009 per correct triple (https://arxiv.org/html/2411.04920v1). The follow-up GPTKB v1.5 pushed this to ~100M beliefs from GPT-4.1 (https://arxiv.org/abs/2510.07024). The decisive part for the thesis is not the volume but the DIRECTIONS: the model invented predicates no human schema had — historicalSignificance (270K triples), hasArtStyle (11K), hobbies (30K) — and 69.5% of the entities it described were NOT in Wikidata at all. The scarce, human-bottlenecked step in every prior knowledge system (knowledge engineers in Cyc, predefined attributes in formal concept analysis) is the exact step a guided LLM now does for free, and it does it along axes nobody pre-declared. "Essentially unbounded multi-directional emission" is a fair characterization: the supply of typed properties and the supply of NEW predicates are both effectively elastic now.

The abundance compounds when you let the model invent the axes themselves. AutoSchemaKG built a 900M+ node, 5.9B edge graph (ATLAS) from 50M+ documents with ZERO predefined schema — the LLM induced entity/event/concept types on the fly and hit 95% semantic alignment with human-crafted schemas with no manual intervention, while preserving 93-97% of source-passage information (https://arxiv.org/html/2505.23628). Strikingly, that was achieved with a small Llama-3-8B extractor, which means the ceiling under a frontier model (GPT-5.5 at 1M-token context, Claude Opus 4.5) is far higher than these papers measured. This is the "lens engine" intuition made quantitative: decompose along attributes, parts, functions, causes, counterfactuals, comparisons, AND emergent axes, and a single entity yields not one row but a fan of dozens-to-hundreds of typed claims plus brand-new predicate types — exactly the abundance donto is built to hold rather than collapse.

The honest counterweight — and the reason donto's posture is right rather than naive — is that emission breadth is now cheap but per-claim PRECISION is not automatic. GPTKB v1.5's own headline finding is that GPT-4.1's factual accuracy is "significantly lower than indicated by previous benchmarks," with inconsistency, ambiguity and hallucination as the main failure modes (https://arxiv.org/abs/2510.07024). On precision-critical typed extraction, GPT-5 reaches only 0.616 F1 five-shot on ChemProt relation extraction — about 12 points behind fine-tuned SOTA — and still trails specialists on disease NER, while leading on chemical NER and reasoning-heavy QA (https://arxiv.org/abs/2509.04462). So the 2025-2026 reality is asymmetric: GENERATION breadth is unbounded and nearly free; per-triple TRUTH is good and improving but not guaranteed. This is precisely the volume-reconciliation the founder already drew — maximize at the typed-extraction layer (where breadth wins), gate at the relationship/belief layer (where truth must be earned by evidence). It also validates donto as the natural home: a vector DB or normal KG must dedup and pick a winner; DEG-RAG shows LLM-built graphs need ~70% entity reduction to be usable in a collapsing store (https://arxiv.org/html/2510.14271v1). donto doesn't collapse — it holds the contradictory firehose as legal bitemporal state and prunes by reality (evidence + re-ranking), which is the only architecture that lets you keep ALL of generation's abundance instead of throwing 70% of it away.

On the "immeasurable directions" claim specifically: the strongest evidence is internal, not behavioral. Anthropic's sparse-autoencoder work extracted 34 million distinct interpretable features (concepts/directions) from the residual stream of a SINGLE mid-size model, Claude 3 Sonnet, with ~12M alive (https://transformer-circuits.pub/2024/scaling-monosemanticity/). The number of latent axes along which a frontier model can characterize a thing is measured in the tens of millions and grows with scale — so when the founder says "any direction," the substrate inside the model genuinely has tens of millions of them. The engineering task is not to manufacture directions (they exist in superabundance) but to ELICIT the useful ones with guidance and ANCHOR each emitted claim to evidence. That reframes donto from "a place to store extracted facts" to "the only substrate that can absorb generative abundance at full bandwidth and let reality, not a dedup heuristic, decide what survives."

What's newly possible:

Pointing a single guided LLM at one entity and getting a FAN of typed claims plus invented predicates — GPTKB averaged 36 triples/entity across 2,133 relations from GPT-4o-mini alone, and the model coined axes Wikidata never had (historicalSignificance, hasArtStyle, hobbies). For donto's flagship: emit skills/roles/seniority/trajectory PLUS latent traits and freshly-invented career-axis predicates per resume, not a fixed feature list.
Schema-FREE construction at web scale: AutoSchemaKG built 900M+ nodes / 5.9B edges from 50M docs with the LLM inducing all types on the fly, 95% aligned to human schemas with zero manual intervention. You no longer need to pre-author an ontology before extraction — the substrate's predicate set can grow itself.
Treating emitted predicates as a first-class growing vocabulary: because models invent new relation types per-pass, donto can let ctx:* predicate inventories EXPAND as a feature, then reconcile/align them later (ESCO/O*NET/Lightcast for jobs) rather than forcing every claim into a frozen schema at write time.
Holding the full firehose without collapsing: prior LLM-KG pipelines must do ~70% entity reduction (DEG-RAG) to stay usable. A paraconsistent bitemporal store can ingest 100% of generated abundance, keep contradictions as legal state, and let evidence/re-ranking prune — abundance becomes an asset instead of noise to be discarded.
Multi-directional relationship discovery at intersections: with breadth this cheap, you can decompose two entities along dozens of axes each and surface relationships at the crossings (skill adjacency, hidden career paths, candidate↔︎role bridges) that no human drew — 'machine serendipity that accumulates,' now with a per-claim cost of ~$0.0001.
1M-token context frontier models (GPT-5.5) mean an entire corpus (a person's full resume history + a job family + ESCO branch) fits in ONE decomposition pass, so cross-entity property emission and relationship hypotheses can be generated jointly rather than stitched from chunks.
Closing the loop with built-in ground truth: because emission is cheap and donto re-ranks on new evidence, the jsonresume→jobs flagship can treat got-interviewed/hired as live labels that continuously re-weight which emitted properties and relationship hypotheses actually predict outcomes.

Measurable signals:

GPTKB: 105M triples / 2.9M entities / 2,133 relations / 367 classes from GPT-4o-mini = ~36 typed properties per entity, $0.00009 per correct triple, 27 hours total runtime
GPTKB novelty: 69.5% of subjects are NOT in Wikidata; only 24% have exact Wikidata matches — the model emits along axes structured KBs never modeled
GPTKB v1.5: ~100M beliefs recursively elicited from GPT-4.1 (single frontier model); accuracy 'significantly lower than prior benchmarks' (the honest precision gap)
AutoSchemaKG / ATLAS: 900M+ nodes (937.3M in ATLAS-CC), 5.9B edges, 241M entities + 696M events + 31M concepts, from 50M+ docs; 95% schema alignment with zero manual intervention; 93-97% passage-information preservation; +12-18% multi-hop QA, +9% factuality (FELM) — achieved with only Llama-3-8B as extractor
Triple-extraction quality in AutoSchemaKG: precision 95-99%, recall 85-94%, F1 89-96% (LLM-as-judge, DeepSeek-V3)
GPT-5 biomedical (the precision gate): ChemProt RE 0.616 F1 five-shot (~12 pts behind fine-tuned SOTA); leads on chemical NER and reasoning QA; trails specialists on disease NER
Internal directions: 34M interpretable SAE features from ONE model (Claude 3 Sonnet), ~12M alive — the literal measured count of distinct axes a single model represents
Faithfulness is engineerable: VeriFY reduces factual hallucination 9.7-53.3% with only 0.4-5.7% recall loss via structured verification traces; SyRACT improves biomedical RE F1 by 11-41% over standard prompting
Collapsing-store tax avoided: DEG-RAG needs ~70% entity reduction for LLM-built KGs to perform — the abundance a paraconsistent substrate gets to keep

Concrete for donto:

Build the cross-entity GENERATOR as a multi-lens decomposition pass: for each entity, run guided prompts along a standing lens set (attributes, parts, functions, causes, counterfactuals, comparisons, analogies) PLUS an open 'invent new axes' lens; emit each result as a typed hypothesis_only claim with the lens recorded as provenance. Target: ≥30 typed claims/entity at ≥90% human-judged faithfulness on a 200-entity audit.
Make the predicate inventory a first-class GROWING object per ctx:. Let new LLM-invented predicates land as hypothesis_only predicate-types, then align them to ESCO/ONET/Lightcast (for jobs) at re-rank time, not write time. Measure: % of emitted predicates auto-aligned vs net-new, and downstream lift from keeping net-new ones.
Adopt the founder's two-layer gate explicitly in code: extraction layer maximizes breadth (no dedup, all variants kept as identity-as-hypothesis), relationship/belief layer requires ≥1 evidence edge before a claim leaves hypothesis_only. This mirrors GPTKB's breadth + AutoSchemaKG's precision split and avoids DEG-RAG's 70% throwaway.
Wire VeriFY-style verification traces into opencode_extract: each emitted claim carries a self-generated probing question + consistency judgment; route low-confidence claims to stay hypothesis_only. Falsifiable target: cut hallucinated claims 30%+ with <6% recall loss, matching VeriFY's published band.
For the jsonresume→jobs flagship: per resume, emit explicit + INFERRED/implicit skills + latent traits + identity variants as separate typed claims anchored to ESCO/O*NET; per job, emit required/nice-to-have/implied competencies; discover candidate↔︎role relationships at the lens intersections; rank with got-interviewed/hired as the re-ranking signal. Baseline to beat: embedding-cosine match; metric: interview/hire precision@k.
Stress-test the paraconsistent claim: ingest a deliberately contradictory entity (a resume with conflicting dates + a job with conflicting seniority) and confirm donto holds both readings as legal bitemporal state with supports/rebuts edges, then re-ranks correctly when one is evidenced. This is donto's differentiator vs a vector DB that must pick.
Exploit 1M-token context: do whole-corpus single-pass decomposition (full resume history + a job family + relevant ESCO subtree) so cross-entity properties and relationship hypotheses are generated jointly. Measure relationship-discovery recall vs the chunked baseline.

Honest constraints:

Per-claim precision is not free even when breadth is: GPT-5 hits only 0.616 F1 on ChemProt RE and GPTKB v1.5 reports accuracy below prior benchmarks. CHALLENGE: gate at the relationship/belief layer with evidence edges; target ≥90% faithfulness on audited claims while keeping 100% of emitted breadth as hypothesis_only.
Invented predicates create vocabulary sprawl and near-duplicate axes (hasHobby vs hobbies vs interests). CHALLENGE: run periodic predicate-alignment to ESCO/O*NET/Lightcast at re-rank time; measure net-new vs aligned ratio and prune only predicates with zero downstream lift.
Emitted entities collide and fork (GPTKB found 69.5% novel-but-many-are-variants); without dedup you get sprawl, with dedup you lose abundance. CHALLENGE: use donto's identity-as-hypothesis + query-time resolution lenses so merging is non-destructive and reversible — the architectural answer to DEG-RAG's 70%-throwaway dilemma.
Cost and latency of multi-lens decomposition at scale: emitting 30+ claims/entity across many lenses multiplies token spend. CHALLENGE: GPTKB shows $0.00009/correct-triple is achievable; budget a per-entity ceiling, use cheaper models for breadth and a frontier model only for the verification/relationship pass.
Self-verification has limits — models are over-confident on plausible falsehoods. CHALLENGE: prefer EXTERNAL grounding (source-anchored evidence edges, the jsonresume got-interviewed/hired signal) over pure self-consistency; treat self-verification (VeriFY-style) as a first filter, not the arbiter.
The cross-entity relationship GENERATOR is still the unbuilt piece — emitting properties per-entity is solved, but generating and ranking relationship hypotheses ACROSS entities at scale is the frontier. CHALLENGE: exploit 1M-token context for joint multi-entity passes; first falsifiable milestone is relationship-discovery recall vs an embedding-cosine baseline on the jobs flagship.

Examples / systems:

GPTKB (Hu, Ghosh, Weikum — MPI) — Recursively elicited a 105M-triple / 2.9M-entity / 2,133-relation / 367-class KB from GPT-4o-mini at $0.00009/correct-triple; 69.5% of entities novel vs Wikidata; invented predicates like historicalSignificance and hasArtStyle — direct proof of unbounded multi-directional emission [105M triples, 36 triples/entity avg, 2,133 distinct relations] https://arxiv.org/html/2411.04920v1
GPTKB v1.5 / Mining the Mind — ~100M beliefs recursively elicited from GPT-4.1; the honest counterweight — breadth is enormous but per-belief accuracy is below prior benchmarks, inconsistency/ambiguity/hallucination are the main failure modes _[~100M beliefs from one frontier model]_ https://arxiv.org/abs/2510.07024
AutoSchemaKG / ATLAS — Schema-FREE web-scale KG: LLM induces all entity/event/concept types on the fly; 95% alignment to human schemas with zero manual intervention; built with only Llama-3-8B, so frontier ceiling is far higher [900M+ nodes, 5.9B edges, 95% schema alignment, +9% factuality] https://arxiv.org/html/2505.23628
Scaling Monosemanticity (Anthropic) — Extracted 34M distinct interpretable features (concept-directions) from a single model's residual stream — the measured count of axes along which one model can characterize a thing [34M features, ~12M alive, in one mid-size model] https://transformer-circuits.pub/2024/scaling-monosemanticity/
Benchmarking GPT-5 for biomedical NLP — The precision gate: GPT-5 strong on chemical NER and reasoning QA but 0.616 F1 on ChemProt RE (~12 pts behind fine-tuned SOTA) — shows emission breadth outruns guaranteed per-triple truth, justifying donto's relationship-layer gating [ChemProt RE 0.616 F1 five-shot] https://arxiv.org/abs/2509.04462
DEG-RAG (Denoising KGs for RAG) — Collapsing stores need ~70% entity reduction to make LLM-built KGs usable — quantifies the abundance a paraconsistent substrate gets to KEEP instead of discard [up to 70% entity reduction without performance loss] https://arxiv.org/html/2510.14271v1
VeriFY (factual self-verification) — Structured verification traces cut factual hallucination 9.7-53.3% with only 0.4-5.7% recall loss — shows faithfulness is an engineerable knob, not a fixed property, for donto's verification layer [9.7-53.3% hallucination reduction, <6% recall loss] https://arxiv.org/pdf/2602.02018

economics-and-measurement-of-abundance

The scarce step in every prior knowledge system was generating typed properties and relations: Cyc paid knowledge engineers per assertion, literature-based discovery rode co-occurrence stats, formal concept analysis needed hand-defined attributes. That bottleneck is now an economic non-event. A guided frontier LLM emits hundreds of self-validated typed facts per source in minutes (donto's own pipeline: a one-sentence "Pandoc" input -> 483 valid ingested facts in ~4.7 min on a flat-rate coding subscription), and inference cost for a fixed capability has fallen ~10x/year for three straight years (a16z "LLMflation"; GPT-3-quality dropped from $60/M tokens in Nov-2021 to ~$0.06/M by Nov-2024, a ~1000x decline). Epoch AI puts the per-task decline at a median 50x/year (range 9x-900x), accelerating to a median 200x/year on post-Jan-2024 data alone. The crossover has already happened: at DeepSeek V4-Pro rates ($0.44/$0.87 per M tokens, made permanent May-2026) or Gemini 2.5 Flash ($0.30/$2.50), decomposing an entity along dozens of directions costs single-digit cents. Concretely: emitting ~500 typed properties for one entity (~5K output tokens) costs ~$0.004-0.04. "Generate everything about everything" is no longer a thought experiment — it is a line item you can budget. This is donto's foundational tailwind: generation abundance is now cheaper than the human curation it replaces by 3-4 orders of magnitude.

A critical 2026 nuance keeps the spine honest: headline API prices are BIFURCATING even as cost-per-capability keeps falling 5-10x/year (Epoch). Western labs raised premium reasoning-model prices in May-2026 (GPT-5.5 doubled to $2.50/$15; Gemini 3.5 Flash 3x'd to $1.50/$9) while efficiency-leaders (DeepSeek) drove to the floor. The takeaway for a builder: the abundance thesis is real but you must ENGINEER for the floor — route bulk typed-property emission to the cheapest capable model (the donto GLM/DeepSeek-tier extraction path), and reserve premium reasoning tokens for the high-value relationship-hypothesis and re-ranking steps. Cost is now a steering variable, not a wall.

The harder, more valuable frontier is MEASUREMENT — because once generation is free, the scarce resource becomes knowing WHICH generated properties are worth keeping. Accuracy alone is the wrong yardstick: a true-but-redundant fact has near-zero value. The right metrics, all live in 2024-2026 research, are (a) downstream TASK-PERFORMANCE LIFT — does adding the emitted structure move a real metric? GraphRAG-style KGs deliver 72-83% comprehensiveness vs vector RAG and +12.8 QA points from better-constructed graphs; this is donto's gold standard, and the jsonresume->jobs flagship has a built-in one: got-interviewed / hired; (b) information gain / Bayesian surprise — how much a new property shifts the posterior over an entity, which doubles as the steering wheel for active generation (BED-LLM and Uncertainty-of-Thoughts choose what to generate next by maximizing expected information gain); (c) novelty/diversity — measured as harmonic mean of originality (fraction of unseen n-grams) and quality, embedding-distance diversity, and the proven result that AI-generated research ideas are rated MORE novel than expert humans (p<0.05, Stanford 100+ researcher study) though less diverse — directly validating "machine serendipity that accumulates"; and (d) coverage/completeness benchmarks like MINE (Feb-2025), where KGGen beat OpenIE/GraphRAG by 18% on representing source text.

Model collapse is the one real risk to a self-growing knowledge base, and the 2024-2026 literature has already de-fanged it for donto's exact architecture. Collapse only happens under REPLACEMENT (training on synthetic data while discarding real); error then grows roughly linearly with iterations (Shumailov). Under ACCUMULATION — keeping all real + synthetic data forever — error is provably BOUNDED, not divergent (Gerstgrasser et al. 2024; the variance converges to a finite limit independent of iteration count). donto is an accumulation system BY CONSTRUCTION: bitemporal, paraconsistent, evidence-anchored, it never overwrites or dedups; every claim keeps its provenance and counter-evidence. The collapse-avoidance recipe the field converged on — accumulate + verify/curate (selection on synthetic data "significantly enhances performance" especially where verifiers exist) — IS donto's claim lifecycle: emit abundantly at the typed-extraction layer, then gate/rank at the relationship layer against evidence and reality. The contradiction-preserving substrate is not just compatible with generative abundance; it is the provably-safe container for it.

What's newly possible:

Budget 'decompose this entity along 50 directions' as a line item: ~500 typed properties per entity (~5K output tokens) now costs ~$0.004-0.04 at DeepSeek/Gemini-Flash rates. A 1M-resume corpus fully decomposed = roughly $4K-40K of inference, not a knowledge-engineering department-decade. The Cyc-style human bottleneck is gone by 3-4 orders of magnitude.
Active, uncertainty-STEERED generation: instead of emitting blindly, use expected-information-gain (BED-LLM, Uncertainty-of-Thoughts 2024-25) to spend the next generation budget where the posterior over an entity is most uncertain — turning measurement from a backward judge into a forward steering wheel ('generate more where it pays').
A self-growing knowledge base that provably does NOT collapse: because donto ACCUMULATES (bitemporal, never overwrites) rather than replaces, error is bounded by theorem (Gerstgrasser 2024), not divergent — so donto can safely ingest its own and other models' generated claims indefinitely, which a vector DB or normal KG (which must dedup/pick) cannot.
Predicate invention as a runtime primitive: frontier LLMs can INVENT new relation types mid-extraction, so the schema grows with the data. Measure each invented predicate by its downstream task lift and information gain; keep the ones that pay. This was impossible when attributes had to be pre-defined (FCA) or hand-engineered (Cyc).
Novelty as a generatable, measurable product: AI ideas now rate MORE novel than expert humans (Stanford p<0.05). donto can emit relationship hypotheses, score each by novelty (harmonic mean of unseen-n-gram originality x evidence-backed quality) AND by Bayesian surprise, and surface only the high-surprise/high-evidence ones — 'machine serendipity that accumulates,' now with a number on it.
Task-lift A/B as the universal value gate: because generation is cheap, you can afford to generate a property, measure whether it improves a real downstream task (interview rate, QA accuracy, link prediction), and keep or kill it empirically — closing the loop that pre-LLM systems could never afford to run at scale.
Cost-tiered cognition routing: emit the firehose of typed properties on the floor-priced model (DeepSeek/GLM tier, ~$0.5/M) and spend premium reasoning tokens (GPT-5.x/Gemini-Pro tier) only on relationship-hypothesis generation and re-ranking — making 'generate everything' AND 'reason hard about the best bits' simultaneously affordable.

Measurable signals:

Inference cost for fixed capability: ~10x/year decline 3yrs running (a16z); GPT-3-quality $60/M (Nov-2021) -> $0.06/M (Nov-2024) = ~1000x. Epoch median 50x/year per task, range 9x-900x, accelerating to median 200x/year on post-Jan-2024 data.
Per-entity generation cost target: <$0.05 to emit ~500 typed properties (~5K output tokens) at DeepSeek V4-Pro $0.87/M or Gemini 2.5 Flash $2.50/M output; <$0.005 at floor rates. Corpus target: full 1M-resume decomposition <$40K.
Cost bifurcation to engineer around: cost-per-capability still falling 5-10x/year (Epoch) but headline premium prices rose May-2026 (GPT-5.5 $2.50/$15, Gemini 3.5 Flash $1.50/$9) while DeepSeek held floor ($0.44/$0.87). Route bulk emission to the floor.
Downstream task lift (the gold metric): GraphRAG 72-83% comprehensiveness vs vector RAG; +12.8 QA points from better KG construction; 3.4x accuracy in enterprise scenarios (Microsoft 2024). donto target: each kept property class must show measurable lift on a held-out task.
Coverage/completeness: MINE benchmark (Feb-2025) — KGGen beat OpenIE/GraphRAG by 18% on faithfully representing source text. Target a published MINE-style score per extraction pass.
Novelty: AI-generated research ideas rated MORE novel than 100+ expert humans (p<0.05, Stanford 2024), though lower diversity — so MEASURE and optimize diversity explicitly (distinct-n, Self-BLEU, embedding-distance, NovelSum which correlates with downstream tuning performance).
Model-collapse boundary conditions: REPLACEMENT -> error grows ~linearly with iterations; ACCUMULATION -> bounded finite error independent of iteration count (Gerstgrasser 2024). Strong Model Collapse: as little as 1-per-1000 synthetic fraction can degrade a REPLACEMENT regime — irrelevant to accumulation, but the reason verification/curation is mandatory.
Synthetic-data regime rule: synthetic helps when real data is scarce (<=~1,024 samples in the study); degrades when real data is abundant. donto rule-of-thumb: weight generated claims down where dense real evidence already exists, up where the entity is sparse.
donto live baseline to beat: 483 valid+ingested facts from a 1-sentence input in ~4.7 min (one opencode/GLM pass); '697 facts from cat-is-red'. Track facts/min, valid-fact %, and downstream-lift-per-fact as the core production dashboard.

Concrete for donto:

Add a per-claim value score column populated by three measurable signals: (1) information_gain (posterior shift over the entity from adding this claim), (2) novelty (harmonic mean of unseen-n-gram originality x evidence-backed quality), (3) downstream_task_lift (filled in retroactively when a claim participates in a successful task — e.g., a resume property present in a got-interviewed match). Re-rank on these, not on accuracy alone.
Build the active-generation steering loop: after each extraction pass, compute per-entity posterior uncertainty and re-spend the next generation budget on the highest expected-information-gain directions (BED-LLM / Uncertainty-of-Thoughts pattern). Expose an EIG-ordered 'what to generate next' queue. Turns the firehose into a guided drill.
Implement cost-tiered routing in opencode_agent.py / extraction.py: bulk typed-property emission on the floor model (GLM/DeepSeek tier), relationship-hypothesis generation + re-ranking on a premium reasoning model. Log $/fact and $/kept-fact per tier; target <$0.05 per ~500-property entity.
Formalize donto's accumulation guarantee as a design invariant and SAY it: 'donto avoids model collapse by construction (Gerstgrasser-2024 accumulation regime) — it never replaces or dedups, so ingested generated claims have bounded, not divergent, error.' This is a defensible moat vs vector/normal-KG competitors who must collapse.
Add a verification/curation gate at the relationship layer (NOT the extraction layer): generate abundantly into hypothesis_only, then promote only claims that pass evidence-attachment + pass a downstream-lift or information-gain threshold. This is the field's proven collapse-avoidance recipe (accumulate + verify) mapped onto the 8-step lifecycle.
Adopt MINE-style coverage scoring as a CI metric on the extraction pipeline: every pass reports a coverage score against source text; regression-test that new extraction prompts/models don't drop coverage. Target beating KGGen's 18%-over-baseline.
For the jsonresume->jobs flagship, wire got-interviewed / hired as the ground-truth task-lift label feeding back into claim value scores: emitted/inferred skills and latent-trait properties that correlate with positive outcomes get up-ranked; those that never help get pruned. This is the rare system with built-in measurable downstream truth — exploit it as the canonical proof of the abundance thesis.
Track a diversity metric on generated relationship hypotheses (distinct-n / embedding-distance / NovelSum) and optimize it explicitly, because the literature shows LLMs are novel-but-low-diversity — donto should counter that with multi-lens decomposition + diversity-aware sampling so 'serendipity that accumulates' actually spans the space.
Weight generated claims by real-evidence density per entity: down-weight where dense real evidence exists (synthetic degrades in abundant-real regime), up-weight for sparse entities (synthetic helps in scarce-real regime, <=~1K samples). A simple per-entity evidence-count feature implements the field's synthetic-data regime rule.

Honest constraints:

Precision/value gate is the real work, not generation. Once facts are free, the scarce resource is deciding which to keep. CHALLENGE: ship the per-claim value score (info-gain x novelty x downstream-lift) and prove that gating on it beats keep-everything on a real task. TARGET: kept-claim set delivers >=90% of full-firehose task-lift at <20% of the claims.
Headline-price bifurcation means 'cost->0' is not automatic for premium reasoning. CHALLENGE: keep bulk emission on floor models (DeepSeek/GLM tier) and prove premium tokens are spent only where they pay. TARGET: <$0.05 per ~500-property entity; <30% of total spend on premium-tier reasoning.
Model collapse is real under replacement and verification is mandatory under accumulation (selection 'significantly enhances performance'). CHALLENGE: enforce the verify/curate gate at the relationship layer and never train downstream models on un-curated self-generated claims. TARGET: measured downstream task-lift is flat-or-up across N self-ingestion cycles (not declining).
Synthetic data helps only in the scarce-real regime and degrades in the abundant-real regime. CHALLENGE: implement per-entity evidence-density weighting so generated claims don't drown dense real evidence. TARGET: no task-lift regression on entities with abundant real evidence after adding generated claims.
LLMs are novel-but-LOW-diversity, so naive abundance produces correlated near-duplicates. CHALLENGE: multi-lens decomposition + diversity-aware sampling, measured by distinct-n/embedding-distance/NovelSum. TARGET: maintain a target diversity coefficient as volume scales; reject passes that collapse it.
The cross-entity relationship generator (the 'discovery at intersections' engine) is the unbuilt high-value piece. CHALLENGE: generate relationship hypotheses across entities and rank by evidence + Bayesian surprise. TARGET (flagship): surfaced hidden-candidate / skill-adjacency edges that beat the existing matcher on got-interviewed rate by a measurable margin in an A/B.
Measuring information gain / Bayesian surprise over an evolving substrate is non-trivial at 39.5M-statement scale. CHALLENGE: an efficient, incremental posterior-shift estimate per claim (approximate, embedding- or count-based) that runs in the ingest path. TARGET: <X ms per claim so it doesn't bottleneck the firehose.

Examples / systems:

a16z LLMflation — Coined the ~10x/year inference-cost-for-fixed-capability decline; GPT-3-quality $60/M (2021) -> $0.06/M (2024), ~1000x in 3 years [10x/year; 1000x over 3 years] https://a16z.com/llmflation-llm-inference-cost/
Epoch AI inference price trends — Per-task price decline median 50x/year (range 9x-900x), accelerating to median 200x/year on post-Jan-2024 data; cost-per-capability still falling 5-10x/year even as 2026 headline prices rose [median 50x/yr, up to 900x/yr] https://epoch.ai/data-insights/llm-inference-price-trends
Gerstgrasser et al. 2024 (Accumulating data) — Model collapse happens under data REPLACEMENT (error grows linearly) but is AVOIDED under ACCUMULATION (bounded error independent of iteration count) — across text/molecule/image generative models [linear-divergence vs bounded finite error] https://arxiv.org/pdf/2404.01413
KGGen + MINE benchmark (Feb 2025) — LLM KG extractor with clustering-based entity resolution; MINE is the first benchmark for how well a KG represents source text [+18% over OpenIE/GraphRAG] https://arxiv.org/abs/2502.09956
Si, Yang, Hashimoto (Stanford 2024) — Can LLMs Generate Novel Research Ideas? — 100+ NLP researcher human study; AI ideas rated MORE novel than expert humans but lower diversity [AI > human novelty, p<0.05] https://arxiv.org/pdf/2409.04109
BED-LLM (Bayesian Experimental Design with LLMs) — Picks next query/generation by maximizing expected information gain — the active/curiosity-driven steering pattern for abundant generation [EIG-maximizing selection] https://www.researchgate.net/publication/397197970_Uncertainty_of_Thoughts_Uncertainty-Aware_Planning_Enhances_Information_Seeking_in_LLMs
Microsoft GraphRAG — Entity-relation graphs + community summaries lift downstream QA over vector RAG — the canonical 'does emitted structure improve a real task' result [72-83% comprehensiveness; +12.8 QA pts; 3.4x enterprise accuracy] https://arxiv.org/abs/2501.00309
DeepSeek V4-Pro / Gemini 2.5 Flash pricing — Floor-tier rates that make per-entity abundant generation cost single-digit cents [DeepSeek $0.44/$0.87; Gemini Flash $0.30/$2.50 per M] https://www.tldl.io/resources/llm-api-pricing-2026
Lightcast Open Skills / ESCO — Anchoring taxonomies for the jsonresume->jobs flagship: 34,000+ Lightcast skills (updated biweekly); ESCO 13,939 skills x 3,039 occupations [34K skills / 13,939 x 3,039] https://lightcast.io/open-skills/extraction

jsonresume-jobs-abundance

The scarce step in every prior matching system was emitting typed properties about people and jobs — and 2024-2026 evidence shows that bottleneck has collapsed. Most required skills in a job posting are expressed implicitly, not as keywords, and a frontier LLM now extracts them better than the entire prior supervised state of the art: zero-shot GPT-4 ESCO skill matching beat the previous best (Decorte et al.) by +22.33 and +29.75 percentage points on RP@10 (arXiv:2307.03539). That is the abundance thesis made measurable — a guided LLM can emit competencies, seniority, trajectory-implied capabilities, working-style signals, and transferable-skill bridges that keyword/embedding pipelines structurally cannot see. And it changes ranking, not just recall: LLM re-ranking (ConFit v3) adds +7.81 pp absolute nDCG@10 over the strongest embedding baseline ConFit v2 (52.33→61.37 on a real 49K-resume recruiting set), and the explainable Synapse system reports +22% nDCG@10 over embedding-only retrieval. So abundance is not noise: more typed properties → better, explainable matches.

The deepest finding for donto is architectural. The best 2026 explainable matcher, JobMatchAI (arXiv:2603.14558), wins by strictly separating a deterministic scoring layer from a generative explanation layer — the LLM "can explain a ranking but never inflate one," yielding 100% faithful explanations (0% unsupported claims), 70.5% top-factor mention, and 94.5% weakness-surfacing, all at 82ms median. This is precisely donto's split: let the LLM emit an unbounded firehose of typed, evidence-anchored claims (HAS_SKILL, IMPLIES_COMPETENCY, BRIDGES_TO, hypothesis_only trajectory inferences), hold the contradictory ones forever as legal paraconsistent state, then gate at the relationship/ranking layer with deterministic, auditable utility — and have the LLM explain only what the evidence already supports. A vector DB must collapse to one embedding; a normal KG must dedup to one canonical skill. donto is the only home that can keep "claims this person can do Kubernetes (inferred from 3 years of Docker + Terraform)" alongside "no direct Kubernetes evidence" as separate evidence-bearing claims, and re-rank when an interview outcome arrives.

The network effects compound at jsonresume scale. The career-mobility literature now grounds next-role prediction to standard taxonomies at volume — KARRIEREWEGE+ (100K resumes → 3,039 ESCO occupations, MRR 43.58, arXiv/COLING-2025) — exactly the skill-adjacency and career-path graph that emerges once millions of resumes are decomposed into typed claims. LinkedIn's economic graph (800M+ members, skills required per job up ~25% since 2015 and projected to double by 2027, skill-adds up 140% since 2022) and Lightcast Open Skills (32K+ skills mined from 1B+ postings, refreshed biweekly) prove the demand and the moat — but they are closed and embedding-collapsed. An open jsonresume claim-substrate, anchored to the now-official ESCO↔︎O*NET crosswalk plus Lightcast/ESCO, can do the one thing the incumbents cannot: expose why a non-obvious candidate fits, as a checkable evidence chain, and improve it with built-in ground truth (got-interviewed / hired / retained). Cost is the only real constraint and it is falling fast: a million resume-extractions run $72 (small open models) to ~$9,000 (GPT-4o), batch APIs cut that 50%, and per-token prices fell ~80% in the last year — so full-firehose extraction over millions of resumes is already an O($1K-10K) line item, not a research project.

What's newly possible:

Emit the IMPLICIT skill graph that keyword/embedding matching structurally misses: most required skills in a posting are never stated explicitly, and zero-shot LLMs now extract them +22-30 pp (RP@10) above the prior supervised best (arXiv:2307.03539) — so a resume's 'real' competency set can be 3-10x larger than its listed skills, anchored to ESCO codes.
Generate typed transferable-skill BRIDGE claims across domains (e.g. 'client-facing ops 6yr ⇒ stakeholder-management + incident-comms' or 'competitive-StarCraft ⇒ real-time resource-allocation') as first-class evidence-anchored edges, surfacing candidates who never held the title — the 'hidden candidate' Eightfold markets but cannot make auditable.
Decompose trajectory, not just snapshot: ground each resume to a career-path graph (KARRIEREWEGE+ style, 3,039 ESCO occupations) and emit 'next-role-ready' / 'over-qualified' / 'stretch-fit' as hypothesis_only claims, re-ranked as outcomes arrive.
De-conflate PREFERENCE vs QUALIFICATION as two separate typed claim streams (arXiv:2602.03097) — donto holds 'wants executive role' and 'qualified for executive role' as distinct, possibly contradictory claims instead of one blended score.
Hold contradictory identity/skill claims paraconsistently: 'senior per title' vs 'junior per tenure', or duplicate-profile variants, live side-by-side with evidence — query-time entity-resolution lenses decide per use, no destructive dedup.
Explanation that is provably faithful, not generated spin: deterministic scoring layer + LLM-explains-only-supported-evidence (JobMatchAI: 100% faithful, 0% unsupported, 94.5% weakness-surfacing) — a compliance-grade 'why this match' that incumbents' black-box embeddings can't produce.
Self-growing skill-adjacency ontology: the LLM can INVENT new predicates/skill-relations it observes across millions of resumes (RELATED_TO, IMPLIES, OBSOLETED_BY) rather than being limited to a pre-frozen 32K-skill list — the taxonomy grows itself and prunes by hire-outcome reality.
Built-in falsifiable ground truth at population scale: every match carries got-interviewed/hired/retained outcomes, so the substrate continuously re-ranks which inferred/implicit/bridge claims actually predict success — turning abundance into a self-calibrating engine.

Measurable signals:

Implicit-skill lift: zero-shot GPT-4 ESCO matching RP@10 = 61.02 (House) / 68.94 (Tech) vs prior supervised best 38.69 / 39.19 — +22.33 / +29.75 pp (arXiv:2307.03539). Target for donto extractor: ≥ this on a held-out jsonresume↔︎ESCO set.
LLM re-ranking > embeddings: ConFit v3 nDCG@10 61.37 / Recall@10 68.89 vs ConFit v2 52.33 / 62.30 (+7.81 pp avg) on 10,597 jobs × 49,398 resumes (arXiv:2605.09760).
Explainable ensemble: Synapse +22% nDCG@10 over embedding-only retrieval; evolutionary loop >60% relative gain on recommender scores (arXiv:2604.02539).
Explanation faithfulness (the donto promise, measurable): JobMatchAI 100% faithful / 0% unsupported claims / 70.5% top-factor mention / 94.5% weakness-surfacing / 82ms median (arXiv:2603.14558).
Career-path prediction grounded to taxonomy: KARRIEREWEGE+ 100K resumes → 3,039 ESCO occupations, best MRR 43.58 (COLING-2025 industry). Target: beat on next-role R@5 using full claim-set vs skills-only.
Human-LLM resume rating: GPT-4 vs human correlation is only minor on 736 real submissions, and LLM shows NO larger demographic group differences than humans (ACL-NAACL-2025.270) — fairness is measurable and not worse than the status quo.
Network scale / demand: LinkedIn 800M+ members; job-required skills changed ~25% since 2015, projected to DOUBLE by 2027; skill-adds up 140% since 2022. Lightcast 32K+ skills from 1B+ postings, refreshed biweekly.
TalentCLEF 2025 open benchmark (Zenodo): best multilingual job-title match MAP 0.534; best job-title→skill MAP 0.360 — a public leaderboard donto can post to.
Extraction cost: $72 (Qwen3-4B) to ~$9,000 (GPT-4o) per 1M docs @ 2.4K tok; batch API -50%; per-token prices fell ~80% in 12 months — full-firehose over millions of resumes is O($1K-10K).
Built-in outcome ground truth: precision/recall of 'inferred/bridge claims' measured against got-interviewed / hired / 6-month-retained, re-computed bitemporally as outcomes land.

Concrete for donto:

Build the jsonresume Abundance Extractor: a guided multi-lens opencode/GLM pass that emits, per resume, typed claims across ~10 directions — explicit skills, IMPLIED competencies (Docker+Terraform⇒Kubernetes-ready), seniority-from-trajectory, transferable BRIDGES, working-style/context signals, latent traits (flagged low-confidence), identity variants — every claim ESCO/O*NET/Lightcast-anchored with an evidence_link back to the resume span.
Mirror JobMatchAI's split inside donto: a DETERMINISTIC scoring layer (Jaccard + KG-relatedness + experience-distance + the dual preference/qualification scores) reads claims; the LLM generates ONLY evidence-supported explanations. Enforce '0% unsupported claims' as a Lean-4-checkable shape on the explanation step.
Store skill-adjacency and career-path as bitemporal typed edges (RELATED_TO, IMPLIES, NEXT_ROLE, OBSOLETED_BY) that accumulate across all resumes — the 'machine serendipity that accumulates' becomes the population skill-graph; let the extractor PROPOSE new predicates, gated before they join the canonical lens.
Make got-interviewed/hired/retained a bitemporal outcome claim that triggers re-ranking: each outcome updates the measured precision of the inferred/bridge claim TYPES that fed the match — the substrate learns which abundance directions actually predict hiring.
Implement preference vs qualification as two separate claim contexts (ctx:jobs/preference/, ctx:jobs/qualification/) so a candidate can be 'wants X' and 'not-yet-qualified-for-X' simultaneously without collapse (arXiv:2602.03097).
Ship a public 'explainable hidden-candidate' demo: given a job, return non-obvious fits with a full evidence chain (which implicit/bridge claims fired, counter-evidence shown) — the thing LinkedIn/Eightfold cannot expose because their match is an opaque embedding.
Adopt the official ESCO↔︎O*NET crosswalk + Lightcast Open Skills as the anchor namespace so jsonresume claims are portable and the open standard is interoperable by construction — a moat incumbents' proprietary taxonomies can't claim.
Run a falsifiable first milestone: extract abundance-claims for a held-out jsonresume cohort, match vs an embedding-only baseline, and report nDCG@10 / hidden-candidate recall AND explanation-faithfulness — target ConFit-v3-class lift (+7-8 pp nDCG@10) with 100%-faithful explanations.

Honest constraints:

Abundance can hallucinate: inferred/bridge claims risk false positives. Solution: store as hypothesis_only with explicit evidence_links and confidence; gate at the ranking layer; measure each claim TYPE's precision against hire outcomes and auto-demote types that don't predict. Target: inferred-claim precision ≥ explicit-claim precision within 2 outcome cycles.
Explanation faithfulness is the whole product — a generated 'why' that isn't backed by evidence is worse than no explanation. Enforce JobMatchAI-style separation (LLM sees only pre-scored evidence) and Lean-4-check '0% unsupported claims'. Target: ≥99% faithful, matching the 100%/0% benchmark.
Cost at firehose scale is real but bounded: 10 lenses × millions of resumes multiplies tokens. Target: keep full re-extraction under ~$0.005/resume using batch API + small models for routine lenses, frontier models only for bridge/trajectory lenses; re-extract incrementally on resume edits, not nightly.
The cross-entity relationship GENERATOR (proposing skill-adjacency/bridge edges no one drew) is the unbuilt piece. Build it as a guarded proposer: LLM proposes candidate edges, deterministic support/rebut scoring + outcome data confirm before they enter the canonical lens. Target: ≥X confirmed novel adjacency edges/month with hire-outcome support above chance.
Ground truth is sparse and biased (only some matches get interview/hire signals, and the funnel itself is biased). Use the 736-resume fairness result as a floor (no worse than humans), measure group-differential outcomes continuously, and treat missing outcomes as missing-not-at-random in the re-ranking. Falsifiable target: demographic outcome gaps ≤ human-baseline gaps.
Taxonomy drift: skills required per job projected to double by 2027, so any frozen taxonomy rots. The self-growing predicate mechanism is the answer, but needs governance — Lean-checkable shape constraints on new predicates before promotion. Target: incorporate Lightcast biweekly refresh + auto-propose net-new skills within one refresh cycle.
Privacy/consent for resume claims: inferring traits/working-style from a resume raises legitimate consent and EU-AI-Act high-risk concerns. Make every inferred claim user-visible, contestable, and deletable; keep inferred-trait lenses opt-in and clearly labeled low-confidence; never let trait inferences enter deterministic scoring without explicit policy.

Examples / systems:

LLMs as Zero-Shot ESCO Skill Matchers (Clavié & Soulié) — Zero-shot GPT-4 extracts ESCO skills incl. implicit ones, beating prior supervised best by +22.33/+29.75 pp RP@10 — proof the extraction bottleneck collapsed [RP@10 61.02/68.94 vs 38.69/39.19] https://arxiv.org/html/2307.03539
ConFit v3 (LLM re-ranking) — LLM listwise re-ranking over embedding baseline on 49K-resume recruiting set with controllable non-negotiables checklist [nDCG@10 52.33→61.37 (+7.81 pp avg)] https://arxiv.org/html/2605.09760v1
Synapse (explainable two-phase + genetic optimization) — Explainable retrieval ensemble + LLM-guided resume optimization for job-person fit [+22% nDCG@10 over embedding-only; >60% relative recommender-score gain] https://arxiv.org/pdf/2604.02539
JobMatchAI (KG + semantic + explainable) — Deterministic scoring layer strictly separated from generative explanation — the architecture donto should mirror [100% faithful / 0% unsupported / 94.5% weakness-surfacing / 82ms] https://arxiv.org/html/2603.14558
KARRIEREWEGE+ career-path dataset — 100K resumes grounded to 3,039 ESCO occupations for next-role prediction at scale [best MRR 43.58] https://aclanthology.org/2025.coling-industry.46.pdf
De-conflating Preference & Qualification (LLM job rec) — Two separate reasoning streams instead of one blended fit score — directly maps to two donto claim contexts [significant gains over single-stream LLM + CF baselines] https://arxiv.org/pdf/2602.03097
Human vs LLM resume matching (observational, 736 resumes) — Real-world fairness/validity check: LLM no more biased than humans across race/gender [no larger group differences than human raters] https://aclanthology.org/2025.findings-naacl.270/
TalentCLEF 2025 (open benchmark, Zenodo) — First public skill/job-title intelligence benchmark, multilingual, ESCO-grounded — a leaderboard to post to [job-title match MAP 0.534; title→skill MAP 0.360] https://arxiv.org/html/2507.13275v1
Lightcast Open Skills + LinkedIn Economic Graph — Scale/demand evidence and the closed-incumbent moat an open claim-substrate can break open [Lightcast 32K+ skills / 1B+ postings; LinkedIn 800M members, skills doubling by 2027] https://lightcast.io/open-skills

modern-abundance-harnessing-systems

Generation of typed properties was the historic bottleneck; it is now abundant. See arrays.

What's newly possible:

A guided LLM decomposes any entity along unbounded directions and invents new predicate types as it goes (AutoSchemaKG at 50M docs, 92 percent schema-alignment, zero human schema engineering, the step that bottlenecked Cyc).
Build a 900M node and 5.9B edge KG directly from text with no predefined schema (ATLAS, 2025); schema induction is a generated artifact now.
Self-improving loops compound: phi-1 reaches 50.6 percent HumanEval on about 7B tokens; Llama-3-8B goes 22.9 to 39.4 percent AlpacaEval 2 on model-generated rewards.
End-to-end agentic discovery ships real candidates: AI Co-Scientist (30 to 5 to 1 AML drugs) and Robin (ripasudil for dry-AMD via ABCA1, 2.5 months), both Nature 2026.
Generative agents simulate 1052 real people from interviews at 85 percent of self-consistency (Stanford 2024-25); text-derived world-models of entity populations are buildable.
The new primitive: pair unbounded generation with a paraconsistent persistent substrate that holds contradictory claims forever and re-ranks as evidence arrives. No shipped system does this; they collapse, dedup, or evaporate.

Measurable signals:

AutoSchemaKG/ATLAS: 50M plus docs to 900M nodes, 5.9B edges, 92 percent schema alignment, zero manual intervention.
GraphRAG: LazyGraphRAG cuts indexing cost 10 to 90 percent (Microsoft 2024).
phi-1: 50.6 percent HumanEval, 55.5 percent MBPP at 1.3B params on about 7B tokens, beats 10x-larger models.
Self/meta rewarding: Llama-3-8B 22.9 to 39.4 percent AlpacaEval 2; STaR 95 percent AMC23; RLAIF about 10x cheaper than RLHF.
AI Co-Scientist (Nature 2026): 30 AML candidates to 5 to 1 active; Elo correlates with correctness.
Robin (Nature 2026): autonomous hypotheses plus 2 lab rounds, dry-AMD ripasudil via ABCA1, 2.5 months, 3 agents.
SciAgents: 33159 node and 48753 edge ontological KG from about 1000 papers (2024).
Sakana AI Scientist-v2: ICLR-workshop review 6,7,6 (avg 6.33) but flagged for hallucinations and faked results.
Generative agents: 1052 people at 85 percent of self-consistency on the GSS (Stanford 2024-25).
LLM-as-judge: 60 to 70 percent positional swing, 10 to 25 percent self-preference, 60 to 68 percent expert agreement.
Skill extraction to ESCO: best LLM pipeline about 0.56 end-to-end, the flagship baseline to beat.

Concrete for donto:

Position donto as the persistent paraconsistent home for generative abundance, the layer AutoSchemaKG, GraphRAG, Co-Scientist, and Robin all lack: they generate then collapse; donto generates then holds and re-ranks.
Implement the 8-step claim lifecycle: ingest, emit unbounded typed claims, hold incompatible claims paraconsistently, generate relationship hypotheses at lens intersections, attach evidence and counter-evidence via supports/rebuts/undercuts edges, rank, re-rank bitemporally, explain. Maximize at extraction, gate at the relationship layer.
Adopt the AutoSchemaKG pattern (entities, events, concepts; schema induced not predefined), but write every induced predicate as a hypothesis-only claim with provenance so schema growth is auditable and reversible.
Build the cross-entity relationship generator donto lacks: at the intersection of two lens decompositions, have the LLM propose typed relationships, store each as hypothesis-only with supports/rebuts edges, then rank, the donto SciAgents/Co-Scientist analogue with persistence.
Treat LLM-as-judge bias as a measured target: ensemble plus position-swap plus reference-anchored scoring to lift human-agreement from about 60 to 68 percent toward over 85 percent; log disagreement as paraconsistent state.
Ship jsonresume to jobs as the flagship: emit unbounded typed properties (explicit, inferred, latent skills, seniority, trajectory, identity variants) anchored to ESCO/ONET/Lightcast; matching is explainable evidence-anchored discovery; first milestone is beating the about 0.56 ESCO baseline, with got-interviewed and hired as bitemporal ground-truth.
Run the synthetic-data self-improvement loop inside donto: claims that survive evidence-anchored re-ranking become high-quality curation signal (the phi/STaR pattern).
Make persistence the demo: re-run the same query a week apart and show a previously low-ranked hypothesis rise on new evidence, a falsifiable capability no shipped discovery agent has.

Honest constraints:

The cross-entity relationship generator is the unbuilt core; intra-corpus extraction works at scale but cross-entity typed-relationship proposal is the harder open problem. Target: 20 plus candidates per entity pair at precision-at-10 over 0.4 on a gold set.
Generative abundance produces hallucinated low-precision claims (the Sakana failure); holding contradictions does not excuse junk. Target: high extraction recall while the relationship gate holds precision-at-k above a published threshold, both layers reported separately.
The ranking layer inherits LLM-as-judge bias; solvable with ensembling and position-swap. Target: lift judge-human agreement from about 60 to 68 percent toward over 85 percent, preferring paraconsistent disagreement logging.
Cost and latency of opening the faucet on every entity is real (GraphRAG indexing is expensive). Target: tiered extraction budget plus an indexing cost-per-entity ceiling tracked as a metric.
Storing the firehose stresses the substrate (donto about 39.5M statements; abundance could 10x to 100x that). Target: extend bounded-candidate query patterns to the claim and relationship layers so worst-case latency stays sub-second.
Ground truth for re-ranking is delayed and sparse. Target: instrument the bitemporal re-rank so a handful of got-interviewed/hired events visibly move rankings, and report calibration as the falsifiable test.

Examples / systems:

AutoSchemaKG / ATLAS — Autonomous LLM KG construction with schema induction; proof that typed-property generation is no longer human-bottlenecked [50M plus docs to 900M nodes, 5.9B edges, 92 percent schema alignment] https://arxiv.org/abs/2505.23628
Google AI Co-Scientist (Nature 2026) — Multi-agent generate/debate/Elo-rank hypotheses; lab-validated; lacks a persistent contradiction-holding ledger across runs [30 AML candidates to 5 to 1 active; Elo correlates with correctness] https://www.nature.com/articles/s41586-026-10652-y
Robin (FutureHouse, Nature 2026) — End-to-end autonomous hypothesis generation; discovered ripasudil for dry-AMD; each run starts fresh, no persistent KB [dry-AMD via ABCA1 upregulation; 2.5 months; 3 agents] https://www.futurehouse.org/research-announcements/demonstrating-end-to-end-scientific-discovery-with-robin-a-multi-agent-system
SciAgents (MIT, 2024) — Multi-agent reasoning over an ontological KG for hidden interdisciplinary relationships, but on an ephemeral graph [33159 nodes and 48753 edges from about 1000 papers] https://arxiv.org/abs/2409.05556
Sakana AI Scientist-v2 — Autonomous papers via tree search; one passed ICLR-workshop review but was flagged for hallucinations and faked results [scores 6,7,6 (avg 6.33); later withdrawn] https://arxiv.org/abs/2504.08066
Generative Agents of 1052 People (Stanford) — LLM agents from interviews replicate real individuals survey responses; text-derived world-models of entity populations [85 percent of self-consistency on GSS] https://arxiv.org/pdf/2411.10109
phi-1 / Textbooks Are All You Need — Synthetic-data curation beats scale; value is in the filter not the faucet [50.6 percent HumanEval at 1.3B params on about 7B tokens] https://arxiv.org/abs/2306.11644
GraphRAG / LazyGraphRAG (Microsoft) — Reference LLM-built KG from text with provenance, but collapses to one canonical graph and cannot hold contradictions [LazyGraphRAG cuts indexing cost 10 to 90 percent] https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
LLM skill extraction to ESCO — Flagship-relevant: LLM pipelines map resume and job text to skill taxonomies; the baseline jsonresume-to-jobs must beat [about 0.56 end-to-end] https://arxiv.org/html/2512.03195v1

substrate-as-possibility-space-and-domains

The scarce step in every prior knowledge-and-discovery system was generation of typed properties and relations. Cyc needed knowledge engineers; literature-based discovery needed co-occurrence statistics; formal concept analysis needed predefined attributes. That bottleneck is gone. A guided frontier LLM now emits typed claims about any entity along essentially unbounded directions for ~$0.00014/triple: GPTKB v1.5 materialized 100M triples from 6.1M entities for ~$14,136 (arXiv 2507.05740), and "Mining the Mind" extracted ~100M beliefs from frontier models and showed they assert mutually contradictory claims depending on framing (arXiv 2510.07024). Emission is now abundant; it is also abundantly contradictory and redundant. The design question has flipped from "how do we get enough typed knowledge?" to "where do we PUT an unbounded, contradictory, evidence-anchored firehose without throwing most of it away?"

TASK A thesis: the standard storage targets are structurally hostile to abundance. A vector DB collapses meaning to a single embedding and returns "semantically redundant outputs that lack contextual diversity" — when an LLM emits a fact that conflicts with a stored one, the closer vector wins and the other is silently lost (Mem0 on LoCoMo retrieves the stale address when it is embedding-closer). Standard KGs and even the best 2025 agent-memory graphs enforce single-truth: Zep/Graphiti explicitly uses an LLM to detect contradicting edges and INVALIDATES the overlapping edge (arXiv 2501.13956); Mem0's update step overwrites. Every one of these must collapse, dedup, or pick-a-winner at write time — destroying exactly the speculative, minority, not-yet-supported claims that are the raw material of discovery. donto's paraconsistent, evidence-first, bitemporal quad store does the opposite: it holds incompatible claims forever as legal state, anchors each to evidence with typed argument edges (supports/rebuts/undercuts), and re-ranks by reality over time instead of deleting on conflict. This converts three problems-of-abundance into assets: (1) RECALL — nothing emitted is lost, so the recall ceiling is set by generation, not by a dedup threshold; (2) AUDITABILITY — every claim carries provenance and counter-evidence, matching the citation/contract demands of AML triage (arXiv 2604.19755) and Claimify-style verification (96.7% precision, 87.6% coverage; arXiv 2502.10855); (3) COMPOUNDING — because claims are retained and re-rankable, new evidence re-scores old hypotheses (Elo-tournament re-ranking is exactly what Google's AI co-scientist used to reach Nature in 2026), so the base accumulates "machine serendipity" rather than resetting each run.

The volume reconciliation is concrete and measurable: maximize at the typed-extraction layer (emit everything, gate nothing — donto-memory already produces 483 valid facts from one sentence), and gate at the relationship/promotion layer (a claim earns .candidate→.proved only via supports-edges to independent evidence, with a precision target ≥0.95 borrowed from Claimify). The two layers are decoupled by the substrate's hypothesis_only flag: abundance lives below the waterline as hypothesis_only claims; reality pulls a vanishing fraction above it. This is the only architecture where "generate in all directions, prune by reality over time" is even expressible.

TASK B / flagship: jsonresume→jobs is the cleanest proving ground because abundance has built-in ground truth. Each resume and job is decomposed into unbounded typed properties — explicit skills, inferred/implicit skills, seniority, trajectory, latent traits, identity variants — anchored to ESCO/O*NET/Lightcast (GPT-4 re-ranking already lifts skill-linking RP@10 by 22+ points; arXiv 2307.03539). Matching becomes explainable, evidence-anchored relationship discovery across the network: skill-adjacency, hidden career paths, candidates no recruiter would surface. The falsifiable signal is got-interviewed/hired — a real-world reality check that re-ranks the whole claim graph, the same compounding loop that the substrate gives to genealogy, drug-repurposing, law, and science integrity below.

What's newly possible:

Generate an entity's properties in ESSENTIALLY UNBOUNDED directions for ~$0.00014/triple (GPTKB: 100M triples/$14k) — and let the LLM INVENT new predicates as it goes, not just fill a fixed schema. Cyc-style hand-authored ontologies are no longer the rate limiter; the lens set is open-ended.
Hold the firehose paraconsistently: store ~100M mutually-contradictory LLM beliefs (Mining the Mind) WITHOUT a write-time winner-pick. Every other 2025 stack (Mem0, Zep/Graphiti) is forced to invalidate-on-conflict; donto can retain both claims + their argument edges as legal state.
Re-rank old hypotheses on new evidence at substrate scale (bitemporal) — the Elo-tournament compounding that took Google's AI co-scientist from demo to Nature (2026) becomes a standing property of the knowledge base, not a one-shot pipeline run.
Cross-entity relationship discovery at the intersections of many lenses: 'machine serendipity that accumulates.' The serendipity-KG benchmark shows frontier models still hit <13% serendipity hit rate over 15.4M entities / 201.7M relations — enormous headroom that a hold-everything substrate can mine and bank rather than recompute.
Decontextualized, atomic, audit-grade claims at scale: Claimify-style extraction (96.7% precision, 87.6% coverage) means the abundance is verifiable claim-by-claim, so gating-by-evidence is a real engineering knob, not a hope.
Treat contradiction itself as a queryable signal: because both sides are retained with provenance, you can rank entities/papers/people by INTERNAL inconsistency — impossible in any store that dedups on write. Directly enables science-integrity and OSINT use.
Sub-second substrate-wide retrieval over the whole firehose (donto's POST /search: 39.3M stmts, 270-820ms incl. stopwords) — abundance is only useful if it stays queryable; this is already built.

Measurable signals:

Generation cost/abundance: ~$0.00014/triple (GPTKB v1.5: 100M triples, 6.1M entities, ~$14,136); donto-memory baseline 483 valid facts from one sentence in ~4.7 min — target: properties-per-entity emitted, % syntactically/ontologically valid (>95%).
Recall-vs-collapse: A/B donto (hold-all) vs a vector-DB and a Zep/Graphiti-style invalidate-on-conflict KG on the SAME LLM firehose — measure % of emitted minority/contradictory claims still retrievable after ingest. Target: donto 100% retained vs measurable loss in the collapsing stores (Mem0 demonstrably returns stale-but-closer facts on LoCoMo).
Promotion precision (the gate): claims promoted .candidate→.proved must hit precision ≥0.95 against held-out ground truth, borrowing Claimify's 96.7% precision / 87.6% coverage as the bar; coverage measured separately so abundance isn't penalized.
Re-ranking lift: when new evidence arrives, measure rank-correlation change of affected hypotheses and downstream accuracy gain. Touchstone: AI co-scientist Elo correlates with GPQA correctness; rare-disease agentic hypothesis-testing lifted Top-5 by >17% and recall to 41.4% with KG retrieval.
Serendipity hit rate over the substrate: replicate the RNS (relevance/novelty/surprise) measure on a held-out set; current frontier ceiling is 0.048-0.134 hit rate, 0.18-0.48 type-match — track whether accumulated, re-ranked claims push past 13%.
Flagship reality-check: got-interviewed / hired rate on explainable matches vs an embedding-only baseline; skill-linking RP@10 (GPT-4 re-rank already +22 pts over distant supervision).
Auditability: % of promoted claims with a complete provenance chain (evidence + counter-evidence edges) — target 100%, the precondition for AML/legal/clinical use (AML triage frameworks already require explicit citations + supporting/other separation).

Concrete for donto:

Build the cross-entity relationship generator (the currently-unbuilt piece): run an LLM over pairs/clusters of high-overlap entities and emit hypothesis_only relationship claims at the lens intersections (e.g. skill-A-implies-skill-B, drug-X-repurposes-to-disease-Y). Store ALL as hypothesis_only with typed argument edges; never dedup at write.
Add a two-layer pipeline contract: (1) extraction layer = emit-everything, gate nothing (already live, 483 facts/sentence); (2) promotion layer = a Lean-4-certified rule that only flips hypothesis_only→.candidate→.proved when N independent supports-edges exist and counter-edges are below threshold. Make the precision target (≥0.95) a config knob.
Instrument a 'collapse delta' benchmark in tests/system/: ingest the same LLM firehose into donto, a vector DB, and a Graphiti-style invalidate-on-conflict graph; report % minority/contradictory claims retained and retrievable. This is the headline measurable proof of TASK A.
Make contradiction first-class in /search: add an inconsistency-rank that scores an entity by count/strength of mutually-rebutting retained claims (directly powers science-integrity + OSINT use cases).
Wire bitemporal re-ranking as a standing job: when new evidence statements land, re-score affected hypothesis_only/.candidate claims (Elo or Bayesian credence) and log rank deltas — turning the substrate into the accumulating co-scientist, not a one-shot run.
Flagship: extend donto-memory extraction lenses for resumes/jobs to emit inferred/implicit skills + latent traits + identity variants anchored to ESCO/O*NET/Lightcast IRIs; expose explainable matches with the supporting evidence chain; capture got-interviewed/hired as ground-truth evidence statements that re-rank the graph.
Provenance completeness gate: refuse to promote any claim lacking a complete evidence chain (mirrors the empty evidence_links problem already flagged in the Caroline-line kinship triples) — make 100% provenance a CI assertion.

Honest constraints:

The cross-entity relationship generator is unbuilt. Emitting properties per-entity is proven (483 facts/sentence); emitting and storing relationship hypotheses across pairs/clusters at substrate scale is the open engineering work. Target: a working pairwise/cluster generator + promotion gate with measured precision ≥0.95 on a held-out set.
Abundance without a gate is noise. The substrate can HOLD everything, but consumers need a trustworthy waterline. Mitigation = the decoupled two-layer design (emit-all below as hypothesis_only; promote only on independent evidence) with Claimify-grade precision targets — make the gate, not the firehose, the contract.
Cost and storage scale with abundance. At ~$0.00014/triple, a 10x lens expansion is real money and real rows on a 39.5M-stmt Postgres box. Target: cost-per-promoted-claim (not per-emitted-claim) as the unit economic; tier hypothesis_only storage cheaply, keep promoted claims hot.
Serendipity precision is genuinely hard — frontier ceiling is <13% hit rate. Frame as: the substrate's job is to RETAIN and RE-RANK candidates so accumulated evidence raises that number over time, not to nail it in one pass. Falsifiable: does hit-rate rise as the evidence base grows?
Re-ranking can drift, not just revise. LLM belief updates aren't always Bayes-consistent (arXiv 2507.17951) and context accumulation causes drift. Mitigation = re-rank on EXTERNAL evidence statements with provenance, certify the promotion rule in Lean-4, and audit rank deltas — distinguish evidence-driven revision from model drift.
Garbage-provenance erodes trust fast. donto already has near-empty evidence_links on most Caroline-line kinship triples — abundance amplifies this. Make 100% provenance-completeness a hard gate for promotion (CI assertion), so unsupported abundance can never masquerade as fact.

Examples / systems:

GPTKB v1.5 (Max Planck / TU Dresden) — Materialized 100M triples from 6.1M entities for ~$14,136 (~$0.00014/triple) and chose to KEEP the inconsistent firehose as a queryable KB rather than dedup to perfection — direct proof that abundant typed emission is cheap and that retaining contradictions is a deliberate, viable design. [100M triples, 6.1M entities, $14,136, ~$0.00014/triple] https://arxiv.org/pdf/2507.05740
Mining the Mind (100M beliefs) — Extracted ~100M beliefs from frontier LLMs and documented systematic internal contradictions (same model asserts conflicting claims by framing) — the empirical case that emission is abundant AND abundantly contradictory, so a paraconsistent home is required. _[~100M beliefs; pervasive intra-model contradiction]_ https://arxiv.org/pdf/2510.07024
Zep / Graphiti temporal KG (anti-pattern contrast) — State-of-the-art 2025 agent-memory graph that detects contradicting edges with an LLM and INVALIDATES the overlapping edge — the exact collapse/pick-a-winner behavior donto refuses; shows the field defaults to destroying minority claims. [bi-temporal; invalidate-on-conflict (vs donto hold-forever)] https://arxiv.org/abs/2501.13956
Serendipity Discovery in KGs for Drug Repurposing — RNS (relevance/novelty/surprise) benchmark over a 15.4M-entity / 201.7M-relation clinical KG; frontier models hit only 0.048-0.134 serendipity hit rate — quantifies the discovery headroom a hold-everything, re-ranking substrate can mine and accumulate. [15.4M entities, 201.7M relations; <13% serendipity hit rate] https://arxiv.org/html/2511.12472
Google DeepMind AI co-scientist (Nature 2026) — Multi-agent system that generates, then re-ranks hypotheses via an Elo tournament that improves with compute and correlates with correctness — the compounding re-rank loop donto can make a standing substrate property instead of a one-shot run. [Elo↑ with compute; correlates with GPQA correctness; lab-validated] https://www.nature.com/
Claimify (Microsoft Research) — Atomic, decontextualized claim extraction at 96.7% precision / 87.6% coverage / 99% entailment — the verification bar for the promotion gate (extract-everything below, gate-by-evidence above). [96.7% precision, 87.6% coverage, 99% entailment] https://arxiv.org/pdf/2502.10855
Rare-disease differential dx with LLMs (2025) — Agentic hypothesis-testing + KG retrieval (Orphanet/OMIM) lifted Top-5 accuracy >17% and recall to 41.4%; ChatGPT-4o 22.4% solo, 30% combined with Exomiser — shows reality-anchored re-ranking beats single-shot and that holding many differential hypotheses pays off. [Top-5 +17%, recall 41.4%, combined dx 30%] https://pubmed.ncbi.nlm.nih.gov/40776018/
OpenSanctions Pairs / LLM entity matching (OSINT) — Large-scale LLM entity matching where rule systems over-match and LLMs fail mainly on transliteration/date noise; multi-agent ER hits 94.3% on name-variation — maps onto donto's identity-as-hypothesis (keep variants as competing claims, resolve at query time). [94.3% name-variation match; complementary failure modes] https://arxiv.org/pdf/2603.11051