donto — Generative Abundance: Research Appendix
(2026-06-02)
donto —
Substrate for Generative Abundance: Research Appendix
Companion to the iteration-4 unified report. Structured output of
the 5 forward-looking research deep-dives (2026-06-02).
frontier-llm-generative-decomposition
The founder's thesis — "a frontier LLM can emit an immeasurable
amount of typed properties in any direction about a thing, with
guidance" — is no longer a metaphor in 2025-2026; it is a measured,
replicated engineering result. The cleanest proof is GPTKB: pointed at a
single CHEAP model (GPT-4o-mini) and asked to recursively elaborate
entities, it materialized 105M triples over 2.9M entities using 2,133
distinct relations and 367 classes — roughly 36 typed properties per
entity on average, for $0.00009 per correct triple (https://arxiv.org/html/2411.04920v1).
The follow-up GPTKB v1.5 pushed this to ~100M beliefs from GPT-4.1 (https://arxiv.org/abs/2510.07024).
The decisive part for the thesis is not the volume but the DIRECTIONS:
the model invented predicates no human schema had —
historicalSignificance (270K triples), hasArtStyle (11K), hobbies (30K)
— and 69.5% of the entities it described were NOT in Wikidata at all.
The scarce, human-bottlenecked step in every prior knowledge system
(knowledge engineers in Cyc, predefined attributes in formal concept
analysis) is the exact step a guided LLM now does for free, and it does
it along axes nobody pre-declared. "Essentially unbounded
multi-directional emission" is a fair characterization: the supply of
typed properties and the supply of NEW predicates are both effectively
elastic now.
The abundance compounds when you let the model invent the axes
themselves. AutoSchemaKG built a 900M+ node, 5.9B edge graph (ATLAS)
from 50M+ documents with ZERO predefined schema — the LLM induced
entity/event/concept types on the fly and hit 95% semantic alignment
with human-crafted schemas with no manual intervention, while preserving
93-97% of source-passage information (https://arxiv.org/html/2505.23628).
Strikingly, that was achieved with a small Llama-3-8B extractor, which
means the ceiling under a frontier model (GPT-5.5 at 1M-token context,
Claude Opus 4.5) is far higher than these papers measured. This is the
"lens engine" intuition made quantitative: decompose along attributes,
parts, functions, causes, counterfactuals, comparisons, AND emergent
axes, and a single entity yields not one row but a fan of
dozens-to-hundreds of typed claims plus brand-new predicate types —
exactly the abundance donto is built to hold rather than collapse.
The honest counterweight — and the reason donto's posture is right
rather than naive — is that emission breadth is now cheap but per-claim
PRECISION is not automatic. GPTKB v1.5's own headline finding is that
GPT-4.1's factual accuracy is "significantly lower than indicated by
previous benchmarks," with inconsistency, ambiguity and hallucination as
the main failure modes (https://arxiv.org/abs/2510.07024).
On precision-critical typed extraction, GPT-5 reaches only 0.616 F1
five-shot on ChemProt relation extraction — about 12 points behind
fine-tuned SOTA — and still trails specialists on disease NER, while
leading on chemical NER and reasoning-heavy QA (https://arxiv.org/abs/2509.04462).
So the 2025-2026 reality is asymmetric: GENERATION breadth is unbounded
and nearly free; per-triple TRUTH is good and improving but not
guaranteed. This is precisely the volume-reconciliation the founder
already drew — maximize at the typed-extraction layer (where breadth
wins), gate at the relationship/belief layer (where truth must be earned
by evidence). It also validates donto as the natural home: a vector DB
or normal KG must dedup and pick a winner; DEG-RAG shows LLM-built
graphs need ~70% entity reduction to be usable in a collapsing store (https://arxiv.org/html/2510.14271v1).
donto doesn't collapse — it holds the contradictory firehose as legal
bitemporal state and prunes by reality (evidence + re-ranking), which is
the only architecture that lets you keep ALL of generation's abundance
instead of throwing 70% of it away.
On the "immeasurable directions" claim specifically: the strongest
evidence is internal, not behavioral. Anthropic's sparse-autoencoder
work extracted 34 million distinct interpretable features
(concepts/directions) from the residual stream of a SINGLE mid-size
model, Claude 3 Sonnet, with ~12M alive (https://transformer-circuits.pub/2024/scaling-monosemanticity/).
The number of latent axes along which a frontier model can characterize
a thing is measured in the tens of millions and grows with scale — so
when the founder says "any direction," the substrate inside the model
genuinely has tens of millions of them. The engineering task is not to
manufacture directions (they exist in superabundance) but to ELICIT the
useful ones with guidance and ANCHOR each emitted claim to evidence.
That reframes donto from "a place to store extracted facts" to "the only
substrate that can absorb generative abundance at full bandwidth and let
reality, not a dedup heuristic, decide what survives."
What's newly possible:
Pointing a single guided LLM at one entity and getting a FAN of
typed claims plus invented predicates — GPTKB averaged 36 triples/entity
across 2,133 relations from GPT-4o-mini alone, and the model coined axes
Wikidata never had (historicalSignificance, hasArtStyle, hobbies). For
donto's flagship: emit skills/roles/seniority/trajectory PLUS latent
traits and freshly-invented career-axis predicates per resume, not a
fixed feature list.
Schema-FREE construction at web scale: AutoSchemaKG built 900M+
nodes / 5.9B edges from 50M docs with the LLM inducing all types on the
fly, 95% aligned to human schemas with zero manual intervention. You no
longer need to pre-author an ontology before extraction — the
substrate's predicate set can grow itself.
Treating emitted predicates as a first-class growing vocabulary:
because models invent new relation types per-pass, donto can let ctx:*
predicate inventories EXPAND as a feature, then reconcile/align them
later (ESCO/O*NET/Lightcast for jobs) rather than forcing every claim
into a frozen schema at write time.
Holding the full firehose without collapsing: prior LLM-KG pipelines
must do ~70% entity reduction (DEG-RAG) to stay usable. A paraconsistent
bitemporal store can ingest 100% of generated abundance, keep
contradictions as legal state, and let evidence/re-ranking prune —
abundance becomes an asset instead of noise to be discarded.
Multi-directional relationship discovery at intersections: with
breadth this cheap, you can decompose two entities along dozens of axes
each and surface relationships at the crossings (skill adjacency, hidden
career paths, candidate↔︎role bridges) that no human drew — 'machine
serendipity that accumulates,' now with a per-claim cost of
~$0.0001.
1M-token context frontier models (GPT-5.5) mean an entire corpus (a
person's full resume history + a job family + ESCO branch) fits in ONE
decomposition pass, so cross-entity property emission and relationship
hypotheses can be generated jointly rather than stitched from
chunks.
Closing the loop with built-in ground truth: because emission is
cheap and donto re-ranks on new evidence, the jsonresume→jobs flagship
can treat got-interviewed/hired as live labels that continuously
re-weight which emitted properties and relationship hypotheses actually
predict outcomes.
Measurable signals:
GPTKB: 105M triples / 2.9M entities / 2,133 relations / 367 classes
from GPT-4o-mini = ~36 typed properties per entity, $0.00009 per correct
triple, 27 hours total runtime
GPTKB novelty: 69.5% of subjects are NOT in Wikidata; only 24% have
exact Wikidata matches — the model emits along axes structured KBs never
modeled
GPTKB v1.5: ~100M beliefs recursively elicited from GPT-4.1 (single
frontier model); accuracy 'significantly lower than prior benchmarks'
(the honest precision gap)
AutoSchemaKG / ATLAS: 900M+ nodes (937.3M in ATLAS-CC), 5.9B edges,
241M entities + 696M events + 31M concepts, from 50M+ docs; 95% schema
alignment with zero manual intervention; 93-97% passage-information
preservation; +12-18% multi-hop QA, +9% factuality (FELM) — achieved
with only Llama-3-8B as extractor
Triple-extraction quality in AutoSchemaKG: precision 95-99%, recall
85-94%, F1 89-96% (LLM-as-judge, DeepSeek-V3)
GPT-5 biomedical (the precision gate): ChemProt RE 0.616 F1
five-shot (~12 pts behind fine-tuned SOTA); leads on chemical NER and
reasoning QA; trails specialists on disease NER
Internal directions: 34M interpretable SAE features from ONE model
(Claude 3 Sonnet), ~12M alive — the literal measured count of distinct
axes a single model represents
Faithfulness is engineerable: VeriFY reduces factual hallucination
9.7-53.3% with only 0.4-5.7% recall loss via structured verification
traces; SyRACT improves biomedical RE F1 by 11-41% over standard
prompting
Collapsing-store tax avoided: DEG-RAG needs ~70% entity reduction
for LLM-built KGs to perform — the abundance a paraconsistent substrate
gets to keep
Concrete for donto:
Build the cross-entity GENERATOR as a multi-lens decomposition pass:
for each entity, run guided prompts along a standing lens set
(attributes, parts, functions, causes, counterfactuals, comparisons,
analogies) PLUS an open 'invent new axes' lens; emit each result as a
typed hypothesis_only claim with the lens recorded as provenance.
Target: ≥30 typed claims/entity at ≥90% human-judged faithfulness on a
200-entity audit.
Make the predicate inventory a first-class GROWING object per
ctx:. Let new LLM-invented predicates land as hypothesis_only
predicate-types, then align them to ESCO/ONET/Lightcast (for jobs)
at re-rank time, not write time. Measure: % of emitted predicates
auto-aligned vs net-new, and downstream lift from keeping net-new
ones.
Adopt the founder's two-layer gate explicitly in code: extraction
layer maximizes breadth (no dedup, all variants kept as
identity-as-hypothesis), relationship/belief layer requires ≥1 evidence
edge before a claim leaves hypothesis_only. This mirrors GPTKB's breadth
+ AutoSchemaKG's precision split and avoids DEG-RAG's 70%
throwaway.
Wire VeriFY-style verification traces into opencode_extract: each
emitted claim carries a self-generated probing question + consistency
judgment; route low-confidence claims to stay hypothesis_only.
Falsifiable target: cut hallucinated claims 30%+ with <6% recall
loss, matching VeriFY's published band.
For the jsonresume→jobs flagship: per resume, emit explicit +
INFERRED/implicit skills + latent traits + identity variants as separate
typed claims anchored to ESCO/O*NET; per job, emit
required/nice-to-have/implied competencies; discover candidate↔︎role
relationships at the lens intersections; rank with got-interviewed/hired
as the re-ranking signal. Baseline to beat: embedding-cosine match;
metric: interview/hire precision@k.
Stress-test the paraconsistent claim: ingest a deliberately
contradictory entity (a resume with conflicting dates + a job with
conflicting seniority) and confirm donto holds both readings as legal
bitemporal state with supports/rebuts edges, then re-ranks correctly
when one is evidenced. This is donto's differentiator vs a vector DB
that must pick.
Exploit 1M-token context: do whole-corpus single-pass decomposition
(full resume history + a job family + relevant ESCO subtree) so
cross-entity properties and relationship hypotheses are generated
jointly. Measure relationship-discovery recall vs the chunked
baseline.
Honest constraints:
Per-claim precision is not free even when breadth is: GPT-5 hits
only 0.616 F1 on ChemProt RE and GPTKB v1.5 reports accuracy below prior
benchmarks. CHALLENGE: gate at the relationship/belief layer with
evidence edges; target ≥90% faithfulness on audited claims while keeping
100% of emitted breadth as hypothesis_only.
Invented predicates create vocabulary sprawl and near-duplicate axes
(hasHobby vs hobbies vs interests). CHALLENGE: run periodic
predicate-alignment to ESCO/O*NET/Lightcast at re-rank time; measure
net-new vs aligned ratio and prune only predicates with zero downstream
lift.
Emitted entities collide and fork (GPTKB found 69.5%
novel-but-many-are-variants); without dedup you get sprawl, with dedup
you lose abundance. CHALLENGE: use donto's identity-as-hypothesis +
query-time resolution lenses so merging is non-destructive and
reversible — the architectural answer to DEG-RAG's 70%-throwaway
dilemma.
Cost and latency of multi-lens decomposition at scale: emitting 30+
claims/entity across many lenses multiplies token spend. CHALLENGE:
GPTKB shows $0.00009/correct-triple is achievable; budget a per-entity
ceiling, use cheaper models for breadth and a frontier model only for
the verification/relationship pass.
Self-verification has limits — models are over-confident on
plausible falsehoods. CHALLENGE: prefer EXTERNAL grounding
(source-anchored evidence edges, the jsonresume got-interviewed/hired
signal) over pure self-consistency; treat self-verification
(VeriFY-style) as a first filter, not the arbiter.
The cross-entity relationship GENERATOR is still the unbuilt piece —
emitting properties per-entity is solved, but generating and ranking
relationship hypotheses ACROSS entities at scale is the frontier.
CHALLENGE: exploit 1M-token context for joint multi-entity passes; first
falsifiable milestone is relationship-discovery recall vs an
embedding-cosine baseline on the jobs flagship.
Examples / systems:
GPTKB (Hu, Ghosh, Weikum — MPI) — Recursively
elicited a 105M-triple / 2.9M-entity / 2,133-relation / 367-class KB
from GPT-4o-mini at $0.00009/correct-triple; 69.5% of entities novel vs
Wikidata; invented predicates like historicalSignificance and
hasArtStyle — direct proof of unbounded multi-directional emission
[105M triples, 36 triples/entity avg, 2,133 distinct relations]https://arxiv.org/html/2411.04920v1
GPTKB v1.5 / Mining the Mind — ~100M beliefs
recursively elicited from GPT-4.1; the honest counterweight — breadth is
enormous but per-belief accuracy is below prior benchmarks,
inconsistency/ambiguity/hallucination are the main failure modes _[~100M
beliefs from one frontier model]_ https://arxiv.org/abs/2510.07024
AutoSchemaKG / ATLAS — Schema-FREE web-scale KG:
LLM induces all entity/event/concept types on the fly; 95% alignment to
human schemas with zero manual intervention; built with only Llama-3-8B,
so frontier ceiling is far higher [900M+ nodes, 5.9B edges, 95%
schema alignment, +9% factuality]https://arxiv.org/html/2505.23628
Scaling Monosemanticity (Anthropic) — Extracted 34M
distinct interpretable features (concept-directions) from a single
model's residual stream — the measured count of axes along which one
model can characterize a thing [34M features, ~12M alive, in one
mid-size model]https://transformer-circuits.pub/2024/scaling-monosemanticity/
Benchmarking GPT-5 for biomedical NLP — The
precision gate: GPT-5 strong on chemical NER and reasoning QA but 0.616
F1 on ChemProt RE (~12 pts behind fine-tuned SOTA) — shows emission
breadth outruns guaranteed per-triple truth, justifying donto's
relationship-layer gating [ChemProt RE 0.616 F1 five-shot]https://arxiv.org/abs/2509.04462
DEG-RAG (Denoising KGs for RAG) — Collapsing stores
need ~70% entity reduction to make LLM-built KGs usable — quantifies the
abundance a paraconsistent substrate gets to KEEP instead of discard
[up to 70% entity reduction without performance loss]https://arxiv.org/html/2510.14271v1
VeriFY (factual self-verification) — Structured
verification traces cut factual hallucination 9.7-53.3% with only
0.4-5.7% recall loss — shows faithfulness is an engineerable knob, not a
fixed property, for donto's verification layer [9.7-53.3%
hallucination reduction, <6% recall loss]https://arxiv.org/pdf/2602.02018
economics-and-measurement-of-abundance
The scarce step in every prior knowledge system was generating typed
properties and relations: Cyc paid knowledge engineers per assertion,
literature-based discovery rode co-occurrence stats, formal concept
analysis needed hand-defined attributes. That bottleneck is now an
economic non-event. A guided frontier LLM emits hundreds of
self-validated typed facts per source in minutes (donto's own pipeline:
a one-sentence "Pandoc" input -> 483 valid ingested facts in ~4.7 min
on a flat-rate coding subscription), and inference cost for a fixed
capability has fallen ~10x/year for three straight years (a16z
"LLMflation"; GPT-3-quality dropped from $60/M tokens in Nov-2021 to
~$0.06/M by Nov-2024, a ~1000x decline). Epoch AI puts the per-task
decline at a median 50x/year (range 9x-900x), accelerating to a median
200x/year on post-Jan-2024 data alone. The crossover has already
happened: at DeepSeek V4-Pro rates ($0.44/$0.87 per M tokens, made
permanent May-2026) or Gemini 2.5 Flash ($0.30/$2.50), decomposing an
entity along dozens of directions costs single-digit cents. Concretely:
emitting ~500 typed properties for one entity (~5K output tokens) costs
~$0.004-0.04. "Generate everything about everything" is no longer a
thought experiment — it is a line item you can budget. This is donto's
foundational tailwind: generation abundance is now cheaper than the
human curation it replaces by 3-4 orders of magnitude.
A critical 2026 nuance keeps the spine honest: headline API prices
are BIFURCATING even as cost-per-capability keeps falling 5-10x/year
(Epoch). Western labs raised premium reasoning-model prices in May-2026
(GPT-5.5 doubled to $2.50/$15; Gemini 3.5 Flash 3x'd to $1.50/$9) while
efficiency-leaders (DeepSeek) drove to the floor. The takeaway for a
builder: the abundance thesis is real but you must ENGINEER for the
floor — route bulk typed-property emission to the cheapest capable model
(the donto GLM/DeepSeek-tier extraction path), and reserve premium
reasoning tokens for the high-value relationship-hypothesis and
re-ranking steps. Cost is now a steering variable, not a wall.
The harder, more valuable frontier is MEASUREMENT — because once
generation is free, the scarce resource becomes knowing WHICH generated
properties are worth keeping. Accuracy alone is the wrong yardstick: a
true-but-redundant fact has near-zero value. The right metrics, all live
in 2024-2026 research, are (a) downstream TASK-PERFORMANCE LIFT — does
adding the emitted structure move a real metric? GraphRAG-style KGs
deliver 72-83% comprehensiveness vs vector RAG and +12.8 QA points from
better-constructed graphs; this is donto's gold standard, and the
jsonresume->jobs flagship has a built-in one: got-interviewed /
hired; (b) information gain / Bayesian surprise — how much a new
property shifts the posterior over an entity, which doubles as the
steering wheel for active generation (BED-LLM and
Uncertainty-of-Thoughts choose what to generate next by maximizing
expected information gain); (c) novelty/diversity — measured as harmonic
mean of originality (fraction of unseen n-grams) and quality,
embedding-distance diversity, and the proven result that AI-generated
research ideas are rated MORE novel than expert humans (p<0.05,
Stanford 100+ researcher study) though less diverse — directly
validating "machine serendipity that accumulates"; and (d)
coverage/completeness benchmarks like MINE (Feb-2025), where KGGen beat
OpenIE/GraphRAG by 18% on representing source text.
Model collapse is the one real risk to a self-growing knowledge base,
and the 2024-2026 literature has already de-fanged it for donto's exact
architecture. Collapse only happens under REPLACEMENT (training on
synthetic data while discarding real); error then grows roughly linearly
with iterations (Shumailov). Under ACCUMULATION — keeping all real +
synthetic data forever — error is provably BOUNDED, not divergent
(Gerstgrasser et al. 2024; the variance converges to a finite limit
independent of iteration count). donto is an accumulation system BY
CONSTRUCTION: bitemporal, paraconsistent, evidence-anchored, it never
overwrites or dedups; every claim keeps its provenance and
counter-evidence. The collapse-avoidance recipe the field converged on —
accumulate + verify/curate (selection on synthetic data "significantly
enhances performance" especially where verifiers exist) — IS donto's
claim lifecycle: emit abundantly at the typed-extraction layer, then
gate/rank at the relationship layer against evidence and reality. The
contradiction-preserving substrate is not just compatible with
generative abundance; it is the provably-safe container for it.
What's newly possible:
Budget 'decompose this entity along 50 directions' as a line item:
~500 typed properties per entity (~5K output tokens) now costs
~$0.004-0.04 at DeepSeek/Gemini-Flash rates. A 1M-resume corpus fully
decomposed = roughly $4K-40K of inference, not a knowledge-engineering
department-decade. The Cyc-style human bottleneck is gone by 3-4 orders
of magnitude.
Active, uncertainty-STEERED generation: instead of emitting blindly,
use expected-information-gain (BED-LLM, Uncertainty-of-Thoughts 2024-25)
to spend the next generation budget where the posterior over an entity
is most uncertain — turning measurement from a backward judge into a
forward steering wheel ('generate more where it pays').
A self-growing knowledge base that provably does NOT collapse:
because donto ACCUMULATES (bitemporal, never overwrites) rather than
replaces, error is bounded by theorem (Gerstgrasser 2024), not divergent
— so donto can safely ingest its own and other models' generated claims
indefinitely, which a vector DB or normal KG (which must dedup/pick)
cannot.
Predicate invention as a runtime primitive: frontier LLMs can INVENT
new relation types mid-extraction, so the schema grows with the data.
Measure each invented predicate by its downstream task lift and
information gain; keep the ones that pay. This was impossible when
attributes had to be pre-defined (FCA) or hand-engineered (Cyc).
Novelty as a generatable, measurable product: AI ideas now rate MORE
novel than expert humans (Stanford p<0.05). donto can emit
relationship hypotheses, score each by novelty (harmonic mean of
unseen-n-gram originality x evidence-backed quality) AND by Bayesian
surprise, and surface only the high-surprise/high-evidence ones —
'machine serendipity that accumulates,' now with a number on it.
Task-lift A/B as the universal value gate: because generation is
cheap, you can afford to generate a property, measure whether it
improves a real downstream task (interview rate, QA accuracy, link
prediction), and keep or kill it empirically — closing the loop that
pre-LLM systems could never afford to run at scale.
Cost-tiered cognition routing: emit the firehose of typed properties
on the floor-priced model (DeepSeek/GLM tier, ~$0.5/M) and spend premium
reasoning tokens (GPT-5.x/Gemini-Pro tier) only on
relationship-hypothesis generation and re-ranking — making 'generate
everything' AND 'reason hard about the best bits' simultaneously
affordable.
Measurable signals:
Inference cost for fixed capability: ~10x/year decline 3yrs running
(a16z); GPT-3-quality $60/M (Nov-2021) -> $0.06/M (Nov-2024) =
~1000x. Epoch median 50x/year per task, range 9x-900x, accelerating to
median 200x/year on post-Jan-2024 data.
Per-entity generation cost target: <$0.05 to emit ~500 typed
properties (~5K output tokens) at DeepSeek V4-Pro $0.87/M or Gemini 2.5
Flash $2.50/M output; <$0.005 at floor rates. Corpus target: full
1M-resume decomposition <$40K.
Cost bifurcation to engineer around: cost-per-capability still
falling 5-10x/year (Epoch) but headline premium prices rose May-2026
(GPT-5.5 $2.50/$15, Gemini 3.5 Flash $1.50/$9) while DeepSeek held floor
($0.44/$0.87). Route bulk emission to the floor.
Downstream task lift (the gold metric): GraphRAG 72-83%
comprehensiveness vs vector RAG; +12.8 QA points from better KG
construction; 3.4x accuracy in enterprise scenarios (Microsoft 2024).
donto target: each kept property class must show measurable lift on a
held-out task.
Coverage/completeness: MINE benchmark (Feb-2025) — KGGen beat
OpenIE/GraphRAG by 18% on faithfully representing source text. Target a
published MINE-style score per extraction pass.
Novelty: AI-generated research ideas rated MORE novel than 100+
expert humans (p<0.05, Stanford 2024), though lower diversity — so
MEASURE and optimize diversity explicitly (distinct-n, Self-BLEU,
embedding-distance, NovelSum which correlates with downstream tuning
performance).
Model-collapse boundary conditions: REPLACEMENT -> error grows
~linearly with iterations; ACCUMULATION -> bounded finite error
independent of iteration count (Gerstgrasser 2024). Strong Model
Collapse: as little as 1-per-1000 synthetic fraction can degrade a
REPLACEMENT regime — irrelevant to accumulation, but the reason
verification/curation is mandatory.
Synthetic-data regime rule: synthetic helps when real data is scarce
(<=~1,024 samples in the study); degrades when real data is abundant.
donto rule-of-thumb: weight generated claims down where dense real
evidence already exists, up where the entity is sparse.
donto live baseline to beat: 483 valid+ingested facts from a
1-sentence input in ~4.7 min (one opencode/GLM pass); '697 facts from
cat-is-red'. Track facts/min, valid-fact %, and downstream-lift-per-fact
as the core production dashboard.
Concrete for donto:
Add a per-claim value score column populated by three measurable
signals: (1) information_gain (posterior shift over the entity from
adding this claim), (2) novelty (harmonic mean of unseen-n-gram
originality x evidence-backed quality), (3) downstream_task_lift (filled
in retroactively when a claim participates in a successful task — e.g.,
a resume property present in a got-interviewed match). Re-rank on these,
not on accuracy alone.
Build the active-generation steering loop: after each extraction
pass, compute per-entity posterior uncertainty and re-spend the next
generation budget on the highest expected-information-gain directions
(BED-LLM / Uncertainty-of-Thoughts pattern). Expose an EIG-ordered 'what
to generate next' queue. Turns the firehose into a guided drill.
Implement cost-tiered routing in opencode_agent.py / extraction.py:
bulk typed-property emission on the floor model (GLM/DeepSeek tier),
relationship-hypothesis generation + re-ranking on a premium reasoning
model. Log $/fact and $/kept-fact per tier; target <$0.05 per
~500-property entity.
Formalize donto's accumulation guarantee as a design invariant and
SAY it: 'donto avoids model collapse by construction (Gerstgrasser-2024
accumulation regime) — it never replaces or dedups, so ingested
generated claims have bounded, not divergent, error.' This is a
defensible moat vs vector/normal-KG competitors who must collapse.
Add a verification/curation gate at the relationship layer (NOT the
extraction layer): generate abundantly into hypothesis_only, then
promote only claims that pass evidence-attachment + pass a
downstream-lift or information-gain threshold. This is the field's
proven collapse-avoidance recipe (accumulate + verify) mapped onto the
8-step lifecycle.
Adopt MINE-style coverage scoring as a CI metric on the extraction
pipeline: every pass reports a coverage score against source text;
regression-test that new extraction prompts/models don't drop coverage.
Target beating KGGen's 18%-over-baseline.
For the jsonresume->jobs flagship, wire got-interviewed / hired
as the ground-truth task-lift label feeding back into claim value
scores: emitted/inferred skills and latent-trait properties that
correlate with positive outcomes get up-ranked; those that never help
get pruned. This is the rare system with built-in measurable downstream
truth — exploit it as the canonical proof of the abundance thesis.
Track a diversity metric on generated relationship hypotheses
(distinct-n / embedding-distance / NovelSum) and optimize it explicitly,
because the literature shows LLMs are novel-but-low-diversity — donto
should counter that with multi-lens decomposition + diversity-aware
sampling so 'serendipity that accumulates' actually spans the
space.
Weight generated claims by real-evidence density per entity:
down-weight where dense real evidence exists (synthetic degrades in
abundant-real regime), up-weight for sparse entities (synthetic helps in
scarce-real regime, <=~1K samples). A simple per-entity
evidence-count feature implements the field's synthetic-data regime
rule.
Honest constraints:
Precision/value gate is the real work, not generation. Once facts
are free, the scarce resource is deciding which to keep. CHALLENGE: ship
the per-claim value score (info-gain x novelty x downstream-lift) and
prove that gating on it beats keep-everything on a real task. TARGET:
kept-claim set delivers >=90% of full-firehose task-lift at <20%
of the claims.
Headline-price bifurcation means 'cost->0' is not automatic for
premium reasoning. CHALLENGE: keep bulk emission on floor models
(DeepSeek/GLM tier) and prove premium tokens are spent only where they
pay. TARGET: <$0.05 per ~500-property entity; <30% of total spend
on premium-tier reasoning.
Model collapse is real under replacement and verification is
mandatory under accumulation (selection 'significantly enhances
performance'). CHALLENGE: enforce the verify/curate gate at the
relationship layer and never train downstream models on un-curated
self-generated claims. TARGET: measured downstream task-lift is
flat-or-up across N self-ingestion cycles (not declining).
Synthetic data helps only in the scarce-real regime and degrades in
the abundant-real regime. CHALLENGE: implement per-entity
evidence-density weighting so generated claims don't drown dense real
evidence. TARGET: no task-lift regression on entities with abundant real
evidence after adding generated claims.
LLMs are novel-but-LOW-diversity, so naive abundance produces
correlated near-duplicates. CHALLENGE: multi-lens decomposition +
diversity-aware sampling, measured by
distinct-n/embedding-distance/NovelSum. TARGET: maintain a target
diversity coefficient as volume scales; reject passes that collapse
it.
The cross-entity relationship generator (the 'discovery at
intersections' engine) is the unbuilt high-value piece. CHALLENGE:
generate relationship hypotheses across entities and rank by evidence +
Bayesian surprise. TARGET (flagship): surfaced hidden-candidate /
skill-adjacency edges that beat the existing matcher on got-interviewed
rate by a measurable margin in an A/B.
Measuring information gain / Bayesian surprise over an evolving
substrate is non-trivial at 39.5M-statement scale. CHALLENGE: an
efficient, incremental posterior-shift estimate per claim (approximate,
embedding- or count-based) that runs in the ingest path. TARGET: <X
ms per claim so it doesn't bottleneck the firehose.
Examples / systems:
a16z LLMflation — Coined the ~10x/year
inference-cost-for-fixed-capability decline; GPT-3-quality $60/M (2021)
-> $0.06/M (2024), ~1000x in 3 years [10x/year; 1000x over 3
years]https://a16z.com/llmflation-llm-inference-cost/
Epoch AI inference price trends — Per-task price
decline median 50x/year (range 9x-900x), accelerating to median
200x/year on post-Jan-2024 data; cost-per-capability still falling
5-10x/year even as 2026 headline prices rose [median 50x/yr, up to
900x/yr]https://epoch.ai/data-insights/llm-inference-price-trends
Gerstgrasser et al. 2024 (Accumulating data) —
Model collapse happens under data REPLACEMENT (error grows linearly) but
is AVOIDED under ACCUMULATION (bounded error independent of iteration
count) — across text/molecule/image generative models
[linear-divergence vs bounded finite error]https://arxiv.org/pdf/2404.01413
KGGen + MINE benchmark (Feb 2025) — LLM KG
extractor with clustering-based entity resolution; MINE is the first
benchmark for how well a KG represents source text [+18% over
OpenIE/GraphRAG]https://arxiv.org/abs/2502.09956
Si, Yang, Hashimoto (Stanford 2024) — Can LLMs Generate
Novel Research Ideas? — 100+ NLP researcher human study; AI
ideas rated MORE novel than expert humans but lower diversity [AI
> human novelty, p<0.05]https://arxiv.org/pdf/2409.04109
Microsoft GraphRAG — Entity-relation graphs +
community summaries lift downstream QA over vector RAG — the canonical
'does emitted structure improve a real task' result [72-83%
comprehensiveness; +12.8 QA pts; 3.4x enterprise accuracy]https://arxiv.org/abs/2501.00309
DeepSeek V4-Pro / Gemini 2.5 Flash pricing —
Floor-tier rates that make per-entity abundant generation cost
single-digit cents [DeepSeek $0.44/$0.87; Gemini Flash $0.30/$2.50
per M]https://www.tldl.io/resources/llm-api-pricing-2026
Lightcast Open Skills / ESCO — Anchoring taxonomies
for the jsonresume->jobs flagship: 34,000+ Lightcast skills (updated
biweekly); ESCO 13,939 skills x 3,039 occupations [34K skills /
13,939 x 3,039]https://lightcast.io/open-skills/extraction
jsonresume-jobs-abundance
The scarce step in every prior matching system was emitting typed
properties about people and jobs — and 2024-2026 evidence shows that
bottleneck has collapsed. Most required skills in a job posting are
expressed implicitly, not as keywords, and a frontier LLM now
extracts them better than the entire prior supervised state of the art:
zero-shot GPT-4 ESCO skill matching beat the previous best (Decorte et
al.) by +22.33 and +29.75 percentage points on RP@10 (arXiv:2307.03539).
That is the abundance thesis made measurable — a guided LLM can emit
competencies, seniority, trajectory-implied capabilities, working-style
signals, and transferable-skill bridges that keyword/embedding pipelines
structurally cannot see. And it changes ranking, not just recall: LLM
re-ranking (ConFit v3) adds +7.81 pp absolute nDCG@10 over the strongest
embedding baseline ConFit v2 (52.33→61.37 on a real 49K-resume
recruiting set), and the explainable Synapse system reports +22% nDCG@10
over embedding-only retrieval. So abundance is not noise: more typed
properties → better, explainable matches.
The deepest finding for donto is architectural. The best 2026
explainable matcher, JobMatchAI (arXiv:2603.14558), wins by strictly
separating a deterministic scoring layer from a generative explanation
layer — the LLM "can explain a ranking but never inflate one,"
yielding 100% faithful explanations (0% unsupported claims), 70.5%
top-factor mention, and 94.5% weakness-surfacing, all at 82ms median.
This is precisely donto's split: let the LLM emit an unbounded firehose
of typed, evidence-anchored claims (HAS_SKILL, IMPLIES_COMPETENCY,
BRIDGES_TO, hypothesis_only trajectory inferences), hold the
contradictory ones forever as legal paraconsistent state, then
gate at the relationship/ranking layer with deterministic,
auditable utility — and have the LLM explain only what the evidence
already supports. A vector DB must collapse to one embedding; a normal
KG must dedup to one canonical skill. donto is the only home that can
keep "claims this person can do Kubernetes (inferred from 3 years of
Docker + Terraform)" alongside "no direct Kubernetes evidence" as
separate evidence-bearing claims, and re-rank when an interview outcome
arrives.
The network effects compound at jsonresume scale. The career-mobility
literature now grounds next-role prediction to standard taxonomies at
volume — KARRIEREWEGE+ (100K resumes → 3,039 ESCO occupations, MRR
43.58, arXiv/COLING-2025) — exactly the skill-adjacency and career-path
graph that emerges once millions of resumes are decomposed into typed
claims. LinkedIn's economic graph (800M+ members, skills required per
job up ~25% since 2015 and projected to double by 2027, skill-adds up
140% since 2022) and Lightcast Open Skills (32K+ skills mined from 1B+
postings, refreshed biweekly) prove the demand and the moat — but they
are closed and embedding-collapsed. An open jsonresume claim-substrate,
anchored to the now-official ESCO↔︎O*NET crosswalk plus Lightcast/ESCO,
can do the one thing the incumbents cannot: expose why a
non-obvious candidate fits, as a checkable evidence chain, and improve
it with built-in ground truth (got-interviewed / hired / retained). Cost
is the only real constraint and it is falling fast: a million
resume-extractions run $72 (small open models) to ~$9,000 (GPT-4o),
batch APIs cut that 50%, and per-token prices fell ~80% in the last year
— so full-firehose extraction over millions of resumes is already an
O($1K-10K) line item, not a research project.
What's newly possible:
Emit the IMPLICIT skill graph that keyword/embedding matching
structurally misses: most required skills in a posting are never stated
explicitly, and zero-shot LLMs now extract them +22-30 pp (RP@10) above
the prior supervised best (arXiv:2307.03539) — so a resume's 'real'
competency set can be 3-10x larger than its listed skills, anchored to
ESCO codes.
Generate typed transferable-skill BRIDGE claims across domains (e.g.
'client-facing ops 6yr ⇒ stakeholder-management + incident-comms' or
'competitive-StarCraft ⇒ real-time resource-allocation') as first-class
evidence-anchored edges, surfacing candidates who never held the title —
the 'hidden candidate' Eightfold markets but cannot make auditable.
Decompose trajectory, not just snapshot: ground each resume to a
career-path graph (KARRIEREWEGE+ style, 3,039 ESCO occupations) and emit
'next-role-ready' / 'over-qualified' / 'stretch-fit' as hypothesis_only
claims, re-ranked as outcomes arrive.
De-conflate PREFERENCE vs QUALIFICATION as two separate typed claim
streams (arXiv:2602.03097) — donto holds 'wants executive role' and
'qualified for executive role' as distinct, possibly contradictory
claims instead of one blended score.
Hold contradictory identity/skill claims paraconsistently: 'senior
per title' vs 'junior per tenure', or duplicate-profile variants, live
side-by-side with evidence — query-time entity-resolution lenses decide
per use, no destructive dedup.
Explanation that is provably faithful, not generated spin:
deterministic scoring layer + LLM-explains-only-supported-evidence
(JobMatchAI: 100% faithful, 0% unsupported, 94.5% weakness-surfacing) —
a compliance-grade 'why this match' that incumbents' black-box
embeddings can't produce.
Self-growing skill-adjacency ontology: the LLM can INVENT new
predicates/skill-relations it observes across millions of resumes
(RELATED_TO, IMPLIES, OBSOLETED_BY) rather than being limited to a
pre-frozen 32K-skill list — the taxonomy grows itself and prunes by
hire-outcome reality.
Built-in falsifiable ground truth at population scale: every match
carries got-interviewed/hired/retained outcomes, so the substrate
continuously re-ranks which inferred/implicit/bridge claims actually
predict success — turning abundance into a self-calibrating engine.
Measurable signals:
Implicit-skill lift: zero-shot GPT-4 ESCO matching RP@10 = 61.02
(House) / 68.94 (Tech) vs prior supervised best 38.69 / 39.19 — +22.33 /
+29.75 pp (arXiv:2307.03539). Target for donto extractor: ≥ this on a
held-out jsonresume↔︎ESCO set.
LLM re-ranking > embeddings: ConFit v3 nDCG@10 61.37 / Recall@10
68.89 vs ConFit v2 52.33 / 62.30 (+7.81 pp avg) on 10,597 jobs × 49,398
resumes (arXiv:2605.09760).
Explainable ensemble: Synapse +22% nDCG@10 over embedding-only
retrieval; evolutionary loop >60% relative gain on recommender scores
(arXiv:2604.02539).
Career-path prediction grounded to taxonomy: KARRIEREWEGE+ 100K
resumes → 3,039 ESCO occupations, best MRR 43.58 (COLING-2025 industry).
Target: beat on next-role R@5 using full claim-set vs skills-only.
Human-LLM resume rating: GPT-4 vs human correlation is only minor on
736 real submissions, and LLM shows NO larger demographic group
differences than humans (ACL-NAACL-2025.270) — fairness is measurable
and not worse than the status quo.
Network scale / demand: LinkedIn 800M+ members; job-required skills
changed ~25% since 2015, projected to DOUBLE by 2027; skill-adds up 140%
since 2022. Lightcast 32K+ skills from 1B+ postings, refreshed
biweekly.
TalentCLEF 2025 open benchmark (Zenodo): best multilingual job-title
match MAP 0.534; best job-title→skill MAP 0.360 — a public leaderboard
donto can post to.
Extraction cost: $72 (Qwen3-4B) to ~$9,000 (GPT-4o) per 1M docs @
2.4K tok; batch API -50%; per-token prices fell ~80% in 12 months —
full-firehose over millions of resumes is O($1K-10K).
Built-in outcome ground truth: precision/recall of 'inferred/bridge
claims' measured against got-interviewed / hired / 6-month-retained,
re-computed bitemporally as outcomes land.
Concrete for donto:
Build the jsonresume Abundance Extractor: a guided multi-lens
opencode/GLM pass that emits, per resume, typed claims across ~10
directions — explicit skills, IMPLIED competencies
(Docker+Terraform⇒Kubernetes-ready), seniority-from-trajectory,
transferable BRIDGES, working-style/context signals, latent traits
(flagged low-confidence), identity variants — every claim
ESCO/O*NET/Lightcast-anchored with an evidence_link back to the resume
span.
Mirror JobMatchAI's split inside donto: a DETERMINISTIC scoring
layer (Jaccard + KG-relatedness + experience-distance + the dual
preference/qualification scores) reads claims; the LLM generates ONLY
evidence-supported explanations. Enforce '0% unsupported claims' as a
Lean-4-checkable shape on the explanation step.
Store skill-adjacency and career-path as bitemporal typed edges
(RELATED_TO, IMPLIES, NEXT_ROLE, OBSOLETED_BY) that accumulate across
all resumes — the 'machine serendipity that accumulates' becomes the
population skill-graph; let the extractor PROPOSE new predicates, gated
before they join the canonical lens.
Make got-interviewed/hired/retained a bitemporal outcome claim that
triggers re-ranking: each outcome updates the measured precision of the
inferred/bridge claim TYPES that fed the match — the substrate learns
which abundance directions actually predict hiring.
Implement preference vs qualification as two separate claim contexts
(ctx:jobs/preference/, ctx:jobs/qualification/) so a candidate
can be 'wants X' and 'not-yet-qualified-for-X' simultaneously without
collapse (arXiv:2602.03097).
Ship a public 'explainable hidden-candidate' demo: given a job,
return non-obvious fits with a full evidence chain (which
implicit/bridge claims fired, counter-evidence shown) — the thing
LinkedIn/Eightfold cannot expose because their match is an opaque
embedding.
Adopt the official ESCO↔︎O*NET crosswalk + Lightcast Open Skills as
the anchor namespace so jsonresume claims are portable and the open
standard is interoperable by construction — a moat incumbents'
proprietary taxonomies can't claim.
Run a falsifiable first milestone: extract abundance-claims for a
held-out jsonresume cohort, match vs an embedding-only baseline, and
report nDCG@10 / hidden-candidate recall AND explanation-faithfulness —
target ConFit-v3-class lift (+7-8 pp nDCG@10) with 100%-faithful
explanations.
Honest constraints:
Abundance can hallucinate: inferred/bridge claims risk false
positives. Solution: store as hypothesis_only with explicit
evidence_links and confidence; gate at the ranking layer; measure each
claim TYPE's precision against hire outcomes and auto-demote types that
don't predict. Target: inferred-claim precision ≥ explicit-claim
precision within 2 outcome cycles.
Explanation faithfulness is the whole product — a generated 'why'
that isn't backed by evidence is worse than no explanation. Enforce
JobMatchAI-style separation (LLM sees only pre-scored evidence) and
Lean-4-check '0% unsupported claims'. Target: ≥99% faithful, matching
the 100%/0% benchmark.
Cost at firehose scale is real but bounded: 10 lenses × millions of
resumes multiplies tokens. Target: keep full re-extraction under
~$0.005/resume using batch API + small models for routine lenses,
frontier models only for bridge/trajectory lenses; re-extract
incrementally on resume edits, not nightly.
The cross-entity relationship GENERATOR (proposing
skill-adjacency/bridge edges no one drew) is the unbuilt piece. Build it
as a guarded proposer: LLM proposes candidate edges, deterministic
support/rebut scoring + outcome data confirm before they enter the
canonical lens. Target: ≥X confirmed novel adjacency edges/month with
hire-outcome support above chance.
Ground truth is sparse and biased (only some matches get
interview/hire signals, and the funnel itself is biased). Use the
736-resume fairness result as a floor (no worse than humans), measure
group-differential outcomes continuously, and treat missing outcomes as
missing-not-at-random in the re-ranking. Falsifiable target: demographic
outcome gaps ≤ human-baseline gaps.
Taxonomy drift: skills required per job projected to double by 2027,
so any frozen taxonomy rots. The self-growing predicate mechanism is the
answer, but needs governance — Lean-checkable shape constraints on new
predicates before promotion. Target: incorporate Lightcast biweekly
refresh + auto-propose net-new skills within one refresh cycle.
Privacy/consent for resume claims: inferring traits/working-style
from a resume raises legitimate consent and EU-AI-Act high-risk
concerns. Make every inferred claim user-visible, contestable, and
deletable; keep inferred-trait lenses opt-in and clearly labeled
low-confidence; never let trait inferences enter deterministic scoring
without explicit policy.
Examples / systems:
LLMs as Zero-Shot ESCO Skill Matchers (Clavié &
Soulié) — Zero-shot GPT-4 extracts ESCO skills incl. implicit
ones, beating prior supervised best by +22.33/+29.75 pp RP@10 — proof
the extraction bottleneck collapsed [RP@10 61.02/68.94 vs
38.69/39.19]https://arxiv.org/html/2307.03539
ConFit v3 (LLM re-ranking) — LLM listwise
re-ranking over embedding baseline on 49K-resume recruiting set with
controllable non-negotiables checklist [nDCG@10 52.33→61.37 (+7.81
pp avg)]https://arxiv.org/html/2605.09760v1
Synapse (explainable two-phase + genetic
optimization) — Explainable retrieval ensemble + LLM-guided
resume optimization for job-person fit [+22% nDCG@10 over
embedding-only; >60% relative recommender-score gain]https://arxiv.org/pdf/2604.02539
De-conflating Preference & Qualification (LLM job
rec) — Two separate reasoning streams instead of one blended
fit score — directly maps to two donto claim contexts [significant
gains over single-stream LLM + CF baselines]https://arxiv.org/pdf/2602.03097
Human vs LLM resume matching (observational, 736
resumes) — Real-world fairness/validity check: LLM no more
biased than humans across race/gender [no larger group differences
than human raters]https://aclanthology.org/2025.findings-naacl.270/
TalentCLEF 2025 (open benchmark, Zenodo) — First
public skill/job-title intelligence benchmark, multilingual,
ESCO-grounded — a leaderboard to post to [job-title match MAP 0.534;
title→skill MAP 0.360]https://arxiv.org/html/2507.13275v1
Lightcast Open Skills + LinkedIn Economic Graph —
Scale/demand evidence and the closed-incumbent moat an open
claim-substrate can break open [Lightcast 32K+ skills / 1B+
postings; LinkedIn 800M members, skills doubling by 2027]https://lightcast.io/open-skills
modern-abundance-harnessing-systems
Generation of typed properties was the historic bottleneck; it is now
abundant. See arrays.
What's newly possible:
A guided LLM decomposes any entity along unbounded directions and
invents new predicate types as it goes (AutoSchemaKG at 50M docs, 92
percent schema-alignment, zero human schema engineering, the step that
bottlenecked Cyc).
Build a 900M node and 5.9B edge KG directly from text with no
predefined schema (ATLAS, 2025); schema induction is a generated
artifact now.
Self-improving loops compound: phi-1 reaches 50.6 percent HumanEval
on about 7B tokens; Llama-3-8B goes 22.9 to 39.4 percent AlpacaEval 2 on
model-generated rewards.
End-to-end agentic discovery ships real candidates: AI Co-Scientist
(30 to 5 to 1 AML drugs) and Robin (ripasudil for dry-AMD via ABCA1, 2.5
months), both Nature 2026.
Generative agents simulate 1052 real people from interviews at 85
percent of self-consistency (Stanford 2024-25); text-derived
world-models of entity populations are buildable.
The new primitive: pair unbounded generation with a paraconsistent
persistent substrate that holds contradictory claims forever and
re-ranks as evidence arrives. No shipped system does this; they
collapse, dedup, or evaporate.
Measurable signals:
AutoSchemaKG/ATLAS: 50M plus docs to 900M nodes, 5.9B edges, 92
percent schema alignment, zero manual intervention.
phi-1: 50.6 percent HumanEval, 55.5 percent MBPP at 1.3B params on
about 7B tokens, beats 10x-larger models.
Self/meta rewarding: Llama-3-8B 22.9 to 39.4 percent AlpacaEval 2;
STaR 95 percent AMC23; RLAIF about 10x cheaper than RLHF.
AI Co-Scientist (Nature 2026): 30 AML candidates to 5 to 1 active;
Elo correlates with correctness.
Robin (Nature 2026): autonomous hypotheses plus 2 lab rounds,
dry-AMD ripasudil via ABCA1, 2.5 months, 3 agents.
SciAgents: 33159 node and 48753 edge ontological KG from about 1000
papers (2024).
Sakana AI Scientist-v2: ICLR-workshop review 6,7,6 (avg 6.33) but
flagged for hallucinations and faked results.
Generative agents: 1052 people at 85 percent of self-consistency on
the GSS (Stanford 2024-25).
LLM-as-judge: 60 to 70 percent positional swing, 10 to 25 percent
self-preference, 60 to 68 percent expert agreement.
Skill extraction to ESCO: best LLM pipeline about 0.56 end-to-end,
the flagship baseline to beat.
Concrete for donto:
Position donto as the persistent paraconsistent home for generative
abundance, the layer AutoSchemaKG, GraphRAG, Co-Scientist, and Robin all
lack: they generate then collapse; donto generates then holds and
re-ranks.
Implement the 8-step claim lifecycle: ingest, emit unbounded typed
claims, hold incompatible claims paraconsistently, generate relationship
hypotheses at lens intersections, attach evidence and counter-evidence
via supports/rebuts/undercuts edges, rank, re-rank bitemporally,
explain. Maximize at extraction, gate at the relationship layer.
Adopt the AutoSchemaKG pattern (entities, events, concepts; schema
induced not predefined), but write every induced predicate as a
hypothesis-only claim with provenance so schema growth is auditable and
reversible.
Build the cross-entity relationship generator donto lacks: at the
intersection of two lens decompositions, have the LLM propose typed
relationships, store each as hypothesis-only with supports/rebuts edges,
then rank, the donto SciAgents/Co-Scientist analogue with
persistence.
Treat LLM-as-judge bias as a measured target: ensemble plus
position-swap plus reference-anchored scoring to lift human-agreement
from about 60 to 68 percent toward over 85 percent; log disagreement as
paraconsistent state.
Ship jsonresume to jobs as the flagship: emit unbounded typed
properties (explicit, inferred, latent skills, seniority, trajectory,
identity variants) anchored to ESCO/ONET/Lightcast; matching is
explainable evidence-anchored discovery; first milestone is beating the
about 0.56 ESCO baseline, with got-interviewed and hired as bitemporal
ground-truth.
Run the synthetic-data self-improvement loop inside donto: claims
that survive evidence-anchored re-ranking become high-quality curation
signal (the phi/STaR pattern).
Make persistence the demo: re-run the same query a week apart and
show a previously low-ranked hypothesis rise on new evidence, a
falsifiable capability no shipped discovery agent has.
Honest constraints:
The cross-entity relationship generator is the unbuilt core;
intra-corpus extraction works at scale but cross-entity
typed-relationship proposal is the harder open problem. Target: 20 plus
candidates per entity pair at precision-at-10 over 0.4 on a gold
set.
Generative abundance produces hallucinated low-precision claims (the
Sakana failure); holding contradictions does not excuse junk. Target:
high extraction recall while the relationship gate holds precision-at-k
above a published threshold, both layers reported separately.
The ranking layer inherits LLM-as-judge bias; solvable with
ensembling and position-swap. Target: lift judge-human agreement from
about 60 to 68 percent toward over 85 percent, preferring paraconsistent
disagreement logging.
Cost and latency of opening the faucet on every entity is real
(GraphRAG indexing is expensive). Target: tiered extraction budget plus
an indexing cost-per-entity ceiling tracked as a metric.
Storing the firehose stresses the substrate (donto about 39.5M
statements; abundance could 10x to 100x that). Target: extend
bounded-candidate query patterns to the claim and relationship layers so
worst-case latency stays sub-second.
Ground truth for re-ranking is delayed and sparse. Target:
instrument the bitemporal re-rank so a handful of got-interviewed/hired
events visibly move rankings, and report calibration as the falsifiable
test.
Examples / systems:
AutoSchemaKG / ATLAS — Autonomous LLM KG
construction with schema induction; proof that typed-property generation
is no longer human-bottlenecked [50M plus docs to 900M nodes, 5.9B
edges, 92 percent schema alignment]https://arxiv.org/abs/2505.23628
Google AI Co-Scientist (Nature 2026) — Multi-agent
generate/debate/Elo-rank hypotheses; lab-validated; lacks a persistent
contradiction-holding ledger across runs [30 AML candidates to 5 to
1 active; Elo correlates with correctness]https://www.nature.com/articles/s41586-026-10652-y
SciAgents (MIT, 2024) — Multi-agent reasoning over
an ontological KG for hidden interdisciplinary relationships, but on an
ephemeral graph [33159 nodes and 48753 edges from about 1000
papers]https://arxiv.org/abs/2409.05556
Sakana AI Scientist-v2 — Autonomous papers via tree
search; one passed ICLR-workshop review but was flagged for
hallucinations and faked results [scores 6,7,6 (avg 6.33); later
withdrawn]https://arxiv.org/abs/2504.08066
Generative Agents of 1052 People (Stanford) — LLM
agents from interviews replicate real individuals survey responses;
text-derived world-models of entity populations [85 percent of
self-consistency on GSS]https://arxiv.org/pdf/2411.10109
phi-1 / Textbooks Are All You Need — Synthetic-data
curation beats scale; value is in the filter not the faucet [50.6
percent HumanEval at 1.3B params on about 7B tokens]https://arxiv.org/abs/2306.11644
LLM skill extraction to ESCO — Flagship-relevant:
LLM pipelines map resume and job text to skill taxonomies; the baseline
jsonresume-to-jobs must beat [about 0.56 end-to-end]https://arxiv.org/html/2512.03195v1
substrate-as-possibility-space-and-domains
The scarce step in every prior knowledge-and-discovery system was
generation of typed properties and relations. Cyc needed knowledge
engineers; literature-based discovery needed co-occurrence statistics;
formal concept analysis needed predefined attributes. That bottleneck is
gone. A guided frontier LLM now emits typed claims about any entity
along essentially unbounded directions for ~$0.00014/triple: GPTKB v1.5
materialized 100M triples from 6.1M entities for ~$14,136 (arXiv
2507.05740), and "Mining the Mind" extracted ~100M beliefs from frontier
models and showed they assert mutually contradictory claims depending on
framing (arXiv 2510.07024). Emission is now abundant; it is also
abundantly contradictory and redundant. The design question has flipped
from "how do we get enough typed knowledge?" to "where do we PUT an
unbounded, contradictory, evidence-anchored firehose without throwing
most of it away?"
TASK A thesis: the standard storage targets are structurally hostile
to abundance. A vector DB collapses meaning to a single embedding and
returns "semantically redundant outputs that lack contextual diversity"
— when an LLM emits a fact that conflicts with a stored one, the closer
vector wins and the other is silently lost (Mem0 on LoCoMo retrieves the
stale address when it is embedding-closer). Standard KGs and even the
best 2025 agent-memory graphs enforce single-truth: Zep/Graphiti
explicitly uses an LLM to detect contradicting edges and INVALIDATES the
overlapping edge (arXiv 2501.13956); Mem0's update step overwrites.
Every one of these must collapse, dedup, or pick-a-winner at write time
— destroying exactly the speculative, minority, not-yet-supported claims
that are the raw material of discovery. donto's paraconsistent,
evidence-first, bitemporal quad store does the opposite: it holds
incompatible claims forever as legal state, anchors each to evidence
with typed argument edges (supports/rebuts/undercuts), and re-ranks by
reality over time instead of deleting on conflict. This converts three
problems-of-abundance into assets: (1) RECALL — nothing emitted is lost,
so the recall ceiling is set by generation, not by a dedup threshold;
(2) AUDITABILITY — every claim carries provenance and counter-evidence,
matching the citation/contract demands of AML triage (arXiv 2604.19755)
and Claimify-style verification (96.7% precision, 87.6% coverage; arXiv
2502.10855); (3) COMPOUNDING — because claims are retained and
re-rankable, new evidence re-scores old hypotheses (Elo-tournament
re-ranking is exactly what Google's AI co-scientist used to reach Nature
in 2026), so the base accumulates "machine serendipity" rather than
resetting each run.
The volume reconciliation is concrete and measurable: maximize at the
typed-extraction layer (emit everything, gate nothing — donto-memory
already produces 483 valid facts from one sentence), and gate at the
relationship/promotion layer (a claim earns .candidate→.proved only via
supports-edges to independent evidence, with a precision target ≥0.95
borrowed from Claimify). The two layers are decoupled by the substrate's
hypothesis_only flag: abundance lives below the waterline as
hypothesis_only claims; reality pulls a vanishing fraction above it.
This is the only architecture where "generate in all directions, prune
by reality over time" is even expressible.
TASK B / flagship: jsonresume→jobs is the cleanest proving ground
because abundance has built-in ground truth. Each resume and job is
decomposed into unbounded typed properties — explicit skills,
inferred/implicit skills, seniority, trajectory, latent traits, identity
variants — anchored to ESCO/O*NET/Lightcast (GPT-4 re-ranking already
lifts skill-linking RP@10 by 22+ points; arXiv 2307.03539). Matching
becomes explainable, evidence-anchored relationship discovery across the
network: skill-adjacency, hidden career paths, candidates no recruiter
would surface. The falsifiable signal is got-interviewed/hired — a
real-world reality check that re-ranks the whole claim graph, the same
compounding loop that the substrate gives to genealogy,
drug-repurposing, law, and science integrity below.
What's newly possible:
Generate an entity's properties in ESSENTIALLY UNBOUNDED directions
for ~$0.00014/triple (GPTKB: 100M triples/$14k) — and let the LLM INVENT
new predicates as it goes, not just fill a fixed schema. Cyc-style
hand-authored ontologies are no longer the rate limiter; the lens set is
open-ended.
Hold the firehose paraconsistently: store ~100M
mutually-contradictory LLM beliefs (Mining the Mind) WITHOUT a
write-time winner-pick. Every other 2025 stack (Mem0, Zep/Graphiti) is
forced to invalidate-on-conflict; donto can retain both claims + their
argument edges as legal state.
Re-rank old hypotheses on new evidence at substrate scale
(bitemporal) — the Elo-tournament compounding that took Google's AI
co-scientist from demo to Nature (2026) becomes a standing property of
the knowledge base, not a one-shot pipeline run.
Cross-entity relationship discovery at the intersections of many
lenses: 'machine serendipity that accumulates.' The serendipity-KG
benchmark shows frontier models still hit <13% serendipity hit rate
over 15.4M entities / 201.7M relations — enormous headroom that a
hold-everything substrate can mine and bank rather than recompute.
Decontextualized, atomic, audit-grade claims at scale:
Claimify-style extraction (96.7% precision, 87.6% coverage) means the
abundance is verifiable claim-by-claim, so gating-by-evidence is a real
engineering knob, not a hope.
Treat contradiction itself as a queryable signal: because both sides
are retained with provenance, you can rank entities/papers/people by
INTERNAL inconsistency — impossible in any store that dedups on write.
Directly enables science-integrity and OSINT use.
Sub-second substrate-wide retrieval over the whole firehose (donto's
POST /search: 39.3M stmts, 270-820ms incl. stopwords) — abundance is
only useful if it stays queryable; this is already built.
Measurable signals:
Generation cost/abundance: ~$0.00014/triple (GPTKB v1.5: 100M
triples, 6.1M entities, ~$14,136); donto-memory baseline 483 valid facts
from one sentence in ~4.7 min — target: properties-per-entity emitted, %
syntactically/ontologically valid (>95%).
Recall-vs-collapse: A/B donto (hold-all) vs a vector-DB and a
Zep/Graphiti-style invalidate-on-conflict KG on the SAME LLM firehose —
measure % of emitted minority/contradictory claims still retrievable
after ingest. Target: donto 100% retained vs measurable loss in the
collapsing stores (Mem0 demonstrably returns stale-but-closer facts on
LoCoMo).
Promotion precision (the gate): claims promoted .candidate→.proved
must hit precision ≥0.95 against held-out ground truth, borrowing
Claimify's 96.7% precision / 87.6% coverage as the bar; coverage
measured separately so abundance isn't penalized.
Re-ranking lift: when new evidence arrives, measure rank-correlation
change of affected hypotheses and downstream accuracy gain. Touchstone:
AI co-scientist Elo correlates with GPQA correctness; rare-disease
agentic hypothesis-testing lifted Top-5 by >17% and recall to 41.4%
with KG retrieval.
Serendipity hit rate over the substrate: replicate the RNS
(relevance/novelty/surprise) measure on a held-out set; current frontier
ceiling is 0.048-0.134 hit rate, 0.18-0.48 type-match — track whether
accumulated, re-ranked claims push past 13%.
Flagship reality-check: got-interviewed / hired rate on explainable
matches vs an embedding-only baseline; skill-linking RP@10 (GPT-4
re-rank already +22 pts over distant supervision).
Auditability: % of promoted claims with a complete provenance chain
(evidence + counter-evidence edges) — target 100%, the precondition for
AML/legal/clinical use (AML triage frameworks already require explicit
citations + supporting/other separation).
Concrete for donto:
Build the cross-entity relationship generator (the currently-unbuilt
piece): run an LLM over pairs/clusters of high-overlap entities and emit
hypothesis_only relationship claims at the lens intersections (e.g.
skill-A-implies-skill-B, drug-X-repurposes-to-disease-Y). Store ALL as
hypothesis_only with typed argument edges; never dedup at write.
Add a two-layer pipeline contract: (1) extraction layer =
emit-everything, gate nothing (already live, 483 facts/sentence); (2)
promotion layer = a Lean-4-certified rule that only flips
hypothesis_only→.candidate→.proved when N independent supports-edges
exist and counter-edges are below threshold. Make the precision target
(≥0.95) a config knob.
Instrument a 'collapse delta' benchmark in tests/system/: ingest the
same LLM firehose into donto, a vector DB, and a Graphiti-style
invalidate-on-conflict graph; report % minority/contradictory claims
retained and retrievable. This is the headline measurable proof of TASK
A.
Make contradiction first-class in /search: add an inconsistency-rank
that scores an entity by count/strength of mutually-rebutting retained
claims (directly powers science-integrity + OSINT use cases).
Wire bitemporal re-ranking as a standing job: when new evidence
statements land, re-score affected hypothesis_only/.candidate claims
(Elo or Bayesian credence) and log rank deltas — turning the substrate
into the accumulating co-scientist, not a one-shot run.
Flagship: extend donto-memory extraction lenses for resumes/jobs to
emit inferred/implicit skills + latent traits + identity variants
anchored to ESCO/O*NET/Lightcast IRIs; expose explainable matches with
the supporting evidence chain; capture got-interviewed/hired as
ground-truth evidence statements that re-rank the graph.
Provenance completeness gate: refuse to promote any claim lacking a
complete evidence chain (mirrors the empty evidence_links problem
already flagged in the Caroline-line kinship triples) — make 100%
provenance a CI assertion.
Honest constraints:
The cross-entity relationship generator is unbuilt. Emitting
properties per-entity is proven (483 facts/sentence); emitting and
storing relationship hypotheses across pairs/clusters at substrate scale
is the open engineering work. Target: a working pairwise/cluster
generator + promotion gate with measured precision ≥0.95 on a held-out
set.
Abundance without a gate is noise. The substrate can HOLD
everything, but consumers need a trustworthy waterline. Mitigation = the
decoupled two-layer design (emit-all below as hypothesis_only; promote
only on independent evidence) with Claimify-grade precision targets —
make the gate, not the firehose, the contract.
Cost and storage scale with abundance. At ~$0.00014/triple, a 10x
lens expansion is real money and real rows on a 39.5M-stmt Postgres box.
Target: cost-per-promoted-claim (not per-emitted-claim) as the unit
economic; tier hypothesis_only storage cheaply, keep promoted claims
hot.
Serendipity precision is genuinely hard — frontier ceiling is
<13% hit rate. Frame as: the substrate's job is to RETAIN and RE-RANK
candidates so accumulated evidence raises that number over time, not to
nail it in one pass. Falsifiable: does hit-rate rise as the evidence
base grows?
Re-ranking can drift, not just revise. LLM belief updates aren't
always Bayes-consistent (arXiv 2507.17951) and context accumulation
causes drift. Mitigation = re-rank on EXTERNAL evidence statements with
provenance, certify the promotion rule in Lean-4, and audit rank deltas
— distinguish evidence-driven revision from model drift.
Garbage-provenance erodes trust fast. donto already has near-empty
evidence_links on most Caroline-line kinship triples — abundance
amplifies this. Make 100% provenance-completeness a hard gate for
promotion (CI assertion), so unsupported abundance can never masquerade
as fact.
Examples / systems:
GPTKB v1.5 (Max Planck / TU Dresden) — Materialized
100M triples from 6.1M entities for ~$14,136 (~$0.00014/triple) and
chose to KEEP the inconsistent firehose as a queryable KB rather than
dedup to perfection — direct proof that abundant typed emission is cheap
and that retaining contradictions is a deliberate, viable design.
[100M triples, 6.1M entities, $14,136, ~$0.00014/triple]https://arxiv.org/pdf/2507.05740
Mining the Mind (100M beliefs) — Extracted ~100M
beliefs from frontier LLMs and documented systematic internal
contradictions (same model asserts conflicting claims by framing) — the
empirical case that emission is abundant AND abundantly contradictory,
so a paraconsistent home is required. _[~100M beliefs; pervasive
intra-model contradiction]_ https://arxiv.org/pdf/2510.07024
Zep / Graphiti temporal KG (anti-pattern contrast)
— State-of-the-art 2025 agent-memory graph that detects contradicting
edges with an LLM and INVALIDATES the overlapping edge — the exact
collapse/pick-a-winner behavior donto refuses; shows the field defaults
to destroying minority claims. [bi-temporal; invalidate-on-conflict
(vs donto hold-forever)]https://arxiv.org/abs/2501.13956
Serendipity Discovery in KGs for Drug Repurposing —
RNS (relevance/novelty/surprise) benchmark over a 15.4M-entity /
201.7M-relation clinical KG; frontier models hit only 0.048-0.134
serendipity hit rate — quantifies the discovery headroom a
hold-everything, re-ranking substrate can mine and accumulate.
[15.4M entities, 201.7M relations; <13% serendipity hit
rate]https://arxiv.org/html/2511.12472
Google DeepMind AI co-scientist (Nature 2026) —
Multi-agent system that generates, then re-ranks hypotheses via an Elo
tournament that improves with compute and correlates with correctness —
the compounding re-rank loop donto can make a standing substrate
property instead of a one-shot run. [Elo↑ with compute; correlates
with GPQA correctness; lab-validated]https://www.nature.com/
Claimify (Microsoft Research) — Atomic,
decontextualized claim extraction at 96.7% precision / 87.6% coverage /
99% entailment — the verification bar for the promotion gate
(extract-everything below, gate-by-evidence above). [96.7%
precision, 87.6% coverage, 99% entailment]https://arxiv.org/pdf/2502.10855
Rare-disease differential dx with LLMs (2025) —
Agentic hypothesis-testing + KG retrieval (Orphanet/OMIM) lifted Top-5
accuracy >17% and recall to 41.4%; ChatGPT-4o 22.4% solo, 30%
combined with Exomiser — shows reality-anchored re-ranking beats
single-shot and that holding many differential hypotheses pays off.
[Top-5 +17%, recall 41.4%, combined dx 30%]https://pubmed.ncbi.nlm.nih.gov/40776018/
OpenSanctions Pairs / LLM entity matching (OSINT) —
Large-scale LLM entity matching where rule systems over-match and LLMs
fail mainly on transliteration/date noise; multi-agent ER hits 94.3% on
name-variation — maps onto donto's identity-as-hypothesis (keep variants
as competing claims, resolve at query time). [94.3% name-variation
match; complementary failure modes]https://arxiv.org/pdf/2603.11051