genes.apexpots.com / research source: donto-abundance-appendix-2026-06-02.md

donto — Generative Abundance: Research Appendix (2026-06-02)

donto — Substrate for Generative Abundance: Research Appendix

Companion to the iteration-4 unified report. Structured output of the 5 forward-looking research deep-dives (2026-06-02).


frontier-llm-generative-decomposition

The founder's thesis — "a frontier LLM can emit an immeasurable amount of typed properties in any direction about a thing, with guidance" — is no longer a metaphor in 2025-2026; it is a measured, replicated engineering result. The cleanest proof is GPTKB: pointed at a single CHEAP model (GPT-4o-mini) and asked to recursively elaborate entities, it materialized 105M triples over 2.9M entities using 2,133 distinct relations and 367 classes — roughly 36 typed properties per entity on average, for $0.00009 per correct triple (https://arxiv.org/html/2411.04920v1). The follow-up GPTKB v1.5 pushed this to ~100M beliefs from GPT-4.1 (https://arxiv.org/abs/2510.07024). The decisive part for the thesis is not the volume but the DIRECTIONS: the model invented predicates no human schema had — historicalSignificance (270K triples), hasArtStyle (11K), hobbies (30K) — and 69.5% of the entities it described were NOT in Wikidata at all. The scarce, human-bottlenecked step in every prior knowledge system (knowledge engineers in Cyc, predefined attributes in formal concept analysis) is the exact step a guided LLM now does for free, and it does it along axes nobody pre-declared. "Essentially unbounded multi-directional emission" is a fair characterization: the supply of typed properties and the supply of NEW predicates are both effectively elastic now.

The abundance compounds when you let the model invent the axes themselves. AutoSchemaKG built a 900M+ node, 5.9B edge graph (ATLAS) from 50M+ documents with ZERO predefined schema — the LLM induced entity/event/concept types on the fly and hit 95% semantic alignment with human-crafted schemas with no manual intervention, while preserving 93-97% of source-passage information (https://arxiv.org/html/2505.23628). Strikingly, that was achieved with a small Llama-3-8B extractor, which means the ceiling under a frontier model (GPT-5.5 at 1M-token context, Claude Opus 4.5) is far higher than these papers measured. This is the "lens engine" intuition made quantitative: decompose along attributes, parts, functions, causes, counterfactuals, comparisons, AND emergent axes, and a single entity yields not one row but a fan of dozens-to-hundreds of typed claims plus brand-new predicate types — exactly the abundance donto is built to hold rather than collapse.

The honest counterweight — and the reason donto's posture is right rather than naive — is that emission breadth is now cheap but per-claim PRECISION is not automatic. GPTKB v1.5's own headline finding is that GPT-4.1's factual accuracy is "significantly lower than indicated by previous benchmarks," with inconsistency, ambiguity and hallucination as the main failure modes (https://arxiv.org/abs/2510.07024). On precision-critical typed extraction, GPT-5 reaches only 0.616 F1 five-shot on ChemProt relation extraction — about 12 points behind fine-tuned SOTA — and still trails specialists on disease NER, while leading on chemical NER and reasoning-heavy QA (https://arxiv.org/abs/2509.04462). So the 2025-2026 reality is asymmetric: GENERATION breadth is unbounded and nearly free; per-triple TRUTH is good and improving but not guaranteed. This is precisely the volume-reconciliation the founder already drew — maximize at the typed-extraction layer (where breadth wins), gate at the relationship/belief layer (where truth must be earned by evidence). It also validates donto as the natural home: a vector DB or normal KG must dedup and pick a winner; DEG-RAG shows LLM-built graphs need ~70% entity reduction to be usable in a collapsing store (https://arxiv.org/html/2510.14271v1). donto doesn't collapse — it holds the contradictory firehose as legal bitemporal state and prunes by reality (evidence + re-ranking), which is the only architecture that lets you keep ALL of generation's abundance instead of throwing 70% of it away.

On the "immeasurable directions" claim specifically: the strongest evidence is internal, not behavioral. Anthropic's sparse-autoencoder work extracted 34 million distinct interpretable features (concepts/directions) from the residual stream of a SINGLE mid-size model, Claude 3 Sonnet, with ~12M alive (https://transformer-circuits.pub/2024/scaling-monosemanticity/). The number of latent axes along which a frontier model can characterize a thing is measured in the tens of millions and grows with scale — so when the founder says "any direction," the substrate inside the model genuinely has tens of millions of them. The engineering task is not to manufacture directions (they exist in superabundance) but to ELICIT the useful ones with guidance and ANCHOR each emitted claim to evidence. That reframes donto from "a place to store extracted facts" to "the only substrate that can absorb generative abundance at full bandwidth and let reality, not a dedup heuristic, decide what survives."

What's newly possible:

Measurable signals:

Concrete for donto:

Honest constraints:

Examples / systems:

economics-and-measurement-of-abundance

The scarce step in every prior knowledge system was generating typed properties and relations: Cyc paid knowledge engineers per assertion, literature-based discovery rode co-occurrence stats, formal concept analysis needed hand-defined attributes. That bottleneck is now an economic non-event. A guided frontier LLM emits hundreds of self-validated typed facts per source in minutes (donto's own pipeline: a one-sentence "Pandoc" input -> 483 valid ingested facts in ~4.7 min on a flat-rate coding subscription), and inference cost for a fixed capability has fallen ~10x/year for three straight years (a16z "LLMflation"; GPT-3-quality dropped from $60/M tokens in Nov-2021 to ~$0.06/M by Nov-2024, a ~1000x decline). Epoch AI puts the per-task decline at a median 50x/year (range 9x-900x), accelerating to a median 200x/year on post-Jan-2024 data alone. The crossover has already happened: at DeepSeek V4-Pro rates ($0.44/$0.87 per M tokens, made permanent May-2026) or Gemini 2.5 Flash ($0.30/$2.50), decomposing an entity along dozens of directions costs single-digit cents. Concretely: emitting ~500 typed properties for one entity (~5K output tokens) costs ~$0.004-0.04. "Generate everything about everything" is no longer a thought experiment — it is a line item you can budget. This is donto's foundational tailwind: generation abundance is now cheaper than the human curation it replaces by 3-4 orders of magnitude.

A critical 2026 nuance keeps the spine honest: headline API prices are BIFURCATING even as cost-per-capability keeps falling 5-10x/year (Epoch). Western labs raised premium reasoning-model prices in May-2026 (GPT-5.5 doubled to $2.50/$15; Gemini 3.5 Flash 3x'd to $1.50/$9) while efficiency-leaders (DeepSeek) drove to the floor. The takeaway for a builder: the abundance thesis is real but you must ENGINEER for the floor — route bulk typed-property emission to the cheapest capable model (the donto GLM/DeepSeek-tier extraction path), and reserve premium reasoning tokens for the high-value relationship-hypothesis and re-ranking steps. Cost is now a steering variable, not a wall.

The harder, more valuable frontier is MEASUREMENT — because once generation is free, the scarce resource becomes knowing WHICH generated properties are worth keeping. Accuracy alone is the wrong yardstick: a true-but-redundant fact has near-zero value. The right metrics, all live in 2024-2026 research, are (a) downstream TASK-PERFORMANCE LIFT — does adding the emitted structure move a real metric? GraphRAG-style KGs deliver 72-83% comprehensiveness vs vector RAG and +12.8 QA points from better-constructed graphs; this is donto's gold standard, and the jsonresume->jobs flagship has a built-in one: got-interviewed / hired; (b) information gain / Bayesian surprise — how much a new property shifts the posterior over an entity, which doubles as the steering wheel for active generation (BED-LLM and Uncertainty-of-Thoughts choose what to generate next by maximizing expected information gain); (c) novelty/diversity — measured as harmonic mean of originality (fraction of unseen n-grams) and quality, embedding-distance diversity, and the proven result that AI-generated research ideas are rated MORE novel than expert humans (p<0.05, Stanford 100+ researcher study) though less diverse — directly validating "machine serendipity that accumulates"; and (d) coverage/completeness benchmarks like MINE (Feb-2025), where KGGen beat OpenIE/GraphRAG by 18% on representing source text.

Model collapse is the one real risk to a self-growing knowledge base, and the 2024-2026 literature has already de-fanged it for donto's exact architecture. Collapse only happens under REPLACEMENT (training on synthetic data while discarding real); error then grows roughly linearly with iterations (Shumailov). Under ACCUMULATION — keeping all real + synthetic data forever — error is provably BOUNDED, not divergent (Gerstgrasser et al. 2024; the variance converges to a finite limit independent of iteration count). donto is an accumulation system BY CONSTRUCTION: bitemporal, paraconsistent, evidence-anchored, it never overwrites or dedups; every claim keeps its provenance and counter-evidence. The collapse-avoidance recipe the field converged on — accumulate + verify/curate (selection on synthetic data "significantly enhances performance" especially where verifiers exist) — IS donto's claim lifecycle: emit abundantly at the typed-extraction layer, then gate/rank at the relationship layer against evidence and reality. The contradiction-preserving substrate is not just compatible with generative abundance; it is the provably-safe container for it.

What's newly possible:

Measurable signals:

Concrete for donto:

Honest constraints:

Examples / systems:

jsonresume-jobs-abundance

The scarce step in every prior matching system was emitting typed properties about people and jobs — and 2024-2026 evidence shows that bottleneck has collapsed. Most required skills in a job posting are expressed implicitly, not as keywords, and a frontier LLM now extracts them better than the entire prior supervised state of the art: zero-shot GPT-4 ESCO skill matching beat the previous best (Decorte et al.) by +22.33 and +29.75 percentage points on RP@10 (arXiv:2307.03539). That is the abundance thesis made measurable — a guided LLM can emit competencies, seniority, trajectory-implied capabilities, working-style signals, and transferable-skill bridges that keyword/embedding pipelines structurally cannot see. And it changes ranking, not just recall: LLM re-ranking (ConFit v3) adds +7.81 pp absolute nDCG@10 over the strongest embedding baseline ConFit v2 (52.33→61.37 on a real 49K-resume recruiting set), and the explainable Synapse system reports +22% nDCG@10 over embedding-only retrieval. So abundance is not noise: more typed properties → better, explainable matches.

The deepest finding for donto is architectural. The best 2026 explainable matcher, JobMatchAI (arXiv:2603.14558), wins by strictly separating a deterministic scoring layer from a generative explanation layer — the LLM "can explain a ranking but never inflate one," yielding 100% faithful explanations (0% unsupported claims), 70.5% top-factor mention, and 94.5% weakness-surfacing, all at 82ms median. This is precisely donto's split: let the LLM emit an unbounded firehose of typed, evidence-anchored claims (HAS_SKILL, IMPLIES_COMPETENCY, BRIDGES_TO, hypothesis_only trajectory inferences), hold the contradictory ones forever as legal paraconsistent state, then gate at the relationship/ranking layer with deterministic, auditable utility — and have the LLM explain only what the evidence already supports. A vector DB must collapse to one embedding; a normal KG must dedup to one canonical skill. donto is the only home that can keep "claims this person can do Kubernetes (inferred from 3 years of Docker + Terraform)" alongside "no direct Kubernetes evidence" as separate evidence-bearing claims, and re-rank when an interview outcome arrives.

The network effects compound at jsonresume scale. The career-mobility literature now grounds next-role prediction to standard taxonomies at volume — KARRIEREWEGE+ (100K resumes → 3,039 ESCO occupations, MRR 43.58, arXiv/COLING-2025) — exactly the skill-adjacency and career-path graph that emerges once millions of resumes are decomposed into typed claims. LinkedIn's economic graph (800M+ members, skills required per job up ~25% since 2015 and projected to double by 2027, skill-adds up 140% since 2022) and Lightcast Open Skills (32K+ skills mined from 1B+ postings, refreshed biweekly) prove the demand and the moat — but they are closed and embedding-collapsed. An open jsonresume claim-substrate, anchored to the now-official ESCO↔︎O*NET crosswalk plus Lightcast/ESCO, can do the one thing the incumbents cannot: expose why a non-obvious candidate fits, as a checkable evidence chain, and improve it with built-in ground truth (got-interviewed / hired / retained). Cost is the only real constraint and it is falling fast: a million resume-extractions run $72 (small open models) to ~$9,000 (GPT-4o), batch APIs cut that 50%, and per-token prices fell ~80% in the last year — so full-firehose extraction over millions of resumes is already an O($1K-10K) line item, not a research project.

What's newly possible:

Measurable signals:

Concrete for donto:

Honest constraints:

Examples / systems:

modern-abundance-harnessing-systems

Generation of typed properties was the historic bottleneck; it is now abundant. See arrays.

What's newly possible:

Measurable signals:

Concrete for donto:

Honest constraints:

Examples / systems:

substrate-as-possibility-space-and-domains

The scarce step in every prior knowledge-and-discovery system was generation of typed properties and relations. Cyc needed knowledge engineers; literature-based discovery needed co-occurrence statistics; formal concept analysis needed predefined attributes. That bottleneck is gone. A guided frontier LLM now emits typed claims about any entity along essentially unbounded directions for ~$0.00014/triple: GPTKB v1.5 materialized 100M triples from 6.1M entities for ~$14,136 (arXiv 2507.05740), and "Mining the Mind" extracted ~100M beliefs from frontier models and showed they assert mutually contradictory claims depending on framing (arXiv 2510.07024). Emission is now abundant; it is also abundantly contradictory and redundant. The design question has flipped from "how do we get enough typed knowledge?" to "where do we PUT an unbounded, contradictory, evidence-anchored firehose without throwing most of it away?"

TASK A thesis: the standard storage targets are structurally hostile to abundance. A vector DB collapses meaning to a single embedding and returns "semantically redundant outputs that lack contextual diversity" — when an LLM emits a fact that conflicts with a stored one, the closer vector wins and the other is silently lost (Mem0 on LoCoMo retrieves the stale address when it is embedding-closer). Standard KGs and even the best 2025 agent-memory graphs enforce single-truth: Zep/Graphiti explicitly uses an LLM to detect contradicting edges and INVALIDATES the overlapping edge (arXiv 2501.13956); Mem0's update step overwrites. Every one of these must collapse, dedup, or pick-a-winner at write time — destroying exactly the speculative, minority, not-yet-supported claims that are the raw material of discovery. donto's paraconsistent, evidence-first, bitemporal quad store does the opposite: it holds incompatible claims forever as legal state, anchors each to evidence with typed argument edges (supports/rebuts/undercuts), and re-ranks by reality over time instead of deleting on conflict. This converts three problems-of-abundance into assets: (1) RECALL — nothing emitted is lost, so the recall ceiling is set by generation, not by a dedup threshold; (2) AUDITABILITY — every claim carries provenance and counter-evidence, matching the citation/contract demands of AML triage (arXiv 2604.19755) and Claimify-style verification (96.7% precision, 87.6% coverage; arXiv 2502.10855); (3) COMPOUNDING — because claims are retained and re-rankable, new evidence re-scores old hypotheses (Elo-tournament re-ranking is exactly what Google's AI co-scientist used to reach Nature in 2026), so the base accumulates "machine serendipity" rather than resetting each run.

The volume reconciliation is concrete and measurable: maximize at the typed-extraction layer (emit everything, gate nothing — donto-memory already produces 483 valid facts from one sentence), and gate at the relationship/promotion layer (a claim earns .candidate→.proved only via supports-edges to independent evidence, with a precision target ≥0.95 borrowed from Claimify). The two layers are decoupled by the substrate's hypothesis_only flag: abundance lives below the waterline as hypothesis_only claims; reality pulls a vanishing fraction above it. This is the only architecture where "generate in all directions, prune by reality over time" is even expressible.

TASK B / flagship: jsonresume→jobs is the cleanest proving ground because abundance has built-in ground truth. Each resume and job is decomposed into unbounded typed properties — explicit skills, inferred/implicit skills, seniority, trajectory, latent traits, identity variants — anchored to ESCO/O*NET/Lightcast (GPT-4 re-ranking already lifts skill-linking RP@10 by 22+ points; arXiv 2307.03539). Matching becomes explainable, evidence-anchored relationship discovery across the network: skill-adjacency, hidden career paths, candidates no recruiter would surface. The falsifiable signal is got-interviewed/hired — a real-world reality check that re-ranks the whole claim graph, the same compounding loop that the substrate gives to genealogy, drug-repurposing, law, and science integrity below.

What's newly possible:

Measurable signals:

Concrete for donto:

Honest constraints:

Examples / systems: