genes.apexpots.com / research last updated 2026-06-01

research notes

Long-form analyses, design documents, and status snapshots from the donto / genes workspace. Each entry links to a rendered HTML view and the raw markdown source.

donto — extraction engineering

Does the Lens Sweep Generalize? A Cross-Domain Extraction Test · [md] (the genealogy-tuned engine, pointed at PlayStation hardware + an HN thread)
2026-06-03 · An out-of-domain test of the extraction engine: Copetti's PlayStation Architecture (7,273 words) + its 25-comment HN thread, run at concurrency 2 through the current lens-sweep. Result: 1,613 anchored statements over ~700 distinct predicates from ~8,150 words, with the recent prompt fixes holding cross-domain (rdf:type 0× rdfType, no owl:sameAs, numbered predicates ≈0) and real esoteric depth (it independently described the PS1's vertex-jitter / texture-warping, decomposed a comment's C4 trick into usesBitwiseOperation → or-operation + operand, and logged a typo as a fact). Honest defect found: predicate-style drift on dense technical prose (292 kebab vs 230 camelCase, 77 clause-fragment predicates) — which is exactly the mess the alignment fabric folds at query time (manufacturedBy → manufactured/isManufacturedBy at 0.89–0.92). Validates both halves: maximal extraction generalizes, and is meant to be tamed downstream, not at the source.
Total Extraction: Deconstructing a Source Through the Whole of Human Understanding · [md] (the extraction vision the systems paper is silent on — "the faucet")
2026-06-03 · donto's philosophy of extraction as total deconstruction: point an agent at one source and guide it to read that source simultaneously as a logician, mereologist, jurist, phenomenologist, historian, and linguist — minting precise predicates from every analytical direction at once, anchoring each claim to its exact span, decoding euphemism and capturing contradiction as first-class facts, and looping until the well is dry. Premise: a thing is not a bag of facts but an unbounded space of true properties, and almost every extractor sees one thin slice and discards the rest. Sane now that generation is cheap (~$10⁻⁴/claim) and safe via the two-layer contract (maximize at extraction, gate at query time) paired with a substrate built to hold the firehose. Grounded in the real lens-sweep engine + output (frontier corpus 31,038 anchored facts / 14,146 predicates / 80.1% singletons; the most dramatic meta-facts are flagged as coming from a deep test re-ingest).
The Embedding Fabric: How Pervasive Embeddings Make donto's Query-Time Vision Real · [md] (why embeddings are the non-brittle join key the whole substrate needs)
2026-06-03 · donto's bet is "emit free, defer joining to query time" — but a deferred join is only as good as the key you join on, and today that key is lexical (trigram + FTS), the brittle surface-form fallback the no-brittle-logic rule forbids. Live proof: the lexical neighbours of killed are only {killedAt, killedBy, killedIn, killedOn}; the top semantic match is murdered (cosine 0.95, trigram just 0.0667), with slew (trigram 0.0) and assassinated (0.0556) reachable only by meaning. Generalizes to an embedding fabric — one maintained vector per object (predicate · entity · statement · span · document · context), refreshed by one continuous loop, consulted everywhere a join/match/rank happens. The sacred constraint: embeddings cluster and rank, never collapse — identity stays a hypothesis, contradictions stay held, alignment is non-destructive query-time expansion. pgvector 0.8.2 + bge-small-384; 865,834 predicates (84.7% singletons); covers the lexical+semantic+LLM ensemble, hybrid RRF search, disk-honest costs, and a 7-eval measurement suite.
Does donto Work? — 105 Queries Against an Abundance-Extracted Knowledge Graph · [md] (empirical query stress-test — all 105 Q&A + good/bad analysis)
2026-06-03 · Empirical answer to "does donto actually work + is abundance useful?": ~12,500 evidence-anchored facts (6,111 predicates) from 7 frontier-conflict events, hit 105 ways across 21 lenses (22-agent workflow), with every query + actual result in tables. 103/105 queries returned real rows (98%). Verified wins: paraconsistency holds incompatible casualty counts as legal state (killedCountPerMurrayLetter=11 AND killCountPerCapricornian1925="over 100" + countDiscrepancyWith); derived historiographic meta-facts (deathCountTrendOverTime="increasing (2 in 1855 to 12 in 1913)"); cross-event prosopography; decoded euphemisms; reprisal causation chains. Honest costs: evidence links wired at the run tier (span=0, doc=0), predicate fragmentation (rdfType=899 vs rdf:type=95), positional pseudo-arrays, in-band provenance → query-time normalization is now mandatory. Verdict: a forensic/humanities evidence store that works, not yet a clean auto-reasoning graph.
Generative-Abundance Knowledge Extraction: Vision, System, and a Measured Run · [md] (vision + system + measured run — for external review)
2026-06-03 · The full picture for an outside reviewer: donto's vision (abundance/emit-free, defer joining to query time, paraconsistent, evidence-first, bitemporal, domain-neutral core) and the rebuilt extraction engine (content-agnostic prompt + invent-your-own ontological lenses, compact JSONL, incremental bash-append, OpenCode-decides-done, controller loop-until-dry, retry-on-empty, query-time alignment), with a fully-measured single-document run: 2,333 evidence-anchored facts (93% anchored) across 1,320 distinct predicates from one 13 KB event (pass 1 = 1,882; pass 2 caught +451 the first pass missed), and the binding constraint we hit — a GLM 5-hour usage cap. Ends with six questions for external feedback.
How the OpenCode Extraction Engine Works (and Where It Breaks) · [md] (engineering report — third opinion wanted)
2026-06-03 · A complete, honest description of donto's agentic LLM fact-extraction pipeline — GLM-5.1 driven via OpenCode in a container, writing evidence-anchored claims into the paraconsistent substrate — with real load-test measurements on a 2,818-row historical dataset. Documents the architecture, the multi-pass prompt + anchoring model, and the failure modes we found: the "size cliff" (large inputs → 0 facts), the confirmed root cause (the agent generates the whole facts.json as a single write-tool argument that never completes for large outputs, so nothing is captured), the speed↔facts tension (verbose JSON ≈ 55 tokens/fact), and flock/concurrency contention. Proposes a redesign — an iterative, self-refining extraction loop that accumulates and self-critiques facts across many small persisted turns until the agent judges the output complete, paired with an append-friendly line format (JSONL/TSV). Ends with six explicit questions for an external reviewer.

donto — discovery & vision

donto — The Substrate for Generative Abundance · [md] (unified — iteration 4)
2026-06-02 · The unifying resynthesis, combining the visionary voice of the lens-engine report with the measurable, product-focused spine of the claim-substrate report — and recentred on one thesis: generating typed knowledge was always the scarce step; it just became abundant, and that abundance is a primitive no prior system had. A guided frontier LLM emits an essentially unbounded, multi-directional space of properties and relations about any entity, inventing the predicates as it goes (GPTKB: 105M triples / ~36 properties per entity at $0.00009/triple; AutoSchemaKG: a 900M-node, 5.9B-edge graph with zero predefined schema; ~10×/year cost collapse → a 1M-resume corpus is a $4K–40K line item). Headline design principle: emit free / untyped now; defer typing, alignment, identity and joining to query time — donto's native strength — reframing the live ~938K freely-minted predicates from a "proliferation problem" into the signature of abundance. Covers the abundance engine, the substrate-as-possibility-space, measurement as the steering wheel, the 8-step claim lifecycle, the jsonresume→jobs flagship, 10 example projects, falsifiable milestones + baselines, and the horizon. Companion: research appendix (md).
donto as a Claim/Discovery Substrate — Product Spec, jsonresume→jobs & 10 Projects · [md]
2026-06-02 · Iteration 3, deliberately tight. Reframes donto from "lens engine" to a contradiction-preserving claim/discovery substrate whose product is the 8-step claim lifecycle (ingest → typed-claim extraction → hold incompatible claims → generate relationship hypotheses → evidence + counter-evidence → rank → re-rank on new evidence → explain). Settles the volume question honestly (maximize at the typed-extraction layer — ~1–2k falsifiable claims per paragraph; gate at the relationship layer — Calude-Longo), insists lenses emit typed claims not prose, and goes deep on the flagship jsonresume → jobs matching application (ESCO/O*NET/Lightcast skill claims, explainable evidence-anchored matches, skill-decay, network-effect career-path discovery, competitive landscape). Ends with 10 concrete example projects across domains (talent, genealogy, linguistics, drug-repurposing, law, science-integrity, OSINT, clinical, financial-crime, personal-AI memory), a falsifiable first milestone + baselines, and 5 honest risks. Companion: research appendix (md).
The Lens Engine: Discovery at the Intersection of Many Apertures · [md]
2026-06-01 · ~9,000 words · A research essay on polyperspectival decomposition: using agents to break any entity down through many deep lenses (philosophical, temporal, causal, mereological, teleological, ethical…) and harvesting the inter-entity relationships that emerge at the intersections — connections no single mind would draw. Honestly maps the lineage (Swanson's literature-based discovery, Koestler's bisociation + BisoNets/CrossBee, Gentner/Hofstadter analogy, Ranganathan facets + Formal Concept Analysis, Wierzbicka's semantic primes + Pustejovsky's qualia, Burt's structural holes + Uzzi's atypical combinations, KG link-prediction, AI co-scientist/SciAgents), isolates the genuine white space (agentic × many-deep-lenses × paraconsistent hold-and-verify substrate), designs the engine (lens taxonomy, cross-lens hypothesis generation, novelty×plausibility×value scoring, the verification ladder), confronts the noise/pareidolia problem and a falsifiable time-slicing pilot, and ends on the 10-year horizon. Distilled from a 10-area study + 4 adversarial critiques.
The Lens Engine: Research Appendix (raw findings) · [md]
2026-06-01 · Companion archive: the 4 adversarial critiques (verdicts + counterarguments) and all 10 area findings — foundational works and modern AI systems with URLs, relevance-to-the-engine, already-done-vs-white-space, and hard problems per area.

donto — company & strategy

donto: A Strategy for Turning a Knowledge Substrate Into a Company · [md]
2026-06-01 · ~9,000 words · Flagship strategy document synthesised from an 11-area landscape study and 5 adversarial thesis stress-tests. Covers the sharpened one-sentence thesis, why-now macro forces (agent explosion, provenance crisis, EU AI Act, memory-as-the-moat), the defensible core (paraconsistency + governance) vs. what is commodity (bitemporality, Postgres, scale), the four-arena competitive landscape (Mem0 / Zep-Graphiti / Letta / Cognee / Supermemory / GraphRAG / XTDB / Datomic / Wikidata), the layered company + six new verticals (clinical, legal, scientific claim-curation, OSINT/ACH, EU-AI-Act audit, sovereign/indigenous memory), the open-core MCP-native GTM that resolves the "substrate, never a product" tension, an honest take on the "1M facts per text" horizon, the hard problems (technical, market, business, legal/ethical, founder), an annotated modern-research reading map, and a sequenced 6–18 month plan.
donto — Company Vision: Research Appendix (raw findings) · [md]
2026-06-01 · Companion archive: the full structured output of the 5 thesis stress-tests (with verdicts, confidences, strongest-counterarguments) and all 11 area findings — every named player with funding/traction, every cited paper with URL, and the per-area differentiator / gap / opportunity / risk breakdowns.

donto-memory — extraction engine

donto-memory deep-mode — engine reference · [md]
2026-05-31 · Comprehensive engine doc covering the mode: "deep" pipeline end to end: request lifecycle, async queue + tokio Mutex, memorize_one internals, the extract_deep orchestrator, the prior-facts block, the LLM-call shape, JSON salvage on truncation, content-key dedup, token + cost accounting, audit log schema, /jobs UI, substrate outputs, observed empirics from both runs, known limits, and a numbered roadmap. Includes a file map and glossary.
Deep-mode extraction on a 109-word Discord message yields 1000 facts · [md]
2026-05-31 · End-to-end review of the first mode:"deep" run: a Nietzsche/Eternal-Recurrence Discord message put through 7 sequential GLM-4.7 passes, producing 1013 raw facts → 1000 unique after dedup. Per-pass yield curve, saturation gradient, pass_2 prose-not-JSON failure, identity-collapse side effect, operational notes on the new async tokio-Mutex queue, and a punch list of follow-ups (JSON-retry, max_tokens bump, identity-resolution post-pass, per-modality default passes).

donto-memory — extraction experiments

What the LLM Actually Extracts — A Qualitative Audit of donto-memory's First Discord Corpus
2026-05-30 · Qualitative read of the first 17 omega-bot memorizes through donto-memory: boilerplate share, identity drift, structural-vs-content yield, what the integration is silently capturing.
donto-memory — early activation report
2026-05-28 · First-light report on donto-memory after stand-up.

donto — forward-looking design

donto — Substrate PRD (PRD-SUBSTRATE-002) · [md]
2026-05-28 · ~5,500 words · Locks donto in as infrastructure rather than product. Names the first-tier consumers (donto-memory, genes, donto-lang). Articulates the substrate contract and the consumer contract. Specifies M10 (Substrate Hardening — overlay extension API, predicate minting controls, hot-path policy projection, true-deletion tombstones, schema-discovery API, multi-tenant pattern, SDKs in Rust / TS / Python), M11 (Federation), and M12 (Scale & Calibration). Defines the substrate test and the definition-of-done for "donto is a true substrate".

donto — systems papers

donto: An Evidence Operating System for Contested Knowledge · [md]
2026-05-28 · ~13,000 words · Long-form scientific paper covering donto's architecture, the ten non-negotiable invariants, the 14-family data model, the DontoQL query language, the six-aperture extraction pipeline, the Trust Kernel, identity and predicate alignment, source-provenance tracing, the Lean 4 overlay, the release-and-federation machinery, and an empirical characterisation of the 39.3M-statement genes corpus.

donto — status snapshots

donto — status snapshot (2026-05-28) · [md]
Self-orientation pass: milestone position M0–M9, live state (39.3M statements, 938k predicates), recent trajectory, open items.