genes.apexpots.com / research
last updated 2026-06-01
research notes
Long-form analyses, design documents, and status snapshots from
the donto / genes workspace. Each entry links to a rendered HTML
view and the raw markdown source.
donto — extraction engineering
-
Does the Lens Sweep Generalize? A Cross-Domain Extraction Test
·
[md]
(the genealogy-tuned engine, pointed at PlayStation hardware + an HN thread)
2026-06-03 · An out-of-domain test of the extraction engine: Copetti's
PlayStation Architecture (7,273 words) + its 25-comment HN thread, run at concurrency 2
through the current lens-sweep. Result: 1,613 anchored statements over ~700 distinct
predicates from ~8,150 words, with the recent prompt fixes holding cross-domain
(rdf:type 0× rdfType, no owl:sameAs, numbered predicates
≈0) and real esoteric depth (it independently described the PS1's vertex-jitter /
texture-warping, decomposed a comment's C4 trick into usesBitwiseOperation →
or-operation + operand, and logged a typo as a fact). Honest defect found: predicate-style
drift on dense technical prose (292 kebab vs 230 camelCase, 77 clause-fragment
predicates) — which is exactly the mess the alignment fabric folds at query time
(manufacturedBy → manufactured/isManufacturedBy at
0.89–0.92). Validates both halves: maximal extraction generalizes, and is meant to be tamed
downstream, not at the source.
-
Total Extraction: Deconstructing a Source Through the Whole of Human Understanding
·
[md]
(the extraction vision the systems paper is silent on — "the faucet")
2026-06-03 · donto's philosophy of extraction as total deconstruction:
point an agent at one source and guide it to read that source simultaneously as a
logician, mereologist, jurist, phenomenologist, historian, and linguist — minting precise
predicates from every analytical direction at once, anchoring each claim to its exact span,
decoding euphemism and capturing contradiction as first-class facts, and looping until
the well is dry. Premise: a thing is not a bag of facts but an unbounded space
of true properties, and almost every extractor sees one thin slice and discards the rest. Sane
now that generation is cheap (~$10⁻⁴/claim) and safe via the two-layer contract (maximize at
extraction, gate at query time) paired with a substrate built to hold the firehose. Grounded in
the real lens-sweep engine + output (frontier corpus 31,038 anchored facts / 14,146 predicates /
80.1% singletons; the most dramatic meta-facts are flagged as coming from a deep test re-ingest).
-
The Embedding Fabric: How Pervasive Embeddings Make donto's Query-Time Vision Real
·
[md]
(why embeddings are the non-brittle join key the whole substrate needs)
2026-06-03 · donto's bet is "emit free, defer joining to query time" — but a deferred
join is only as good as the key you join on, and today that key is
lexical (trigram + FTS), the brittle surface-form fallback the no-brittle-logic
rule forbids. Live proof: the lexical neighbours of killed are only
{killedAt, killedBy, killedIn, killedOn}; the top semantic match is
murdered (cosine 0.95, trigram just 0.0667), with slew
(trigram 0.0) and assassinated (0.0556) reachable only by meaning.
Generalizes to an embedding fabric — one maintained vector per object
(predicate · entity · statement · span · document · context), refreshed by one continuous loop,
consulted everywhere a join/match/rank happens. The sacred constraint: embeddings
cluster and rank, never collapse — identity stays a hypothesis, contradictions
stay held, alignment is non-destructive query-time expansion. pgvector 0.8.2 + bge-small-384;
865,834 predicates (84.7% singletons); covers the lexical+semantic+LLM
ensemble, hybrid RRF search, disk-honest costs, and a 7-eval measurement suite.
-
Does donto Work? — 105 Queries Against an Abundance-Extracted Knowledge Graph
·
[md]
(empirical query stress-test — all 105 Q&A + good/bad analysis)
2026-06-03 · Empirical answer to "does donto actually work + is abundance
useful?": ~12,500 evidence-anchored facts (6,111 predicates) from 7 frontier-conflict
events, hit 105 ways across 21 lenses (22-agent workflow), with every
query + actual result in tables. 103/105 queries returned real rows (98%).
Verified wins: paraconsistency holds incompatible casualty counts as legal state
(killedCountPerMurrayLetter=11 AND killCountPerCapricornian1925="over 100"
+ countDiscrepancyWith); derived historiographic meta-facts
(deathCountTrendOverTime="increasing (2 in 1855 to 12 in 1913)"); cross-event
prosopography; decoded euphemisms; reprisal causation chains. Honest costs: evidence links
wired at the run tier (span=0, doc=0), predicate fragmentation
(rdfType=899 vs rdf:type=95), positional pseudo-arrays, in-band
provenance → query-time normalization is now mandatory. Verdict: a forensic/humanities
evidence store that works, not yet a clean auto-reasoning graph.
-
Generative-Abundance Knowledge Extraction: Vision, System, and a Measured Run
·
[md]
(vision + system + measured run — for external review)
2026-06-03 · The full picture for an outside reviewer: donto's vision
(abundance/emit-free, defer joining to query time, paraconsistent,
evidence-first, bitemporal, domain-neutral core) and the rebuilt extraction
engine (content-agnostic prompt + invent-your-own ontological
lenses, compact JSONL, incremental bash-append, OpenCode-decides-done,
controller loop-until-dry, retry-on-empty, query-time
alignment), with a fully-measured single-document run: 2,333
evidence-anchored facts (93% anchored) across 1,320 distinct predicates from
one 13 KB event (pass 1 = 1,882; pass 2 caught +451 the first
pass missed), and the binding constraint we hit — a GLM 5-hour usage
cap. Ends with six questions for external feedback.
-
How the OpenCode Extraction Engine Works (and Where It Breaks)
·
[md]
(engineering report — third opinion wanted)
2026-06-03 · A complete, honest description of donto's agentic
LLM fact-extraction pipeline — GLM-5.1 driven via OpenCode in a container,
writing evidence-anchored claims into the paraconsistent substrate — with
real load-test measurements on a 2,818-row historical dataset. Documents
the architecture, the multi-pass prompt + anchoring model, and the failure
modes we found: the "size cliff" (large inputs → 0 facts), the confirmed
root cause (the agent generates the whole facts.json as a
single write-tool argument that never completes for large outputs,
so nothing is captured), the speed↔facts tension (verbose JSON ≈ 55
tokens/fact), and flock/concurrency contention. Proposes a redesign — an
iterative, self-refining extraction loop that accumulates
and self-critiques facts across many small persisted turns until the agent
judges the output complete, paired with an append-friendly line format
(JSONL/TSV). Ends with six explicit questions for an external reviewer.
donto — discovery & vision
-
donto — The Substrate for Generative Abundance
·
[md]
(unified — iteration 4)
2026-06-02 · The unifying resynthesis, combining the visionary
voice of the lens-engine report with the measurable, product-focused
spine of the claim-substrate report — and recentred on one thesis:
generating typed knowledge was always the scarce step; it just
became abundant, and that abundance is a primitive no prior
system had. A guided frontier LLM emits an essentially unbounded,
multi-directional space of properties and relations about any entity,
inventing the predicates as it goes (GPTKB: 105M triples / ~36
properties per entity at $0.00009/triple; AutoSchemaKG: a 900M-node,
5.9B-edge graph with zero predefined schema; ~10×/year cost collapse →
a 1M-resume corpus is a $4K–40K line item). Headline design principle:
emit free / untyped now; defer typing, alignment, identity and
joining to query time — donto's native strength — reframing
the live ~938K freely-minted predicates from a "proliferation problem"
into the signature of abundance. Covers the abundance engine, the
substrate-as-possibility-space, measurement as the steering wheel, the
8-step claim lifecycle, the jsonresume→jobs flagship, 10 example
projects, falsifiable milestones + baselines, and the horizon.
Companion:
research appendix
(md).
-
donto as a Claim/Discovery Substrate — Product Spec, jsonresume→jobs & 10 Projects
·
[md]
2026-06-02 · Iteration 3, deliberately tight. Reframes donto
from "lens engine" to a contradiction-preserving claim/discovery
substrate whose product is the 8-step claim lifecycle (ingest →
typed-claim extraction → hold incompatible claims → generate
relationship hypotheses → evidence + counter-evidence → rank →
re-rank on new evidence → explain). Settles the volume question
honestly (maximize at the typed-extraction layer — ~1–2k falsifiable
claims per paragraph; gate at the relationship layer — Calude-Longo),
insists lenses emit typed claims not prose, and goes deep on the
flagship jsonresume → jobs matching application
(ESCO/O*NET/Lightcast skill claims, explainable evidence-anchored
matches, skill-decay, network-effect career-path discovery,
competitive landscape). Ends with 10 concrete example
projects across domains (talent, genealogy, linguistics,
drug-repurposing, law, science-integrity, OSINT, clinical,
financial-crime, personal-AI memory), a falsifiable first milestone +
baselines, and 5 honest risks. Companion:
research appendix
(md).
-
The Lens Engine: Discovery at the Intersection of Many Apertures
·
[md]
2026-06-01 · ~9,000 words · A research essay on
polyperspectival decomposition: using agents to break any
entity down through many deep lenses (philosophical, temporal,
causal, mereological, teleological, ethical…) and harvesting the
inter-entity relationships that emerge at the intersections
— connections no single mind would draw. Honestly maps the lineage
(Swanson's literature-based discovery, Koestler's bisociation +
BisoNets/CrossBee, Gentner/Hofstadter analogy, Ranganathan facets +
Formal Concept Analysis, Wierzbicka's semantic primes +
Pustejovsky's qualia, Burt's structural holes + Uzzi's atypical
combinations, KG link-prediction, AI co-scientist/SciAgents),
isolates the genuine white space (agentic × many-deep-lenses ×
paraconsistent hold-and-verify substrate), designs the engine
(lens taxonomy, cross-lens hypothesis generation, novelty×plausibility×value
scoring, the verification ladder), confronts the noise/pareidolia
problem and a falsifiable time-slicing pilot, and ends on the
10-year horizon. Distilled from a 10-area study + 4 adversarial
critiques.
-
The Lens Engine: Research Appendix (raw findings)
·
[md]
2026-06-01 · Companion archive: the 4 adversarial critiques
(verdicts + counterarguments) and all 10 area findings — foundational
works and modern AI systems with URLs, relevance-to-the-engine,
already-done-vs-white-space, and hard problems per area.
donto — company & strategy
-
donto: A Strategy for Turning a Knowledge Substrate Into a Company
·
[md]
2026-06-01 · ~9,000 words · Flagship strategy document
synthesised from an 11-area landscape study and 5 adversarial
thesis stress-tests. Covers the sharpened one-sentence thesis,
why-now macro forces (agent explosion, provenance crisis, EU AI
Act, memory-as-the-moat), the defensible core (paraconsistency +
governance) vs. what is commodity (bitemporality, Postgres,
scale), the four-arena competitive landscape (Mem0 / Zep-Graphiti
/ Letta / Cognee / Supermemory / GraphRAG / XTDB / Datomic /
Wikidata), the layered company + six new verticals (clinical,
legal, scientific claim-curation, OSINT/ACH, EU-AI-Act audit,
sovereign/indigenous memory), the open-core MCP-native GTM that
resolves the "substrate, never a product" tension, an honest take
on the "1M facts per text" horizon, the hard problems (technical,
market, business, legal/ethical, founder), an annotated
modern-research reading map, and a sequenced 6–18 month plan.
-
donto — Company Vision: Research Appendix (raw findings)
·
[md]
2026-06-01 · Companion archive: the full structured output
of the 5 thesis stress-tests (with verdicts, confidences,
strongest-counterarguments) and all 11 area findings — every named
player with funding/traction, every cited paper with URL, and the
per-area differentiator / gap / opportunity / risk breakdowns.
donto-memory — extraction engine
-
donto-memory deep-mode — engine reference
·
[md]
2026-05-31 · Comprehensive engine doc covering the
mode: "deep" pipeline end to end: request lifecycle,
async queue + tokio Mutex, memorize_one internals,
the extract_deep orchestrator, the prior-facts
block, the LLM-call shape, JSON salvage on truncation,
content-key dedup, token + cost accounting, audit log schema,
/jobs UI, substrate outputs, observed empirics from
both runs, known limits, and a numbered roadmap. Includes a
file map and glossary.
-
Deep-mode extraction on a 109-word Discord message yields 1000 facts
·
[md]
2026-05-31 · End-to-end review of the first mode:"deep"
run: a Nietzsche/Eternal-Recurrence Discord message put through
7 sequential GLM-4.7 passes, producing 1013 raw facts → 1000
unique after dedup. Per-pass yield curve, saturation gradient,
pass_2 prose-not-JSON failure, identity-collapse side effect,
operational notes on the new async tokio-Mutex queue, and a
punch list of follow-ups (JSON-retry, max_tokens bump,
identity-resolution post-pass, per-modality default
passes).
donto-memory — extraction experiments
donto — forward-looking design
-
donto — Substrate PRD (PRD-SUBSTRATE-002)
·
[md]
2026-05-28 · ~5,500 words · Locks donto in as
infrastructure rather than product. Names the first-tier
consumers (donto-memory, genes, donto-lang). Articulates the
substrate contract and the consumer contract. Specifies M10
(Substrate Hardening — overlay extension API, predicate
minting controls, hot-path policy projection, true-deletion
tombstones, schema-discovery API, multi-tenant pattern, SDKs
in Rust / TS / Python), M11 (Federation), and M12 (Scale &
Calibration). Defines the substrate test and the
definition-of-done for "donto is a true substrate".
donto — systems papers
-
donto: An Evidence Operating System for Contested Knowledge
·
[md]
2026-05-28 · ~13,000 words · Long-form scientific paper
covering donto's architecture, the ten non-negotiable
invariants, the 14-family data model, the DontoQL query
language, the six-aperture extraction pipeline, the Trust
Kernel, identity and predicate alignment, source-provenance
tracing, the Lean 4 overlay, the release-and-federation
machinery, and an empirical characterisation of the
39.3M-statement genes corpus.
donto — status snapshots