The Embedding Fabric: How Pervasive Embeddings Make donto's Query-Time Vision Real

The Embedding Fabric: How Pervasive Embeddings Make donto's Query-Time Vision Real

2026-06-03

Abstract

donto's entire bet is to emit free / untyped now and defer joining — typing, alignment, identity-resolution — to query time. This report argues that a deferred join is only ever as good as the key you join on, and that donto's join key today is lexical (trigram similarity plus a tsvector FTS projection) — precisely the brittle, surface-form fallback the abundance vision explicitly forbids. The proof is one predicate: lexically, the neighbours of killed are only {killedAt, killedBy, killedIn, killedOn}; semantically, the top match is murdered (cosine 0.95). The trigram similarity of killedmurdered is just 0.0667 — an incidental overlap on the shared -ed suffix, an order of magnitude below the 0.30 alignment threshold, so lexical cannot reach it in practice. Genuine zero-overlap synonyms like slew (trigram 0.0) and assassinated (0.0556) are structurally unreachable to any character key, while semantically they sit beside killed. From this we make the central claim: embeddings are not a feature bolted onto predicate alignment; they are the enabling substrate primitive that makes query-time alignment, and therefore the whole defer-joining vision, actually work. (Lexical 0.0667 vs semantic 0.95 on killedmurdered is the contrast in one line.) Generalized to every join-relevant object — predicate, entity, statement, span, document, context — this becomes an embedding fabric: one maintained vector per object, refreshed by one continuous loop, consulted everywhere a match, rank, or join happens. The non-negotiable constraint, the line that keeps donto donto: embeddings may cluster and rank; they must never collapse or merge. Identity stays a hypothesis; contradictions stay held; alignment stays non-destructive query-time expansion. We specify what to embed, how the loop maintains it, how each consumer reads it, the costs against this box's real disk budget, and the eval suite that measures whether the fabric finds the joins lexical keys cannot.


1. Embeddings are the join key abundance needs

For sixty years the scarce step in every knowledge system was generation — minting a typed fact cost human attention, and so every architecture downstream of it was built to ration. That scarcity is gone. A guided frontier model now emits an essentially unbounded, multi-directional space of properties and relations about any entity — inventing the predicates as it goes — for roughly $0.0001 each. The hard problem flipped. It is no longer "how do we generate enough typed knowledge?" It is "where do we put an unbounded, contradictory, evidence-anchored firehose without throwing most of it away?"

donto's answer is its whole identity: hold everything, paraconsistently, and defer joining to query time. Don't dedup, don't pick a winner, don't invalidate-on-conflict. Anchor every claim to its source, keep incompatible claims forever as legal state, and resolve typing, alignment, and identity later — at the moment a query actually needs them. The live store is the proof that this is happening at scale: 39,560,959 live statements across ~19,931 contexts, minted under 865,834 distinct predicates. That predicate count is not a proliferation bug to be cleaned up. In the frontier-test corpus, roughly 4,995 of ~6,111 predicates were singletons — used exactly once. That long tail is the signature of abundance. A system that forces every emission through a fixed schema would have discarded most of it at the door; donto keeps it and types it on demand.

But here is the twist that this report is about. A deferred join is only as good as the key you join on. "Defer alignment to query time" is a beautiful principle right up until a query arrives and asks: which of these 865,834 predicates mean the same thing as the one I'm asking about? That question is answered by a join key, and today donto's join key is lexical — trigram similarity (donto_suggest_alignments) and an FTS tsvector projection over humanized IRI segments. Lexical matching is exactly the brittle, surface-form fallback the abundance vision explicitly forbids. CLAUDE.md's no-brittle-logic rule is unambiguous: never resolve a semantic problem with string overlap, synonym lists, or if/elif ladders. Yet a trigram join is a string-overlap heuristic wearing a SQL function's clothes. It can only ever find predicates that share characters.

The cost of that limitation is measurable and stark. Ask the lexical engine for neighbours of the predicate killed and it returns {killed, killedAt, killedBy, killedIn, killedOn} — five variants that all share the substring killed. It cannot reach murdered, whose trigram similarity to killed is a mere 0.0667 (a single incidental overlap on the -ed suffix, far below the 0.30 alignment threshold), nor genuine zero-overlap synonyms like slew (0.0) or assassinated (0.0556) — even though any human (or model) reads them all as the same relation. The semantic index built in this work returns exactly those: murdered at 0.95 cosine, with slew and assassinated in the same neighbourhood — true synonyms with effectively no lexical overlap whatsoever. This is the entire abundance thesis compressed into one example. killedBy ↔︎ assassinatedBy, rdfType ↔︎ rdf:type, the thousands of self-invented predicate variants the firehose produces every day — these are precisely the alignments a character-based key structurally cannot make, and they are precisely the alignments query-time resolution exists to make. A lexical join key doesn't just underperform; it makes the wrong cases impossible by construction.

So the central argument is this:

Embeddings are not a feature bolted onto predicate alignment. They are the enabling substrate primitive that makes query-time alignment — and therefore the entire defer-joining-to-query-time vision — actually work. The vision's load-bearing operation is the deferred join; the deferred join's load-bearing component is its key; and the only non-brittle key available for an open, self-minted, semantic vocabulary is a learned vector. Embeddings are the non-brittle join key abundance needs.

This reframes embeddings from an optimization into a precondition. donto's three pillars — paraconsistent hold, evidence anchoring, bitemporal re-ranking — all assume that when a query finally reaches across the firehose, it can find the claims that belong together despite never having been forced together at write time. Lexical keys break that assumption for the exact alignments that matter most. The state of the alignment machinery confirms how thin the current foundation is: ~18,500 live predicate alignments and 62,559 rows in donto_predicate_closure (as of 2026-06-03; the loop is now actively running, so these drift), all of it lexical-only — and until this work, the alignment engine was dormant and un-scheduled. The identity layer is barely exercised at all (122 identity edges; the cluster cache empty). The differentiating machinery exists; it has simply never had a key strong enough to make it load-bearing.

1.1 The embedding fabric

The fix is not to embed predicates and stop. If a vector is the right join key for predicates, it is the right join key for every object donto defers a join on. That generalization is what this report calls the embedding fabric:

An embedding fabric is a maintained, dense vector attached to every join-relevant object type in the substrate — predicate, entity, statement, span, document, context — kept fresh by one continuous background loop, and consulted everywhere a join, a match, a rank, or a "find-similar" happens. It is a single semantic coordinate system laid over the entire store, so that "things that mean the same thing land near each other" becomes a primitive the substrate offers rather than a trick each consumer reinvents.

Concretely, the seed of this fabric already exists: pgvector 0.8.2 is installed, ~30,000+ predicates and climbing are embedded with fastembed BAAI/bge-small-en-v1.5 (384-dim) in donto_predicate_embedding behind an HNSW vector_cosine_ops index (30,587 as of 2026-06-03, an in-progress backfill), and the full ~865,800-predicate registry backfill is queued. The new donto_suggest_alignments_semantic and _hybrid SQL functions are the first consumers. The rest of this report argues that this should not remain a predicate-only convenience: it should become a substrate-wide primitive, with the same vector discipline applied to entities (identity-as-hypothesis), statements and spans (evidence retrieval and contradiction-clustering), documents, and contexts.

One nuance is non-negotiable, because it is where naive "just use embeddings" advice would destroy donto's identity. Embeddings cluster and rank; they must never collapse or merge. Proximity in vector space is a hypothesis that two predicates, or two entities, are the same — exactly the status donto already assigns to identity. Query-time alignment must be non-destructive expansion (when you ask for killed, also surface the claims filed under murdered and wasKilled, with their alignment scores attached), never a write-time rewrite that fuses them. The firehose stays whole. Contradictions stay legal. Identity stays a hypothesis the query can lean on or override. The embedding fabric makes the deferred join possible; paraconsistency is the constraint that keeps it donto. The sections that follow specify what to embed, how the continuous loop maintains it, how each consumer reads it, and how to measure whether it is actually finding the joins lexical keys cannot.


2. The lexical ceiling: where donto is blind today

Section 1 argued why deferring the join to query time is the correct bet in the age of generative abundance: when generation is cheap and unbounded, the only sane place to type, align, resolve identity, and dedup is late, against the accumulated firehose, never at write time. But a deferred join is only ever as good as the key you join on — and today every deferred operation in the live substrate joins on the surface string. The substrate can read characters. It cannot read meaning. That gap is not a missing feature; it is a structural blindness sitting directly underneath the thesis, and this section measures it.

2.1 The shape of the firehose

The live numbers make the problem concrete and unavoidable:

All figures as of 2026-06-03; the embedding backfill and the alignment loop are both actively running, so the last rows drift upward.

Object Live count Note
Live statements (upper(tx_time) IS NULL) 39,560,959 donto_statement is ~34 GB on disk; a full scan ≈ 34s
Contexts ~19,812 ctx:genealogy + ctx:genes ≈ 98.7%
Distinct predicates (registry) 865,836 the freely-minted "tail"
Singleton predicates (used exactly once) 733,401 84.7% of all predicates appear on a single statement
Live predicate alignments 18,488 trigram-only; dormant/un-scheduled until this build
donto_predicate_closure rows 62,559 transitive expansion of those alignments
Identity edges (donto_identity_edge) 122 the contradiction/identity machinery, barely used
Predicates with an embedding ~30,587 / 865,836 ~3.5% and climbing — this build's bootstrap; the rest queued

The headline is 865,834 distinct predicates, but the sharper number is 733,401 singletons (84.7%). The substrate is not holding a few thousand reusable schema terms with a long tail of typos — it is holding an overwhelmingly one-shot predicate space where the typical predicate has been minted once and never again. This is exactly what the vision predicts and welcomes: it is the signature of LLM extraction inventing predicates as it goes (rdfType, killedBy, wasAssassinatedBy, dateOfDeathApproximate, servedAsWitnessAt), each anchored to evidence, none collapsed. Abundance is working as designed on the write path.

The problem is entirely on the read path. A predicate space that is 84.7% singletons is useless to query unless you can group it by meaning at query time. If a user asks "who killed whom," the answer is scattered across killed, wasKilledBy, murdered, assassinated, putToDeath, slew, causedDeathOf, and several hundred one-off variants — and the only mechanism the live substrate has to gather them is to match the characters in the string. That mechanism cannot see that murdered and killed are the same question.

2.2 The live proof: lexical can't reach "murdered"

The clearest demonstration is a single predicate. Ask the substrate for the neighbours of killed using the alignment engine as it existed before this build — donto_suggest_alignments, which is pure pg_trgm trigram similarity:

lexical neighbours of "killed"  →  killed, killedAt, killedBy, killedIn, killedOn

Every single neighbour is a substring relative of killed. The trigram index can only find predicates that share characters with the probe, so it returns the morphological family and stops. The query-time alignment that the entire vision rests on, run lexically, expands killed into five spellings of the same English root and declares victory.

Now the semantic neighbours, from this build's donto_suggest_alignments_semantic (fastembed bge-small-en-v1.5, 384-dim, cosine over the HNSW index):

semantic neighbours of "killed"  →  murdered (0.95), wasKilled (0.94), killedOn (0.94), killedAt (0.94), killedIn (0.93) ...

murdered is effectively invisible to lexical alignment: its trigram similarity to killed is 0.0667. pg_trgm pads each string with spaces, so the two share exactly one incidental trigram — the trailing ed from their common -ed suffix — out of { k, ki, ed , ill, kil, led, lle} for killed and { m, mu, der, ed , ere, mur, rde, red, urd} for murdered. That 0.0667 is more than four times below the 0.30 alignment threshold, so lexical cannot return murdered at any usable threshold (you would have to drop the cutoff below ~0.07, where the engine would also surface thousands of unrelated -ed predicates and become useless). Genuine zero/near-zero-trigram synonyms make the point even more sharply: slew shares no trigram with killed (similarity 0.0) and assassinated only 0.0556 — there is effectively nothing for the string-distance metric to grab. The embedding returns all of them near 0.95 because they mean the same thing. This is the canonical hard case from the project's own rules — killedBy ↔︎ assassinatedBy, rdfType ↔︎ rdf:type — and it is the whole report in one line: the trigram engine and the embedding engine disagree precisely where meaning and spelling diverge, which is exactly where alignment matters.

The table below is the entire indictment. Note wasKilled is not in the "missed" column: its trigram similarity to killed is 0.4167 (it shares ill, kil, led, lle, ed ), so lexical alignment at threshold 0.3 does return it. wasKilled is the morphological family lexical is good at; the synonyms lexical genuinely cannot reach are the zero/near-zero-trigram ones.

probe what lexical returns (trigram sim) what it misses (semantic, cosine)
killed killedAt/killedBy/killedIn/killedOn (0.70), wasKilled (0.42) murdered (trigram 0.0667 / cosine 0.95), assassinated (0.0556), slew (0.0)

Lexical is not wrong about killedBy or wasKilled — those are genuine relatives. It is blind about murdered, slew, and assassinated. And blindness, not error, is the failure mode that scales catastrophically across 733,401 singletons, because singletons are exactly the predicates with no morphological family to fall back on.

2.3 The closure inherits the blindness

It is tempting to think donto_predicate_closure (62,559 rows) rescues this — that even if each alignment edge is shallow, the transitive closure stitches the variants together into rich equivalence neighbourhoods. It does not, and it cannot, for two compounding reasons.

First, the closure is only the transitive hull of the edges it is given. Those 62,559 rows are computed over the 18,487 lexical alignments. If killed → murdered was never an edge — and it never could be, lexically — then no amount of transitive expansion invents it. killed → killedBy → killedOnDate → ... closes over the trigram family and stays trapped inside it. Closure amplifies whatever signal the edges carry; fed a string-similarity signal, it produces a larger string-similarity neighbourhood, not a semantic one. Garbage-in is not the right phrase — it is blindness-in, blindness-amplified.

Second, the alignment engine that feeds the closure was dormant. The SQL machinery (donto_suggest_alignments, the closure builder) existed but was un-scheduled — nothing ran it on a loop, so the ~18,500 alignments and 62,559 closure rows are a stale, partial snapshot over a predicate space that has grown to ~865,800. Coverage is the killer statistic here: ~18,500 alignment edges against 733,401 singletons means the vast majority of the predicate tail has no alignment edge at all, lexical or otherwise. The closure is not just lexically blind; it is largely empty over precisely the long tail that abundance produces and that query-time joining must reach. A deferred-join key that exists for a few percent of keys is not a key.

2.4 Search is FTS, which is the same blindness wearing a different hat

The substrate-wide /search endpoint is the other place where a meaning-join is deferred to query time, and it makes the identical compromise. The index donto_statement_fts_name is a GIN tsvector over a humanized projection of subject + object_iri + left(object_lit,120), queried with plainto_tsquery + ts_rank. It is well-engineered as lexical search — the to_tsvector(...) expression matches the index DDL exactly (including upper(tx_time) IS NULL) so it doesn't seq-scan 39.5M rows, a bounded candidate CTE caps latency, and it returns in ~270–820 ms. None of that is the problem.

The problem is that tsvector is lexical to the bone. It tokenizes, lowercases, and stems — running → run — but stemming is morphology, not meaning. A search for killed will not surface a statement whose object reads was assassinated; a search for physician will not find doctor; a search for spouse will not find wife. Postgres FTS reaches synonyms only through a hand-maintained synonym or thesaurus dictionary — which is exactly the static string-list the project's non-negotiable no-brittle-logic rule forbids. So /search, like alignment, is structurally capped at the morphological boundary. Fast, bounded, correct-as-spelled, and blind to every synonym a human or an LLM would consider obvious.

2.5 The general law: every deferred operation falls back to surface strings

Step back and the pattern is not five separate gaps — it is one gap, appearing everywhere donto defers a meaning-join to query time. Each query-time-deferred operation needs a notion of "same / related," and in the live substrate that notion is currently implemented as "shares characters."

Deferred operation What it must decide Live mechanism Blind to
Predicate alignment is killed the same relation as murdered? trigram (donto_suggest_alignments) synonyms with no shared trigrams
Entity / identity resolution is ex:kitty the same person as ex:kitty-wulbar? string/IRI overlap; dozens of identity edges total co-referent entities with different surface forms
Substrate-wide search which statements answer this query? tsvector FTS (donto_statement_fts_name) synonymous objects/predicates
Contradiction detection do two claims even talk about the same predicate+entity? requires aligned keys → inherits lexical alignment contradictions hidden behind different spellings
Dedup / clustering are these two claims restatements? no semantic key; cluster cache empty paraphrased duplicates

The last two rows are where the cost compounds into the substrate's differentiator. donto's reason to exist is paraconsistent contradiction-holding and re-ranking — but a contradiction can only be detected once you agree the two claims are about the same predicate and the same entity. If A killed B and B was murdered by A live under non-aligned predicates and non-resolved entities, the substrate never notices they are even comparable, so it never holds them as a contradiction, never builds an argument edge, never re-ranks by reality. The differentiator is gated behind the alignment key — and the alignment key is lexical. This is visible in the data: donto_argument carries ~2,426 edges and donto_identity_edge only dozens, against 39.5M statements. The contradiction machinery is barely exercised not because contradictions are rare in a 33M-statement contested-genealogy corpus — they are everywhere — but because the substrate can't see them through the surface strings.

2.6 What this costs, stated plainly

The lexical ceiling is the cost of not having an embedding layer, and it is not an inconvenience at the margin — it is a cap on the core thesis:

Crucially, the forbidden fix and the blindness are the same thing seen twice. The brittle escape hatch from lexical blindness — a synonym table, an if/elif predicate-variant map, a hand-curated thesaurus — is precisely what the no-brittle-logic rule outlaws, because it does not scale to 865,834 freely-minted predicates and never will. So the live substrate sits trapped between two walls: the lexical engine it has cannot see meaning, and the static-list engine it could bolt on is forbidden (and unmaintainable) by design. The vision says emit free now, defer the meaning-join to query time — but the meaning-join, today, has no meaning in it.

The next section names the way out across every place a join is deferred. There is exactly one join key that is non-brittle (no lists to maintain), learned (not hand-authored), and meaning-bearing (reaches murdered from killed despite a trigram similarity of just 0.0667, and slew despite literally zero shared characters): the embedding.


3. Aspect by aspect: what the fabric changes

The argument so far has been structural: a deferred join is only as good as its key, and donto's key today is lexical, which the vision explicitly forbids. This section makes that argument concrete across the ten places in the substrate where joins, comparisons, and decisions actually happen. Each one is, on inspection, a similarity operation wearing a different costume — and each one is, today, served by a trigram or tsvector fallback that fails precisely where abundance bites hardest: when two LLM-minted strings mean the same thing but share no characters.

The discipline throughout is the one the vision insists on: embeddings cluster and rank; they never collapse or merge. The fabric adds a non-brittle key to every comparison in the system without ever destroying a statement, retracting a winner, or hardening a hypothesis into a fact. Identity stays a hypothesis. Contradiction stays held. Alignment stays a query-time expansion, not a write-time deletion.

# Aspect Today (lexical / absent) With the embedding fabric
1 Predicate alignment donto_suggest_alignments = trigram only; ~18,500 live alignments, 62,559 closure rows, all from shared substrings killed↔︎murdered (trigram 0.0667 / cosine 0.95), killed↔︎slew (trigram 0 / cosine ~0.9) — synonyms lexical cannot reach in practice; ~30k+/865,834 predicates embedded, HNSW recall
2 Entity / identity resolution dozens of identity edges, cluster cache empty; resolution by IRI string + label trigram Vector built from label + statement-signature; nearest-neighbour candidate generation feeding donto_identity_edgestill a hypothesis, non-destructive
3 Semantic + hybrid /search GIN tsvector over humanized IRI segments; plainto_tsquery, seq-scan risk on 39.5M rows Vector recall fused with FTS via reciprocal-rank fusion; finds paraphrase, not just lexical overlap
4 Lens Engine / discovery Cross-lens step unbuilt; analogy has no operator Analogy = vector arithmetic; structural similarity = neighbourhood-embedding distance — discovery becomes a native query
5 Contradiction & corroboration ~2,426 argument edges over 39.5M statements; conflict found only on exact (s,p,o) collision Cluster claims about the same proposition despite wording; populate donto_argument at scale — cluster, don't collapse
6 Evidence ↔︎ claim anchoring ~1.88M evidence links (4.75%); span match by substring Semantic span match when the claim paraphrases the source and substring search returns nothing
7 Extraction-time reuse New predicate/entity minted per run; tail grows unbounded (4,995 singletons in the frontier test) Semantic retrieval of existing predicates/entities at emit time — reduce fragmentation at the source, by choice not by force
8 Query-time dedup / collapse No dedup primitive; either keep all rows or pick a winner (forbidden) Collapse-as-a-lens: a view clusters near-duplicate objects on read; the base rows are untouched
9 Emergent typing / ontology induction 865,834 predicates, no families; typing deferred but never done Cluster the predicate tail into families — the deferred typing the vision promises, induced not authored
10 Routing / classification if/elif over names, hand-maintained lists (the forbidden pattern) Embedding nearest-centroid classifiers; routing learned from data, refreshed by one loop

The rest of this section takes each row in turn.

3.1 Predicate alignment — the proof that lexical is not enough

This is where the thesis is no longer a claim but a measurement. donto's alignment engine already exists: donto_suggest_alignments produces candidate equivalences, donto_register_alignment records them, and donto_predicate_closure (62,559 rows) materializes the transitive expansion consulted at query time. There are ~18,500 live alignments today (18,488 as of 2026-06-03, and now drifting as the loop runs). Every one of them was found by trigram similarity — that is, by shared substrings. That is exactly why the registry's kill* neighbourhood aligns cleanly into {killed, killedAt, killedBy, killedIn, killedOn} and stops there. Trigram similarity is, by construction, blind to any synonym that does not share characters.

The fabric closes that blindness. With ~30,000+ of 865,834 predicates embedded into donto_predicate_embedding (384-dim bge-small-en-v1.5, HNSW vector_cosine_ops index) and the new donto_suggest_alignments_semantic / donto_suggest_alignments_hybrid functions, the top semantic neighbour of killed is murdered (0.95) — a true synonym whose trigram similarity to killed is only 0.0667, far below any usable alignment threshold, so trigram search cannot reach it in practice. Zero-trigram synonyms like slew (0.0) and assassinated (0.0556) are reachable only semantically. This is the canonical killedBy ↔︎ assassinatedBy problem from the no-brittle-logic rule, solved by similarity rather than by a hand-maintained synonym table. The hybrid function is the right default: lexical for morphological variants (killedBy/killedAt), semantic for paraphrase (murdered/assassinated), fused so neither blind spot survives.

The scale stakes are specific to donto's corpus. The kill* neighbourhood in the live registry is not five tidy predicates — it includes event-narrative monsters like 100NativePeopleKilled, 11SettlersKilledLargestLossPowellFirst8Killed70Total3UnarmedWomenIndiscriminate, and 15AboriginalsKilledIncludingBaulie. These are the 4,995 singletons of the frontier test made flesh: each is a once-minted, sentence-length predicate that lexical alignment will never connect to the generic killed, because the shared trigrams are drowned by the surrounding tokens. Embeddings place all of them in the same semantic region as killed/murdered, which is the only mechanism that can ever fold the abundance tail back toward a usable join key.

Before: killed aligns to {killedAt, killedBy, killedIn, killedOn, wasKilled} — its spelling family. A query for "who killed whom" misses every statement minted under murdered, slew, assassinated, or any narrative variant that shares no usable substring. After: the same query, expanded through a hybrid closure, reaches murdered, assassinated, and the long narrative tail — without a single synonym ever being typed by hand, and without retracting or rewriting any of the original predicates.

3.2 Entity / identity resolution — a vector per entity, still a hypothesis

Identity is donto's most underexercised differentiator: dozens of donto_identity_edge rows against tens of millions of distinct subjects, and an empty cluster cache. The reason is the same as everywhere else — the only cheap candidate-generation key today is the IRI string and a trigram on the label, which means ex:kitty (a known junk-drawer URI), ex:kitty-wulbar, and a freshly minted ex:kitty-munro are only as joinable as their spelling. In genealogy, where the entire problem is "are these two attestations the same person under different names," a spelling-based key is the wrong tool by definition.

The fabric gives every entity a vector built from two signals: its label (the surface name) and its statement-signature — a pooled embedding of the predicates and objects attached to it (born-in, married-to, child-of, occupation). Two entities are identity candidates when they are near in this combined space, which captures "same name, same shape of life" even when the names diverge (Mahamoodally/Mamode Ally, Lablanche/Lablache). Nearest-neighbour search over those vectors becomes the candidate generator that feeds — not replaces — donto_identity_edge.

The non-negotiable nuance: this stays a hypothesis. The fabric proposes; it does not merge. An identity edge is a bitemporal, retractable assertion that two URIs may co-refer, scored and dated, exactly as the substrate already models it. Embeddings raise the recall of candidate generation from "names that look alike" to "people whose whole evidentiary footprint rhymes," but the destructive act — collapsing two entities into one — never happens. Paraconsistency requires that ex:kitty-as-Brady-2013 and ex:kitty-as-EKY-2026 can be held as possibly-distinct even while an identity edge hypothesizes they are one. The vector is the matchmaker, not the registrar.

Before: dozens of identity edges; candidate pairs surfaced only when labels share trigrams; co-referent people under variant spellings never even nominated. After: every entity carries a label+signature vector; nearest-neighbour scan nominates co-reference candidates across spelling boundaries, each written as a dated, retractable donto_identity_edge — recall goes up, destruction stays at zero.

3.3 Semantic + hybrid /search — fuse vector recall with FTS

The substrate-wide /search (donto-memory, POST /search) is the most-used read path and the clearest case of the lexical ceiling. It runs plainto_tsquery + ts_rank against donto_statement_fts_name, a GIN tsvector over a humanized projection of subject + object IRI + literal prefix, with a bounded candidate CTE to keep latency under a second across 39.5M rows. It is fast and it works — for terms that appear. A search for "homicide" returns nothing from statements minted under killed; a search for "spouse" misses marriedTo. FTS retrieves lexical overlap, never meaning.

The fabric adds a second retriever — vector recall over statement (or object/span) embeddings — and fuses the two with reciprocal-rank fusion rather than choosing between them. RRF is the right join because it needs no score calibration between an incomparable ts_rank and a cosine distance: each retriever contributes $\frac{1}{k + \text{rank}}$, the lists merge, and a result that ranks well in either surfaces. FTS keeps its precision on exact tokens and rare proper nouns (where embeddings are weakest); vectors add the paraphrase recall FTS structurally cannot have. The partial:true / 9s-timeout discipline already in search.rs carries over unchanged; the vector leg is just a second bounded candidate source feeding the same fusion step. (Section 4.3 specifies RRF in full; it is the query-side instance of the same ensemble that aligns predicates.)

This also fixes the index-fragility footgun in one stroke: the existing FTS to_tsvector(...) expression must match the index DDL exactly (including upper(tx_time) IS NULL) or it silently seq-scans 39M rows. An HNSW vector index has no such hand-matched-expression hazard — the recall path is the index, full stop — so the semantic leg is both more capable and less brittle to operate.

3.4 The Lens Engine / discovery — analogy is a vector operation

The Lens Engine's discovery vision (relationships emerging at lens intersections, analogy, structural similarity) has a missing piece: the cross-lens step is unbuilt, and analogy has no operator. That is not an accident — analogy and structural similarity are inherently vector operations, and the substrate had no vectors. You cannot ask "what is to context A as X is to context B" with tsquery; you cannot rank entities by "structural role similarity" with a trigram. The lens vision was waiting for the fabric whether or not anyone named it.

With embeddings present, the core discovery moves become native queries. Analogy is offset arithmetic in entity space (the $a:b :: c:?$ pattern as nearest-neighbour to $\mathrm{vec}(b) - \mathrm{vec}(a) + \mathrm{vec}(c)$). Structural similarity is distance between neighbourhood embeddings — two people who occupy the same shape of relations (same predicate profile, same kinds of neighbours) are near even if they share no literal facts. Lens intersection becomes "objects near the centroid of two lens-defined regions at once." These are exactly the operations FCA, structure-mapping, and bisociation describe in the lens lineage — and they reduce, in an embedded substrate, to vector neighbourhood queries the substrate can already index with HNSW.

The lens lineage also warns that the moat is the verifier, not the lenses — and the fabric respects that division cleanly. Embeddings generate candidates (this analogy is plausible, these two roles are structurally alike); they do not assert the discovered relationship. A vector-proposed analogy enters the substrate as a candidate claim with an argument edge, to be corroborated or contradicted by evidence like any other. The fabric makes the lenses cheap and the cross-lens step possible; verification stays the load-bearing, paraconsistent core.

3.5 Contradiction & corroboration clustering — cluster, never collapse

donto's headline differentiator is paraconsistency: hold incompatible claims forever, re-rank by reality, never pick a winner. Yet the machinery that finds the incompatibilities is barely populated — ~2,426 donto_argument edges against 39.5M statements. The reason is mechanical: a contradiction is only detected today when two statements collide on an exact (subject, predicate, object) shape. But two claims that contradict each other in an abundance corpus rarely share that shape. "Caroline's mother was Kitty" and "Jessie Buchanan could not possibly be the GM's mother" are about the same proposition and they conflict — but they share neither predicate nor object string, so the substrate never notices.

The fabric supplies the missing primitive: cluster claims by the proposition they express, using embeddings of the (subject, predicate, object) triple as text, so that semantically-equivalent and semantically-opposed claims land near each other regardless of wording. Within a tight cluster you get corroboration candidates (many independent sources, different phrasings, same assertion — exactly the signal bitemporal re-ranking needs). Across the polarity of a cluster you get contradiction candidates to write as donto_argument edges — finally populating the differentiator at scale instead of by hand. This is the engine that turns ~2,426 argument edges into a number proportional to a 39.5M-statement corpus.

The paraconsistency discipline is absolute here and bears restating: cluster, don't collapse. A corroboration cluster does not dedup into one statement; it remains N held statements with a shared cluster id and N evidence trails. A contradiction cluster does not invalidate either side; it writes an argument edge and lets reality re-rank over bitemporal time. The fabric is what lets donto exercise its own thesis — it finds the contradictions abundance generates so the substrate can hold them, rather than leaving them undetected because no two LLM phrasings ever collided on a string.

Before: ~2,426 argument edges; contradiction detected only on exact triple collision; semantically-opposed claims with different wording sit unrelated and unnoticed. After: propositions clustered by triple-embedding; corroboration and contradiction candidates surfaced across wording differences; donto_argument populated at corpus scale — every cluster held intact, nothing merged, nothing invalidated.

3.6 Evidence ↔︎ claim anchoring — semantic spans when substring fails

Only 4.75% of statements carry an evidence link (~1.88M of 39.56M), and the genealogy work repeatedly hit the failure mode behind that number: a fact is true to its source, but the span — the exact text snippet that supports it — can't be located because the claim paraphrases the document rather than quoting it. Span anchoring today is substring search: find where the object literal appears in the revision body. When the extractor writes "died in 1898" and the source says "passed away in the year eighteen ninety-eight," substring returns nothing, and the statement ends up in the un-anchored 95%.

The fabric makes span anchoring a semantic match. Embed candidate spans from the revision body, embed the claim, and take the nearest span as the evidence anchor when substring search comes up empty. This directly attacks the evidence-link sparsity that the genealogy notes flagged as a Trust Kernel testbed problem (e.g., only 3 of ~80 Caroline-line kinship triples carry evidence links). The substrate's native evidence model — fact → evidence_link → span → revision → blob — is unchanged; the fabric only improves the span-finding step from "exact string present" to "this passage means this claim."

This is the most direct payoff against donto's one hard rule and its evidence-first identity. Every fact is supposed to be defensible by a retrievable snippet plus a link to the full resource. Substring matching quietly drops that guarantee for any paraphrased extraction; semantic span matching restores it, and does so without inventing evidence — the span is real text from a real stored revision, merely found by meaning rather than by character match.

Before: span anchoring by substring; paraphrased extractions (95% of statements) left without an evidence link; evidence model intact but mostly empty. After: semantic span match as the fallback when substring fails; paraphrased claims anchored to the real passage that supports them; the 4.75% coverage figure becomes movable for the first time.

3.7 Extraction-time reuse — fight fragmentation at the source

Every aspect above cleans up fragmentation after it happens. This one prevents a share of it from happening at all. Today each extraction run mints predicates and entities freely — the right default, and the source of the 865,834-predicate registry that is the signature of abundance, not a bug. But "emit free" and "emit blindly" are not the same thing. When the extractor needs a predicate for "was killed by," nothing tells it that killedBy, murderedBy, and wasKilledBy already exist, so it mints a fourth. Much of the singleton tail (4,995 of ~6,111 in the frontier test) is not genuinely novel meaning — it is the same meaning re-minted because the extractor couldn't see what was already there.

The fabric offers the extractor a semantic lookup at emit time: before minting, embed the proposed predicate/entity and retrieve the nearest existing ones, surfacing "you may mean killedBy (0.94)" as a suggestion. Reuse when it fits; mint when it genuinely doesn't. This reduces fragmentation at the source — fewer accidental singletons, a denser and more joinable registry — without ever capping invention. The two-layer contract holds: maximize at extraction, but make the maximization informed rather than amnesiac.

The line that must not be crossed: this is suggestion, never enforcement. This is the one aspect that, mis-implemented, pushes back toward write-time schema-fixing — an over-helpful emit-time suggester is a soft collapse-at-the-source, the very thing "emit free / untyped now" exists to prevent. The vision forbids collapsing the emit step into a fixed vocabulary; an extractor that cannot mint a new predicate is no longer abundance-native. So the guard is not just "the suggestion is declinable" — it is a measurable obligation: reuse-suggestion must never reduce the recall of genuinely-novel predicate minting. If the suggester ever nudges an extractor toward a near-but-wrong existing predicate when the right move was a new one, it has quietly narrowed the firehose, and that is a regression to be caught by eval (a minting-recall metric on a gold set of genuinely-novel cases), not a feature. Tuned correctly, the fabric reduces accidental fragmentation (re-minting an existing meaning) while preserving every bit of intentional novelty (the predicate that really is new). It is the cheapest fragmentation win available — a predicate not minted needs no later alignment at all — but only so long as it never costs a single legitimate new predicate.

3.8 Query-time dedup / collapse-as-a-lens

donto faces a real tension: the substrate must never dedup (paraconsistency, I3 no-destructive-overwrite), yet a human or downstream consumer reading 40 near-identical corroborating statements wants them folded. The resolution the vision implies is collapse as a lens — a read-time view, not a write-time mutation. But a collapse view needs a similarity key to decide what counts as "near-identical," and a lexical key collapses only verbatim duplicates while leaving every paraphrase un-folded.

The fabric supplies that key. A collapse lens clusters statements (or entities, or predicates) by embedding proximity above a threshold and presents one representative per cluster on read, with the cluster's members and their evidence trails one click away. The base table is untouched: all N statements remain live, bitemporal, and individually retrievable; the "collapse" exists only in the projection the query chose. Turn the lens off and the full firehose returns. This is dedup that respects I3 absolutely — nothing is overwritten, nothing is retracted, the merge lives entirely in the view layer.

This reframes the volume debate the claim-substrate report settled. "Maximize at extraction, gate at promotion" gets a third companion: collapse at presentation. The substrate holds everything; the query decides how much of it to fold for this reader, for this purpose, reversibly. Collapse-as-a-lens is the consumer-facing relief valve that lets donto keep its hoard-everything core while still handing a memory consumer or a UI a clean, de-duplicated surface when that is what the task wants.

3.9 Emergent typing / ontology induction — fold the tail into families

The vision's central design move is "emit free / untyped now, defer typing to query time." donto does the first half well — 865,834 predicates, no imposed schema — but the second half has never actually happened. Typing is deferred, but it is not done; the predicate tail sits un-grouped, and "defer to query time" is only a promise until something at query time can produce the types. Without a similarity key, the only way to induce a type system over 865K freely-minted predicates would be to author one by hand, which is the forbidden pattern at maximum scale.

The fabric makes induced typing a clustering job. Embed the predicates (~30,000+ done and climbing, the rest queued) and cluster the embedding space: the kill/death/violence predicates — from the clean killed to the sentence-long 100NativePeopleKilled — fall into one family; kinship predicates into another; occupation, place, date into theirs. Each cluster is an induced type, with a centroid that names the family and a membership that spans every wording variant the LLM ever minted. This is precisely the deferred typing the vision wants: not a schema authored up front, but families induced from the data after the fact, refreshable as the corpus grows.

The discipline is the cluster-not-collapse one (§7.6): induced families are an overlay, not a rewrite. A predicate's membership in the "violence" family is an alignment/closure relationship and a soft type — not a destructive recanonicalization of the IRI. The 865K predicates remain individually addressable; the type system is a lens over them, consulted at query time, re-derivable when embeddings refresh. This is the move that finally turns "abundance is a feature" from a defensive slogan into an operational claim: the proliferation is fine because the fabric can fold it into a navigable family structure on demand.

3.10 Routing / classification — embedding classifiers, not brittle rules

The last aspect is where the fabric most directly enforces the no-brittle-logic rule. The substrate and its consumers are full of latent classification decisions: which lens applies to a document, which context a statement belongs in, whether a span is relevant, how to route an extraction. The tempting implementation for every one of these is an if/elif ladder over names or a hand-maintained list of strings — the exact anti-pattern the project bans, because such lists rot the moment the data shifts and they are blind to anything not enumerated.

The fabric replaces those decisions with nearest-centroid classification. Build a small set of labeled exemplars per class, embed them, take centroids, and classify a new object by cosine distance to the nearest centroid. There is no string list to maintain: when the data drifts, you add exemplars and re-embed, you do not edit a ladder of conditionals. A document about a massacre routes to the violence/event lens because its embedding is near that centroid — not because someone added "massacre" to a keyword array. This is the same primitive as §3.9 (a class is a cluster with a label) applied to routing.

The payoff is both correctness and maintenance. Embedding classifiers degrade gracefully on novel inputs (they return the nearest class with a distance you can threshold for "none of the above"), they are auditable (the matched exemplars explain the decision), and they are refreshed by the same continuous loop that maintains every other vector in the fabric. One embedding loop, consulted everywhere — for alignment, identity, search, discovery, clustering, anchoring, reuse, collapse, typing, and routing — is the operational meaning of "embedding fabric." It is not ten features; it is one maintained primitive that every brittle decision in the substrate can delegate to.


4. All three techniques, not one

Section 3 made the case that the join key must be an embedding. That is true, but it is not the whole truth, and stating it baldly would reproduce exactly the error this report is trying to correct — swapping one brittle monoculture (lexical) for another (semantic). The fabric is not "embeddings instead of trigrams." It is an ensemble of three signals, each of which is individually insufficient and demonstrably blind in a specific, characterizable way. The art is in how they compose: cheap signals generate recall, an expensive signal supplies precision, governance decides what is allowed to act, and closure plus query-time expansion turn the result into something the substrate can actually join on. This section walks the three signals, shows where each fails on real donto predicates, and then specifies the pipeline that stitches them together — including the query-side instance of the same ensemble, hybrid search via reciprocal rank fusion.

4.1 Three signals, three blind spots

donto already carries all three signals in some form. The point of this section is that none of them is load-bearing alone.

Lexical (trigram + FTS). This is what the alignment engine has had all along: donto_suggest_alignments(source, min_similarity, limit) over pg_trgm, plus the GIN tsvector index donto_statement_fts_name powering substrate-wide /search. It is essentially free — a GiST/GIN index lookup, no model, no GPU, microseconds per probe — and it is genuinely good at one thing: syntactic variants of the same surface string. rdfType ↔︎ rdf-type ↔︎ rdf:type, birthPlace ↔︎ birth_place ↔︎ birthplace, casing, separators, pluralization, stemming. Across donto's 865,834 distinct predicates — a tail in which ~4,995 of ~6,111 frontier-test predicates were singletons — a very large fraction of the "alignments" we need are nothing more than the same human concept spelled five ways by five extraction runs. Trigram catches every one of those for nothing.

Its blind spot is total and structural: it cannot see meaning. Two predicates that share no substring are invisible to it, no matter how synonymous. This is not a tuning problem; it is what trigram similarity is. The live proof is the canonical example. For the predicate killed, the lexical neighbours returned by donto_suggest_alignments('killed', 0.3, …) are:

target_iri trigram sim
game:killed / game:game:killed 1.000
killedAt 0.700
killedBy 0.700
killedIn 0.700
killedOn 0.700
allKilled 0.636

(The game:killed / game:game:killed rows at 1.000 are themselves an abundance/identity artifact — the same predicate minted under duplicated namespaces — and they occupy the top of both the lexical and the semantic result sets; the analysis below is about the substantive neighbours beneath them.) Every substantive hit shares the substring killed. The true synonym murdered — the one alignment a human would reach for first — is nowhere, and trigram cannot reach it in practice: its similarity to killed is 0.0667, an incidental brush on the shared -ed suffix, far below the 0.30 cutoff this very call uses. Drop the cutoff low enough to catch it and you drown in thousands of unrelated -ed predicates. This is the failure mode the no-brittle-logic rule names directly: lexical matching is the hand-maintained-synonym-list problem wearing an index, and it silently drops the abundance tail's most valuable joins.

Semantic (embeddings). This is what this build added: ~30,000+ of ~865,800 predicates embedded (full population is a queued background job) with fastembed BAAI/bge-small-en-v1.5 at 384 dimensions, stored in donto_predicate_embedding under an HNSW vector_cosine_ops index, queried by donto_suggest_alignments_semantic(iri, threshold, limit). Cosine over learned vectors is exactly the inverse of trigram: it is blind to surface form and sighted to meaning. The actual top-k for killed (representative snapshot, as of 2026-06-03; the three game:*:killed namespace duplicates at 1.000 are elided) is:

target_iri cosine sim trigram sim
murdered 0.947 0.0667
wasKilled 0.942 0.417
killedOn 0.940 0.700
killedAt 0.939 0.700
killedIn 0.933 0.700
wasKilledBy 0.930
killedBy 0.930 0.700
killedPerson 0.928

murdered at 0.947 cosine but only 0.0667 trigram is the entire thesis in one row: the only neighbour the embedding finds that lexical cannot, sitting above every morphological variant lexical already had. The killedBy ↔︎ assassinatedBy class of join — synonyms that diverge in spelling — is solved here and only here.

But look harder at that same table, because it also exposes the embedding's blind spot. killedOn, killedAt, killedIn, killedBy, and wasKilledBy sit at essentially the same cosine (0.93–0.94) as the true synonym murdered. Yet several of them are not the same relation as killed. killed and murdered are roughly exact_equivalent; killed and killedBy/wasKilledBy are direction-flipped — they are inverse_equivalent (the subject and object swap roles), even though cosine ranks them right alongside the synonym. Collapsing X killed Y into X killedBy Y would silently reverse the meaning of every statement that joins through it. Embeddings put synonyms, inverses, hypernyms, and false friends all in the same neighbourhood, because a 384-dimensional sentence embedding of a short predicate label encodes topic, not argument structure. Cosine cannot tell you the relation type, and it cannot tell you that bornIn (place) and diedIn (place) — which sit at cosine 0.638 in the live embedding table because both are about life-events at a place — are not interchangeable, or that partner (business) and partner (romantic) are false friends despite identical strings and high cosine. Semantic recall is wide and meaning-aware; it is also relation-blind and false-friend-prone.

LLM adjudication. The third signal is the only one that can read usage and direction. Where trigram judges strings and embeddings judge topical proximity, an LLM judge is handed the two predicates, representative subject/object pairs drawn from donto_statement, and the evidence spans, and asked to classify the relation between them into donto's existing alignment vocabulary. That vocabulary is not invented for this report; it is the live relation enum on donto_predicate_alignment, the same column the engine already writes:

exact_equivalent · inverse_equivalent · sub_property_of · close_match · decomposition · not_equivalent (plus the legacy/import aliases exact_match, inverse_of, narrow_match, broad_match, incompatible_with, …).

The current ~18,500 live alignments (as of 2026-06-03; the loop is actively running, so these counts drift) are almost entirely close_match (18,080) with only 305 exact_equivalent, 53 inverse_equivalent, 26 sub_property_of, 12 decomposition, and 12 not_equivalent — because they were produced by the lexical-only engine, which can assert "these strings look alike" (close_match) but cannot justify a stronger or directional claim. That distribution is the signature of an engine running on one signal. It is precisely the inverse_equivalent/sub_property_of/not_equivalent distinctions — the ones that change query correctness, not just recall — that neither trigram nor cosine can produce and that LLM adjudication exists to supply.

The LLM is also the only signal that resolves false friends, because resolving them requires reading actual usage: partner in ctx:genealogy co-occurs with marriage and parentage; partner in a jobs/resume context co-occurs with firms and equity. Same string, same embedding, opposite meaning — separable only by looking at the statements each predicate actually appears in. That is a judgement task, not a similarity computation, and the no-brittle-logic rule's permitted toolset names it explicitly: "LLM judgement" is a first-class technique, not a fallback.

The three blind spots, side by side:

Signal Cost / probe Catches Structurally blind to
Lexical (trigram/FTS) ~free (index) syntactic variants (rdfType↔︎rdf-type, casing, separators, stemming) meaning; any low-overlap synonym (killed↔︎murdered at trigram 0.0667; killed↔︎slew at 0)
Semantic (HNSW cosine) ~ms (vector index) synonyms, paraphrase, cross-spelling (killed↔︎murdered, killedBy↔︎assassinatedBy) direction (killed↔︎killedBy look identical); false friends; relation type
LLM adjudication ~$/call, seconds relation type (exact/inverse/sub_property/not_equivalent); usage-disambiguated false friends scale — too expensive to run on every pair

The complementarity is exact. Lexical's blind spot (meaning) is semantic's strength. Semantic's blind spot (direction, false friends, relation type) is the LLM's strength. The LLM's blind spot (cost at scale) is precisely what the two cheap signals fix, by shrinking the candidate set the LLM ever has to look at. No two of the three cover all three failure modes; you need the full ensemble.

4.2 The pipeline: cheap recall → expensive precision → governance → closure → expansion

Composing the three is a recall/precision cascade. Each stage is calibrated so that the expensive signal only ever sees candidates the cheap signals could not resolve on their own.

Stage 1 — cheap recall (lexical ∪ semantic candidate generation). For a source predicate, take the union of trigram neighbours (donto_suggest_alignments, threshold ~0.30) and HNSW neighbours (donto_suggest_alignments_semantic, threshold ~0.80). Union, not intersection: the whole point is that each signal contributes candidates the other is blind to — lexical contributes killedBy (shared string), semantic contributes murdered (shared meaning), and the candidates that both signals return (e.g. killedBy, which is lexically and semantically close to killed) are exactly the high-confidence ones. The combined ranker donto_suggest_alignments_hybrid(source, min_score, limit, semantic_weight) already implements the fused score; the live pg_proc actually carries five suggest functions — donto_suggest_alignments, _semantic, two overloads of _hybrid (one keyed by iri, one by source with a semantic_weight default of 0.70), and a _hybrid_fast variant — which is a clear sign this layer is mid-build and should be consolidated to a single _hybrid contract (folding the _fast path in as a query-planner option, not a separate function). The output of Stage 1 is a candidate list of perhaps 10–30 predicates per source, with each candidate tagged by which signals proposed it and at what score.

Stage 2 — cheap resolution where the signals already agree. Most candidates never need an LLM. When lexical and semantic both score a pair very high and the surface forms are trivial variants (rdfType/rdf-type), the engine can write the alignment directly with relation exact_equivalent or close_match at high confidence — these are the cases trigram was always right about. This stage is what keeps the LLM bill bounded: the 865K-predicate tail is dominated by spelling variants, and those should be resolved for free.

Stage 3 — expensive precision (LLM adjudication on the borderline). Route to the LLM only the candidates that are genuinely ambiguous: high cosine but low lexical (possible synonym or possible false friend — the engine cannot tell which), or candidates where direction is in question (killed vs killedBy both surfaced, both near 0.93). The judge receives the two predicates, sampled (subject, object) pairs and evidence spans from donto_statement for each, and returns a relation drawn from the live enum plus a confidence and a short rationale. This is where inverse_equivalent gets correctly assigned to killed↔︎killedBy, where partner(genealogy) and partner(jobs) get split as not_equivalent, and where bornIn vs bornOn is caught. The rationale and the sampled spans are stored, so the alignment is itself evidence-anchored — evidence_anchor_ids is a real column on the table and should be populated here, making the alignment auditable in the same way every donto statement is.

Stage 4 — governance (review_status, the safe_for_ flags).* This is the stage that protects donto's identity, and it is already modeled in the schema. Every alignment carries review_status ∈ {candidate, accepted, rejected, superseded} and three independent capability flags: safe_for_query_expansion, safe_for_export, safe_for_logical_inference. The graduation is deliberate and asymmetric. A candidate close_match from cheap signals can be safe_for_query_expansion = true (it is fine to widen a search with it — the worst case is a few extra recalled rows the caller can ignore) while staying safe_for_logical_inference = false and safe_for_export = false (it must not be used to derive new facts or be shipped as ground truth). LLM-adjudicated exact_equivalent alignments with strong rationale and human sign-off can graduate review_status → accepted and earn the stronger flags. Critically: all ~18,500 current alignments are review_status = candidate — the engine has never promoted one, which tells us governance is wired but the loop that exercises it has not run. The flags are the difference between "the fabric helps you find more" and "the fabric silently rewrites your knowledge"; they are how aggressive recall and conservative truth coexist in one table.

Stage 5 — closure. Accepted/safe alignments compose transitively into donto_predicate_closure (currently 62,559 rows over the ~18,500 base alignments — a ~3.4× fan-out), so that A exact_equivalent B and B exact_equivalent C make AC reachable without a third pairwise call. Closure must be relation-aware, which is the whole reason Stage 3's typing matters: exact_equivalent is transitive and symmetric and closes freely; sub_property_of is transitive but directional (closes one way); inverse_equivalent flips argument order on traversal; not_equivalent and close_match must not transitively close at all (chaining "roughly similar" links is how you drift from killed to something unrelated in four hops). A closure that ignored relation type would manufacture false joins — which is exactly the failure a lexical-only closure is prone to, since it only has close_match to work with.

Stage 6 — query-time expansion. This is the payoff and the line donto must not cross. At query time, a probe on predicate P expands to $P \cup \mathrm{closure}(P \mid \texttt{safe_for_query_expansion})$ and the join runs over the expanded set. The expansion is non-destructive: the underlying statements keep their original, freely-minted predicates forever; nothing is merged, deduped, or overwritten; identity stays a hypothesis. The fabric ranks and clusters the join candidates; it never collapses them. This is the concrete mechanism behind "emit free / untyped now, defer joining to query time" — the deferral is real precisely because the join key (the embedding-anchored, LLM-typed, governance-gated closure) is now good enough to defer to.

4.3 The query side is the same ensemble: hybrid search via RRF

The cascade above aligns predicates (a maintenance-time job). But the identical ensemble logic governs the query-time retrieval the fabric is meant to power, and it would be a mistake to treat them as two separate systems. Substrate-wide /search today is lexical-only: plainto_tsquery + ts_rank over the donto_statement_fts_name GIN index, with a bounded candidate CTE (LIMIT 2000) to cap latency at ~270–820 ms across 39.5M statements. That inherits the exact blind spot from §4.1 — a query for "homicide victim" will never retrieve a statement whose predicate is killed or murdered, because FTS matches words, not meaning.

The fix is hybrid search: run the FTS query and a vector-kNN query (statement/span embeddings, the same HNSW machinery, once the embedding fabric extends past predicates to spans and statements) in parallel, then fuse the two ranked lists with Reciprocal Rank Fusion. RRF scores each candidate by $\sum_i \frac{1}{k + \text{rank}_i}$ over the lists it appears in (typically $k=60$), which has three properties that make it the right fusion operator here: it needs no score calibration between the incommensurable ts_rank and cosine scales (it consumes ranks, not raw scores); it is robust to one retriever returning garbage (a low rank contributes a small term, it cannot dominate); and it rewards candidates that both retrievers surface — precisely the "lexical ∩ semantic = high confidence" intuition from Stage 1, now at query time. A statement that is both a lexical match and a semantic match floats to the top; one that only one retriever found still appears, lower. This is the recall-union/precision-rerank pattern of the maintenance pipeline, compressed into a single request and a single fusion formula.

So hybrid-search-by-RRF is not a different idea bolted onto alignment; it is the query-side instance of the same ensemble. The maintenance loop and the query path draw on one embedding fabric, one set of cheap-recall signals, and the same conviction: each technique alone is insufficient, and the system that ships is the one that runs all three and fuses them — cheaply where they agree, expensively where they don't, and never destructively.


5. Architecture & best practices

This section describes how the embedding fabric is actually built on donto-pg today, what is already running, and the production-grade knobs that govern it as it scales from the current ~30,000+ embedded predicates to the full ~865,800-predicate registry and, eventually, to entities, statements, spans, documents, and contexts. The design intent throughout is the one stated in §1–§4: embeddings are the non-brittle join key for query-time alignment, and they must cluster and rank without ever collapsing or merging — paraconsistency stays intact, identity stays a hypothesis.

A point worth stating up front, because it changes how to read everything below: the substrate was already scaffolded for this. pgvector and a working embedding source were the only missing pieces — not the schema, not the alignment plumbing, not the CLI surface. Three pieces of pre-existing evidence:

So this build did not bolt a new subsystem onto donto. It populated a socket the substrate had been carrying empty.

5.1 The physical layer: pgvector 0.8.2 and per-object embedding tables

The substrate runs Postgres 16 in the donto-pg container with pgvector 0.8.2 and pg_trgm 1.6 both installed (confirmed live via pg_extension). Keeping vectors inside the same Postgres that holds the 39.56M statements is deliberate and load-bearing for the vision: alignment is a query-time join, and a join is cheapest and most consistent when both sides live in one transactional store. There is no external vector service to keep in sync, no dual-write consistency problem, no separate backup story — the embedding fabric is backed up by the same pg_dump that backs up the substrate, and it participates in the same MVCC/bitemporal world as everything else.

The fabric is per-object-type, not one giant table. The live table today is:

                Table "public.donto_predicate_embedding"
   Column   |           Type           | Nullable | Default
------------+--------------------------+----------+---------
 iri        | text                     | not null |
 embedding  | vector(384)              | not null |
 model      | text                     | not null |
 updated_at | timestamp with time zone | not null | now()
Indexes:
    "donto_predicate_embedding_pkey" PRIMARY KEY, btree (iri)
    "donto_predicate_embedding_hnsw" hnsw (embedding vector_cosine_ops)

Five design decisions are encoded in that small table, and they are the template every other layer copies:

  1. The text projection is the load-bearing, under-specified step. bge-small-en-v1.5 is an English sentence model, and a predicate IRI is a camelCase identifier, not a sentence. What actually gets embedded is therefore a humanized projection of the IRI, not the raw string: strip the namespace prefix, split camelCase / snake_case / separators into words, lowercase, and (where available) append a short usage descriptor sampled from the predicate's statements (representative object types, a label if one exists). So killed is embedded as roughly "killed", wasKilledBy as "was killed by", and the sentence-length tail monster 100NativePeopleKilled as "100 native people killed" — which is exactly why it lands near killed/murdered rather than near numeric predicates. This matters for honesty about the 0.95 scores: killed/murdered are the easy case (the split tokens are dictionary words the model knows well), and the projection step is what determines whether the hard cases — opaque camelCase coinages, abbreviations, multilingual act-text predicates — embed meaningfully or as noise. The projection is a tunable component, and §6 must measure scores across easy and hard predicate shapes, not just the dictionary-word example.

  2. Keyed by the object's natural identity (iri), one row per object. The embedding is a projection of the object, not a new object. This is what keeps embeddings from leaking into the statement model — a predicate's vector lives beside the predicate registry, not as a statement, so it can be recomputed, dropped, and rebuilt freely without touching the immutable donto_statement ledger or violating I3 (no destructive overwrite of facts).

  3. vector(384) matches the active provider exactly (see §5.3). The dimension is a property of the table, which is why a provider swap to a different dimensionality means a new/migrated table, not an in-place type change (see §5.3 on halfvec and migration).

  4. model is stored per row. Embeddings from different models are not comparable; recording the producing model per row is what lets the maintenance loop detect "this row was embedded by an older model" and re-embed it, and lets readers refuse to compare across models. This is the analogue, in the embedding layer, of donto's evidence-first discipline: every vector knows where it came from.

  5. The index is HNSW with vector_cosine_ops. Cosine distance is the right metric for normalized sentence-embedding spaces like bge-small; the operator class is fixed at index-build time, so queries must use the matching <=> cosine-distance operator to use the index (the same "the query expression must match the index DDL exactly or you seq-scan" discipline already documented for the FTS to_tsvector index — it applies identically to vector indexes).

The layering plan. Predicate is layer one because predicate proliferation is the most acute abundance signal (865,834 distinct predicates, ~4,995 of ~6,111 frontier-test predicates singletons) and because predicate alignment is what the existing closure/identity machinery already consumes. The same table shape extends outward, each as its own donto_<type>_embedding table with its own HNSW index:

Layer Object embedded Text projection that gets embedded Primary downstream use
Predicate (live, ~30,587 / 865,836) predicate IRI humanized predicate name + sampled descriptor/usage context query-time alignment, closure, predicate identity clusters
Entity (next) subject/object entity IRI label + key attributes + type hints identity-as-hypothesis clustering, entity dedup candidates (never merges)
Statement a live statement humanized subject predicate object triple semantic recall, contradiction-neighborhood discovery, lens intersection
Span an evidence span the snippet text itself evidence retrieval, "find the source that says X"
Document / revision a registered source body document text (chunked) /search/resources, source triangulation
Context a context aggregate/centroid of its statements context similarity, cross-corpus routing

The crucial invariant restated at the schema level: none of these tables has a "merged_into" or "canonical_iri" column. An embedding table can only ever rank neighbors; it can never rewrite an object's identity. Collapse is structurally impossible because the fabric has nowhere to write a collapse decision. Identity decisions live in donto_identity_edge as hypotheses (with method ∈ {trigram, embedding, neural, human, import, rule}), and alignment decisions live in donto_predicate_alignment as bitemporal, retractable rows — both are additive, both are reversible, both preserve every original object forever.

5.2 The candidate/closure/identity machinery the fabric feeds

The embedding tables are the new input; the consumers already existed and are now getting a better key. Live counts on donto-pg:

Object Live count (as of 2026-06-03) Notes
donto_predicate_embedding rows ~30,587 and climbing full ~865,800 registry is a queued background job
donto_predicate_alignment (live) 18,488 bitemporal; upper(tx_time) IS NULL filter
donto_predicate_closure rows 62,559 transitive closure over alignments
donto_identity_edge 122 barely used; the differentiator, still cold

The suggest functions sit between embeddings and these tables. Live pg_proc carries five (donto_suggest_alignments, _semantic, two _hybrid overloads, and _hybrid_fast — see §4.2 on consolidation); the four that matter conceptually:

The flow is: embeddings (+ lexical) → suggest candidates → adjudicate → register alignment → rebuild closure → (optionally) emit identity edges → consult at query time. Embeddings improve exactly one thing — candidate generation — but candidate generation is the upstream bottleneck that gated the entire pipeline. Better candidates are why the closure can grow past lexical's ceiling, and why the cold identity machinery finally has a non-brittle signal to run on.

5.3 The swappable embedding provider

The active provider is fastembed running BAAI/bge-small-en-v1.5 (384-dim) on CPU, local, in-process. The properties that make it the right default for an abundance substrate:

But "default" is not "only." The provider is an abstraction, mirroring the model/provider abstraction the rewrite is building for extraction. The model column on every embedding row, the per-object-type tables, and the per-row dimension are exactly what make a swap clean: dropping in OpenAI text-embedding-3-*, a GLM embedding endpoint, or a larger local model (bge-large, e5-large) is a matter of (a) a new table at the new dimension or a migration, (b) a new model string, (c) the maintenance loop re-embedding under the new model. Because model is recorded per row, a migration can run incrementally and mixed — new rows in the new space, old rows re-embedded lazily — and readers can refuse cross-model comparisons until a layer is fully migrated. Critically, alignment.rs already accepts embedding_model: Option<String> alongside the vector, so the provider identity flows through the Rust API without a schema change there.

Recommended posture: keep bge-small as the always-on local baseline (it must always work offline and for free), and treat larger/hosted models as an opt-in quality lever for high-value layers (e.g. entity-identity clustering, where a false merge is expensive) rather than a wholesale replacement.

5.4 Incremental maintenance: signature-hash change detection

The fabric must never re-embed the whole world. With 865K predicates and millions of statements, full re-embedding is both wasteful and a recurring 34s-class scan against a 34 GB table. The discipline is embed-on-change, keyed by a content signature:

This makes maintenance $O(\text{new} + \text{changed})$, not $O(\text{total})$. New predicates minted by the firehose get embedded; predicates whose sampled descriptor context shifted get refreshed; a model swap invalidates exactly the rows under the old model and no others. The updated_at column gives a cheap watermark for "embed everything touched since T," and model gives the cross-model invalidation key. The ~30,587/865,836 coverage figure (as of 2026-06-03, rising) is itself a maintenance artifact — it is simply how far the incremental backfill has progressed, not a sampling decision.

5.5 The one continuous loop — "baked in, constantly aligning"

The defining architectural choice is that alignment is not a batch job someone remembers to run. The handbook records that the alignment engine was historically dormant and un-scheduled — the SQL functions existed but nothing drove them, which is why an 865K-predicate registry had only ~18,500 alignments and 122 identity edges. The fix is a single continuous loop, scheduled (cron/Temporal), idempotent, and observable, that does the whole cycle each tick:

  1. embed-new — find objects needing embedding via signature-hash diff (§5.4); embed them in batches with the active provider; upsert rows (with model, updated_at).
  2. candidate generation — for changed/new objects, run donto_suggest_alignments_hybrid (lexical ∪ semantic) to propose alignment candidates above thresholds.
  3. LLM adjudication — for ambiguous candidates in the gray band, ask an LLM judge to confirm/reject (this is the "use a model, not an if/elif ladder" rule applied to the alignment decision itself; high-confidence and clearly-low candidates skip the LLM to save cost).
  4. register — write confirmed alignments to donto_predicate_alignment as bitemporal rows (additive; a later reversal is a retraction, never a delete).
  5. rebuild closure — recompute the transitive closure into donto_predicate_closure (currently 62,559 rows) so query-time expansion sees the new edges.
  6. rebuild identity clusters — recompute entity/predicate identity hypothesis clusters into donto_identity_edge with method ∈ {embedding, neural} — the previously-cold machinery, now fed a real signal. These are hypotheses with confidence, never merges.
  7. record run — write a run record (counts embedded, candidates, adjudicated, alignments added, closure delta, coverage %, wall-clock) for observability (§5.6).

This is what "constantly aligning" means in practice: the firehose mints free, untyped predicates; the loop quietly embeds them, finds their semantic neighbors, proposes and adjudicates alignments, and refreshes the closure — so that by the time a query arrives, the join key is already current. Query-time alignment stays query-time (nothing is decided eagerly or destructively at write time), but the machinery the query consults is kept warm by the loop. The two are not in tension: the loop maintains the index; the query does the expansion.

Idempotency and locking. Each loop tick must be safe to overlap or re-run: steps 1, 4, 5, 6 are upserts/recomputes keyed by stable identity, so a crashed tick re-does work harmlessly. A single advisory lock (e.g. pg_advisory_lock on a fixed key) prevents two loop instances from running steps 2–6 concurrently and double-proposing; embed-new (step 1) can fan out under that lock by batching disjoint IRI ranges. Durability via Temporal means a restart loses no progress — a half-finished tick resumes rather than restarting cold.

5.6 Query-time consultation

Reads never touch the loop; they consult the maintained artifacts:

5.7 Best-practice knobs

HNSW build and search parameters. pgvector's HNSW exposes three primary knobs:

Knob What it controls Recommended posture for donto
m (build) edges per node; graph density default 16 for predicate/entity layers; consider 24–32 for the statement layer where recall matters most and the corpus is large
ef_construction (build) candidate list size at build; build quality vs. build time 64 baseline; raise to 128–200 for the high-value entity/statement layers, accepting slower builds (a once-per-layer cost)
ef_search (query, session-set) candidate list at query; recall vs. latency tune per read path: small (40–80) for interactive alignment lookups, larger (100–200) for offline identity-cluster rebuilds where recall trumps latency

m and ef_construction are fixed at index build, so getting them right per layer matters before the layer scales; ef_search is a session GUC and is the live recall/latency dial.

halfvec for storage at scale. vector(384) is 4 bytes/dim = ~1.5 KB/row before index overhead. Across 865K predicates that is tolerable; across tens of millions of statements it is not — the embedding column plus its HNSW graph would dwarf reasonable memory and stress the /dev/sdb data disk (which hit 95% once, on 2026-06-02, when an ENOSPC truncated a source file; it sits at 56% / ~157 GB free as of 2026-06-03, but a statement-level build is exactly the kind of write that could exhaust it again). pgvector's halfvec (2 bytes/dim) roughly halves storage and index size for ~negligible recall loss on normalized sentence embeddings, and HNSW supports halfvec_cosine_ops. Posture: keep vector at the predicate/entity layers (small, precision-sensitive); adopt halfvec for the statement/span layers where row counts are large and the marginal recall cost is immaterial. This is a per-layer decision precisely because the fabric is per-object-type.

Batching. Embed in provider-sized batches (bge-small on CPU is throughput-bound, not latency-bound) and upsert in transactional chunks so a failure rolls back a bounded unit. Build/maintain HNSW indexes with adequate maintenance_work_mem; on the 16 GB box, building a large layer's index may warrant a temporary bump and off-peak scheduling, and --shm-size=2g on the donto-pg container must be preserved (it is a documented hard requirement of the pgvector image).

Idempotency & locking. Covered in §5.5: signature-hash skip + upsert-by-IRI make every step replay-safe; a single advisory lock serializes the adjudicate→closure→identity stages; Temporal durability makes the whole loop crash-safe.

Observability / coverage metrics. The loop's step-7 run record is the operational dashboard. Track at minimum, per layer:

These metrics also close the loop on the measurement-as-steering-wheel principle (§6): coverage and closure-growth are how you see abundance being tamed at query time instead of thrown away at write time.

5.8 What was missing vs. what was added

To make the "already scaffolded" claim precise:

Component State before this build State after
alignment.rs embedding params (embedding_model, embedding: Vec<f32>) present, unused now fed real vectors
identity-edge method ∈ {…, embedding, neural} (migration 0060) declared, never produced producible by the loop's cluster step
derive_embeddings Trust-Kernel capability + CLI subcommand present, no backend backed by a real provider
donto_predicate_closure / donto_match_aligned present, lexical-fed semantic+lexical-fed
pgvector extension absent 0.8.2 installed
embedding source/provider absent fastembed bge-small-en-v1.5, local CPU
donto_predicate_embedding + HNSW index absent present, ~30,587 rows (rising)
semantic / hybrid suggest functions absent donto_suggest_alignments_semantic / _hybrid present
continuous scheduled loop absent (engine dormant/un-scheduled) the one loop (§5.5)

The substrate had been built as if embeddings would arrive. This build delivered the two genuinely missing primitives — a vector index (pgvector) and a vector source (a local provider) — and the one piece of orchestration (the continuous loop) that turns them from a one-time backfill into a permanently-maintained fabric. Everything else was a socket waiting for a plug.


6. Measurement: the steering wheel

donto does not get to assert that the embedding fabric "works." The substrate's own design principle — measurement is the steering wheel — forbids it. Every claim in §§3–5 (semantic join keys beat lexical ones; pervasive embeddings raise recall; the fabric stays paraconsistent) is a hypothesis until it is instrumented, baselined, and tracked over time. This section defines the eval suite that turns the fabric from an architectural bet into a measured system, gives target and illustrative numbers anchored to the live store, and specifies the instrumentation each metric needs.

The discipline is the one that already governs extraction: a metric that is not stored, re-runnable, and time-sliced is not a metric. Each eval below is specified as (a) a gold or proxy ground-truth, (b) a numerator/denominator that can be recomputed against any tx_time slice, and (c) a steering decision it informs. The whole suite is written to a donto_eval_run context (ctx:eval/<suite>/<run-id>) so eval results are themselves bitemporal donto state — we can ask "what did alignment precision look like as of 2026-05-01?" the same way we ask any retrospective question.

6.0 Why measurement is load-bearing here specifically

The fabric is a quality intervention, not a capability one. Lexical alignment already returns something for almost every query; the question is never "does a result come back" but "is the result the right one, and would a human or a downstream task agree." That makes baselines non-negotiable. The live store gives us the denominators that make the metrics honest:

Quantity Live value (2026-06-03) Role in the eval suite
Live statements 39,560,959 denominator for closure-expansion recall, time-slice population
Distinct predicates (registry) 865,836 embedding-coverage denominator
Distinct predicates in live statements 985,448 fragmentation numerator (the surface actually queried)
Singleton predicates (used exactly once) 733,401 (84.7%) the fragmentation tail the fabric must compress
Live predicate alignments 18,488 precision/recall test population
donto_predicate_closure rows 62,559 effective-predicate expansion factor
Predicates embedded ~30,587 (3.5% of registry) embedding-coverage starting point
Identity edges 122 identity-cluster purity test population (tiny — see §6.5)
Evidence links ~1.88M (4.75% of stmts) anchors for span-level relevance judging

(The alignment, embedding, and identity counts drift between measurements because the loop is now actively running.) One row looks like a contradiction and is not: the distinct predicates in live statements (985,448) exceeds the registry (865,836) because statements freely reference predicate IRIs that were never written into donto_predicate — the registry is a catalog the firehose is allowed to outrun, which is itself a small abundance artifact (the emit path mints predicate IRIs on statements without a registry round-trip). The eval suite should use the larger, surface-actually-queried figure as the fragmentation denominator.

Two of these numbers are the report's central tension stated as measurements. The overwhelming majority of predicates are singletons — that is the fragmentation the fabric exists to compress. 3.5% embedding coverage is the gap between the fabric as designed and the fabric as deployed; until coverage is high, every other metric below is being measured on a partially-built fabric, and the coverage number must be reported alongside every other result as a confound.

6.1 Alignment precision / recall (the join-key quality eval)

What it measures. Of the alignment pairs the fabric proposes (lexical-only, semantic-only, hybrid-RRF), how many are true synonyms/sub-property relations (precision), and of the true relations that exist, how many does each method find (recall). This is the eval that directly tests the report's money-shot claim — that semantic finds killed↔︎murdered and killedBy↔︎assassinatedBy where lexical structurally cannot.

Gold set. We need a held-out labelled set of predicate pairs. Bootstrapping it without violating the no-brittle-logic rule (no hand-curated synonym dictionary used in the system): take a stratified sample of ~2,000 predicate pairs drawn from three buckets — (i) high lexical similarity (trigram > 0.5), (ii) high semantic similarity but low lexical (cosine > 0.85, trigram < 0.2 — the bucket lexical can never reach), (iii) random pairs as negatives — and have an LLM judge label each same / sub-property / related / unrelated, with a human spot-check of a 200-pair subset to estimate judge error. The gold set is a measurement artifact, never a runtime lookup; it lives in ctx:eval/alignment-gold/v1 and is itself versioned.

Method comparison (illustrative targets). The structural point is bucket (ii): semantic and hybrid should dominate there by construction, and the lexical column should be near-zero — that asymmetry is the thesis.

Method Precision@proposed Recall (all buckets) Recall on bucket (ii) only
Lexical (trigram, current prod) ~0.80 (illustrative) ~0.45 ~0.02 — structurally near-zero
Semantic (bge-small cosine) ~0.78 ~0.70 ~0.75
Hybrid (RRF of lexical+semantic) ~0.86 ~0.78 ~0.74

The numbers are illustrative pending the first gold run, but the shape is a prediction the eval will confirm or falsify: lexical's bucket-(ii) recall must be ~0 (it cannot find synonyms below a usable trigram threshold — the killed↔︎murdered case at 0.0667), and hybrid should beat both single methods on overall recall while matching or exceeding lexical on precision via RRF's agreement-weighting. If semantic precision comes in below lexical, that steers us to raise the cosine threshold or add an LLM-adjudication gate on the semantic-only proposals before they are written as alignments.

A blunt honesty note on the current empirical base. As of this report, the entire measured evidence for the thesis is a single predicate: killed, whose semantic top-k (murdered 0.95, then the killed* morphological family at 0.93–0.94) and lexical top-k (the killed* family, no murdered) are reproduced above from live function calls, plus one live false-friend cosine (bornIn/diedIn at 0.638). That is one decisive anecdote, not a trend — and it is partly an artifact of coverage: at 3.5% embedding coverage most candidate probes (married, occupation, …) currently return nothing semantic because their neighbours are not yet embedded. The honest minimum next step, achievable before the full 2,000-pair gold set, is a 20-neighbour-by-5-predicate hand spot-check: pick five well-covered predicates spanning easy (dictionary-word) and hard (camelCase/sentence-length) shapes, label the top-20 semantic neighbours of each same / sub-property / inverse / related / unrelated, and report raw precision@20. Until that exists, every numeric row in this section is explicitly a projection, and the report says so.

Instrumentation needed. A harness that, for each gold pair, queries all three donto_suggest_alignments* functions and records rank + score; a confusion-matrix writer keyed to the gold labels; and a per-bucket breakdown so the bucket-(ii) asymmetry is always visible. The functions already exist (donto_suggest_alignments, _semantic, _hybrid); what is missing is the gold context and the runner.

6.2 Predicate-fragmentation reduction (the abundance-compression eval)

What it measures. Abundance produces ~985,446 distinct surface predicates, the overwhelming majority of them singletons. The fabric's job is not to delete them (paraconsistency forbids that) but to make them queryable as fewer effective predicates via closure expansion at query time. The metric is the effective-predicate count after closure: how many equivalence-ish clusters do the alignment+closure relations induce, and how much of the singleton tail gets pulled into a cluster with a populated, well-typed predicate?

Definition. Run connected-components over the donto_predicate_closure graph (62,559 rows today). Report:

Metric Today (lexical closure) Target after semantic fabric
Effective predicates (components) ~ (distinct − small merges) meaningful contraction, tail-driven
Singleton rescue rate (≥10-freq anchor) low (lexical can't reach the tail) the steering target — track monthly
Mean component size ~1.0+ rises as the tail joins

The critical guard. Compression must be reported with §6.1 precision. Fragmentation reduction is trivially maximized by aligning everything to everything — which destroys precision and, worse, would silently collapse genuinely-distinct predicates. The steering rule is: maximize singleton rescue subject to alignment precision ≥ threshold. A fragmentation drop with no precision floor is a regression, not a win.

Instrumentation needed. A periodic connected-components job over donto_predicate_closure writing component sizes and the rescue-rate histogram to ctx:eval/fragmentation/<date>.

6.3 Search relevance lift (nDCG / MRR, lexical vs semantic vs hybrid)

What it measures. Whether the fabric improves the substrate-wide /search (the GIN-tsvector path over 39M statements) and the predicate/entity retrieval that feeds query-time joins. This is the user-facing payoff metric.

Gold set. A query set with graded relevance judgments. Two sources, neither hand-maintained as a runtime list: (i) proxy-from-behavior — Omega's recallMemories.ts and the genealogy front already issue real queries; log query→clicked/used-result pairs as weak positive signal; (ii) LLM-judged — for ~150 representative queries (genealogy name+place lookups, memory recall, predicate-intent queries) have an LLM grade the top-20 from each method 0–3. Store in ctx:eval/search-gold/v1.

Metrics. nDCG@10 and MRR per method, plus a hybrid via Reciprocal Rank Fusion column — because RRF (§4.3) is the principled way to combine the lexical FTS path (which is what prod runs today) with the semantic path without tuning a score-scale.

Method nDCG@10 MRR Notes
Lexical FTS (current donto_statement_fts_name) baseline baseline the brittle fallback; strong on exact name hits
Semantic (statement/entity embeddings) +Δ on paraphrase/variant queries wins on spelling variants (Mahamoodally↔︎Mamode Ally)
Hybrid (RRF) ≥ max(lexical, semantic) ≥ max the production target

donto-specific prediction. Lexical should win or tie on exact-token queries (it is excellent at "Maurel" → the Maurel statements) and lose badly on variant/paraphrase queries — exactly the genealogy variant-spelling problem (Lablanche↔︎Lablache, Collinson↔︎Colinson) the handbook calls out. Hybrid-RRF should never lose to either single method on nDCG; if it does on a query class, that class needs reweighting. Report nDCG segmented by query class (exact-name / variant-name / intent/paraphrase) — a single average hides the whole story.

Instrumentation needed. Statement- and entity-level embeddings (currently only predicates are embedded), a query-logging hook in /search and /recall, and an RRF fusion step in the search route. The latency budget already exists (the route caps at a 2000-row candidate CTE + 9s timeout); semantic retrieval must live inside it via the HNSW index, not a brute-force scan.

6.4 Embedding coverage (the readiness gauge)

What it measures. What fraction of each object type carries a current vector. This is the gauge that contextualizes every other metric: a 0.78 semantic recall measured at 3.5% predicate coverage means something very different at 95% coverage.

Definition. Per object type, coverage = (objects with a non-stale embedding) / (objects). "Non-stale" matters: an embedding is stale if the underlying text changed after the embedding's tx_time. The fabric's continuous loop (the one maintenance loop from §5) must close this gap and keep it closed.

Object type Embedded today Total Coverage Target
Predicates ~30,587 865,836 3.5% ≥ 99% (queued backfill)
Entities 0 (millions) 0% high-value-first, then full
Statements 0 39,560,959 0% sampled then full (cost-gated)
Spans / evidence 0 ~1.88M 0% full (small, high-value)
Documents 0 0% full
Contexts 0 ~19,812 0% full (cheap)

Steering use. Coverage is the prerequisite metric — it gates interpretation of §§6.1–6.3. The report's honest framing is that the fabric is presently a predicate-only fabric at 3.5%; the eval suite must publish coverage first and refuse to over-claim the others until coverage on the relevant object type is high. Cheap high-value layers (~19,812 contexts; ~1.88M spans) should be backfilled first because they unlock span-level relevance judging (§6.3) and context-routing for very little compute.

Instrumentation needed. A donto_embedding_coverage view (counts per type + staleness), and a freshness SLO on the maintenance loop (e.g. p95 time-from-mint-to-embedded < 1h for predicates).

6.5 Identity-cluster purity and coverage (the paraconsistency guard)

What it measures. Embeddings cluster entity mentions; identity decides whether two mentions are the same individual. The non-negotiable nuance from the report's thesis is that embeddings must cluster and rank but must not collapse — identity stays a hypothesis (donto_identity_edge), not a destructive merge. This eval proves the fabric helps propose identity candidates without silently merging distinct people — a live genealogy hazard (16 distinct "Kittys"; the ex:kitty junk-drawer URI).

Caveat the honest report must state. The identity machinery is barely exercised: only 122 identity edges total against tens of millions of distinct subjects. Purity/coverage here are measured on a near-empty population; the metric's first job is to grow that population safely, then measure it. The Kitty disambiguation (16 distinct individuals collapsed under one URI) is the natural first gold case.

Instrumentation needed. An entity-mention embedding layer (§6.4 currently 0%), a candidate-clustering job (HNSW k-NN, no merge), and an LLM/human adjudication queue that materializes confirmed/refuted decisions as identity-edge hypotheses — never as merges.

6.6 Downstream task-lift (does the fabric move a real outcome)

Intrinsic metrics (§§6.1–6.5) can all improve while the user-facing system does not. Task-lift is the metric that matters most and the one the abundance vision centers — measurement as steering means optimizing the downstream task, not the intermediate score.

Flagship: jsonresume→jobs matching. The match quality depends entirely on query-time alignment of freely-minted skill predicates (usesReact, reactDeveloper, proficientInReact, ESCO/Lightcast skill IRIs). Metric: match precision/recall and explainability coverage (fraction of matches with a citable evidence path) against a gold set of resume→suitable-job judgments. Ablation: matching with lexical-only alignment vs hybrid-fabric alignment.

Configuration Match recall Match precision Explainable-match rate
Lexical alignment only baseline baseline baseline
+ semantic fabric (hybrid) +Δ (target the headline) ≥ baseline ≥ baseline

Genealogy record-matching recall. The fabric's job is to raise recall on the variant-spelling problem without losing precision: of known true person-record matches (DNA-triangulated cases, the Sherrington/Brooks gold from the live research), what fraction does fabric-assisted search surface that lexical misses? This is the killed↔︎murdered contrast (trigram 0.0667, far below threshold) applied to names: Mahamoodally↔︎Mamode Ally is a near-zero-overlap pair lexical cannot bridge at any usable cutoff.

Instrumentation needed. A held-out gold for each task (jobs-match judgments; DNA-confirmed genealogy match pairs already exist in the research corpus), and an ablation switch that runs the same pipeline with the fabric on/off so the lift is attributable to the fabric and nothing else.

6.7 Retrospective time-slicing (donto's signature eval)

This is the eval no vector DB or normal KG can run, and it is the one that proves the bitemporal-paraconsistent design pays off. Question: does an alignment learned at time T improve recall on claims that were ingested before T?

Why it is unique to donto. A collapse-on-conflict store rewrites history; it cannot ask "given what I know now, how would my older answers change," because the old state is gone. donto keeps everything as legal bitemporal state, so we can hold the claim population fixed at an older tx_time and vary only the alignment knowledge to its newer state. The deferred-join architecture means alignment learned today retroactively improves every claim ever stored — and we can measure exactly that.

Protocol.

  1. Freeze a query gold set Q and a claim population as of tx_time = T0 (e.g. 2026-05-01).
  2. Measure recall/nDCG on Q using alignment knowledge frozen at T0.
  3. Re-measure the same Q on the same T0 claim population, but using alignment+closure knowledge as of T1 = now.
  4. Lift = metric(T1-alignment, T0-claims) − metric(T0-alignment, T0-claims).

A positive lift is the proof: learning is retroactive. An alignment minted today (reactDeveloper ↔︎ usesReact, killedBy ↔︎ assassinatedBy) raises recall on claims that have sat untouched in the store for months — without re-ingesting or rewriting a single statement. That is "defer joining to query time" delivering compounding value, measured.

Slice Alignment knowledge Claim population Recall@10 Interpretation
Baseline T0 (2026-05-01) T0 baseline what we could answer then
Retroactive T1 (now) T0 (held fixed) baseline + Δ what today's knowledge unlocks in old data

Instrumentation needed. Bitemporal-aware querying in the harness (every alignment lookup must accept an as-of tx_time — the closure tables are already bitemporal, so this is queryable today), a frozen gold set with stable IDs, and a stored lift series so the retroactive Δ is itself tracked over time (the second derivative: is the fabric's retroactive power growing?).

6.8 The eval as continuous instrument

These seven evals are not a one-time validation; they are the dashboard the maintenance loop steers by. The intended cadence:

Eval Cadence Primary steering decision
Embedding coverage (§6.4) continuous (SLO) backfill priority; gates interpretation of all others
Alignment P/R (§6.1) per alignment-model change cosine/RRF thresholds; LLM-gate on semantic-only
Fragmentation reduction (§6.2) weekly singleton-rescue vs precision-floor tradeoff
Search relevance (§6.3) weekly + per index change fusion weights by query class
Identity purity/coverage (§6.5) per clustering run candidate threshold; collapse-safety must always pass
Task-lift (§6.6) per release ship/hold the fabric change
Retrospective time-slice (§6.7) monthly proves compounding; reports the headline retroactive Δ

The unifying rule, and the one that keeps the fabric honest: every quality metric is reported jointly with embedding coverage and with the paraconsistency invariant (0 destructive merges). A recall win at the cost of coverage confound, or a fragmentation win at the cost of a silent merge, is not a win — it is a regression the steering wheel exists to catch. The fabric earns its place only when these numbers move together in the right direction, and donto is the rare substrate that can prove they did — retroactively, against history it never threw away.


7. Costs, tradeoffs, and what embeddings must NOT do

An embedding fabric that touches every object type is a strong claim, and a report that only sold the upside would be dishonest. This section is the counterweight. It states the real costs in real numbers for this box, the tradeoffs we are deliberately accepting, and — most importantly — the one thing embeddings must never be allowed to do inside donto, because doing it would dissolve the differentiator the whole substrate exists to protect.

The short version: embeddings are cheap and continuous for the small, high-leverage object types (predicates and entities, ~1.8M vectors) and expensive and opt-in for the large ones (statements, spans, documents). And across all of them, the embedding fabric is permitted to cluster and rank, and forbidden to collapse and merge. Everything below elaborates those two sentences.

7.1 The disk arithmetic, done honestly

The single hardest constraint is the data disk. /mnt/donto-data (/dev/sdb) is ~373 GB and already carries pgdata (the live substrate), backups, and the genealogy workspace/. As of 2026-06-03 it is 56% used (~157 GB free) — but that headroom is recent and fragile: on 2026-06-02 this same disk hit 95% and an ENOSPC truncated a source file mid-write, and it was only pruning old backups that bought back the space. Whatever the embedding fabric costs, it pays out of that budget, against a disk that has demonstrated it can fill. So the design has to be sized against it before anything else.

A bge-small-en-v1.5 vector is 384 float4 = 1,536 bytes of raw payload. The realistic on-disk cost is higher than that once you count Postgres row overhead, the vector type header, the foreign-key/IRI column, and — the big one — the HNSW index, which for typical m/ef_construction settings runs roughly the same order of magnitude as the vectors themselves (often 1.0–1.5x the raw vector bytes). A defensible planning figure is ~3–4 KB of total footprint per embedded object (vector row + its share of the HNSW graph). Applying that to each object type:

Object type Population (live) Raw vectors @1.5 KB Vectors + HNSW @ ~3.5 KB Verdict
Predicates 865,836 distinct ~1.3 GB ~3.0 GB Embed fully + continuously
Entities (distinct subjects) ~1M (order-of-magnitude) ~1.5 GB ~3.5 GB Embed fully + continuously
Contexts ~19,812 ~30 MB ~70 MB Trivial; embed fully
Statements 39,560,959 ~59 GB ~140 GB Tiered / opt-in only
Evidence spans ~1.88M evidence links ~3 GB ~7 GB Opt-in per consumer

The contrast is the entire policy. Predicates + entities + contexts together are ~1.8M vectors ≈ 6–7 GB all-in — a rounding error against the free budget, and small enough that a single continuous loop can keep every one of them fresh. Statements are a different universe: embedding all 39.5M of them is ~59 GB of raw vectors and ~140 GB once the HNSW index is built — about 89% of the current ~157 GB free, on top of a donto_statement heap that is already 34 GB total. That is not a maintenance burden we can absorb; it would leave the disk a single backup-run away from the 2026-06-02 ENOSPC again — a disk-exhaustion event with a build step attached. (For calibration, the predicate embedding table — only ~30,587 of 865,836 predicates embedded so far — is already ~110 MB.)

The conclusion is not "embeddings don't scale." It is that the small object types are where the leverage is anyway. The query-time-alignment vision is fundamentally about joining on predicates and entities — the keys. A predicate vector is consulted on behalf of every statement that uses that predicate; embedding 866K predicates buys you semantic reach over all 39.5M statements for the cost of 866K vectors. Embedding the statements themselves would buy comparatively little additional join power at ~20x the cost. The math and the vision agree: embed the keys fully, embed the rows selectively.

7.2 The tiering policy for statement-level embeddings

Statement and span embeddings are therefore a tiered, opt-in layer, never a blanket build. Concretely:

This keeps the genuinely large layer governed by explicit, bounded opt-in, and keeps the disk constraint from ever being the thing that breaks.

7.3 Drift, staleness, and the cost of re-embedding

Vectors are not write-once. Three things make an existing embedding wrong over time:

  1. New objects. The substrate mints predicates and entities continuously (the abundance firehose — ~4,995 of 6,111 frontier-test predicates were singletons, i.e. brand new). An embedding that doesn't exist yet is the most common kind of staleness. This is exactly the current state: **~30,587 of 865,836 predicates are embedded (3.5%) and climbing**; the remaining ~835K are a queued background job. Until that backfill completes, semantic alignment silently degrades to lexical for any predicate without a vector — a correctness gap, not just a coverage gap.
  2. Model change. Swapping the embedding model (or its dimensionality) invalidates every vector at once. A migration off bge-small re-embeds the entire fabric. For 1.8M predicate+entity vectors at fastembed throughput this is hours, not days, and is the strongest reason to keep the large statement tier small — re-embedding 39.5M statements on a model change would be prohibitive.
  3. Definitional drift. A predicate's meaning can shift as its usage distribution changes (a freely-minted status predicate used one way in genealogy and another in memory). The vector embeds the name/descriptor, so this is slower-moving, but it argues for periodic re-embedding from current descriptors rather than embed-once-forget.

The mitigation is the one continuous loop the fabric is built around (§5.5): a single scheduled worker that (a) embeds new objects, (b) re-embeds objects whose descriptor changed, and (c) carries a model_version / embed_version stamp on every vector so a model swap is a filtered backfill (WHERE embed_version < current) rather than a stop-the-world rebuild. Crucially, this loop must actually run — the alignment engine's prior failure mode was being built but dormant/un-scheduled (lexical-only, never invoked). An embedding fabric that isn't continuously refreshed is worse than no fabric, because it gives stale answers with the confidence of fresh ones.

7.4 Model and dimensionality tradeoffs

The current choice is BAAI/bge-small-en-v1.5, 384-dim, run locally via fastembed. The tradeoffs that pinned that choice, and where they'd be revisited:

Dimensionality is also a schema commitment: the donto_predicate_embedding column and its vector_cosine_ops HNSW index are typed to 384. Changing it is a migration, which is the final argument for getting the default right rather than oscillating.

7.5 The false-friend risk, and why LLM adjudication gates

Semantic similarity is powerful and over-eager. §4.1 laid out the taxonomy of the embedding's blind spot (relation-type blindness, inverses-as-equivalents, false friends) as one of three signals; this section is the cost-side restatement of the same fact, grounded in a live measurement. The same property that lets cosine find murdered ≈ killed (cosine 0.95, trigram 0.0667) will also rank dangerous near-neighbours highly:

This is why the cosine score is a candidate generator, not a decision — the §4 ensemble in its sharpest form. The pipeline is explicitly multi-stage: lexical + semantic propose (donto_suggest_alignments, _semantic, _hybrid), and an LLM adjudicator disposes — judging whether a high-cosine pair is exact_equivalent, inverse_equivalent, sub_property_of, close_match, or not_equivalent / incompatible_with (all of which are first-class values of the relation column on donto_predicate_alignment). The embedding's job is to shrink 866K candidates to a handful; the adjudicator's job is to assign the typed relation and the confidence. Skipping the adjudicator — promoting on cosine alone — is precisely the brittle, semantically-wrong shortcut the alignment relation enum exists to prevent.

7.6 The one thing embeddings must NOT do: collapse

Everything above is ordinary engineering tradeoff. This is not. It is the load-bearing constraint, and it is donto-specific.

Embeddings may CLUSTER and RANK. They must never COLLAPSE or MERGE.

donto's entire differentiator is that it is paraconsistent and evidence-first: it holds incompatible claims forever as legal state, never dedups, never picks a winner, never invalidates-on-conflict. Identity is a hypothesis, not a fact. An over-eager embedding pipeline is the single most natural way to destroy this — because "two things are close in vector space" is exactly the signal a naive system uses to say "these are the same thing, merge them." A MERGE INTO ... WHERE cosine > 0.9 would, in one query, turn donto into the thing it was built to replace: a collapsing store that throws most of the firehose away.

The contradiction-preserving discipline is enforced structurally, and the live schema already encodes it:

Put plainly: the embedding fabric earns its place by making the firehose navigable — by ranking what is near and clustering what is alike — and it keeps donto donto by never once acting on that nearness destructively. Clustering is a view; merging is a deletion. The fabric does the first and is structurally incapable of the second.

7.7 The honest bottom line

The fabric is, today, a predicate-only fabric at 3.5% coverage, proven on one decisive example (killed → murdered at cosine 0.95 but trigram 0.0667 — an order of magnitude below the 0.30 alignment threshold, so unreachable by the lexical key in practice) and one piece of infrastructure (pgvector + a local provider + two new SQL functions, plugged into sockets the substrate had already cut). The honest qualifier the report carries everywhere applies here too: that example is, for now, almost the entire empirical base — one positive pair plus a single live false-friend cosine (bornIn/diedIn at 0.638). The §6 eval suite exists precisely to convert this from one anecdote into a measured trend. The case for it is not that embeddings are a good idea in general — they are everywhere, and saying so adds nothing. The case is specific and structural: donto's whole bet is to defer the join to query time; a deferred join is only as good as its key; donto's key was lexical, which the vision itself forbids; and the only non-brittle, learned, meaning-bearing key for an open self-minted vocabulary is a vector. Embeddings are therefore not a feature of the alignment engine — they are the missing substrate primitive the deferred-join thesis has needed all along.

The costs are real and bounded: a few GB to embed every key (predicate + entity + context) fully and forever, and an explicit opt-in tier for the 39.5M-statement body where a blanket build would eat the disk. The risks are real and answered: false friends are gated behind LLM adjudication, staleness is answered by one continuous loop, and the one catastrophic failure mode — collapse — is made structurally impossible by recording alignment as a reversible, governance-gated, bitemporal edge that never touches a statement. The proof obligation is laid out in §6 and is donto's to discharge: coverage to ~100% on the keys, alignment precision/recall on a real gold set, fragmentation reduction under a precision floor, and the retroactive time-slice that only a never-forgetting substrate can even run. If those numbers move together, donto will have done the thing it was built to do — hold an unbounded, contradictory, evidence-anchored firehose and, at query time, find what belongs together without ever forcing it together. The embedding fabric is what makes that last clause true.