2026-06-03
donto's entire bet is to emit free / untyped now and defer
joining — typing, alignment, identity-resolution — to query time.
This report argues that a deferred join is only ever as good as the
key you join on, and that donto's join key today is
lexical (trigram similarity plus a
tsvector FTS projection) — precisely the brittle,
surface-form fallback the abundance vision explicitly forbids. The proof
is one predicate: lexically, the neighbours of killed are
only {killedAt, killedBy, killedIn, killedOn};
semantically, the top match is murdered (cosine 0.95). The
trigram similarity of killed→murdered is just
0.0667 — an incidental overlap on the shared
-ed suffix, an order of magnitude below the 0.30 alignment
threshold, so lexical cannot reach it in practice. Genuine zero-overlap
synonyms like slew (trigram 0.0) and
assassinated (0.0556) are structurally unreachable
to any character key, while semantically they sit beside
killed. From this we make the central claim:
embeddings are not a feature bolted onto predicate alignment;
they are the enabling substrate primitive that makes query-time
alignment, and therefore the whole defer-joining vision, actually
work. (Lexical 0.0667 vs semantic 0.95 on
killed→murdered is the contrast in one line.)
Generalized to every join-relevant object — predicate, entity,
statement, span, document, context — this becomes an embedding
fabric: one maintained vector per object, refreshed by one
continuous loop, consulted everywhere a match, rank, or join happens.
The non-negotiable constraint, the line that keeps donto donto:
embeddings may cluster and rank; they must never
collapse or merge. Identity stays a hypothesis;
contradictions stay held; alignment stays non-destructive query-time
expansion. We specify what to embed, how the loop maintains it, how each
consumer reads it, the costs against this box's real disk budget, and
the eval suite that measures whether the fabric finds the joins lexical
keys cannot.
For sixty years the scarce step in every knowledge system was generation — minting a typed fact cost human attention, and so every architecture downstream of it was built to ration. That scarcity is gone. A guided frontier model now emits an essentially unbounded, multi-directional space of properties and relations about any entity — inventing the predicates as it goes — for roughly $0.0001 each. The hard problem flipped. It is no longer "how do we generate enough typed knowledge?" It is "where do we put an unbounded, contradictory, evidence-anchored firehose without throwing most of it away?"
donto's answer is its whole identity: hold everything, paraconsistently, and defer joining to query time. Don't dedup, don't pick a winner, don't invalidate-on-conflict. Anchor every claim to its source, keep incompatible claims forever as legal state, and resolve typing, alignment, and identity later — at the moment a query actually needs them. The live store is the proof that this is happening at scale: 39,560,959 live statements across ~19,931 contexts, minted under 865,834 distinct predicates. That predicate count is not a proliferation bug to be cleaned up. In the frontier-test corpus, roughly 4,995 of ~6,111 predicates were singletons — used exactly once. That long tail is the signature of abundance. A system that forces every emission through a fixed schema would have discarded most of it at the door; donto keeps it and types it on demand.
But here is the twist that this report is about. A deferred
join is only as good as the key you join on. "Defer alignment
to query time" is a beautiful principle right up until a query arrives
and asks: which of these 865,834 predicates mean the same thing as
the one I'm asking about? That question is answered by a join key,
and today donto's join key is lexical — trigram
similarity (donto_suggest_alignments) and an FTS
tsvector projection over humanized IRI segments. Lexical
matching is exactly the brittle, surface-form fallback the abundance
vision explicitly forbids. CLAUDE.md's no-brittle-logic rule is
unambiguous: never resolve a semantic problem with string overlap,
synonym lists, or if/elif ladders. Yet a trigram join
is a string-overlap heuristic wearing a SQL function's clothes.
It can only ever find predicates that share characters.
The cost of that limitation is measurable and stark. Ask the lexical
engine for neighbours of the predicate
killed and it returns
{killed, killedAt, killedBy, killedIn, killedOn} — five
variants that all share the substring killed. It cannot
reach murdered, whose trigram similarity
to killed is a mere 0.0667 (a single
incidental overlap on the -ed suffix, far below the 0.30
alignment threshold), nor genuine zero-overlap synonyms like
slew (0.0) or assassinated (0.0556) — even
though any human (or model) reads them all as the same relation. The
semantic index built in this work returns exactly those:
murdered at 0.95 cosine, with
slew and assassinated in the same
neighbourhood — true synonyms with effectively no lexical overlap
whatsoever. This is the entire abundance thesis compressed into one
example. killedBy ↔︎ assassinatedBy,
rdfType ↔︎ rdf:type, the thousands of self-invented
predicate variants the firehose produces every day — these are precisely
the alignments a character-based key structurally cannot make,
and they are precisely the alignments query-time resolution exists to
make. A lexical join key doesn't just underperform; it makes the wrong
cases impossible by construction.
So the central argument is this:
Embeddings are not a feature bolted onto predicate alignment. They are the enabling substrate primitive that makes query-time alignment — and therefore the entire defer-joining-to-query-time vision — actually work. The vision's load-bearing operation is the deferred join; the deferred join's load-bearing component is its key; and the only non-brittle key available for an open, self-minted, semantic vocabulary is a learned vector. Embeddings are the non-brittle join key abundance needs.
This reframes embeddings from an optimization into a precondition.
donto's three pillars — paraconsistent hold, evidence anchoring,
bitemporal re-ranking — all assume that when a query finally reaches
across the firehose, it can find the claims that belong
together despite never having been forced together at write time.
Lexical keys break that assumption for the exact alignments that matter
most. The state of the alignment machinery confirms how thin the current
foundation is: ~18,500 live predicate alignments and
62,559 rows in donto_predicate_closure (as
of 2026-06-03; the loop is now actively running, so these drift), all of
it lexical-only — and until this work, the alignment engine was dormant
and un-scheduled. The identity layer is barely exercised at all (122
identity edges; the cluster cache empty). The differentiating machinery
exists; it has simply never had a key strong enough to make it
load-bearing.
The fix is not to embed predicates and stop. If a vector is the right join key for predicates, it is the right join key for every object donto defers a join on. That generalization is what this report calls the embedding fabric:
An embedding fabric is a maintained, dense vector attached to every join-relevant object type in the substrate — predicate, entity, statement, span, document, context — kept fresh by one continuous background loop, and consulted everywhere a join, a match, a rank, or a "find-similar" happens. It is a single semantic coordinate system laid over the entire store, so that "things that mean the same thing land near each other" becomes a primitive the substrate offers rather than a trick each consumer reinvents.
Concretely, the seed of this fabric already exists: pgvector 0.8.2 is
installed, ~30,000+ predicates and climbing are
embedded with fastembed BAAI/bge-small-en-v1.5 (384-dim) in
donto_predicate_embedding behind an HNSW
vector_cosine_ops index (30,587 as of 2026-06-03, an
in-progress backfill), and the full ~865,800-predicate registry backfill
is queued. The new donto_suggest_alignments_semantic and
_hybrid SQL functions are the first consumers. The rest of
this report argues that this should not remain a predicate-only
convenience: it should become a substrate-wide primitive, with the same
vector discipline applied to entities (identity-as-hypothesis),
statements and spans (evidence retrieval and contradiction-clustering),
documents, and contexts.
One nuance is non-negotiable, because it is where naive "just use
embeddings" advice would destroy donto's identity. Embeddings
cluster and rank; they must never collapse or merge. Proximity
in vector space is a hypothesis that two predicates, or two
entities, are the same — exactly the status donto already assigns to
identity. Query-time alignment must be non-destructive
expansion (when you ask for killed, also surface
the claims filed under murdered and wasKilled,
with their alignment scores attached), never a write-time rewrite that
fuses them. The firehose stays whole. Contradictions stay legal.
Identity stays a hypothesis the query can lean on or override. The
embedding fabric makes the deferred join possible;
paraconsistency is the constraint that keeps it donto. The
sections that follow specify what to embed, how the continuous loop
maintains it, how each consumer reads it, and how to measure whether it
is actually finding the joins lexical keys cannot.
Section 1 argued why deferring the join to query time is the correct bet in the age of generative abundance: when generation is cheap and unbounded, the only sane place to type, align, resolve identity, and dedup is late, against the accumulated firehose, never at write time. But a deferred join is only ever as good as the key you join on — and today every deferred operation in the live substrate joins on the surface string. The substrate can read characters. It cannot read meaning. That gap is not a missing feature; it is a structural blindness sitting directly underneath the thesis, and this section measures it.
The live numbers make the problem concrete and unavoidable:
All figures as of 2026-06-03; the embedding backfill and the alignment loop are both actively running, so the last rows drift upward.
| Object | Live count | Note |
|---|---|---|
Live statements (upper(tx_time) IS NULL) |
39,560,959 | donto_statement is ~34 GB on disk; a full scan ≈
34s |
| Contexts | ~19,812 | ctx:genealogy + ctx:genes ≈ 98.7% |
| Distinct predicates (registry) | 865,836 | the freely-minted "tail" |
| Singleton predicates (used exactly once) | 733,401 | 84.7% of all predicates appear on a single statement |
| Live predicate alignments | 18,488 | trigram-only; dormant/un-scheduled until this build |
donto_predicate_closure rows |
62,559 | transitive expansion of those alignments |
Identity edges (donto_identity_edge) |
122 | the contradiction/identity machinery, barely used |
| Predicates with an embedding | ~30,587 / 865,836 | ~3.5% and climbing — this build's bootstrap; the rest queued |
The headline is 865,834 distinct predicates, but the sharper number
is 733,401 singletons (84.7%). The substrate is not
holding a few thousand reusable schema terms with a long tail of typos —
it is holding an overwhelmingly one-shot predicate space where
the typical predicate has been minted once and never again. This is
exactly what the vision predicts and welcomes: it is the signature of
LLM extraction inventing predicates as it goes (rdfType,
killedBy, wasAssassinatedBy,
dateOfDeathApproximate, servedAsWitnessAt),
each anchored to evidence, none collapsed. Abundance is working as
designed on the write path.
The problem is entirely on the read path. A predicate space
that is 84.7% singletons is useless to query unless you can
group it by meaning at query time. If a user asks "who killed
whom," the answer is scattered across killed,
wasKilledBy, murdered,
assassinated, putToDeath, slew,
causedDeathOf, and several hundred one-off variants — and
the only mechanism the live substrate has to gather them is to
match the characters in the string. That mechanism cannot see that
murdered and killed are the same question.
The clearest demonstration is a single predicate. Ask the substrate
for the neighbours of killed using the alignment engine as
it existed before this build — donto_suggest_alignments,
which is pure pg_trgm trigram similarity:
lexical neighbours of "killed" → killed, killedAt, killedBy, killedIn, killedOn
Every single neighbour is a substring relative of
killed. The trigram index can only find predicates that
share characters with the probe, so it returns the
morphological family and stops. The query-time alignment that the entire
vision rests on, run lexically, expands killed into five
spellings of the same English root and declares victory.
Now the semantic neighbours, from this build's
donto_suggest_alignments_semantic (fastembed
bge-small-en-v1.5, 384-dim, cosine over the HNSW
index):
semantic neighbours of "killed" → murdered (0.95), wasKilled (0.94), killedOn (0.94), killedAt (0.94), killedIn (0.93) ...
murdered is effectively invisible to lexical
alignment: its trigram similarity to killed is
0.0667. pg_trgm pads each string with spaces, so the
two share exactly one incidental trigram — the trailing ed
from their common -ed suffix — out of
{ k, ki, ed , ill, kil, led, lle} for killed
and { m, mu, der, ed , ere, mur, rde, red, urd} for
murdered. That 0.0667 is more than four times below the
0.30 alignment threshold, so lexical cannot return murdered
at any usable threshold (you would have to drop the cutoff
below ~0.07, where the engine would also surface thousands of unrelated
-ed predicates and become useless). Genuine
zero/near-zero-trigram synonyms make the point even more sharply:
slew shares no trigram with killed
(similarity 0.0) and assassinated only 0.0556 — there is
effectively nothing for the string-distance metric to grab. The
embedding returns all of them near 0.95 because they mean the
same thing. This is the canonical hard case from the project's own rules
— killedBy ↔︎ assassinatedBy,
rdfType ↔︎ rdf:type — and it is the whole report in one
line: the trigram engine and the embedding engine disagree
precisely where meaning and spelling diverge, which is exactly where
alignment matters.
The table below is the entire indictment. Note wasKilled
is not in the "missed" column: its trigram similarity to
killed is 0.4167 (it shares
ill, kil, led, lle, ed ), so lexical alignment at threshold
0.3 does return it. wasKilled is the morphological
family lexical is good at; the synonyms lexical genuinely cannot reach
are the zero/near-zero-trigram ones.
| probe | what lexical returns (trigram sim) | what it misses (semantic, cosine) |
|---|---|---|
killed |
killedAt/killedBy/killedIn/killedOn
(0.70), wasKilled (0.42) |
murdered (trigram 0.0667 / cosine 0.95),
assassinated (0.0556), slew (0.0) |
Lexical is not wrong about killedBy or
wasKilled — those are genuine relatives. It is
blind about murdered, slew,
and assassinated. And blindness, not error, is the failure
mode that scales catastrophically across 733,401 singletons, because
singletons are exactly the predicates with no morphological family to
fall back on.
It is tempting to think donto_predicate_closure (62,559
rows) rescues this — that even if each alignment edge is shallow, the
transitive closure stitches the variants together into rich equivalence
neighbourhoods. It does not, and it cannot, for two compounding
reasons.
First, the closure is only the transitive hull of the edges
it is given. Those 62,559 rows are computed over the 18,487
lexical alignments. If killed → murdered was never an edge
— and it never could be, lexically — then no amount of transitive
expansion invents it.
killed → killedBy → killedOnDate → ... closes over the
trigram family and stays trapped inside it. Closure amplifies whatever
signal the edges carry; fed a string-similarity signal, it produces a
larger string-similarity neighbourhood, not a semantic one. Garbage-in
is not the right phrase — it is blindness-in,
blindness-amplified.
Second, the alignment engine that feeds the closure was
dormant. The SQL machinery
(donto_suggest_alignments, the closure builder) existed but
was un-scheduled — nothing ran it on a loop, so the ~18,500 alignments
and 62,559 closure rows are a stale, partial snapshot over a predicate
space that has grown to ~865,800. Coverage is the killer statistic here:
~18,500 alignment edges against 733,401 singletons means the vast
majority of the predicate tail has no alignment edge at all,
lexical or otherwise. The closure is not just lexically blind; it is
largely empty over precisely the long tail that abundance
produces and that query-time joining must reach. A deferred-join key
that exists for a few percent of keys is not a key.
The substrate-wide /search endpoint is the other place
where a meaning-join is deferred to query time, and it makes the
identical compromise. The index donto_statement_fts_name is
a GIN tsvector over a humanized projection of
subject + object_iri + left(object_lit,120), queried with
plainto_tsquery + ts_rank. It is
well-engineered as lexical search — the
to_tsvector(...) expression matches the index DDL exactly
(including upper(tx_time) IS NULL) so it doesn't seq-scan
39.5M rows, a bounded candidate CTE caps latency, and it returns in
~270–820 ms. None of that is the problem.
The problem is that tsvector is lexical to the bone. It
tokenizes, lowercases, and stems — running → run — but
stemming is morphology, not meaning. A search for killed
will not surface a statement whose object reads
was assassinated; a search for physician will
not find doctor; a search for spouse will not
find wife. Postgres FTS reaches synonyms only through a
hand-maintained synonym or thesaurus
dictionary — which is exactly the static string-list the
project's non-negotiable no-brittle-logic rule forbids. So
/search, like alignment, is structurally capped at the
morphological boundary. Fast, bounded, correct-as-spelled, and blind to
every synonym a human or an LLM would consider obvious.
Step back and the pattern is not five separate gaps — it is one gap, appearing everywhere donto defers a meaning-join to query time. Each query-time-deferred operation needs a notion of "same / related," and in the live substrate that notion is currently implemented as "shares characters."
| Deferred operation | What it must decide | Live mechanism | Blind to |
|---|---|---|---|
| Predicate alignment | is killed the same relation as
murdered? |
trigram (donto_suggest_alignments) |
synonyms with no shared trigrams |
| Entity / identity resolution | is ex:kitty the same person as
ex:kitty-wulbar? |
string/IRI overlap; dozens of identity edges total | co-referent entities with different surface forms |
| Substrate-wide search | which statements answer this query? | tsvector FTS
(donto_statement_fts_name) |
synonymous objects/predicates |
| Contradiction detection | do two claims even talk about the same predicate+entity? | requires aligned keys → inherits lexical alignment | contradictions hidden behind different spellings |
| Dedup / clustering | are these two claims restatements? | no semantic key; cluster cache empty | paraphrased duplicates |
The last two rows are where the cost compounds into the substrate's
differentiator. donto's reason to exist is paraconsistent
contradiction-holding and re-ranking — but a contradiction can
only be detected once you agree the two claims are about the same
predicate and the same entity. If A killed B and
B was murdered by A live under non-aligned predicates and
non-resolved entities, the substrate never notices they are even
comparable, so it never holds them as a contradiction, never builds an
argument edge, never re-ranks by reality. The differentiator is gated
behind the alignment key — and the alignment key is lexical. This is
visible in the data: donto_argument carries ~2,426 edges
and donto_identity_edge only dozens, against 39.5M
statements. The contradiction machinery is barely exercised not because
contradictions are rare in a 33M-statement contested-genealogy corpus —
they are everywhere — but because the substrate can't see them
through the surface strings.
The lexical ceiling is the cost of not having an embedding layer, and it is not an inconvenience at the margin — it is a cap on the core thesis:
killed
to its spelling family and cannot reach murdered (trigram
0.0667) at any usable threshold.Crucially, the forbidden fix and the blindness are the same thing
seen twice. The brittle escape hatch from lexical blindness — a
synonym table, an if/elif predicate-variant map, a
hand-curated thesaurus — is precisely what the no-brittle-logic rule
outlaws, because it does not scale to 865,834 freely-minted predicates
and never will. So the live substrate sits trapped between two walls:
the lexical engine it has cannot see meaning, and the static-list engine
it could bolt on is forbidden (and unmaintainable) by design. The vision
says emit free now, defer the meaning-join to query time — but
the meaning-join, today, has no meaning in it.
The next section names the way out across every place a join is
deferred. There is exactly one join key that is non-brittle (no lists to
maintain), learned (not hand-authored), and meaning-bearing (reaches
murdered from killed despite a trigram
similarity of just 0.0667, and slew despite literally zero
shared characters): the embedding.
The argument so far has been structural: a deferred join is only as
good as its key, and donto's key today is lexical, which the vision
explicitly forbids. This section makes that argument concrete across the
ten places in the substrate where joins, comparisons, and decisions
actually happen. Each one is, on inspection, a similarity
operation wearing a different costume — and each one is, today, served
by a trigram or tsvector fallback that fails precisely
where abundance bites hardest: when two LLM-minted strings mean
the same thing but share no characters.
The discipline throughout is the one the vision insists on: embeddings cluster and rank; they never collapse or merge. The fabric adds a non-brittle key to every comparison in the system without ever destroying a statement, retracting a winner, or hardening a hypothesis into a fact. Identity stays a hypothesis. Contradiction stays held. Alignment stays a query-time expansion, not a write-time deletion.
| # | Aspect | Today (lexical / absent) | With the embedding fabric |
|---|---|---|---|
| 1 | Predicate alignment | donto_suggest_alignments = trigram only; ~18,500 live
alignments, 62,559 closure rows, all from shared substrings |
killed↔︎murdered (trigram 0.0667 / cosine 0.95),
killed↔︎slew (trigram 0 / cosine ~0.9) — synonyms lexical
cannot reach in practice; ~30k+/865,834 predicates embedded, HNSW
recall |
| 2 | Entity / identity resolution | dozens of identity edges, cluster cache empty; resolution by IRI string + label trigram | Vector built from label + statement-signature; nearest-neighbour
candidate generation feeding donto_identity_edge —
still a hypothesis, non-destructive |
| 3 | Semantic + hybrid /search |
GIN tsvector over humanized IRI segments;
plainto_tsquery, seq-scan risk on 39.5M rows |
Vector recall fused with FTS via reciprocal-rank fusion; finds paraphrase, not just lexical overlap |
| 4 | Lens Engine / discovery | Cross-lens step unbuilt; analogy has no operator | Analogy = vector arithmetic; structural similarity = neighbourhood-embedding distance — discovery becomes a native query |
| 5 | Contradiction & corroboration | ~2,426 argument edges over 39.5M statements; conflict found only on exact (s,p,o) collision | Cluster claims about the same proposition despite wording;
populate donto_argument at scale — cluster, don't
collapse |
| 6 | Evidence ↔︎ claim anchoring | ~1.88M evidence links (4.75%); span match by substring | Semantic span match when the claim paraphrases the source and substring search returns nothing |
| 7 | Extraction-time reuse | New predicate/entity minted per run; tail grows unbounded (4,995 singletons in the frontier test) | Semantic retrieval of existing predicates/entities at emit time — reduce fragmentation at the source, by choice not by force |
| 8 | Query-time dedup / collapse | No dedup primitive; either keep all rows or pick a winner (forbidden) | Collapse-as-a-lens: a view clusters near-duplicate objects on read; the base rows are untouched |
| 9 | Emergent typing / ontology induction | 865,834 predicates, no families; typing deferred but never done | Cluster the predicate tail into families — the deferred typing the vision promises, induced not authored |
| 10 | Routing / classification | if/elif over names, hand-maintained lists (the
forbidden pattern) |
Embedding nearest-centroid classifiers; routing learned from data, refreshed by one loop |
The rest of this section takes each row in turn.
This is where the thesis is no longer a claim but a measurement.
donto's alignment engine already exists:
donto_suggest_alignments produces candidate equivalences,
donto_register_alignment records them, and
donto_predicate_closure (62,559 rows) materializes the
transitive expansion consulted at query time. There are ~18,500 live
alignments today (18,488 as of 2026-06-03, and now drifting as the loop
runs). Every one of them was found by trigram similarity — that is, by
shared substrings. That is exactly why the registry's
kill* neighbourhood aligns cleanly into
{killed, killedAt, killedBy, killedIn, killedOn} and stops
there. Trigram similarity is, by construction, blind to any synonym that
does not share characters.
The fabric closes that blindness. With ~30,000+ of 865,834 predicates
embedded into donto_predicate_embedding (384-dim
bge-small-en-v1.5, HNSW vector_cosine_ops
index) and the new donto_suggest_alignments_semantic /
donto_suggest_alignments_hybrid functions, the top semantic
neighbour of killed is murdered
(0.95) — a true synonym whose trigram similarity to
killed is only 0.0667, far below any
usable alignment threshold, so trigram search cannot reach it in
practice. Zero-trigram synonyms like slew (0.0) and
assassinated (0.0556) are reachable only
semantically. This is the canonical
killedBy ↔︎ assassinatedBy problem from the no-brittle-logic
rule, solved by similarity rather than by a hand-maintained synonym
table. The hybrid function is the right default: lexical for
morphological variants (killedBy/killedAt),
semantic for paraphrase
(murdered/assassinated), fused so neither
blind spot survives.
The scale stakes are specific to donto's corpus. The
kill* neighbourhood in the live registry is not five tidy
predicates — it includes event-narrative monsters like
100NativePeopleKilled,
11SettlersKilledLargestLossPowellFirst8Killed70Total3UnarmedWomenIndiscriminate,
and 15AboriginalsKilledIncludingBaulie. These are the 4,995
singletons of the frontier test made flesh: each is a once-minted,
sentence-length predicate that lexical alignment will never
connect to the generic killed, because the shared trigrams
are drowned by the surrounding tokens. Embeddings place all of them in
the same semantic region as killed/murdered,
which is the only mechanism that can ever fold the abundance tail back
toward a usable join key.
Before:
killedaligns to{killedAt, killedBy, killedIn, killedOn, wasKilled}— its spelling family. A query for "who killed whom" misses every statement minted undermurdered,slew,assassinated, or any narrative variant that shares no usable substring. After: the same query, expanded through a hybrid closure, reachesmurdered,assassinated, and the long narrative tail — without a single synonym ever being typed by hand, and without retracting or rewriting any of the original predicates.
Identity is donto's most underexercised differentiator: dozens of
donto_identity_edge rows against tens of millions of
distinct subjects, and an empty cluster cache. The reason is the same as
everywhere else — the only cheap candidate-generation key today is the
IRI string and a trigram on the label, which means ex:kitty
(a known junk-drawer URI), ex:kitty-wulbar, and a freshly
minted ex:kitty-munro are only as joinable as their
spelling. In genealogy, where the entire problem is "are these two
attestations the same person under different names," a spelling-based
key is the wrong tool by definition.
The fabric gives every entity a vector built from two signals: its
label (the surface name) and its
statement-signature — a pooled embedding of the
predicates and objects attached to it (born-in, married-to, child-of,
occupation). Two entities are identity candidates when they are
near in this combined space, which captures "same name, same shape of
life" even when the names diverge
(Mahamoodally/Mamode Ally,
Lablanche/Lablache). Nearest-neighbour search
over those vectors becomes the candidate generator that feeds — not
replaces — donto_identity_edge.
The non-negotiable nuance: this stays a hypothesis.
The fabric proposes; it does not merge. An identity edge is a
bitemporal, retractable assertion that two URIs may co-refer,
scored and dated, exactly as the substrate already models it. Embeddings
raise the recall of candidate generation from "names that look alike" to
"people whose whole evidentiary footprint rhymes," but the destructive
act — collapsing two entities into one — never happens. Paraconsistency
requires that ex:kitty-as-Brady-2013 and
ex:kitty-as-EKY-2026 can be held as
possibly-distinct even while an identity edge hypothesizes they are
one. The vector is the matchmaker, not the registrar.
Before: dozens of identity edges; candidate pairs surfaced only when labels share trigrams; co-referent people under variant spellings never even nominated. After: every entity carries a label+signature vector; nearest-neighbour scan nominates co-reference candidates across spelling boundaries, each written as a dated, retractable
donto_identity_edge— recall goes up, destruction stays at zero.
/search — fuse vector recall with
FTSThe substrate-wide /search (donto-memory,
POST /search) is the most-used read path and the clearest
case of the lexical ceiling. It runs plainto_tsquery +
ts_rank against donto_statement_fts_name, a
GIN tsvector over a humanized projection of subject +
object IRI + literal prefix, with a bounded candidate CTE to keep
latency under a second across 39.5M rows. It is fast and it works — for
terms that appear. A search for "homicide" returns nothing from
statements minted under killed; a search for "spouse"
misses marriedTo. FTS retrieves lexical overlap, never
meaning.
The fabric adds a second retriever — vector recall over statement (or
object/span) embeddings — and fuses the two with reciprocal-rank
fusion rather than choosing between them. RRF is the right join
because it needs no score calibration between an incomparable
ts_rank and a cosine distance: each retriever contributes
$\frac{1}{k + \text{rank}}$, the lists merge, and a result that ranks
well in either surfaces. FTS keeps its precision on exact
tokens and rare proper nouns (where embeddings are weakest); vectors add
the paraphrase recall FTS structurally cannot have. The
partial:true / 9s-timeout discipline already in
search.rs carries over unchanged; the vector leg is just a
second bounded candidate source feeding the same fusion step. (Section
4.3 specifies RRF in full; it is the query-side instance of the same
ensemble that aligns predicates.)
This also fixes the index-fragility footgun in one stroke: the
existing FTS to_tsvector(...) expression must match the
index DDL exactly (including
upper(tx_time) IS NULL) or it silently seq-scans 39M rows.
An HNSW vector index has no such hand-matched-expression hazard — the
recall path is the index, full stop — so the semantic leg is both more
capable and less brittle to operate.
The Lens Engine's discovery vision (relationships emerging at lens
intersections, analogy, structural similarity) has a missing piece: the
cross-lens step is unbuilt, and analogy has no operator. That is not an
accident — analogy and structural similarity are inherently
vector operations, and the substrate had no vectors. You cannot ask
"what is to context A as X is to context B" with tsquery;
you cannot rank entities by "structural role similarity" with a trigram.
The lens vision was waiting for the fabric whether or not anyone named
it.
With embeddings present, the core discovery moves become native queries. Analogy is offset arithmetic in entity space (the $a:b :: c:?$ pattern as nearest-neighbour to $\mathrm{vec}(b) - \mathrm{vec}(a) + \mathrm{vec}(c)$). Structural similarity is distance between neighbourhood embeddings — two people who occupy the same shape of relations (same predicate profile, same kinds of neighbours) are near even if they share no literal facts. Lens intersection becomes "objects near the centroid of two lens-defined regions at once." These are exactly the operations FCA, structure-mapping, and bisociation describe in the lens lineage — and they reduce, in an embedded substrate, to vector neighbourhood queries the substrate can already index with HNSW.
The lens lineage also warns that the moat is the verifier, not the lenses — and the fabric respects that division cleanly. Embeddings generate candidates (this analogy is plausible, these two roles are structurally alike); they do not assert the discovered relationship. A vector-proposed analogy enters the substrate as a candidate claim with an argument edge, to be corroborated or contradicted by evidence like any other. The fabric makes the lenses cheap and the cross-lens step possible; verification stays the load-bearing, paraconsistent core.
donto's headline differentiator is paraconsistency: hold incompatible
claims forever, re-rank by reality, never pick a winner. Yet the
machinery that finds the incompatibilities is barely populated
— ~2,426 donto_argument edges against 39.5M statements. The
reason is mechanical: a contradiction is only detected today when two
statements collide on an exact (subject, predicate, object)
shape. But two claims that contradict each other in an abundance corpus
rarely share that shape. "Caroline's mother was Kitty" and "Jessie
Buchanan could not possibly be the GM's mother" are about the same
proposition and they conflict — but they share neither predicate
nor object string, so the substrate never notices.
The fabric supplies the missing primitive: cluster claims by the
proposition they express, using embeddings of the (subject,
predicate, object) triple as text, so that semantically-equivalent and
semantically-opposed claims land near each other regardless of wording.
Within a tight cluster you get corroboration candidates
(many independent sources, different phrasings, same assertion — exactly
the signal bitemporal re-ranking needs). Across the polarity of a
cluster you get contradiction candidates to write as
donto_argument edges — finally populating the
differentiator at scale instead of by hand. This is the engine that
turns ~2,426 argument edges into a number proportional to a
39.5M-statement corpus.
The paraconsistency discipline is absolute here and bears restating: cluster, don't collapse. A corroboration cluster does not dedup into one statement; it remains N held statements with a shared cluster id and N evidence trails. A contradiction cluster does not invalidate either side; it writes an argument edge and lets reality re-rank over bitemporal time. The fabric is what lets donto exercise its own thesis — it finds the contradictions abundance generates so the substrate can hold them, rather than leaving them undetected because no two LLM phrasings ever collided on a string.
Before: ~2,426 argument edges; contradiction detected only on exact triple collision; semantically-opposed claims with different wording sit unrelated and unnoticed. After: propositions clustered by triple-embedding; corroboration and contradiction candidates surfaced across wording differences;
donto_argumentpopulated at corpus scale — every cluster held intact, nothing merged, nothing invalidated.
Only 4.75% of statements carry an evidence link (~1.88M of 39.56M), and the genealogy work repeatedly hit the failure mode behind that number: a fact is true to its source, but the span — the exact text snippet that supports it — can't be located because the claim paraphrases the document rather than quoting it. Span anchoring today is substring search: find where the object literal appears in the revision body. When the extractor writes "died in 1898" and the source says "passed away in the year eighteen ninety-eight," substring returns nothing, and the statement ends up in the un-anchored 95%.
The fabric makes span anchoring a semantic match. Embed candidate
spans from the revision body, embed the claim, and take the nearest span
as the evidence anchor when substring search comes up empty. This
directly attacks the evidence-link sparsity that the genealogy notes
flagged as a Trust Kernel testbed problem (e.g., only 3 of ~80
Caroline-line kinship triples carry evidence links). The substrate's
native evidence model —
fact → evidence_link → span → revision → blob — is
unchanged; the fabric only improves the span-finding step from
"exact string present" to "this passage means this claim."
This is the most direct payoff against donto's one hard rule and its evidence-first identity. Every fact is supposed to be defensible by a retrievable snippet plus a link to the full resource. Substring matching quietly drops that guarantee for any paraphrased extraction; semantic span matching restores it, and does so without inventing evidence — the span is real text from a real stored revision, merely found by meaning rather than by character match.
Before: span anchoring by substring; paraphrased extractions (95% of statements) left without an evidence link; evidence model intact but mostly empty. After: semantic span match as the fallback when substring fails; paraphrased claims anchored to the real passage that supports them; the 4.75% coverage figure becomes movable for the first time.
Every aspect above cleans up fragmentation after it happens.
This one prevents a share of it from happening at all. Today each
extraction run mints predicates and entities freely — the right default,
and the source of the 865,834-predicate registry that is the signature
of abundance, not a bug. But "emit free" and "emit blindly" are
not the same thing. When the extractor needs a predicate for "was killed
by," nothing tells it that killedBy,
murderedBy, and wasKilledBy already exist, so
it mints a fourth. Much of the singleton tail (4,995 of ~6,111 in the
frontier test) is not genuinely novel meaning — it is the same meaning
re-minted because the extractor couldn't see what was already there.
The fabric offers the extractor a semantic lookup at emit time:
before minting, embed the proposed predicate/entity and retrieve the
nearest existing ones, surfacing "you may mean killedBy
(0.94)" as a suggestion. Reuse when it fits; mint when it
genuinely doesn't. This reduces fragmentation at the source — fewer
accidental singletons, a denser and more joinable registry — without
ever capping invention. The two-layer contract holds: maximize at
extraction, but make the maximization informed rather than
amnesiac.
The line that must not be crossed: this is suggestion, never enforcement. This is the one aspect that, mis-implemented, pushes back toward write-time schema-fixing — an over-helpful emit-time suggester is a soft collapse-at-the-source, the very thing "emit free / untyped now" exists to prevent. The vision forbids collapsing the emit step into a fixed vocabulary; an extractor that cannot mint a new predicate is no longer abundance-native. So the guard is not just "the suggestion is declinable" — it is a measurable obligation: reuse-suggestion must never reduce the recall of genuinely-novel predicate minting. If the suggester ever nudges an extractor toward a near-but-wrong existing predicate when the right move was a new one, it has quietly narrowed the firehose, and that is a regression to be caught by eval (a minting-recall metric on a gold set of genuinely-novel cases), not a feature. Tuned correctly, the fabric reduces accidental fragmentation (re-minting an existing meaning) while preserving every bit of intentional novelty (the predicate that really is new). It is the cheapest fragmentation win available — a predicate not minted needs no later alignment at all — but only so long as it never costs a single legitimate new predicate.
donto faces a real tension: the substrate must never dedup (paraconsistency, I3 no-destructive-overwrite), yet a human or downstream consumer reading 40 near-identical corroborating statements wants them folded. The resolution the vision implies is collapse as a lens — a read-time view, not a write-time mutation. But a collapse view needs a similarity key to decide what counts as "near-identical," and a lexical key collapses only verbatim duplicates while leaving every paraphrase un-folded.
The fabric supplies that key. A collapse lens clusters statements (or entities, or predicates) by embedding proximity above a threshold and presents one representative per cluster on read, with the cluster's members and their evidence trails one click away. The base table is untouched: all N statements remain live, bitemporal, and individually retrievable; the "collapse" exists only in the projection the query chose. Turn the lens off and the full firehose returns. This is dedup that respects I3 absolutely — nothing is overwritten, nothing is retracted, the merge lives entirely in the view layer.
This reframes the volume debate the claim-substrate report settled. "Maximize at extraction, gate at promotion" gets a third companion: collapse at presentation. The substrate holds everything; the query decides how much of it to fold for this reader, for this purpose, reversibly. Collapse-as-a-lens is the consumer-facing relief valve that lets donto keep its hoard-everything core while still handing a memory consumer or a UI a clean, de-duplicated surface when that is what the task wants.
The vision's central design move is "emit free / untyped now, defer typing to query time." donto does the first half well — 865,834 predicates, no imposed schema — but the second half has never actually happened. Typing is deferred, but it is not done; the predicate tail sits un-grouped, and "defer to query time" is only a promise until something at query time can produce the types. Without a similarity key, the only way to induce a type system over 865K freely-minted predicates would be to author one by hand, which is the forbidden pattern at maximum scale.
The fabric makes induced typing a clustering job. Embed the
predicates (~30,000+ done and climbing, the rest queued) and cluster the
embedding space: the kill/death/violence predicates — from the clean
killed to the sentence-long
100NativePeopleKilled — fall into one family; kinship
predicates into another; occupation, place, date into theirs. Each
cluster is an induced type, with a centroid that names the
family and a membership that spans every wording variant the LLM ever
minted. This is precisely the deferred typing the vision wants: not a
schema authored up front, but families induced from the data
after the fact, refreshable as the corpus grows.
The discipline is the cluster-not-collapse one (§7.6): induced families are an overlay, not a rewrite. A predicate's membership in the "violence" family is an alignment/closure relationship and a soft type — not a destructive recanonicalization of the IRI. The 865K predicates remain individually addressable; the type system is a lens over them, consulted at query time, re-derivable when embeddings refresh. This is the move that finally turns "abundance is a feature" from a defensive slogan into an operational claim: the proliferation is fine because the fabric can fold it into a navigable family structure on demand.
The last aspect is where the fabric most directly enforces the
no-brittle-logic rule. The substrate and its consumers are full of
latent classification decisions: which lens applies to a document, which
context a statement belongs in, whether a span is relevant, how to route
an extraction. The tempting implementation for every one of these is an
if/elif ladder over names or a hand-maintained list of
strings — the exact anti-pattern the project bans, because such lists
rot the moment the data shifts and they are blind to anything not
enumerated.
The fabric replaces those decisions with nearest-centroid
classification. Build a small set of labeled exemplars per class, embed
them, take centroids, and classify a new object by cosine distance to
the nearest centroid. There is no string list to maintain: when the data
drifts, you add exemplars and re-embed, you do not edit a ladder of
conditionals. A document about a massacre routes to the violence/event
lens because its embedding is near that centroid — not because someone
added "massacre" to a keyword array. This is the same
primitive as §3.9 (a class is a cluster with a label) applied to
routing.
The payoff is both correctness and maintenance. Embedding classifiers degrade gracefully on novel inputs (they return the nearest class with a distance you can threshold for "none of the above"), they are auditable (the matched exemplars explain the decision), and they are refreshed by the same continuous loop that maintains every other vector in the fabric. One embedding loop, consulted everywhere — for alignment, identity, search, discovery, clustering, anchoring, reuse, collapse, typing, and routing — is the operational meaning of "embedding fabric." It is not ten features; it is one maintained primitive that every brittle decision in the substrate can delegate to.
Section 3 made the case that the join key must be an embedding. That is true, but it is not the whole truth, and stating it baldly would reproduce exactly the error this report is trying to correct — swapping one brittle monoculture (lexical) for another (semantic). The fabric is not "embeddings instead of trigrams." It is an ensemble of three signals, each of which is individually insufficient and demonstrably blind in a specific, characterizable way. The art is in how they compose: cheap signals generate recall, an expensive signal supplies precision, governance decides what is allowed to act, and closure plus query-time expansion turn the result into something the substrate can actually join on. This section walks the three signals, shows where each fails on real donto predicates, and then specifies the pipeline that stitches them together — including the query-side instance of the same ensemble, hybrid search via reciprocal rank fusion.
donto already carries all three signals in some form. The point of this section is that none of them is load-bearing alone.
Lexical (trigram + FTS). This is what the alignment
engine has had all along:
donto_suggest_alignments(source, min_similarity, limit)
over pg_trgm, plus the GIN tsvector index
donto_statement_fts_name powering substrate-wide
/search. It is essentially free — a GiST/GIN index lookup,
no model, no GPU, microseconds per probe — and it is genuinely good at
one thing: syntactic variants of the same surface
string. rdfType ↔︎ rdf-type ↔︎ rdf:type,
birthPlace ↔︎ birth_place ↔︎ birthplace, casing, separators,
pluralization, stemming. Across donto's 865,834 distinct predicates — a
tail in which ~4,995 of ~6,111 frontier-test predicates were singletons
— a very large fraction of the "alignments" we need are nothing more
than the same human concept spelled five ways by five extraction runs.
Trigram catches every one of those for nothing.
Its blind spot is total and structural: it cannot see
meaning. Two predicates that share no substring are invisible
to it, no matter how synonymous. This is not a tuning problem; it is
what trigram similarity is. The live proof is the canonical
example. For the predicate killed, the lexical neighbours
returned by donto_suggest_alignments('killed', 0.3, …)
are:
| target_iri | trigram sim |
|---|---|
game:killed / game:game:killed |
1.000 |
killedAt |
0.700 |
killedBy |
0.700 |
killedIn |
0.700 |
killedOn |
0.700 |
allKilled |
0.636 |
(The game:killed / game:game:killed rows at
1.000 are themselves an abundance/identity artifact — the same predicate
minted under duplicated namespaces — and they occupy the top of
both the lexical and the semantic result sets; the analysis
below is about the substantive neighbours beneath them.) Every
substantive hit shares the substring killed. The true
synonym murdered — the one alignment a human would reach
for first — is nowhere, and trigram cannot reach it in
practice: its similarity to killed is
0.0667, an incidental brush on the shared
-ed suffix, far below the 0.30 cutoff this very call uses.
Drop the cutoff low enough to catch it and you drown in thousands of
unrelated -ed predicates. This is the failure mode the
no-brittle-logic rule names directly: lexical matching is the
hand-maintained-synonym-list problem wearing an index, and it silently
drops the abundance tail's most valuable joins.
Semantic (embeddings). This is what this build
added: ~30,000+ of ~865,800 predicates embedded (full population is a
queued background job) with fastembed
BAAI/bge-small-en-v1.5 at 384 dimensions, stored in
donto_predicate_embedding under an HNSW
vector_cosine_ops index, queried by
donto_suggest_alignments_semantic(iri, threshold, limit).
Cosine over learned vectors is exactly the inverse of trigram: it is
blind to surface form and sighted to meaning. The
actual top-k for killed (representative snapshot, as of
2026-06-03; the three game:*:killed namespace duplicates at
1.000 are elided) is:
| target_iri | cosine sim | trigram sim |
|---|---|---|
murdered |
0.947 | 0.0667 |
wasKilled |
0.942 | 0.417 |
killedOn |
0.940 | 0.700 |
killedAt |
0.939 | 0.700 |
killedIn |
0.933 | 0.700 |
wasKilledBy |
0.930 | — |
killedBy |
0.930 | 0.700 |
killedPerson |
0.928 | — |
murdered at 0.947 cosine but only 0.0667 trigram is the
entire thesis in one row: the only neighbour the embedding finds that
lexical cannot, sitting above every morphological variant
lexical already had. The killedBy ↔︎ assassinatedBy class of
join — synonyms that diverge in spelling — is solved here and only
here.
But look harder at that same table, because it also exposes the
embedding's blind spot. killedOn, killedAt,
killedIn, killedBy, and
wasKilledBy sit at essentially the same cosine (0.93–0.94)
as the true synonym murdered. Yet several of them are
not the same relation as killed.
killed and murdered are roughly
exact_equivalent; killed and
killedBy/wasKilledBy are
direction-flipped — they are
inverse_equivalent (the subject and object swap roles),
even though cosine ranks them right alongside the synonym. Collapsing
X killed Y into X killedBy Y would silently
reverse the meaning of every statement that joins through it. Embeddings
put synonyms, inverses, hypernyms, and false friends
all in the same neighbourhood, because a 384-dimensional sentence
embedding of a short predicate label encodes topic, not
argument structure. Cosine cannot tell you the relation type,
and it cannot tell you that bornIn (place) and
diedIn (place) — which sit at cosine 0.638
in the live embedding table because both are about life-events at a
place — are not interchangeable, or that partner (business)
and partner (romantic) are false friends despite identical
strings and high cosine. Semantic recall is wide and
meaning-aware; it is also relation-blind and false-friend-prone.
LLM adjudication. The third signal is the only one
that can read usage and direction. Where trigram judges
strings and embeddings judge topical proximity, an LLM judge is handed
the two predicates, representative subject/object pairs drawn from
donto_statement, and the evidence spans, and asked to
classify the relation between them into donto's existing
alignment vocabulary. That vocabulary is not invented for this report;
it is the live relation enum on
donto_predicate_alignment, the same column the engine
already writes:
exact_equivalent · inverse_equivalent ·
sub_property_of · close_match ·
decomposition · not_equivalent (plus the
legacy/import aliases exact_match, inverse_of,
narrow_match, broad_match,
incompatible_with, …).
The current ~18,500 live alignments (as of 2026-06-03; the loop is
actively running, so these counts drift) are almost entirely
close_match (18,080) with only 305
exact_equivalent, 53 inverse_equivalent, 26
sub_property_of, 12 decomposition, and 12
not_equivalent — because they were produced by the
lexical-only engine, which can assert "these strings
look alike" (close_match) but cannot justify a stronger or
directional claim. That distribution is the signature of an engine
running on one signal. It is precisely the
inverse_equivalent/sub_property_of/not_equivalent
distinctions — the ones that change query correctness, not just
recall — that neither trigram nor cosine can produce and that LLM
adjudication exists to supply.
The LLM is also the only signal that resolves false friends, because
resolving them requires reading actual usage:
partner in ctx:genealogy co-occurs with
marriage and parentage; partner in a jobs/resume context
co-occurs with firms and equity. Same string, same embedding, opposite
meaning — separable only by looking at the statements each predicate
actually appears in. That is a judgement task, not a similarity
computation, and the no-brittle-logic rule's permitted toolset names it
explicitly: "LLM judgement" is a first-class technique, not a
fallback.
The three blind spots, side by side:
| Signal | Cost / probe | Catches | Structurally blind to |
|---|---|---|---|
| Lexical (trigram/FTS) | ~free (index) | syntactic variants (rdfType↔︎rdf-type, casing,
separators, stemming) |
meaning; any low-overlap synonym (killed↔︎murdered at
trigram 0.0667; killed↔︎slew at 0) |
| Semantic (HNSW cosine) | ~ms (vector index) | synonyms, paraphrase, cross-spelling (killed↔︎murdered,
killedBy↔︎assassinatedBy) |
direction (killed↔︎killedBy look identical); false
friends; relation type |
| LLM adjudication | ~$/call, seconds | relation type
(exact/inverse/sub_property/not_equivalent);
usage-disambiguated false friends |
scale — too expensive to run on every pair |
The complementarity is exact. Lexical's blind spot (meaning) is semantic's strength. Semantic's blind spot (direction, false friends, relation type) is the LLM's strength. The LLM's blind spot (cost at scale) is precisely what the two cheap signals fix, by shrinking the candidate set the LLM ever has to look at. No two of the three cover all three failure modes; you need the full ensemble.
Composing the three is a recall/precision cascade. Each stage is calibrated so that the expensive signal only ever sees candidates the cheap signals could not resolve on their own.
Stage 1 — cheap recall (lexical ∪ semantic candidate
generation). For a source predicate, take the
union of trigram neighbours
(donto_suggest_alignments, threshold ~0.30) and HNSW
neighbours (donto_suggest_alignments_semantic, threshold
~0.80). Union, not intersection: the whole point is that each signal
contributes candidates the other is blind to — lexical contributes
killedBy (shared string), semantic contributes
murdered (shared meaning), and the candidates that
both signals return (e.g. killedBy, which
is lexically and semantically close to killed) are
exactly the high-confidence ones. The combined ranker
donto_suggest_alignments_hybrid(source, min_score, limit, semantic_weight)
already implements the fused score; the live pg_proc
actually carries five suggest functions —
donto_suggest_alignments, _semantic, two
overloads of _hybrid (one keyed by iri, one by
source with a semantic_weight default of
0.70), and a _hybrid_fast variant — which is a clear sign
this layer is mid-build and should be consolidated to a single
_hybrid contract (folding the _fast path in as
a query-planner option, not a separate function). The output of Stage 1
is a candidate list of perhaps 10–30 predicates per source, with each
candidate tagged by which signals proposed it and at what
score.
Stage 2 — cheap resolution where the signals already
agree. Most candidates never need an LLM. When lexical and
semantic both score a pair very high and the surface forms are
trivial variants (rdfType/rdf-type), the
engine can write the alignment directly with relation
exact_equivalent or close_match at high
confidence — these are the cases trigram was always right about. This
stage is what keeps the LLM bill bounded: the 865K-predicate tail is
dominated by spelling variants, and those should be resolved for
free.
Stage 3 — expensive precision (LLM adjudication on the
borderline). Route to the LLM only the
candidates that are genuinely ambiguous: high cosine but low lexical
(possible synonym or possible false friend — the engine cannot
tell which), or candidates where direction is in question
(killed vs killedBy both surfaced, both near
0.93). The judge receives the two predicates, sampled
(subject, object) pairs and evidence spans from
donto_statement for each, and returns a relation drawn from
the live enum plus a confidence and a short rationale. This is where
inverse_equivalent gets correctly assigned to
killed↔︎killedBy, where partner(genealogy) and
partner(jobs) get split as not_equivalent, and
where bornIn vs bornOn is caught. The
rationale and the sampled spans are stored, so the alignment is itself
evidence-anchored — evidence_anchor_ids is a real column on
the table and should be populated here, making the alignment auditable
in the same way every donto statement is.
Stage 4 — governance (review_status, the safe_for_
flags).* This is the stage that protects donto's identity, and it
is already modeled in the schema. Every alignment carries
review_status ∈ {candidate, accepted, rejected, superseded}
and three independent capability flags:
safe_for_query_expansion, safe_for_export,
safe_for_logical_inference. The graduation is deliberate
and asymmetric. A candidate close_match from cheap signals
can be safe_for_query_expansion = true (it is fine to
widen a search with it — the worst case is a few extra recalled
rows the caller can ignore) while staying
safe_for_logical_inference = false and
safe_for_export = false (it must not be
used to derive new facts or be shipped as ground truth). LLM-adjudicated
exact_equivalent alignments with strong rationale and human
sign-off can graduate review_status → accepted and earn the
stronger flags. Critically: all ~18,500 current alignments are
review_status = candidate — the engine has never
promoted one, which tells us governance is wired but the loop that
exercises it has not run. The flags are the difference between "the
fabric helps you find more" and "the fabric silently rewrites your
knowledge"; they are how aggressive recall and conservative truth
coexist in one table.
Stage 5 — closure. Accepted/safe alignments compose
transitively into donto_predicate_closure (currently 62,559
rows over the ~18,500 base alignments — a ~3.4× fan-out), so that
A exact_equivalent B and B exact_equivalent C
make A–C reachable without a third pairwise
call. Closure must be relation-aware, which is the
whole reason Stage 3's typing matters: exact_equivalent is
transitive and symmetric and closes freely; sub_property_of
is transitive but directional (closes one way);
inverse_equivalent flips argument order on traversal;
not_equivalent and close_match must
not transitively close at all (chaining "roughly
similar" links is how you drift from killed to something
unrelated in four hops). A closure that ignored relation type would
manufacture false joins — which is exactly the failure a lexical-only
closure is prone to, since it only has close_match to work
with.
Stage 6 — query-time expansion. This is the payoff
and the line donto must not cross. At query time, a probe on predicate
P expands to $P \cup \mathrm{closure}(P \mid
\texttt{safe_for_query_expansion})$ and the join runs over the expanded
set. The expansion is non-destructive: the underlying
statements keep their original, freely-minted predicates forever;
nothing is merged, deduped, or overwritten; identity stays a hypothesis.
The fabric ranks and clusters the join candidates; it never collapses
them. This is the concrete mechanism behind "emit free / untyped now,
defer joining to query time" — the deferral is real precisely because
the join key (the embedding-anchored, LLM-typed, governance-gated
closure) is now good enough to defer to.
The cascade above aligns predicates (a maintenance-time
job). But the identical ensemble logic governs the query-time
retrieval the fabric is meant to power, and it would be a mistake to
treat them as two separate systems. Substrate-wide /search
today is lexical-only: plainto_tsquery +
ts_rank over the donto_statement_fts_name GIN
index, with a bounded candidate CTE (LIMIT 2000) to cap
latency at ~270–820 ms across 39.5M statements. That inherits the exact
blind spot from §4.1 — a query for "homicide victim" will never retrieve
a statement whose predicate is killed or
murdered, because FTS matches words, not meaning.
The fix is hybrid search: run the FTS query and a
vector-kNN query (statement/span embeddings, the same HNSW machinery,
once the embedding fabric extends past predicates to spans and
statements) in parallel, then fuse the two ranked lists
with Reciprocal Rank Fusion. RRF scores each candidate
by $\sum_i \frac{1}{k + \text{rank}_i}$ over the lists it appears in
(typically $k=60$), which has three properties that make it the right
fusion operator here: it needs no score calibration between the
incommensurable ts_rank and cosine scales (it consumes
ranks, not raw scores); it is robust to one retriever returning garbage
(a low rank contributes a small term, it cannot dominate); and it
rewards candidates that both retrievers surface —
precisely the "lexical ∩ semantic = high confidence" intuition from
Stage 1, now at query time. A statement that is both a lexical match
and a semantic match floats to the top; one that only one
retriever found still appears, lower. This is the
recall-union/precision-rerank pattern of the maintenance pipeline,
compressed into a single request and a single fusion formula.
So hybrid-search-by-RRF is not a different idea bolted onto alignment; it is the query-side instance of the same ensemble. The maintenance loop and the query path draw on one embedding fabric, one set of cheap-recall signals, and the same conviction: each technique alone is insufficient, and the system that ships is the one that runs all three and fuses them — cheaply where they agree, expensively where they don't, and never destructively.
This section describes how the embedding fabric is actually built on
donto-pg today, what is already running, and the
production-grade knobs that govern it as it scales from the current
~30,000+ embedded predicates to the full ~865,800-predicate registry
and, eventually, to entities, statements, spans, documents, and
contexts. The design intent throughout is the one stated in §1–§4:
embeddings are the non-brittle join key for query-time
alignment, and they must cluster and rank without ever
collapsing or merging — paraconsistency stays intact, identity
stays a hypothesis.
A point worth stating up front, because it changes how to read everything below: the substrate was already scaffolded for this. pgvector and a working embedding source were the only missing pieces — not the schema, not the alignment plumbing, not the CLI surface. Three pieces of pre-existing evidence:
apps/dontosrv/src/alignment.rs already takes an
optional embedding on the alignment-register path —
pub embedding_model: Option<String> and
pub embedding: Option<Vec<f32>> — and a
required embedding: Vec<f32> on the embedding-store
path. The Rust API never assumed lexical-only; it assumed a vector would
eventually be supplied.packages/sql/migrations/0060_identity_edge.sql declares
the identity-edge method column with the literal allowed
values trigram, embedding, neural, human, import, rule.
embedding and neural identity were always
legal — there was simply nothing producing them.derive_embeddings is a first-class capability in the
Trust Kernel policy surface (it appears in
packages/donto-query/src/evaluator.rs,
packages/client-ts/src/policy.ts, the DontoQL grammar, and
the PRD's capability matrix), and derive_embeddings is a
documented donto CLI subcommand.So this build did not bolt a new subsystem onto donto. It populated a socket the substrate had been carrying empty.
The substrate runs Postgres 16 in the donto-pg container
with pgvector 0.8.2 and pg_trgm 1.6
both installed (confirmed live via pg_extension). Keeping
vectors inside the same Postgres that holds the 39.56M
statements is deliberate and load-bearing for the vision: alignment is a
query-time join, and a join is cheapest and most
consistent when both sides live in one transactional store. There is no
external vector service to keep in sync, no dual-write consistency
problem, no separate backup story — the embedding fabric is backed up by
the same pg_dump that backs up the substrate, and it
participates in the same MVCC/bitemporal world as everything else.
The fabric is per-object-type, not one giant table. The live table today is:
Table "public.donto_predicate_embedding"
Column | Type | Nullable | Default
------------+--------------------------+----------+---------
iri | text | not null |
embedding | vector(384) | not null |
model | text | not null |
updated_at | timestamp with time zone | not null | now()
Indexes:
"donto_predicate_embedding_pkey" PRIMARY KEY, btree (iri)
"donto_predicate_embedding_hnsw" hnsw (embedding vector_cosine_ops)
Five design decisions are encoded in that small table, and they are the template every other layer copies:
The text projection is the load-bearing, under-specified
step. bge-small-en-v1.5 is an English
sentence model, and a predicate IRI is a camelCase identifier, not
a sentence. What actually gets embedded is therefore a humanized
projection of the IRI, not the raw string: strip the namespace
prefix, split camelCase / snake_case / separators into words, lowercase,
and (where available) append a short usage descriptor sampled from the
predicate's statements (representative object types, a label if one
exists). So killed is embedded as roughly "killed",
wasKilledBy as "was killed by", and the sentence-length
tail monster 100NativePeopleKilled as "100 native people
killed" — which is exactly why it lands near
killed/murdered rather than near numeric
predicates. This matters for honesty about the 0.95 scores:
killed/murdered are the easy case
(the split tokens are dictionary words the model knows well), and the
projection step is what determines whether the hard cases —
opaque camelCase coinages, abbreviations, multilingual act-text
predicates — embed meaningfully or as noise. The projection is a tunable
component, and §6 must measure scores across easy and hard predicate
shapes, not just the dictionary-word example.
Keyed by the object's natural identity
(iri), one row per object. The embedding is a
projection of the object, not a new object. This is what keeps
embeddings from leaking into the statement model — a predicate's vector
lives beside the predicate registry, not as a statement, so it can be
recomputed, dropped, and rebuilt freely without touching the immutable
donto_statement ledger or violating I3 (no destructive
overwrite of facts).
vector(384) matches the active
provider exactly (see §5.3). The dimension is a property of the table,
which is why a provider swap to a different dimensionality means a
new/migrated table, not an in-place type change (see §5.3 on
halfvec and migration).
model is stored per row. Embeddings
from different models are not comparable; recording the producing model
per row is what lets the maintenance loop detect "this row was embedded
by an older model" and re-embed it, and lets readers refuse to compare
across models. This is the analogue, in the embedding layer, of donto's
evidence-first discipline: every vector knows where it came
from.
The index is HNSW with
vector_cosine_ops. Cosine distance is the right
metric for normalized sentence-embedding spaces like bge-small; the
operator class is fixed at index-build time, so queries must use the
matching <=> cosine-distance operator to use the
index (the same "the query expression must match the index DDL exactly
or you seq-scan" discipline already documented for the FTS
to_tsvector index — it applies identically to vector
indexes).
The layering plan. Predicate is layer one because
predicate proliferation is the most acute abundance signal (865,834
distinct predicates, ~4,995 of ~6,111 frontier-test predicates
singletons) and because predicate alignment is what the existing
closure/identity machinery already consumes. The same table shape
extends outward, each as its own
donto_<type>_embedding table with its own HNSW
index:
| Layer | Object embedded | Text projection that gets embedded | Primary downstream use |
|---|---|---|---|
| Predicate (live, ~30,587 / 865,836) | predicate IRI | humanized predicate name + sampled descriptor/usage context | query-time alignment, closure, predicate identity clusters |
| Entity (next) | subject/object entity IRI | label + key attributes + type hints | identity-as-hypothesis clustering, entity dedup candidates (never merges) |
| Statement | a live statement | humanized subject predicate object triple |
semantic recall, contradiction-neighborhood discovery, lens intersection |
| Span | an evidence span | the snippet text itself | evidence retrieval, "find the source that says X" |
| Document / revision | a registered source body | document text (chunked) | /search/resources, source triangulation |
| Context | a context | aggregate/centroid of its statements | context similarity, cross-corpus routing |
The crucial invariant restated at the schema level: none of
these tables has a "merged_into" or "canonical_iri" column. An
embedding table can only ever rank neighbors; it can never
rewrite an object's identity. Collapse is structurally impossible
because the fabric has nowhere to write a collapse decision. Identity
decisions live in donto_identity_edge as
hypotheses (with method ∈ {trigram, embedding,
neural, human, import, rule}), and alignment decisions live in
donto_predicate_alignment as bitemporal, retractable rows —
both are additive, both are reversible, both preserve every original
object forever.
The embedding tables are the new input; the consumers already existed
and are now getting a better key. Live counts on
donto-pg:
| Object | Live count (as of 2026-06-03) | Notes |
|---|---|---|
donto_predicate_embedding rows |
~30,587 and climbing | full ~865,800 registry is a queued background job |
donto_predicate_alignment (live) |
18,488 | bitemporal; upper(tx_time) IS NULL filter |
donto_predicate_closure rows |
62,559 | transitive closure over alignments |
donto_identity_edge |
122 | barely used; the differentiator, still cold |
The suggest functions sit between embeddings and these tables. Live
pg_proc carries five
(donto_suggest_alignments, _semantic, two
_hybrid overloads, and _hybrid_fast — see §4.2
on consolidation); the four that matter conceptually:
donto_suggest_alignments — the original,
lexical-only (pg_trgm similarity over predicate
strings). This is the brittle fallback the vision forbids as a
sole key: for killed it can only return
{killed, killedAt, killedBy, killedIn, killedOn} —
neighbors that share a substring.donto_suggest_alignments_semantic — new this
build, cosine-kNN over donto_predicate_embedding.
For killed it surfaces murdered (0.95) at the
top — a synonym whose trigram similarity to killed is only
0.0667, far below any usable threshold, so lexical cannot reach it in
practice (and slew, trigram 0.0, not at all). This is the
entire thesis in one query.donto_suggest_alignments_hybrid — new this
build, the union: lexical ∪ semantic, deduped and re-scored.
Hybrid is the production default precisely because the two signals are
complementary failure modes — lexical catches morphological
variants and typos that drift in embedding space
(organisation/organization), semantic catches
synonyms and paraphrase that share no characters
(killedBy/assassinatedBy). Neither alone is
sufficient; the union dominates both.donto_match_aligned — the query-time read
path: given a predicate (or its variants), expand through
donto_predicate_closure to its aligned set so a query for
killed also matches statements minted as
murdered, assassinated,
wasKilled. This is "defer joining to query time" made
concrete — the join key is the closure, and the closure is now fed by
semantics, not just trigrams.The flow is: embeddings (+ lexical) → suggest candidates → adjudicate → register alignment → rebuild closure → (optionally) emit identity edges → consult at query time. Embeddings improve exactly one thing — candidate generation — but candidate generation is the upstream bottleneck that gated the entire pipeline. Better candidates are why the closure can grow past lexical's ceiling, and why the cold identity machinery finally has a non-brittle signal to run on.
The active provider is fastembed running BAAI/bge-small-en-v1.5 (384-dim) on CPU, local, in-process. The properties that make it the right default for an abundance substrate:
But "default" is not "only." The provider is an
abstraction, mirroring the model/provider abstraction
the rewrite is building for extraction. The model column on
every embedding row, the per-object-type tables, and the per-row
dimension are exactly what make a swap clean: dropping in OpenAI
text-embedding-3-*, a GLM embedding endpoint, or a larger
local model (bge-large, e5-large) is a matter of (a) a new table at the
new dimension or a migration, (b) a new model string, (c)
the maintenance loop re-embedding under the new model. Because
model is recorded per row, a migration can run
incrementally and mixed — new rows in the new space,
old rows re-embedded lazily — and readers can refuse cross-model
comparisons until a layer is fully migrated. Critically,
alignment.rs already accepts
embedding_model: Option<String> alongside the
vector, so the provider identity flows through the Rust API
without a schema change there.
Recommended posture: keep bge-small as the always-on local baseline (it must always work offline and for free), and treat larger/hosted models as an opt-in quality lever for high-value layers (e.g. entity-identity clustering, where a false merge is expensive) rather than a wholesale replacement.
The fabric must never re-embed the whole world. With 865K predicates and millions of statements, full re-embedding is both wasteful and a recurring 34s-class scan against a 34 GB table. The discipline is embed-on-change, keyed by a content signature:
model. Store it (alongside, or derivable from,
the embedding row).This makes maintenance $O(\text{new} + \text{changed})$, not
$O(\text{total})$. New predicates minted by the firehose get embedded;
predicates whose sampled descriptor context shifted get refreshed; a
model swap invalidates exactly the rows under the old model and no
others. The updated_at column gives a cheap watermark for
"embed everything touched since T," and model gives the
cross-model invalidation key. The ~30,587/865,836 coverage figure (as of
2026-06-03, rising) is itself a maintenance artifact — it is simply how
far the incremental backfill has progressed, not a sampling
decision.
The defining architectural choice is that alignment is not a batch job someone remembers to run. The handbook records that the alignment engine was historically dormant and un-scheduled — the SQL functions existed but nothing drove them, which is why an 865K-predicate registry had only ~18,500 alignments and 122 identity edges. The fix is a single continuous loop, scheduled (cron/Temporal), idempotent, and observable, that does the whole cycle each tick:
model,
updated_at).donto_suggest_alignments_hybrid (lexical ∪ semantic) to
propose alignment candidates above thresholds.donto_predicate_alignment as bitemporal rows (additive; a
later reversal is a retraction, never a delete).donto_predicate_closure (currently 62,559 rows) so
query-time expansion sees the new edges.donto_identity_edge with method ∈ {embedding,
neural} — the previously-cold machinery, now fed a real signal. These
are hypotheses with confidence, never merges.This is what "constantly aligning" means in practice: the firehose mints free, untyped predicates; the loop quietly embeds them, finds their semantic neighbors, proposes and adjudicates alignments, and refreshes the closure — so that by the time a query arrives, the join key is already current. Query-time alignment stays query-time (nothing is decided eagerly or destructively at write time), but the machinery the query consults is kept warm by the loop. The two are not in tension: the loop maintains the index; the query does the expansion.
Idempotency and locking. Each loop tick must be safe
to overlap or re-run: steps 1, 4, 5, 6 are upserts/recomputes keyed by
stable identity, so a crashed tick re-does work harmlessly. A single
advisory lock (e.g. pg_advisory_lock on a fixed key)
prevents two loop instances from running steps 2–6 concurrently and
double-proposing; embed-new (step 1) can fan out under that lock by
batching disjoint IRI ranges. Durability via Temporal means a restart
loses no progress — a half-finished tick resumes rather than restarting
cold.
Reads never touch the loop; they consult the maintained artifacts:
donto_match_aligned + donto_predicate_closure:
a query naming killed transparently also matches
murdered/assassinated/wasKilled.
This is alignment-as-expansion, the non-destructive dual of
dedup — the original predicates all survive in the ledger; the query
just reaches further.<=> cosine operator to hit
donto_predicate_embedding_hnsw, and ef_search
(§5.7) bounds recall-vs-latency per query. The same
bounded-candidate-then-rerank pattern that keeps the 39M-row FTS search
under ~1s applies: take a bounded HNSW neighborhood, then
re-rank/filter, rather than scanning.HNSW build and search parameters. pgvector's HNSW exposes three primary knobs:
| Knob | What it controls | Recommended posture for donto |
|---|---|---|
m (build) |
edges per node; graph density | default 16 for predicate/entity layers; consider
24–32 for the statement layer where recall matters most and
the corpus is large |
ef_construction (build) |
candidate list size at build; build quality vs. build time | 64 baseline; raise to 128–200 for the
high-value entity/statement layers, accepting slower builds (a
once-per-layer cost) |
ef_search (query, session-set) |
candidate list at query; recall vs. latency | tune per read path: small (40–80) for interactive alignment lookups, larger (100–200) for offline identity-cluster rebuilds where recall trumps latency |
m and ef_construction are fixed at index
build, so getting them right per layer matters before the layer scales;
ef_search is a session GUC and is the live recall/latency
dial.
halfvec for storage at scale.
vector(384) is 4 bytes/dim = ~1.5 KB/row before index
overhead. Across 865K predicates that is tolerable; across tens of
millions of statements it is not — the embedding column plus its
HNSW graph would dwarf reasonable memory and stress the
/dev/sdb data disk (which hit 95% once, on 2026-06-02, when
an ENOSPC truncated a source file; it sits at 56% / ~157 GB free as of
2026-06-03, but a statement-level build is exactly the kind of write
that could exhaust it again). pgvector's halfvec (2
bytes/dim) roughly halves storage and index
size for ~negligible recall loss on normalized sentence embeddings, and
HNSW supports halfvec_cosine_ops. Posture: keep
vector at the predicate/entity layers (small,
precision-sensitive); adopt halfvec for the
statement/span layers where row counts are large and the
marginal recall cost is immaterial. This is a per-layer decision
precisely because the fabric is per-object-type.
Batching. Embed in provider-sized batches (bge-small
on CPU is throughput-bound, not latency-bound) and upsert in
transactional chunks so a failure rolls back a bounded unit.
Build/maintain HNSW indexes with adequate
maintenance_work_mem; on the 16 GB box, building a large
layer's index may warrant a temporary bump and off-peak scheduling, and
--shm-size=2g on the donto-pg container must
be preserved (it is a documented hard requirement of the pgvector
image).
Idempotency & locking. Covered in §5.5: signature-hash skip + upsert-by-IRI make every step replay-safe; a single advisory lock serializes the adjudicate→closure→identity stages; Temporal durability makes the whole loop crash-safe.
Observability / coverage metrics. The loop's step-7 run record is the operational dashboard. Track at minimum, per layer:
donto_predicate_closure row delta per run (62,559 as of
2026-06-03) and alignment count (~18,500 live) — the proof the loop is
doing work;donto_identity_edge additions by method
(almost all cold today — the number to watch as the differentiator
finally warms);These metrics also close the loop on the measurement-as-steering-wheel principle (§6): coverage and closure-growth are how you see abundance being tamed at query time instead of thrown away at write time.
To make the "already scaffolded" claim precise:
| Component | State before this build | State after |
|---|---|---|
alignment.rs embedding params
(embedding_model,
embedding: Vec<f32>) |
present, unused | now fed real vectors |
identity-edge method ∈ {…, embedding, neural}
(migration 0060) |
declared, never produced | producible by the loop's cluster step |
derive_embeddings Trust-Kernel capability + CLI
subcommand |
present, no backend | backed by a real provider |
donto_predicate_closure /
donto_match_aligned |
present, lexical-fed | semantic+lexical-fed |
| pgvector extension | absent | 0.8.2 installed |
| embedding source/provider | absent | fastembed bge-small-en-v1.5, local CPU |
donto_predicate_embedding + HNSW index |
absent | present, ~30,587 rows (rising) |
| semantic / hybrid suggest functions | absent | donto_suggest_alignments_semantic /
_hybrid present |
| continuous scheduled loop | absent (engine dormant/un-scheduled) | the one loop (§5.5) |
The substrate had been built as if embeddings would arrive. This build delivered the two genuinely missing primitives — a vector index (pgvector) and a vector source (a local provider) — and the one piece of orchestration (the continuous loop) that turns them from a one-time backfill into a permanently-maintained fabric. Everything else was a socket waiting for a plug.
donto does not get to assert that the embedding fabric "works." The substrate's own design principle — measurement is the steering wheel — forbids it. Every claim in §§3–5 (semantic join keys beat lexical ones; pervasive embeddings raise recall; the fabric stays paraconsistent) is a hypothesis until it is instrumented, baselined, and tracked over time. This section defines the eval suite that turns the fabric from an architectural bet into a measured system, gives target and illustrative numbers anchored to the live store, and specifies the instrumentation each metric needs.
The discipline is the one that already governs extraction: a
metric that is not stored, re-runnable, and time-sliced is not a
metric. Each eval below is specified as (a) a gold or proxy
ground-truth, (b) a numerator/denominator that can be recomputed against
any tx_time slice, and (c) a steering decision it informs.
The whole suite is written to a donto_eval_run context
(ctx:eval/<suite>/<run-id>) so eval results are
themselves bitemporal donto state — we can ask "what did alignment
precision look like as of 2026-05-01?" the same way we ask any
retrospective question.
The fabric is a quality intervention, not a capability one. Lexical alignment already returns something for almost every query; the question is never "does a result come back" but "is the result the right one, and would a human or a downstream task agree." That makes baselines non-negotiable. The live store gives us the denominators that make the metrics honest:
| Quantity | Live value (2026-06-03) | Role in the eval suite |
|---|---|---|
| Live statements | 39,560,959 | denominator for closure-expansion recall, time-slice population |
| Distinct predicates (registry) | 865,836 | embedding-coverage denominator |
| Distinct predicates in live statements | 985,448 | fragmentation numerator (the surface actually queried) |
| Singleton predicates (used exactly once) | 733,401 (84.7%) | the fragmentation tail the fabric must compress |
| Live predicate alignments | 18,488 | precision/recall test population |
donto_predicate_closure rows |
62,559 | effective-predicate expansion factor |
| Predicates embedded | ~30,587 (3.5% of registry) | embedding-coverage starting point |
| Identity edges | 122 | identity-cluster purity test population (tiny — see §6.5) |
| Evidence links | ~1.88M (4.75% of stmts) | anchors for span-level relevance judging |
(The alignment, embedding, and identity counts drift between
measurements because the loop is now actively running.) One row looks
like a contradiction and is not: the distinct predicates in
live statements (985,448) exceeds the registry (865,836)
because statements freely reference predicate IRIs that were never
written into donto_predicate — the registry is a catalog
the firehose is allowed to outrun, which is itself a small abundance
artifact (the emit path mints predicate IRIs on statements without a
registry round-trip). The eval suite should use the larger,
surface-actually-queried figure as the fragmentation denominator.
Two of these numbers are the report's central tension stated as measurements. The overwhelming majority of predicates are singletons — that is the fragmentation the fabric exists to compress. 3.5% embedding coverage is the gap between the fabric as designed and the fabric as deployed; until coverage is high, every other metric below is being measured on a partially-built fabric, and the coverage number must be reported alongside every other result as a confound.
What it measures. Of the alignment pairs the fabric
proposes (lexical-only, semantic-only, hybrid-RRF), how many
are true synonyms/sub-property relations (precision), and of the true
relations that exist, how many does each method find (recall). This is
the eval that directly tests the report's money-shot claim — that
semantic finds killed↔︎murdered and
killedBy↔︎assassinatedBy where lexical structurally
cannot.
Gold set. We need a held-out labelled set of
predicate pairs. Bootstrapping it without violating the no-brittle-logic
rule (no hand-curated synonym dictionary used in the system):
take a stratified sample of ~2,000 predicate pairs drawn from three
buckets — (i) high lexical similarity (trigram > 0.5), (ii) high
semantic similarity but low lexical (cosine > 0.85, trigram
< 0.2 — the bucket lexical can never reach), (iii) random pairs as
negatives — and have an LLM judge label each
same / sub-property / related / unrelated, with a human
spot-check of a 200-pair subset to estimate judge error. The gold set is
a measurement artifact, never a runtime lookup; it lives in
ctx:eval/alignment-gold/v1 and is itself versioned.
Method comparison (illustrative targets). The structural point is bucket (ii): semantic and hybrid should dominate there by construction, and the lexical column should be near-zero — that asymmetry is the thesis.
| Method | Precision@proposed | Recall (all buckets) | Recall on bucket (ii) only |
|---|---|---|---|
| Lexical (trigram, current prod) | ~0.80 (illustrative) | ~0.45 | ~0.02 — structurally near-zero |
| Semantic (bge-small cosine) | ~0.78 | ~0.70 | ~0.75 |
| Hybrid (RRF of lexical+semantic) | ~0.86 | ~0.78 | ~0.74 |
The numbers are illustrative pending the first gold run, but the
shape is a prediction the eval will confirm or falsify:
lexical's bucket-(ii) recall must be ~0 (it cannot find synonyms below a
usable trigram threshold — the killed↔︎murdered case at
0.0667), and hybrid should beat both single methods on overall recall
while matching or exceeding lexical on precision via RRF's
agreement-weighting. If semantic precision comes in below
lexical, that steers us to raise the cosine threshold or add an
LLM-adjudication gate on the semantic-only proposals before they are
written as alignments.
A blunt honesty note on the current empirical base.
As of this report, the entire measured evidence for the thesis
is a single predicate: killed, whose semantic top-k
(murdered 0.95, then the killed* morphological family at
0.93–0.94) and lexical top-k (the killed* family, no
murdered) are reproduced above from live function calls,
plus one live false-friend cosine
(bornIn/diedIn at 0.638). That is one decisive
anecdote, not a trend — and it is partly an artifact of coverage: at
3.5% embedding coverage most candidate probes (married,
occupation, …) currently return nothing semantic
because their neighbours are not yet embedded. The honest minimum next
step, achievable before the full 2,000-pair gold set, is a
20-neighbour-by-5-predicate hand spot-check: pick five
well-covered predicates spanning easy (dictionary-word) and hard
(camelCase/sentence-length) shapes, label the top-20 semantic neighbours
of each
same / sub-property / inverse / related / unrelated, and
report raw precision@20. Until that exists, every numeric row in this
section is explicitly a projection, and the report says so.
Instrumentation needed. A harness that, for each
gold pair, queries all three donto_suggest_alignments*
functions and records rank + score; a confusion-matrix writer keyed to
the gold labels; and a per-bucket breakdown so the bucket-(ii) asymmetry
is always visible. The functions already exist
(donto_suggest_alignments, _semantic,
_hybrid); what is missing is the gold context and the
runner.
What it measures. Abundance produces ~985,446 distinct surface predicates, the overwhelming majority of them singletons. The fabric's job is not to delete them (paraconsistency forbids that) but to make them queryable as fewer effective predicates via closure expansion at query time. The metric is the effective-predicate count after closure: how many equivalence-ish clusters do the alignment+closure relations induce, and how much of the singleton tail gets pulled into a cluster with a populated, well-typed predicate?
Definition. Run connected-components over the
donto_predicate_closure graph (62,559 rows today).
Report:
| Metric | Today (lexical closure) | Target after semantic fabric |
|---|---|---|
| Effective predicates (components) | ~ (distinct − small merges) | meaningful contraction, tail-driven |
| Singleton rescue rate (≥10-freq anchor) | low (lexical can't reach the tail) | the steering target — track monthly |
| Mean component size | ~1.0+ | rises as the tail joins |
The critical guard. Compression must be reported with §6.1 precision. Fragmentation reduction is trivially maximized by aligning everything to everything — which destroys precision and, worse, would silently collapse genuinely-distinct predicates. The steering rule is: maximize singleton rescue subject to alignment precision ≥ threshold. A fragmentation drop with no precision floor is a regression, not a win.
Instrumentation needed. A periodic
connected-components job over donto_predicate_closure
writing component sizes and the rescue-rate histogram to
ctx:eval/fragmentation/<date>.
What it measures. Whether the fabric improves the
substrate-wide /search (the GIN-tsvector path over 39M
statements) and the predicate/entity retrieval that feeds query-time
joins. This is the user-facing payoff metric.
Gold set. A query set with graded relevance
judgments. Two sources, neither hand-maintained as a runtime list: (i)
proxy-from-behavior — Omega's
recallMemories.ts and the genealogy front already issue
real queries; log query→clicked/used-result pairs as weak positive
signal; (ii) LLM-judged — for ~150 representative
queries (genealogy name+place lookups, memory recall, predicate-intent
queries) have an LLM grade the top-20 from each method 0–3.
Store in ctx:eval/search-gold/v1.
Metrics. nDCG@10 and MRR per method, plus a hybrid via Reciprocal Rank Fusion column — because RRF (§4.3) is the principled way to combine the lexical FTS path (which is what prod runs today) with the semantic path without tuning a score-scale.
| Method | nDCG@10 | MRR | Notes |
|---|---|---|---|
Lexical FTS (current donto_statement_fts_name) |
baseline | baseline | the brittle fallback; strong on exact name hits |
| Semantic (statement/entity embeddings) | +Δ on paraphrase/variant queries | +Δ | wins on spelling variants (Mahamoodally↔︎Mamode Ally) |
| Hybrid (RRF) | ≥ max(lexical, semantic) | ≥ max | the production target |
donto-specific prediction. Lexical should win or tie on exact-token queries (it is excellent at "Maurel" → the Maurel statements) and lose badly on variant/paraphrase queries — exactly the genealogy variant-spelling problem (Lablanche↔︎Lablache, Collinson↔︎Colinson) the handbook calls out. Hybrid-RRF should never lose to either single method on nDCG; if it does on a query class, that class needs reweighting. Report nDCG segmented by query class (exact-name / variant-name / intent/paraphrase) — a single average hides the whole story.
Instrumentation needed. Statement- and entity-level
embeddings (currently only predicates are embedded), a query-logging
hook in /search and /recall, and an RRF fusion
step in the search route. The latency budget already exists (the route
caps at a 2000-row candidate CTE + 9s timeout); semantic retrieval must
live inside it via the HNSW index, not a brute-force scan.
What it measures. What fraction of each object type carries a current vector. This is the gauge that contextualizes every other metric: a 0.78 semantic recall measured at 3.5% predicate coverage means something very different at 95% coverage.
Definition. Per object type, coverage = (objects
with a non-stale embedding) / (objects). "Non-stale" matters: an
embedding is stale if the underlying text changed after the embedding's
tx_time. The fabric's continuous loop (the one maintenance
loop from §5) must close this gap and keep it closed.
| Object type | Embedded today | Total | Coverage | Target |
|---|---|---|---|---|
| Predicates | ~30,587 | 865,836 | 3.5% | ≥ 99% (queued backfill) |
| Entities | 0 | (millions) | 0% | high-value-first, then full |
| Statements | 0 | 39,560,959 | 0% | sampled then full (cost-gated) |
| Spans / evidence | 0 | ~1.88M | 0% | full (small, high-value) |
| Documents | 0 | — | 0% | full |
| Contexts | 0 | ~19,812 | 0% | full (cheap) |
Steering use. Coverage is the prerequisite metric — it gates interpretation of §§6.1–6.3. The report's honest framing is that the fabric is presently a predicate-only fabric at 3.5%; the eval suite must publish coverage first and refuse to over-claim the others until coverage on the relevant object type is high. Cheap high-value layers (~19,812 contexts; ~1.88M spans) should be backfilled first because they unlock span-level relevance judging (§6.3) and context-routing for very little compute.
Instrumentation needed. A
donto_embedding_coverage view (counts per type +
staleness), and a freshness SLO on the maintenance loop (e.g. p95
time-from-mint-to-embedded < 1h for predicates).
What it measures. Embeddings cluster entity
mentions; identity decides whether two mentions are the same individual.
The non-negotiable nuance from the report's thesis is that
embeddings must cluster and rank but must not collapse
— identity stays a hypothesis (donto_identity_edge), not a
destructive merge. This eval proves the fabric helps propose
identity candidates without silently merging distinct people — a live
genealogy hazard (16 distinct "Kittys"; the ex:kitty
junk-drawer URI).
identity_edge hypothesis and never deletes/merges
a statement. This is a structural invariant test, not a tuning metric —
it must always pass.Caveat the honest report must state. The identity machinery is barely exercised: only 122 identity edges total against tens of millions of distinct subjects. Purity/coverage here are measured on a near-empty population; the metric's first job is to grow that population safely, then measure it. The Kitty disambiguation (16 distinct individuals collapsed under one URI) is the natural first gold case.
Instrumentation needed. An entity-mention embedding layer (§6.4 currently 0%), a candidate-clustering job (HNSW k-NN, no merge), and an LLM/human adjudication queue that materializes confirmed/refuted decisions as identity-edge hypotheses — never as merges.
Intrinsic metrics (§§6.1–6.5) can all improve while the user-facing system does not. Task-lift is the metric that matters most and the one the abundance vision centers — measurement as steering means optimizing the downstream task, not the intermediate score.
Flagship: jsonresume→jobs matching. The match
quality depends entirely on query-time alignment of freely-minted skill
predicates (usesReact, reactDeveloper,
proficientInReact, ESCO/Lightcast skill IRIs). Metric:
match precision/recall and explainability coverage
(fraction of matches with a citable evidence path) against a gold set of
resume→suitable-job judgments. Ablation: matching with lexical-only
alignment vs hybrid-fabric alignment.
| Configuration | Match recall | Match precision | Explainable-match rate |
|---|---|---|---|
| Lexical alignment only | baseline | baseline | baseline |
| + semantic fabric (hybrid) | +Δ (target the headline) | ≥ baseline | ≥ baseline |
Genealogy record-matching recall. The fabric's job
is to raise recall on the variant-spelling problem without losing
precision: of known true person-record matches (DNA-triangulated cases,
the Sherrington/Brooks gold from the live research), what fraction does
fabric-assisted search surface that lexical misses? This is the
killed↔︎murdered contrast (trigram 0.0667, far below
threshold) applied to names: Mahamoodally↔︎Mamode Ally is a
near-zero-overlap pair lexical cannot bridge at any usable cutoff.
Instrumentation needed. A held-out gold for each task (jobs-match judgments; DNA-confirmed genealogy match pairs already exist in the research corpus), and an ablation switch that runs the same pipeline with the fabric on/off so the lift is attributable to the fabric and nothing else.
This is the eval no vector DB or normal KG can run, and it is the one that proves the bitemporal-paraconsistent design pays off. Question: does an alignment learned at time T improve recall on claims that were ingested before T?
Why it is unique to donto. A collapse-on-conflict
store rewrites history; it cannot ask "given what I know now, how would
my older answers change," because the old state is gone. donto keeps
everything as legal bitemporal state, so we can hold the claim
population fixed at an older tx_time and vary only the
alignment knowledge to its newer state. The deferred-join
architecture means alignment learned today retroactively improves every
claim ever stored — and we can measure exactly that.
Protocol.
tx_time = T0 (e.g. 2026-05-01).A positive lift is the proof: learning is retroactive. An
alignment minted today (reactDeveloper ↔︎ usesReact,
killedBy ↔︎ assassinatedBy) raises recall on claims that
have sat untouched in the store for months — without re-ingesting or
rewriting a single statement. That is "defer joining to query time"
delivering compounding value, measured.
| Slice | Alignment knowledge | Claim population | Recall@10 | Interpretation |
|---|---|---|---|---|
| Baseline | T0 (2026-05-01) | T0 | baseline | what we could answer then |
| Retroactive | T1 (now) | T0 (held fixed) | baseline + Δ | what today's knowledge unlocks in old data |
Instrumentation needed. Bitemporal-aware querying in
the harness (every alignment lookup must accept an as-of
tx_time — the closure tables are already bitemporal, so
this is queryable today), a frozen gold set with stable IDs, and a
stored lift series so the retroactive Δ is itself tracked over time (the
second derivative: is the fabric's retroactive power growing?).
These seven evals are not a one-time validation; they are the dashboard the maintenance loop steers by. The intended cadence:
| Eval | Cadence | Primary steering decision |
|---|---|---|
| Embedding coverage (§6.4) | continuous (SLO) | backfill priority; gates interpretation of all others |
| Alignment P/R (§6.1) | per alignment-model change | cosine/RRF thresholds; LLM-gate on semantic-only |
| Fragmentation reduction (§6.2) | weekly | singleton-rescue vs precision-floor tradeoff |
| Search relevance (§6.3) | weekly + per index change | fusion weights by query class |
| Identity purity/coverage (§6.5) | per clustering run | candidate threshold; collapse-safety must always pass |
| Task-lift (§6.6) | per release | ship/hold the fabric change |
| Retrospective time-slice (§6.7) | monthly | proves compounding; reports the headline retroactive Δ |
The unifying rule, and the one that keeps the fabric honest: every quality metric is reported jointly with embedding coverage and with the paraconsistency invariant (0 destructive merges). A recall win at the cost of coverage confound, or a fragmentation win at the cost of a silent merge, is not a win — it is a regression the steering wheel exists to catch. The fabric earns its place only when these numbers move together in the right direction, and donto is the rare substrate that can prove they did — retroactively, against history it never threw away.
An embedding fabric that touches every object type is a strong claim, and a report that only sold the upside would be dishonest. This section is the counterweight. It states the real costs in real numbers for this box, the tradeoffs we are deliberately accepting, and — most importantly — the one thing embeddings must never be allowed to do inside donto, because doing it would dissolve the differentiator the whole substrate exists to protect.
The short version: embeddings are cheap and continuous for the small, high-leverage object types (predicates and entities, ~1.8M vectors) and expensive and opt-in for the large ones (statements, spans, documents). And across all of them, the embedding fabric is permitted to cluster and rank, and forbidden to collapse and merge. Everything below elaborates those two sentences.
The single hardest constraint is the data disk.
/mnt/donto-data (/dev/sdb) is ~373 GB and
already carries pgdata (the live substrate),
backups, and the genealogy workspace/. As of
2026-06-03 it is 56% used (~157 GB free) — but that headroom is recent
and fragile: on 2026-06-02 this same disk hit 95% and an ENOSPC
truncated a source file mid-write, and it was only pruning old backups
that bought back the space. Whatever the embedding fabric costs, it pays
out of that budget, against a disk that has demonstrated it can
fill. So the design has to be sized against it before anything else.
A bge-small-en-v1.5 vector is 384 float4 =
1,536 bytes of raw payload. The realistic on-disk cost
is higher than that once you count Postgres row overhead, the
vector type header, the foreign-key/IRI column, and — the
big one — the HNSW index, which for typical
m/ef_construction settings runs roughly the
same order of magnitude as the vectors themselves (often 1.0–1.5x the
raw vector bytes). A defensible planning figure is ~3–4 KB of
total footprint per embedded object (vector row + its share of
the HNSW graph). Applying that to each object type:
| Object type | Population (live) | Raw vectors @1.5 KB | Vectors + HNSW @ ~3.5 KB | Verdict |
|---|---|---|---|---|
| Predicates | 865,836 distinct | ~1.3 GB | ~3.0 GB | Embed fully + continuously |
| Entities (distinct subjects) | ~1M (order-of-magnitude) | ~1.5 GB | ~3.5 GB | Embed fully + continuously |
| Contexts | ~19,812 | ~30 MB | ~70 MB | Trivial; embed fully |
| Statements | 39,560,959 | ~59 GB | ~140 GB | Tiered / opt-in only |
| Evidence spans | ~1.88M evidence links | ~3 GB | ~7 GB | Opt-in per consumer |
The contrast is the entire policy. Predicates + entities + contexts
together are ~1.8M vectors ≈ 6–7 GB all-in — a rounding
error against the free budget, and small enough that a single continuous
loop can keep every one of them fresh. Statements are a different
universe: embedding all 39.5M of them is ~59 GB of raw vectors
and ~140 GB once the HNSW index is built — about 89% of
the current ~157 GB free, on top of a
donto_statement heap that is already 34 GB total. That is
not a maintenance burden we can absorb; it would leave the disk a single
backup-run away from the 2026-06-02 ENOSPC again — a disk-exhaustion
event with a build step attached. (For calibration, the predicate
embedding table — only ~30,587 of 865,836 predicates embedded so far —
is already ~110 MB.)
The conclusion is not "embeddings don't scale." It is that the small object types are where the leverage is anyway. The query-time-alignment vision is fundamentally about joining on predicates and entities — the keys. A predicate vector is consulted on behalf of every statement that uses that predicate; embedding 866K predicates buys you semantic reach over all 39.5M statements for the cost of 866K vectors. Embedding the statements themselves would buy comparatively little additional join power at ~20x the cost. The math and the vision agree: embed the keys fully, embed the rows selectively.
Statement and span embeddings are therefore a tiered, opt-in layer, never a blanket build. Concretely:
donto_statement_fts_name, the
existing GIN index) remains the fallback for broad statement search, and
predicate/entity vectors carry the semantic join.ctx:memory/*, which is ~56K statements, or a single
genealogy sub-context) can request embedding for that context's slice.
56K vectors is ~200 MB all-in — affordable. The cost scales with the
slice the consumer actually pays for, not with the whole substrate.tx_time touched in the last N days)
can be embedded and aged out, capping the index at a fixed size
regardless of corpus growth.This keeps the genuinely large layer governed by explicit, bounded opt-in, and keeps the disk constraint from ever being the thing that breaks.
Vectors are not write-once. Three things make an existing embedding wrong over time:
bge-small re-embeds the entire fabric. For 1.8M
predicate+entity vectors at fastembed throughput this is hours, not
days, and is the strongest reason to keep the large statement tier
small — re-embedding 39.5M statements on a model change would
be prohibitive.status predicate used one way in genealogy and another in
memory). The vector embeds the name/descriptor, so this is
slower-moving, but it argues for periodic re-embedding from current
descriptors rather than embed-once-forget.The mitigation is the one continuous loop the fabric
is built around (§5.5): a single scheduled worker that (a) embeds new
objects, (b) re-embeds objects whose descriptor changed, and (c) carries
a model_version / embed_version stamp on every
vector so a model swap is a filtered backfill
(WHERE embed_version < current) rather than a
stop-the-world rebuild. Crucially, this loop must actually run
— the alignment engine's prior failure mode was being built but
dormant/un-scheduled (lexical-only, never invoked). An
embedding fabric that isn't continuously refreshed is worse than no
fabric, because it gives stale answers with the confidence of fresh
ones.
The current choice is BAAI/bge-small-en-v1.5, 384-dim,
run locally via fastembed. The tradeoffs that pinned that choice, and
where they'd be revisited:
bge-small
(384-dim) is the right default precisely because the fabric is
pervasive: at 1.8M+ vectors, halving the dimension nearly halves both
disk and HNSW build/query cost, and bge-small is strong
enough that the semantic win over lexical is already decisive (the
killed → murdered at 0.95 result was produced
with it). A larger model (768/1024-dim) buys marginal recall at ~2–3x
the footprint — only justified if false-friend rates (below) prove
unacceptable on the small model, and even then better spent on the
adjudication model than the embedding model.bge-small-en will under-cluster genuinely cross-lingual
synonyms. The honest position is that for the predicate/entity
layer (mostly camelCase English-ish identifiers) this is minor; for any
future span embedding over Mauritian état-civil text it would
warrant a multilingual model on that tier specifically. Tiering makes
this tractable — model choice can vary per object type.Dimensionality is also a schema commitment: the
donto_predicate_embedding column and its
vector_cosine_ops HNSW index are typed to 384. Changing it
is a migration, which is the final argument for getting the default
right rather than oscillating.
Semantic similarity is powerful and over-eager. §4.1 laid
out the taxonomy of the embedding's blind spot (relation-type blindness,
inverses-as-equivalents, false friends) as one of three signals; this
section is the cost-side restatement of the same fact, grounded in a
live measurement. The same property that lets cosine find
murdered ≈ killed (cosine 0.95, trigram 0.0667) will also
rank dangerous near-neighbours highly:
donto_predicate_embedding
table, bornIn and diedIn — opposite
life-events, never interchangeable — sit at cosine
0.638, well inside the neighbourhood the
candidate-generation step returns.
marriedTo/divorcedFrom and
ancestorOf/descendantOf are the same hazard:
distributionally very close (same domains, ranges, sentences), so
embeddings surface them as top neighbours. Cosine cannot tell "same
topic" from "same meaning," and the danger is measurable, not
theoretical.killedBy and assassinatedBy are equivalent;
killedBy and killed are inverses.
Both pairs score high. Treating an inverse as an equivalent silently
reverses the direction of a claim.locatedIn (a
city) vs. locatedIn (a continent) — same vector, wrong
join.This is why the cosine score is a candidate generator, not a
decision — the §4 ensemble in its sharpest form. The pipeline
is explicitly multi-stage: lexical + semantic propose
(donto_suggest_alignments, _semantic,
_hybrid), and an LLM adjudicator disposes
— judging whether a high-cosine pair is exact_equivalent,
inverse_equivalent, sub_property_of,
close_match, or not_equivalent /
incompatible_with (all of which are first-class values of
the relation column on
donto_predicate_alignment). The embedding's job is to
shrink 866K candidates to a handful; the adjudicator's job is to assign
the typed relation and the confidence. Skipping the adjudicator
— promoting on cosine alone — is precisely the brittle,
semantically-wrong shortcut the alignment relation enum exists to
prevent.
Everything above is ordinary engineering tradeoff. This is not. It is the load-bearing constraint, and it is donto-specific.
Embeddings may CLUSTER and RANK. They must never COLLAPSE or MERGE.
donto's entire differentiator is that it is paraconsistent and
evidence-first: it holds incompatible claims forever as legal
state, never dedups, never picks a winner, never
invalidates-on-conflict. Identity is a hypothesis, not
a fact. An over-eager embedding pipeline is the single most natural way
to destroy this — because "two things are close in vector space" is
exactly the signal a naive system uses to say "these are the
same thing, merge them." A
MERGE INTO ... WHERE cosine > 0.9 would, in one query,
turn donto into the thing it was built to replace: a collapsing store
that throws most of the firehose away.
The contradiction-preserving discipline is enforced structurally, and the live schema already encodes it:
Alignment is a separate edge, never a
rewrite. A discovered equivalence is recorded as a row in
donto_predicate_alignment (or an identity edge), pointing
source_iri → target_iri with a typed relation.
It does not touch donto_statement. No
subject, object, or predicate is ever rewritten to a "canonical" form.
Both predicates keep their statements; both entities keep their claims.
The "join" happens at read time by following alignment edges,
not by mutating the store. This is the emit free / defer joining to
query time principle expressed at the physical level: the firehose
stays raw, and alignment is a lens over it, not a surgery on
it.
Alignment is reversible and bitemporal. Every
alignment row carries tx_time (a tstzrange)
and a review_status ∈ {candidate,
accepted, rejected, superseded}.
A wrong alignment is retracted (close its tx_time,
set superseded) — never DELETEd — honoring
invariant I3 (no destructive overwrite). Because the
underlying statements were never modified, retracting an alignment fully
restores the prior view. A destructive merge has no undo; an alignment
edge is undo by construction.
Query expansion is gated, not automatic. The
schema separates what cosine suggested from what queries
are allowed to use. The flags
safe_for_query_expansion, safe_for_export, and
safe_for_logical_inference are independent booleans, and
review_status defaults to candidate. A fresh
high-cosine pair is, by default, a candidate alignment that
does not silently widen anyone's results until it is
reviewed/adjudicated and marked safe. The three flags exist so that an
alignment can be trusted for fuzzy ranking (query expansion)
while still being forbidden for logical inference or
export — because "close enough to rank together" is a far
weaker claim than "substitutable in a proof." Embeddings earn the
weakest of these by default and the stronger ones only through
adjudication.
not_equivalent /
incompatible_with are first-class. The same
machinery that records "these align" records "these explicitly do
not align" — a typed, evidence-anchored assertion that two
high-cosine neighbours (the bornIn/diedIn
false friends) are deliberately held apart. A near-miss in vector space
is not silently dropped; it is recorded as a negative relation, so the
next pass does not re-propose it and a query never expands across it.
The fabric thus learns its own false friends and writes them
down, rather than re-discovering and re-rejecting them forever.
Put plainly: the embedding fabric earns its place by making the firehose navigable — by ranking what is near and clustering what is alike — and it keeps donto donto by never once acting on that nearness destructively. Clustering is a view; merging is a deletion. The fabric does the first and is structurally incapable of the second.
The fabric is, today, a predicate-only fabric at 3.5%
coverage, proven on one decisive example
(killed → murdered at cosine 0.95 but trigram 0.0667 — an
order of magnitude below the 0.30 alignment threshold, so unreachable by
the lexical key in practice) and one piece of infrastructure (pgvector +
a local provider + two new SQL functions, plugged into sockets the
substrate had already cut). The honest qualifier the report carries
everywhere applies here too: that example is, for now, almost the
entire empirical base — one positive pair plus a single live
false-friend cosine (bornIn/diedIn at 0.638).
The §6 eval suite exists precisely to convert this from one anecdote
into a measured trend. The case for it is not that embeddings are a good
idea in general — they are everywhere, and saying so adds nothing. The
case is specific and structural: donto's whole bet is to defer the join
to query time; a deferred join is only as good as its key; donto's key
was lexical, which the vision itself forbids; and the only non-brittle,
learned, meaning-bearing key for an open self-minted vocabulary is a
vector. Embeddings are therefore not a feature of the alignment engine —
they are the missing substrate primitive the deferred-join thesis has
needed all along.
The costs are real and bounded: a few GB to embed every key (predicate + entity + context) fully and forever, and an explicit opt-in tier for the 39.5M-statement body where a blanket build would eat the disk. The risks are real and answered: false friends are gated behind LLM adjudication, staleness is answered by one continuous loop, and the one catastrophic failure mode — collapse — is made structurally impossible by recording alignment as a reversible, governance-gated, bitemporal edge that never touches a statement. The proof obligation is laid out in §6 and is donto's to discharge: coverage to ~100% on the keys, alignment precision/recall on a real gold set, fragmentation reduction under a precision floor, and the retroactive time-slice that only a never-forgetting substrate can even run. If those numbers move together, donto will have done the thing it was built to do — hold an unbounded, contradictory, evidence-anchored firehose and, at query time, find what belongs together without ever forcing it together. The embedding fabric is what makes that last clause true.