A systems paper on the architecture, data model, and operational experience of a bitemporal, paraconsistent, evidence-first quad store running at 39.3 million statements.
Thomas Davis · Ajax Davis · 2026-05-28
We describe donto, a knowledge-substrate system
organised around three commitments most knowledge graphs do not make:
that every claim is anchored in evidence under a context, that
contradictions are preserved rather than rejected, and that the system
records both world-time (when a fact was true in the world) and
system-time (when it was believed in the database). The system
is implemented as a Rust workspace shipping a PostgreSQL extension
(pg_donto, built with pgrx), an HTTP sidecar
(dontosrv), a Python FastAPI + Temporal extraction layer
(donto-api), a CLI (donto), a terminal UI
(donto-tui), and a Lean 4 formal overlay
(donto_engine). The native query language,
DontoQL, exposes twenty-one clauses covering scope
inheritance, polarity, maturity, identity lenses, predicate-alignment
closure, bitemporal time-travel, modality, extraction-level filtering,
and policy enforcement. The substrate is exercised by
genes, a genealogical-research corpus with 39,294,083
statements, 938,918 distinct predicates, 19,230 contexts, and 1.84M
evidence links across 48 GB on disk. We document the system's data
model, its query language, its six-aperture extraction pipeline, its
Trust Kernel for policy-gated ingest, its release-and-federation
machinery (Ed25519-signed RO-Crate envelopes), and its Lean 4 overlay
for shape and rule certification. We characterise its performance
(steady ~2.5–3.0 K-row/s insert throughput, sub-100 ms point queries
through 1 M rows) and report empirical observations from production:
predicate proliferation (~938 k distinct freely-minted predicates),
evidence-anchor sparsity (~4.7 % of statements carry an evidence link),
and an exceedingly low retraction rate (281 of 39.3 M, ~7 × 10⁻⁶) that
reveals the system is currently operated as an append-mostly archive
rather than a constantly-revised research notebook. We argue that the
combination of evidence-first storage, paraconsistent semantics,
bitemporality, and a typed alignment layer is a useful substrate for
domains where multiple sources, schemas, communities, and models must
make claims about a shared world without forcing premature consensus —
language documentation, oral-history, genealogy, legal evidence,
clinical observation, and historical research being the motivating
cases.
Keywords: knowledge graphs, paraconsistent logic, bitemporal databases, provenance, evidence anchoring, predicate alignment, identity resolution, policy enforcement, scientific reproducibility, Lean 4, language documentation, genealogy.
Most research-supporting systems silently assume one or more of the following:
donto rejects all eight. Its operating model is open-world, evidence-first, contradiction-preserving, governance-native, bitemporal, multimodal, and schema-plural (PRD §0). The product question the system was designed to answer is:
Given a contested question, can the system return the relevant claims, the evidence behind them, the schema mappings that make them comparable, the identity hypotheses they depend on, the disagreements between them, the access policies governing them, and a reproducible release artefact?
If it can, the substrate works.
The first proving domain is language documentation, because it exhibits every constraint of contested knowledge simultaneously: incompatible analytical schemas (Universal Dependencies vs WALS vs Grambank), disputed identities (dialect/language boundaries; ISO codes vs Glottolog vs internal community ontologies), multimodal evidence (text, audio, interlinear glosses, phonetic transcription), restricted cultural material (community-governed sacred or sensitive records), diachronic change (reconstructed vs attested forms), formal validation (paradigms have shape constraints), and corpus-scale annotation.
The exercise domain is genealogy — specifically, the
North-Queensland family-history corpus that exercises donto in
production at genes.apexpots.com. Genealogy is, in our
experience, the hardest realistic instance of the same problem:
name spelling drifts across records, the same person appears under
maiden, married, and clan names, colonial-era records contain
misclassifications and falsifications as data, identity is contested
across native-title determinations, and oral-history claims contradict
written archival ones in irreducibly load-bearing ways.
The point of the genealogy exercise is not the genealogy. It is that
every contradiction in genealogy is a tripwire for donto's invariants.
Each friction point we hit becomes either a new tripwire test in
packages/donto-client/tests/ or an amendment to the PRD; we
are deliberately running the substrate at its limits because that is how
we learn what the substrate must be.
This paper describes:
safe_for_query_expansion,
safe_for_export, safe_for_logical_inference),
plus a materialised closure that the evaluator rides at query time
(§9).donto_engine) that certifies
shape and rule reports asynchronously over a line-delimited JSON
protocol, with the invariant that Lean certifies and does not
gate: ingest is never blocked on the Lean side (§10).did:key format, and a federation analysis comparing five
candidate stacks (Verifiable Credentials + DID, Solid Pods, SPARQL
federation, DataCite, RO-Crate) with reasoning for the chosen
architecture (§11).We are not claiming any one component is novel in isolation: typed schema mapping is older than the semantic web, bitemporal databases have been formalised since the 1990s, paraconsistent logics go back to da Costa and Belnap, and the FAIR/CARE principles articulate the ethics we operate under. The contribution of this paper is the composition: a single substrate where these properties are not optional library features bolted onto a triplestore but invariants enforced from the schema upward, and a working system of nontrivial scale that we can characterise empirically rather than hypothetically.
donto is a Rust workspace deployed as a small set of cooperating processes against a single PostgreSQL 16 instance.
┌──────────────────────────────────────────────────────────────────┐
│ clients / consumers │
│ donto-cli donto-tui dontopedia (Next.js) external │
└──────────────────────────────────────────────────────────────────┘
▼ ▼ ▼ ▼
┌───────────────────────────────────────────────┐
│ donto-api (FastAPI :8000 + Temporal) │
│ extraction workflows, ingest activities │
└───────────────────────────────────────────────┘
▼
┌───────────────────────────────────────────────┐
│ dontosrv (axum HTTP sidecar :7879) │
│ 67 routes: query, assert, retract, shapes, │
│ rules, policy, evidence, alignment, search │
└───────────────────────────────────────────────┘
▼ ▼ ▼
┌────────────────┐ ┌──────────────┐ ┌──────────────────┐
│ donto-client │ │ donto-query │ │ donto_engine │
│ (Rust SDK) │ │ (DontoQL + │ │ (Lean 4 binary, │
│ │ │ SPARQL │ │ stdio JSON │
│ 131-migration │ │ evaluator) │ │ protocol) │
│ applier, │ │ │ │ │
│ typed wrapper │ │ │ │ shapes + rules │
│ over SQL fns │ │ │ │ certifier │
└────────────────┘ └──────────────┘ └──────────────────┘
▼ ▼
┌───────────────────────────────────────────────┐
│ PostgreSQL 16 (donto-pg) │
│ pg_donto extension (pgrx), 91 relations, │
│ 131 idempotent migrations, GiST + GIN + │
│ trigram indexes, advisory-locked migrator │
└───────────────────────────────────────────────┘
▼
┌─────────────────────────────┐
│ /mnt/donto-data/pgdata │
│ ~48 GB (32 GB statements) │
└─────────────────────────────┘
The production deployment runs on a single GCE
e2-standard-4 VM (4 vCPU, 16 GB RAM) with PostgreSQL 16
inside a Docker container (donto-pg), volume-bound to an
attached SSD at /mnt/donto-data/pgdata. systemd manages
five long-running services (dontosrv,
donto-api, donto-api-worker,
donto-debug, dontopedia-web) plus Caddy as a
TLS terminator and reverse proxy. Temporal workflows (extraction,
alignment, entity resolution) run in donto-api-worker
against the same Postgres instance.
The choice to keep pg_donto as a pgrx extension rather
than a sidecar microservice was deliberate. The SQL substrate
is the contract; the pgrx layer exists to make Rust the lingua
franca for plan-quality hot paths (e.g., immutable polarity / maturity
decoding) and to package all 131 migrations as
extension_sql_file! declarations so a
CREATE EXTENSION pg_donto on a fresh database produces a
working substrate. The HTTP sidecar dontosrv does not own
substrate state; it is a typed gateway over donto-client's
SQL surface.
packages/ apps/ total
Rust ~14.0 K ~9.3 K 52,943 LOC
SQL — 13,234 LOC (131 migrations)
Python (donto-api) — 8,349 LOC
Lean 4 ~1.7 K — 1,656 LOC
TypeScript (client-ts) ~0.9 K — 899 LOC
The two largest crates are dontosrv (5,088 LOC of HTTP
routing and sidecar protocol) and donto-cli (4,226 LOC
across ~40 subcommands). donto-client (the typed Rust
wrapper over the SQL surface) is 2,665 LOC; donto-query
(DontoQL + SPARQL parser, algebra, evaluator) is 2,739 LOC. The five
linguistic-pilot importer crates
(donto-ling-{cldf,ud,unimorph,lift,eaf}) total ~2.3 K
LOC.
The test surface is unusually large: 592 #[tokio::test]
annotations, 91 #[test], and 511 invocations of the
pg_or_skip! macro that lets database-touching tests skip
cleanly when Postgres is unreachable. The 77-file tripwire suite in
packages/donto-client/tests/ totals 19,968 LOC — more than
the combined LOC of every package except dontosrv. The
invariants_*.rs files (governance, paraconsistency,
bitemporal, predicate, modality, hypothesis, evidence, releases) encode
the PRD invariants as executable assertions.
The PRD specifies ten invariants the substrate must enforce
(donto/docs/DONTO-PRD.md §2). They are non-negotiable:
amendments go to the PRD first, never to code without spec. They are
listed here in shortened form; the precise text is in the PRD.
I1. No claim without evidence or explicit hypothesis
status. A statement must reference an
donto_evidence_link row, or carry a
hypothesis_only=true flag, before it can advance past
maturity E1.
I2. No restricted source without policy. A document
cannot be registered without a policy_id. Migration 0123
promotes donto_document.policy_id to NOT NULL with a
fail-closed default
(policy:default/restricted_pending_review, zero allowed
actions); this closes the historical F-1 gap where legacy ingest paths
could produce unpoliced rows.
I3. No destructive overwrite. Corrections, retractions, merges, splits, alignments, and policy changes are append-only events. The system supports transaction-time reconstruction of what was believed and visible at any prior system time.
I4. Contradictions are preserved. Two sources
disagreeing about Annie Davis's birth year both live in the database
forever. The system creates a donto_argument row with
relation rebuts or alternative_analysis_of and
a donto_proof_obligation with kind
needs_contradiction_review.
I5. Machine confidence is not maturity. A model may
report confidence on [0, 1]. Maturity is earned by evidence quality,
review, cross-source corroboration, or formal validation. Auto-promotion
is capped at E2 for any extraction-produced claim; E3+ requires a human
review decision. The helpers.py:54-59
confidence-to-maturity mapping (0.95 → 4, 0.8 → 3, 0.6 → 2, 0.4 → 1,
else 0) sets the ceiling, not the floor.
I6. Governance propagates to derivatives. A claim derived from a restricted source inherits the most restrictive applicable policy of its source anchors. Embeddings, translations, summaries, and exports all inherit. Overrides require a qualified authority's attestation.
I7. Schema mappings are typed and scoped. No two
predicates are "the same" by default. Alignment edges declare one of
eleven relations (exact_equivalent,
close_match, broad_match,
narrow_match, inverse_of,
decomposes_to, has_value_mapping,
incompatible_with, derived_from,
local_specialization, not_equivalent) and
three per-edge safety booleans: safe_for_query_expansion,
safe_for_export, safe_for_logical_inference.
Closure expansion respects safety flags.
I8. Identity is a hypothesis, not a foreign key.
Person, place, language, lexeme, morpheme, source, specimen, case, and
concept identity may all be contested. The system stores identity
hypotheses (eight kinds: same_as,
different_from, broader_than,
narrower_than, split_candidate,
merge_candidate, successor_of,
alias_of) and lets users query under selected identity
lenses (strict, likely, exploratory, custom).
I9. Adapters must report information loss. Every
import and export adapter produces a LossReport that
explicitly names what the source format cannot represent (governance,
contradiction, time, n-ary frames, anchors, review state).
I10. A release is a reproducible view. A release is a named query plus a policy report, source manifest, transformation manifest, checksum manifest, and reproducibility contract.
donto tracks a six-level epistemic ladder per claim:
| Level | Name | Earned by |
|---|---|---|
| E0 | Raw | Source registered, policy classified. |
| E1 | Candidate | Evidence anchor or hypothesis_only. |
| E2 | Evidence-supported | Anchor validation, policy inheritance, no malformed terms. |
| E3 | Reviewed | Human or authorised reviewer decision. |
| E4 | Corroborated | Multiple independent anchors or accepted argument analysis. |
| E5 | Certified | Machine-checkable certificate, formal shape, or domain proof. |
Promotion is monotonic per claim event. A claim may be superseded or
retracted; the maturity history remains queryable. The flags smallint on
donto_statement packs polarity into bits 0–1 and maturity
into bits 2–4 (migration 0002), with the stored 4 → "E5 Certified" /
stored 5 → "E4 Corroborated" non-monotone detail explicitly documented
in migration 0102 so the ordering helpers know about it.
Three orthogonal axes describe the epistemic shape of a claim.
Polarity is one of asserted (default),
negated (explicit rejection — "X is not Y"),
absent (the source explicitly does not mention this), or
unknown (the source mentions but is not clear). It is
packed into the flags smallint and queryable via DontoQL's
POLARITY clause.
Modality is a sparse overlay
(donto_statement_modality, migration 0099) with fifteen
values: descriptive, prescriptive,
reconstructed, inferred,
elicited, corpus_observed,
typological_summary, experimental_result,
clinical_observation, legal_holding,
archival_metadata, oral_history,
community_protocol, model_output,
other. Statements without a modality row are present in the
system but filtered out of modality-restricted queries.
Confidence is stored as up to four parallel values
(donto_confidence, migration 0101):
machine_confidence (model-reported),
calibrated_confidence (empirically calibrated against
reviewer decisions), human_confidence (reviewer-reported),
and source_reliability_weight (source/method-level).
Queries may select a confidence lens; the system does not collapse to a
scalar by default.
Every statement is filed under exactly one context (the
donto_statement.context column references
donto_context.iri). Contexts form a forest (migration 0001)
with parent links; multiple parents are supported for secondary
attachments via donto_statement_context (migration 0103). A
context has a kind (source, snapshot,
hypothesis, user, pipeline,
trust, derivation, quarantine,
custom, system) and a mode
(permissive or curated). Curated contexts
route shape violations to a quarantine context rather than rejecting
them outright; permissive contexts accept and emit a proof
obligation.
The default context for any assert call that does not specify one is
donto:anonymous. There is no nullable context column
anywhere in the schema.
donto_statement.valid_time is a daterange
capturing world-time applicability — when the claim was true in the
world. tx_time is a tstzrange capturing
system-time belief — when the row was asserted. Both lower-inclusive
('[)'); lower_inc(tx_time) is a table-level
CHECK constraint.
Retraction (donto_retract) closes the upper bound of
tx_time without creating a new row; the statement
transitions from "currently believed" to "was once believed". Correction
(donto_correct) retracts the old row and inserts a new one
referencing the prior via lineage (donto_stmt_lineage).
Both operations emit donto_audit log rows.
The discipline extends through every mutating object via
donto_event_log (migration 0090) — alignments, identity
edges, policies, attestations, reviews, releases. The single rule the
substrate enforces: never DELETE FROM donto_statement. The
CLAUDE.md non-negotiable list states this in capitals.
The substrate is 91 PostgreSQL relations (84 donto_*
tables and seven materialized/standard views) defined across 131
idempotent migrations. We describe the highest-load core here and refer
the reader to Appendix D of the PRD for the full table inventory.
donto_statement-- packages/sql/migrations/0001_core.sql:43-96
create table if not exists donto_statement (
statement_id uuid primary key default gen_random_uuid(),
subject text not null,
predicate text not null,
object_iri text,
object_lit jsonb, -- {"v": <value>, "dt": <datatype-iri>, "lang": <tag-or-null>}
context text not null references donto_context(iri),
tx_time tstzrange not null default tstzrange(now(), null, '[)'),
valid_time daterange not null default daterange(null, null, '[)'),
flags smallint not null default 0,
content_hash bytea generated always as (digest(...)) stored,
constraint donto_statement_object_one_of
check ((object_iri is not null) <> (object_lit is not null)),
constraint donto_statement_tx_lower_inc check (lower_inc(tx_time))
);
create unique index if not exists donto_statement_open_content_uniq
on donto_statement (content_hash) where upper(tx_time) is null;
create index if not exists donto_statement_spo_idx
on donto_statement (subject, predicate, object_iri);
create index if not exists donto_statement_pos_idx
on donto_statement (predicate, object_iri, subject);
create index if not exists donto_statement_osp_idx
on donto_statement (object_iri, subject, predicate)
where object_iri is not null;
create index if not exists donto_statement_valid_time_idx
on donto_statement using gist (valid_time);
create index if not exists donto_statement_tx_time_idx
on donto_statement using gist (tx_time);
create index if not exists donto_statement_object_lit_gin
on donto_statement using gin (object_lit jsonb_path_ops)
where object_lit is not null;Three indexes cover the standard SPO/POS/OSP join orders. GiST
indexes handle bitemporal range queries. Literal-object substring search
rides the trigram index added in migration 0131
(object_iri_trgm). Idempotence on assert is enforced by a
partial unique index on
(content_hash) where upper(tx_time) is null: only
currently-believed rows must be unique by content. A retracted
row with the same content can later be re-asserted as a separate
row.
-- packages/sql/migrations/0002_flags.sql
create or replace function donto_pack_flags(polarity text, maturity int)
returns smallint language sql immutable as $$
select (
(case lower(polarity)
when 'asserted' then 0 when 'negated' then 1
when 'absent' then 2 when 'unknown' then 3 else null end)
| ((maturity & 7) << 2)
)::smallint
$$;Bits 0–1 are polarity, bits 2–4 are maturity, bits 5–15 are reserved.
The function is IMMUTABLE and PARALLEL SAFE; a
Rust mirror exists in pg_donto/src/lib.rs:211-263 for
plan-quality hot paths.
-- packages/sql/migrations/0001_core.sql:17-38
create table if not exists donto_context (
iri text primary key,
kind text not null check (kind in (
'source','snapshot','hypothesis','user','pipeline',
'trust','derivation','quarantine','custom','system')),
parent text references donto_context(iri),
label text,
metadata jsonb not null default '{}'::jsonb,
mode text not null default 'permissive'
check (mode in ('permissive','curated')),
created_at timestamptz not null default now(),
closed_at timestamptz,
constraint donto_context_no_self_parent
check (parent is distinct from iri)
);donto_resolve_scope (migration 0003) walks the context
forest either downward (default: include descendants) or upward
(optional: include ancestors), with set-based include/exclude lists.
donto_document (migration 0023) registers a source
artefact with a required policy_id (post-0123).
donto_document_revision holds content-addressed revision
bodies (the blob backend stores the actual bytes; the revision holds the
metadata and optionally inlines short bodies). donto_span
carries char-offset spans over revisions.
-- packages/sql/migrations/0029_evidence_links.sql:11-59
create table if not exists donto_evidence_link (
link_id uuid primary key default gen_random_uuid(),
statement_id uuid not null references donto_statement(statement_id),
link_type text not null check (link_type in (
'extracted_from', 'supported_by', 'contradicted_by',
'derived_from', 'cited_in', 'anchored_at', 'produced_by'
)),
target_document_id uuid references donto_document(document_id),
target_revision_id uuid references donto_document_revision(revision_id),
target_span_id uuid references donto_span(span_id),
target_annotation_id uuid references donto_annotation(annotation_id),
target_run_id uuid references donto_extraction_run(run_id),
target_statement_id uuid references donto_statement(statement_id),
confidence double precision,
context text references donto_context(iri),
tx_time tstzrange not null default tstzrange(now(), null, '[)'),
metadata jsonb not null default '{}'::jsonb,
created_at timestamptz not null default now(),
constraint donto_evidence_link_has_target check (
(target_document_id is not null)::int +
(target_revision_id is not null)::int +
(target_span_id is not null)::int +
(target_annotation_id is not null)::int +
(target_run_id is not null)::int +
(target_statement_id is not null)::int = 1
),
constraint donto_evidence_link_tx_lower_inc check (lower_inc(tx_time))
);donto_argument (migration 0031) holds typed support /
attack relations between statements (nine relations including
supports, rebuts, undercuts,
qualifies, endorses, supersedes,
potentially_same, same_referent,
same_event) with a strength on [0, 1] and an open-edge
unique index that lets the same pair of statements be related
differently in different contexts.
donto_entity_symbol (migration 0057) records every
freely-minted IRI a model or import produces, with a trigram index on
the normalised label for blocking. donto_identity_edge
(migration 0060) records weighted, bitemporal coreference assertions
with one of four relations: same_referent,
possibly_same_referent, distinct_referent,
not_enough_information. The edge table constraints
left_symbol_id < right_symbol_id to avoid double-
representation.
donto_identity_hypothesis names a clustering solution
over the edges (e.g., the strict_identity_v1 hypothesis
only takes edges with confidence ≥ 0.98 and no cannot-link).
-- packages/sql/migrations/0048_predicate_alignment.sql
create table if not exists donto_predicate_alignment (
alignment_id uuid primary key default gen_random_uuid(),
source_iri text not null,
target_iri text not null,
relation text not null check (relation in (
'exact_equivalent', 'inverse_equivalent',
'sub_property_of', 'close_match',
'decomposition', 'not_equivalent'
)),
confidence double precision not null default 1.0
check (confidence >= 0 and confidence <= 1),
valid_time daterange not null default daterange(null, null, '[)'),
tx_time tstzrange not null default tstzrange(now(), null, '[)'),
run_id uuid,
provenance jsonb not null default '{}'::jsonb,
registered_by text,
registered_at timestamptz not null default now(),
constraint donto_pa_distinct check (source_iri <> target_iri),
constraint donto_pa_tx_lower_inc check (lower_inc(tx_time))
);A materialised closure table (donto_predicate_closure)
pre-computes transitive chains so the evaluator can ride alignments at
query time without a recursive CTE per row. The closure is rebuilt by
donto_rebuild_predicate_closure().
donto_policy_capsule (migration 0111) holds a policy
with fifteen boolean allowed-actions (read_metadata,
read_content, quote,
view_anchor_location, derive_claims,
derive_embeddings, translate,
summarize, export_claims,
export_sources, export_anchors,
train_model, publish_release,
share_with_third_party, federated_query).
donto_access_assignment maps targets (document,
revision, span, context, statement, frame, release, entity, predicate)
to policies.
donto_attestation (migration 0112) records credentials:
a holder agent, an issuer agent, a policy IRI, a subset of allowed
actions, a purpose (review,
community_curation, private_research,
publication, model_training,
audit, extraction, federation,
inspection), a required rationale, and
lifecycle fields (issued_at, expires_at,
revoked_at).
The top-level access check
donto_authorise(holder, target_kind, target_id, action)
combines policy effective-action AND with attestation OR semantics. The
effective-action helper donto_effective_actions does a
bool_and over all assigned policies for the target,
defaulting to the fail-closed restricted policy when none is
assigned.
The live database holds:
| Object | Count |
|---|---|
| Statements | 39,294,083 |
| Distinct predicates | 938,918 |
| Distinct contexts | 19,230 |
| Evidence links | 1,837,151 |
Currently-believed statements (upper_inf(tx_time)) |
39,293,802 |
| Retracted statements | 281 |
| Database total on disk | 48 GB |
donto_statement table size |
32 GB |
The retraction rate is 281 / 39,293,802 ≈ 7.1 × 10⁻⁶ — see the empirical discussion in §13.
The polarity distribution is heavily skewed toward
asserted:
| Polarity | Count |
|---|---|
asserted |
39,292,908 |
negated |
813 |
unknown |
331 |
absent |
31 |
The maturity distribution shows the cap-at-E1-or-below pattern of machine extraction:
| Maturity | Count |
|---|---|
| 0 (raw) | 22,082,198 |
| 1 (candidate) | 14,870,096 |
| 2 (evidence-supp) | 47,905 |
| 3 (reviewed) | 244,806 |
| 4 (corroborated)* | 2,049,078 |
| 5 (certified)* | — |
*The flags-bit encoding stores 4 = E5 Certified and
5 = E4 Corroborated (migration 0102), reflecting a
historical naming ambiguity preserved for backward compatibility.
DontoQL v2 is a 21-clause query language compiled to a unified
algebra alongside a strict SPARQL 1.1 subset. The parser is hand- rolled
(packages/donto-query/src/dontoql.rs); both surfaces emit
the same algebra::Query struct, evaluated by
packages/donto-query/src/evaluator.rs.
query := keyword_clause+
keyword_clause :=
'SCOPE' scope_descriptor
| 'PRESET' IDENT_or_PREFIXED_or_STRING
| 'MATCH' triple (',' triple)*
| 'FILTER' filter_expr (',' filter_expr)*
| 'POLARITY' ident_in_set
| 'MATURITY' '>='? INT
| 'IDENTITY' ident
| 'IDENTITY_LENS' ident
| 'PREDICATES' ('EXPAND' | 'STRICT' | 'EXPAND_ABOVE' INT)
| 'MODALITY' ident (',' ident)*
| 'EXTRACTION_LEVEL' ident (',' ident)*
| 'TRANSACTION_TIME' 'AS_OF' STRING_or_PREFIXED
| 'AS_OF' STRING_or_PREFIXED
| 'POLICY' 'ALLOWS' ident
| 'SCHEMA_LENS' (iri | ident)
| 'EXPANDS_FROM' 'concept' '(' iri ')' 'USING' 'schema_lens' '(' iri ')'
| 'ORDER_BY' ident ('DESC'|'ASC')?
| 'WITH' 'evidence' '=' ident
| 'PROJECT' var (',' var)*
| 'LIMIT' INT
| 'OFFSET' INT
triple := term term term ('IN' term)?
term := var | iri | string-lit | int-lit
var := '?' IDENT
iri := '<' chars '>' | PREFIXED
filter_expr := term op term -- op ∈ { = != < <= > >= }
Clauses may appear in any order. Whitespace is insignificant;
# introduces an end-of-line comment.
Without any clauses, MATCH ?s ?p ?o returns asserted,
currently- believed rows across every context, at any maturity, with the
identity lens default (no expansion across sameAs clusters),
predicate expansion expand (rides the alignment closure), and
no ordering. The choice to default to asserted-only
prevents contradictions from leaking into queries without explicit
request.
Contested birth claims as of a system time. Find every claim about Annie Davis's birth that disagrees with another claim, in research contexts where the maturity is at least E2 (evidence- supported), as of the state of the store on 2026-04-01:
SCOPE include ctx:genes/annie-davis ancestors
PRESET curated
MATCH ?stmt ex:about ex:annie-davis,
?stmt ex:predicate ex:born_in,
?stmt ex:object ?place
FILTER ?place != "unknown"
POLARITY asserted
TRANSACTION_TIME AS_OF "2026-04-01T00:00:00Z"
PREDICATES EXPAND_ABOVE 75
PROJECT ?stmt, ?place
LIMIT 50
Release-safe claims. Show only claims under a hypothesis context that the policy permits for publication, with evidence redacted where required:
SCOPE include ctx:project:language-pilot ancestors
MATCH ?stmt ?p ?o
MATURITY >= 3
POLICY ALLOWS publish_release
WITH evidence = redacted_if_required
Cluster-expansion search. Same query, broaden across identity clusters and alignment expansion:
MATCH ?s ex:bornInPlace ?city
IDENTITY_LENS expand_clusters
PREDICATES EXPAND_ABOVE 70
LIMIT 100
Contradiction frontier. Order results by how contested each binding's leading statement is:
MATCH ?stmt ex:about ?subject,
?stmt ex:predicate ?pred
ORDER BY contradiction_pressure DESC
PROJECT ?stmt, ?subject, ?pred
LIMIT 50
contradiction_pressure is
attack_count − support_count from
donto_contradiction_frontier, joined against the binding's
most recent matched statement_id. Rows without any argument
edges sort to pressure = 0.
As-of historical view. Reconstruct what was known last week:
MATCH ?p ex:about ex:somebody, ?p ?pred ?val
AS_OF "2026-05-21T00:00:00Z"
LIMIT 100
The SPARQL parser (packages/donto-query/src/sparql.rs)
accepts PREFIX, SELECT (including
SELECT *), WHERE, GRAPH, basic
FILTER (numeric and string comparison), LIMIT,
and OFFSET. Property paths, OPTIONAL,
UNION, aggregates (COUNT, SUM,
GROUP BY), and mutation (INSERT DATA,
DELETE DATA) are deliberately out of scope. donto's
mutating operations are expressed through donto_assert /
donto_retract / donto_correct at the SQL
layer; the query language is read-only.
The Phase-4 evaluator is a nested-loop join with variable unification:
// packages/donto-query/src/evaluator.rs
for pattern in &query.patterns {
for binding in &env {
let substituted = substitute(pattern, binding);
let rows = client.match_pattern(
substituted.subject_iri,
substituted.predicate_iri,
substituted.object_iri,
scope, polarity, min_maturity,
as_of_tx, as_of_valid,
).await?;
for row in rows {
if let Some(extended) = unify(binding, pattern, &row) {
next_env.push(extended);
}
}
}
env = next_env;
}
apply_filters(&mut env, &query.filters);
apply_overlays(&mut env, query.modality, query.extraction_level);
apply_policy_gate(&mut env, query.policy_allows);
apply_order_by(&mut env, query.order_by);
apply_offset_limit(&mut env, query.offset, query.limit);
attach_evidence(&mut env, query.evidence_shape);This is correct but unoptimised: the query planner is a deliberately-
deferred Phase-10 work item. The evaluator's HTTP entry point at
POST /dontoql handles ~50 ms p50 for simple patterns at the
current scale.
The extraction layer is the bridge from natural-language sources to
typed claims. donto's design rejects the "single LLM call with an 8-tier
prompt" approach the predecessor used. Three structural failures
motivated the replacement (docs/EXTRACTION-MAXIMALISM.md
L15-31):
The replacement uses six apertures — independent specialised passes over the same source — and content-hash deduplication across the union.
| Aperture | What it mines | Modality | Confidence |
|---|---|---|---|
| Surface | Explicitly-stated claims | asserted, anchored | 0.95–1.0 |
| Linguistic | Clause-by-clause: every NP → entity, VP → event, modifier → property | asserted, anchored | 0.85–1.0 |
| Presupposition | What the text takes for granted but does not assert | hypothesis_only, anchored to trigger |
0.7–0.95 |
| Inferential | Common-knowledge consequences of stated facts | asserted, anchored to trigger | 0.4–0.7 |
| Conceivable | "Hairs on the head" claims that could plausibly hold given entity types | hypothesis_only, no anchor |
0.85 (it is conceivable) |
| Recursive | Re-runs Surface with newly-discovered entities as seeds | asserted, anchored | 0.85–1.0 |
A 1,376-character biographical text run through both pipelines gave:
1-pass tier : 95 facts at $0.0042 in 65 s
6-pass aperture : 341 facts at $0.0252 in 449 s (3.6× yield, 6× cost)
surface : 87
linguistic : 127
presupposition : 34
inferential : 12
conceivable : 54
recursive : 27
distinct predicates: 171
distinct subjects : 70
anchor coverage : 0.842
hypothesis density : 0.258
dedup collisions : 6
The yield is a floor, not a target. The Maximalism doc characterises ambitions to push toward ~20,000–30,000 facts per source at fuller aperture coverage, gated by cost (target ≈ 0.02–0.05 per 1 kB of source for the full 15-aperture pass, $0.001/kB on cache hits).
The conceivable aperture is the most controversial. It deliberately floods the candidate space with unanchored, hypothesis-only claims (persons have hair, organisations have employees, projects have contributors). The position is explicit:
Maximal extraction is a design stance, not a yield target. Mine everything. Quarantine the malformed. Flag the hypothetical. Let the curation gate, not the extractor, decide what counts. (EXTRACTION-MAXIMALISM L311-314)
Quarantine is implemented as a sink that routes invalid candidates to
a ctx:quarantine/<source> context with policy
restricted_pending_review. Conceivable-aperture output
lands in the candidate space at maturity E1 with
hypothesis_only=true; downstream curation (Trust Kernel
policy gate, maturity ladder, reviewer acceptance) decides what survives
into a release.
The Python implementation in apps/donto-api/extraction/
is laid out as:
apertures.py -- six aperture prompts (Surface, Linguistic, ...)
exhaustive.py -- multi-pass orchestrator (asyncio.gather + dedup)
dispatch.py -- single-pass M5 path (still useful for cheap runs)
validation.py -- hard-gate validator (anchor + hypothesis_only invariant)
quarantine.py -- quarantine sink
policy_gate.py -- Trust Kernel probe before any external model call
main.py::extract_exhaustive -- POST /extract/exhaustive
The exhaustive orchestrator (exhaustive.py:138-244)
gathers the non-recursive apertures in parallel via
asyncio.gather, dedups in-flight using a content-key hash,
then seeds Recursive from the top 12 most-frequent subjects and object
IRIs in the union. Vocabulary-aware prompting (vocab.py)
injects the current top-80 predicates and entity candidates into the
system prompt so the model reuses existing IRIs rather than minting
fresh ones. We discuss the empirical fallout of this prompt in §13.
helpers.py:54-59 maps model confidence to maturity
ceiling:
def confidence_to_maturity(c: float) -> int:
if c >= 0.95: return 4
if c >= 0.80: return 3
if c >= 0.60: return 2
if c >= 0.40: return 1
return 0This is a ceiling, not a floor: the post-PRD policy caps extraction- produced claims at E1 regardless of model confidence. The PRD I5 invariant is enforced by an ingest validator that drops high-confidence model claims back to E1 unless a reviewer attestation is present in the same transaction.
Long-running extraction jobs are submitted as Temporal workflows
(apps/donto-api/workflows.py). The four-stage pipeline
extracting → ingesting → aligning → resolving → completed
is implemented as activities with explicit retry policies (30-minute
exhaustive extract activity timeout, 3 retries; 5-minute ingest timeout,
5 retries; alignment and resolution similarly bounded). The workflow is
queryable mid-flight via the status method.
The Trust Kernel (PRD §M0) is the substrate's answer to invariants I2 and I6: no source without policy, and governance propagates to derivatives.
A donto_policy_capsule (migration 0111) is one of nine
policy kinds (public,
open_metadata_restricted_content,
community_restricted, embargoed,
licensed, private, regulated,
sealed, unknown_restricted) with a JSONB
allowed-actions object covering fifteen actions. An inheritance rule
(max_restriction is the default) defines how a derived row
combines the policies of its source anchors:
allowed_actions = jsonb_build_object(
'read_metadata', false,
'read_content', false,
'quote', false,
'view_anchor_location',false,
'derive_claims', false,
'derive_embeddings', false,
'translate', false,
'summarize', false,
'export_claims', false,
'export_sources', false,
'export_anchors', false,
'train_model', false,
'publish_release', false,
'share_with_third_party', false,
'federated_query', false
)The boilerplate default is fail-closed; a public-by-default classification must be explicit.
The donto_effective_actions(target_kind, target_id)
function returns the AND across all policies assigned to the target via
donto_access_assignment. With no assignment, it returns the
fail-closed default policy's allowed-actions:
foreach v_key in array v_keys loop
select bool_and(coalesce((p.allowed_actions->>v_key)::boolean, false))
into v_allowed
from donto_access_assignment a
join donto_policy_capsule p on p.policy_iri = a.policy_iri
where a.target_kind = p_target_kind
and a.target_id = p_target_id
and p.revocation_status = 'active'
and (p.expiry is null or p.expiry > now());
v_result := v_result || jsonb_build_object(v_key, coalesce(v_allowed, false));
end loop;
return v_result;Revoked, expired, and superseded policies do not contribute. The
choice of bool_and over bool_or
operationalises the max-restriction doctrine: every assigned
policy must permit the action for it to be permitted.
A donto_attestation is a holder credential: agent X is
entitled to perform action Y under policy Z for purpose P with rationale
R until expiry T. The check
donto_authorise(holder, target, action) combines the
effective-action AND with an attestation OR — if any attestation grants
the holder the action under any policy assigned to the target, the call
returns true.
Revocation is immediate for new checks (set
revoked_at = now()); in- flight reads in the same
transaction may still proceed but cannot read newly-revoked data.
docs/REVIEW-FINDINGS.md records eighteen findings F-1
through F-18 from an adversarial review of the Trust Kernel. Sixteen are
DOC-severity (intentional, documented, no work required) and two are
HARD-severity, both now resolved. The historical F-1 finding is worth
recording in full because it illustrates how the substrate preserves
invariants in the face of legacy code paths:
Pre-migration
0123, the legacydonto_ensure_documentSQL function did not requirepolicy_id, leaving the I2 invariant ("no source without policy") enforced only by fail-closed read defaults rather than write-time refusal.Resolution: migration 0123 backfilled NULL
policy_idwith the seeded fail-closedpolicy:default/restricted_pending_review(allowed_actions all false), set that capsule as the column default, promotedpolicy_idto NOT NULL, and validated the previously-NOT-VALID foreign key. The legacy path now succeeds but the produced row has the fail-closed policy; explicit-policy callers (viadonto_register_source) continue to pass policy explicitly.Tripwire (inverted):
invariants_adversarial:: legacy_register_document_lands_on_default_restricted_policy— asserts the legacy path succeeds and that the produced document has the fail-closed policy. If a future migration removes the default or the constraint, this test fails and prompts a deliberate decision.Production state at resolution: 8,211 legacy
donto_documentrows backfilled topolicy:default/restricted_pending_review. Read access remained fail-closed both before and after.
The pattern — backfill, then NOT NULL, then validate FK, all in one idempotent migration — is the standard donto fix for an invariant gap that has been silently tolerated.
The substrate side of the F-1 closure is complete. The remaining
work, recorded in ROADMAP-AFTER-MAY18.md, is the HTTP
middleware that enforces policy presence on write paths at the
sidecar layer rather than at the SQL layer. The genes corpus has
hundreds of unpoliced legacy sources (now bound to fail-closed defaults
at the substrate but still ingested via the legacy path); the middleware
work is the visible end-to-end testbed for the Trust Kernel beyond the
database layer.
donto's open-world entity-resolution architecture decomposes the "who is this person?" problem into five distinct objects:
donto_entity_symbol — a string
identifier produced by extraction, legacy import, user action, or
external KB import. Examples: ex:mary-watson,
ex:mrs-watson, ctx:legacy/31448699f0e5.donto_entity_mention — an occurrence
of a symbol in a document / span / extraction run.donto_entity_signature — the current
feature vector or profile for a symbol, derived from statements and
evidence.donto_identity_edge — a weighted,
bitemporal assertion about whether two symbols co-refer.donto_identity_hypothesis — a named
clustering solution over identity edges.Three default hypotheses ship:
strict_identity_v1 : same_referent edges with confidence >= 0.98 only
likely_identity_v1 : >= 0.85, no strong cannot-link
exploratory_identity_v1 : >= 0.60, useful for search and discovery
Production queries can be asked under any lens. A merge accepted
under likely does not destroy the original symbols; under
strict, the same two symbols may still resolve to distinct
entities.
At 19,230 contexts and 938,918 distinct predicates, naive O(N²)
pairwise comparison of subjects is infeasible. The
ARCHITECTURE-REPORT proposes (and we are landing in phases)
nine independent blocking channels for persons:
B1 : normalised full-name trigram
B2 : surname + given initial
B3 : maiden-name / married-name variants
B4 : birth-year bucket ±N years
B5 : birth/death/residence place
B6 : spouse/parent/child neighbourhood overlap
B7 : document/source-local co-occurrence
B8 : embedding nearest neighbours over compact entity profiles
B9 : legacy imported identifier / external ID / source record ID
Trigram (pg_trgm) indexes provide cheap B1–B3 lookups;
the embedding channel (B8) is staged for the pgvector-backed
predicate-similarity work but the entity dimension is later.
The scoring formula, generalising Fellegi–Sunter:
LLR(i,j) = log prior_block(i,j)
+ Σ_k log P(feature_k | same) / P(feature_k | different)
+ λ_rel * relational_overlap(i,j)
+ λ_time * temporal_compatibility(i,j)
+ λ_place * spatial_compatibility(i,j)
+ λ_neural * neural_pair_score(i,j)
+ λ_source * source_dependence_adjustment(i,j)
p_same = sigmoid(a * LLR + b)
Ditto-style transformer matchers [Li et al., 2020] inform the neural-pair component for hard pairs only; running a deep matcher over all candidate pairs would be economically prohibitive.
Naive transitive closure on positive identity edges produces catastrophic merges in genealogy: Mary Watson ≈ Mrs Watson, Mrs Watson ≈ Mary Watson née Oxley, Mary Watson née Oxley ≈ Mary Oxley, Mary Oxley ≈ another Mary Oxley, and suddenly the nineteenth-century Cooktown beche-de-mer fisherman's widow is the same person as a twentieth-century Sydney accountant.
The constrained clustering algorithm we are landing:
Input
positive edges with weights
negative / cannot-link edges with weights
soft ontology and temporal constraints
source-dependence penalties
Output
clusters maximising same-edge agreement
while minimising distinct-edge violations
Implemented in phases: union only p_same ≥ 0.98 and no
strong cannot-link; local correlation-clustering-style optimisation
inside candidate blocks; produce ambiguous bridge edges as proof
obligations, not automatic merges; human-or-agent review only on
high-impact ambiguous bridges.
The non-negotiable principle is reversibility. A merge accepted at
time t and reversed at time t' > t must leave the
database queryable as-of any t″ ∈ [t, t'] with the merged view,
and as-of any t″ ∉ [t, t'] with the unmerged view. The
substrate enforces this by leaving original symbols in
donto_statement untouched; query projection maps symbol →
referent under the requested hypothesis at evaluation time, never at
write time.
The 938,918 distinct predicates in the live database are not a sign
of expressive richness; they are a sign of LLM-driven predicate
proliferation. Each freely-minted predicate (ex:fatherOf,
ex:isFatherOf, ex:hadFather,
ex:was-father-of, ...) is a row in
donto_predicate and a leaf in the alignment graph.
The alignment layer's job is to make queries portable across this proliferation. An alignment edge declares one of eleven relations (PRD §6.10) and three per-edge safety booleans:
safe_for_query_expansion : evaluator may ride this edge in EXPAND mode
safe_for_export : downstream exporters may collapse via this edge
safe_for_logical_inference : OWL-style entailment may use this edge
A user can register an alignment between ex:fatherOf and
ex:hadFather with relation inverse_equivalent,
safety flag safe_for_query_expansion=true, scope =
ctx:genes/registry. The materialised closure
(donto_predicate_closure) pre-computes transitive chains
the evaluator can ride at query time. donto align auto
proposes alignments via embedding-similarity over predicate descriptors;
auto-proposals land at confidence < 1.0 and are
reviewer-promotable.
A claim with surface text
"Mary Watson, born Cornwall 1860" should resolve to a
donto_span row at byte offsets [1342, 1376) of
revision rev_92f1... of document
doc:genes/trove-cooktown/watkins-1881. The
packages/donto-trace/ crate implements this resolution as a
three-tier search.
For each statement carrying a donto:textSpan or
ex:normalized_claims/text_span predicate, the trace
worker:
Cache check (cross-shard). Hash the surface
text; look up donto_trace_log for any prior match (across
all runs, not just this run). If found, reuse the prior result.
Production data shows ~4.3× dedup: 1.5 M textSpan statements → 349 K
distinct surface texts.
Tier 1 — exact line equality. Query
donto_revision_line where line_text = needle
and length(line_text) ≤ 400. A partial B-tree index on
short lines makes this O(log n).
Tier 2 — substring within a line. Query
donto_revision_line with
line_text LIKE pattern ESCAPE '\'. Uses a trigram GIN index
on line_text for substring search. Cheaper than Tier 3
because the per-row haystack is small.
Tier 3 — full-body fallback. Query
donto_document_revision.body_inline with the same LIKE
pattern. Slow but catches multi-line quotes. Only runs if the surface
text contains a newline.
The four match types:
Exact : surface text matches verbatim confidence = 1.0
Normalized : whitespace-collapsed match confidence = 0.9
Ambiguous : same text in >1 revision confidence = 0.5
NotFound : no match found confidence = 0.0
Writes are strictly additive. A successful match emits four rows:
donto_span (via
donto_create_char_span),donto:hasSpan claim linking the original statement to
the span,donto_evidence_link with type
anchored_at and the match confidence,donto_trace_log row for resumability.The legacy donto:textSpan literal is never retracted.
The new donto:hasSpan claim coexists with it.
Resumability: the worklist query
(packages/donto-trace/src/lib.rs: 336-369) skips statements
already in donto_trace_log for the given
run_name. SIGINT sets an interrupted flag the main loop
checks at batch boundaries; the current batch completes cleanly and the
process exits. A re-run with the same run_name continues
from where it left off.
The blob store (packages/donto-blob/, migration 0125) is
content- addressed by SHA-256. Three backends are provided
(LocalFsBlobStore, GcsBlobStore,
MockBlobStore) behind a common trait:
pub trait BlobStore: Send + Sync {
fn backend(&self) -> &'static str;
async fn put_bytes(&self, bytes: &[u8], mime: Option<&str>)
-> Result<BlobSummary>;
async fn put_file(&self, path: &Path, mime: Option<&str>)
-> Result<BlobSummary>;
async fn exists(&self, sha256: &[u8; 32]) -> Result<bool>;
async fn fetch(&self, sha256: &[u8; 32]) -> Result<Vec<u8>>;
fn uri_for(&self, sha256: &[u8; 32]) -> String;
}put_file two-passes: hash first, then upload only if not
already present.
donto_register_blob(sha256, byte_size, mime_type, bucket_uri)
records the blob in donto_blob; the revision table
references it. The same revision body uploaded by ten different
documents lands as one blob with ten document-revision references.
packages/lean/ contains a Lean 4 library and an
executable donto_engine binary. The binary is a stdio JSON
sidecar dontosrv may spawn at startup and communicate with
via a line-delimited DIR (Donto Intermediate Representation) envelope
format at protocol version 0.1.0-json.
-- packages/lean/Donto/Core.lean
inductive Polarity where
| asserted | negated | absent | unknown
deriving Repr, BEq, DecidableEq
inductive Modality where
| observed | derived | hypothesized | retracted
deriving Repr, BEq, DecidableEq
inductive Confidence where
| uncertified | speculative | moderate | strong
deriving Repr, BEq, DecidableEq
structure Maturity where
level : Nat
hLE : level ≤ 4
structure Statement where
id : Option String
subject : String
predicate : String
object : Object
context : String
polarity : Polarity
modality : Modality
confidence : Confidence
maturity : Maturity
valid_from : Option String
valid_to : Option String
The Lean types mirror the Postgres schema. The structural isomorphism
is what makes the DIR encoding straightforward: a Lean
Statement decodes from the same JSON shape that
donto-client serialises.
-- packages/lean/Donto/Shapes.lean
structure Shape where
iri : String
label : String
severity : Severity
evaluate : List Statement → ShapeReport
Built-in shapes ship with the library:
builtin:functional/* — at-most-one object per (subject,
predicate) pair.builtin:datatype/* — typed-literal datatype
enforcement.builtin:parent-child-age-gap — genealogy domain shape:
ex:parentOf edges require parents 12–80 years older than
children, compared via ex:birthYear. Missing or
unreasonable gaps produce violation rows.Each shape is a pure predicate over a scoped statement list:
it does not touch the database directly. dontosrv ships a focus selector
— a scoped pattern that materialises the statement list — along with the
shape IRI; Lean returns a ShapeReport with violation
rows.
-- packages/lean/Donto/Rules.lean
structure Rule where
iri : String
label : String
output : Context
mode : RuleMode -- eager | batch | onDemand
apply : List Statement → List Statement
transitiveClosure p emits all (a, p+, c)
from (a, p, b) and (b, p, c), in the output
context, at modality derived and maturity 3 (E3 by the
storage encoding, which maps to "reviewed" in the maturity-ladder doc;
rules are formal certifications and thus considered as-if-reviewed).
Inverse and symmetric rules emit bidirectional pairs.
donto_engine reads lines from stdin; for each line it
parses a JSON envelope, dispatches via
Donto.Engine.dispatch, and emits a JSON response. Failures
(parse errors, unknown shape IRIs) produce explicit error
envelopes:
request:
{
"version" : "0.1.0-json",
"kind" : "validate_request",
"shape_iri" : "lean:builtin/parent-child-age-gap",
"scope" : {...},
"statements": [...]
}
response:
{
"version" : "0.1.0-json",
"kind" : "validate_response",
"shape_iri" : "...",
"focus_count" : N,
"violations" : [...]
}
dontosrv spawns the Lean engine with a 10-second startup banner
timeout and enforces a 30-second per-request timeout. On engine death or
unresponsiveness the parent closes the child and returns
sidecar_unavailable for subsequent shape/rule calls; the
rest of dontosrv continues to serve ingest, query, and policy traffic
without interruption.
This is the principle: Lean certifies, doesn't gate. Ingest never waits on the formal overlay. The sidecar's absence degrades shape/rule/cert calls only.
A release (PRD §17.1) is eleven things:
packages/donto-release/ builds these from a
ReleaseSpec JSON. Release blockers include any included
claim with a policy disallowing publication, any source with unresolved
policy, any restricted-anchor reference without redaction, any claim
below the release-maturity threshold, any adapter loss report with
unaccepted critical loss, and any required review that has not
occurred.
The native format is donto-release.jsonl (one statement
per line, with checksums) plus manifest.json. Optional
exports include RO-Crate (a BagIt-compatible JSON-LD metadata file),
CLDF (for language datasets — lossy, requires
--max-cldf-loss gate), CoNLL-U (for corpus releases),
CSV/TSV for tabular subsets, and RDF / JSON-LD for linked-data
consumers.
// packages/donto-release/src/envelope.rs
pub struct Keypair { signing: ed25519_dalek::SigningKey, /* ... */ }
impl Keypair {
pub fn generate() -> Self { /* OsRng */ }
pub fn from_seed(seed: [u8; 32]) -> Self { /* deterministic */ }
pub fn did_key(&self) -> String {
// multicodec [0xed, 0x01] + base32(verifying_key)
}
}
pub struct ReleaseEnvelope {
pub manifest_id : String,
pub manifest_sha256 : String,
pub issuer_did : String,
pub signature_suite : String, // "Ed25519Signature2020"
pub signature : String,
pub created_at : String,
}
pub fn sign(manifest: &ReleaseManifest, kp: &Keypair) -> ReleaseEnvelope { /* … */ }
pub fn verify(env: &ReleaseEnvelope) -> Result<(), VerifyError> { /* … */ }The CLI subcommand donto release pipeline orchestrates
the full five-stage emission: build manifest from spec → write native
JSONL → sign → write RO-Crate metadata → optional CLDF export.
Verification requires no network: did:key is
self-contained.
docs/M9-FEDERATION-MEMO.md records an evaluation of five
candidate federation stacks. The explicit non-goal is "any researcher
querying any other researcher's tree". The actual question is whether
instance B can verify a release manifest from instance A without
re-ingesting A's source content.
W3C Verifiable Credentials + DID — Verdict: strongest fit. Each manifest is a signed VC; the signer is identified by a DID. Selective disclosure (BBS+, SD-JWT) hides claim payloads when policy demands. Cost: BBS+ requires pairing-friendly curves not in standard Postgres/Rust crypto stacks (~weeks of implementation).
Solid Pods — Verdict: defer. The pod model assumes one principal per pod. donto already has multi-context, multi-authority semantics that the pod model doesn't represent natively.
SPARQL federation (SERVICE) — Verdict:
reject as primary layer. Information leakage through query shape and
count is a known risk; donto's PRD invariant "cross-instance restricted
content cannot leak through counts or errors" rules out the naive
implementation.
DataCite-style citation metadata — Verdict: proceed.
The ReleaseManifest is essentially what DataCite expects.
The federation piece is publishing the manifest to a registry (DataCite,
Zenodo, OpenAIRE) and adopting their identifier scheme. Compatible with
VC/DID for trust: the manifest is a VC; publication registers its
reference.
RO-Crate — Verdict: proceed independently. RO-Crate is a format, not a federation protocol. Pairs naturally with VC/DID (sign the crate's metadata) and DataCite (publish the signed crate's identifier).
The synthesis (M9-MEMO §4):
Manifest format : RO-Crate (M7 work, landed)
Signing layer : VC over the manifest (M9 spike, did:key)
Publishing : DataCite-style citation metadata
Live cross-instance query: explicit non-goal for v1
Acceptance: instance A builds release R; the output is an RO-Crate signed by a VC issued under A's DID; instance B fetches the crate, verifies the VC, reads the release manifest, and can answer "does this crate contain claims about entity X" without ever storing A's raw rows.
We characterise the substrate at two scales: the synthetic benchmark
suite (donto bench) at 10 K, 100 K, and 1 M rows on the
production hardware, and the live genes corpus at 39.3 M
rows.
GCE e2-standard-4 (4 vCPU, 16 GB RAM), PostgreSQL 16 in
the donto-pg Docker container with volume bind to
/mnt/donto-data/pgdata (SSD). Workload is the synthetic
fixture in donto-cli bench: write N rows under a throwaway
context, time one point query and one batch query.
| Scale (N) | Insert wall | Inserts/s | Point query (H1) | Batch query (H4) |
|---|---|---|---|---|
| 10,000 | 3.36 s | 2,977 | 10.7 ms | 50 ms |
| 100,000 | 35.48 s | 2,819 | 42.8 ms | 504 ms |
| 1,000,000 | 396.67 s | 2,521 | 50.9 ms | 6.59 s |
Insert throughput is essentially flat at ~2.5–3.0 K row/s through three decades of scale. Linear extrapolation gives ~70 min for a 10 M-row cold ingest (the H10 hard target); the genes prod corpus (39 M statements) would be ~4.2 hours cold, though production ingest is faster through batched pipelines than the single-row CLI path.
Point queries stay sub-100 ms through 1 M rows on the SPO / POS / OSP indexes. PRD §25 H1 hard target is 100 ms at 10 M; the trend is on track.
| Scale (N) | H1 point | H4 batch | H2 aligned | H3 AS_OF | H5 frontier | H7 modality query |
|---|---|---|---|---|---|---|
| 10,000 | 15.1 ms | 80 ms | 10.6 ms | 4.4 ms | 3 ms | 48 ms |
| 100,000 | 8.9 ms | 2.59 s | 3.2 ms | 3.4 ms | 2 ms | 95 ms |
| 1,000,000 | 33.6 ms | 8.21 s | 4.6 ms | 3.0 ms | 15 ms | 1.73 s |
| Scale (N) | H6 join | H8 setup | H8 query | H8 rows kept | H9 4× concurrent |
|---|---|---|---|---|---|
| 10,000 | 6 ms | 106 ms | 107 ms | 9,900 | 385 ms |
| 100,000 | 6 ms | 798 ms | 812 ms | 99,000 | 428 ms |
Observations.
AS_OF bitemporal path (H3) is essentially constant:
GiST index on tx_time plus partial-index on currently-open
rows.(modality, statement_id) index is a deferred tuning
candidate.The genes corpus, as of 2026-05-28:
Statements: 39,294,083
Distinct predicates: 938,918
Distinct contexts: 19,230
Evidence links: 1,837,151
Database size: 48 GB
donto_statement table: 32 GB
Currently-believed: 39,293,802
Retracted: 281
Polarity:
asserted : 39,292,908 (99.997%)
negated : 813 ( 0.002%)
unknown : 331
absent : 31
Maturity (storage-bit encoding):
0 (raw) : 22,082,198 (56.2%)
1 (candidate) : 14,870,096 (37.8%)
2 (evidence-supp.) : 47,905 ( 0.1%)
3 (reviewed) : 244,806 ( 0.6%)
4 (E5 certified*) : 2,049,078 ( 5.2%)
Top context (by statement count):
ctx:genealogy/research-db 21,842,452
ctx:genealogy/smoketest 4,114,991
ctx:genealogy/analysis-db 3,852,396
ctx:genes/yeatman-knoll-coleman-banjo-gibson 1,159,340
ctx:genes/edward-herbert-chinese-stack/qld 1,011,967
ctx:genealogy/research-db/source/unknown 497,658
ctx:genealogy/resources 328,863
ctx:genes/naa-32841845 271,739
ctx:genes/edward-herbert-father 192,952
ctx:genes/trove-cooktown/reynolds 142,192
ctx:genes/trove-cooktown/beche-de-mer 114,401
Top predicates (by statement count):
rdf:type 3,751,377
donto:status 1,629,900
donto:aboutPredicate 1,629,842
donto:confidenceLabel 1,226,023
donto:predicate 1,224,518
donto:textSpan 1,220,903
donto:extractionModel 1,206,699
donto:objectValue 1,113,534
donto:hasSpan 1,095,921
donto:claimB 1,086,531
donto:aboutSubject 1,086,531
donto:claimA 1,086,531
ex:knownAs 1,081,364
donto:createdAt 857,137
donto:inSource 764,847
The top fifteen predicates are dominated by reified meta-statements
(donto:status, donto:aboutPredicate,
donto:textSpan, donto:extractionModel,
donto:claimA/B) rather than domain predicates. This is a
direct consequence of how the M5 extractor emitted facts: each claim was
reified into ~7–10 quad rows. The exhaustive (M6) extractor does not
reify in the same way; the older half of the database carries a heavy
reification tail.
Three substantive research workloads have exercised the system in the last twelve months:
Annie Davis. Multiple genealogical sources disagree
on Annie Davis's parents, birth year, and place. donto stores claims
from the Davis family oral history, the Brackenridge family records, the
Mareeba marriage register, three colonial-era obituaries, and the 2007
EKY native-title determination, all of which contradict each other
on at least one field. The contradiction frontier view exposes the
disagreements; the maturity ladder caps the reviewable claims at E3
pending family-elder review; the policy capsule for oral-history sources
defaults to community_restricted.
Caroline Brown / Kaitchi. A second-generation EKY apical ancestor whose parents and grandparents are subject to active litigation in the Federal Court of Australia. donto holds 820 lines of dossier and ~80 kinship triples, of which only three carry evidence links to primary sources — a sparsity flagged for the Trust Kernel HTTP-middleware testbed (see §13).
Blucher and the Maryborough boiling-down works. A
nineteenth- century Aboriginal figure mis-identified across colonial
archives; the corpus distinguishes the Maryborough "King Blucher"
(Maryborough boss-name, NMP worker) from the authentic apical Bujilkabu
claim. The PDFs at genes.apexpots.com/pdfs/ and the
verbatim primary sources at genes.apexpots.com/blucher/s/
are produced from this data.
These workloads have driven changes to the substrate: the
contradiction frontier view (H5), the multi-context attachment
(migration 0103, for claims that belong to both a hypothesis lens and an
oral-history source), and the source-provenance trace
(donto-trace, §9.6).
Two queries the agent submission for this paper did not return in its time budget:
SELECT COUNT(DISTINCT subject) FROM donto_statement;
-- > 30 min, did not completeSELECT COUNT(*) FROM (
SELECT subject, predicate
FROM donto_statement
WHERE upper_inf(tx_time)
GROUP BY subject, predicate
HAVING COUNT(DISTINCT donto_polarity(flags)) > 1
) t;
-- > 5 min, did not completeSubject cardinality and contradiction-by-polarity are both
load-bearing characterisation queries we would like to report
empirically. That they do not complete in routine time is itself a
finding: a donto_subject_stats matview (companion to the
existing donto_subject_count matview that powers the
/subjects/all directory page) is the natural fix and is the
top item on the scheduled-refresh roadmap.
Append-only discipline. The decision to never
DELETE FROM donto_statement and to extend the same
discipline to alignments, identity, policies, and attestations through
donto_event_log has not produced an unmanageable
storage explosion. At 39.3 M rows and only 281 retractions, the
bitemporal overhead is essentially nil. The discipline pays for itself
the first time you need to answer "what did we believe last
Tuesday?"
Idempotent migrations. All 131 migrations are
idempotent (if not exists, create or replace,
advisory-locked migrator, SHA-256 ledger). New migrations are added by
sequential number; prior migrations are never edited. The model means a
fresh database can be brought up to head with one
donto migrate call, and a production database can absorb
new migrations without service interruption (the F-1 closure ran in
production with zero downtime).
The pgrx packaging. The pgrx extension
pg_donto embeds every migration via
extension_sql_file! and provides Rust mirrors for
plan-quality immutable helpers. The cost was non-trivial (managing pgrx
version skew, learning the Rust-Postgres ABI) but the result is a single
CREATE EXTENSION pg_donto on a fresh database produces a
working substrate. This is what packaging looks like when you take
Postgres seriously as the boundary.
The tripwire test suite. 77 files, ~20 K LOC, ~592
#[tokio::test] and ~91 #[test] annotations,
with the pg_or_skip! macro letting database-touching tests
skip cleanly when Postgres is unreachable. The suite encodes PRD
invariants as executable assertions. Every PRD §I-clause has at least
one tripwire; every new invariant lands with at least one new test. The
convention has caught more regressions than any single review.
Predicate proliferation. 938,918 distinct predicates
is far beyond what we expected. The freely-minted-predicate problem is
the direct downstream of giving LLMs the latitude to mint IRIs without a
registry lookup, which we did for M5 because the alternative — "the
model must use one of these 12,000 existing predicates" — was producing
systematic under-coverage (the model would refuse to extract a claim it
had no good predicate for). The vocab-aware extraction (commit
31c519b, "vocab-aware extraction — stop minting fresh
predicates") is the partial fix; the alignment-closure backfill (~922 K
predicates → nearest canonical, cosine ≥ 0.9) is the larger one.
Evidence-anchor sparsity. 1,837,151 evidence links /
39,294,083 statements ≈ 4.7 % anchor coverage. This is the gap between
the intended invariant (I1: no claim without evidence or
hypothesis_only) and the lived reality of a corpus
accumulated across two extractor generations. The anchor-aware ingest
(commit 5928bff) plus the donto trace
provenance backward-fill (Stage D) is closing this gap; the roadmap
target is ≥ 50 % coverage.
Retraction rarity. 281 retractions across 39.3 M
statements (7 × 10⁻⁶). This is a striking finding. It tells us that we
are operating donto as an append-mostly archive rather than as
a constantly-revised research notebook. The substrate supports
retraction (donto_retract, donto_correct) and
the discipline is encoded in the CLAUDE.md non-negotiable list. But the
actual usage shows that researchers are accumulating contradictory
claims rather than retracting older ones — exactly as the
paraconsistency invariant says they should. The two oldest sources
disagreeing about Annie Davis's birth year both still live.
The reification tail. The top fifteen predicates by
row count are dominated by reified meta-statements
(donto:status, donto:aboutPredicate,
donto:textSpan, donto:claimA,
donto:claimB). The M5 extractor reified each claim into
~7–10 rows; the M6 exhaustive extractor does not. The older half of the
database carries the reification tail; the newer half is substantially
less dense. The plan for unwinding the tail is quarantine (move
ex:normalized_claims/* to a retract-or-quarantine context,
~2.37 M rows) plus a re-ingest under the new vocabulary.
Long-running characterisation queries. Subject
cardinality and polarity-mixed contradictions do not return in routine
time at 39 M rows. We thought we knew the index story for these — index
on subject, index on (subject, predicate) — but DISTINCT and GROUP BY
HAVING over the whole table both require parallel hash-aggregate plans
that take longer than the agent budget. The fix is matviews
(donto_subject_stats,
donto_contradiction_pressure) refreshed on a daily
schedule.
Of the six apertures, Conceivable is the one we are least sure about.
It produces unanchored, hypothesis-only claims by design. The position
recorded in the Maximalism doc — "mine everything; let curation
decide" — is principled, but we have not yet tested whether
downstream curation actually filters Conceivable output at useful rates,
or whether it floods the candidate space in ways that make E2 promotion
costly. The provisional answer is to keep it on by default but in a
dedicated ctx:.../conceivable sub-context, so that release
builders can exclude the entire context with one clause.
The Lean overlay is operationally working — donto_engine
spawns under dontosrv, certifies the three built-in shapes (functional,
typed-literal, parent–child age-gap), and the
autoresearch-genealogy/lean/Genealogy/ library has a
substantially more developed catalogue (one-birth-per-person, sameAs
symmetry/transitivity, parent-date plausibility). The gap is that the
two libraries are not yet converged: shapes that exist in Genealogy do
not yet exist in packages/lean/. The convergence work is
straightforward (port the shape combinator with its proof of soundness)
but unspectacular; landing it is the natural milestone that completes
the Lean side of the substrate.
donto_subject_stats matview is the route to answering basic
characterisation queries at corpus scale without ad-hoc full table
scans.The substrate (M0–M4) is complete. Open work clusters around applications:
donto extract.donto release CLI
verb.autoresearch-genealogy/lean/Genealogy/ library..md files that are not yet
linked from any context's source registration.WITH evidence result shape
(current evaluator records the directive but does not change the row
shape).donto man / donto completions;
package the install as part of a donto install-completions
subcommand.donto is, by intention, an uneasy product. It treats contradictions
as data, schemas as plural, identities as hypotheses, sources as
policy-bound, and time as bitemporal. It refuses the simplifying
assumptions a triple store usually makes — and pays the cost in schema
complexity (91 tables), test surface (~20 K LOC of tripwires), and
conceptual overhead (a 2,500-line PRD with ten non-negotiable
invariants). What we get in exchange is a substrate where two oral
histories disagreeing about a great-grandmother's birthplace can both
live, where the legal-precedent citation chain from a 2013 Federal Court
determination can be queried under a strict identity lens that excludes
the 2024 family-elder review's provisional merges, where the predicate
ex:motherOf and the predicate ex:hadMother are
typed-aligned but not collapsed at storage, where a release artefact
ships with its own checksum manifest and policy report and an Ed25519
envelope verifiable without contacting the originating instance.
The system runs in production at 39.3 million statements against a single PostgreSQL instance on a 4-vCPU VM. The benchmark numbers are encouraging: 2.5–3.0 K-row/s insert throughput holds through 1 M rows on the standard hardware; point queries stay sub-100 ms. The empirical surprises (predicate proliferation, evidence sparsity, low retraction rate, the reification tail) point to concrete next steps that the substrate's architectural choices make possible — backfill alignments without rewriting history, backward-fill anchors via three-tier trace, refresh matviews on a schedule without invalidating the bitemporal model.
We do not claim donto is the right substrate for every knowledge graph application. We claim it is a working substrate for contested knowledge, demonstrated against one of the hardest realistic domains we know — North-Queensland Aboriginal genealogy and the language documentation surrounding it. Every architectural choice in this paper has a tripwire test, a PRD section, a migration, and a row count in the live database to back it. The system is, in the sense the PRD demands, working.
Allen, J. F. (1983). Maintaining knowledge about temporal intervals. Communications of the ACM, 26(11), 832–843.
Belnap, N. D. (1977). A useful four-valued logic. In J. M. Dunn & G. Epstein (Eds.), Modern Uses of Multiple-Valued Logic (pp. 8–37). Dordrecht: Reidel.
Carroll, S. R., Garba, I., Figueroa-Rodríguez, O. L., Holbrook, J., Lovett, R., Materechera, S., Parsons, M., Raseroka, K., Rodriguez- Lonebear, D., Rowe, R., Sara, R., Walker, J. D., Anderson, J., & Hudson, M. (2020). The CARE principles for Indigenous data governance. Data Science Journal, 19, 43.
Cyganiak, R., Wood, D., & Lanthaler, M. (2014). RDF 1.1 Concepts and Abstract Syntax (W3C Recommendation). W3C.
da Costa, N. C. A. (1974). On the theory of inconsistent formal systems. Notre Dame Journal of Formal Logic, 15(4), 497–510.
Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.
Gregg, F., & Eder, D. (2015). Dedupe: a Python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution. github.com/dedupeio/dedupe.
Harris, S., & Seaborne, A. (2013). SPARQL 1.1 Query Language (W3C Recommendation). W3C.
Hickey, R. (2012). Datomic. Strange Loop 2012.
Jaśkowski, S. (1948). Rachunek zdań dla systemów dedukcyjnych sprzecznych. Studia Societatis Scientiarum Torunensis, Sectio A, 1(5), 55–77.
Konda, P., Das, S., Suganthan G. C., P., Doan, A., Ardalan, A., Ballard, J. R., Li, H., Panahi, F., Zhang, H., Naughton, J., Prasad, S., Krishnan, G., Deep, R., & Raghavendra, V. (2016). Magellan: Toward building entity matching management systems. PVLDB, 9(12), 1197–1208.
Lebo, T., Sahoo, S., & McGuinness, D. (2013). PROV-O: The PROV Ontology (W3C Recommendation). W3C.
Li, Y., Li, J., Suhara, Y., Doan, A., & Tan, W.-C. (2020). Deep entity matching with pre-trained language models. PVLDB, 14(1), 50–60.
Linacre, R. (2022). Splink: probabilistic record linkage at scale. github.com/moj-analytical-services/splink.
Library of Congress (2019). Extended Date/Time Format (EDTF) Specification. LoC.
Priest, G. (1979). The logic of paradox. Journal of Philosophical Logic, 8(1), 219–241.
Pratt, J., Dale, S., Ploderer, B., et al. (2019). XTDB / Crux: an unbundled, bitemporal database. Strange Loop 2019.
Snodgrass, R. T. (1999). Developing Time-Oriented Database Applications in SQL. Morgan Kaufmann.
Soiland-Reyes, S., Sefton, P., Crosas, M., Castro, L. J., Coppens, F., Fernández, J. M., Garijo, D., Grüning, B., La Rosa, M., Leo, S., Ó Carragáin, E., Portier, M., Trisovic, A., RO-Crate Community, Groth, P., & Goble, C. (2022). Packaging research artefacts with RO-Crate. Data Science, 5(2), 97–138.
Vrandečić, D., & Krötzsch, M. (2014). Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10), 78–85.
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 3, 160018.
dontosrv http://localhost:7879 axum, Rust, 67 routes
donto-api http://localhost:8000 FastAPI + Temporal
donto-api-worker n/a Temporal worker (extraction)
donto-debug http://localhost:3002 Next.js debug dashboard
dontopedia-web http://localhost:3000 Next.js public site
agent-runner http://localhost:4001 Temporal kickoff fastify
caddy :80 / :443 TLS termination + proxy
donto-pg :55432 → :5432 Postgres 16 in Docker
temporal :7233 + :8088 (UI) workflow engine
Public DNS routes (Cloudflare in Full SSL):
genes.apexpots.com → mostly :8000, some paths to :3002
debug.genes.apexpots.com → :3002
genes.apexpots.com/pdfs/ → /srv/genes-pdfs/ (file_server)
genes.apexpots.com/blucher/ → /mnt/donto-data/blucher-sources/ (file_server)
genes.apexpots.com/research/ → /srv/genes-research/ (file_server, this paper)
www.dontopedia.com → :3000
0001 core -- donto_context, donto_statement, donto_audit
0002 flags -- pack/unpack polarity + maturity into smallint
0003 functions -- donto_assert, donto_retract, donto_correct,
donto_ensure_context, donto_match
0023 documents -- donto_document table
0029 evidence_links -- donto_evidence_link with 7-target check
0031 arguments -- donto_argument with 9 typed relations
0048-0055 predicate_alignment -- alignment edges + closure rebuild
0057 entity_symbol -- entity registry with trigram blocking
0060 identity_edge -- weighted bitemporal coreference
0089 hypothesis_only_flag -- per-statement flag for I1
0090 event_log -- append-only history for non-statement objects
0098 polarity_v2 -- extended polarity values
0099 statement_modality -- sparse modality overlay
0100 extraction_level -- sparse extraction-level overlay
0102 maturity_e_naming -- E5/E4 ordering note for stored values 4/5
0103 multi_context -- secondary context attachments
0111 policy_capsule -- 15-action policy with max-restriction inheritance
0112 attestation -- holder credentials with purpose + rationale
0123 document_policy_id_required -- F-1 closure (NOT NULL + FK validate)
0125 blob_store -- content-addressed blob registry
0126 trace -- source-provenance trace log
0127 trace_lines -- byte-offset-preserving line index
0128 safe_extract -- defensive literal handling
0129 disambiguate -- entity disambiguation tables
0130 predicate_counts -- matview for /subjects/all etc.
0131 object_iri_trgm -- trigram index on object IRIs (substring search)
131 migrations total. All idempotent. Applied under a single
pg_advisory_lock so concurrent migrators serialise.
I1 No claim without evidence invariants_evidence.rs
I2 No restricted source without policy invariants_governance.rs
I3 No destructive overwrite invariants_bitemporal.rs
I4 Contradictions are preserved invariants_paraconsistency.rs
I5 Machine confidence is not maturity invariants_maturity.rs
I6 Governance propagates to derivatives invariants_governance.rs
I7 Schema mappings are typed and scoped invariants_predicate.rs
I8 Identity is a hypothesis invariants_identity.rs
I9 Adapters must report information loss adapters/*
I10 A release is a reproducible view invariants_releases.rs
Each invariant has at least one tripwire test; many have multiple
adversarial-walkthrough scenarios in
invariants_adversarial.rs (1,160 LOC, the largest single
test file).
End of paper.