# donto: An Evidence Operating System for Contested Knowledge

**A systems paper on the architecture, data model, and operational
experience of a bitemporal, paraconsistent, evidence-first quad store
running at 39.3 million statements.**

*Thomas Davis · Ajax Davis · 2026-05-28*

---

## Abstract

We describe **donto**, a knowledge-substrate system organised around
three commitments most knowledge graphs do not make: that every claim
is anchored in evidence under a context, that contradictions are
preserved rather than rejected, and that the system records both
*world-time* (when a fact was true in the world) and *system-time*
(when it was believed in the database). The system is implemented as a
Rust workspace shipping a PostgreSQL extension (`pg_donto`, built with
pgrx), an HTTP sidecar (`dontosrv`), a Python FastAPI + Temporal
extraction layer (`donto-api`), a CLI (`donto`), a terminal UI
(`donto-tui`), and a Lean 4 formal overlay (`donto_engine`). The
native query language, **DontoQL**, exposes twenty-one clauses
covering scope inheritance, polarity, maturity, identity lenses,
predicate-alignment closure, bitemporal time-travel, modality,
extraction-level filtering, and policy enforcement. The substrate is
exercised by **genes**, a genealogical-research corpus with 39,294,083
statements, 938,918 distinct predicates, 19,230 contexts, and 1.84M
evidence links across 48 GB on disk. We document the system's data
model, its query language, its six-aperture extraction pipeline, its
Trust Kernel for policy-gated ingest, its release-and-federation
machinery (Ed25519-signed RO-Crate envelopes), and its Lean 4 overlay
for shape and rule certification. We characterise its performance
(steady ~2.5–3.0 K-row/s insert throughput, sub-100 ms point queries
through 1 M rows) and report empirical observations from production:
predicate proliferation (~938 k distinct freely-minted predicates),
evidence-anchor sparsity (~4.7 % of statements carry an evidence link),
and an exceedingly low retraction rate (281 of 39.3 M, ~7 × 10⁻⁶) that
reveals the system is currently operated as an append-mostly archive
rather than a constantly-revised research notebook. We argue that the
combination of evidence-first storage, paraconsistent semantics,
bitemporality, and a typed alignment layer is a useful substrate for
domains where multiple sources, schemas, communities, and models must
make claims about a shared world without forcing premature consensus —
language documentation, oral-history, genealogy, legal evidence,
clinical observation, and historical research being the motivating
cases.

**Keywords:** knowledge graphs, paraconsistent logic, bitemporal
databases, provenance, evidence anchoring, predicate alignment,
identity resolution, policy enforcement, scientific reproducibility,
Lean 4, language documentation, genealogy.

---

## 1. Introduction

### 1.1 The eight assumptions

Most research-supporting systems silently assume one or more of the
following:

1. There is one correct value to store for a given proposition.
2. There is one canonical schema to map data into.
3. There is one entity-resolution answer per referent.
4. A user either has access to a record or they do not.
5. Provenance is metadata attached to records rather than the central
   object of the system.
6. Machine confidence can stand in for scholarly review.
7. Exports are static files rather than reproducible views.
8. Contradictions are quality failures, not data.

donto rejects all eight. Its operating model is **open-world,
evidence-first, contradiction-preserving, governance-native,
bitemporal, multimodal, and schema-plural** (PRD §0). The product
question the system was designed to answer is:

> Given a contested question, can the system return the relevant
> claims, the evidence behind them, the schema mappings that make
> them comparable, the identity hypotheses they depend on, the
> disagreements between them, the access policies governing them,
> and a reproducible release artefact?

If it can, the substrate works.

### 1.2 The proving domain

The first proving domain is **language documentation**, because it
exhibits every constraint of contested knowledge simultaneously:
incompatible analytical schemas (Universal Dependencies vs WALS vs
Grambank), disputed identities (dialect/language boundaries; ISO codes
vs Glottolog vs internal community ontologies), multimodal evidence
(text, audio, interlinear glosses, phonetic transcription),
restricted cultural material (community-governed sacred or sensitive
records), diachronic change (reconstructed vs attested forms),
formal validation (paradigms have shape constraints), and
corpus-scale annotation.

The exercise domain is **genealogy** — specifically, the
North-Queensland family-history corpus that exercises donto in
production at `genes.apexpots.com`. Genealogy is, in our experience,
the *hardest* realistic instance of the same problem: name spelling
drifts across records, the same person appears under maiden, married,
and clan names, colonial-era records contain misclassifications and
falsifications as data, identity is contested across native-title
determinations, and oral-history claims contradict written archival
ones in irreducibly load-bearing ways.

The point of the genealogy exercise is not the genealogy. It is that
every contradiction in genealogy is a tripwire for donto's invariants.
Each friction point we hit becomes either a new tripwire test in
`packages/donto-client/tests/` or an amendment to the PRD; we are
deliberately running the substrate at its limits because that is how
we learn what the substrate must be.

### 1.3 Contribution

This paper describes:

- A 14-family object model (claims, frames, contexts, sources,
  revisions, anchors, predicates, entities, alignment edges, argument
  edges, identity hypotheses, policy capsules, attestations, release
  manifests) implemented as 91 PostgreSQL relations across 131
  idempotent migrations (§5).
- DontoQL, a 21-clause query language that compiles to a unified
  algebra alongside a SPARQL 1.1 subset; first-class clauses for
  bitemporal time-travel, identity lenses, predicate-closure
  expansion, modality and extraction-level filtering, policy
  enforcement, and contradiction-pressure ordering (§6).
- A six-aperture exhaustive-by-default extraction pipeline (surface,
  linguistic, presupposition, inferential, conceivable, recursive)
  with content-hash deduplication, vocabulary-aware prompting that
  reuses existing predicates rather than minting new ones, and
  confidence-to-maturity mapping that caps machine-extracted claims
  below human-reviewed maturity (§7).
- A Trust Kernel layering policy capsules, attestation credentials,
  and append-only event logs over a fail-closed default policy, with
  one historical write-side enforcement gap (F-1) closed by migration
  0123 (§8).
- A typed predicate-alignment layer with eleven relations and three
  per-relation safety flags (`safe_for_query_expansion`,
  `safe_for_export`, `safe_for_logical_inference`), plus a
  materialised closure that the evaluator rides at query time (§9).
- A three-tier source-provenance trace (exact line equality →
  substring within line → full-body fallback) with cross-shard
  surface-text caching, idempotent resumability, and content-
  addressed blob storage (§9).
- A Lean 4 formal overlay (`donto_engine`) that certifies shape and
  rule reports asynchronously over a line-delimited JSON protocol,
  with the invariant that **Lean certifies and does not gate**:
  ingest is never blocked on the Lean side (§10).
- A release builder producing JSONL + RO-Crate + optional CLDF
  artefacts, signed with Ed25519 over a manifest SHA-256 in the
  `did:key` format, and a federation analysis comparing five
  candidate stacks (Verifiable Credentials + DID, Solid Pods, SPARQL
  federation, DataCite, RO-Crate) with reasoning for the chosen
  architecture (§11).
- An empirical characterisation of the live system: insert
  throughput, point-query latency, batch-query scaling, policy
  gate cost, concurrent-writer behaviour, and observations about
  predicate proliferation, evidence-anchor sparsity, and
  retraction rate that motivate concrete next steps (§12, §13).

We are not claiming any one component is novel in isolation: typed
schema mapping is older than the semantic web, bitemporal databases
have been formalised since the 1990s, paraconsistent logics go back
to da Costa and Belnap, and the FAIR/CARE principles articulate the
ethics we operate under. The contribution of this paper is the
*composition*: a single substrate where these properties are
not optional library features bolted onto a triplestore but
invariants enforced from the schema upward, and a working system
of nontrivial scale that we can characterise empirically rather than
hypothetically.

---

## 2. Background and Related Work

### 2.1 Triple stores, quad stores, and named graphs

RDF [Cyganiak et al., 2014] and its 1.2 extension (RDF-star) describe
data as subject–predicate–object triples. Quads extend this with a
fourth term, typically called a *graph name* or *named graph*, which
the W3C SPARQL specification [Harris and Seaborne, 2013] uses to
partition triples. donto's `donto_statement` schema is a quad —
`(subject, predicate, object, context)` — but the context column
plays a distinct role from a SPARQL named graph: it is not merely a
partition key but the carrier of policy, scope inheritance, kind
(`source`, `snapshot`, `hypothesis`, `user`, `pipeline`, `trust`,
`derivation`, `quarantine`, `custom`, `system`), and mode
(`permissive`, `curated`). A statement's context determines who can
read it, what alignments expand over it, and how it inherits policy
from parent contexts. This is closer to the *attribution* role of
Wikibase qualifiers and references [Vrandečić and Krötzsch, 2014]
than to a SPARQL graph.

### 2.2 Bitemporal databases

donto stores `valid_time` (a `daterange` in the calendar sense — when
the fact was true in the world) and `tx_time` (a `tstzrange` in
real-time — when the fact was believed in the database). Retraction
closes `tx_time`; the row is *never* deleted. This is the model
formalised by Snodgrass [1999] and operationalised in Datomic
[Hickey, 2012] and XTDB [Pratt et al., 2019]. Where donto differs
is in committing to bitemporality across **every mutating object**,
not just statements: alignment edges, identity hypotheses, policy
capsules, attestations, and review decisions are all written through
an append-only `donto_event_log` (migration 0090). The invariant *no
destructive overwrite* (I3) is enforced uniformly.

### 2.3 Paraconsistent logic

Classical logic explodes on contradiction: from `P ∧ ¬P` any
proposition follows. Paraconsistent logics, beginning with Jaśkowski
[1948] and da Costa [1974] and continuing through Belnap's
four-valued logic [Belnap, 1977] and Priest's LP [Priest, 1979], are
calculi that tolerate contradictions without exploding. donto is not
a logic; it is a substrate. But it is *paraconsistent in the
storage-semantic sense*: two statements with the same subject and
predicate but conflicting objects, polarities, or valid-time
intervals can coexist as currently-believed rows, and the system
exposes a *contradiction frontier* view enumerating such pairs. The
algebraic structure is a directed argument graph (the
`donto_argument` table) with nine typed relations (`supports`,
`rebuts`, `undercuts`, `qualifies`, `explains`,
`alternative_analysis_of`, `same_evidence_different_analysis`,
`same_claim_different_schema`, `supersedes`) and a strength on
[0, 1]; the truth model per claim, when needed, maps to a Belnap-style
four-valued state (T, F, B for "both", N for "neither").

### 2.4 Provenance

PROV-O [Lebo et al., 2013] defines a vocabulary for *who produced
what when from what*: entities, activities, agents, and a small
relation set (`wasGeneratedBy`, `used`, `wasDerivedFrom`,
`wasAttributedTo`). donto's `donto_extraction_run`,
`donto_document_revision`, `donto_evidence_link`, and `donto_agent`
tables expose roughly PROV-O-compatible structure (the agent crate
even speaks PROV-O at the export boundary), but the central
commitment is stronger: provenance is not metadata attached to a
record, but the *organising primary key* of the data model. A claim
without an evidence anchor is either an explicit `hypothesis_only`
row or a violation of invariant I1.

### 2.5 RO-Crate, CARE, and FAIR

RO-Crate [Soiland-Reyes et al., 2022] is a metadata format for
packaging research artefacts in a self-describing JSON-LD envelope.
donto's release builder emits RO-Crate alongside its native JSONL,
treating the crate as the citable-and-portable export format. We
inherit the FAIR principles [Wilkinson et al., 2016] (findable,
accessible, interoperable, reusable) and operationalise the CARE
principles [Carroll et al., 2020] (collective benefit, authority to
control, responsibility, ethics) through the Trust Kernel: source
policy classification is required before ingest, the default for an
unclassified policy is `restricted_pending_review` (fail-closed),
and exports check derived-claim policy inheritance before emitting.

### 2.6 Knowledge graph extraction with language models

Recent surveys of LLM-driven knowledge graph construction
[arXiv:2510.20345 and related] describe pipelines that map text
through a single extractor producing typed triples. donto's
*extraction maximalism* (§7) is a deliberate departure: rather than
one pass with a single "tier" label, six independent analytical
apertures decompose the same source through different lenses
(surface, linguistic, presupposition, inferential, conceivable,
recursive). The maturity ladder caps machine extraction at E1
(candidate) regardless of model confidence; human review is required
to advance to E3 (reviewed).

### 2.7 Entity resolution at scale

Probabilistic record linkage [Fellegi and Sunter, 1969] and its
contemporary descendants (Splink [Linacre, 2022], Dedupe [Gregg and
Eder, 2015], Magellan [Konda et al., 2016], Ditto [Li et al., 2020])
formulate entity matching as a likelihood-ratio computation over
features. donto stores identity as an explicit hypothesis layer
(invariant I8: *identity is a hypothesis, not a foreign key*).
Symbols (the IRIs the LLM mints) are kept verbatim; identity edges
(`donto_identity_edge`) are weighted, bitemporal, method-attributed
assertions about coreference; identity hypotheses
(`donto_identity_hypothesis`) name clustering solutions over those
edges. Queries select an "identity lens" at query time rather than
collapsing entities at storage time. This means a single accepted
merge does not destroy the original symbols; the system can answer
"who was Mary Watson, under the strict lens?" and "under the
exploratory lens?" with different answers.

### 2.8 Temporal modelling

Allen's interval algebra [Allen, 1983] defines thirteen interval
relations (`before`, `meets`, `overlaps`, `starts`, `during`,
`finishes`, `equals`, and their six inverses). donto preserves
Allen-style relations as a bitset on temporal claims, allowing
queries such as *"birth before marriage"* and *"residence overlaps
with employment"*. Historical dates use EDTF [LoC, 2019] semantics:
`"1860"` is a year-grain expression covering `[1860-01-01,
1861-01-01)`, not `1860-01-01`; uncertainty markers (`~`, `?`),
intervals, and unbounded endpoints round-trip through the
`donto_time_expression` overlay (proposed in `ARCHITECTURE-REPORT
§temporal` and partially landed). Reducing historical dates to false
day-grain precision is treated as a category error.

### 2.9 Formal verification overlays

Cyc and SUMO embedded inference engines directly. donto's choice is
to keep the database in execution authority and let a Lean 4 binary
(`donto_engine`) own meaning. Lean certifies shapes (functional,
typed-literal, parent–child age-gap) and derivation rules (transitive
closure, inverse, symmetric) over scoped statement lists; it never
gates ingest. If the Lean engine is unreachable, dontosrv degrades
the shape/rule/cert endpoints to "unavailable" and continues serving
everything else (`apps/dontosrv/src/lean.rs:117-150`). The Lean
overlay is more in the spirit of Coq's proof-by-extraction or
Liquid Haskell's refinement types — formal where useful, optional
where not.

---

## 3. System Architecture

donto is a Rust workspace deployed as a small set of cooperating
processes against a single PostgreSQL 16 instance.

```
┌──────────────────────────────────────────────────────────────────┐
│                        clients / consumers                         │
│       donto-cli   donto-tui   dontopedia (Next.js)   external      │
└──────────────────────────────────────────────────────────────────┘
                ▼              ▼              ▼              ▼
        ┌───────────────────────────────────────────────┐
        │   donto-api  (FastAPI :8000 + Temporal)        │
        │     extraction workflows, ingest activities    │
        └───────────────────────────────────────────────┘
                              ▼
        ┌───────────────────────────────────────────────┐
        │      dontosrv  (axum HTTP sidecar :7879)        │
        │   67 routes: query, assert, retract, shapes,    │
        │   rules, policy, evidence, alignment, search    │
        └───────────────────────────────────────────────┘
              ▼                ▼               ▼
   ┌────────────────┐  ┌──────────────┐  ┌──────────────────┐
   │  donto-client  │  │  donto-query │  │  donto_engine     │
   │  (Rust SDK)    │  │  (DontoQL +  │  │  (Lean 4 binary,  │
   │                │  │   SPARQL     │  │   stdio JSON      │
   │  131-migration │  │   evaluator) │  │   protocol)       │
   │  applier,      │  │              │  │                   │
   │  typed wrapper │  │              │  │   shapes + rules  │
   │  over SQL fns  │  │              │  │   certifier       │
   └────────────────┘  └──────────────┘  └──────────────────┘
              ▼                ▼
        ┌───────────────────────────────────────────────┐
        │             PostgreSQL 16 (donto-pg)           │
        │   pg_donto extension (pgrx), 91 relations,     │
        │   131 idempotent migrations, GiST + GIN +      │
        │   trigram indexes, advisory-locked migrator    │
        └───────────────────────────────────────────────┘
                              ▼
                ┌─────────────────────────────┐
                │  /mnt/donto-data/pgdata     │
                │   ~48 GB (32 GB statements) │
                └─────────────────────────────┘
```

### 3.1 Deployment

The production deployment runs on a single GCE `e2-standard-4` VM
(4 vCPU, 16 GB RAM) with PostgreSQL 16 inside a Docker container
(`donto-pg`), volume-bound to an attached SSD at
`/mnt/donto-data/pgdata`. systemd manages five long-running services
(`dontosrv`, `donto-api`, `donto-api-worker`, `donto-debug`,
`dontopedia-web`) plus Caddy as a TLS terminator and reverse proxy.
Temporal workflows (extraction, alignment, entity resolution) run in
`donto-api-worker` against the same Postgres instance.

### 3.2 Process boundaries

The choice to keep `pg_donto` as a pgrx extension rather than a
sidecar microservice was deliberate. The SQL substrate *is* the
contract; the pgrx layer exists to make Rust the lingua franca for
plan-quality hot paths (e.g., immutable polarity / maturity
decoding) and to package all 131 migrations as
`extension_sql_file!` declarations so a `CREATE EXTENSION pg_donto`
on a fresh database produces a working substrate. The HTTP sidecar
`dontosrv` does not own substrate state; it is a typed gateway over
`donto-client`'s SQL surface.

### 3.3 Codebase footprint

```
                  packages/   apps/    total
Rust                ~14.0 K   ~9.3 K   52,943 LOC
SQL                              —     13,234 LOC (131 migrations)
Python (donto-api)               —      8,349 LOC
Lean 4                ~1.7 K     —      1,656 LOC
TypeScript (client-ts) ~0.9 K     —        899 LOC
```

The two largest crates are `dontosrv` (5,088 LOC of HTTP routing and
sidecar protocol) and `donto-cli` (4,226 LOC across ~40 subcommands).
`donto-client` (the typed Rust wrapper over the SQL surface) is
2,665 LOC; `donto-query` (DontoQL + SPARQL parser, algebra,
evaluator) is 2,739 LOC. The five linguistic-pilot importer crates
(`donto-ling-{cldf,ud,unimorph,lift,eaf}`) total ~2.3 K LOC.

The test surface is unusually large: 592 `#[tokio::test]`
annotations, 91 `#[test]`, and 511 invocations of the
`pg_or_skip!` macro that lets database-touching tests skip cleanly
when Postgres is unreachable. The 77-file tripwire suite in
`packages/donto-client/tests/` totals 19,968 LOC — more than the
combined LOC of every package except `dontosrv`. The
`invariants_*.rs` files (governance, paraconsistency, bitemporal,
predicate, modality, hypothesis, evidence, releases) encode the PRD
invariants as executable assertions.

---

## 4. Conceptual Foundations

### 4.1 The ten non-negotiable invariants

The PRD specifies ten invariants the substrate must enforce
(`donto/docs/DONTO-PRD.md` §2). They are non-negotiable: amendments
go to the PRD first, never to code without spec. They are listed
here in shortened form; the precise text is in the PRD.

**I1. No claim without evidence or explicit hypothesis status.**
A statement must reference an `donto_evidence_link` row, or carry a
`hypothesis_only=true` flag, before it can advance past maturity E1.

**I2. No restricted source without policy.** A document cannot be
registered without a `policy_id`. Migration 0123 promotes
`donto_document.policy_id` to NOT NULL with a fail-closed default
(`policy:default/restricted_pending_review`, zero allowed actions);
this closes the historical F-1 gap where legacy ingest paths could
produce unpoliced rows.

**I3. No destructive overwrite.** Corrections, retractions, merges,
splits, alignments, and policy changes are append-only events. The
system supports transaction-time reconstruction of what was believed
and visible at any prior system time.

**I4. Contradictions are preserved.** Two sources disagreeing about
Annie Davis's birth year both live in the database forever. The
system creates a `donto_argument` row with relation `rebuts` or
`alternative_analysis_of` and a `donto_proof_obligation` with kind
`needs_contradiction_review`.

**I5. Machine confidence is not maturity.** A model may report
confidence on [0, 1]. Maturity is earned by evidence quality, review,
cross-source corroboration, or formal validation. Auto-promotion is
capped at E2 for any extraction-produced claim; E3+ requires a human
review decision. The `helpers.py:54-59` confidence-to-maturity
mapping (0.95 → 4, 0.8 → 3, 0.6 → 2, 0.4 → 1, else 0) sets the
*ceiling*, not the floor.

**I6. Governance propagates to derivatives.** A claim derived from a
restricted source inherits the most restrictive applicable policy of
its source anchors. Embeddings, translations, summaries, and exports
all inherit. Overrides require a qualified authority's attestation.

**I7. Schema mappings are typed and scoped.** No two predicates are
"the same" by default. Alignment edges declare one of eleven
relations (`exact_equivalent`, `close_match`, `broad_match`,
`narrow_match`, `inverse_of`, `decomposes_to`, `has_value_mapping`,
`incompatible_with`, `derived_from`, `local_specialization`,
`not_equivalent`) and three per-edge safety booleans:
`safe_for_query_expansion`, `safe_for_export`,
`safe_for_logical_inference`. Closure expansion respects safety
flags.

**I8. Identity is a hypothesis, not a foreign key.** Person, place,
language, lexeme, morpheme, source, specimen, case, and concept
identity may all be contested. The system stores identity
hypotheses (eight kinds: `same_as`, `different_from`, `broader_than`,
`narrower_than`, `split_candidate`, `merge_candidate`,
`successor_of`, `alias_of`) and lets users query under selected
identity lenses (strict, likely, exploratory, custom).

**I9. Adapters must report information loss.** Every import and
export adapter produces a `LossReport` that explicitly names what
the source format cannot represent (governance, contradiction, time,
n-ary frames, anchors, review state).

**I10. A release is a reproducible view.** A release is a named
query plus a policy report, source manifest, transformation
manifest, checksum manifest, and reproducibility contract.

### 4.2 The maturity ladder

donto tracks a six-level epistemic ladder per claim:

| Level | Name              | Earned by |
|-------|-------------------|-----------|
| E0    | Raw               | Source registered, policy classified. |
| E1    | Candidate         | Evidence anchor or `hypothesis_only`. |
| E2    | Evidence-supported| Anchor validation, policy inheritance, no malformed terms. |
| E3    | Reviewed          | Human or authorised reviewer decision. |
| E4    | Corroborated      | Multiple independent anchors or accepted argument analysis. |
| E5    | Certified         | Machine-checkable certificate, formal shape, or domain proof. |

Promotion is monotonic per claim event. A claim may be superseded or
retracted; the maturity history remains queryable. The flags
smallint on `donto_statement` packs polarity into bits 0–1 and
maturity into bits 2–4 (migration 0002), with the stored
4 → "E5 Certified" / stored 5 → "E4 Corroborated" non-monotone
detail explicitly documented in migration 0102 so the ordering
helpers know about it.

### 4.3 Polarity, modality, and confidence

Three orthogonal axes describe the epistemic shape of a claim.

**Polarity** is one of `asserted` (default), `negated` (explicit
rejection — "X is *not* Y"), `absent` (the source explicitly does
not mention this), or `unknown` (the source mentions but is not
clear). It is packed into the flags smallint and queryable via
DontoQL's `POLARITY` clause.

**Modality** is a sparse overlay (`donto_statement_modality`,
migration 0099) with fifteen values: `descriptive`, `prescriptive`,
`reconstructed`, `inferred`, `elicited`, `corpus_observed`,
`typological_summary`, `experimental_result`,
`clinical_observation`, `legal_holding`, `archival_metadata`,
`oral_history`, `community_protocol`, `model_output`, `other`.
Statements without a modality row are present in the system but
filtered out of modality-restricted queries.

**Confidence** is stored as up to four parallel values
(`donto_confidence`, migration 0101): `machine_confidence`
(model-reported), `calibrated_confidence` (empirically calibrated
against reviewer decisions), `human_confidence` (reviewer-reported),
and `source_reliability_weight` (source/method-level). Queries may
select a confidence lens; the system does not collapse to a scalar
by default.

### 4.4 Contexts

Every statement is filed under exactly one context (the
`donto_statement.context` column references `donto_context.iri`).
Contexts form a forest (migration 0001) with parent links; multiple
parents are supported for secondary attachments via
`donto_statement_context` (migration 0103). A context has a kind
(`source`, `snapshot`, `hypothesis`, `user`, `pipeline`, `trust`,
`derivation`, `quarantine`, `custom`, `system`) and a mode
(`permissive` or `curated`). Curated contexts route shape
violations to a quarantine context rather than rejecting them
outright; permissive contexts accept and emit a proof obligation.

The default context for any assert call that does not specify one is
`donto:anonymous`. There is no nullable context column anywhere in
the schema.

### 4.5 Bitemporality

`donto_statement.valid_time` is a `daterange` capturing world-time
applicability — when the claim was true in the world. `tx_time` is
a `tstzrange` capturing system-time belief — when the row was
asserted. Both lower-inclusive (`'[)'`); `lower_inc(tx_time)` is a
table-level CHECK constraint.

Retraction (`donto_retract`) closes the upper bound of `tx_time`
without creating a new row; the statement transitions from
"currently believed" to "was once believed". Correction
(`donto_correct`) retracts the old row and inserts a new one
referencing the prior via lineage (`donto_stmt_lineage`). Both
operations emit `donto_audit` log rows.

The discipline extends through every mutating object via
`donto_event_log` (migration 0090) — alignments, identity edges,
policies, attestations, reviews, releases. The single rule the
substrate enforces: never `DELETE FROM donto_statement`. The CLAUDE.md
non-negotiable list states this in capitals.

---

## 5. Data Model

The substrate is 91 PostgreSQL relations (84 `donto_*` tables and
seven materialized/standard views) defined across 131 idempotent
migrations. We describe the highest-load core here and refer the
reader to Appendix D of the PRD for the full table inventory.

### 5.1 The core: `donto_statement`

```sql
-- packages/sql/migrations/0001_core.sql:43-96
create table if not exists donto_statement (
    statement_id  uuid primary key default gen_random_uuid(),
    subject       text not null,
    predicate     text not null,
    object_iri    text,
    object_lit    jsonb,    -- {"v": <value>, "dt": <datatype-iri>, "lang": <tag-or-null>}
    context       text not null references donto_context(iri),
    tx_time       tstzrange not null default tstzrange(now(), null, '[)'),
    valid_time    daterange not null default daterange(null, null, '[)'),
    flags         smallint not null default 0,
    content_hash  bytea generated always as (digest(...)) stored,
    constraint donto_statement_object_one_of
        check ((object_iri is not null) <> (object_lit is not null)),
    constraint donto_statement_tx_lower_inc check (lower_inc(tx_time))
);

create unique index if not exists donto_statement_open_content_uniq
    on donto_statement (content_hash) where upper(tx_time) is null;

create index if not exists donto_statement_spo_idx
    on donto_statement (subject, predicate, object_iri);
create index if not exists donto_statement_pos_idx
    on donto_statement (predicate, object_iri, subject);
create index if not exists donto_statement_osp_idx
    on donto_statement (object_iri, subject, predicate)
    where object_iri is not null;
create index if not exists donto_statement_valid_time_idx
    on donto_statement using gist (valid_time);
create index if not exists donto_statement_tx_time_idx
    on donto_statement using gist (tx_time);
create index if not exists donto_statement_object_lit_gin
    on donto_statement using gin (object_lit jsonb_path_ops)
    where object_lit is not null;
```

Three indexes cover the standard SPO/POS/OSP join orders. GiST
indexes handle bitemporal range queries. Literal-object substring
search rides the trigram index added in migration 0131
(`object_iri_trgm`). Idempotence on assert is enforced by a partial
unique index on `(content_hash) where upper(tx_time) is null`: only
*currently-believed* rows must be unique by content. A retracted row
with the same content can later be re-asserted as a separate row.

### 5.2 Flag packing

```sql
-- packages/sql/migrations/0002_flags.sql
create or replace function donto_pack_flags(polarity text, maturity int)
returns smallint language sql immutable as $$
    select (
        (case lower(polarity)
            when 'asserted' then 0 when 'negated'  then 1
            when 'absent'   then 2 when 'unknown'  then 3 else null end)
        | ((maturity & 7) << 2)
    )::smallint
$$;
```

Bits 0–1 are polarity, bits 2–4 are maturity, bits 5–15 are reserved.
The function is `IMMUTABLE` and `PARALLEL SAFE`; a Rust mirror exists
in `pg_donto/src/lib.rs:211-263` for plan-quality hot paths.

### 5.3 Contexts

```sql
-- packages/sql/migrations/0001_core.sql:17-38
create table if not exists donto_context (
    iri          text primary key,
    kind         text not null check (kind in (
        'source','snapshot','hypothesis','user','pipeline',
        'trust','derivation','quarantine','custom','system')),
    parent       text references donto_context(iri),
    label        text,
    metadata     jsonb not null default '{}'::jsonb,
    mode         text not null default 'permissive'
                 check (mode in ('permissive','curated')),
    created_at   timestamptz not null default now(),
    closed_at    timestamptz,
    constraint donto_context_no_self_parent
        check (parent is distinct from iri)
);
```

`donto_resolve_scope` (migration 0003) walks the context forest
either downward (default: include descendants) or upward (optional:
include ancestors), with set-based include/exclude lists.

### 5.4 Evidence and arguments

`donto_document` (migration 0023) registers a source artefact with a
required `policy_id` (post-0123). `donto_document_revision` holds
content-addressed revision bodies (the blob backend stores the
actual bytes; the revision holds the metadata and optionally inlines
short bodies). `donto_span` carries char-offset spans over revisions.

```sql
-- packages/sql/migrations/0029_evidence_links.sql:11-59
create table if not exists donto_evidence_link (
    link_id              uuid primary key default gen_random_uuid(),
    statement_id         uuid not null references donto_statement(statement_id),
    link_type            text not null check (link_type in (
        'extracted_from', 'supported_by', 'contradicted_by',
        'derived_from', 'cited_in', 'anchored_at', 'produced_by'
    )),
    target_document_id   uuid references donto_document(document_id),
    target_revision_id   uuid references donto_document_revision(revision_id),
    target_span_id       uuid references donto_span(span_id),
    target_annotation_id uuid references donto_annotation(annotation_id),
    target_run_id        uuid references donto_extraction_run(run_id),
    target_statement_id  uuid references donto_statement(statement_id),
    confidence           double precision,
    context              text references donto_context(iri),
    tx_time              tstzrange not null default tstzrange(now(), null, '[)'),
    metadata             jsonb not null default '{}'::jsonb,
    created_at           timestamptz not null default now(),
    constraint donto_evidence_link_has_target check (
        (target_document_id is not null)::int +
        (target_revision_id is not null)::int +
        (target_span_id is not null)::int +
        (target_annotation_id is not null)::int +
        (target_run_id is not null)::int +
        (target_statement_id is not null)::int = 1
    ),
    constraint donto_evidence_link_tx_lower_inc check (lower_inc(tx_time))
);
```

`donto_argument` (migration 0031) holds typed support / attack
relations between statements (nine relations including `supports`,
`rebuts`, `undercuts`, `qualifies`, `endorses`, `supersedes`,
`potentially_same`, `same_referent`, `same_event`) with a strength
on [0, 1] and an open-edge unique index that lets the same pair of
statements be related differently in different contexts.

### 5.5 Identity layer

`donto_entity_symbol` (migration 0057) records every freely-minted
IRI a model or import produces, with a trigram index on the
normalised label for blocking. `donto_identity_edge` (migration
0060) records weighted, bitemporal coreference assertions with one
of four relations: `same_referent`, `possibly_same_referent`,
`distinct_referent`, `not_enough_information`. The edge table
constraints `left_symbol_id < right_symbol_id` to avoid double-
representation.

`donto_identity_hypothesis` names a clustering solution over the
edges (e.g., the `strict_identity_v1` hypothesis only takes edges
with confidence ≥ 0.98 and no cannot-link).

### 5.6 Predicate alignment

```sql
-- packages/sql/migrations/0048_predicate_alignment.sql
create table if not exists donto_predicate_alignment (
    alignment_id     uuid primary key default gen_random_uuid(),
    source_iri       text not null,
    target_iri       text not null,
    relation         text not null check (relation in (
        'exact_equivalent', 'inverse_equivalent',
        'sub_property_of', 'close_match',
        'decomposition', 'not_equivalent'
    )),
    confidence       double precision not null default 1.0
                     check (confidence >= 0 and confidence <= 1),
    valid_time       daterange not null default daterange(null, null, '[)'),
    tx_time          tstzrange not null default tstzrange(now(), null, '[)'),
    run_id           uuid,
    provenance       jsonb not null default '{}'::jsonb,
    registered_by    text,
    registered_at    timestamptz not null default now(),
    constraint donto_pa_distinct check (source_iri <> target_iri),
    constraint donto_pa_tx_lower_inc check (lower_inc(tx_time))
);
```

A materialised closure table (`donto_predicate_closure`) pre-computes
transitive chains so the evaluator can ride alignments at query
time without a recursive CTE per row. The closure is rebuilt by
`donto_rebuild_predicate_closure()`.

### 5.7 Trust kernel

`donto_policy_capsule` (migration 0111) holds a policy with fifteen
boolean allowed-actions (`read_metadata`, `read_content`, `quote`,
`view_anchor_location`, `derive_claims`, `derive_embeddings`,
`translate`, `summarize`, `export_claims`, `export_sources`,
`export_anchors`, `train_model`, `publish_release`,
`share_with_third_party`, `federated_query`).

`donto_access_assignment` maps targets (document, revision, span,
context, statement, frame, release, entity, predicate) to policies.

`donto_attestation` (migration 0112) records credentials: a holder
agent, an issuer agent, a policy IRI, a subset of allowed actions, a
purpose (`review`, `community_curation`, `private_research`,
`publication`, `model_training`, `audit`, `extraction`,
`federation`, `inspection`), a required `rationale`, and lifecycle
fields (`issued_at`, `expires_at`, `revoked_at`).

The top-level access check `donto_authorise(holder, target_kind,
target_id, action)` combines policy effective-action AND with
attestation OR semantics. The effective-action helper
`donto_effective_actions` does a `bool_and` over all assigned
policies for the target, defaulting to the fail-closed restricted
policy when none is assigned.

### 5.8 Substrate scale and storage

The live database holds:

| Object | Count |
|--------|-------|
| Statements | **39,294,083** |
| Distinct predicates | **938,918** |
| Distinct contexts | **19,230** |
| Evidence links | **1,837,151** |
| Currently-believed statements (`upper_inf(tx_time)`) | **39,293,802** |
| Retracted statements | **281** |
| Database total on disk | **48 GB** |
| `donto_statement` table size | **32 GB** |

The retraction rate is 281 / 39,293,802 ≈ 7.1 × 10⁻⁶ — see the
empirical discussion in §13.

The polarity distribution is heavily skewed toward `asserted`:

| Polarity | Count |
|----------|-------|
| `asserted` | 39,292,908 |
| `negated`  | 813 |
| `unknown`  | 331 |
| `absent`   | 31 |

The maturity distribution shows the cap-at-E1-or-below pattern of
machine extraction:

| Maturity | Count |
|----------|-------|
| 0 (raw)            | 22,082,198 |
| 1 (candidate)      | 14,870,096 |
| 2 (evidence-supp)  | 47,905 |
| 3 (reviewed)       | 244,806 |
| 4 (corroborated)*  | 2,049,078 |
| 5 (certified)*     | — |

*The flags-bit encoding stores `4 = E5 Certified` and `5 = E4
Corroborated` (migration 0102), reflecting a historical naming
ambiguity preserved for backward compatibility.

---

## 6. DontoQL: The Query Language

DontoQL v2 is a 21-clause query language compiled to a unified
algebra alongside a strict SPARQL 1.1 subset. The parser is hand-
rolled (`packages/donto-query/src/dontoql.rs`); both surfaces emit
the same `algebra::Query` struct, evaluated by
`packages/donto-query/src/evaluator.rs`.

### 6.1 Grammar

```
query           := keyword_clause+
keyword_clause  :=
    'SCOPE'           scope_descriptor
  | 'PRESET'          IDENT_or_PREFIXED_or_STRING
  | 'MATCH'           triple (',' triple)*
  | 'FILTER'          filter_expr (',' filter_expr)*
  | 'POLARITY'        ident_in_set
  | 'MATURITY'        '>='? INT
  | 'IDENTITY'        ident
  | 'IDENTITY_LENS'   ident
  | 'PREDICATES'      ('EXPAND' | 'STRICT' | 'EXPAND_ABOVE' INT)
  | 'MODALITY'        ident (',' ident)*
  | 'EXTRACTION_LEVEL' ident (',' ident)*
  | 'TRANSACTION_TIME' 'AS_OF' STRING_or_PREFIXED
  | 'AS_OF'           STRING_or_PREFIXED
  | 'POLICY' 'ALLOWS' ident
  | 'SCHEMA_LENS'     (iri | ident)
  | 'EXPANDS_FROM'    'concept' '(' iri ')' 'USING' 'schema_lens' '(' iri ')'
  | 'ORDER_BY'        ident ('DESC'|'ASC')?
  | 'WITH'            'evidence' '=' ident
  | 'PROJECT'         var (',' var)*
  | 'LIMIT'           INT
  | 'OFFSET'          INT

triple   := term term term ('IN' term)?
term     := var | iri | string-lit | int-lit
var      := '?' IDENT
iri      := '<' chars '>' | PREFIXED
filter_expr := term op term         -- op ∈ { = != < <= > >= }
```

Clauses may appear in any order. Whitespace is insignificant; `#`
introduces an end-of-line comment.

### 6.2 Defaults

Without any clauses, `MATCH ?s ?p ?o` returns asserted, currently-
believed rows across every context, at any maturity, with the
identity lens *default* (no expansion across sameAs clusters),
predicate expansion *expand* (rides the alignment closure), and no
ordering. The choice to default to `asserted`-only prevents
contradictions from leaking into queries without explicit request.

### 6.3 Worked examples

**Contested birth claims as of a system time.** Find every claim
about Annie Davis's birth that disagrees with another claim, in
research contexts where the maturity is at least E2 (evidence-
supported), as of the state of the store on 2026-04-01:

```dontoql
SCOPE include ctx:genes/annie-davis ancestors
PRESET curated
MATCH ?stmt ex:about ex:annie-davis,
      ?stmt ex:predicate ex:born_in,
      ?stmt ex:object    ?place
FILTER ?place != "unknown"
POLARITY asserted
TRANSACTION_TIME AS_OF "2026-04-01T00:00:00Z"
PREDICATES EXPAND_ABOVE 75
PROJECT ?stmt, ?place
LIMIT 50
```

**Release-safe claims.** Show only claims under a hypothesis context
that the policy permits for publication, with evidence redacted
where required:

```dontoql
SCOPE include ctx:project:language-pilot ancestors
MATCH ?stmt ?p ?o
MATURITY >= 3
POLICY ALLOWS publish_release
WITH evidence = redacted_if_required
```

**Cluster-expansion search.** Same query, broaden across identity
clusters and alignment expansion:

```dontoql
MATCH ?s ex:bornInPlace ?city
IDENTITY_LENS expand_clusters
PREDICATES EXPAND_ABOVE 70
LIMIT 100
```

**Contradiction frontier.** Order results by how contested each
binding's leading statement is:

```dontoql
MATCH ?stmt ex:about ?subject,
      ?stmt ex:predicate ?pred
ORDER BY contradiction_pressure DESC
PROJECT ?stmt, ?subject, ?pred
LIMIT 50
```

`contradiction_pressure` is `attack_count − support_count` from
`donto_contradiction_frontier`, joined against the binding's most
recent matched `statement_id`. Rows without any argument edges sort
to pressure = 0.

**As-of historical view.** Reconstruct what was known last week:

```dontoql
MATCH ?p ex:about ex:somebody, ?p ?pred ?val
AS_OF "2026-05-21T00:00:00Z"
LIMIT 100
```

### 6.4 SPARQL subset

The SPARQL parser (`packages/donto-query/src/sparql.rs`) accepts
`PREFIX`, `SELECT` (including `SELECT *`), `WHERE`, `GRAPH`, basic
`FILTER` (numeric and string comparison), `LIMIT`, and `OFFSET`.
Property paths, `OPTIONAL`, `UNION`, aggregates (`COUNT`, `SUM`,
`GROUP BY`), and mutation (`INSERT DATA`, `DELETE DATA`) are
deliberately out of scope. donto's mutating operations are
expressed through `donto_assert` / `donto_retract` / `donto_correct`
at the SQL layer; the query language is read-only.

### 6.5 Evaluator

The Phase-4 evaluator is a nested-loop join with variable
unification:

```rust
// packages/donto-query/src/evaluator.rs

for pattern in &query.patterns {
    for binding in &env {
        let substituted = substitute(pattern, binding);
        let rows = client.match_pattern(
            substituted.subject_iri,
            substituted.predicate_iri,
            substituted.object_iri,
            scope, polarity, min_maturity,
            as_of_tx, as_of_valid,
        ).await?;
        for row in rows {
            if let Some(extended) = unify(binding, pattern, &row) {
                next_env.push(extended);
            }
        }
    }
    env = next_env;
}

apply_filters(&mut env, &query.filters);
apply_overlays(&mut env, query.modality, query.extraction_level);
apply_policy_gate(&mut env, query.policy_allows);
apply_order_by(&mut env, query.order_by);
apply_offset_limit(&mut env, query.offset, query.limit);
attach_evidence(&mut env, query.evidence_shape);
```

This is correct but unoptimised: the query planner is a deliberately-
deferred Phase-10 work item. The evaluator's HTTP entry point at
`POST /dontoql` handles ~50 ms p50 for simple patterns at the
current scale.

---

## 7. Extraction: Six-Aperture Maximalism

The extraction layer is the bridge from natural-language sources to
typed claims. donto's design rejects the "single LLM call with an
8-tier prompt" approach the predecessor used. Three structural
failures motivated the replacement
(`docs/EXTRACTION-MAXIMALISM.md` L15-31):

1. **Yield cap.** One pass, one context window, one temperature.
2. **Tier–topic conflation.** Truth status was not first-class.
3. **No recursion.** Entities mentioned in facts were dead ends, not
   re-mined as subjects.

The replacement uses six **apertures** — independent specialised
passes over the same source — and content-hash deduplication across
the union.

### 7.1 The six apertures

| Aperture | What it mines | Modality | Confidence |
|----------|---------------|----------|------------|
| **Surface**       | Explicitly-stated claims | asserted, anchored | 0.95–1.0 |
| **Linguistic**    | Clause-by-clause: every NP → entity, VP → event, modifier → property | asserted, anchored | 0.85–1.0 |
| **Presupposition**| What the text takes for granted but does not assert | `hypothesis_only`, anchored to trigger | 0.7–0.95 |
| **Inferential**   | Common-knowledge consequences of stated facts | asserted, anchored to trigger | 0.4–0.7 |
| **Conceivable**   | "Hairs on the head" claims that could plausibly hold given entity types | `hypothesis_only`, no anchor | 0.85 (it is conceivable) |
| **Recursive**     | Re-runs Surface with newly-discovered entities as seeds | asserted, anchored | 0.85–1.0 |

### 7.2 Yield curve

A 1,376-character biographical text run through both pipelines gave:

```
1-pass tier        :   95 facts at $0.0042 in   65 s
6-pass aperture    :  341 facts at $0.0252 in  449 s    (3.6× yield, 6× cost)
  surface          :   87
  linguistic       :  127
  presupposition   :   34
  inferential      :   12
  conceivable      :   54
  recursive        :   27
distinct predicates:  171
distinct subjects  :   70
anchor coverage    :    0.842
hypothesis density :    0.258
dedup collisions   :    6
```

The yield is a floor, not a target. The Maximalism doc characterises
ambitions to push toward ~20,000–30,000 facts per source at fuller
aperture coverage, gated by cost (target ≈ $0.02–$0.05 per 1 kB of
source for the full 15-aperture pass, $0.001/kB on cache hits).

### 7.3 The conceivable aperture and quarantine

The conceivable aperture is the most controversial. It deliberately
floods the candidate space with unanchored, hypothesis-only claims
(persons have hair, organisations have employees, projects have
contributors). The position is explicit:

> Maximal extraction is a design stance, not a yield target. Mine
> everything. Quarantine the malformed. Flag the hypothetical. Let
> the curation gate, not the extractor, decide what counts.
> (EXTRACTION-MAXIMALISM L311-314)

Quarantine is implemented as a sink that routes invalid candidates
to a `ctx:quarantine/<source>` context with policy
`restricted_pending_review`. Conceivable-aperture output lands in
the candidate space at maturity E1 with `hypothesis_only=true`;
downstream curation (Trust Kernel policy gate, maturity ladder,
reviewer acceptance) decides what survives into a release.

### 7.4 Orchestration

The Python implementation in `apps/donto-api/extraction/` is laid
out as:

```
apertures.py     -- six aperture prompts (Surface, Linguistic, ...)
exhaustive.py    -- multi-pass orchestrator (asyncio.gather + dedup)
dispatch.py      -- single-pass M5 path (still useful for cheap runs)
validation.py    -- hard-gate validator (anchor + hypothesis_only invariant)
quarantine.py    -- quarantine sink
policy_gate.py   -- Trust Kernel probe before any external model call
main.py::extract_exhaustive -- POST /extract/exhaustive
```

The exhaustive orchestrator (`exhaustive.py:138-244`) gathers the
non-recursive apertures in parallel via `asyncio.gather`, dedups
in-flight using a content-key hash, then seeds Recursive from the
top 12 most-frequent subjects and object IRIs in the union.
Vocabulary-aware prompting (`vocab.py`) injects the current top-80
predicates and entity candidates into the system prompt so the model
reuses existing IRIs rather than minting fresh ones. We discuss the
empirical fallout of this prompt in §13.

### 7.5 Confidence and maturity

`helpers.py:54-59` maps model confidence to maturity ceiling:

```python
def confidence_to_maturity(c: float) -> int:
    if c >= 0.95: return 4
    if c >= 0.80: return 3
    if c >= 0.60: return 2
    if c >= 0.40: return 1
    return 0
```

This is a ceiling, not a floor: the post-PRD policy caps extraction-
produced claims at E1 regardless of model confidence. The PRD I5
invariant is enforced by an ingest validator that drops
high-confidence model claims back to E1 unless a reviewer attestation
is present in the same transaction.

### 7.6 Temporal workflows

Long-running extraction jobs are submitted as Temporal workflows
(`apps/donto-api/workflows.py`). The four-stage pipeline
`extracting → ingesting → aligning → resolving → completed` is
implemented as activities with explicit retry policies (30-minute
exhaustive extract activity timeout, 3 retries; 5-minute ingest
timeout, 5 retries; alignment and resolution similarly bounded).
The workflow is queryable mid-flight via the `status` method.

---

## 8. The Trust Kernel

The Trust Kernel (PRD §M0) is the substrate's answer to invariants
I2 and I6: no source without policy, and governance propagates to
derivatives.

### 8.1 The capsule model

A `donto_policy_capsule` (migration 0111) is one of nine policy
kinds (`public`, `open_metadata_restricted_content`,
`community_restricted`, `embargoed`, `licensed`, `private`,
`regulated`, `sealed`, `unknown_restricted`) with a JSONB
allowed-actions object covering fifteen actions. An inheritance
rule (`max_restriction` is the default) defines how a derived row
combines the policies of its source anchors:

```sql
allowed_actions = jsonb_build_object(
    'read_metadata',       false,
    'read_content',        false,
    'quote',               false,
    'view_anchor_location',false,
    'derive_claims',       false,
    'derive_embeddings',   false,
    'translate',           false,
    'summarize',           false,
    'export_claims',       false,
    'export_sources',      false,
    'export_anchors',      false,
    'train_model',         false,
    'publish_release',     false,
    'share_with_third_party', false,
    'federated_query',     false
)
```

The boilerplate default is fail-closed; a public-by-default
classification must be explicit.

### 8.2 Effective action computation

The `donto_effective_actions(target_kind, target_id)` function
returns the AND across all policies assigned to the target via
`donto_access_assignment`. With no assignment, it returns the
fail-closed default policy's allowed-actions:

```sql
foreach v_key in array v_keys loop
    select bool_and(coalesce((p.allowed_actions->>v_key)::boolean, false))
    into v_allowed
    from donto_access_assignment a
    join donto_policy_capsule   p on p.policy_iri = a.policy_iri
    where a.target_kind = p_target_kind
      and a.target_id   = p_target_id
      and p.revocation_status = 'active'
      and (p.expiry is null or p.expiry > now());
    v_result := v_result || jsonb_build_object(v_key, coalesce(v_allowed, false));
end loop;
return v_result;
```

Revoked, expired, and superseded policies do not contribute. The
choice of `bool_and` over `bool_or` operationalises the
*max-restriction* doctrine: every assigned policy must permit the
action for it to be permitted.

### 8.3 Attestations

A `donto_attestation` is a holder credential: agent X is entitled to
perform action Y under policy Z for purpose P with rationale R until
expiry T. The check `donto_authorise(holder, target, action)`
combines the effective-action AND with an attestation OR — if any
attestation grants the holder the action under any policy assigned
to the target, the call returns true.

Revocation is immediate for new checks (set `revoked_at = now()`); in-
flight reads in the same transaction may still proceed but cannot
read newly-revoked data.

### 8.4 The F-1 closure

`docs/REVIEW-FINDINGS.md` records eighteen findings F-1 through
F-18 from an adversarial review of the Trust Kernel. Sixteen are
DOC-severity (intentional, documented, no work required) and two
are HARD-severity, both now resolved. The historical F-1 finding is
worth recording in full because it illustrates how the substrate
preserves invariants in the face of legacy code paths:

> Pre-migration `0123`, the legacy `donto_ensure_document` SQL
> function did not require `policy_id`, leaving the I2 invariant
> ("no source without policy") enforced only by fail-closed read
> defaults rather than write-time refusal.
>
> Resolution: migration 0123 backfilled NULL `policy_id` with the
> seeded fail-closed `policy:default/restricted_pending_review`
> (allowed_actions all false), set that capsule as the column
> default, promoted `policy_id` to NOT NULL, and validated the
> previously-NOT-VALID foreign key. The legacy path now succeeds but
> the produced row has the fail-closed policy; explicit-policy
> callers (via `donto_register_source`) continue to pass policy
> explicitly.
>
> Tripwire (inverted): `invariants_adversarial::
> legacy_register_document_lands_on_default_restricted_policy` —
> asserts the legacy path succeeds *and* that the produced document
> has the fail-closed policy. If a future migration removes the
> default or the constraint, this test fails and prompts a
> deliberate decision.
>
> Production state at resolution: 8,211 legacy `donto_document` rows
> backfilled to `policy:default/restricted_pending_review`. Read
> access remained fail-closed both before and after.

The pattern — backfill, then NOT NULL, then validate FK, all in one
idempotent migration — is the standard donto fix for an invariant
gap that has been silently tolerated.

### 8.5 The HTTP-middleware follow-on

The substrate side of the F-1 closure is complete. The remaining
work, recorded in `ROADMAP-AFTER-MAY18.md`, is the HTTP middleware
that enforces policy presence on write paths *at the sidecar layer*
rather than at the SQL layer. The genes corpus has hundreds of
unpoliced legacy sources (now bound to fail-closed defaults at the
substrate but still ingested via the legacy path); the middleware
work is the visible end-to-end testbed for the Trust Kernel beyond
the database layer.

---

## 9. Identity, Alignment, and Provenance

### 9.1 Symbols, mentions, signatures, edges, hypotheses

donto's open-world entity-resolution architecture decomposes the
"who is this person?" problem into five distinct objects:

- **`donto_entity_symbol`** — a string identifier produced by
  extraction, legacy import, user action, or external KB import.
  Examples: `ex:mary-watson`, `ex:mrs-watson`,
  `ctx:legacy/31448699f0e5`.
- **`donto_entity_mention`** — an occurrence of a symbol in a
  document / span / extraction run.
- **`donto_entity_signature`** — the current feature vector or
  profile for a symbol, derived from statements and evidence.
- **`donto_identity_edge`** — a weighted, bitemporal assertion
  about whether two symbols co-refer.
- **`donto_identity_hypothesis`** — a named clustering solution
  over identity edges.

Three default hypotheses ship:

```
strict_identity_v1       : same_referent edges with confidence >= 0.98 only
likely_identity_v1       : >= 0.85, no strong cannot-link
exploratory_identity_v1  : >= 0.60, useful for search and discovery
```

Production queries can be asked under any lens. A merge accepted
under `likely` does not destroy the original symbols; under `strict`,
the same two symbols may still resolve to distinct entities.

### 9.2 Candidate generation: blocking

At 19,230 contexts and 938,918 distinct predicates, naive O(N²)
pairwise comparison of subjects is infeasible. The
`ARCHITECTURE-REPORT` proposes (and we are landing in phases) nine
independent blocking channels for persons:

```
B1 : normalised full-name trigram
B2 : surname + given initial
B3 : maiden-name / married-name variants
B4 : birth-year bucket ±N years
B5 : birth/death/residence place
B6 : spouse/parent/child neighbourhood overlap
B7 : document/source-local co-occurrence
B8 : embedding nearest neighbours over compact entity profiles
B9 : legacy imported identifier / external ID / source record ID
```

Trigram (`pg_trgm`) indexes provide cheap B1–B3 lookups; the
embedding channel (B8) is staged for the pgvector-backed
predicate-similarity work but the entity dimension is later.

### 9.3 Pairwise scoring

The scoring formula, generalising Fellegi–Sunter:

```
LLR(i,j) = log prior_block(i,j)
         + Σ_k log P(feature_k | same) / P(feature_k | different)
         + λ_rel    * relational_overlap(i,j)
         + λ_time   * temporal_compatibility(i,j)
         + λ_place  * spatial_compatibility(i,j)
         + λ_neural * neural_pair_score(i,j)
         + λ_source * source_dependence_adjustment(i,j)

p_same   = sigmoid(a * LLR + b)
```

Ditto-style transformer matchers [Li et al., 2020] inform the
neural-pair component for hard pairs only; running a deep matcher
over all candidate pairs would be economically prohibitive.

### 9.4 Clustering: the reversibility principle

Naive transitive closure on positive identity edges produces
catastrophic merges in genealogy: *Mary Watson ≈ Mrs Watson*,
*Mrs Watson ≈ Mary Watson née Oxley*, *Mary Watson née Oxley ≈
Mary Oxley*, *Mary Oxley ≈ another Mary Oxley*, and suddenly the
nineteenth-century Cooktown beche-de-mer fisherman's widow is the
same person as a twentieth-century Sydney accountant.

The constrained clustering algorithm we are landing:

```
Input
  positive edges with weights
  negative / cannot-link edges with weights
  soft ontology and temporal constraints
  source-dependence penalties

Output
  clusters maximising same-edge agreement
  while minimising distinct-edge violations
```

Implemented in phases: union only `p_same ≥ 0.98` and no strong
cannot-link; local correlation-clustering-style optimisation inside
candidate blocks; produce ambiguous bridge edges as *proof
obligations*, not automatic merges; human-or-agent review only on
high-impact ambiguous bridges.

The non-negotiable principle is reversibility. A merge accepted at
time *t* and reversed at time *t' > t* must leave the database
queryable as-of any *t″ ∈ [t, t']* with the merged view, and as-of
any *t″ ∉ [t, t']* with the unmerged view. The substrate enforces
this by leaving original symbols in `donto_statement` untouched;
query projection maps symbol → referent under the requested
hypothesis at evaluation time, never at write time.

### 9.5 Predicate alignment

The 938,918 distinct predicates in the live database are not a sign
of expressive richness; they are a sign of LLM-driven predicate
proliferation. Each freely-minted predicate (`ex:fatherOf`,
`ex:isFatherOf`, `ex:hadFather`, `ex:was-father-of`, ...) is a row
in `donto_predicate` and a leaf in the alignment graph.

The alignment layer's job is to make queries portable across this
proliferation. An alignment edge declares one of eleven relations
(PRD §6.10) and three per-edge safety booleans:

```
safe_for_query_expansion : evaluator may ride this edge in EXPAND mode
safe_for_export          : downstream exporters may collapse via this edge
safe_for_logical_inference : OWL-style entailment may use this edge
```

A user can register an alignment between `ex:fatherOf` and
`ex:hadFather` with relation `inverse_equivalent`, safety flag
`safe_for_query_expansion=true`, scope = `ctx:genes/registry`. The
materialised closure (`donto_predicate_closure`) pre-computes
transitive chains the evaluator can ride at query time. `donto align
auto` proposes alignments via embedding-similarity over predicate
descriptors; auto-proposals land at confidence < 1.0 and are
reviewer-promotable.

### 9.6 Source provenance: the three-tier trace

A claim with surface text `"Mary Watson, born Cornwall 1860"`
should resolve to a `donto_span` row at byte offsets `[1342, 1376)`
of revision `rev_92f1...` of document
`doc:genes/trove-cooktown/watkins-1881`. The
`packages/donto-trace/` crate implements this resolution as a
three-tier search.

For each statement carrying a `donto:textSpan` or
`ex:normalized_claims/text_span` predicate, the trace worker:

1. **Cache check (cross-shard).** Hash the surface text; look up
   `donto_trace_log` for any prior match (across all runs, not
   just this run). If found, reuse the prior result. Production
   data shows ~4.3× dedup: 1.5 M textSpan statements → 349 K
   distinct surface texts.

2. **Tier 1 — exact line equality.** Query `donto_revision_line`
   where `line_text = needle` and `length(line_text) ≤ 400`. A
   partial B-tree index on short lines makes this O(log n).

3. **Tier 2 — substring within a line.** Query
   `donto_revision_line` with `line_text LIKE pattern ESCAPE '\'`.
   Uses a trigram GIN index on `line_text` for substring search.
   Cheaper than Tier 3 because the per-row haystack is small.

4. **Tier 3 — full-body fallback.** Query
   `donto_document_revision.body_inline` with the same LIKE
   pattern. Slow but catches multi-line quotes. Only runs if the
   surface text contains a newline.

The four match types:

```
Exact       : surface text matches verbatim       confidence = 1.0
Normalized  : whitespace-collapsed match           confidence = 0.9
Ambiguous   : same text in >1 revision             confidence = 0.5
NotFound    : no match found                       confidence = 0.0
```

Writes are strictly additive. A successful match emits four rows:

- a `donto_span` (via `donto_create_char_span`),
- a `donto:hasSpan` claim linking the original statement to the
  span,
- a `donto_evidence_link` with type `anchored_at` and the match
  confidence,
- a `donto_trace_log` row for resumability.

The legacy `donto:textSpan` literal is never retracted. The new
`donto:hasSpan` claim coexists with it.

Resumability: the worklist query (`packages/donto-trace/src/lib.rs:
336-369`) skips statements already in `donto_trace_log` for the
given `run_name`. SIGINT sets an interrupted flag the main loop
checks at batch boundaries; the current batch completes cleanly and
the process exits. A re-run with the same `run_name` continues from
where it left off.

### 9.7 The content-addressed blob store

The blob store (`packages/donto-blob/`, migration 0125) is content-
addressed by SHA-256. Three backends are provided
(`LocalFsBlobStore`, `GcsBlobStore`, `MockBlobStore`) behind a
common trait:

```rust
pub trait BlobStore: Send + Sync {
    fn backend(&self) -> &'static str;
    async fn put_bytes(&self, bytes: &[u8], mime: Option<&str>)
        -> Result<BlobSummary>;
    async fn put_file(&self, path: &Path, mime: Option<&str>)
        -> Result<BlobSummary>;
    async fn exists(&self, sha256: &[u8; 32]) -> Result<bool>;
    async fn fetch(&self, sha256: &[u8; 32]) -> Result<Vec<u8>>;
    fn uri_for(&self, sha256: &[u8; 32]) -> String;
}
```

`put_file` two-passes: hash first, then upload only if not already
present. `donto_register_blob(sha256, byte_size, mime_type,
bucket_uri)` records the blob in `donto_blob`; the revision table
references it. The same revision body uploaded by ten different
documents lands as one blob with ten document-revision references.

---

## 10. The Lean 4 Formal Overlay

`packages/lean/` contains a Lean 4 library and an executable
`donto_engine` binary. The binary is a stdio JSON sidecar
`dontosrv` may spawn at startup and communicate with via a
line-delimited DIR (Donto Intermediate Representation) envelope
format at protocol version `0.1.0-json`.

### 10.1 Type system

```lean
-- packages/lean/Donto/Core.lean
inductive Polarity where
  | asserted | negated | absent | unknown
  deriving Repr, BEq, DecidableEq

inductive Modality where
  | observed | derived | hypothesized | retracted
  deriving Repr, BEq, DecidableEq

inductive Confidence where
  | uncertified | speculative | moderate | strong
  deriving Repr, BEq, DecidableEq

structure Maturity where
  level : Nat
  hLE   : level ≤ 4

structure Statement where
  id         : Option String
  subject    : String
  predicate  : String
  object     : Object
  context    : String
  polarity   : Polarity
  modality   : Modality
  confidence : Confidence
  maturity   : Maturity
  valid_from : Option String
  valid_to   : Option String
```

The Lean types mirror the Postgres schema. The structural
isomorphism is what makes the DIR encoding straightforward: a
Lean `Statement` decodes from the same JSON shape that
`donto-client` serialises.

### 10.2 Shape combinators

```lean
-- packages/lean/Donto/Shapes.lean
structure Shape where
  iri      : String
  label    : String
  severity : Severity
  evaluate : List Statement → ShapeReport
```

Built-in shapes ship with the library:

- `builtin:functional/*` — at-most-one object per (subject,
  predicate) pair.
- `builtin:datatype/*` — typed-literal datatype enforcement.
- `builtin:parent-child-age-gap` — genealogy domain shape:
  `ex:parentOf` edges require parents 12–80 years older than
  children, compared via `ex:birthYear`. Missing or unreasonable
  gaps produce violation rows.

Each shape is a *pure predicate* over a scoped statement list: it
does not touch the database directly. dontosrv ships a focus
selector — a scoped pattern that materialises the statement list —
along with the shape IRI; Lean returns a `ShapeReport` with
violation rows.

### 10.3 Rules

```lean
-- packages/lean/Donto/Rules.lean
structure Rule where
  iri    : String
  label  : String
  output : Context
  mode   : RuleMode  -- eager | batch | onDemand
  apply  : List Statement → List Statement
```

`transitiveClosure p` emits all `(a, p+, c)` from `(a, p, b)` and
`(b, p, c)`, in the output context, at modality `derived` and
maturity 3 (E3 by the storage encoding, which maps to "reviewed" in
the maturity-ladder doc; rules are formal certifications and thus
considered as-if-reviewed). Inverse and symmetric rules emit
bidirectional pairs.

### 10.4 The protocol

`donto_engine` reads lines from stdin; for each line it parses a
JSON envelope, dispatches via `Donto.Engine.dispatch`, and emits
a JSON response. Failures (parse errors, unknown shape IRIs) produce
explicit `error` envelopes:

```
request:
{
  "version"   : "0.1.0-json",
  "kind"      : "validate_request",
  "shape_iri" : "lean:builtin/parent-child-age-gap",
  "scope"     : {...},
  "statements": [...]
}

response:
{
  "version"     : "0.1.0-json",
  "kind"        : "validate_response",
  "shape_iri"   : "...",
  "focus_count" : N,
  "violations"  : [...]
}
```

### 10.5 The certifies-not-gates invariant

dontosrv spawns the Lean engine with a 10-second startup banner
timeout and enforces a 30-second per-request timeout. On engine
death or unresponsiveness the parent closes the child and returns
`sidecar_unavailable` for subsequent shape/rule calls; the rest of
dontosrv continues to serve ingest, query, and policy traffic
without interruption.

This is the principle: **Lean certifies, doesn't gate**. Ingest
never waits on the formal overlay. The sidecar's absence degrades
shape/rule/cert calls only.

---

## 11. Release and Federation

### 11.1 The release contract

A release (PRD §17.1) is eleven things:

1. Release query.
2. Claim set.
3. Evidence manifest.
4. Source manifest.
5. Policy report.
6. Transformation report.
7. Adapter loss report.
8. Checksum manifest.
9. Citation metadata.
10. Reproduction instructions.
11. Optional export packages.

`packages/donto-release/` builds these from a `ReleaseSpec` JSON.
Release blockers include any included claim with a policy
disallowing publication, any source with unresolved policy, any
restricted-anchor reference without redaction, any claim below the
release-maturity threshold, any adapter loss report with unaccepted
critical loss, and any required review that has not occurred.

### 11.2 Native and optional exports

The native format is `donto-release.jsonl` (one statement per line,
with checksums) plus `manifest.json`. Optional exports include
RO-Crate (a BagIt-compatible JSON-LD metadata file), CLDF (for
language datasets — lossy, requires `--max-cldf-loss` gate),
CoNLL-U (for corpus releases), CSV/TSV for tabular subsets, and
RDF / JSON-LD for linked-data consumers.

### 11.3 Ed25519 signing

```rust
// packages/donto-release/src/envelope.rs
pub struct Keypair { signing: ed25519_dalek::SigningKey, /* ... */ }

impl Keypair {
    pub fn generate() -> Self { /* OsRng */ }
    pub fn from_seed(seed: [u8; 32]) -> Self { /* deterministic */ }
    pub fn did_key(&self) -> String {
        // multicodec [0xed, 0x01] + base32(verifying_key)
    }
}

pub struct ReleaseEnvelope {
    pub manifest_id     : String,
    pub manifest_sha256 : String,
    pub issuer_did      : String,
    pub signature_suite : String,   // "Ed25519Signature2020"
    pub signature       : String,
    pub created_at      : String,
}

pub fn sign(manifest: &ReleaseManifest, kp: &Keypair) -> ReleaseEnvelope { /* … */ }
pub fn verify(env: &ReleaseEnvelope) -> Result<(), VerifyError> { /* … */ }
```

The CLI subcommand `donto release pipeline` orchestrates the full
five-stage emission: build manifest from spec → write native JSONL
→ sign → write RO-Crate metadata → optional CLDF export. Verification
requires no network: `did:key` is self-contained.

### 11.4 Federation: an evaluation

`docs/M9-FEDERATION-MEMO.md` records an evaluation of five
candidate federation stacks. The explicit non-goal is "any
researcher querying any other researcher's tree". The actual
question is whether instance B can verify a release manifest from
instance A without re-ingesting A's source content.

**W3C Verifiable Credentials + DID** — Verdict: strongest fit.
Each manifest is a signed VC; the signer is identified by a DID.
Selective disclosure (BBS+, SD-JWT) hides claim payloads when
policy demands. Cost: BBS+ requires pairing-friendly curves not in
standard Postgres/Rust crypto stacks (~weeks of implementation).

**Solid Pods** — Verdict: defer. The pod model assumes one
principal per pod. donto already has multi-context, multi-authority
semantics that the pod model doesn't represent natively.

**SPARQL federation (`SERVICE`)** — Verdict: reject as primary
layer. Information leakage through query shape and count is a known
risk; donto's PRD invariant "cross-instance restricted content
cannot leak through counts or errors" rules out the naive
implementation.

**DataCite-style citation metadata** — Verdict: proceed. The
`ReleaseManifest` is essentially what DataCite expects. The
federation piece is publishing the manifest to a registry (DataCite,
Zenodo, OpenAIRE) and adopting their identifier scheme. Compatible
with VC/DID for trust: the manifest is a VC; publication registers
its reference.

**RO-Crate** — Verdict: proceed independently. RO-Crate is a
format, not a federation protocol. Pairs naturally with VC/DID
(sign the crate's metadata) and DataCite (publish the signed
crate's identifier).

The synthesis (M9-MEMO §4):

```
Manifest format          : RO-Crate (M7 work, landed)
Signing layer            : VC over the manifest (M9 spike, did:key)
Publishing               : DataCite-style citation metadata
Live cross-instance query: explicit non-goal for v1
```

Acceptance: instance A builds release R; the output is an RO-Crate
signed by a VC issued under A's DID; instance B fetches the crate,
verifies the VC, reads the release manifest, and can answer "does
this crate contain claims about entity X" without ever storing A's
raw rows.

---

## 12. Empirical Evaluation

We characterise the substrate at two scales: the synthetic benchmark
suite (`donto bench`) at 10 K, 100 K, and 1 M rows on the production
hardware, and the live `genes` corpus at 39.3 M rows.

### 12.1 Benchmark hardware

GCE `e2-standard-4` (4 vCPU, 16 GB RAM), PostgreSQL 16 in the
`donto-pg` Docker container with volume bind to
`/mnt/donto-data/pgdata` (SSD). Workload is the synthetic fixture
in `donto-cli bench`: write N rows under a throwaway context, time
one point query and one batch query.

### 12.2 Insert throughput and point-query latency (H1)

| Scale (N) | Insert wall | Inserts/s | Point query (H1) | Batch query (H4) |
|-----------|-------------|-----------|------------------|------------------|
| 10,000    |   3.36 s    | **2,977** | 10.7 ms          | 50 ms            |
| 100,000   |  35.48 s    | **2,819** | 42.8 ms          | 504 ms           |
| 1,000,000 | 396.67 s    | **2,521** | 50.9 ms          | 6.59 s           |

Insert throughput is essentially flat at ~2.5–3.0 K row/s through
three decades of scale. Linear extrapolation gives ~70 min for a
10 M-row cold ingest (the H10 hard target); the genes prod corpus
(39 M statements) would be ~4.2 hours cold, though production
ingest is faster through batched pipelines than the single-row CLI
path.

Point queries stay sub-100 ms through 1 M rows on the SPO / POS /
OSP indexes. PRD §25 H1 hard target is 100 ms at 10 M; the trend
is on track.

### 12.3 Extended benchmark suite (H2–H9)

| Scale (N) | H1 point | H4 batch | H2 aligned | H3 AS_OF | H5 frontier | H7 modality query |
|-----------|---------:|---------:|-----------:|---------:|------------:|------------------:|
| 10,000    | 15.1 ms  | 80 ms    | 10.6 ms    | 4.4 ms   | 3 ms        | 48 ms             |
| 100,000   |  8.9 ms  | 2.59 s   | 3.2 ms     | 3.4 ms   | 2 ms        | 95 ms             |
| 1,000,000 | 33.6 ms  | 8.21 s   | 4.6 ms     | 3.0 ms   | 15 ms       | 1.73 s            |

| Scale (N) | H6 join | H8 setup | H8 query | H8 rows kept | H9 4× concurrent |
|-----------|--------:|---------:|---------:|-------------:|-----------------:|
| 10,000    |    6 ms |  106 ms  |  107 ms  |    9,900     |        385 ms    |
| 100,000   |    6 ms |  798 ms  |  812 ms  |   99,000     |        428 ms    |

**Observations.**

- Batch (full-context-scan) queries (H4) grow linearly with N as
  expected; the planner does not optimise unbound predicate-and-
  subject patterns.
- The aligned-query path (H2) is *faster* than the unaligned path at
  100 K and 1 M because the closure-rebuild work moved the
  alignment join to a hot in-memory cache.
- The `AS_OF` bitemporal path (H3) is essentially constant: GiST
  index on `tx_time` plus partial-index on currently-open rows.
- Contradiction frontier (H5) is bound by index lookup, not by
  row count: the frontier view is materialised.
- Modality filter setup (H7 setup at 1 M) takes 15 s; the filter
  itself runs in ~1.7 s. Adding a composite
  `(modality, statement_id)` index is a deferred tuning candidate.
- Policy allows (H8) scales modestly with N: ~107 ms at 10 K → ~812
  ms at 100 K, extrapolating to ~8 s at 1 M. Fine for curated read
  workloads, a tuning candidate for hot paths.
- 4× concurrent writers (H9) complete in ~400 ms regardless of bench
  scale. Advisory-lock + unique-content-hash path doesn't contend.

### 12.4 Live corpus characterisation

The genes corpus, as of 2026-05-28:

```
Statements:                39,294,083
Distinct predicates:          938,918
Distinct contexts:             19,230
Evidence links:             1,837,151
Database size:                  48 GB
donto_statement table:          32 GB
Currently-believed:        39,293,802
Retracted:                        281
```

**Polarity:**

```
asserted : 39,292,908  (99.997%)
negated  :        813  ( 0.002%)
unknown  :        331
absent   :         31
```

**Maturity (storage-bit encoding):**

```
0  (raw)             : 22,082,198   (56.2%)
1  (candidate)       : 14,870,096   (37.8%)
2  (evidence-supp.)  :     47,905   ( 0.1%)
3  (reviewed)        :    244,806   ( 0.6%)
4  (E5 certified*)   :  2,049,078   ( 5.2%)
```

**Top context (by statement count):**

```
ctx:genealogy/research-db                          21,842,452
ctx:genealogy/smoketest                             4,114,991
ctx:genealogy/analysis-db                           3,852,396
ctx:genes/yeatman-knoll-coleman-banjo-gibson        1,159,340
ctx:genes/edward-herbert-chinese-stack/qld          1,011,967
ctx:genealogy/research-db/source/unknown              497,658
ctx:genealogy/resources                               328,863
ctx:genes/naa-32841845                                271,739
ctx:genes/edward-herbert-father                       192,952
ctx:genes/trove-cooktown/reynolds                     142,192
ctx:genes/trove-cooktown/beche-de-mer                 114,401
```

**Top predicates (by statement count):**

```
rdf:type                  3,751,377
donto:status              1,629,900
donto:aboutPredicate      1,629,842
donto:confidenceLabel     1,226,023
donto:predicate           1,224,518
donto:textSpan            1,220,903
donto:extractionModel     1,206,699
donto:objectValue         1,113,534
donto:hasSpan             1,095,921
donto:claimB              1,086,531
donto:aboutSubject        1,086,531
donto:claimA              1,086,531
ex:knownAs                1,081,364
donto:createdAt             857,137
donto:inSource              764,847
```

The top fifteen predicates are dominated by reified meta-statements
(`donto:status`, `donto:aboutPredicate`, `donto:textSpan`,
`donto:extractionModel`, `donto:claimA/B`) rather than domain
predicates. This is a direct consequence of how the M5 extractor
emitted facts: each claim was reified into ~7–10 quad rows. The
exhaustive (M6) extractor does not reify in the same way; the older
half of the database carries a heavy reification tail.

### 12.5 Real workloads

Three substantive research workloads have exercised the system in
the last twelve months:

**Annie Davis.** Multiple genealogical sources disagree on Annie
Davis's parents, birth year, and place. donto stores claims from
the Davis family oral history, the Brackenridge family records, the
Mareeba marriage register, three colonial-era obituaries, and the
2007 EKY native-title determination, *all of which contradict each
other on at least one field*. The contradiction frontier view
exposes the disagreements; the maturity ladder caps the reviewable
claims at E3 pending family-elder review; the policy capsule for
oral-history sources defaults to `community_restricted`.

**Caroline Brown / Kaitchi.** A second-generation EKY apical
ancestor whose parents and grandparents are subject to active
litigation in the Federal Court of Australia. donto holds 820 lines
of dossier and ~80 kinship triples, of which only three carry
evidence links to primary sources — a sparsity flagged for the
Trust Kernel HTTP-middleware testbed (see §13).

**Blucher and the Maryborough boiling-down works.** A nineteenth-
century Aboriginal figure mis-identified across colonial archives;
the corpus distinguishes the Maryborough "King Blucher" (Maryborough
boss-name, NMP worker) from the authentic apical Bujilkabu claim.
The PDFs at `genes.apexpots.com/pdfs/` and the verbatim primary
sources at `genes.apexpots.com/blucher/s/` are produced from this
data.

These workloads have driven changes to the substrate: the
contradiction frontier view (H5), the multi-context attachment
(migration 0103, for claims that belong to both a hypothesis lens
and an oral-history source), and the source-provenance trace
(`donto-trace`, §9.6).

### 12.6 Long-running queries

Two queries the agent submission for this paper *did not* return in
its time budget:

```sql
SELECT COUNT(DISTINCT subject) FROM donto_statement;
-- > 30 min, did not complete
```

```sql
SELECT COUNT(*) FROM (
    SELECT subject, predicate
    FROM donto_statement
    WHERE upper_inf(tx_time)
    GROUP BY subject, predicate
    HAVING COUNT(DISTINCT donto_polarity(flags)) > 1
) t;
-- > 5 min, did not complete
```

Subject cardinality and contradiction-by-polarity are both
load-bearing characterisation queries we would like to report
empirically. That they do not complete in routine time is itself a
finding: a `donto_subject_stats` matview (companion to the existing
`donto_subject_count` matview that powers the
`/subjects/all` directory page) is the natural fix and is the
top item on the scheduled-refresh roadmap.

---

## 13. Discussion

### 13.1 What worked

**Append-only discipline.** The decision to never `DELETE FROM
donto_statement` and to extend the same discipline to alignments,
identity, policies, and attestations through `donto_event_log` has
*not* produced an unmanageable storage explosion. At 39.3 M rows
and only 281 retractions, the bitemporal overhead is essentially
nil. The discipline pays for itself the first time you need to
answer "what did we believe last Tuesday?"

**Idempotent migrations.** All 131 migrations are idempotent (`if
not exists`, `create or replace`, advisory-locked migrator,
SHA-256 ledger). New migrations are added by sequential number;
prior migrations are never edited. The model means a fresh database
can be brought up to head with one `donto migrate` call, and a
production database can absorb new migrations without service
interruption (the F-1 closure ran in production with zero
downtime).

**The pgrx packaging.** The pgrx extension `pg_donto` embeds every
migration via `extension_sql_file!` and provides Rust mirrors for
plan-quality immutable helpers. The cost was non-trivial (managing
pgrx version skew, learning the Rust-Postgres ABI) but the result
is a single `CREATE EXTENSION pg_donto` on a fresh database
produces a working substrate. This is what packaging looks like
when you take Postgres seriously as the boundary.

**The tripwire test suite.** 77 files, ~20 K LOC, ~592
`#[tokio::test]` and ~91 `#[test]` annotations, with the
`pg_or_skip!` macro letting database-touching tests skip cleanly
when Postgres is unreachable. The suite encodes PRD invariants as
executable assertions. Every PRD §I-clause has at least one
tripwire; every new invariant lands with at least one new test.
The convention has caught more regressions than any single review.

### 13.2 What surprised

**Predicate proliferation.** 938,918 distinct predicates is far
beyond what we expected. The freely-minted-predicate problem is the
direct downstream of giving LLMs the latitude to mint IRIs without
a registry lookup, which we did for M5 because the alternative —
"the model must use one of these 12,000 existing predicates" — was
producing systematic under-coverage (the model would refuse to
extract a claim it had no good predicate for). The vocab-aware
extraction (commit `31c519b`, "vocab-aware extraction — stop minting
fresh predicates") is the partial fix; the alignment-closure
backfill (~922 K predicates → nearest canonical, cosine ≥ 0.9) is
the larger one.

**Evidence-anchor sparsity.** 1,837,151 evidence links / 39,294,083
statements ≈ 4.7 % anchor coverage. This is the gap between the
intended invariant (I1: no claim without evidence or
`hypothesis_only`) and the lived reality of a corpus accumulated
across two extractor generations. The anchor-aware ingest (commit
`5928bff`) plus the `donto trace` provenance backward-fill (Stage
D) is closing this gap; the roadmap target is ≥ 50 % coverage.

**Retraction rarity.** 281 retractions across 39.3 M statements
(7 × 10⁻⁶). This is a striking finding. It tells us that we are
operating donto as an *append-mostly archive* rather than as a
constantly-revised research notebook. The substrate supports
retraction (`donto_retract`, `donto_correct`) and the discipline is
encoded in the CLAUDE.md non-negotiable list. But the actual usage
shows that researchers are accumulating contradictory claims
rather than retracting older ones — exactly as the paraconsistency
invariant says they should. The two oldest sources disagreeing
about Annie Davis's birth year both still live.

**The reification tail.** The top fifteen predicates by row count
are dominated by reified meta-statements (`donto:status`,
`donto:aboutPredicate`, `donto:textSpan`, `donto:claimA`,
`donto:claimB`). The M5 extractor reified each claim into ~7–10
rows; the M6 exhaustive extractor does not. The older half of the
database carries the reification tail; the newer half is
substantially less dense. The plan for unwinding the tail is
quarantine (move `ex:normalized_claims/*` to a retract-or-quarantine
context, ~2.37 M rows) plus a re-ingest under the new vocabulary.

**Long-running characterisation queries.** Subject cardinality and
polarity-mixed contradictions do not return in routine time at 39 M
rows. We thought we knew the index story for these — index on
subject, index on (subject, predicate) — but DISTINCT and
GROUP BY HAVING over the whole table both require parallel
hash-aggregate plans that take longer than the agent budget. The
fix is matviews (`donto_subject_stats`,
`donto_contradiction_pressure`) refreshed on a daily schedule.

### 13.3 The Conceivable Aperture: an honest accounting

Of the six apertures, Conceivable is the one we are least sure
about. It produces unanchored, hypothesis-only claims by design.
The position recorded in the Maximalism doc — *"mine everything;
let curation decide"* — is principled, but we have not yet tested
whether downstream curation actually filters Conceivable output at
useful rates, or whether it floods the candidate space in ways that
make E2 promotion costly. The provisional answer is to keep it on
by default but in a dedicated `ctx:.../conceivable` sub-context, so
that release builders can exclude the entire context with one
clause.

### 13.4 Lean parity

The Lean overlay is operationally working — `donto_engine` spawns
under dontosrv, certifies the three built-in shapes (functional,
typed-literal, parent–child age-gap), and the
`autoresearch-genealogy/lean/Genealogy/` library has a substantially
more developed catalogue (one-birth-per-person, sameAs
symmetry/transitivity, parent-date plausibility). The gap is that
the two libraries are not yet converged: shapes that exist in
Genealogy do not yet exist in `packages/lean/`. The convergence
work is straightforward (port the shape combinator with its proof
of soundness) but unspectacular; landing it is the natural milestone
that completes the Lean side of the substrate.

---

## 14. Limitations and Future Work

### 14.1 Substrate

- **HTTP-middleware Trust Kernel enforcement.** The substrate-side
  policy is fail-closed. The HTTP-side enforcement (block writes
  that don't carry policy) is the visible end-to-end Trust Kernel
  test, with hundreds of unpoliced legacy sources in the genes
  corpus as the natural testbed.
- **Subject cardinality matview.** The `donto_subject_stats`
  matview is the route to answering basic characterisation queries
  at corpus scale without ad-hoc full table scans.
- **Predicate alignment backfill.** Across 938 K predicates, the
  nearest-canonical alignment proposal (cosine ≥ 0.9) needs to be
  run, reviewed at the high-confidence band, and applied. The
  closure-rebuild work happens daily on a systemd timer; the
  proposal stage does not.
- **H10 scale lock.** PRD §25 hard target is 100 ms at 10 M rows.
  The extrapolation says yes; the actual run gives a number we can
  cite. The scheduling cost is ~70 minutes for the insert pass plus
  benchmark; it is worth committing the calendar slot.

### 14.2 Substrate completeness vs application

The substrate (M0–M4) is complete. Open work clusters around
applications:

- **M5 (Extraction):** scheduled runs as a systemd timer (~20 LOC
  shell). Currently runs are submitted ad-hoc via
  `donto extract`.
- **M6 (Language pilot):** all five importers (CLDF, CoNLL-U,
  UniMorph, LIFT, EAF) shipped with tests; the work is running
  them against real datasets (Glottolog, Universal Dependencies,
  UniMorph, LIFT, ELAN-attested corpora) and dispatching from
  the CLI.
- **M7 (Release builder):** wrap the existing JSONL + RO-Crate +
  Ed25519 pipeline as a `donto release` CLI verb.
- **M8 (Scale):** see H10 above.
- **M9 (Federation):** publish a release end-to-end via DataCite
  (or Zenodo) and demonstrate the two-instance smoke-test
  (instance A signs, instance B verifies).

### 14.3 Domain

- **Lean parity** with the `autoresearch-genealogy/lean/Genealogy/`
  library.
- **Orphan-research-notes cross-linker:** the genes workspace has
  several hundred `.md` files that are not yet linked from any
  context's source registration.
- **DontoQL `WITH evidence` result shape** (current evaluator
  records the directive but does not change the row shape).
- **CLI manpage and completions install:** the CLI emits these on
  `donto man` / `donto completions`; package the install as part of
  a `donto install-completions` subcommand.

### 14.4 What we are deliberately not doing

- **Live cross-instance federated query** (§11.4 non-goal).
- **Closed-world entity reconciliation.** Identity is a hypothesis,
  not a foreign key.
- **Aggressive normalisation at write time.** The substrate writes
  what the extractor emits; alignment and identity are evaluated at
  query time under the user's chosen lens.
- **OWL-style entailment as primary semantics.** Lean certifies;
  Postgres executes; OWL is one possible alignment safety flag
  among three.

---

## 15. Conclusion

donto is, by intention, an uneasy product. It treats contradictions
as data, schemas as plural, identities as hypotheses, sources as
policy-bound, and time as bitemporal. It refuses the simplifying
assumptions a triple store usually makes — and pays the cost in
schema complexity (91 tables), test surface (~20 K LOC of
tripwires), and conceptual overhead (a 2,500-line PRD with ten
non-negotiable invariants). What we get in exchange is a substrate
where two oral histories disagreeing about a great-grandmother's
birthplace can both live, where the legal-precedent citation chain
from a 2013 Federal Court determination can be queried under a
strict identity lens that excludes the 2024 family-elder review's
provisional merges, where the predicate `ex:motherOf` and the
predicate `ex:hadMother` are typed-aligned but not collapsed at
storage, where a release artefact ships with its own checksum
manifest and policy report and an Ed25519 envelope verifiable
without contacting the originating instance.

The system runs in production at 39.3 million statements against a
single PostgreSQL instance on a 4-vCPU VM. The benchmark numbers
are encouraging: 2.5–3.0 K-row/s insert throughput holds through
1 M rows on the standard hardware; point queries stay sub-100 ms.
The empirical surprises (predicate proliferation, evidence
sparsity, low retraction rate, the reification tail) point to
concrete next steps that the substrate's architectural choices make
*possible* — backfill alignments without rewriting history,
backward-fill anchors via three-tier trace, refresh matviews on a
schedule without invalidating the bitemporal model.

We do not claim donto is the right substrate for every knowledge
graph application. We claim it is a working substrate for *contested
knowledge*, demonstrated against one of the hardest realistic
domains we know — North-Queensland Aboriginal genealogy and the
language documentation surrounding it. Every architectural choice
in this paper has a tripwire test, a PRD section, a migration, and
a row count in the live database to back it. The system is, in the
sense the PRD demands, *working*.

---

## References

Allen, J. F. (1983). Maintaining knowledge about temporal intervals.
*Communications of the ACM*, 26(11), 832–843.

Belnap, N. D. (1977). A useful four-valued logic. In J. M. Dunn &
G. Epstein (Eds.), *Modern Uses of Multiple-Valued Logic*
(pp. 8–37). Dordrecht: Reidel.

Carroll, S. R., Garba, I., Figueroa-Rodríguez, O. L., Holbrook, J.,
Lovett, R., Materechera, S., Parsons, M., Raseroka, K., Rodriguez-
Lonebear, D., Rowe, R., Sara, R., Walker, J. D., Anderson, J., &
Hudson, M. (2020). The CARE principles for Indigenous data
governance. *Data Science Journal*, 19, 43.

Cyganiak, R., Wood, D., & Lanthaler, M. (2014). *RDF 1.1 Concepts
and Abstract Syntax* (W3C Recommendation). W3C.

da Costa, N. C. A. (1974). On the theory of inconsistent formal
systems. *Notre Dame Journal of Formal Logic*, 15(4), 497–510.

Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage.
*Journal of the American Statistical Association*, 64(328),
1183–1210.

Gregg, F., & Eder, D. (2015). *Dedupe: a Python library for
accurate and scalable fuzzy matching, record deduplication and
entity-resolution*. github.com/dedupeio/dedupe.

Harris, S., & Seaborne, A. (2013). *SPARQL 1.1 Query Language*
(W3C Recommendation). W3C.

Hickey, R. (2012). Datomic. *Strange Loop 2012*.

Jaśkowski, S. (1948). Rachunek zdań dla systemów dedukcyjnych
sprzecznych. *Studia Societatis Scientiarum Torunensis*, Sectio A,
1(5), 55–77.

Konda, P., Das, S., Suganthan G. C., P., Doan, A., Ardalan, A.,
Ballard, J. R., Li, H., Panahi, F., Zhang, H., Naughton, J.,
Prasad, S., Krishnan, G., Deep, R., & Raghavendra, V. (2016).
Magellan: Toward building entity matching management systems.
*PVLDB*, 9(12), 1197–1208.

Lebo, T., Sahoo, S., & McGuinness, D. (2013). *PROV-O: The PROV
Ontology* (W3C Recommendation). W3C.

Li, Y., Li, J., Suhara, Y., Doan, A., & Tan, W.-C. (2020). Deep
entity matching with pre-trained language models. *PVLDB*, 14(1),
50–60.

Linacre, R. (2022). *Splink: probabilistic record linkage at
scale*. github.com/moj-analytical-services/splink.

Library of Congress (2019). *Extended Date/Time Format (EDTF)
Specification*. LoC.

Priest, G. (1979). The logic of paradox. *Journal of Philosophical
Logic*, 8(1), 219–241.

Pratt, J., Dale, S., Ploderer, B., et al. (2019). XTDB / Crux: an
unbundled, bitemporal database. *Strange Loop 2019*.

Snodgrass, R. T. (1999). *Developing Time-Oriented Database
Applications in SQL*. Morgan Kaufmann.

Soiland-Reyes, S., Sefton, P., Crosas, M., Castro, L. J., Coppens,
F., Fernández, J. M., Garijo, D., Grüning, B., La Rosa, M., Leo,
S., Ó Carragáin, E., Portier, M., Trisovic, A., RO-Crate Community,
Groth, P., & Goble, C. (2022). Packaging research artefacts with
RO-Crate. *Data Science*, 5(2), 97–138.

Vrandečić, D., & Krötzsch, M. (2014). Wikidata: a free
collaborative knowledgebase. *Communications of the ACM*, 57(10),
78–85.

Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G.,
Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos,
L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T.,
Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T.,
Finkers, R., … Mons, B. (2016). The FAIR guiding principles for
scientific data management and stewardship. *Scientific Data*, 3,
160018.

---

## Appendix A: Live Service Endpoints

```
dontosrv               http://localhost:7879   axum, Rust, 67 routes
donto-api              http://localhost:8000   FastAPI + Temporal
donto-api-worker       n/a                     Temporal worker (extraction)
donto-debug            http://localhost:3002   Next.js debug dashboard
dontopedia-web         http://localhost:3000   Next.js public site
agent-runner           http://localhost:4001   Temporal kickoff fastify
caddy                  :80 / :443              TLS termination + proxy
donto-pg               :55432 → :5432          Postgres 16 in Docker
temporal               :7233 + :8088 (UI)      workflow engine
```

Public DNS routes (Cloudflare in Full SSL):

```
genes.apexpots.com            → mostly :8000, some paths to :3002
debug.genes.apexpots.com      → :3002
genes.apexpots.com/pdfs/      → /srv/genes-pdfs/ (file_server)
genes.apexpots.com/blucher/   → /mnt/donto-data/blucher-sources/ (file_server)
genes.apexpots.com/research/  → /srv/genes-research/ (file_server, this paper)
www.dontopedia.com            → :3000
```

## Appendix B: Migration Index (Selected)

```
0001 core               -- donto_context, donto_statement, donto_audit
0002 flags              -- pack/unpack polarity + maturity into smallint
0003 functions          -- donto_assert, donto_retract, donto_correct,
                           donto_ensure_context, donto_match
0023 documents          -- donto_document table
0029 evidence_links     -- donto_evidence_link with 7-target check
0031 arguments          -- donto_argument with 9 typed relations
0048-0055 predicate_alignment -- alignment edges + closure rebuild
0057 entity_symbol      -- entity registry with trigram blocking
0060 identity_edge      -- weighted bitemporal coreference
0089 hypothesis_only_flag -- per-statement flag for I1
0090 event_log          -- append-only history for non-statement objects
0098 polarity_v2        -- extended polarity values
0099 statement_modality -- sparse modality overlay
0100 extraction_level   -- sparse extraction-level overlay
0102 maturity_e_naming  -- E5/E4 ordering note for stored values 4/5
0103 multi_context      -- secondary context attachments
0111 policy_capsule     -- 15-action policy with max-restriction inheritance
0112 attestation        -- holder credentials with purpose + rationale
0123 document_policy_id_required -- F-1 closure (NOT NULL + FK validate)
0125 blob_store         -- content-addressed blob registry
0126 trace              -- source-provenance trace log
0127 trace_lines        -- byte-offset-preserving line index
0128 safe_extract       -- defensive literal handling
0129 disambiguate       -- entity disambiguation tables
0130 predicate_counts   -- matview for /subjects/all etc.
0131 object_iri_trgm    -- trigram index on object IRIs (substring search)
```

131 migrations total. All idempotent. Applied under a single
`pg_advisory_lock` so concurrent migrators serialise.

## Appendix C: PRD Invariants Tripwire Map

```
I1  No claim without evidence                  invariants_evidence.rs
I2  No restricted source without policy        invariants_governance.rs
I3  No destructive overwrite                   invariants_bitemporal.rs
I4  Contradictions are preserved               invariants_paraconsistency.rs
I5  Machine confidence is not maturity         invariants_maturity.rs
I6  Governance propagates to derivatives       invariants_governance.rs
I7  Schema mappings are typed and scoped       invariants_predicate.rs
I8  Identity is a hypothesis                   invariants_identity.rs
I9  Adapters must report information loss      adapters/*
I10 A release is a reproducible view           invariants_releases.rs
```

Each invariant has at least one tripwire test; many have multiple
adversarial-walkthrough scenarios in `invariants_adversarial.rs`
(1,160 LOC, the largest single test file).

---

*End of paper.*
