# donto — status snapshot (2026-05-28)

Self-orientation pass. Branch `main` @ `eb23b78`, working tree clean,
ahead of remote by 1 commit. All four prod services healthy.

## One-paragraph framing

**donto** is a bitemporal, paraconsistent quad store recast in the PRD
as an "evidence operating system for contested knowledge." Postgres
extension (`pg_donto`, pgrx) + Rust workspace shipping an HTTP sidecar
(`dontosrv`), CLI (`donto`), TUI, and Lean 4 overlay. Native query
language **DontoQL** (SPARQL 1.1 subset also supported). Every
statement is **evidence-backed**, **filed under a context**, has both
`valid_time` and `tx_time`, and contradictions are preserved as data
rather than rejected. Language documentation is the formal first
proving domain; **genes** (~39M statements about North-Queensland
genealogy, oral histories, DNA matches) is the live exercise.

## Live state

| | |
|---|---|
| `donto_statement` rows | **39.3M** |
| Distinct predicates | **938.9k** |
| Top contexts | `genes/research-db` 21.8M → `genes/smoketest` 4.1M → `genes/analysis-db` 3.8M → `genes-family-trees` 1.2M |
| `donto_*` tables | 71 |
| Highest migration | **0131** `object_iri_trgm.sql` |
| Tripwire tests | **77** files in `packages/donto-client/tests/` |
| Services up | `dontosrv:7879` ✅ `donto-api:8000` ✅ `donto-api-worker` ✅ `donto-debug:3002` ✅ |

`localhost:7879/health` → `ok`. `localhost:8000/health` → `{ status: ok,
dontosrv: ok }`. (Daemon reload pending on a few units — file mtime drift,
not active failure.)

## Milestone position (PRD M0–M9)

- **M0 Trust Kernel** — substrate complete. Migration 0123 closed F-1
  by making `policy_id` required on documents with fail-closed default
  backfill. HTTP middleware to *enforce* policies on write paths is
  still open (see Open Items).
- **M1 Bitemporal** — substrate complete (`tx_time` / `valid_time`
  ranges enforced).
- **M2 Claim kernel** — substrate complete.
- **M3 Schema + identity** — substrate complete.
- **M4 Paraconsistency + modality** — substrate complete.
- **M5 Extraction** — extraction CLI ✅, reviewer-acceptance analyzer
  ✅, six apertures live (surface / linguistic / presupposition /
  inferential / conceivable / recursive); Conceivable Aperture
  quarantined. **Next:** scheduled runs (cron/systemd, ~20 LOC).
- **M6 Language pilot** — all 5 importers shipped with tests (CLDF,
  CoNLL-U, UniMorph, LIFT, EAF). **Next:** run against real datasets
  (Glottolog, UD, UniMorph, LIFT, ELAN) + CLI subcommand dispatch.
- **M7 Release builder** — JSONL, RO-Crate, Ed25519 envelope, end-to-
  end test all landed. **Next:** wrap as `donto release` CLI verb.
- **M8 Scale** — H1–H9 benchmarks complete. **Next:** H10 10M-row
  scale lock (extrapolated ~70 min insert).
- **M9 Federation** — signed-envelope spike landed. **Next:** publish
  end-to-end via DataCite or a stand-in registry.

## Recent trajectory (last 3 weeks)

```
eb23b78  feat: predicate fragmentation endpoint + cost budgets + align activity rewire
20a158e  feat: predicate alignment + context-spans + conceivable quarantine
5f8d957  docs: ROADMAP-AFTER-MAY18 — deferred items from the infra review
3fb4f45  feat: entity-merge endpoint + data-hygiene polish
2b16a13  trace: Stage D.6 disambiguation pass
3bd2ac7  chore: switch default extraction model to z-ai/glm-5
947963b  feat: GET /context-facts
9cae9bc  fix: /search indexes object_iri too
72969cd  fix: /extract — register source + persist anchors
5928bff  feat: anchor-aware ingest + exhaustive-by-default extraction
31c519b  feat: vocab-aware extraction — stop minting fresh predicates
```

Theme: predicate alignment + extraction hygiene + provenance trace
(Stage D), then a hop to context-spans and cost budgets. `HEAD~10`
diffstat: 13 files, +1249/-77.

## Open items (where the next push lands)

From `REVIEW-FINDINGS.md`, `ROADMAP-AFTER-MAY18.md`,
`ROADMAP-NEXT.md`, and `EXTRACTION-MAXIMALISM.md`:

**High-leverage data hygiene (1–2 weeks):**
- Predicate alignment **backfill** across 938k predicates → nearest
  canonical (cosine ≥ 0.9).
- Anchor/evidence backfill — currently ~4.5% of statements have an
  `evidence_link`; target ~50% via substring + LLM anchor-only
  re-extraction.
- Quarantine `ex:normalized_claims/*` (2.37M rows): bulk retract or
  close `tx_time`.

**Medium-leverage infrastructure (2–4 weeks):**
- Split Temporal off `donto-pg` (second Postgres / Cloud SQL).
- Move Postgres off the app VM (HA).
- Schedule matview + alignment-closure refresh on a timer.
- Stand up Loki + Grafana + promtail observability skeleton.
- Cost budget on extraction (per-run + per-account ceiling).

**Trust Kernel HTTP wiring (F-1 follow-on):**
SQL substrate for `donto_register_source` + `policy_id` exists; HTTP
middleware that enforces it on write paths does not. Genes has
hundreds of unpoliced sources — natural end-to-end testbed.

**Domain / overlay:**
- Lean overlay parity: `packages/lean/` is skeleton;
  `autoresearch-genealogy/lean/Genealogy/` has the developed library;
  converge.
- DontoQL `WITH evidence` result-shape cleanup (tuple → struct).
- CLI manpage / completions install.

## Workspace shape

**`packages/`** (22): substrate (`sql`, `pg_donto`, `donto-blob`,
`donto-ingest`); query + extraction (`donto-query`, `donto-client`,
`donto-trace`); 5 linguistic importers (`donto-ling-{cldf,ud,unimorph,
lift,eaf}`); ops (`donto-alert-sink`, `donto-analytics`, `donto-release`,
`donto-migrate`, `donto-synthetic`); frontend (`client-ts`, `tsconfig`);
`lean` overlay.

**`apps/`** (5): `donto-cli` (Rust — extract / ingest / query / migrate
/ release / analyze / cite / bench / man / completions), `dontosrv`
(Rust HTTP gateway, :7879), `donto-api` (FastAPI + Temporal,
:8000), `donto-tui` (refreshed May 2026), `docs`.

## Non-negotiables (still load-bearing)

- Paraconsistent. Two sources disagreeing → both rows live forever.
- Bitemporal. `donto_retract` / `donto_correct`. **Never** `DELETE FROM
  donto_statement`.
- Every statement has a context (`donto:anonymous` is the default).
- Lean certifies, doesn't gate. Sidecar absence degrades only
  shape / rule / cert calls.
- Postgres owns execution. Lean owns meaning. DIR is the boundary.
- No hidden ordering.

## Map of docs (where to dig in next)

- `donto/docs/DONTO-PRD.md` — canonical (27 §, M0–M9, I1–I10).
- `donto/docs/DONTOQL.md` — v2 reference, evaluator status.
- `donto/docs/REVIEW-FINDINGS.md` — F-1 closed; F-2..F-18 DOC-severity.
- `donto/docs/EXTRACTION-MAXIMALISM.md` — six-aperture extraction.
- `donto/docs/ROADMAP-AFTER-MAY18.md` — deferred infra review.
- `donto/docs/ROADMAP-NEXT.md` — post-DontoQL-v2 priorities.
- `donto/docs/BENCH-RESULTS.md` — H1–H9 scale benchmarks.
- `donto/docs/M9-FEDERATION-MEMO.md` — signed-release research.
- `dontopedia/PREDICATE-ALIGNMENT-PROBLEM.md` — live brief on
  alignment quality with the genes corpus.
