# donto as a Claim/Discovery Substrate — Iteration 3 (Product Spec)

## 1. The reframe, in one paragraph

donto is a **contradiction-preserving claim/discovery substrate for machines**. It ingests evidence, decomposes it into source-grounded **typed claims**, holds mutually-incompatible claims as permanent legal state, and then **generates, ranks, and re-ranks relationship hypotheses** from claim combinations — always able to show a human *why* a hypothesis exists. The product is this **claim lifecycle**, not the "lens" vocabulary. Lenses are just one way to generate candidate typed claims; a lens that doesn't emit structured claims that change the graph is decorative. Everything below is one product spec for that lifecycle, plus the domains that prove it.

---

## 2. The claim lifecycle (the actual product spec)

This 8-step loop is the spine. Each step is something donto does to the graph.

| # | Step | What donto does (one line) |
|---|------|----------------------------|
| 1 | **Ingest evidence** | Anchor every source to a byte range; raw artifact + cleaned copy stored, never summarized-and-discarded. |
| 2 | **Extract typed claims** | Run extraction lenses → emit normalized, typed, falsifiable claims (skill-claim, causal-claim, tlink), each evidence-linked. Volume lives here. |
| 3 | **Hold incompatible claims** | Store contradictory claims side-by-side as legal state (`hypothesis_only`, contradiction frontier) — never average, overwrite, or drop. |
| 4 | **Generate relationship hypotheses** | Combine claims into candidate cross-entity edges via shared intermediate facts (Swanson ABC join keys: shared sense/frame/entity-IRI/skill). *This step is net-new code.* |
| 5 | **Attach evidence + counter-evidence** | Wire typed argument edges (`supports`/`rebuts`/`undercuts`, AIF-style) to each hypothesis. |
| 6 | **Rank** | Score = f(novelty, plausibility, evidentiary-support, contradiction-value, verification-cost). This ranker IS the FDR controller. |
| 7 | **Re-rank on new evidence** | New byte arrives → re-score the *existing* hypothesis pool (cheap), don't re-generate (expensive). Bitemporal trigger. |
| 8 | **Explain** | Surface the backing claim-chain, the counter-evidence, the confidence, and a verification path ("confirm by adding repo link"). |

Steps 1–3 are mostly built (genealogy data). Steps 4–7 are the product gap. Step 8 is the differentiator the incumbents structurally can't offer.

---

## 3. Settling the volume question (honestly)

The founder pushed back on the prior report's de-emphasis of volume. **The founder is right** — with one condition. The disagreement dissolves once you split volume across two layers.

**The founder's intuition is literally measurable.** Modern NLP defines ~12 distinct gold-standard, falsifiable annotation layers over the *same* text, each emitting a different claim type:

- **Universal Dependencies**: every token gets ~5–6 typed claims (form, lemma, 1 of 17 POS tags, ~30 morphological feature types / 200+ values, head, 1 of 37 dependency relations). 200+ treebanks, 150+ languages.
- **AMR 3.0** (LDC2020T02): whole-sentence meaning graph, ~1 node per content word + edges. ~59,255 gold English AMRs.
- **PropBank + FrameNet**: predicate-argument structure two ways (1,000+ frames, 13,000+ lexical units).
- **UDS 1.0** (decomp.io): turns *one* predicate-argument edge into ~20 graded real-valued claims (proto-roles, factuality, genericity).
- **TimeML / TimeBank-Dense**: events + TIMEX3 + Allen-interval TLINKs.
- **PDTB-3**: 53,631 discourse-relation tokens. Plus WSD, coref, NER, MWEs (PARSEME: 62K+ verbal MWEs).

**Arithmetic for one rich 100-word paragraph**: ~100 tokens × ~5 UD claims ≈ 500 syntactic claims; + ~80–120 AMR nodes/edges; + ~10–20 predicates × (sense + args + ~20 UDS scalars) ≈ several hundred semantic claims; + temporal + discourse + entity links. **Conservatively 1,000–2,000+ typed falsifiable claims per paragraph, zero prose.** The latent typed structure of one text *is* vast.

**This is not aspirational.** SemMedDB/SemRep extracted **96–130M typed subject-predicate-object predications from ~29–37M PubMed abstracts** — and that database *is* the substrate of automated biomedical discovery. (It was **deprecated Dec 31 2024** — a dated, open market gap donto's extraction layer directly fills.)

**Why density enables discovery (the clincher).** Swanson's Raynaud↔fish-oil discovery worked because the bridge B-concept (blood viscosity) existed as a typed claim in *both* literatures, even though only **4 of 489** articles co-mentioned A and C. The discovery lived in the intermediate typed facts, not the prose. **No B-facts → no bridge → the true link is unrecoverable.** Sparse-graph completion provably degrades (real commonsense/biochem KGs avg degree ~2; models "degrade quickly as density is reduced", Malaviya 2019). Under-extraction is a recall ceiling no downstream ranker can lift.

**Where volume hurts.** Candidate relationships scale ~O(N²) (worse for typed paths) while true links grow ~linearly. **Calude & Longo (Foundations of Science 2017)** prove via Ramsey theory that large enough databases *must* contain arbitrary correlations as a function of size alone — "most correlations are spurious," findable even in random data. So the unverified-relationship layer needs a gate (Benjamini-Hochberg FDR logic; Bonferroni is "too conservative to be useful" at large m).

| | **Typed-extraction / decomposition layer** | **Unverified-relationship-hypothesis layer** |
|---|---|---|
| **Volume verdict** | YES — recall, the raw fuel | GATED — combinatorial false-discovery |
| **Why** | 1k–2k typed claims/paragraph; Swanson needs B-facts; denser graph = more reachable true links | O(N²) candidates; Calude-Longo: spurious regularities guaranteed by size |
| **Failure if ignored** | Recall ceiling — true links permanently unrecoverable | Deluge — real findings buried in noise |
| **donto state** | claim entering graph (evidence-anchored, re-checkable) — allowed at any volume | hypothesis promoted `hypothesis_only → believed` — must pass scorer under an FDR budget |
| **donto today** | UNDER-fed (4.7% evidence coverage) | EMPTY (the missing generate+gate step) |

**Conclusion:** The founder is right — extract maximally — under **two conditions**: (1) lenses emit **typed claims, never prose** (prose has no join key, no falsifiability, no provenance type — it can neither feed the ranker nor bridge two texts); (2) the **relationship layer is gated** by the ranker. **Volume feeds the verifier; they are sequential, not opposed.** The only universal-noise case is free-form prose output.

---

## 4. Lenses must emit typed claims (not prose)

A lens is a typed-claim emitter with a declared target schema. The objective test for "decorative": *does it write rows in its declared predicate family that the ranker can use — i.e. raise recall of true latent links OR sharpen the gate?* If neither, **cut it.**

| Lens | Typed claim it emits | How it changes the graph |
|------|----------------------|--------------------------|
| **Causal** | `causal-claim(A causes B, polarity, strength)` (SciClaim schema) | New candidate causal edge → enters ranker; can rebut an existing causal claim |
| **Temporal** | `tml:event`, `tml:tlink(before/during/overlaps)` (Allen) | Time-bounds an assertion; enables bitemporal "what was true at T"; detects time contradictions |
| **Linguistic** | `ud:deprel`, `amr:ARGn`, `wn:sense`, `uds:proto-role(scalar)` | Adds join keys (shared sense/frame) for cross-text bridges; graded scalars become ranker weights |
| **Genealogical** | `relationship-hypothesis(parent-of, source-specific)` | Candidate kinship edge with per-source evidence; competing edges held paraconsistently |
| **Legal** | treatment-edge `cites/distinguishes/overrules/follows` | Builds the argument graph; `overrules` flips valid-time of a precedent without deletion |
| **Social (FAN)** | `co-occurrence/associate-of(witness, sponsor, neighbor)` | Cluster edges that triangulate identity and surface hidden links |
| **Skills/competency** | `skill-claim(surface → ESCO/O*NET IRI, confidence)`, `implicit-skill-claim` | Resolves to taxonomy node non-destructively; implicit claims enter as `hypothesis_only` |
| **Identity** | `identity_edge(maybe-same-as, evidence)` | Query-time, non-destructive merge candidate; competing identities coexist |

**Kill the decorative failure mode:** any lens emitting a paragraph of analysis instead of typed rows is deleted. "It explained the text nicely" is not a graph mutation.

---

## 5. Flagship: jsonresume → jobs

This is the cleanest possible proof of the whole thesis. Matching **is literally scored relationship-discovery over typed claims**. The founder owns **both sides of the graph and the distribution channel**: JSON Resume (MIT standard, schema v1.0.0, ~2.4k stars, 10,000+ devs, 400+ themes, Gist-backed registry at `registry.jsonresume.org/<user>`) *plus* a draft `job-schema.json`. The resume side is already half-decomposed into claims (`basics/work/education/skills{name,level,keywords}/projects`).

### (a) Entities
Two: **resume** (a person's claim-set) and **job** (a requirement claim-set). Both decompose into the *same* typed schema, so matching is set-to-set relationship discovery — not cosine similarity between two opaque vectors.

### (b) Typed claims per side

| Resume claims | Job claims |
|---|---|
| skill-claim (surface → ESCO/Lightcast/O*NET IRI) | required-skill-claim |
| role-claim, seniority-claim (time-bounded) | preferred-skill-claim |
| domain-claim (e.g. fintech), tenure-claim | required-seniority-claim |
| education-claim, trajectory-claim (role transitions) | domain-claim |
| **implicit/inferred-skill-claim** (`hypothesis_only`) | comp-claim, location/remote-claim |
| identity-variant-claim (duplicate/variant profiles) | |

**Anchor vocabulary:** ESCO (14,575 skills / 3,039 occupations / 28 languages) + O*NET (1,016 occupations, 277 descriptors) as canonical free/open skill IRIs; Lightcast Open Skills (~33,000 skills, refreshed biweekly from 1B+ postings) as a freshness benchmark. Each skill is **three claims**: raw surface form + normalized IRI + confidence — the bullet text is never destroyed.

**Implicit-skill lens example:** bullet `"Led migration of monolith to microservices on AWS for a 5M-user fintech"` → explicit (AWS, microservices, leadership) **and** inferred-as-`hypothesis_only` (distributed-systems, observability, PCI/regulatory exposure, team-scale, domain=fintech, seniority signal). Industrial implicit-skill extraction runs **~80% precision / ~86% recall** — real and measured, and exactly why it must enter as hypothesis, not belief.

### (c) Matching as explainable relationship hypotheses
Each match is an evidence-anchored edge: **"requirement R met by skill-claim S extracted from bullet B of resume gist G."** Framed as FEVER-style claim verification (skill-claim SUPPORTS / REFUTES / NOT-ENOUGH-INFO a job requirement). The UX leads with:
- **"You match because:"** the backing claim chain.
- **"You're missing:"** unmet required-skill-claims + a verification path ("confirm inferred skill X by adding the repo link").

### (d) Why donto beats keyword + embedding matching
SOTA admits the gap verbatim. **ConFit v2/v3 (ACL 2025/2026)**: embedding rankers "lack controllability and explainability as the ranking process happens entirely in latent embedding space" — they bolt on LLM re-ranking *to recover the WHY* (+13.8% recall, +17.5% nDCG over BM25/OpenAI embeddings, but one benchmark 72% male).

| Capability | Keyword/Embedding | donto |
|---|---|---|
| Why this match | latent / none | claim-chain to source bullet |
| Contradiction | silently averaged | `senior` claim **rebutted** by tenure summing 18 months → surfaced as risk panel |
| Skill freshness | static vector | bitemporal: skill-claim valid-from 2018, no recent reinforcement → down-weighted; **skill decay** |
| Audit / compliance | opaque | bitemporal "replay why we matched" — EU AI Act-style explainability incumbents can't retrofit |
| Hidden candidates | keyword-gated | implicit-skill match even when surface keywords don't |

### (e) Network-effect payoff at millions of datapoints
**This is where the founder's volume intuition is unambiguously right** — it's a genuine network effect, not just recall:
- **Skill-adjacency graph**: skills co-occurring in evidence → candidate prerequisite/substitute edges (LinkedIn proves the structure: 39K skills, 374K aliases, **200K+ skill-to-skill edges**).
- **Career-path graph**: observed role transitions → trajectory hypotheses with frequency evidence.
- **Hidden-candidate discovery**: Swanson ABC applied to careers — candidate → skill → adjacent-skill → job the person never thought to apply for.

These are emergent typed relationships incumbents charge enterprise prices for; donto generates them as a byproduct of the lifecycle. **Volume feeds the verifier**: dense claims → denser graph → more candidate matches → ranker promotes only evidence-backed ones.

### (f) Competitive landscape

| Player | ARR / scale | Weakness donto exploits |
|---|---|---|
| **Eightfold AI** | ~$96.6M ARR, ~$2.1B val, $410M raised | opaque deep-learning talent graph (low explainability) |
| **Beamery** | ~$112.8M ARR, was $1B; laid off 12% then ~25% | capital-stressed; proprietary graph |
| **SeekOut** | ~$25.2M ARR, val $1.2B→$435M, 30% layoffs | profile-aggregation alone = fragile moat |
| **LinkedIn Skills Graph** | 875M people, 200K+ skill links | proprietary, not exportable |
| **hiring.cafe** | 14,000+ companies, free | ingestion model to copy for the jobs side |

**The wedge:** incumbents have *bigger* skills graphs. donto's differentiation is **not** "we have a skills graph" — it's **evidence-anchoring + contradiction-preservation + bitemporal replay + human-readable WHY**, on an **open standard** the founder controls. The likely *paid* wedge is explainability/fairness compliance (TA software ~$22–26B, AI-recruiting ~$3.2B at ~12% CAGR), not raw accuracy. **If donto collapses claims to scores like everyone else, there is no moat.**

### (g) Falsifiable first test
Harvest the jobs side from open ATS feeds (Greenhouse/Lever/Workday) the way hiring.cafe does; resume side from the registry (Gist version history = free bitemporal valid-time). **Test: on a held-out set, beat an ESCO-embedding baseline on `precision@k` of promoted matches against a real outcome signal (got-interviewed / hire), while producing an evidence-anchored, contradiction-flagged explanation for each.** Report the false-discovery proportion among promoted matches as a first-class number.

---

## 6. 10 example projects across domains

Template: **Domain · Entities · Lenses→claims · Relationships discovered · Load-bearing property · Falsifiable test · Who pays.**

**1. jsonresume → jobs**
Domain: talent matching. Entities: resumes, jobs. Lenses→claims: skill-lens→ESCO skill-claims; implicit-skill-lens→inferred PCI/distributed-systems (`hypothesis_only`); trajectory-lens→career-path edges. Relationships: candidate↔job matches, skill-adjacency, hidden candidates. Load-bearing: **volume / network effects** (+ evidence-anchor for the WHY). Test: beat ESCO-embedding `precision@k` on got-interviewed. Pays: recruiters, job platforms, fairness-compliance buyers.

**2. Genealogy / native-title (proving ground)**
Domain: ancestry/legal apicals. Entities: persons, sources, places. Lenses→claims: genealogical-lens→`parent-of` per-source; identity-lens→`maybe-same-as` (the 16 distinct "Kittys"); FAN-lens→witness/sponsor edges. Relationships: contested lineage chains held paraconsistently. Load-bearing: **contradiction + identity-as-hypothesis + bitemporal** (every source is an interpretive witness, not ground truth). Test: surface the EKY #58 vs Brady 2013 incest-conflict without collapsing either reading. Pays: native-title corporations, RNTBCs, family researchers.

**3. Deep linguistic decomposition / intertextuality (volume showcase)**
Domain: NLP/corpus. Entities: documents, tokens, predicates. Lenses→claims: UD-lens→5–6 claims/token; UDS-lens→20 scalars/edge; AMR-lens→concept nodes. Relationships: cross-text bridges via shared WordNet sense / FrameNet frame / AMR concept (Swanson ABC join keys). Load-bearing: **volume** (1k–2k claims/paragraph). Test: recover a held-out cross-document link findable only through a shared B-concept. Pays: research labs, intelligence/analysis, RAG vendors.

**4. Biomedical drug-repurposing / LBD (discovery)**
Domain: biomedicine. Entities: drugs, genes, diseases. Lenses→claims: predication-lens→TREATS/INHIBITS/CAUSES (SemMedDB schema, UMLS-normalized); contradiction-lens→inter-study conflict. Relationships: A→C repurposing candidates via intermediate B (Hetionet ~2.25M edges / DRKG ~5.8M triples). Load-bearing: **discovery + contradiction** (SemMedDB deprecated 2024 = live gap). Test: rediscover a held-out, later-confirmed drug-disease link from pre-discovery literature. Pays: pharma, biotech, academic discovery.

**5. Legal case-law / argument graph**
Domain: law. Entities: cases, holdings, statutes. Lenses→claims: treatment-lens→`cites/distinguishes/overrules/follows`; argument-lens→facts/issues/rules (LAMUS-style). Relationships: precedent chains; superseded-not-deleted law. Load-bearing: **contradiction + bitemporal** ("what was good law on date T" — the killer query; Caselaw Access Project 6.7M cases). Test: correctly answer good-law-as-of-date for a precedent overruled at a known date. Pays: law firms, legal-research vendors (decision-support, never authoritative ruling).

**6. Scientific claim-curation / research-integrity**
Domain: science. Entities: papers, claims, retractions. Lenses→claims: claim-lens→SciClaim typed associations; provenance-lens→nanopub 3-graph envelope. Relationships: supporting vs conflicting bodies of evidence; re-rank on retraction. Load-bearing: **contradiction frontier + bitemporal** (Retraction Watch 60k+; "believed at T1, retracted at T2" cascades). Test: detect a contradiction pair Bucur's super-pattern flags, and re-rank dependents when a cited paper is retracted. Pays: publishers, funders, meta-science.

**7. OSINT / investigative journalism**
Domain: investigations. Entities: people, companies, leaks, records. Lenses→claims: entity-lens→FollowTheMoney typed claims; conflict-lens→leak-vs-official disagreement. Relationships: hidden links across contradictory datasets with confidence tiers (confirmed/probable/possible). Load-bearing: **identity-as-hypothesis + conflicting sources** (OCCRP Aleph; "why do we believe X" survives legal review). Test: reproduce a known cross-leak link with a court-defensible why-trail. Pays: newsrooms, NGOs, due-diligence.

**8. Clinical patient-evidence**
Domain: healthcare. Entities: patients, diagnoses, guideline recs. Lenses→claims: event-lens→OMOP temporal events; guideline-lens→conflicting recommendations under multimorbidity. Relationships: rec conflicts resolved by argumentation; revised diagnoses over time. Load-bearing: **bitemporal + contradiction + governance** (OHDSI ~800M+ records). Test: flag a guideline conflict for a comorbid patient with each rec traced to its trial. Pays: health systems, CDS vendors (hypothesis/decision-support only).

**9. Financial-crime / beneficial-ownership / AML**
Domain: compliance. Entities: persons, companies, ownership. Lenses→claims: ownership-lens→FtM control claims; sanctions-lens→PEP/sanction status. Relationships: true beneficial owner through shell layers; explainable matches. Load-bearing: **entity resolution + conflicting registries** (OpenSanctions 2M+ entities / 337 sources; false positives are the #1 pain). Test: surface a true sanctioned beneficial owner through ≥2 shell layers with a regulator-readable explanation, beating an ER baseline on false-positive rate. Pays: banks, fintechs, compliance teams.

**10. Personal-AI memory**
Domain: agent memory. Entities: a person's evolving facts. Lenses→claims: fact-lens→time-bounded life claims; revision-lens→superseding-but-preserved updates. Relationships: belief evolution; re-rank on new context. Load-bearing: **evidence-anchor + bitemporal**, framed *against Mem0* (Mem0 *replaces* contradictory facts; donto *preserves* and re-ranks them). Test: answer "what did the user believe at T" correctly after a documented preference reversal. Pays: AI-assistant vendors, agent platforms.

(Properties spread: contradiction → 2,4,5,6,7,8,9; identity → 2,7,9; bitemporal → 2,5,6,8,10; evidence-anchor → 1,7,10; volume → 1,3,4.)

---

## 7. First milestone & baselines

**Brutally concrete. Two tracks.**

**Track A — jsonresume matching (the monetizable proof).**
Build the held-out benchmark: resumes (registry, with Gist history as valid-time) × jobs (ATS-scraped). Promote only matches with evidence-anchored skill→requirement satisfaction above threshold; hold the rest `hypothesis_only`.

**Track B — discovery via retrospective time-slicing.**
Cut the corpus at time T, generate hypotheses, check against post-T confirmed truth (drug-repurposing link, or "got-interviewed" outcome). This tests step 4–7 honestly without waiting for the future.

**Baselines to beat (multi-agent discovery alone is NOT novel — these already exist):**

| Baseline | Why it's the bar |
|---|---|
| Plain frontier-LLM prompt | the bitter-lesson null hypothesis — if a single prompt matches you, the substrate adds nothing |
| Vector / embedding search | ConFit shows it works but can't explain; beat it on `precision@k` **and** explainability |
| Normal KG link prediction | the structured baseline (Hetionet/DRKG metapaths) |
| Multi-agent discovery (mat2vec, AI Co-Scientist Elo tournament, SciAgents) | **DeepMind's Co-Scientist (Nature, May 2026) already ships generate→rank→re-rank — in-memory per session.** donto's *only* claim is the **persistent paraconsistent bitemporal substrate** underneath. Frame it precisely. |

**The number that proves it works:** beat the strongest baseline on `precision@k` of *promoted* hypotheses **while** (a) attaching an evidence-anchored explanation to each, (b) surfacing ≥1 real contradiction the baseline silently dropped, and (c) demonstrably re-ranking the pool when one new evidence byte arrives. Less is a demo; this is the product.

---

## 8. The honest risks

- **The cross-lens relationship generator does not exist in the code today** (step 4). `donto_argument` = 2,426 edges, `identity_edge` = 77, evidence ~4.7%. The contradiction/identity machinery — the differentiator — is barely exercised; this domain will stress-test unbuilt paths.
- **Novel ≠ true ≠ valuable.** Calude-Longo guarantees spurious links at scale; ~80% extraction precision floods the graph with false claims. The gate (`hypothesis_only` + ranker) is not optional — without it the matcher is a spam engine.
- **The bitter-lesson baseline is real.** If a plain LLM prompt or vector search matches the lifecycle's output, the substrate is overhead. Every milestone must beat that null first.
- **FDR control is an analogy, not a guarantee.** donto's scores are heuristic composites, not p-values; the false-discovery number is an empirical `precision@k` estimate, not a proven bound. Say so.
- **Genealogy-gravity / focus.** donto is ~98% genealogy. jsonresume→jobs is the discipline that proves domain-neutrality and pays — but it's net-new build (and carries GDPR/right-to-erasure tension with "preserve forever" for *living* people, unlike deceased ancestors). Resist letting the proving ground absorb all the effort.