# Total Extraction: Deconstructing a Source Through the Whole of Human Understanding

> *A companion essay on donto's extraction philosophy and method — the part the systems paper underplays. 2026-06-03.*

## Abstract

The donto systems paper specifies a container — bitemporal, paraconsistent, evidence-first — and is nearly silent about the faucet. This essay is about the faucet, and about the conviction behind it: that extraction should be **deep, esoteric, and total.** You point an agent at one source and guide it to deconstruct that source through the whole apparatus of human understanding — reading a single sentence simultaneously as a logician, a mereologist, a jurist, a phenomenologist, a historian, and a linguist — minting precise predicates from every analytical direction at once, anchoring each claim to its exact span, decoding euphemism and capturing contradiction as first-class facts, and looping until a fresh pass surfaces nothing genuinely new and the agent itself declares the well dry. The premise: a *thing* is not a bag of facts but an essentially unbounded space of true properties and relations, and almost every extraction system ever built sees one thin slice and silently discards the rest. This is sane now because generation flipped from the scarce step to the cheap step (~$10⁻⁴ per justifiable claim), and safe because of a two-layer contract — *maximize at extraction, gate at query time* — paired with a substrate built to hold the firehose. We ground every claim in the real implemented engine and its real output: a production frontier-massacre corpus of 32,672 evidence-anchored facts over 14,629 predicates (79.9% singletons), the richest single run holding 2,334 evidence-anchored claims from one event, and a live substrate of ~39.58M statements over ~992,700 predicates.

---

## 1. The thing is not a bag of facts

For as long as machines have read text for us, extraction has meant **subtraction.** You define a schema — `Person`, `Organization`, `Location`, a dozen relation types if the project is ambitious — and you point a model at a document, and the model returns the handful of things that fit the boxes you brought with you. Everything else, which is to say *almost everything the document actually says*, falls on the floor. A named-entity recognizer reads a coroner's deposition and returns three names and a date. A relation extractor reads a paragraph dense with causation, obligation, hearsay, euphemism, contradiction, and grief, and returns `(Smith, locatedIn, Queensland)`. The text was a galaxy; the output is a postage stamp. We grew so habituated to this loss that we stopped calling it loss. We called it *the result.*

This essay is about inverting that habit completely, and the inversion begins with a premise most extraction systems quietly assume and never defend: that a source *contains* a finite, enumerable set of facts, and the job is to find them. Read the document, pull the entities, attach the handful of relations the schema knows how to hold, and you are done. What you missed, you missed because it wasn't there. The bag is full; close it.

The premise is false, and its falseness is the whole of donto's extraction philosophy. **A source does not contain a fixed number of facts. It supports an essentially unbounded space of true properties and relations, distributed across dozens of ontological dimensions, the overwhelming majority of which no single schema will ever name.** A sentence is not a row to be parsed into its columns. It is a point in a possibility-space, and the number of true things you can say about that point is bounded not by the sentence but by the analytical apparatus you bring to it. Bring one lens — entity-and-relation, the industry default — and you see one thin slice. Bring the whole of human understanding to bear at once, and the same sentence opens like a fractal: every clause is a logical proposition, a part-whole structure, a causal link, a deontic obligation, a speech act, a moment in time, a claim made by someone with a stake.

Take a single word from the corpus this essay draws on — the University of Newcastle's *Colonial Frontier Massacres in Australia 1788–1930* dataset, which donto's engine has processed into the live substrate. The colonial archive's word of choice is *dispersal*, and the engine's real, live capture of it (a production edge, `ex:dispersal · euphemismFor · "killing or violent dispersal of Aboriginal people"`) is the thread we pull on. To a named-entity extractor a clause built on that word is two entities and a verb. But the same clause is, simultaneously:

- a **euphemism** — `dispersed` decoding to *killed*: a lexical-semantics and pragmatics fact, and exactly the edge the store holds;
- a **causal node** in a reprisal chain — an etiology fact;
- a **spatial relation** — `locatedNear` a watercourse, and watercourse-proximity is itself a pattern across frontier killings: a topology fact;
- a **deontic** event — an extrajudicial act framed as routine: a norms-and-law fact;
- an **agentive** structure with a *deliberately suppressed agent* — who dispersed them? the passive voice hides it: a thematic-role / case-grammar fact;
- an **axiological** act of framing — the author declines to express sympathy: a value/stance fact about the *source*, not the event;
- a **phenomenological** trace — what was perceived, by whom, and what was *not* recorded;
- an **epistemic** assertion attributable to a particular witness on a particular date — a provenance fact.

Eight disciplines, one word, and we have not yet counted the entities, the type assertions, the labels, or the inverse edges. The named-entity extractor saw one thin slice and silently discarded the other seven. **Total extraction is the refusal to discard.** Most extraction systems see one slice of a source and have no way of even *noticing* that they did, because their schema cannot represent what they threw away. The thinness is not caution. It is a vast, unlogged discard of everything true that the chosen schema couldn't name.

## 2. The deeper point: guiding a mind that already contains the frameworks

Here is what makes total extraction newly possible rather than merely desirable. A frontier model has internalized — compressed into its weights — a staggering fraction of how humans have learned to analyze the world: the case grammar of the linguist, the mereology of the metaphysician, the etiology of the historian, the qualia structure of the lexical semanticist, the supports-and-undercuts of the argumentation theorist. Anthropic's interpretability work found tens of millions of distinct, interpretable concept-directions alive inside a single mid-size model. The directions along which a source can be decomposed are not scarce and they are not ours to invent; they are superabundant and already present, waiting to be *elicited.*

So the art of total extraction is not generation. **The art is guidance.** It is the discipline of getting the model to bring all of that internalized understanding to bear on one source at once, along every axis it knows, and to keep going until the well is dry. The prompt is not a list of fields to fill. It is a method for unlocking, one lens at a time, the analytic frameworks the model already carries — and an explicit license to invent the lenses we forgot to name. Extraction stops being information retrieval and becomes something closer to *guided polyperspectival reading.*

This reframes the act from *recognition* to *deconstruction.* A recognizer matches the source against templates it brought with it. A deconstructor brings the whole toolkit of human analysis and asks of every clause: what does this say to a theorist of parts and wholes? to a theorist of cause? of time? of obligation and permission? of value, of modality, of speech acts, of phenomenal experience? Each lens is a different question, and each question the source can answer is a fact most systems never see. The art is not in choosing the right lens. The art is in bringing them all to bear at once, and in not stopping early.

## 3. The lens sweep, as actually instructed

This is not a manifesto without code behind it. The live extraction prompt — `apps/donto-api/prompts/extract_broad.txt`, ~15.8 KB of instruction, with a domain-tuned sibling `frontier_broad.txt` — *is* the method, written down. It reads less like a parser configuration and more like a charge to a graduate seminar. Three of its instructions carry the whole philosophy.

**First, abundance is the explicit goal, and tidiness is the named enemy.** The standing demand:

> *"Your one job is MAXIMAL FAITHFUL CAPTURE: emit the LARGEST justifiable set of discrete, source-anchored claims the document supports… A substantial source should yield SEVERAL HUNDRED claims; a few dozen means you stopped far too early — go back through every line again. If unsure whether to emit a fact, EMIT IT."*

There is no upper bound on facts and no upper bound on distinct predicates. The model is told, in as many words, that *"a proliferation of specific, self-invented predicates is the SIGNATURE OF ABUNDANCE, not a problem,"* and that typing, alignment, identity-resolution, and joining are explicitly someone else's job, deferred to query time — the substrate's, never the extractor's.

**Second, the sweep is multi-directional by construction.** The prompt is emphatic: *"Do NOT extract from a single angle. Sweep many lenses against EVERY entity and EVERY clause; each yields predicates the others miss, and one fact usually yields several predicates from different lenses — emit them ALL."* A **lens** is a *generative direction* — a discipline's characteristic way of interrogating a thing. The mereologist asks *what are its parts and what is it part of?* The jurist asks *who was permitted, obliged, forbidden, liable?* The phenomenologist asks *how was it experienced, how did it appear?* The lenses are not categories you sort facts into after the fact; they are interrogation strategies you run *before* the facts exist, whose whole purpose is to make the agent ask questions it would not otherwise ask. The prompt enumerates roughly two dozen, each named with the discipline it draws on and the predicates it surfaces that the others structurally cannot:

| Lens | The question it forces | What it surfaces that the others miss |
|---|---|---|
| **Taxonomy / type theory** | *what kind of thing is this?* | `rdf:type`, `instanceOf`, `subtypeOf`, `naturalKind` — the bare typing that makes everything else joinable; cheap and load-bearing |
| **Mereology** (part/whole) | *what are its parts; what is it part of?* | `partOf`, `hasPart`, `componentOf`, `constitutes`, `boundaryOf` |
| **Identity & persistence** | *is this the same thing as that, across time?* | `sameAs`, `likelySameAs`, `aliasOf`, `formerlyKnownAs`, `identityDisputedWith` — identity as *hypothesis*, never premature merge |
| **Topology / spatial** | *where, relative to what?* | `locatedIn`, `contains`, `adjacentTo`, `borders`, `near`, `pathBetween` |
| **Chronology / time** | *when, in what order, for how long?* | `before`, `during`, `startsAt`, `duration`, `atAge`, `recurrenceOf` |
| **Causation / etiology** | *what caused it; what did it cause?* | `causes`, `causedBy`, `triggers`, `enables`, `resultsIn`, `riskFactorFor` |
| **Teleology / function** | *what is it for?* | `purposeOf`, `usedFor`, `intendedFor`, `meansToward`, `serves` |
| **Agency / case grammar** | *who did what to whom, with what, for whom?* | `agentOf`, `patientOf`, `instrumentOf`, `beneficiaryOf`, `experiencerOf` |
| **Epistemology** | *how is this known; who attests; how sure?* | `attestedBy`, `reportedIn`, `inferredFrom`, `doubtedBy`, `witnessedBy` |
| **Deontology / norms & law** | *who was permitted, obliged, forbidden, liable?* | `permittedTo`, `obligedTo`, `forbiddenFrom`, `governedBy`, `liableFor` |
| **Axiology / value** | *how is it valued, praised, condemned, framed?* | `valuedAs`, `praisedFor`, `condemnedFor`, `framedAs` |
| **Modality** | *is it possible, necessary, contingent, capable?* | `possibleThat`, `necessaryThat`, `capableOf`, `contingentOn` |
| **Qualia structure** (Pustejovsky) | *its form, matter, purpose, origin?* | `formalQuale`, `constitutiveQuale`, `telicQuale`, `agentiveQuale` |
| **Lexical semantics** | *how do its words relate?* | `synonymOf`, `hypernymOf`, `meronymOf`, `spellingVariantOf`, `etymologyOf` |
| **Social ontology** | *what roles, statuses, authorities, kinship?* | `roleIn`, `authorityOver`, `subordinateTo`, `kinOf`, `allyOf`, `rivalOf` |
| **Process / event structure** | *its phases, pre/post-conditions, state changes?* | `subEventOf`, `phaseOf`, `preconditionOf`, `stateChangedTo`, `initiates` |
| **Constitution / material** | *what is it made of; what realises it?* | `madeOf`, `realizes`, `embodies`, `instantiatedBy` |
| **Dependence / grounding** | *what does it depend on, presuppose, require?* | `dependsOn`, `supervenesOn`, `groundedIn`, `presupposes` |
| **Provenance / origin** | *where did it come from; who made it?* | `originatesFrom`, `derivedFrom`, `descendedFrom`, `authoredBy`, `basedOn` |
| **Comparison / similarity** | *like what, unlike what, more or less than?* | `similarTo`, `analogousTo`, `contrastsWith`, `proportionalTo` |
| **Quantity / measurement** | *how much, at what rate, in what unit?* | `hasQuantity`, `magnitudeOf`, `rateOf`, `measuredAs`, `approximately` |
| **Disposition / capacity** | *what is it disposed or able to do; vulnerable to?* | `disposedTo`, `tendsTo`, `capableOf`, `vulnerableTo`, `resistantTo` |
| **Speech acts** | *what act did the utterance perform?* | `asserts`, `denies`, `requests`, `promises`, `commands`, `describes` |
| **Phenomenology / experience** | *how was it perceived, felt, made to appear?* | `perceivedBy`, `experiencedAs`, `appearsAs`, `feltBy`, `observedBy` |

And then the meta-instruction that carries the whole design:

> *"That was JUST A LIST OF EXAMPLES — both the lenses and the predicate names. Invent your own predicates, and invent your own lenses/categories, freely. … the right predicates for any text are the ones YOU surface from it."*

This is the difference between a checklist and total extraction, and it is the load-bearing decision of the entire method. **A fixed lens taxonomy would re-impose exactly the scarcity donto exists to escape.** A schema is a finite enumeration of permitted predicates; an ontology is a curated list of permitted lenses. A closed lens set would simply move the bottleneck from "generating facts" to "anticipating the dimensions facts can have" — and the latter is, if anything, harder, because it requires foreseeing every analytical angle every future domain might demand. A frozen taxonomy of two dozen lenses would handle massacre testimony and genealogical certificates and then *silently mutilate* the first clinical case report, legal deposition, protein-interaction dataset, or liturgical text it met — not because those domains lack structure, but because their structure was never on the list.

That is the brittle-logic trap applied to ontology itself, and the open lens is the escape: an open, self-extending lens set is the only thing that scales to arbitrary domains, which is the whole point of a domain-neutral substrate. A clinical note rewards a pharmacology lens (`contraindicatedWith`, `titratedTo`, `adverseReactionIn`) that no genealogist needs; a contract rewards a deontic-instrument lens (`obligationTriggeredBy`, `defaultUpon`, `survivesTermination`) that no massacre report needs; a poem rewards a prosody-and-figuration lens that nothing else does. Each is a lens the agent *constructs in response to the material*, drawing on the same internalized understanding that lets it read as a logician one moment and a jurist the next. The two dozen lenses in the catalogue are not the lens set; they are a demonstration of what a lens *is* — calibrated examples that teach the model the move ("interrogate from a fresh analytical direction; mint the predicate that direction surfaces") so it can perform that move in directions no one wrote down.

**Third, the output is incremental and the agent decides when it is done.** The mechanism is deliberately humble. The agent reads the source, appends a batch of ~30–60 facts to `facts.jsonl` with a `cat >>` heredoc, then re-scans the *whole* source against the *whole* checklist for anything it has not yet emitted — finer relations, missed entities, missing types, inverse edges, second-order links, contradictions, decoded figurative language — and appends another batch. Then it repeats: *"KEEP GOING until YOU judge the source is fully exhausted and a fresh re-scan turns up nothing genuinely new. YOU decide when you are finished. There is NO upper limit on facts or on the number of append batches — exhaustiveness is the ONLY success criterion."* We return to that `>>` in §6 — it is not a stylistic choice but the mechanism that makes long, exhaustive runs survivable.

## 4. The four disciplines that make capture *faithful*, not just maximal

Maximal is easy; maximal-and-honest is the hard part, and it is where the prompt does its most careful work. Four disciplines keep the firehose anchored to reality.

**(a) Every directly-stated claim is anchored to its exact source span.** The fact shape carries an `"a"` field — *"the EXACT substring copied CHARACTER-FOR-CHARACTER from source.txt"*, same casing, same punctuation, including misspellings, *"findable by exact string search."* This is donto's native evidence model made operational: `fact → span → revision → content-addressed blob`. The anchor is what makes every emitted property *falsifiable* — for any claim, even one whose predicate the model invented a millisecond ago, you can return to the exact words that licensed it and ask whether they really do. Maximal capture without anchoring would be hallucination at scale; maximal capture *with* per-claim anchoring is the largest justifiable set of claims a document supports, each one auditable.

**(b) Inference is fenced off and down-weighted.** A claim with no single licensing span omits the anchor, sets `h:true` (hypothesis-only), and drops confidence below 0.9 — `1.0` for explicitly stated, `0.9` for light inference from an adjacent span, `0.7` for significant inference, `0.5` for speculative or decoded. The substrate therefore always knows the difference between *what the source said* and *what the model concluded*, and can re-weight the latter as evidence accumulates.

**(c) Figurative language is captured twice — verbatim and decoded.** The euphemism rule is one of the most distinctive parts of the method. Sources do not say what they mean: they use euphemism (*"dispersed," "tumbled down," "quietened"*), they presuppose (*"his wife"* implies a wife who is an entity in her own right), they imply entities (*"a group of them," "the others"*). The instruction is to emit *both* — the verbatim term, anchored, at full confidence; *and* a decoded claim, anchor omitted, `h:true`, `c < 0.9`, linked by `euphemismFor` / `decodedAs`. The source's own framing is preserved as an attestation; the model's reading of what the framing conceals is preserved as a hypothesis. **The substrate holds the surface and its decoding without collapsing one into the other** — the paraconsistent move, applied at the level of a single word.

**(d) No authority is ground truth.** Every source, author, official, and witness is an *interpretive witness*, framed as an attestation (`attestedBy` / `reportedIn` / `accordingTo`) rather than the extractor's own judgment. Certainty markers in the text ("alleged," "reported," "said to," "estimated") become first-class `certaintyMarker` claims. Crucially, the source is attributed *as an edge, not baked into the predicate* — `ex:capricornian-1913 reportedIn` plus a clean `killCount` value, never a frozen singleton like `killCountPerCapricornian1913` — so the substrate can later join across every source that reports a given value. And when the source contradicts itself, the extractor is forbidden to reconcile any of it: *"NEVER reconcile, dedup, canonicalise, or pick a winner. If the text gives two names/numbers/dates for the same thing, emit EACH as its own claim, then ADD claims linking them"* — `conflictsWith`, `nameDiscrepancyWith`, `countDiscrepancyWith`, `corroboratedBy`. The contradiction is not noise to be cleaned; it is signal to be captured as a first-class edge. The substrate's differentiator — that it holds incompatible claims forever as legal state — begins here, in the extractor's refusal to choose.

## 5. One sentence, swept

The thesis is easy to assert and easy to disbelieve, so watch a polyperspectival sweep work a single, ordinary sentence — the kind of clause that fills a colonial frontier record:

> *In 1855 the Native Police dispersed the blacks at Rannes, and two were killed.*

A conventional entity-and-relation extractor returns perhaps three facts: an event in 1855, a location *Rannes*, a casualty count of *2*, and reports success. Now sweep the lenses the prompt actually runs. *This section is an illustrative reconstruction — a constructed sentence and constructed claim rows, written to show the shape of a polyperspectival sweep in the engine's own fact format; it is not a dump from the store (the live-store dumps are in §§7–8). Every claim with an anchor is character-for-character supported by the span shown; the decoded and inferred claims carry `h:true` and a lower confidence, exactly as the engine emits them.*

| # | Lens | Claim (subject · predicate · object) | Anchor / status |
|---|---|---|---|
| 1 | **Taxonomy** | `ex:native-police` · `rdf:type` · `ex:ParamilitaryForce` | inferred `h:true c:0.9` |
| 2 | Taxonomy | `ex:rannes-event-1855` · `rdf:type` · `ex:DispersalEvent` | `"dispersed the blacks"` |
| 3 | **Mereology** | `ex:rannes-event-1855` · `hasParticipantGroup` · `ex:the-blacks` | `"the blacks at Rannes"` |
| 4 | Mereology | `ex:the-killed-two` · `subsetOf` · `ex:the-blacks` | inferred `h:true c:0.9` |
| 5 | **Chronology** | `ex:rannes-event-1855` · `occurredInYear` · `1855` | `"In 1855"` |
| 6 | Chronology | `ex:dispersal-rannes` · `precededBy` · `ex:provocation-at-rannes` | inferred `h:true c:0.6` |
| 7 | **Topology** | `ex:rannes-event-1855` · `locatedAt` · `ex:rannes` | `"at Rannes"` |
| 8 | **Agency** | `ex:native-police` · `agentOf` · `ex:rannes-event-1855` | `"the Native Police dispersed"` |
| 9 | Agency | `ex:the-blacks` · `patientOf` · `ex:rannes-event-1855` | `"dispersed the blacks"` |
| 10 | Agency | `ex:the-killed-two` · `experiencerOf` · `ex:death` | `"two were killed"` |
| 11 | **Causation** | `ex:rannes-event-1855` · `resultedIn` · `ex:two-deaths` | `"and two were killed"` |
| 12 | Causation | `ex:two-deaths` · `causedBy` · `ex:native-police-action` | inferred `h:true c:0.85` |
| 13 | **Lexical / decoding** | `ex:dispersed` · `euphemismFor` · `ex:killing-and-driving-off` | decoded `h:true c:0.6` |
| 14 | Lexical | `ex:rannes-event-1855` · `decodedAs` · `"killings of Aboriginal people"` | decoded `h:true c:0.55` |
| 15 | Lexical | `ex:the-blacks` · `spellingVariantOf` · `ex:aboriginal-people-rannes` | `"the blacks"` |
| 16 | **Quantity** | `ex:rannes-event-1855` · `recordedDeathCount` · `2` | `"two were killed"` |
| 17 | Quantity | `ex:rannes-event-1855` · `countIsMinimumOnly` · `true` | inferred `h:true c:0.7` |
| 18 | **Deontology** | `ex:native-police` · `actedUnderColourOf` · `ex:colonial-authority` | inferred `h:true c:0.75` |
| 19 | Deontology | `ex:rannes-event-1855` · `noLegalSanctionRecorded` · `true` | inferred `h:true c:0.6` |
| 20 | **Axiology / framing** | `ex:source-record` · `framesKillingAs` · `ex:routine-policing` | framing `h:true c:0.7` |
| 21 | **Modality** | `ex:the-blacks` · `vulnerableTo` · `ex:armed-reprisal` | inferred `h:true c:0.7` |
| 22 | **Epistemology** | `ex:rannes-event-1855` · `reportedIn` · `ex:source-record` | provenance |
| 23 | Epistemology | `ex:death-count-2` · `attestedBy` · `ex:source-record` | provenance |
| 24 | Epistemology | `ex:death-count-2` · `certaintyMarker` · `"unconfirmed-tally"` | inferred `h:true c:0.6` |
| 25 | **Speech act / passive voice** | `ex:source-record` · `omitsAgentOf` · `ex:two-deaths` | `"two were killed"` |
| 26 | **Social ontology** | `ex:native-police` · `authorityOver` · `ex:frontier-district-rannes` | inferred `h:true c:0.65` |
| 27 | **Phenomenology** | `ex:two-deaths` · `notDirectlyWitnessedBy` · `ex:record-author` | inferred `h:true c:0.55` |

Twenty-seven claims, and the sweep is not exhausted — a second pass surfaces the inverse edges (`ex:rannes · wasSiteOf · ex:rannes-event-1855`), the `rdfs:label` on every minted entity, the part-relation between *Rannes* and its colonial district, the comparison to other dispersals in the same record. One ordinary sentence, three "obvious" facts, becomes two to three dozen distinct, evidence-anchored claims — most along dimensions a news-IE schema has no column for and would have discarded without a trace.

Look at *what* the extra two dozen facts are, because they are not padding. Claim 25 — `omitsAgentOf` — is the most important fact in the sentence and the one a conventional extractor is *structurally guaranteed to miss*: the passive "two were killed" deletes the killer, and capturing the deletion *as a fact about the source* is how the substrate later reasons about whose account this is. Claim 13's decoded `euphemismFor` turns the period euphemism into a held hypothesis about what physically happened — anchored to the verbatim term *and* carried as a lower-confidence decoded claim, so the substrate holds both the witness's word and the historian's reading without collapsing either. Claims 18–20 read the sentence as a jurist and an axiologist: the Native Police acted under colour of authority, no sanction was recorded, the record frames a killing as routine policing. None of these are *in* the words. All of them are *supported by* the words, and a frontier model that has read ten thousand histories knows it.

## 6. The engine that runs the loop

A vision this maximal is worthless if the machinery buckles under it, so the engine is built around the failure modes total extraction actually hits.

**An agentic driver over a swappable provider.** The engine is not a regex bank, not a fine-tuned NER head, not a schema-bound IE model. Extraction runs on `OpenCodeAgent` — a headless OpenCode driver that `docker exec`s into a container holding the model credentials, exchanges files over a shared bind mount (`/data/omega/shared/oc/<run_id>` on the host == `/data/oc/<run_id>` inside), and reads back whatever the agent wrote. The model and provider are a clean abstraction (`z-ai/glm-5.1` today, configured by an injected `OPENCODE_CONFIG_CONTENT`; OpenAI / OpenRouter / local are drop-in). Each run gets an isolated `HOME` so concurrent agents never collide on OpenCode's internal SQLite, and a global `flock`-based slot cap bounds how many runs exist host-wide. The whole thing is made durable by Temporal, so a multi-pass, multi-hour deconstruction never loses a batch and a worker restart resumes rather than restarts.

**Faceted multi-pass with the lens-sweep prompt.** Pass 1 is the broad lens sweep of §3. Subsequent passes are continuation passes, each seeded with everything found so far plus a gap-finding preamble — *"facts.jsonl ALREADY contains {n} facts… re-read source.txt and hunt ONLY for facts that are still MISSING — apply ontological lenses you have not used yet, go finer on relations already started… The single goal is to MISS NOTHING."* The genealogy variant additionally rotates focused lenses (kinship, vital events, places, identity resolution, occupations, provenance) so one dimension is mined to exhaustion while the agent still sees all prior facts and refuses to repeat them.

**Incremental append — each batch a durable commit boundary.** A multi-pass, loop-until-dry extraction of a substantial source is a *long* operation — minutes of agent time, thousands of claims, an unknown number of sweeps. A single giant write at the end would mean any timeout, crash, or restart loses everything. The answer is structurally simple and operationally decisive: the agent appends each batch as it goes, *"ALWAYS … with `>>` (NEVER overwrite, NEVER attempt a single giant write of everything). Each batch is durably saved the moment you append it."* Each `>>` is a durable commit boundary; a timeout on pass four does not cost passes one through three. The evidence is in the run logs (an operational observation, not a store-queryable metric): a run that hit `exit 143` — killed at the wall-clock cap after 3,606 seconds — still yielded 2,096 facts, because each `cat >>` had already committed its batch to disk. A monolithic writer would have returned nothing. Exhaustiveness and durability are the same design decision seen from two angles.

**Loop-until-dry — the agent decides when the well is dry.** The terminating condition is not a fact count, not a token budget, not a fixed number of passes; it is the agent's own judgment that a fresh re-scan turns up nothing genuinely new. The controller's only job is to keep feeding continuation passes and to retry a pass that errored to zero. A pass is "dry" when it adds fewer than `max(10, 2% of total)` genuinely new facts; after a configured dry streak, the loop stops. This matters because the second sweep is not redundant: re-scanning the whole source against the whole checklist demonstrably catches what the first pass missed. On the deep re-ingest of one frontier-massacre event (the `ctx:test/ingest-verify/10690` run of §7, 2,334 facts live in the store), the run log records pass one producing 1,882 facts and pass two adding 451 more — **roughly 19% of the final yield surfaced only on the second sweep**, and the 1,882 + 451 ≈ 2,334 total is corroborated by the live store even though the per-pass split is a run-log observation. A single-pass extractor would have reported that event "done" at 1,882 and discarded the fifth that remained, with no signal that anything was missing.

**Two design tensions, handled honestly.** The vision is clean; the implementation earned its scars.

- *A single giant write fails.* The naive design — "emit one big `{"facts":[…]}` object" — times out and writes nothing on any substantial source. The fix is the batched-append mechanism above: many small durable commits instead of one fragile one.
- *Strict offset matching threw away ~70% of good facts.* The original anchor design asked the model for character offsets and rejected any anchor whose `source[start:end]` did not match. But LLMs cannot count characters — their offset arithmetic is almost always wrong — so a strict-offset gate silently discarded the majority of perfectly good, faithfully-quoted facts. The fix (`helpers.ingest_facts` / `_flex_find`) inverts the trust: ignore the model's offsets, trust the *quoted surface text*, and re-derive the offsets by finding that substring in the source — first exactly, then with a whitespace- and case-tolerant regex for spans the model quoted across a line-wrap. The anchor is preserved whenever the quote is genuinely in the source; the offsets are recomputed by software that *can* count. This single change is the difference between a 30%-anchored corpus and the densely evidence-linked one that exists (~1.89M evidence links substrate-wide).

## 7. What total extraction actually produced

The proof is in the substrate. The genealogy example consumer ran this engine over a corpus of colonial frontier-massacre records — dense, contested, multi-source historical descriptions of killings on the Australian frontier, exactly the kind of source where a thin extractor records "an attack happened, N people died" and stops. These are difficult sources — euphemistic, contradictory, written by interested parties, with the violence deliberately obscured by the language. They are precisely where thin extraction fails and total extraction earns its keep. The metrics below are scoped to the **production** corpus and verified against the live store (`ctx:genealogy/frontier-massacres/*`, 26 event contexts, 2026-06-03):

| Metric | Value |
|---|---|
| Total evidence-anchored facts (production corpus) | **32,672** |
| Distinct event contexts | **26** |
| Distinct predicates | **14,629** |
| Singleton predicates (used exactly once) | **11,688 (79.9%)** |
| Facts per event (production) | several hundred → ~1,000+ on the densest |
| Richest single run held (see note) | **2,334** facts on one event |
| Whole substrate, for scale | **992,707 distinct predicates** over **39.58M** live statements |

A provenance note, in the interest of the same rigour the method demands: the production corpus above is what the 26 event contexts hold today, with the densest production events running into four figures of facts. The single richest run this essay quotes — **2,334 evidence-anchored facts** on one event (a multi-source frontier killing, the loop-until-dry and historiographic-meta-fact examples that follow) — comes from a deeper re-ingest of that event held in a verification context (`ctx:test/ingest-verify/10690`), *not* counted in the 26-context production figure; the production version of the same event holds a thinner 920-fact pass. Every number in this essay is real and live in the store; we name which context each came from rather than fuse the two. The re-ingest is the better demonstration of how far a single source can be pushed when the loop is allowed to run to exhaustion; the production corpus is the broader, more conservative everyday yield.

Roughly eighty percent of the predicates in the production corpus are used exactly once, and the same shape holds substrate-wide: of ~992,700 distinct predicates over 39.58M statements, **739,143 — about 74.5% — are used exactly once.** A pre-LLM knowledge engineer would read "80% singletons" as a catastrophe — a vocabulary that has failed to converge, unjoinable noise to be normalized away before the data is usable. donto reads it as the opposite. **The long tail is the signature of completeness, not noise.** Every singleton is a dimension of the source that exactly one lens, fired once, was the only thing that could capture. To demand those predicates *reuse* an existing vocabulary is to demand that the source be flattened into the dimensions we anticipated in advance — which is the slice-and-discard failure total extraction exists to refuse. The singletons are not the system failing to converge; they are the system declining to throw away the parts of reality that don't fit a column. They are captured dimensions awaiting query-time alignment, held losslessly until reality and the alignment engine decide which fold together.

What do those singletons look like? A sample of the corpus's one-time predicates reads like a transcript of a dozen disciplines reading at once:

| Predicate (used once) | The lens it came from |
|---|---|
| `bodyConcealmentDuration`, `bodyDiscoveredAfterDays` | process / event-state-change + time |
| `mannerOfDeath`, `numberOfSpearWounds`, `causationChainDetailed` | thematic roles + causation / forensics |
| `executionDecision`, `orderedAction` | agency + deontology (command, decision) |
| `victimsFleeingTowards`, `escapeMethodObserved`, `displacementDestination` | spatial topology + phenomenology of witness |
| `jurisdictionAtTime`, `actedUnderColourOf` | legal/deontic ladder × chronology |
| `omissionInPrimarySource`, `doesNotExpressSympathyForVictims`, `doesNotCloseWithMoralJudgment` | epistemology / axiology — what a source *fails* to do |
| `isExampleOfColonialTravelWriting`, `quoteFormalRegister`, `usedNonRestrictiveRelativeClauseForBlacks` | linguistic / historiographic *meta-analysis of the source itself* |

Notice the register shift. `numberOfSpearWounds` is a fact a careful extractor might reach. `doesNotExpressSympathyForVictims` is a fact about the *author's stance* — a second-order, axiological reading that requires the model to step back from the event to the witness describing it. That last cluster is worth pausing on: reading a colonial newspaper account, the engine surfaced predicates about *how the account is written* — its register, its grammar, its non-restrictive relative clauses dehumanizing the victims. These are not facts about the massacre. They are second-order facts about the source *as a historiographic artifact*, the kind of reading a trained historian or critical linguist performs and a named-entity recognizer cannot even represent. They emerged because the prompt does not stop at the document's content; it sweeps the open lens — "what other true, supported property does this text hold that nothing above named?" — and lets the model mint a predicate on the spot. That is what "deconstruct using the whole of human understanding" cashes out to in practice.

## 8. The esoteric captures: euphemism, contradiction, and facts the text only implies

The frontier-massacre corpus is where donto's most distinctive extraction behaviours become visible, because the genre is built on euphemism and on irreconcilable sources. Counting the second-order predicates directly in the production corpus (`ctx:genealogy/frontier-massacres/*`, 2026-06-03):

| Predicate | Count | What it captures |
|---|---|---|
| `reportedIn` | 159 | claim attributed to a specific source-as-witness |
| `causedBy` | 72 | etiology edges (incl. reprisal chains) |
| `resultedIn` | 72 | causal / consequence edges |
| `attestedBy` | 66 | claim attributed to a named attester |
| `corroboratedBy` | 32 | independent source agreement |
| `nameDiscrepancyWith` | 29 | name disagreement between sources |
| `euphemismFor` | 21 | verbatim term linked to its decoded meaning |
| `certaintyMarker` | 16 | "alleged" / "reported" / "said to" as first-class |
| `conflictsWith` | 13 | contradictory accounts held side by side |
| `corroborates` | 13 | the inverse corroboration edge |
| `countDiscrepancyWith` | 13 | numeric disagreement between sources |
| `dateDiscrepancyWith` | 12 | date disagreement between sources |
| `decodedAs` | 5 | inferred plain-meaning of figurative language |

**Euphemism decoding** is the signature move. The colonial archive does not say *killed*; it says *dispersed, tumbled down, quietened, a tragedy.* The engine captures both — the verbatim term anchored to its exact span, *and* a separate decoded claim flagged inferred (`h:true`, confidence below 0.9) and linked by `euphemismFor` / `decodedAs`. Real edges from the store:

| Verbatim term (anchored) | Decoded claim (`euphemismFor` / `decodedAs`, hypothesis) |
|---|---|
| "dispersal" / "dispersed" | *killing or violent dispersal of Aboriginal people* |
| "tumble-down" | *shot dead / fell dead* |
| "made the wild blacks pay for it" | *reprisal killings of Aboriginal people* |
| "paid the penalty" | *killed* |
| "punish the offenders" | *kill Aboriginal people* |

This is the two-layer contract operating *inside a single fact*: the literal term is captured at full confidence and anchored to the page; the decoding is captured as an explicit hypothesis, lower-confidence, never overwriting the verbatim record. Both survive. The reader downstream can choose to trust the decode or audit it back to the euphemism and the span. The historiography lives in the gap between the two columns.

**Contradiction is captured, never resolved.** The genre's sources disagree violently on body counts, and the engine records the disagreement as data:

- `ex:attack-nmp-rannes-1855 · countDiscrepancyWith · "Sources give death tolls of 2, 3, 5, 1, 3, 12"`
- the same event: *"Capricornian 1909 says 1 killed; Queenslander 1892 says 5 killed; De Satge says 5 of 6 killed"*
- the same event: *"Morning Bulletin 1912: 3 killed + 2 crippled; Capricornian 1913: 12 killed"*

The most striking product of "no authority is ground truth" is the **cross-source count ladder.** Ten witnesses give ten different tolls for the same event, and the agent emits each as its own anchored claim — *1* ("killing a trooper"), *2* ("murdering two of them"), *3* ("kill three on the spot"), *4* ("Four out of the five…"), *5* ("five of the six troopers killed"), … *12* ("they killed twelve native police"). No reconciliation, no winner. Every count held side-by-side as legal bitemporal state, wired together with `countDiscrepancyWith` edges across the specific sources — the paraconsistent behaviour the substrate exists to support, produced *at extraction time* rather than discarded there.

**Historiographic meta-facts the text only implies.** This is the deepest reach, and the hardest to get any other way. *(Provenance: the five meta-facts in this block are live in the deep re-ingest context `ctx:test/ingest-verify/10690` — the 2,334-fact run flagged in §7 — not in the thinner production pass of the same event. The object values are quoted verbatim from the store.)* Having laid down ten conflicting death counts and tracked how the event is *named* across its sources — the `eventTerminologyBySource` claims pair each term with the witness who used it (`"murder (Leith Hay and Holt)"`, `"outrage (Empire Oct 15)"`, `"affray (Empire Oct 15)"`, `"slaughter (De Satge 1901)"`, `"slaughter (Queenslander 1892)"`, `"massacre (Morning Bulletin 1912)"`), so the hardening of the language over the decades is recoverable by query rather than asserted as a gloss — the agent steps up a level and asserts properties of the *evidentiary record itself*:

- `deathCountTrendOverTime → "increasing (2 in 1855 to 12 in 1913)"` (`h:true`)
- `interSourceDisagreementOnDeathCount → "severe (1-12)"`
- `interSourceDisagreementOnTimeOfDay → "night vs morning"`
- `interSourceDisagreementOnOfficerIdentity → "Robert George Walker vs Henry Walker vs Percy Walker"`
- `narrativeExaggerationOverTime → true`

No sentence in any source says "the reported death toll rose over time." The agent *inferred the trend from the attestation graph it had just built* — a second-order historiographic claim, correctly flagged as inference, anchored to the structure of the testimony rather than to any single span. That is the payoff of reading as a historian and a statistician and an epistemologist simultaneously: the source's own self-contradiction becomes a measured, queryable property. The same totality produces reprisal causation chains (`causedBy` / `resultedIn` linking an initial killing to the punitive expedition that followed, with self-minted predicates the production store actually holds — `reprisalFor`, `provokedReprisal`, `deathTriggeredReprisal`, `sisterReprisalEvent`, `ledReprisalFor`, `causedReprisal`, `followedByReprisal`, and dozens more variants the alignment engine will fold later) and cross-event prosopography — the same named trooper or squatter recognized across separate documents and linked, so that a person becomes a thread running through events extracted independently. None of this is reachable by reading one clause at a time for entities. All of it falls out of reading the whole source, repeatedly, through every lens at once.

**The single-span fan-out.** To see the lenses working in concert, watch one clause from a real run — *"the blacks had been unwisely employed"* — fan into 16 distinct facts: a social-ontology employment relation (`statesBlacksEmployedOnStation`), a causal/etiological one (`backgroundCause`, `enabledByAboriginalEmploymentOnStation`), an epistemic one (`insiderKnowledgeEnabledBy "employment of Aboriginal people at station"`), an axiological/framing one (`framedAboriginalPeopleAs "blacks"`), and an attestation one (`nmp-corps criticisedBy de-satge-1901`). One nine-word clause, read by six experts, yields six different true relations. Multiply that across every clause and the per-word fact density — more than one fact per word, on the densest events — stops looking implausible and starts looking inevitable.

## 9. Why total extraction is now sane: economics and the two-layer contract

Total extraction sounds like a fantasy that would have been laughed out of any knowledge-engineering meeting for the last sixty years — and it would have been, because it was *unaffordable* and there was *nowhere to put the output.* Both constraints just lifted.

**Generation flipped from the scarce step to the cheap step.** For sixty years the binding constraint on every knowledge system was generation. Cyc paid knowledge engineers per assertion. Formal concept analysis required the attributes declared up front. Literature-based discovery rode co-occurrence statistics because actually *reading* and *typing* the literature was unaffordable. When generation is the scarce step, the rational extraction policy is *minimalism*: pull the one canonical entity and its three obvious relations, because every additional fact costs a human minute you cannot spare. That policy is now exactly backwards. **A guided frontier LLM emits a justifiable, evidence-anchored claim for on the order of $10^{-4}$** — a hundredth of a cent. The public record makes the point at scale: GPTKB materialized 105M typed triples over 2.9M entities at $0.00009 per correct triple, inventing its own predicate axes as it went, with 69.5% of the entities it described absent from Wikidata; AutoSchemaKG induced a 900M-node, 5.9B-edge graph from 50M documents with zero predefined schema, the extractor inventing every type on the fly while preserving 93–97% of source information. When emitting one more true fact costs a hundredth of a cent, the cost of *omitting* it — the slice of the source you can never recover because nobody recorded it — is the expensive mistake. The economics did not merely *permit* total extraction; they flipped the optimum from minimal to maximal.

| Era | Cost to emit one typed claim | Scarce resource | Rational extraction policy |
|---|---|---|---|
| Knowledge-engineering (Cyc, FCA) | ~minutes of expert labor | generation | minimal — pull the canonical few, discard the rest |
| Frontier LLM (now) | ~$10^{-4}$ | *holding & deciding* | **maximal — emit every justifiable claim, decide later** |

**But cheapness alone does not make total extraction sane.** The objection to maximalism was never really cost; it was that a maximal firehose *drowns the systems that have to receive it* — and under the standard storage targets, that objection is correct. A vector database collapses every emission to a single embedding; when a new fact conflicts with a stored one, the closer vector wins and the other is silently lost. A normal knowledge graph — and even 2025's best agent-memory graphs — enforces single-truth at write time: Zep/Graphiti uses an LLM to detect contradicting edges and invalidate the overlapping one; Mem0's update step overwrites. Pour a multi-perspectival firehose into any of these and you get the worst of both worlds: you pay to generate 2,000 facts about a source, then the store's collapse machinery throws most of them away at ingest — and it throws away *exactly* the minority, speculative, and contradictory claims that are the raw material of discovery. The decoded euphemism, held at lower confidence beside the verbatim term, is precisely the kind of claim a collapsing store deletes as a near-duplicate. Under those assumptions, the knowledge engineer's minimalism was not timidity. It was a correct adaptation to a downstream that punishes abundance.

**So the real precondition for total extraction was never the price of a token. It was a substrate that does not collapse.** donto's answer is the two-layer contract, and it is the only thing that makes maximal capture safe.

- **Layer one — extraction — maximizes and gates nothing.** The contract here has exactly one invariant: every emission is a *claim* — subject, (possibly brand-new) predicate, object — *anchored to its source evidence.* Not a schema, not a type, not a canonical entity. Within that single constraint the agent emits free, untyped, multi-directional, self-invented predicates and never throws away a justifiable fact.
- **Layer two — the substrate, at query and promotion time — gates everything.** Typing, alignment of near-duplicate predicates, identity resolution, deduplication, and promotion from `hypothesis_only` to a ranked, supported claim are all deferred to where they can be done well, with the full corpus in view, reversibly, and per use. Identity is a hypothesis resolved at query time, never a merge committed at ingest. Predicate alignment is a typed, scoped, query-time operation — align the jobs context to ESCO/O\*NET when you *query* it — not a write-time bottleneck forcing the extractor to guess a canonical form it cannot yet know.

| | Layer 1 — Extraction | Layer 2 — Substrate (query/promotion time) |
|---|---|---|
| **Mandate** | MAXIMIZE — emit every justifiable claim | GATE — type, align, resolve, dedup, promote |
| **Invariant** | claim = subject / predicate / object + evidence anchor | bitemporal, paraconsistent, evidence-first hold |
| **Predicates** | free, self-invented, camelCase, multi-directional | aligned per-context to taxonomies on query |
| **Contradiction** | emit each side + `conflictsWith` — never reconcile | held forever as legal state; re-ranked, not deleted |
| **Identity** | mint distinct IRIs + `likelySameAs` — never pre-merge | resolved as a query-time hypothesis, reversibly |
| **Reversibility** | n/a — emits, never deletes | every decision non-destructive and re-runnable |

The extractor is *forbidden from making the irreversible decisions* — it never picks a winner between two reported death counts, never merges two spellings of a name, never decides which framing is "true." It emits each, then *adds* claims linking them. Every collapse a vector DB or normal KG performs at write time is, in donto, deferred, non-destructive, and per-query. That is what makes maximal emission safe: nothing the extractor emits forecloses a later decision, because the extractor is structurally barred from foreclosing.

There is even a theorem underneath this. Model collapse — a knowledge base degrading as it feeds on its own output — occurs only under *replacement* (train on synthetic, discard real → error grows). Under *accumulation* — keep all real and synthetic claims forever — error is provably bounded, independent of iteration count (Gerstgrasser et al. 2024). donto is an accumulation system by construction: it never overwrites, never dedups, keeps provenance and counter-evidence on every claim. The contradiction-preserving substrate is not merely *compatible* with a maximal firehose; it is the provably-safe container for it. And it stays usable while doing so: `POST /search` already ranks across the full ~39.5M-statement substrate in 270–820ms, stopwords included. Abundance is only worth generating if it stays queryable, and that part is built. And this is the right place to be explicit about the north star: the substrate is the goal, and total extraction exists to *test* it — a maximal, euphemism-decoding, contradiction-emitting firehose is precisely the load that exercises the substrate's hardest invariants (paraconsistent hold, no destructive overwrite, identity-as-hypothesis, defer-joining). The extraction engine earns its keep not by being clever but by handing the substrate exactly the kind of input that would break anything that collapses.

## 10. Measurement, and the honest limits

How do you know you got "everything"? You don't, and the intellectual honesty of the method begins with admitting that **completeness is unfalsifiable.** You cannot prove a source has been fully read; there is no oracle that returns the true cardinality of an unbounded property space. So we do not chase a completeness proof. We chase *saturation* and *faithfulness*, and we instrument both.

- **Coverage saturation (loop-until-dry).** The operational definition of "done" is *the pass on which new facts approach zero.* The agent keeps re-scanning until a fresh sweep of the whole checklist surfaces nothing genuinely new. The saturation curve — facts emitted per pass, trending toward zero — *is* the completeness signal, the only honest one available. The terminating decision is handed to the only thing in the loop that can actually tell whether the well is dry: the agent.
- **Re-extraction lift.** The cheapest test of whether the loop is real: re-run and see if a second sweep finds more. It demonstrably does — the deep re-ingest of event 10690 totals 2,334 facts live in the store, of which the run log attributes 451 (~19%) to the second sweep over the 1,882 of the first. Nearly a fifth of what the source supported was invisible until the whole lens-set was re-applied a second time.
- **Faithfulness via anchors.** Because every directly-stated claim carries a span copied character-for-character from the source, faithfulness is *checkable* by string search rather than by trust. The decoded and inferred claims are the population to audit; the anchored ones are auditable against the page. Measurability survives free emission precisely *because* every emission carries an anchor.

The limits are real and worth stating without flinching. The loop's stopping rule is the model's judgment, and a model that tires or satisfices will declare a source dry too early — saturation is a proxy, not a proof. Decoded euphemism and historiographic meta-facts are hypotheses by construction, and some will be wrong; the design's defense is not that they are always right but that they are *marked as inference, down-weighted, and reversible*, never allowed to masquerade as stated fact. The 80%-singleton tail, the signature of abundance, is also a deferred bill: those predicates must eventually be folded by the alignment engine for a query that needs them folded, and that engine is still being built. And the GLM-coding-subscription path the engine runs on today is a TOS-risky, expiring subsidy — which is exactly why provider is a swappable abstraction, not a hardcoded dependency. None of these undercut the thesis; they locate the work that remains.

## 11. The well, and the bottom of it

What total extraction asks for is a change of stance. The conventional extractor approaches a source with a schema and asks, *which of my slots does this fill?* — and the source's answer is mostly silence, because most of what a source knows was never given a slot. The donto extractor approaches the same source with the whole of human analytical understanding loaded into a single guided mind, and asks the opposite question, again and again until the asking stops paying: *what else is true here? what would a logician see that a historian missed? a jurist that a linguist missed? what is this word concealing, what caused this event, where do these two witnesses disagree, and what have I not yet looked at?* It reads the sentence as all of them at once, mints the precise predicate each reading demands, anchors what the source states and fences what it merely implies, and keeps re-reading until a fresh sweep from every angle turns up nothing new.

The result is not a tidy record. On the richest single event it is 2,334 evidence-anchored claims drawn from a knot of contradictory sources, in a vocabulary that is roughly four-fifths singletons, holding the verbatim euphemism beside its decoding and every witness beside the witnesses it contradicts — and the broader production corpus repeats the shape, more conservatively, across two dozen events. That is what it looks like to get *every single knowable thing* out of a thing, and to put it somewhere that can hold it all. The opening claim, now earned: a source is not a bag of facts to be emptied but a lit-as-far-as-you-light-it space, and almost every extractor built before this one carried a lantern that showed a single wall.

For sixty years we extracted thin because we had to. We do not have to anymore. The well is deep, the bucket is now nearly free, and the only remaining question — the one the systems paper answers — is where to put the water once you finally stop refusing to draw it.
