A companion essay on donto's extraction philosophy and method — the part the systems paper underplays. 2026-06-03.
The donto systems paper specifies a container — bitemporal, paraconsistent, evidence-first — and is nearly silent about the faucet. This essay is about the faucet, and about the conviction behind it: that extraction should be deep, esoteric, and total. You point an agent at one source and guide it to deconstruct that source through the whole apparatus of human understanding — reading a single sentence simultaneously as a logician, a mereologist, a jurist, a phenomenologist, a historian, and a linguist — minting precise predicates from every analytical direction at once, anchoring each claim to its exact span, decoding euphemism and capturing contradiction as first-class facts, and looping until a fresh pass surfaces nothing genuinely new and the agent itself declares the well dry. The premise: a thing is not a bag of facts but an essentially unbounded space of true properties and relations, and almost every extraction system ever built sees one thin slice and silently discards the rest. This is sane now because generation flipped from the scarce step to the cheap step (~$10⁻⁴ per justifiable claim), and safe because of a two-layer contract — maximize at extraction, gate at query time — paired with a substrate built to hold the firehose. We ground every claim in the real implemented engine and its real output: a production frontier-massacre corpus of 32,672 evidence-anchored facts over 14,629 predicates (79.9% singletons), the richest single run holding 2,334 evidence-anchored claims from one event, and a live substrate of ~39.58M statements over ~992,700 predicates.
For as long as machines have read text for us, extraction has meant
subtraction. You define a schema — Person,
Organization, Location, a dozen relation types
if the project is ambitious — and you point a model at a document, and
the model returns the handful of things that fit the boxes you brought
with you. Everything else, which is to say almost everything the
document actually says, falls on the floor. A named-entity
recognizer reads a coroner's deposition and returns three names and a
date. A relation extractor reads a paragraph dense with causation,
obligation, hearsay, euphemism, contradiction, and grief, and returns
(Smith, locatedIn, Queensland). The text was a galaxy; the
output is a postage stamp. We grew so habituated to this loss that we
stopped calling it loss. We called it the result.
This essay is about inverting that habit completely, and the inversion begins with a premise most extraction systems quietly assume and never defend: that a source contains a finite, enumerable set of facts, and the job is to find them. Read the document, pull the entities, attach the handful of relations the schema knows how to hold, and you are done. What you missed, you missed because it wasn't there. The bag is full; close it.
The premise is false, and its falseness is the whole of donto's extraction philosophy. A source does not contain a fixed number of facts. It supports an essentially unbounded space of true properties and relations, distributed across dozens of ontological dimensions, the overwhelming majority of which no single schema will ever name. A sentence is not a row to be parsed into its columns. It is a point in a possibility-space, and the number of true things you can say about that point is bounded not by the sentence but by the analytical apparatus you bring to it. Bring one lens — entity-and-relation, the industry default — and you see one thin slice. Bring the whole of human understanding to bear at once, and the same sentence opens like a fractal: every clause is a logical proposition, a part-whole structure, a causal link, a deontic obligation, a speech act, a moment in time, a claim made by someone with a stake.
Take a single word from the corpus this essay draws on — the
University of Newcastle's Colonial Frontier Massacres in Australia
1788–1930 dataset, which donto's engine has processed into the live
substrate. The colonial archive's word of choice is dispersal,
and the engine's real, live capture of it (a production edge,
ex:dispersal · euphemismFor · "killing or violent dispersal of Aboriginal people")
is the thread we pull on. To a named-entity extractor a clause built on
that word is two entities and a verb. But the same clause is,
simultaneously:
dispersed decoding to
killed: a lexical-semantics and pragmatics fact, and exactly
the edge the store holds;locatedNear a
watercourse, and watercourse-proximity is itself a pattern across
frontier killings: a topology fact;Eight disciplines, one word, and we have not yet counted the entities, the type assertions, the labels, or the inverse edges. The named-entity extractor saw one thin slice and silently discarded the other seven. Total extraction is the refusal to discard. Most extraction systems see one slice of a source and have no way of even noticing that they did, because their schema cannot represent what they threw away. The thinness is not caution. It is a vast, unlogged discard of everything true that the chosen schema couldn't name.
Here is what makes total extraction newly possible rather than merely desirable. A frontier model has internalized — compressed into its weights — a staggering fraction of how humans have learned to analyze the world: the case grammar of the linguist, the mereology of the metaphysician, the etiology of the historian, the qualia structure of the lexical semanticist, the supports-and-undercuts of the argumentation theorist. Anthropic's interpretability work found tens of millions of distinct, interpretable concept-directions alive inside a single mid-size model. The directions along which a source can be decomposed are not scarce and they are not ours to invent; they are superabundant and already present, waiting to be elicited.
So the art of total extraction is not generation. The art is guidance. It is the discipline of getting the model to bring all of that internalized understanding to bear on one source at once, along every axis it knows, and to keep going until the well is dry. The prompt is not a list of fields to fill. It is a method for unlocking, one lens at a time, the analytic frameworks the model already carries — and an explicit license to invent the lenses we forgot to name. Extraction stops being information retrieval and becomes something closer to guided polyperspectival reading.
This reframes the act from recognition to deconstruction. A recognizer matches the source against templates it brought with it. A deconstructor brings the whole toolkit of human analysis and asks of every clause: what does this say to a theorist of parts and wholes? to a theorist of cause? of time? of obligation and permission? of value, of modality, of speech acts, of phenomenal experience? Each lens is a different question, and each question the source can answer is a fact most systems never see. The art is not in choosing the right lens. The art is in bringing them all to bear at once, and in not stopping early.
This is not a manifesto without code behind it. The live extraction
prompt — apps/donto-api/prompts/extract_broad.txt, ~15.8 KB
of instruction, with a domain-tuned sibling
frontier_broad.txt — is the method, written down.
It reads less like a parser configuration and more like a charge to a
graduate seminar. Three of its instructions carry the whole
philosophy.
First, abundance is the explicit goal, and tidiness is the named enemy. The standing demand:
"Your one job is MAXIMAL FAITHFUL CAPTURE: emit the LARGEST justifiable set of discrete, source-anchored claims the document supports… A substantial source should yield SEVERAL HUNDRED claims; a few dozen means you stopped far too early — go back through every line again. If unsure whether to emit a fact, EMIT IT."
There is no upper bound on facts and no upper bound on distinct predicates. The model is told, in as many words, that "a proliferation of specific, self-invented predicates is the SIGNATURE OF ABUNDANCE, not a problem," and that typing, alignment, identity-resolution, and joining are explicitly someone else's job, deferred to query time — the substrate's, never the extractor's.
Second, the sweep is multi-directional by construction. The prompt is emphatic: "Do NOT extract from a single angle. Sweep many lenses against EVERY entity and EVERY clause; each yields predicates the others miss, and one fact usually yields several predicates from different lenses — emit them ALL." A lens is a generative direction — a discipline's characteristic way of interrogating a thing. The mereologist asks what are its parts and what is it part of? The jurist asks who was permitted, obliged, forbidden, liable? The phenomenologist asks how was it experienced, how did it appear? The lenses are not categories you sort facts into after the fact; they are interrogation strategies you run before the facts exist, whose whole purpose is to make the agent ask questions it would not otherwise ask. The prompt enumerates roughly two dozen, each named with the discipline it draws on and the predicates it surfaces that the others structurally cannot:
| Lens | The question it forces | What it surfaces that the others miss |
|---|---|---|
| Taxonomy / type theory | what kind of thing is this? | rdf:type, instanceOf,
subtypeOf, naturalKind — the bare typing that
makes everything else joinable; cheap and load-bearing |
| Mereology (part/whole) | what are its parts; what is it part of? | partOf, hasPart, componentOf,
constitutes, boundaryOf |
| Identity & persistence | is this the same thing as that, across time? | sameAs, likelySameAs,
aliasOf, formerlyKnownAs,
identityDisputedWith — identity as hypothesis,
never premature merge |
| Topology / spatial | where, relative to what? | locatedIn, contains,
adjacentTo, borders, near,
pathBetween |
| Chronology / time | when, in what order, for how long? | before, during, startsAt,
duration, atAge,
recurrenceOf |
| Causation / etiology | what caused it; what did it cause? | causes, causedBy, triggers,
enables, resultsIn,
riskFactorFor |
| Teleology / function | what is it for? | purposeOf, usedFor,
intendedFor, meansToward,
serves |
| Agency / case grammar | who did what to whom, with what, for whom? | agentOf, patientOf,
instrumentOf, beneficiaryOf,
experiencerOf |
| Epistemology | how is this known; who attests; how sure? | attestedBy, reportedIn,
inferredFrom, doubtedBy,
witnessedBy |
| Deontology / norms & law | who was permitted, obliged, forbidden, liable? | permittedTo, obligedTo,
forbiddenFrom, governedBy,
liableFor |
| Axiology / value | how is it valued, praised, condemned, framed? | valuedAs, praisedFor,
condemnedFor, framedAs |
| Modality | is it possible, necessary, contingent, capable? | possibleThat, necessaryThat,
capableOf, contingentOn |
| Qualia structure (Pustejovsky) | its form, matter, purpose, origin? | formalQuale, constitutiveQuale,
telicQuale, agentiveQuale |
| Lexical semantics | how do its words relate? | synonymOf, hypernymOf,
meronymOf, spellingVariantOf,
etymologyOf |
| Social ontology | what roles, statuses, authorities, kinship? | roleIn, authorityOver,
subordinateTo, kinOf, allyOf,
rivalOf |
| Process / event structure | its phases, pre/post-conditions, state changes? | subEventOf, phaseOf,
preconditionOf, stateChangedTo,
initiates |
| Constitution / material | what is it made of; what realises it? | madeOf, realizes, embodies,
instantiatedBy |
| Dependence / grounding | what does it depend on, presuppose, require? | dependsOn, supervenesOn,
groundedIn, presupposes |
| Provenance / origin | where did it come from; who made it? | originatesFrom, derivedFrom,
descendedFrom, authoredBy,
basedOn |
| Comparison / similarity | like what, unlike what, more or less than? | similarTo, analogousTo,
contrastsWith, proportionalTo |
| Quantity / measurement | how much, at what rate, in what unit? | hasQuantity, magnitudeOf,
rateOf, measuredAs,
approximately |
| Disposition / capacity | what is it disposed or able to do; vulnerable to? | disposedTo, tendsTo,
capableOf, vulnerableTo,
resistantTo |
| Speech acts | what act did the utterance perform? | asserts, denies, requests,
promises, commands,
describes |
| Phenomenology / experience | how was it perceived, felt, made to appear? | perceivedBy, experiencedAs,
appearsAs, feltBy,
observedBy |
And then the meta-instruction that carries the whole design:
"That was JUST A LIST OF EXAMPLES — both the lenses and the predicate names. Invent your own predicates, and invent your own lenses/categories, freely. … the right predicates for any text are the ones YOU surface from it."
This is the difference between a checklist and total extraction, and it is the load-bearing decision of the entire method. A fixed lens taxonomy would re-impose exactly the scarcity donto exists to escape. A schema is a finite enumeration of permitted predicates; an ontology is a curated list of permitted lenses. A closed lens set would simply move the bottleneck from "generating facts" to "anticipating the dimensions facts can have" — and the latter is, if anything, harder, because it requires foreseeing every analytical angle every future domain might demand. A frozen taxonomy of two dozen lenses would handle massacre testimony and genealogical certificates and then silently mutilate the first clinical case report, legal deposition, protein-interaction dataset, or liturgical text it met — not because those domains lack structure, but because their structure was never on the list.
That is the brittle-logic trap applied to ontology itself, and the
open lens is the escape: an open, self-extending lens set is the only
thing that scales to arbitrary domains, which is the whole point of a
domain-neutral substrate. A clinical note rewards a pharmacology lens
(contraindicatedWith, titratedTo,
adverseReactionIn) that no genealogist needs; a contract
rewards a deontic-instrument lens (obligationTriggeredBy,
defaultUpon, survivesTermination) that no
massacre report needs; a poem rewards a prosody-and-figuration lens that
nothing else does. Each is a lens the agent constructs in response
to the material, drawing on the same internalized understanding
that lets it read as a logician one moment and a jurist the next. The
two dozen lenses in the catalogue are not the lens set; they are a
demonstration of what a lens is — calibrated examples that
teach the model the move ("interrogate from a fresh analytical
direction; mint the predicate that direction surfaces") so it can
perform that move in directions no one wrote down.
Third, the output is incremental and the agent decides when
it is done. The mechanism is deliberately humble. The agent
reads the source, appends a batch of ~30–60 facts to
facts.jsonl with a cat >> heredoc, then
re-scans the whole source against the whole checklist
for anything it has not yet emitted — finer relations, missed entities,
missing types, inverse edges, second-order links, contradictions,
decoded figurative language — and appends another batch. Then it
repeats: "KEEP GOING until YOU judge the source is fully exhausted
and a fresh re-scan turns up nothing genuinely new. YOU decide when you
are finished. There is NO upper limit on facts or on the number of
append batches — exhaustiveness is the ONLY success criterion." We
return to that >> in §6 — it is not a stylistic
choice but the mechanism that makes long, exhaustive runs
survivable.
Maximal is easy; maximal-and-honest is the hard part, and it is where the prompt does its most careful work. Four disciplines keep the firehose anchored to reality.
(a) Every directly-stated claim is anchored to its exact
source span. The fact shape carries an "a" field —
"the EXACT substring copied CHARACTER-FOR-CHARACTER from
source.txt", same casing, same punctuation, including misspellings,
"findable by exact string search." This is donto's native
evidence model made operational:
fact → span → revision → content-addressed blob. The anchor
is what makes every emitted property falsifiable — for any
claim, even one whose predicate the model invented a millisecond ago,
you can return to the exact words that licensed it and ask whether they
really do. Maximal capture without anchoring would be hallucination at
scale; maximal capture with per-claim anchoring is the largest
justifiable set of claims a document supports, each one auditable.
(b) Inference is fenced off and down-weighted. A
claim with no single licensing span omits the anchor, sets
h:true (hypothesis-only), and drops confidence below 0.9 —
1.0 for explicitly stated, 0.9 for light
inference from an adjacent span, 0.7 for significant
inference, 0.5 for speculative or decoded. The substrate
therefore always knows the difference between what the source
said and what the model concluded, and can re-weight the
latter as evidence accumulates.
(c) Figurative language is captured twice — verbatim and
decoded. The euphemism rule is one of the most distinctive
parts of the method. Sources do not say what they mean: they use
euphemism ("dispersed," "tumbled down," "quietened"), they
presuppose ("his wife" implies a wife who is an entity in her
own right), they imply entities ("a group of them," "the
others"). The instruction is to emit both — the verbatim
term, anchored, at full confidence; and a decoded claim, anchor
omitted, h:true, c < 0.9, linked by
euphemismFor / decodedAs. The source's own
framing is preserved as an attestation; the model's reading of what the
framing conceals is preserved as a hypothesis. The substrate
holds the surface and its decoding without collapsing one into the
other — the paraconsistent move, applied at the level of a
single word.
(d) No authority is ground truth. Every source,
author, official, and witness is an interpretive witness,
framed as an attestation (attestedBy /
reportedIn / accordingTo) rather than the
extractor's own judgment. Certainty markers in the text ("alleged,"
"reported," "said to," "estimated") become first-class
certaintyMarker claims. Crucially, the source is attributed
as an edge, not baked into the predicate —
ex:capricornian-1913 reportedIn plus a clean
killCount value, never a frozen singleton like
killCountPerCapricornian1913 — so the substrate can later
join across every source that reports a given value. And when the source
contradicts itself, the extractor is forbidden to reconcile any of it:
"NEVER reconcile, dedup, canonicalise, or pick a winner. If the text
gives two names/numbers/dates for the same thing, emit EACH as its own
claim, then ADD claims linking them" — conflictsWith,
nameDiscrepancyWith, countDiscrepancyWith,
corroboratedBy. The contradiction is not noise to be
cleaned; it is signal to be captured as a first-class edge. The
substrate's differentiator — that it holds incompatible claims forever
as legal state — begins here, in the extractor's refusal to choose.
The thesis is easy to assert and easy to disbelieve, so watch a polyperspectival sweep work a single, ordinary sentence — the kind of clause that fills a colonial frontier record:
In 1855 the Native Police dispersed the blacks at Rannes, and two were killed.
A conventional entity-and-relation extractor returns perhaps three
facts: an event in 1855, a location Rannes, a casualty count of
2, and reports success. Now sweep the lenses the prompt
actually runs. This section is an illustrative reconstruction — a
constructed sentence and constructed claim rows, written to show the
shape of a polyperspectival sweep in the engine's own fact format; it is
not a dump from the store (the live-store dumps are in §§7–8). Every
claim with an anchor is character-for-character supported by the span
shown; the decoded and inferred claims carry h:true and a
lower confidence, exactly as the engine emits them.
| # | Lens | Claim (subject · predicate · object) | Anchor / status |
|---|---|---|---|
| 1 | Taxonomy | ex:native-police · rdf:type ·
ex:ParamilitaryForce |
inferred h:true c:0.9 |
| 2 | Taxonomy | ex:rannes-event-1855 · rdf:type ·
ex:DispersalEvent |
"dispersed the blacks" |
| 3 | Mereology | ex:rannes-event-1855 · hasParticipantGroup
· ex:the-blacks |
"the blacks at Rannes" |
| 4 | Mereology | ex:the-killed-two · subsetOf ·
ex:the-blacks |
inferred h:true c:0.9 |
| 5 | Chronology | ex:rannes-event-1855 · occurredInYear ·
1855 |
"In 1855" |
| 6 | Chronology | ex:dispersal-rannes · precededBy ·
ex:provocation-at-rannes |
inferred h:true c:0.6 |
| 7 | Topology | ex:rannes-event-1855 · locatedAt ·
ex:rannes |
"at Rannes" |
| 8 | Agency | ex:native-police · agentOf ·
ex:rannes-event-1855 |
"the Native Police dispersed" |
| 9 | Agency | ex:the-blacks · patientOf ·
ex:rannes-event-1855 |
"dispersed the blacks" |
| 10 | Agency | ex:the-killed-two · experiencerOf ·
ex:death |
"two were killed" |
| 11 | Causation | ex:rannes-event-1855 · resultedIn ·
ex:two-deaths |
"and two were killed" |
| 12 | Causation | ex:two-deaths · causedBy ·
ex:native-police-action |
inferred h:true c:0.85 |
| 13 | Lexical / decoding | ex:dispersed · euphemismFor ·
ex:killing-and-driving-off |
decoded h:true c:0.6 |
| 14 | Lexical | ex:rannes-event-1855 · decodedAs ·
"killings of Aboriginal people" |
decoded h:true c:0.55 |
| 15 | Lexical | ex:the-blacks · spellingVariantOf ·
ex:aboriginal-people-rannes |
"the blacks" |
| 16 | Quantity | ex:rannes-event-1855 · recordedDeathCount
· 2 |
"two were killed" |
| 17 | Quantity | ex:rannes-event-1855 · countIsMinimumOnly
· true |
inferred h:true c:0.7 |
| 18 | Deontology | ex:native-police · actedUnderColourOf ·
ex:colonial-authority |
inferred h:true c:0.75 |
| 19 | Deontology | ex:rannes-event-1855 ·
noLegalSanctionRecorded · true |
inferred h:true c:0.6 |
| 20 | Axiology / framing | ex:source-record · framesKillingAs ·
ex:routine-policing |
framing h:true c:0.7 |
| 21 | Modality | ex:the-blacks · vulnerableTo ·
ex:armed-reprisal |
inferred h:true c:0.7 |
| 22 | Epistemology | ex:rannes-event-1855 · reportedIn ·
ex:source-record |
provenance |
| 23 | Epistemology | ex:death-count-2 · attestedBy ·
ex:source-record |
provenance |
| 24 | Epistemology | ex:death-count-2 · certaintyMarker ·
"unconfirmed-tally" |
inferred h:true c:0.6 |
| 25 | Speech act / passive voice | ex:source-record · omitsAgentOf ·
ex:two-deaths |
"two were killed" |
| 26 | Social ontology | ex:native-police · authorityOver ·
ex:frontier-district-rannes |
inferred h:true c:0.65 |
| 27 | Phenomenology | ex:two-deaths · notDirectlyWitnessedBy ·
ex:record-author |
inferred h:true c:0.55 |
Twenty-seven claims, and the sweep is not exhausted — a second pass
surfaces the inverse edges
(ex:rannes · wasSiteOf · ex:rannes-event-1855), the
rdfs:label on every minted entity, the part-relation
between Rannes and its colonial district, the comparison to
other dispersals in the same record. One ordinary sentence, three
"obvious" facts, becomes two to three dozen distinct, evidence-anchored
claims — most along dimensions a news-IE schema has no column for and
would have discarded without a trace.
Look at what the extra two dozen facts are, because they are
not padding. Claim 25 — omitsAgentOf — is the most
important fact in the sentence and the one a conventional extractor is
structurally guaranteed to miss: the passive "two were killed"
deletes the killer, and capturing the deletion as a fact about the
source is how the substrate later reasons about whose account this
is. Claim 13's decoded euphemismFor turns the period
euphemism into a held hypothesis about what physically happened —
anchored to the verbatim term and carried as a lower-confidence
decoded claim, so the substrate holds both the witness's word and the
historian's reading without collapsing either. Claims 18–20 read the
sentence as a jurist and an axiologist: the Native Police acted under
colour of authority, no sanction was recorded, the record frames a
killing as routine policing. None of these are in the words.
All of them are supported by the words, and a frontier model
that has read ten thousand histories knows it.
A vision this maximal is worthless if the machinery buckles under it, so the engine is built around the failure modes total extraction actually hits.
An agentic driver over a swappable provider. The
engine is not a regex bank, not a fine-tuned NER head, not a
schema-bound IE model. Extraction runs on OpenCodeAgent — a
headless OpenCode driver that docker execs into a container
holding the model credentials, exchanges files over a shared bind mount
(/data/omega/shared/oc/<run_id> on the host ==
/data/oc/<run_id> inside), and reads back whatever
the agent wrote. The model and provider are a clean abstraction
(z-ai/glm-5.1 today, configured by an injected
OPENCODE_CONFIG_CONTENT; OpenAI / OpenRouter / local are
drop-in). Each run gets an isolated HOME so concurrent
agents never collide on OpenCode's internal SQLite, and a global
flock-based slot cap bounds how many runs exist host-wide.
The whole thing is made durable by Temporal, so a multi-pass, multi-hour
deconstruction never loses a batch and a worker restart resumes rather
than restarts.
Faceted multi-pass with the lens-sweep prompt. Pass 1 is the broad lens sweep of §3. Subsequent passes are continuation passes, each seeded with everything found so far plus a gap-finding preamble — "facts.jsonl ALREADY contains {n} facts… re-read source.txt and hunt ONLY for facts that are still MISSING — apply ontological lenses you have not used yet, go finer on relations already started… The single goal is to MISS NOTHING." The genealogy variant additionally rotates focused lenses (kinship, vital events, places, identity resolution, occupations, provenance) so one dimension is mined to exhaustion while the agent still sees all prior facts and refuses to repeat them.
Incremental append — each batch a durable commit
boundary. A multi-pass, loop-until-dry extraction of a
substantial source is a long operation — minutes of agent time,
thousands of claims, an unknown number of sweeps. A single giant write
at the end would mean any timeout, crash, or restart loses everything.
The answer is structurally simple and operationally decisive: the agent
appends each batch as it goes, "ALWAYS … with >>
(NEVER overwrite, NEVER attempt a single giant write of everything).
Each batch is durably saved the moment you append it." Each
>> is a durable commit boundary; a timeout on pass
four does not cost passes one through three. The evidence is in the run
logs (an operational observation, not a store-queryable metric): a run
that hit exit 143 — killed at the wall-clock cap after
3,606 seconds — still yielded 2,096 facts, because each
cat >> had already committed its batch to disk. A
monolithic writer would have returned nothing. Exhaustiveness and
durability are the same design decision seen from two angles.
Loop-until-dry — the agent decides when the well is
dry. The terminating condition is not a fact count, not a token
budget, not a fixed number of passes; it is the agent's own judgment
that a fresh re-scan turns up nothing genuinely new. The controller's
only job is to keep feeding continuation passes and to retry a pass that
errored to zero. A pass is "dry" when it adds fewer than
max(10, 2% of total) genuinely new facts; after a
configured dry streak, the loop stops. This matters because the second
sweep is not redundant: re-scanning the whole source against the whole
checklist demonstrably catches what the first pass missed. On the deep
re-ingest of one frontier-massacre event (the
ctx:test/ingest-verify/10690 run of §7, 2,334 facts live in
the store), the run log records pass one producing 1,882 facts and pass
two adding 451 more — roughly 19% of the final yield surfaced
only on the second sweep, and the 1,882 + 451 ≈ 2,334 total is
corroborated by the live store even though the per-pass split is a
run-log observation. A single-pass extractor would have reported that
event "done" at 1,882 and discarded the fifth that remained, with no
signal that anything was missing.
Two design tensions, handled honestly. The vision is clean; the implementation earned its scars.
{"facts":[…]} object" — times out and writes nothing on
any substantial source. The fix is the batched-append mechanism above:
many small durable commits instead of one fragile one.source[start:end] did not match.
But LLMs cannot count characters — their offset arithmetic is almost
always wrong — so a strict-offset gate silently discarded the majority
of perfectly good, faithfully-quoted facts. The fix
(helpers.ingest_facts / _flex_find) inverts
the trust: ignore the model's offsets, trust the quoted surface
text, and re-derive the offsets by finding that substring in the
source — first exactly, then with a whitespace- and case-tolerant regex
for spans the model quoted across a line-wrap. The anchor is preserved
whenever the quote is genuinely in the source; the offsets are
recomputed by software that can count. This single change is
the difference between a 30%-anchored corpus and the densely
evidence-linked one that exists (~1.89M evidence links
substrate-wide).The proof is in the substrate. The genealogy example consumer ran
this engine over a corpus of colonial frontier-massacre records — dense,
contested, multi-source historical descriptions of killings on the
Australian frontier, exactly the kind of source where a thin extractor
records "an attack happened, N people died" and stops. These are
difficult sources — euphemistic, contradictory, written by interested
parties, with the violence deliberately obscured by the language. They
are precisely where thin extraction fails and total extraction earns its
keep. The metrics below are scoped to the production
corpus and verified against the live store
(ctx:genealogy/frontier-massacres/*, 26 event contexts,
2026-06-03):
| Metric | Value |
|---|---|
| Total evidence-anchored facts (production corpus) | 32,672 |
| Distinct event contexts | 26 |
| Distinct predicates | 14,629 |
| Singleton predicates (used exactly once) | 11,688 (79.9%) |
| Facts per event (production) | several hundred → ~1,000+ on the densest |
| Richest single run held (see note) | 2,334 facts on one event |
| Whole substrate, for scale | 992,707 distinct predicates over 39.58M live statements |
A provenance note, in the interest of the same rigour the method
demands: the production corpus above is what the 26 event contexts hold
today, with the densest production events running into four figures of
facts. The single richest run this essay quotes — 2,334
evidence-anchored facts on one event (a multi-source frontier
killing, the loop-until-dry and historiographic-meta-fact examples that
follow) — comes from a deeper re-ingest of that event held in a
verification context (ctx:test/ingest-verify/10690),
not counted in the 26-context production figure; the production
version of the same event holds a thinner 920-fact pass. Every number in
this essay is real and live in the store; we name which context each
came from rather than fuse the two. The re-ingest is the better
demonstration of how far a single source can be pushed when the loop is
allowed to run to exhaustion; the production corpus is the broader, more
conservative everyday yield.
Roughly eighty percent of the predicates in the production corpus are used exactly once, and the same shape holds substrate-wide: of ~992,700 distinct predicates over 39.58M statements, 739,143 — about 74.5% — are used exactly once. A pre-LLM knowledge engineer would read "80% singletons" as a catastrophe — a vocabulary that has failed to converge, unjoinable noise to be normalized away before the data is usable. donto reads it as the opposite. The long tail is the signature of completeness, not noise. Every singleton is a dimension of the source that exactly one lens, fired once, was the only thing that could capture. To demand those predicates reuse an existing vocabulary is to demand that the source be flattened into the dimensions we anticipated in advance — which is the slice-and-discard failure total extraction exists to refuse. The singletons are not the system failing to converge; they are the system declining to throw away the parts of reality that don't fit a column. They are captured dimensions awaiting query-time alignment, held losslessly until reality and the alignment engine decide which fold together.
What do those singletons look like? A sample of the corpus's one-time predicates reads like a transcript of a dozen disciplines reading at once:
| Predicate (used once) | The lens it came from |
|---|---|
bodyConcealmentDuration,
bodyDiscoveredAfterDays |
process / event-state-change + time |
mannerOfDeath, numberOfSpearWounds,
causationChainDetailed |
thematic roles + causation / forensics |
executionDecision, orderedAction |
agency + deontology (command, decision) |
victimsFleeingTowards,
escapeMethodObserved,
displacementDestination |
spatial topology + phenomenology of witness |
jurisdictionAtTime,
actedUnderColourOf |
legal/deontic ladder × chronology |
omissionInPrimarySource,
doesNotExpressSympathyForVictims,
doesNotCloseWithMoralJudgment |
epistemology / axiology — what a source fails to do |
isExampleOfColonialTravelWriting,
quoteFormalRegister,
usedNonRestrictiveRelativeClauseForBlacks |
linguistic / historiographic meta-analysis of the source itself |
Notice the register shift. numberOfSpearWounds is a fact
a careful extractor might reach.
doesNotExpressSympathyForVictims is a fact about the
author's stance — a second-order, axiological reading that
requires the model to step back from the event to the witness describing
it. That last cluster is worth pausing on: reading a colonial newspaper
account, the engine surfaced predicates about how the account is
written — its register, its grammar, its non-restrictive relative
clauses dehumanizing the victims. These are not facts about the
massacre. They are second-order facts about the source as a
historiographic artifact, the kind of reading a trained historian
or critical linguist performs and a named-entity recognizer cannot even
represent. They emerged because the prompt does not stop at the
document's content; it sweeps the open lens — "what other true,
supported property does this text hold that nothing above named?" — and
lets the model mint a predicate on the spot. That is what "deconstruct
using the whole of human understanding" cashes out to in practice.
The frontier-massacre corpus is where donto's most distinctive
extraction behaviours become visible, because the genre is built on
euphemism and on irreconcilable sources. Counting the second-order
predicates directly in the production corpus
(ctx:genealogy/frontier-massacres/*, 2026-06-03):
| Predicate | Count | What it captures |
|---|---|---|
reportedIn |
159 | claim attributed to a specific source-as-witness |
causedBy |
72 | etiology edges (incl. reprisal chains) |
resultedIn |
72 | causal / consequence edges |
attestedBy |
66 | claim attributed to a named attester |
corroboratedBy |
32 | independent source agreement |
nameDiscrepancyWith |
29 | name disagreement between sources |
euphemismFor |
21 | verbatim term linked to its decoded meaning |
certaintyMarker |
16 | "alleged" / "reported" / "said to" as first-class |
conflictsWith |
13 | contradictory accounts held side by side |
corroborates |
13 | the inverse corroboration edge |
countDiscrepancyWith |
13 | numeric disagreement between sources |
dateDiscrepancyWith |
12 | date disagreement between sources |
decodedAs |
5 | inferred plain-meaning of figurative language |
Euphemism decoding is the signature move. The
colonial archive does not say killed; it says dispersed,
tumbled down, quietened, a tragedy. The engine captures both — the
verbatim term anchored to its exact span, and a separate
decoded claim flagged inferred (h:true, confidence below
0.9) and linked by euphemismFor / decodedAs.
Real edges from the store:
| Verbatim term (anchored) | Decoded claim (euphemismFor / decodedAs,
hypothesis) |
|---|---|
| "dispersal" / "dispersed" | killing or violent dispersal of Aboriginal people |
| "tumble-down" | shot dead / fell dead |
| "made the wild blacks pay for it" | reprisal killings of Aboriginal people |
| "paid the penalty" | killed |
| "punish the offenders" | kill Aboriginal people |
This is the two-layer contract operating inside a single fact: the literal term is captured at full confidence and anchored to the page; the decoding is captured as an explicit hypothesis, lower-confidence, never overwriting the verbatim record. Both survive. The reader downstream can choose to trust the decode or audit it back to the euphemism and the span. The historiography lives in the gap between the two columns.
Contradiction is captured, never resolved. The genre's sources disagree violently on body counts, and the engine records the disagreement as data:
ex:attack-nmp-rannes-1855 · countDiscrepancyWith · "Sources give death tolls of 2, 3, 5, 1, 3, 12"The most striking product of "no authority is ground truth" is the
cross-source count ladder. Ten witnesses give ten
different tolls for the same event, and the agent emits each as its own
anchored claim — 1 ("killing a trooper"), 2
("murdering two of them"), 3 ("kill three on the spot"),
4 ("Four out of the five…"), 5 ("five of the six
troopers killed"), … 12 ("they killed twelve native police").
No reconciliation, no winner. Every count held side-by-side as legal
bitemporal state, wired together with countDiscrepancyWith
edges across the specific sources — the paraconsistent behaviour the
substrate exists to support, produced at extraction time rather
than discarded there.
Historiographic meta-facts the text only implies.
This is the deepest reach, and the hardest to get any other way.
(Provenance: the five meta-facts in this block are live in the deep
re-ingest context ctx:test/ingest-verify/10690 — the
2,334-fact run flagged in §7 — not in the thinner production pass of the
same event. The object values are quoted verbatim from the store.)
Having laid down ten conflicting death counts and tracked how the event
is named across its sources — the
eventTerminologyBySource claims pair each term with the
witness who used it ("murder (Leith Hay and Holt)",
"outrage (Empire Oct 15)",
"affray (Empire Oct 15)",
"slaughter (De Satge 1901)",
"slaughter (Queenslander 1892)",
"massacre (Morning Bulletin 1912)"), so the hardening of
the language over the decades is recoverable by query rather than
asserted as a gloss — the agent steps up a level and asserts properties
of the evidentiary record itself:
deathCountTrendOverTime → "increasing (2 in 1855 to 12 in 1913)"
(h:true)interSourceDisagreementOnDeathCount → "severe (1-12)"interSourceDisagreementOnTimeOfDay → "night vs morning"interSourceDisagreementOnOfficerIdentity → "Robert George Walker vs Henry Walker vs Percy Walker"narrativeExaggerationOverTime → trueNo sentence in any source says "the reported death toll rose over
time." The agent inferred the trend from the attestation graph it
had just built — a second-order historiographic claim, correctly
flagged as inference, anchored to the structure of the testimony rather
than to any single span. That is the payoff of reading as a historian
and a statistician and an epistemologist simultaneously: the source's
own self-contradiction becomes a measured, queryable property. The same
totality produces reprisal causation chains (causedBy /
resultedIn linking an initial killing to the punitive
expedition that followed, with self-minted predicates the production
store actually holds — reprisalFor,
provokedReprisal, deathTriggeredReprisal,
sisterReprisalEvent, ledReprisalFor,
causedReprisal, followedByReprisal, and dozens
more variants the alignment engine will fold later) and cross-event
prosopography — the same named trooper or squatter recognized across
separate documents and linked, so that a person becomes a thread running
through events extracted independently. None of this is reachable by
reading one clause at a time for entities. All of it falls out of
reading the whole source, repeatedly, through every lens at once.
The single-span fan-out. To see the lenses working
in concert, watch one clause from a real run — "the blacks had been
unwisely employed" — fan into 16 distinct facts: a social-ontology
employment relation (statesBlacksEmployedOnStation), a
causal/etiological one (backgroundCause,
enabledByAboriginalEmploymentOnStation), an epistemic one
(insiderKnowledgeEnabledBy "employment of Aboriginal people at station"),
an axiological/framing one
(framedAboriginalPeopleAs "blacks"), and an attestation one
(nmp-corps criticisedBy de-satge-1901). One nine-word
clause, read by six experts, yields six different true relations.
Multiply that across every clause and the per-word fact density — more
than one fact per word, on the densest events — stops looking
implausible and starts looking inevitable.
Total extraction sounds like a fantasy that would have been laughed out of any knowledge-engineering meeting for the last sixty years — and it would have been, because it was unaffordable and there was nowhere to put the output. Both constraints just lifted.
Generation flipped from the scarce step to the cheap step. For sixty years the binding constraint on every knowledge system was generation. Cyc paid knowledge engineers per assertion. Formal concept analysis required the attributes declared up front. Literature-based discovery rode co-occurrence statistics because actually reading and typing the literature was unaffordable. When generation is the scarce step, the rational extraction policy is minimalism: pull the one canonical entity and its three obvious relations, because every additional fact costs a human minute you cannot spare. That policy is now exactly backwards. A guided frontier LLM emits a justifiable, evidence-anchored claim for on the order of $10^{-4}$ — a hundredth of a cent. The public record makes the point at scale: GPTKB materialized 105M typed triples over 2.9M entities at $0.00009 per correct triple, inventing its own predicate axes as it went, with 69.5% of the entities it described absent from Wikidata; AutoSchemaKG induced a 900M-node, 5.9B-edge graph from 50M documents with zero predefined schema, the extractor inventing every type on the fly while preserving 93–97% of source information. When emitting one more true fact costs a hundredth of a cent, the cost of omitting it — the slice of the source you can never recover because nobody recorded it — is the expensive mistake. The economics did not merely permit total extraction; they flipped the optimum from minimal to maximal.
| Era | Cost to emit one typed claim | Scarce resource | Rational extraction policy |
|---|---|---|---|
| Knowledge-engineering (Cyc, FCA) | ~minutes of expert labor | generation | minimal — pull the canonical few, discard the rest |
| Frontier LLM (now) | ~$10^{-4}$ | holding & deciding | maximal — emit every justifiable claim, decide later |
But cheapness alone does not make total extraction sane. The objection to maximalism was never really cost; it was that a maximal firehose drowns the systems that have to receive it — and under the standard storage targets, that objection is correct. A vector database collapses every emission to a single embedding; when a new fact conflicts with a stored one, the closer vector wins and the other is silently lost. A normal knowledge graph — and even 2025's best agent-memory graphs — enforces single-truth at write time: Zep/Graphiti uses an LLM to detect contradicting edges and invalidate the overlapping one; Mem0's update step overwrites. Pour a multi-perspectival firehose into any of these and you get the worst of both worlds: you pay to generate 2,000 facts about a source, then the store's collapse machinery throws most of them away at ingest — and it throws away exactly the minority, speculative, and contradictory claims that are the raw material of discovery. The decoded euphemism, held at lower confidence beside the verbatim term, is precisely the kind of claim a collapsing store deletes as a near-duplicate. Under those assumptions, the knowledge engineer's minimalism was not timidity. It was a correct adaptation to a downstream that punishes abundance.
So the real precondition for total extraction was never the price of a token. It was a substrate that does not collapse. donto's answer is the two-layer contract, and it is the only thing that makes maximal capture safe.
hypothesis_only to a ranked, supported claim are all
deferred to where they can be done well, with the full corpus in view,
reversibly, and per use. Identity is a hypothesis resolved at query
time, never a merge committed at ingest. Predicate alignment is a typed,
scoped, query-time operation — align the jobs context to ESCO/O*NET when
you query it — not a write-time bottleneck forcing the
extractor to guess a canonical form it cannot yet know.| Layer 1 — Extraction | Layer 2 — Substrate (query/promotion time) | |
|---|---|---|
| Mandate | MAXIMIZE — emit every justifiable claim | GATE — type, align, resolve, dedup, promote |
| Invariant | claim = subject / predicate / object + evidence anchor | bitemporal, paraconsistent, evidence-first hold |
| Predicates | free, self-invented, camelCase, multi-directional | aligned per-context to taxonomies on query |
| Contradiction | emit each side + conflictsWith — never reconcile |
held forever as legal state; re-ranked, not deleted |
| Identity | mint distinct IRIs + likelySameAs — never
pre-merge |
resolved as a query-time hypothesis, reversibly |
| Reversibility | n/a — emits, never deletes | every decision non-destructive and re-runnable |
The extractor is forbidden from making the irreversible decisions — it never picks a winner between two reported death counts, never merges two spellings of a name, never decides which framing is "true." It emits each, then adds claims linking them. Every collapse a vector DB or normal KG performs at write time is, in donto, deferred, non-destructive, and per-query. That is what makes maximal emission safe: nothing the extractor emits forecloses a later decision, because the extractor is structurally barred from foreclosing.
There is even a theorem underneath this. Model collapse — a knowledge
base degrading as it feeds on its own output — occurs only under
replacement (train on synthetic, discard real → error grows).
Under accumulation — keep all real and synthetic claims forever
— error is provably bounded, independent of iteration count
(Gerstgrasser et al. 2024). donto is an accumulation system by
construction: it never overwrites, never dedups, keeps provenance and
counter-evidence on every claim. The contradiction-preserving substrate
is not merely compatible with a maximal firehose; it is the
provably-safe container for it. And it stays usable while doing so:
POST /search already ranks across the full ~39.5M-statement
substrate in 270–820ms, stopwords included. Abundance is only worth
generating if it stays queryable, and that part is built. And this is
the right place to be explicit about the north star: the substrate is
the goal, and total extraction exists to test it — a maximal,
euphemism-decoding, contradiction-emitting firehose is precisely the
load that exercises the substrate's hardest invariants (paraconsistent
hold, no destructive overwrite, identity-as-hypothesis, defer-joining).
The extraction engine earns its keep not by being clever but by handing
the substrate exactly the kind of input that would break anything that
collapses.
How do you know you got "everything"? You don't, and the intellectual honesty of the method begins with admitting that completeness is unfalsifiable. You cannot prove a source has been fully read; there is no oracle that returns the true cardinality of an unbounded property space. So we do not chase a completeness proof. We chase saturation and faithfulness, and we instrument both.
The limits are real and worth stating without flinching. The loop's stopping rule is the model's judgment, and a model that tires or satisfices will declare a source dry too early — saturation is a proxy, not a proof. Decoded euphemism and historiographic meta-facts are hypotheses by construction, and some will be wrong; the design's defense is not that they are always right but that they are marked as inference, down-weighted, and reversible, never allowed to masquerade as stated fact. The 80%-singleton tail, the signature of abundance, is also a deferred bill: those predicates must eventually be folded by the alignment engine for a query that needs them folded, and that engine is still being built. And the GLM-coding-subscription path the engine runs on today is a TOS-risky, expiring subsidy — which is exactly why provider is a swappable abstraction, not a hardcoded dependency. None of these undercut the thesis; they locate the work that remains.
What total extraction asks for is a change of stance. The conventional extractor approaches a source with a schema and asks, which of my slots does this fill? — and the source's answer is mostly silence, because most of what a source knows was never given a slot. The donto extractor approaches the same source with the whole of human analytical understanding loaded into a single guided mind, and asks the opposite question, again and again until the asking stops paying: what else is true here? what would a logician see that a historian missed? a jurist that a linguist missed? what is this word concealing, what caused this event, where do these two witnesses disagree, and what have I not yet looked at? It reads the sentence as all of them at once, mints the precise predicate each reading demands, anchors what the source states and fences what it merely implies, and keeps re-reading until a fresh sweep from every angle turns up nothing new.
The result is not a tidy record. On the richest single event it is 2,334 evidence-anchored claims drawn from a knot of contradictory sources, in a vocabulary that is roughly four-fifths singletons, holding the verbatim euphemism beside its decoding and every witness beside the witnesses it contradicts — and the broader production corpus repeats the shape, more conservatively, across two dozen events. That is what it looks like to get every single knowable thing out of a thing, and to put it somewhere that can hold it all. The opening claim, now earned: a source is not a bag of facts to be emptied but a lit-as-far-as-you-light-it space, and almost every extractor built before this one carried a lantern that showed a single wall.
For sixty years we extracted thin because we had to. We do not have to anymore. The well is deep, the bucket is now nearly free, and the only remaining question — the one the systems paper answers — is where to put the water once you finally stop refusing to draw it.