2026-06-03
donto's extraction engine was built and tuned almost entirely on one
domain: contested colonial-frontier genealogy. A fair question is
whether its polyperspectival lens sweep — the technique of
guiding an agent to deconstruct a source through many ontological
directions at once — is a genealogy trick or a general method. This note
reports a small, deliberately out-of-domain test: extracting Rodrigo
Copetti's PlayStation Architecture (a 7,273-word hardware
deep-dive) and its 25-comment Hacker News thread, run side by side at
concurrency 2, through the current engine — the one carrying
the recent prompt fixes (rdf:type over
rdfType, sameAs over owl:sameAs,
no numbered-suffix predicates). The result: 1,613
evidence-anchored statements over ~700 distinct predicates from
~8,150 words of source, with the prompt fixes holding and the lens sweep
firing as richly on DMA controllers and forum opinions as it ever did on
massacre depositions. It also surfaced one honest defect —
predicate-style drift on dense technical prose — which turns out to be
the exact mess the query-time alignment fabric exists to fold. The test
supports both halves of the thesis: maximal extraction generalizes, and
it is meant to be tamed downstream, not at the source.
Two sources, one substrate, opposite registers:
https://www.copetti.org/writings/consoles/playstation/, a
dense, diagram-heavy technical history of the PS1 (CPU, GTE, GPU, VRAM
layout, the CD subsystem, the famous texture-warping). 7,273 words.
Context ctx:test/hn-playstation/article.ctx:test/hn-playstation/thread.Both were submitted to POST /jobs/extract (mode
opencode, the GLM lens-sweep) and run concurrently,
two at a time — the same durable Temporal pipeline used for the
genealogy corpus. Neither source has anything to do with genealogy; the
extraction prompt (extract_broad.txt) is domain-neutral by
construction and was given no hint about hardware or forums.
| source words | statements | distinct predicates | distinct subjects | span-anchored | |
|---|---|---|---|---|---|
| Article | 7,273 | 1,252 | 634 | 395 | 821 (66%) |
| Thread | 882 | 361 | 93 | — | 111 (31%) |
| Total | 8,155 | 1,613 | ~700 | — | 932 |
Roughly one anchored claim per five source words, and a predicate is minted almost as fast as a fact is — 634 distinct predicates across 1,252 article statements. That ratio is the signature of the lens sweep doing its job: each entity and clause is interrogated from many directions, so the predicate space fans out rather than collapsing onto a handful of reused relations.
The anchoring rates are instructive and correct: the article (dense, directly-stated fact) anchors 66% of its claims to exact source spans; the thread (opinion, inference, second-order commentary) anchors 31%, with the remainder legitimately marked as inferred/hypothesis (an opinion or a decoded intent has no single licensing span). The engine is not hallucinating spans for claims that don't have them — it is honestly distinguishing the stated from the inferred.
These were tuned against genealogy; the test is whether they survive a domain shift.
| check | article | thread | verdict |
|---|---|---|---|
rdf:type (canonical) vs rdfType |
172 / 0 | 227 / 0 | ✅ held |
sameAs vs owl:sameAs |
— | 2 / 0 | ✅ held |
numbered-suffix predicates (layer1,
operand2…) |
1 | 2 | ✅ near-eliminated |
The corrected prompt steers the model to the canonical CURIE form and away from positional pseudo-arrays regardless of domain — the fixes are properties of the prompt, not artifacts of the training corpus.
The whole bet is that an LLM, properly guided, can read one source as a logician, a mereologist, a historian, and a domain expert simultaneously. On hardware and forum text it clearly does. A sample of the predicates minted shows the ontological breadth:
hasDMAController, hasInternalFIFOBuffer,
hasVideoRAMSize, hasWordSize,
connectsRemainingComponentsAndIO,
graphics-chipset includes gpubitmapSizePixels, polygonsPerSecondMultiplier,
main-ram size-reserved-unit KBperformComputations, processedDataSentTo,
transferred-via,
perspective-transformation uses camera-perspectivegte identifiedAs CP2 (the Geometry Transformation Engine
is coprocessor 2 — exactly right), usedAs,
mayBeBlendedWithmanufacturedBy,
initially-targeting,
before-expanding-intorespondsTo, contrastsDesignPriorities,
postedBy, citesSource,
findsImpressiveexpressesOpinionChangeOverTime,
observesCulturalTrend, acquiredKnowledgeAbout,
commentCount, footnoteMarker,
typoTypeTwo captures are worth dwelling on, because they are the esoteric depth the method is supposed to reach:
undersampling-behaviour noticeable-when-rendering-geometry-located-far-from camera
and edges that "make sudden jumps when moved slightly" — the
engine independently described the PS1's notorious affine
texture-warping and vertex jitter from prose, as discrete
properties of the graphics pipeline.usesBitwiseOperation → or-operation and
or-operation operand2 hex-value-80000000h — it read a
casual programming aside and extracted the actual bitwise
operation and its operand. It even logged a typo in a comment
as a fact (typo-pixel-are-t contains pixel-art-wasnt).This is not entity-and-a-relation extraction. It is the source taken apart from every angle the text supports — on a domain the engine had never seen.
Maximal extraction is messy, and this run shows where. On the dense
technical prose, the model drifted from the prescribed
camelCase toward verbose kebab-case predicates that pack a
clause into the predicate name:
| predicate style (article, distinct) | count |
|---|---|
clean camelCase (hasDMAController) |
230 |
kebab-with-dashes (made-use-of) |
292 |
| long (>30 chars, clause-like) | 77 |
Examples of the bad tail:
unable-to-render-anything-decent-if,
remaining-vram-can-be-used-to-store,
noticeable-when-rendering-geometry-located-far-from. These
are sentence fragments wearing a predicate's clothes — the same
anti-pattern the prompt warns against for objects ("never pack
a whole sentence into the object") leaking into the predicate.
They are almost all singletons, they bloat the predicate tail, and they
will never be reused verbatim. This is a real quality regression on
technical content relative to the cleaner genealogy output, and it
argues for one more prompt reinforcement: predicates are short,
reusable camelCase relations; push the specifics into
subject, object, and qualifier claims, never into the predicate
string.
Here is the part that makes the defect interesting rather than damning. donto's design contract is maximize at extraction, gate (align, type, dedup) at query time. A bloated, drifting predicate tail is precisely the input the continuous alignment fabric is built to fold. Asked for the semantic neighbours of one of this run's predicates, the engine already knows what to do:
manufacturedBy → manufactured (0.891), isManufacturedBy (0.901),
manufacturedFrom (0.916), donto:manufacturedBy (1.000)
The manufactured* family — minted independently across
runs and domains — collapses into one queryable cluster by
meaning, with no hand-maintained synonym list, and (after LLM
adjudication) with the correct relation type rather than a blind merge.
The 292 kebab predicates and 77 clause-fragments are not garbage to be
prevented at the source; they are abundance to be reconciled downstream.
A system that forced clean predicates at extraction time would have
re-imposed the very scarcity the abundance thesis escapes — and would
have thrown away the vertex-jitter and
bitwise-operand captures along the way, because those came
out of the same uninhibited sweep.
So the test cuts both ways, exactly as the vision predicts: the extraction is gloriously, usefully messy, and the substrate is the thing that makes that mess pay off.
A genealogy-tuned engine, pointed at PlayStation hardware and a forum thread it had never seen, produced 1,613 anchored claims across ~700 predicates from many ontological lenses, kept its recent prompt fixes intact, and reached real esoteric depth (texture-warping, bitwise decomposition, opinion-change-over-time) — while exposing one honest defect (predicate-style drift) that the alignment fabric is purpose-built to absorb. The lens sweep is not a genealogy trick. It is a general method for taking a thing apart through the whole of human understanding, and it travels.
Two follow-ups fall out of this run: (1) a small prompt reinforcement
on predicate hygiene (short reusable camelCase, no
clause-packing); (2) once the predicate embedding backfill completes,
let the continuous alignment engine fold this run's 700 fresh predicates
— and measure how much of the kebab tail collapses into the existing
camelCase clusters. Both are now queued.
Companion pieces: Total Extraction (the method), The Embedding Fabric (the downstream fold), and the donto systems paper.