Does the Lens Sweep Generalize? A Cross-Domain Extraction Test

2026-06-03

Abstract

donto's extraction engine was built and tuned almost entirely on one domain: contested colonial-frontier genealogy. A fair question is whether its polyperspectival lens sweep — the technique of guiding an agent to deconstruct a source through many ontological directions at once — is a genealogy trick or a general method. This note reports a small, deliberately out-of-domain test: extracting Rodrigo Copetti's PlayStation Architecture (a 7,273-word hardware deep-dive) and its 25-comment Hacker News thread, run side by side at concurrency 2, through the current engine — the one carrying the recent prompt fixes (rdf:type over rdfType, sameAs over owl:sameAs, no numbered-suffix predicates). The result: 1,613 evidence-anchored statements over ~700 distinct predicates from ~8,150 words of source, with the prompt fixes holding and the lens sweep firing as richly on DMA controllers and forum opinions as it ever did on massacre depositions. It also surfaced one honest defect — predicate-style drift on dense technical prose — which turns out to be the exact mess the query-time alignment fabric exists to fold. The test supports both halves of the thesis: maximal extraction generalizes, and it is meant to be tamed downstream, not at the source.

1. The setup

Two sources, one substrate, opposite registers:

The article — https://www.copetti.org/writings/consoles/playstation/, a dense, diagram-heavy technical history of the PS1 (CPU, GTE, GPU, VRAM layout, the CD subsystem, the famous texture-warping). 7,273 words. Context ctx:test/hn-playstation/article.
The thread — the 25-comment HN discussion (id 48382142), a casual mix of nostalgia, technical correction, and source-citing. 882 words. Context ctx:test/hn-playstation/thread.

Both were submitted to POST /jobs/extract (mode opencode, the GLM lens-sweep) and run concurrently, two at a time — the same durable Temporal pipeline used for the genealogy corpus. Neither source has anything to do with genealogy; the extraction prompt (extract_broad.txt) is domain-neutral by construction and was given no hint about hardware or forums.

2. The numbers

	source words	statements	distinct predicates	distinct subjects	span-anchored
Article	7,273	1,252	634	395	821 (66%)
Thread	882	361	93	—	111 (31%)
Total	8,155	1,613	~700	—	932

Roughly one anchored claim per five source words, and a predicate is minted almost as fast as a fact is — 634 distinct predicates across 1,252 article statements. That ratio is the signature of the lens sweep doing its job: each entity and clause is interrogated from many directions, so the predicate space fans out rather than collapsing onto a handful of reused relations.

The anchoring rates are instructive and correct: the article (dense, directly-stated fact) anchors 66% of its claims to exact source spans; the thread (opinion, inference, second-order commentary) anchors 31%, with the remainder legitimately marked as inferred/hypothesis (an opinion or a decoded intent has no single licensing span). The engine is not hallucinating spans for claims that don't have them — it is honestly distinguishing the stated from the inferred.

3. Did the recent fixes hold?

These were tuned against genealogy; the test is whether they survive a domain shift.

check	article	thread	verdict
`rdf:type` (canonical) vs `rdfType`	172 / 0	227 / 0	✅ held
`sameAs` vs `owl:sameAs`	—	2 / 0	✅ held
numbered-suffix predicates (`layer1`, `operand2`…)	1	2	✅ near-eliminated

The corrected prompt steers the model to the canonical CURIE form and away from positional pseudo-arrays regardless of domain — the fixes are properties of the prompt, not artifacts of the training corpus.

4. The lens sweep generalizes

The whole bet is that an LLM, properly guided, can read one source as a logician, a mereologist, a historian, and a domain expert simultaneously. On hardware and forum text it clearly does. A sample of the predicates minted shows the ontological breadth:

Mereology / composition: hasDMAController, hasInternalFIFOBuffer, hasVideoRAMSize, hasWordSize, connectsRemainingComponentsAndIO, graphics-chipset includes gpu
Quantity / measurement: bitmapSizePixels, polygonsPerSecondMultiplier, main-ram size-reserved-unit KB
Causation / function / process: performComputations, processedDataSentTo, transferred-via, perspective-transformation uses camera-perspective
Identity / comparison: gte identifiedAs CP2 (the Geometry Transformation Engine is coprocessor 2 — exactly right), usedAs, mayBeBlendedWith
Provenance / history: manufacturedBy, initially-targeting, before-expanding-into
Forum / social / speech-act (thread): respondsTo, contrastsDesignPriorities, postedBy, citesSource, findsImpressive
Epistemics / meta (thread): expressesOpinionChangeOverTime, observesCulturalTrend, acquiredKnowledgeAbout, commentCount, footnoteMarker, typoType

Two captures are worth dwelling on, because they are the esoteric depth the method is supposed to reach:

From the article: undersampling-behaviour noticeable-when-rendering-geometry-located-far-from camera and edges that "make sudden jumps when moved slightly" — the engine independently described the PS1's notorious affine texture-warping and vertex jitter from prose, as discrete properties of the graphics pipeline.
From the thread: a commenter's C4-storage trick was decomposed into usesBitwiseOperation → or-operation and or-operation operand2 hex-value-80000000h — it read a casual programming aside and extracted the actual bitwise operation and its operand. It even logged a typo in a comment as a fact (typo-pixel-are-t contains pixel-art-wasnt).

This is not entity-and-a-relation extraction. It is the source taken apart from every angle the text supports — on a domain the engine had never seen.

5. The honest defect: predicate-style drift

Maximal extraction is messy, and this run shows where. On the dense technical prose, the model drifted from the prescribed camelCase toward verbose kebab-case predicates that pack a clause into the predicate name:

predicate style (article, distinct)	count
clean `camelCase` (`hasDMAController`)	230
kebab-with-dashes (`made-use-of`)	292
long (>30 chars, clause-like)	77

Examples of the bad tail: unable-to-render-anything-decent-if, remaining-vram-can-be-used-to-store, noticeable-when-rendering-geometry-located-far-from. These are sentence fragments wearing a predicate's clothes — the same anti-pattern the prompt warns against for objects ("never pack a whole sentence into the object") leaking into the predicate. They are almost all singletons, they bloat the predicate tail, and they will never be reused verbatim. This is a real quality regression on technical content relative to the cleaner genealogy output, and it argues for one more prompt reinforcement: predicates are short, reusable camelCase relations; push the specifics into subject, object, and qualifier claims, never into the predicate string.

6. Why the mess is not fatal — and is in fact the point

Here is the part that makes the defect interesting rather than damning. donto's design contract is maximize at extraction, gate (align, type, dedup) at query time. A bloated, drifting predicate tail is precisely the input the continuous alignment fabric is built to fold. Asked for the semantic neighbours of one of this run's predicates, the engine already knows what to do:

manufacturedBy  →  manufactured (0.891), isManufacturedBy (0.901),
                   manufacturedFrom (0.916), donto:manufacturedBy (1.000)

The manufactured* family — minted independently across runs and domains — collapses into one queryable cluster by meaning, with no hand-maintained synonym list, and (after LLM adjudication) with the correct relation type rather than a blind merge. The 292 kebab predicates and 77 clause-fragments are not garbage to be prevented at the source; they are abundance to be reconciled downstream. A system that forced clean predicates at extraction time would have re-imposed the very scarcity the abundance thesis escapes — and would have thrown away the vertex-jitter and bitwise-operand captures along the way, because those came out of the same uninhibited sweep.

So the test cuts both ways, exactly as the vision predicts: the extraction is gloriously, usefully messy, and the substrate is the thing that makes that mess pay off.

7. Conclusion

A genealogy-tuned engine, pointed at PlayStation hardware and a forum thread it had never seen, produced 1,613 anchored claims across ~700 predicates from many ontological lenses, kept its recent prompt fixes intact, and reached real esoteric depth (texture-warping, bitwise decomposition, opinion-change-over-time) — while exposing one honest defect (predicate-style drift) that the alignment fabric is purpose-built to absorb. The lens sweep is not a genealogy trick. It is a general method for taking a thing apart through the whole of human understanding, and it travels.

Two follow-ups fall out of this run: (1) a small prompt reinforcement on predicate hygiene (short reusable camelCase, no clause-packing); (2) once the predicate embedding backfill completes, let the continuous alignment engine fold this run's 700 fresh predicates — and measure how much of the kebab tail collapses into the existing camelCase clusters. Both are now queued.

Companion pieces: Total Extraction (the method), The Embedding Fabric (the downstream fold), and the donto systems paper.