Shannon Prime Lattice

KnackAU · June 2, 2026, 8:06am

shannon-prime-lattice

Shannon-Prime PPT ARM Lattice — a decentralized, byte-exact inference and
training fabric for large transformer models built on a single discrete math
object: the prime-factored coordinate lattice over Z_q with dual-prime
Chinese-Remainder-Theorem (CRT) decomposition, the Friedman-Kruskal dominance
order ⪯_d, and the CRT cyclotomic ring R_q = Z_q[x]/(x^N + 1).

This repository is the public project entry point. It holds the theory,
systems, ABI, and on-disk-format papers; the demos; the integration tests;
and the bootstrap prompt for new working sessions. Code lives in the two
companion repositories:

Repo	Role	URL
`shannon-prime-lattice` (this)	Papers, roadmap, demos, integration tests	GitHub - nihilistau/shannon-prime-lattice: Umbrella for the decentralized cooperative AI training/inference architecture built on the prime-factored coordinate lattice and the dominance order. Theory + Systems + Roadmap papers, contracts, offload pattern. · GitHub
`shannon-prime-system`	Math-core: L1 C ABI, NTT, poly-ring, KSTE, Frobenius, sessions	GitHub - nihilistau/shannon-prime-system: Clean from-scratch math core for shannon-prime-lattice: KSTE encoder, Friedman sieve, ARM (HRR in CRT cyclotomic ring), CRT NTT primitives, Position-as-Arithmetic. · GitHub
`shannon-prime-system-engine`	Engine backends (CPU/CUDA/Vulkan/Hexagon), `sp_daemon` HTTP/SSE, tools	GitHub - nihilistau/shannon-prime-system-engine: Clean from-scratch inference engine for shannon-prime-lattice. NTT-based attention, two-node CRT-sharded inference path, KSTE-encoded KV state. · GitHub

Discord: Shannon-Prime-Lattice
License: AGPL-3.0-or-later. Commercial licensing available — contact the
copyright holder.

1. What makes this different

Shannon-Prime Lattice is not “yet another inference engine wrapper.” Every
load-bearing primitive is discrete (integers in Z_q with q a 30-bit
Proth prime, or Z_{q_1} × Z_{q_2} via CRT), so identity, dominance, hashing,
and reproducibility are properties the implementation can prove rather than
estimate. Floating point is plumbing — the math is in Z_q.

Distinguishing claims (each one validated by a shipped sprint and a closure
note under papers/SESSION-CLOSED-*.md or
shannon-prime-system-engine/tools/sp_compute_skel/docs/CLOSURE-*.md):

Discrete Z_q substrate. Two frozen 30-bit Proth primes
q_1 = 1073738753, q_2 = 1073732609, M = q_1·q_2 ≈ 2^60. Negacyclic
NTT over each prime with Garner CRT recombination at the boundary. Every
cross-backend gate is byte-exact, not “small KL divergence.”
Polynomial-ring attention. Attention scores ⟨q, k⟩ reduce to one
coefficient of a negacyclic polynomial product in R_q, computed exactly
via NTT. Bit-identical to the scalar reference at N ∈ {128, 256, 512}
direct, and N ∈ {2..256} via Bluestein chirp-z. See
papers/PPT-LAT-Theory.md §6.1.
Frobenius-lift Q8 weight storage. Per-row int8 codes + fp32 scale;
4× compression vs fp32 with bit-identical dequant round-trip. The
on-RAM packed-arena format is what every backend reads — no per-matmul
re-quantization.
Spinor 63-byte KV-cache block. VHT2 anchor projection + Möbius
reorder + CRC-8 trailer + 0xA5 sentinel. One cache-line on ARM
Cortex-X2. The frozen on-wire KV record format (see
shannon-prime-system/include/sp/spinor_block.h).
KSTE encoder. Knight-Spinor Tree Encoder: deterministic 64-byte
packed tree from a K-vector of int32 components, with byte-identical
signature across platforms. Tier-0/Tier-1 dominance.
PoUW receipt ledger. Per-turn 64-byte SpinorReceipt audit
envelope. Append-only ledger; canonical-order replay; cross-device
byte-identity gates. Shipped end-to-end via sp_daemon’s
/v1/dialogue endpoint.
QUIC dual-prime mesh. Each peer carries one of the two CRT residue
shards (q_1 or q_2); driver Garner-recombines to the centered
signed result. Today: two-node lattice smoke. Planned: Fibonacci-Prime
DHT (papers/PPT-LAT-Roadmap.md §8).
Heterogeneous SoC compute. The cDSP V69 HVX backend on Snapdragon
8 Gen 1 runs the full NTT pipeline (forward, twiddle VTCM staging,
dual-prime dispatch, INTT + Garner) byte-exact vs the math-core
scalar reference. NPU + cDSP dual-island composition is filed under
Phase 4-MTP.

2. Current status

Honest snapshot, 2026-05-31.

Component	Status	Evidence
Frozen L1 C ABI	shipped	`shannon-prime-system/include/sp/sp_l1.h`; tag `lat-phase2-contract-frozen`
`.sp-model` v0 wire format	shipped	`papers/PPT-LAT-SP-MODEL-v0.md`; loader at `core/io_format/`
Math-core reference forward	shipped — runs Qwen3-0.6B, Qwen2.5-Coder-0.5B, Gemma3-1B byte-exact host + aarch64-android	`lib/shannon-prime-system/core/forward/forward.c`; closure `SESSION-CLOSED-lat-3-cell-*.md`
NTT-CRT primitive (host)	shipped	`core/ntt_crt/`; tests `T_NTT_*`
NTT-CRT primitive (Hexagon V69 HVX)	shipped end-to-end byte-exact vs math-core	sprints NTT.0 → NTT.4; closures `CLOSURE-NTT-{0..4}.md`
Polynomial-ring attention overlay	shipped — host + Hexagon	sprints NTT.5a / 5b / 5c
Spinor-block KV cache	shipped	`core/vht2/`; tests `T_VHT_1..6`
Frobenius-lift Q8 / Q4 packing	shipped	`core/frobenius/`, `core/arena/`
KSTE encoder + Tier-0/1 dominance	shipped	`core/kste/`; tests `T_KSTE_1..5`
`sp_daemon` HTTP/SSE chat (`/v1/chat`)	shipped	`tools/sp_daemon/`; closure `CLOSURE-CHAT-INTEGRATION.md`
Dual-model dialogue (`/v1/dialogue`)	shipped	sprint M.2; closure `CLOSURE-M2-DIALOGUE.md`
PoUW receipt ledger + canonical-order replay	shipped	sprints M.4, mesh-canonical-order, ledger-autowire
KSTE-routed sparse Memory activation	shipped	sprint M.5; closure `CLOSURE-M5-ROUTING.md`
Two-node sharded inference smoke	shipped	closure `SESSION-CLOSED-lat-smoke-2node.md`
TailSlayer GF(2) channel oracle	shipped offline pattern	sprints `lat-ts-probe`, `lat-ts-map`, `lat-16-3-*`
CPU AVX-512 backend	built	`src/backends/cpu/avx512/`; closure `SESSION-CLOSED-lat-2-CPU-AVX.md`
CUDA backend (PTX MMA + NTT)	built	`src/backends/cuda/`; closures `SESSION-CLOSED-lat-2-CU-PTX-*.md`
Vulkan backend	built	`src/backends/vulkan/`; closure `SESSION-CLOSED-lat-2-L1-PARITY.md`
Hexagon HVX backend (cDSP V69)	built	`src/backends/hexagon/sp_hex_host.c` + `tools/sp_compute_skel/`
`sp_daemon` → backend dispatch wiring	shipped daemon-side; cDSP skel rebuild pending	sprint WIRE-HEX; closure `CLOSURE-WIRE-HEX.md`
NTT.5d (HD=128 direct backend path)	filed, not shipped	`papers/PPT-LAT-Roadmap.md` §4-NTT
NTT.5e (decode-path NTT routing)	filed, not shipped	`papers/PPT-LAT-Roadmap.md` §4-NTT
CUDA / Vulkan daemon wiring	not shipped — symmetric to WIRE-HEX	`CLOSURE-WIRE-HEX.md` §“What’s NOT done”
Fibonacci-Prime DHT	spec’d	`papers/PPT-LAT-Roadmap.md` §8

Production tok/s baseline (Knack S22U, math-core reference forward, ctx=16+32):

Model	Wall (s)	Tokens	tok/s
Gemma3-1B	18.06	16	0.89
Qwen3-0.6B	11.21	16	1.43

These are the reference path numbers. Once the cDSP skel is rebuilt
against the WIRE-HEX-bundled inc/sp_hex.idl, SP_DAEMON_BACKEND=hex
routes through the HVX backend end-to-end and the table gains a third
column. See shannon-prime-system-engine/tools/sp_compute_skel/docs/CLOSURE-WIRE-HEX.md.

3. Architecture in one diagram

                ┌──────────────────────────────────────────────┐
                │  HTML / TUI / chat clients                   │
                │  curl, browser, sp-console                   │
                └─────────────┬────────────────────────────────┘
                              │ HTTP/JSON, SSE, WebSocket
                              ▼
        ┌──────────────────────────────────────────────────────┐
        │  sp_daemon  (Rust, axum + tokio)                     │
        │  ── L3 routes: /v1/chat /v1/dialogue /v1/events ...  │
        │  ── PoUW ledger, KSTE routing, dialogue pool         │
        │  ── QUIC mesh coordinator (dual-prime shards)        │
        └─────────────┬────────────────────────────────────────┘
                      │ frozen L1 C ABI (sp_session_*, sp_prefill_chunk,
                      │ sp_decode_step, sp_session_register_forward_backend)
                      ▼
        ┌──────────────────────────────────────────────────────┐
        │  libshannonprime  (C, the math core)                 │
        │  ── reference forward: matmul, RMSNorm, RoPE, attn   │
        │  ── NTT-CRT, poly-ring attention overlay             │
        │  ── KSTE, Frobenius, Spinor, arena                   │
        │  ── sp_session, .sp-model loader                     │
        └─────┬──────────────────────────────────────────────┬─┘
              │ §6 forward-backend hook                       │
              ▼                                                ▼
        ┌──────────────────────┐                  ┌──────────────────────┐
        │ Engine backends      │                  │ Hexagon cDSP skel    │
        │ (libsp_engine)       │                  │ (sp_compute_skel)    │
        │ ── CPU AVX2/AVX-512  │                  │ ── HVX NTT butterfly │
        │ ── CUDA (PTX MMA)    │                  │ ── VTCM twiddle stage│
        │ ── Vulkan SPV        │                  │ ── Garner CRT        │
        │ ── Hexagon HVX (host)│ ─FastRPC─────────│ ── Halide FFN        │
        └──────────────────────┘                  └──────────────────────┘

The “single math object” reappears at six layers. Walk down from the
top — DHT key space → polynomial ring → matmul kernel → vector ALU
width — and the same prime-factored lattice picks out the right
operation at each scale. See papers/PPT-LAT-Systems.md
(“Overview: six layers of one math object”).

4. Getting started

4.1 Clone all three repos

git clone https://github.com/nihilistau/shannon-prime-lattice.git
git clone https://github.com/nihilistau/shannon-prime-system.git
git clone --recurse-submodules https://github.com/nihilistau/shannon-prime-system-engine.git

The engine repo bundles shannon-prime-system as a Git submodule under
lib/shannon-prime-system/ — that submodule pin is what every engine
build uses. The standalone shannon-prime-system clone is for working
on the math core in isolation.

4.2 Pick a starting path

You want to run a model and chat with it locally. Go to
shannon-prime-system-engine/README.md. Build the daemon, transcode a
GGUF model, curl /v1/chat.

You want to understand the math. Read in this order:

papers/PPT-LAT-Theory.md — the lattice, ⪯_d as well-quasi-order,
CRT cyclotomic ring, HRR, the 13-step PPT substitution, the unified
role of one math object across the stack.
papers/PPT-LAT-Systems.md — six-layer architecture, engine
backends, inline compression, model-family coverage, gated lattice
features, blockchain scaffolding.
papers/PPT-LAT-Roadmap.md — current implementation phases (1..16
plus the NTT and MeMo waves), per-sub-phase contracts, test gates,
the offload pattern.

You want to write a kernel against the frozen ABI. Read
papers/PPT-LAT-L1-ABI-v0.md then shannon-prime-system/include/sp/sp_l1.h
(the live header). Every backend registers via
sp_session_register_forward_backend (full-forward hook) or the
NTT-dispatch hook in core/poly_ring_bluestein/.

You want to add support for a new model family. Read
papers/PPT-LAT-SP-MODEL-v0.md (on-disk format) plus
shannon-prime-system-engine/tools/sp_transcode/sp_transcode.c (the GGUF
→ .sp-model transcoder). Add a sp_arch_id and a
gemma3_forward_* / qwen3_forward_* arch path.

You want to add a peer to a running mesh. Read
papers/PPT-LAT-Systems.md §“DHT and sharded inference” then
shannon-prime-system-engine/tools/sp_daemon/src/network/quic_shard.rs.

5. Repository layout

shannon-prime-lattice/
├── papers/                            # the project's papers — read these first
│   ├── PPT-LAT-Theory.md              # math foundations + 13-step PPT substitution
│   ├── PPT-LAT-Systems.md             # six-layer architecture
│   ├── PPT-LAT-Roadmap.md             # implementation phases (living document)
│   ├── PPT-LAT-L1-ABI-v0.md           # frozen Layer-1 C ABI contract
│   ├── PPT-LAT-SP-MODEL-v0.md         # .sp-model / .sp-tokenizer on-disk format
│   ├── SESSION-CLOSED-lat-*.md        # per-sprint closure notes (audit trail)
│   └── SESSION-STATE-lat-*.md         # session-handoff snapshots
├── demos/                             # phase demos
├── frontends/                         # HTML mock-ups + bootstrap chat UIs
├── reference/                         # reference material (images, screenshots, PDFs)
├── scripts/                           # cross-repo helpers
├── tests/                             # integration tests spanning math-core + engine
└── prompt.md                          # bootstrap / context-priming for new sessions

The papers are the source of truth for design. The closure notes
are the source of truth for “what shipped, with what gate result.”
The roadmap is a living document and amendable; the theory paper is
amendable when reality contradicts it; the ABI and .sp-model papers
are frozen.

6. Hard rules

These rules are binding for any session that picks up the project. The
memory entries feedback-no-silent-gate-revisions,
feedback-lead-with-reference-then-theory, and
feedback-parallel-agents-separate-worktrees are also load-bearing.

Anti-contamination. Do NOT read, copy, or vendor code from the
archived shannon-prime/ or shannon-prime-engine/ repos. The math
papers under papers/PPT-ARM/ are conceptual reference — read for
theory, never paste code. The lattice is a clean rebuild.
No silent gate revisions. If implementation can’t meet the spec’d
gate, surface upstream. Do not retreat to a higher-level API, defer
to an unrelated phase, or tune fixtures until the number passes.
Adjustments land as roadmap amendments with rationale, not as
footnotes on a PASS.
Honest closure notes. Every closure enumerates the test gates,
their actual results, what was bundled vs isolated, and what changed
vs spec. The session-closure pattern is the audit trail.
One math object. Lattice features must touch one of the
distinguishing primitives in §1; otherwise they are drift. The
manifesto trick list (reference-heterogeneous-soc-crt-tricks in
the team’s memory) names ten such primitives. New sub-phases reference
trick numbers rather than reinventing the framework.
Worktrees per concurrent agent. When dispatching 2+ agents on
the same repo, each agent operates in its own git worktree add
to prevent cross-contamination of uncommitted files.

7. Where to read next

If you want	Read
The math foundations	`papers/PPT-LAT-Theory.md`
The systems architecture	`papers/PPT-LAT-Systems.md`
The implementation roadmap (living)	`papers/PPT-LAT-Roadmap.md`
The frozen L1 C ABI contract	`papers/PPT-LAT-L1-ABI-v0.md` then `shannon-prime-system/include/sp/sp_l1.h`
The `.sp-model` on-disk format	`papers/PPT-LAT-SP-MODEL-v0.md`
The math-core library API	`shannon-prime-system/README.md`
The engine + daemon + HTTP API	`shannon-prime-system-engine/README.md`
What the most recent sprint shipped	`papers/SESSION-CLOSED-.md` (lattice scope) or `shannon-prime-system-engine/tools/sp_compute_skel/docs/CLOSURE-.md` (engine + DSP scope)
A bootstrap prompt for new sessions	`prompt.md`

Agerico · June 2, 2026, 1:19pm

Shannon-Prime-Lattice reduces numerical and infrastructural entropy, but it does not thereby dissolve the classical philosophical problems of completeness, grounding, reference, representation, decidability, and semantic closure, e.g., Gödel’s Theorem of Incompleteness, Turing’s Halting Problem, Church’s Undecidability of First Order Logic, Duhem-Quine Thesis, Quine’s Inscrutability of Reference and Underdetermination of Knowledge theses, more. It relocates them into a discrete algebraic lattice architecture. This is the deepest issue. If the system begins to encode not only object-level data but also its own inference states, dominance relations, memory receipts, provenance, and correctness claims, it risks semantic self-reference. That is where closure problems arise: can the system fully represent, verify, and govern its own representational adequacy from inside the same lattice? The Gödel/Tarski/Turing family of concerns re-enters here.

KnackAU · June 2, 2026, 2:23pm

You are correct that Shannon-Prime PPT ARM does not dissolve Godel, Turing, or Quine. It is not an attempt to solve the epistemic problems of truth, reference, or semantic closure.

The goal of the Shannon-Prime Lattice is much more mechanical, we are solving the physical and informational drift caused by floating-point arithmetic in continuous architectures.

Addressing you concern about semantic self-reference and closure problems arising from the system encoding its own states, here is how the architecture structurally avoids that trap

Strict Separation of Substrate and Semantics: The Z_q Cyclotomic Ring is a purely syntactic, deterministic ALU. It doesn’t judge the “truth” or representational adequacy of what it computes. It just multiplies and adds discrete integers losslessly. The semantic orchestration (MTP verification, state rollbacks, Beatty routing) happens entirely outside the mathematical ring, managed by a completely separate L3 orchestrator (a Rust daemon). We do not ask the polynomial ring to prove its own consistency.
Frozen Base and Append-Only Memory, The system does not recursively rewrite its own foundational logic. The base model weights are mathematically frozen. The continuous learning mechanisms (MEMO, Spinor receipts) function as an append-only cryptographic ledger of discrete integer offsets. Because applying these updates is strictly matrix addition in Z_q, it is fully commutative and associative. It accumulates context without initiating recursive self-modification.
The system is transactional, Not Self-Referential When the system evaluates a state (like verifying a Multi-Token Prediction draft), it is evaluating byte-exact integer equality, not subjective probability. If a draft fails, it triggers a hard, mechanical rollback to a previously committed Spinor block.

You are absolutely right that if we tried to build a self-modifying, self-governing AGI entirely inside a single lattice, we would hit a Godelian wall. By treating the lattice simply as a flawless, lossless engine and keeping state-management external, we avoid semantic self-reference. We aren’t trying to beat Turing; we just want to stop bleeding entropy into the hardware.

KnackAU · June 2, 2026, 6:13pm

Agerico, following up on our discussion—we just concluded a round of physical silicon validation this week that I think perfectly illustrates the boundary between the philosophical traps you rightly point out, and how we are physically sidestepping them in the architecture.

When you mentioned the risks of the system managing its own ‘memory receipts, provenance, and correctness claims,’ the immediate engineering danger is that if a model has to semantically ‘understand’ its own memory to retrieve it, it falls into that exact recursive, undecidable trap.

We just finished wiring our Ring-2 memory architecture, which physically spills the model’s KV cache out of RAM and onto Intel Optane NVMe drives, completely decoupling context length from host memory. To retrieve that memory without triggering semantic collapse, here is what we proved on the hardware:

1. Routing via Geometry, Not Semantics:

To find a specific needle of information in a massive context window spilled to disk, the system does not ‘read’ or evaluate the semantics of the text. Instead, we deployed a \pm 1 Rademacher integer projection sidecar. It uses the Johnson-Lindenstrauss lemma to preserve the inner-product geometry of the attention vectors. The router just performs ultra-fast, discrete Z_q integer matching. It scored a perfect 8/8 retrieval at depth-10% of the context window, proving we can route ‘dominance’ purely through discrete geometry.

2. Physical Grounding (The NaN-Poisoned Cache):

To prove the system wasn’t hallucinating or cheating with residual RAM, we intentionally poisoned the Ring-1 RAM cache with NaN values for any token that was evicted to the Optane drive. If the model tried to evaluate its memory representations internally instead of reading the physical disk, the math would instantly explode. The model successfully retrieved the specific needles with 100% accuracy, proving the spill -> fetch -> decode -> attend pipeline is purely mechanical.

3. Dismantling the Compute Wall (18.86 µs latency):

By decoupling the query-head parallel loop from the KV fetch (a strict deduplication phase), we bypassed the OS page cache using FILE_FLAG_NO_BUFFERING and drove per-read latency down to 18.86 µs directly through the Windows kernel.

The takeaway for us is this: You are absolutely right that we cannot solve the Gödel/Tarski/Turing family of concerns from inside the lattice. So we don’t try. We treat memory retrieval not as a semantic evaluation, but as a pure, asynchronous I/O block-storage problem governed by integer projections. By keeping the math discrete and pushing the state-management to physical disk sectors, we let the physics do the work.

Agerico · June 2, 2026, 10:13pm

Clarifies much. I have been too much in a rush to comment and have taken the wrong perspective. Thanks. Agree, “let the physics work.”

KnackAU · June 3, 2026, 7:21am

A couple of corrections for the record, a way to reproduce the work, and a licensing note.

Tightening two numbers from my last post. In the spirit of the receipts-first discipline I keep claiming, I conflated two separate gates and undersold a third:

The 8/8 is the router in isolation — the ±1 Rademacher projection scored 8/8 needles at cosine 1.0 against an adversarial decoy set (B=64, r=16). Separately, the end-to-end NIAH decode gate retrieves the needle at depth 10%, 50%, and 90% (no recency bias). Two different gates; I ran them together last time.
The latency I quoted (18.86 µs) was an intermediate stage. The final IOCP + FILE_FLAG_NO_BUFFERING path is 7.57 µs/read. I undersold it.

For completeness, the rest of the envelope at 32k context: 910× resident KV-cache shrink (7.5 GB → 8.3 MB), 8× KV sparsification at +0.69% perplexity (measured at 2k context on one corpus; 2× and 4× go negative), and a reducing transcode that makes the on-disk model ~50% smaller with a bit-identical forward on both Gemma-3 and Qwen3.

Reproduce it from a command. I’ve put the work up as a receipts-first paper series — the rule is no number without a runnable command:

Landing page: Shannon-Prime — long-context KV memory you can run
Repo: GitHub - nihilistau/Position_Is_Arithmetic: Prime Power Transformer: A Number-Theoretic Architecture for Compute · GitHub

git clone https://github.com/nihilistau/Position_Is_Arithmetic.git
cd Position_Is_Arithmetic
# 02 — the reducing loader: reproduces green now (6/6 format gates,
#      bit-faithful forward on gemma-3 + qwen3). See papers/02-reducing-loader/repro/
# 01 — two-ring memory: the needle-retrieval harness is in
#      papers/01-two-ring-memory/repro/ ; the 32k headline figures
#      land as that run completes.

Each paper carries its own repro/ with the exact invocation and an EXPECTED.md. Correctness reproduces on any NVMe; the latency figure is the only Optane-specific part.

Licensing. The AGPL-3.0 line in the top post is stale — we’re moving everything to MIT across all the repos. The papers repo above is already MIT; the code repos are following.

And thanks, Agerico — the closure pressure was the right thing to push on, even though the answer turned out to be “keep the lattice purely mechanical and let the disk do the remembering.”

KnackAU · June 8, 2026, 8:00am

Update — the receipts-first paper series grew three papers, and one of them required indicting an ecosystem

A lot has happened since the opening post. The short version: the public,
receipts-first paper series at
https://github.com/nihilistau/Position_Is_Arithmetic now carries papers
04, 05 and 06 — and finishing 06 forced us to root-cause something the
whole local-inference community is currently sitting on: every Gemma-4 GGUF
we could measure, including the post-fix rebuilds, carries broken weights.

The series discipline hasn’t changed: every number is a row in a shared
ledger with a command behind it, honest negatives stay on the record, and no
throughput number is citable without a quality gate on the same artifact.
That last rule is the reason this update exists.

Paper 04 — The Oracle & the Teacher (oracle-grounded backend verification)

What it solves: porting a complex architecture to new silicon without the
weeks-long divergence hunt — and, it turns out, defending yourself when the
reference implementation itself is wrong.
How: extract a bit-faithful CPU oracle from the reference first (scalar,
readable, f64-accumulating), grade every backend against the oracle and
never against a prior port, and gate autoregressive decode by
teacher-forcing (the oracle re-predicts the port’s own generated stream).
Receipt: a 35-layer variable-geometry MatFormer (per-layer attention
widths, shared KV, proportional RoPE, softcap) matched its oracle at
max KL 2.663e-10 (argmax 12/12), both live runs green first-try, 38/38.
How it fits: this is the verification layer for everything in §1 of the
opening post — “byte-exact, not small-KL” is only meaningful if the thing
you’re byte-exact against is itself proven. The paper’s case study is the
strongest demonstration we have: when llama.cpp scored wikitext PPL 397–506
on Gemma-4-12B and the ecosystem normalized it, a from-scratch forward
written off the official safetensors + config alone measured 4.6776 —
the model was healthy, llama.cpp’s forward was exonerated (two independent
engines agree per-artifact), and the GGUF artifacts themselves were
convicted. An oracle is not a porting tool; it’s the only defense against a
poisoned reference frame.

Paper 05 — The Probe Suite (bisection, isolation & benchmark hygiene as one set)

What it solves: the fact that correct numbers about computing systems are
not read off — they are manufactured. The suite is how.
How: truncated-parity bisection, isolation sweeps, benchmark hygiene and
oracle-rank telemetry, used together. Documented kills: a 12.65× phantom
speedup (three stacked artifacts), a 2.8e-3 wrong-arithmetic localized in
two probe runs, a mixed-precision 0/256 bug the isolated bench passed at
1.34e-7, and a per-vector activation-quant collapse at oracle-rank 205,596
on outlier-heavy activations (fixed with per-block scales aligned to the
kernel’s 128-bit loads).
How it fits: the second half turns the same toolset outward, at ecosystem
scale — tensor-class swap bisection over the broken GGUFs (restoring just
the per-layer scale class recovered PPL 364→97; restoring norms made it
worse, proving the matmul weights damaged too), per-layer cosine
forensics (no permutation; in-place damage with a period-6 layer
signature), and simulate-before-build: six quantization recipes
simulated through the proven reference forward before a line of CUDA
existed — and the built artifact then matched the simulation to four
decimal places (5.1259), with the GPU kernel agreeing as a third
instrument (5.1160).

Paper 06 — Computing on the Zip File (the dp4a bandwidth ladder — complete, gated, citable)

What it solves: memory-bound decode on consumer silicon. The weights’
byte count is the speed of light, but only if you compute directly on the
packed integer codes — dequantizing to f32 scratch first measured 3×
slower than plain f32.
How: warp-per-row __dp4a GEMV, 128-bit loads, in-ALU nibble unpack
(~7% tax), exact integer accumulation, one Frobenius lift at the end —
the isolated ladder runs f32 1× → int8 ~3.8× → Q4 ~7.06×, hugging the
byte ratios. New this round: the OK_Q4B format (per-32-block f16
scales, store-then-derive discipline) where one weight block is exactly
one 128-bit chunk in the kernel — zero extra code-bus traffic — and the
sovereign quantization pipeline: artifact values come from the official
safetensors checkpoint, never from a GGUF, and every artifact gates
against the paper-04 oracle before any throughput number is taken.
The headline, stated honestly: Gemma-4-12B at 26.1 tok/s and wikitext
PPL 5.12 on an RTX 2060 12GB (graph path bit-exact, decode 256/256
top-1, 24/24 gates, clocks pinned). llama.cpp-CUDA on the same card does
31.29 tok/s — at PPL 192–506, because its artifacts are broken.
Engine-for-engine we move +18% more bytes/s (245 vs 207 GB/s effective);
our artifact is heavier because it is the only mathematically intact
4-bit Gemma-4-12B in existence. And in the spirit of the series: an
earlier 34.2 tok/s headline is formally retired in the ledger —
it was measured on an artifact that later failed the PPL gate. The rule
caught our own number first.

For anyone hitting the Gemma-4 quant weirdness themselves: we published a
standalone walkthrough — verify the breakage in ~30 minutes with an
engine-independent method, plus the quantization recipe that actually works
on this PTQ-hostile model (blanket 4-bit costs +45% PPL; 4-bit on the FFN
gate/up pair only with 8-bit elsewhere costs +9.6%):
https://github.com/nihilistau/Position_Is_Arithmetic/blob/main/GEMMA4-QUANT-FIX.md
All forensic instruments are MIT, ~130-line numpy/torch scripts, no GPU
required for the verification.

How this fits the lattice overall: the opening post’s thesis was that
floating-point drift and un-provable identity are entropy bleeding into the
hardware, and that a discrete substrate makes correctness a property you
prove rather than estimate. This round extended that doctrine one level up
the stack — to the artifacts. The same discipline that makes a kernel
byte-exact (oracle, gates, receipts) is what caught an interchange format
silently destroying weights while every smoke test stayed green. The
supply chain is now part of the math.

Papers, ledger, methodology, instruments:
nihilistau/Position_Is_Arithmetic

As always — the unflattering numbers are kept attached on purpose.

Topic		Replies	Views
Cut LLM Inference Waste: Structural Fixes for Coherence Collapse & Compute Metering Standards Research	4	104	May 14, 2026
AERIS V20 – Architectural Constraints for Non-Standard LLM Behavior Research	8	190	January 19, 2026
I'm not an engineer. I just wanted to see if a 3D cube of cells could learn to talk Research	3	129	May 31, 2026
The Periodic Table of AI Architecture: Assigning Clear Roles to Scattered AI Findings Research	7	149	April 27, 2026
CUDA support added - Pre-generation knowledge-boundary estimator Intermediate	2	26	June 9, 2026