Shannon Prime Lattice

shannon-prime-lattice

Shannon-Prime PPT ARM Lattice — a decentralized, byte-exact inference and
training fabric for large transformer models built on a single discrete math
object: the prime-factored coordinate lattice over Z_q with dual-prime
Chinese-Remainder-Theorem (CRT) decomposition, the Friedman-Kruskal dominance
order ⪯_d, and the CRT cyclotomic ring R_q = Z_q[x]/(x^N + 1).

This repository is the public project entry point. It holds the theory,
systems, ABI, and on-disk-format papers; the demos; the integration tests;
and the bootstrap prompt for new working sessions. Code lives in the two
companion repositories:

Discord: Shannon-Prime-Lattice
License: AGPL-3.0-or-later. Commercial licensing available — contact the
copyright holder.


1. What makes this different

Shannon-Prime Lattice is not “yet another inference engine wrapper.” Every
load-bearing primitive is discrete (integers in Z_q with q a 30-bit
Proth prime, or Z_{q_1} × Z_{q_2} via CRT), so identity, dominance, hashing,
and reproducibility are properties the implementation can prove rather than
estimate. Floating point is plumbing — the math is in Z_q.

Distinguishing claims (each one validated by a shipped sprint and a closure
note under papers/SESSION-CLOSED-*.md or
shannon-prime-system-engine/tools/sp_compute_skel/docs/CLOSURE-*.md):

  • Discrete Z_q substrate. Two frozen 30-bit Proth primes
    q_1 = 1073738753, q_2 = 1073732609, M = q_1·q_2 ≈ 2^60. Negacyclic
    NTT over each prime with Garner CRT recombination at the boundary. Every
    cross-backend gate is byte-exact, not “small KL divergence.”
  • Polynomial-ring attention. Attention scores ⟨q, k⟩ reduce to one
    coefficient of a negacyclic polynomial product in R_q, computed exactly
    via NTT. Bit-identical to the scalar reference at N ∈ {128, 256, 512}
    direct, and N ∈ {2..256} via Bluestein chirp-z. See
    papers/PPT-LAT-Theory.md §6.1.
  • Frobenius-lift Q8 weight storage. Per-row int8 codes + fp32 scale;
    4× compression vs fp32 with bit-identical dequant round-trip. The
    on-RAM packed-arena format is what every backend reads — no per-matmul
    re-quantization.
  • Spinor 63-byte KV-cache block. VHT2 anchor projection + Möbius
    reorder + CRC-8 trailer + 0xA5 sentinel. One cache-line on ARM
    Cortex-X2. The frozen on-wire KV record format (see
    shannon-prime-system/include/sp/spinor_block.h).
  • KSTE encoder. Knight-Spinor Tree Encoder: deterministic 64-byte
    packed tree from a K-vector of int32 components, with byte-identical
    signature across platforms. Tier-0/Tier-1 dominance.
  • PoUW receipt ledger. Per-turn 64-byte SpinorReceipt audit
    envelope. Append-only ledger; canonical-order replay; cross-device
    byte-identity gates. Shipped end-to-end via sp_daemon’s
    /v1/dialogue endpoint.
  • QUIC dual-prime mesh. Each peer carries one of the two CRT residue
    shards (q_1 or q_2); driver Garner-recombines to the centered
    signed result. Today: two-node lattice smoke. Planned: Fibonacci-Prime
    DHT (papers/PPT-LAT-Roadmap.md §8).
  • Heterogeneous SoC compute. The cDSP V69 HVX backend on Snapdragon
    8 Gen 1 runs the full NTT pipeline (forward, twiddle VTCM staging,
    dual-prime dispatch, INTT + Garner) byte-exact vs the math-core
    scalar reference. NPU + cDSP dual-island composition is filed under
    Phase 4-MTP.

2. Current status

Honest snapshot, 2026-05-31.

Component Status Evidence
Frozen L1 C ABI shipped shannon-prime-system/include/sp/sp_l1.h; tag lat-phase2-contract-frozen
.sp-model v0 wire format shipped papers/PPT-LAT-SP-MODEL-v0.md; loader at core/io_format/
Math-core reference forward shipped — runs Qwen3-0.6B, Qwen2.5-Coder-0.5B, Gemma3-1B byte-exact host + aarch64-android lib/shannon-prime-system/core/forward/forward.c; closure SESSION-CLOSED-lat-3-cell-*.md
NTT-CRT primitive (host) shipped core/ntt_crt/; tests T_NTT_*
NTT-CRT primitive (Hexagon V69 HVX) shipped end-to-end byte-exact vs math-core sprints NTT.0 → NTT.4; closures CLOSURE-NTT-{0..4}.md
Polynomial-ring attention overlay shipped — host + Hexagon sprints NTT.5a / 5b / 5c
Spinor-block KV cache shipped core/vht2/; tests T_VHT_1..6
Frobenius-lift Q8 / Q4 packing shipped core/frobenius/, core/arena/
KSTE encoder + Tier-0/1 dominance shipped core/kste/; tests T_KSTE_1..5
sp_daemon HTTP/SSE chat (/v1/chat) shipped tools/sp_daemon/; closure CLOSURE-CHAT-INTEGRATION.md
Dual-model dialogue (/v1/dialogue) shipped sprint M.2; closure CLOSURE-M2-DIALOGUE.md
PoUW receipt ledger + canonical-order replay shipped sprints M.4, mesh-canonical-order, ledger-autowire
KSTE-routed sparse Memory activation shipped sprint M.5; closure CLOSURE-M5-ROUTING.md
Two-node sharded inference smoke shipped closure SESSION-CLOSED-lat-smoke-2node.md
TailSlayer GF(2) channel oracle shipped offline pattern sprints lat-ts-probe, lat-ts-map, lat-16-3-*
CPU AVX-512 backend built src/backends/cpu/avx512/; closure SESSION-CLOSED-lat-2-CPU-AVX.md
CUDA backend (PTX MMA + NTT) built src/backends/cuda/; closures SESSION-CLOSED-lat-2-CU-PTX-*.md
Vulkan backend built src/backends/vulkan/; closure SESSION-CLOSED-lat-2-L1-PARITY.md
Hexagon HVX backend (cDSP V69) built src/backends/hexagon/sp_hex_host.c + tools/sp_compute_skel/
sp_daemon → backend dispatch wiring shipped daemon-side; cDSP skel rebuild pending sprint WIRE-HEX; closure CLOSURE-WIRE-HEX.md
NTT.5d (HD=128 direct backend path) filed, not shipped papers/PPT-LAT-Roadmap.md §4-NTT
NTT.5e (decode-path NTT routing) filed, not shipped papers/PPT-LAT-Roadmap.md §4-NTT
CUDA / Vulkan daemon wiring not shipped — symmetric to WIRE-HEX CLOSURE-WIRE-HEX.md §“What’s NOT done”
Fibonacci-Prime DHT spec’d papers/PPT-LAT-Roadmap.md §8

Production tok/s baseline (Knack S22U, math-core reference forward, ctx=16+32):

Model Wall (s) Tokens tok/s
Gemma3-1B 18.06 16 0.89
Qwen3-0.6B 11.21 16 1.43

These are the reference path numbers. Once the cDSP skel is rebuilt
against the WIRE-HEX-bundled inc/sp_hex.idl, SP_DAEMON_BACKEND=hex
routes through the HVX backend end-to-end and the table gains a third
column. See shannon-prime-system-engine/tools/sp_compute_skel/docs/CLOSURE-WIRE-HEX.md.


3. Architecture in one diagram

                ┌──────────────────────────────────────────────┐
                │  HTML / TUI / chat clients                   │
                │  curl, browser, sp-console                   │
                └─────────────┬────────────────────────────────┘
                              │ HTTP/JSON, SSE, WebSocket
                              ▼
        ┌──────────────────────────────────────────────────────┐
        │  sp_daemon  (Rust, axum + tokio)                     │
        │  ── L3 routes: /v1/chat /v1/dialogue /v1/events ...  │
        │  ── PoUW ledger, KSTE routing, dialogue pool         │
        │  ── QUIC mesh coordinator (dual-prime shards)        │
        └─────────────┬────────────────────────────────────────┘
                      │ frozen L1 C ABI (sp_session_*, sp_prefill_chunk,
                      │ sp_decode_step, sp_session_register_forward_backend)
                      ▼
        ┌──────────────────────────────────────────────────────┐
        │  libshannonprime  (C, the math core)                 │
        │  ── reference forward: matmul, RMSNorm, RoPE, attn   │
        │  ── NTT-CRT, poly-ring attention overlay             │
        │  ── KSTE, Frobenius, Spinor, arena                   │
        │  ── sp_session, .sp-model loader                     │
        └─────┬──────────────────────────────────────────────┬─┘
              │ §6 forward-backend hook                       │
              ▼                                                ▼
        ┌──────────────────────┐                  ┌──────────────────────┐
        │ Engine backends      │                  │ Hexagon cDSP skel    │
        │ (libsp_engine)       │                  │ (sp_compute_skel)    │
        │ ── CPU AVX2/AVX-512  │                  │ ── HVX NTT butterfly │
        │ ── CUDA (PTX MMA)    │                  │ ── VTCM twiddle stage│
        │ ── Vulkan SPV        │                  │ ── Garner CRT        │
        │ ── Hexagon HVX (host)│ ─FastRPC─────────│ ── Halide FFN        │
        └──────────────────────┘                  └──────────────────────┘

The “single math object” reappears at six layers. Walk down from the
top — DHT key space → polynomial ring → matmul kernel → vector ALU
width — and the same prime-factored lattice picks out the right
operation at each scale. See papers/PPT-LAT-Systems.md
(“Overview: six layers of one math object”).


4. Getting started

4.1 Clone all three repos

git clone https://github.com/nihilistau/shannon-prime-lattice.git
git clone https://github.com/nihilistau/shannon-prime-system.git
git clone --recurse-submodules https://github.com/nihilistau/shannon-prime-system-engine.git

The engine repo bundles shannon-prime-system as a Git submodule under
lib/shannon-prime-system/ — that submodule pin is what every engine
build uses. The standalone shannon-prime-system clone is for working
on the math core in isolation.

4.2 Pick a starting path

You want to run a model and chat with it locally. Go to
shannon-prime-system-engine/README.md. Build the daemon, transcode a
GGUF model, curl /v1/chat.

You want to understand the math. Read in this order:

  1. papers/PPT-LAT-Theory.md — the lattice, ⪯_d as well-quasi-order,
    CRT cyclotomic ring, HRR, the 13-step PPT substitution, the unified
    role of one math object across the stack.
  2. papers/PPT-LAT-Systems.md — six-layer architecture, engine
    backends, inline compression, model-family coverage, gated lattice
    features, blockchain scaffolding.
  3. papers/PPT-LAT-Roadmap.md — current implementation phases (1..16
    plus the NTT and MeMo waves), per-sub-phase contracts, test gates,
    the offload pattern.

You want to write a kernel against the frozen ABI. Read
papers/PPT-LAT-L1-ABI-v0.md then shannon-prime-system/include/sp/sp_l1.h
(the live header). Every backend registers via
sp_session_register_forward_backend (full-forward hook) or the
NTT-dispatch hook in core/poly_ring_bluestein/.

You want to add support for a new model family. Read
papers/PPT-LAT-SP-MODEL-v0.md (on-disk format) plus
shannon-prime-system-engine/tools/sp_transcode/sp_transcode.c (the GGUF
.sp-model transcoder). Add a sp_arch_id and a
gemma3_forward_* / qwen3_forward_* arch path.

You want to add a peer to a running mesh. Read
papers/PPT-LAT-Systems.md §“DHT and sharded inference” then
shannon-prime-system-engine/tools/sp_daemon/src/network/quic_shard.rs.


5. Repository layout

shannon-prime-lattice/
├── papers/                            # the project's papers — read these first
│   ├── PPT-LAT-Theory.md              # math foundations + 13-step PPT substitution
│   ├── PPT-LAT-Systems.md             # six-layer architecture
│   ├── PPT-LAT-Roadmap.md             # implementation phases (living document)
│   ├── PPT-LAT-L1-ABI-v0.md           # frozen Layer-1 C ABI contract
│   ├── PPT-LAT-SP-MODEL-v0.md         # .sp-model / .sp-tokenizer on-disk format
│   ├── SESSION-CLOSED-lat-*.md        # per-sprint closure notes (audit trail)
│   └── SESSION-STATE-lat-*.md         # session-handoff snapshots
├── demos/                             # phase demos
├── frontends/                         # HTML mock-ups + bootstrap chat UIs
├── reference/                         # reference material (images, screenshots, PDFs)
├── scripts/                           # cross-repo helpers
├── tests/                             # integration tests spanning math-core + engine
└── prompt.md                          # bootstrap / context-priming for new sessions

The papers are the source of truth for design. The closure notes
are the source of truth for “what shipped, with what gate result.”
The roadmap is a living document and amendable; the theory paper is
amendable when reality contradicts it; the ABI and .sp-model papers
are frozen.


6. Hard rules

These rules are binding for any session that picks up the project. The
memory entries feedback-no-silent-gate-revisions,
feedback-lead-with-reference-then-theory, and
feedback-parallel-agents-separate-worktrees are also load-bearing.

  • Anti-contamination. Do NOT read, copy, or vendor code from the
    archived shannon-prime/ or shannon-prime-engine/ repos. The math
    papers under papers/PPT-ARM/ are conceptual reference — read for
    theory, never paste code. The lattice is a clean rebuild.
  • No silent gate revisions. If implementation can’t meet the spec’d
    gate, surface upstream. Do not retreat to a higher-level API, defer
    to an unrelated phase, or tune fixtures until the number passes.
    Adjustments land as roadmap amendments with rationale, not as
    footnotes on a PASS.
  • Honest closure notes. Every closure enumerates the test gates,
    their actual results, what was bundled vs isolated, and what changed
    vs spec. The session-closure pattern is the audit trail.
  • One math object. Lattice features must touch one of the
    distinguishing primitives in §1; otherwise they are drift. The
    manifesto trick list (reference-heterogeneous-soc-crt-tricks in
    the team’s memory) names ten such primitives. New sub-phases reference
    trick numbers rather than reinventing the framework.
  • Worktrees per concurrent agent. When dispatching 2+ agents on
    the same repo, each agent operates in its own git worktree add
    to prevent cross-contamination of uncommitted files.

7. Where to read next

If you want Read
The math foundations papers/PPT-LAT-Theory.md
The systems architecture papers/PPT-LAT-Systems.md
The implementation roadmap (living) papers/PPT-LAT-Roadmap.md
The frozen L1 C ABI contract papers/PPT-LAT-L1-ABI-v0.md then shannon-prime-system/include/sp/sp_l1.h
The .sp-model on-disk format papers/PPT-LAT-SP-MODEL-v0.md
The math-core library API shannon-prime-system/README.md
The engine + daemon + HTTP API shannon-prime-system-engine/README.md
What the most recent sprint shipped papers/SESSION-CLOSED-*.md (lattice scope) or shannon-prime-system-engine/tools/sp_compute_skel/docs/CLOSURE-*.md (engine + DSP scope)
A bootstrap prompt for new sessions prompt.md

Shannon-Prime-Lattice reduces numerical and infrastructural entropy, but it does not thereby dissolve the classical philosophical problems of completeness, grounding, reference, representation, decidability, and semantic closure, e.g., Gödel’s Theorem of Incompleteness, Turing’s Halting Problem, Church’s Undecidability of First Order Logic, Duhem-Quine Thesis, Quine’s Inscrutability of Reference and Underdetermination of Knowledge theses, more. It relocates them into a discrete algebraic lattice architecture. This is the deepest issue. If the system begins to encode not only object-level data but also its own inference states, dominance relations, memory receipts, provenance, and correctness claims, it risks semantic self-reference. That is where closure problems arise: can the system fully represent, verify, and govern its own representational adequacy from inside the same lattice? The Gödel/Tarski/Turing family of concerns re-enters here.

You are correct that Shannon-Prime PPT ARM does not dissolve Godel, Turing, or Quine. It is not an attempt to solve the epistemic problems of truth, reference, or semantic closure.

The goal of the Shannon-Prime Lattice is much more mechanical, we are solving the physical and informational drift caused by floating-point arithmetic in continuous architectures.

Addressing you concern about semantic self-reference and closure problems arising from the system encoding its own states, here is how the architecture structurally avoids that trap

  1. Strict Separation of Substrate and Semantics: The Z_q Cyclotomic Ring is a purely syntactic, deterministic ALU. It doesn’t judge the “truth” or representational adequacy of what it computes. It just multiplies and adds discrete integers losslessly. The semantic orchestration (MTP verification, state rollbacks, Beatty routing) happens entirely outside the mathematical ring, managed by a completely separate L3 orchestrator (a Rust daemon). We do not ask the polynomial ring to prove its own consistency.

  2. Frozen Base and Append-Only Memory, The system does not recursively rewrite its own foundational logic. The base model weights are mathematically frozen. The continuous learning mechanisms (MEMO, Spinor receipts) function as an append-only cryptographic ledger of discrete integer offsets. Because applying these updates is strictly matrix addition in Z_q, it is fully commutative and associative. It accumulates context without initiating recursive self-modification.

  3. The system is transactional, Not Self-Referential When the system evaluates a state (like verifying a Multi-Token Prediction draft), it is evaluating byte-exact integer equality, not subjective probability. If a draft fails, it triggers a hard, mechanical rollback to a previously committed Spinor block.

You are absolutely right that if we tried to build a self-modifying, self-governing AGI entirely inside a single lattice, we would hit a Godelian wall. By treating the lattice simply as a flawless, lossless engine and keeping state-management external, we avoid semantic self-reference. We aren’t trying to beat Turing; we just want to stop bleeding entropy into the hardware.

Agerico, following up on our discussion—we just concluded a round of physical silicon validation this week that I think perfectly illustrates the boundary between the philosophical traps you rightly point out, and how we are physically sidestepping them in the architecture.

When you mentioned the risks of the system managing its own ‘memory receipts, provenance, and correctness claims,’ the immediate engineering danger is that if a model has to semantically ‘understand’ its own memory to retrieve it, it falls into that exact recursive, undecidable trap.

We just finished wiring our Ring-2 memory architecture, which physically spills the model’s KV cache out of RAM and onto Intel Optane NVMe drives, completely decoupling context length from host memory. To retrieve that memory without triggering semantic collapse, here is what we proved on the hardware:

1. Routing via Geometry, Not Semantics:

To find a specific needle of information in a massive context window spilled to disk, the system does not ‘read’ or evaluate the semantics of the text. Instead, we deployed a \pm 1 Rademacher integer projection sidecar. It uses the Johnson-Lindenstrauss lemma to preserve the inner-product geometry of the attention vectors. The router just performs ultra-fast, discrete Z_q integer matching. It scored a perfect 8/8 retrieval at depth-10% of the context window, proving we can route ‘dominance’ purely through discrete geometry.

2. Physical Grounding (The NaN-Poisoned Cache):

To prove the system wasn’t hallucinating or cheating with residual RAM, we intentionally poisoned the Ring-1 RAM cache with NaN values for any token that was evicted to the Optane drive. If the model tried to evaluate its memory representations internally instead of reading the physical disk, the math would instantly explode. The model successfully retrieved the specific needles with 100% accuracy, proving the spill -> fetch -> decode -> attend pipeline is purely mechanical.

3. Dismantling the Compute Wall (18.86 µs latency):

By decoupling the query-head parallel loop from the KV fetch (a strict deduplication phase), we bypassed the OS page cache using FILE_FLAG_NO_BUFFERING and drove per-read latency down to 18.86 µs directly through the Windows kernel.

The takeaway for us is this: You are absolutely right that we cannot solve the Gödel/Tarski/Turing family of concerns from inside the lattice. So we don’t try. We treat memory retrieval not as a semantic evaluation, but as a pure, asynchronous I/O block-storage problem governed by integer projections. By keeping the math discrete and pushing the state-management to physical disk sectors, we let the physics do the work.

Clarifies much. I have been too much in a rush to comment and have taken the wrong perspective. Thanks. Agree, “let the physics work.”

A couple of corrections for the record, a way to reproduce the work, and a licensing note.

Tightening two numbers from my last post. In the spirit of the receipts-first discipline I keep claiming, I conflated two separate gates and undersold a third:

  • The 8/8 is the router in isolation — the ±1 Rademacher projection scored 8/8 needles at cosine 1.0 against an adversarial decoy set (B=64, r=16). Separately, the end-to-end NIAH decode gate retrieves the needle at depth 10%, 50%, and 90% (no recency bias). Two different gates; I ran them together last time.
  • The latency I quoted (18.86 µs) was an intermediate stage. The final IOCP + FILE_FLAG_NO_BUFFERING path is 7.57 µs/read. I undersold it.

For completeness, the rest of the envelope at 32k context: 910× resident KV-cache shrink (7.5 GB → 8.3 MB), 8× KV sparsification at +0.69% perplexity (measured at 2k context on one corpus; 2× and 4× go negative), and a reducing transcode that makes the on-disk model ~50% smaller with a bit-identical forward on both Gemma-3 and Qwen3.

Reproduce it from a command. I’ve put the work up as a receipts-first paper series — the rule is no number without a runnable command:

Landing page: Shannon-Prime — long-context KV memory you can run
Repo: GitHub - nihilistau/Position_Is_Arithmetic: Prime Power Transformer: A Number-Theoretic Architecture for Compute · GitHub

git clone https://github.com/nihilistau/Position_Is_Arithmetic.git
cd Position_Is_Arithmetic
# 02 — the reducing loader: reproduces green now (6/6 format gates,
#      bit-faithful forward on gemma-3 + qwen3). See papers/02-reducing-loader/repro/
# 01 — two-ring memory: the needle-retrieval harness is in
#      papers/01-two-ring-memory/repro/ ; the 32k headline figures
#      land as that run completes.

Each paper carries its own repro/ with the exact invocation and an EXPECTED.md. Correctness reproduces on any NVMe; the latency figure is the only Optane-specific part.

Licensing. The AGPL-3.0 line in the top post is stale — we’re moving everything to MIT across all the repos. The papers repo above is already MIT; the code repos are following.

And thanks, Agerico — the closure pressure was the right thing to push on, even though the answer turned out to be “keep the lattice purely mechanical and let the disk do the remembering.”

Update — the receipts-first paper series grew three papers, and one of them required indicting an ecosystem

A lot has happened since the opening post. The short version: the public,
receipts-first paper series at
https://github.com/nihilistau/Position_Is_Arithmetic now carries papers
04, 05 and 06 — and finishing 06 forced us to root-cause something the
whole local-inference community is currently sitting on: every Gemma-4 GGUF
we could measure, including the post-fix rebuilds, carries broken weights.

The series discipline hasn’t changed: every number is a row in a shared
ledger with a command behind it, honest negatives stay on the record, and no
throughput number is citable without a quality gate on the same artifact.
That last rule is the reason this update exists.

Paper 04 — The Oracle & the Teacher (oracle-grounded backend verification)

  • What it solves: porting a complex architecture to new silicon without the
    weeks-long divergence hunt — and, it turns out, defending yourself when the
    reference implementation itself is wrong.
  • How: extract a bit-faithful CPU oracle from the reference first (scalar,
    readable, f64-accumulating), grade every backend against the oracle and
    never against a prior port, and gate autoregressive decode by
    teacher-forcing (the oracle re-predicts the port’s own generated stream).
    Receipt: a 35-layer variable-geometry MatFormer (per-layer attention
    widths, shared KV, proportional RoPE, softcap) matched its oracle at
    max KL 2.663e-10 (argmax 12/12), both live runs green first-try, 38/38.
  • How it fits: this is the verification layer for everything in §1 of the
    opening post — “byte-exact, not small-KL” is only meaningful if the thing
    you’re byte-exact against is itself proven. The paper’s case study is the
    strongest demonstration we have: when llama.cpp scored wikitext PPL 397–506
    on Gemma-4-12B and the ecosystem normalized it, a from-scratch forward
    written off the official safetensors + config alone measured 4.6776
    the model was healthy, llama.cpp’s forward was exonerated (two independent
    engines agree per-artifact), and the GGUF artifacts themselves were
    convicted. An oracle is not a porting tool; it’s the only defense against a
    poisoned reference frame.

Paper 05 — The Probe Suite (bisection, isolation & benchmark hygiene as one set)

  • What it solves: the fact that correct numbers about computing systems are
    not read off — they are manufactured. The suite is how.
  • How: truncated-parity bisection, isolation sweeps, benchmark hygiene and
    oracle-rank telemetry, used together. Documented kills: a 12.65× phantom
    speedup (three stacked artifacts), a 2.8e-3 wrong-arithmetic localized in
    two probe runs, a mixed-precision 0/256 bug the isolated bench passed at
    1.34e-7, and a per-vector activation-quant collapse at oracle-rank 205,596
    on outlier-heavy activations (fixed with per-block scales aligned to the
    kernel’s 128-bit loads).
  • How it fits: the second half turns the same toolset outward, at ecosystem
    scale — tensor-class swap bisection over the broken GGUFs (restoring just
    the per-layer scale class recovered PPL 364→97; restoring norms made it
    worse, proving the matmul weights damaged too), per-layer cosine
    forensics (no permutation; in-place damage with a period-6 layer
    signature), and simulate-before-build: six quantization recipes
    simulated through the proven reference forward before a line of CUDA
    existed — and the built artifact then matched the simulation to four
    decimal places
    (5.1259), with the GPU kernel agreeing as a third
    instrument (5.1160).

Paper 06 — Computing on the Zip File (the dp4a bandwidth ladder — complete, gated, citable)

  • What it solves: memory-bound decode on consumer silicon. The weights’
    byte count is the speed of light, but only if you compute directly on the
    packed integer codes — dequantizing to f32 scratch first measured 3×
    slower than plain f32.
  • How: warp-per-row __dp4a GEMV, 128-bit loads, in-ALU nibble unpack
    (~7% tax), exact integer accumulation, one Frobenius lift at the end —
    the isolated ladder runs f32 1× → int8 ~3.8× → Q4 ~7.06×, hugging the
    byte ratios. New this round: the OK_Q4B format (per-32-block f16
    scales, store-then-derive discipline) where one weight block is exactly
    one 128-bit chunk in the kernel — zero extra code-bus traffic — and the
    sovereign quantization pipeline: artifact values come from the official
    safetensors checkpoint, never from a GGUF, and every artifact gates
    against the paper-04 oracle before any throughput number is taken.
  • The headline, stated honestly: Gemma-4-12B at 26.1 tok/s and wikitext
    PPL 5.12 on an RTX 2060 12GB
    (graph path bit-exact, decode 256/256
    top-1, 24/24 gates, clocks pinned). llama.cpp-CUDA on the same card does
    31.29 tok/s — at PPL 192–506, because its artifacts are broken.
    Engine-for-engine we move +18% more bytes/s (245 vs 207 GB/s effective);
    our artifact is heavier because it is the only mathematically intact
    4-bit Gemma-4-12B in existence. And in the spirit of the series: an
    earlier 34.2 tok/s headline is formally retired in the ledger —
    it was measured on an artifact that later failed the PPL gate. The rule
    caught our own number first.

For anyone hitting the Gemma-4 quant weirdness themselves: we published a
standalone walkthrough — verify the breakage in ~30 minutes with an
engine-independent method, plus the quantization recipe that actually works
on this PTQ-hostile model (blanket 4-bit costs +45% PPL; 4-bit on the FFN
gate/up pair only with 8-bit elsewhere costs +9.6%):
https://github.com/nihilistau/Position_Is_Arithmetic/blob/main/GEMMA4-QUANT-FIX.md
All forensic instruments are MIT, ~130-line numpy/torch scripts, no GPU
required for the verification.

How this fits the lattice overall: the opening post’s thesis was that
floating-point drift and un-provable identity are entropy bleeding into the
hardware, and that a discrete substrate makes correctness a property you
prove rather than estimate. This round extended that doctrine one level up
the stack — to the artifacts. The same discipline that makes a kernel
byte-exact (oracle, gates, receipts) is what caught an interchange format
silently destroying weights while every smoke test stayed green. The
supply chain is now part of the math.

Papers, ledger, methodology, instruments:
nihilistau/Position_Is_Arithmetic

As always — the unflattering numbers are kept attached on purpose.