Gemma-4-26B-A4B-it - TurboQuant+ Config-I (MLX)

26B-parameter MoE compressed to 11 GB with Config-I mixed-precision quantization. Standard MLX format - works with stock mlx_lm and mlx-swift-lm. No custom loaders required.

Config-I quantization of google/gemma-4-26b-a4b-it (26B total, 128 experts, top-8 active). The policy applies aggressive 2-bit compression to expert MLPs (where MoE is most tolerant), protects attention at 4-bit, and shields boundary layers and routing at full precision. See the Config-I paper for the policy derivation.

Status: Available for testing. Quality benchmarks (PPL, MMLU, NIAH, speed) are pending. Use at your own risk.

Compression

Size
bf16 source ~50 GB
Uniform MLX 4-bit 14 GB
Config-I (3.80 bpw) 11 GB

Config-I Policy (Gemma-4 MoE Adaptation)

128 experts, top-8 active per token. 30 layers with mixed sliding/full attention.

Component Bits Layers Rationale
Expert MLP gate/up 2-bit middle 26 98%+ of params, MoE-tolerant
Expert MLP down 3-bit middle 26 Write-back sensitivity (Config-I finding)
Attention Q/K/V/O 4-bit middle 26 Uniform per layer
Boundary (all tensors) 8-bit first 2 + last 2 Boundary layer protection
MoE router f16 all Routing precision critical
Embeddings + lm_head 8-bit - Protected

What is Config-I?

Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation, it was discovered that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight: compression policy matters more than compression math - which tensors to compress, which to protect, and how aggressively.

For MoE models, expert MLPs dominate parameter count but tolerate aggressive compression because only a fraction of experts are active per token. Config-I exploits this by compressing expert MLPs to 2-3 bit while protecting attention and routing at higher precision.

Config-I has been validated on MiniMax M2.7 (93.5% MMLU, PPL 4.604, 12/12 NIAH) and across Qwen/Phi model families at 27-38% size reduction with +1.0-3.9% PPL. See MiniMax M2.7 Config-I results for a fully benchmarked reference.

Compatibility

Field Value
Format MLX safetensors (standard)
Avg bits 3.798 bpw
Runtime mlx_lm (Python), mlx-swift-lm (Swift)
Platform Apple Silicon (M-series with 16GB+)
Quantized on 2026-04-15

No custom loader needed. This is standard MLX per-layer quantization. Any tool that reads MLX safetensors with config.json quantization metadata will work.

How to Run

Python (mlx_lm)

pip install mlx-lm
python -m mlx_lm.generate --model thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX --prompt "Hello"
from mlx_lm import load, generate
model, tokenizer = load("thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX")
print(generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=1.0, top_p=0.95))

Swift (mlx-swift-lm)

import MLXLLM

let container = try await LLMModelFactory.shared.loadContainer(
    configuration: ModelConfiguration(id: "thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX"))

let result = try await container.generate(
    input: .init(text: .init(tokens: tokenArray)),
    parameters: GenerateParameters(temperature: 1.0))

Links


Quantized by @thetom-ai | GitHub | X | Sponsor

Downloads last month
188
Safetensors
Model size
25B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support