Instructions to use thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX

Run Hermes

hermes

MLX LM

How to use thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Gemma-4-26B-A4B-it - TurboQuant+ Config-I (MLX)

26B-parameter MoE compressed to 11 GB with Config-I mixed-precision quantization. Standard MLX format - works with stock mlx_lm and mlx-swift-lm. No custom loaders required.

Config-I quantization of google/gemma-4-26b-a4b-it (26B total, 128 experts, top-8 active). The policy applies aggressive 2-bit compression to expert MLPs (where MoE is most tolerant), protects attention at 4-bit, and shields boundary layers and routing at full precision. See the Config-I paper for the policy derivation.

Status: Available for testing. Quality benchmarks (PPL, MMLU, NIAH, speed) are pending. Use at your own risk.

Compression

	Size
bf16 source	~50 GB
Uniform MLX 4-bit	14 GB
Config-I (3.80 bpw)	11 GB

Config-I Policy (Gemma-4 MoE Adaptation)

128 experts, top-8 active per token. 30 layers with mixed sliding/full attention.

Component	Bits	Layers	Rationale
Expert MLP gate/up	2-bit	middle 26	98%+ of params, MoE-tolerant
Expert MLP down	3-bit	middle 26	Write-back sensitivity (Config-I finding)
Attention Q/K/V/O	4-bit	middle 26	Uniform per layer
Boundary (all tensors)	8-bit	first 2 + last 2	Boundary layer protection
MoE router	f16	all	Routing precision critical
Embeddings + lm_head	8-bit	-	Protected

What is Config-I?

Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation, it was discovered that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight: compression policy matters more than compression math - which tensors to compress, which to protect, and how aggressively.

For MoE models, expert MLPs dominate parameter count but tolerate aggressive compression because only a fraction of experts are active per token. Config-I exploits this by compressing expert MLPs to 2-3 bit while protecting attention and routing at higher precision.

Config-I has been validated on MiniMax M2.7 (93.5% MMLU, PPL 4.604, 12/12 NIAH) and across Qwen/Phi model families at 27-38% size reduction with +1.0-3.9% PPL. See MiniMax M2.7 Config-I results for a fully benchmarked reference.

Compatibility

Field	Value
Format	MLX safetensors (standard)
Avg bits	3.798 bpw
Runtime	`mlx_lm` (Python), `mlx-swift-lm` (Swift)
Platform	Apple Silicon (M-series with 16GB+)
Quantized on	2026-04-15

No custom loader needed. This is standard MLX per-layer quantization. Any tool that reads MLX safetensors with config.json quantization metadata will work.

How to Run

Python (mlx_lm)

pip install mlx-lm
python -m mlx_lm.generate --model thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX --prompt "Hello"

from mlx_lm import load, generate
model, tokenizer = load("thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX")
print(generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=1.0, top_p=0.95))

Swift (mlx-swift-lm)

import MLXLLM

let container = try await LLMModelFactory.shared.loadContainer(
    configuration: ModelConfiguration(id: "thetom-ai/Gemma-4-26B-A4B-it-ConfigI-MLX"))

let result = try await container.generate(
    input: .init(text: .init(tokens: tokenArray)),
    parameters: GenerateParameters(temperature: 1.0))

thetom-ai
/

Gemma-4-26B-A4B-it-ConfigI-MLX