huihui Gemma 4 26B-A4B abliterated Β· EAGLE-3 draft model
β οΈ 2026-05-17 endpoint correction β the ~100 tok/s and ~2Γ speedup numbers in this card are measured on
/v1/completions(raw prompt, no chat template). A later paired bench showed that on production/v1/chat/completionsworkloads, this drafter delivers ~46 tok/s vs pure body ~40 tok/s (β +15% uplift), and vanilla MTP n=1 hits ~51 tok/s on the same workload. The acceptance-curve flattening this drafter achieves is real on raw output, but the per-position-acceptance and throughput numbers below don't translate 1:1 to chat workloads. Round 2 paired bench (with Chinese training data) will land in Part 31. Until then, for chat workloads consider usinggemma4-26b-a4b-it-assistant(vanilla MTP) withnum_speculative_tokens=4instead β Round 2 paired bench shows it hits chat EN ~53 / ZH ~45 tok/s. See the Part 30 endpoint correction note for full paired numbers.
EAGLE-3 speculative-decoding draft model fine-tuned for huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated (and the FP8 quantized variant coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic).
Starts from RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3 (vanilla-trained on the unmodified Gemma 4 body) and is fine-tuned for 1 epoch / 50K samples against the abliterated body's hidden-state distribution. The point is to restore deep-speculation acceptance that the vanilla draft loses once the body has been abliterated.
Why this exists
Vanilla MTP / EAGLE-3 drafters are trained against the vanilla gemma-4-26B-A4B-it body. When you swap the body for an abliterated one (refusal direction removed), the body's hidden-state distribution shifts and the draft model's predictions stop matching the body's actual outputs β especially at deeper speculation depths. Per-position acceptance collapses from ~65% (pos 0) to ~20% (pos 3) in a 4-token speculation window.
This drafter is fine-tuned to re-align with the abliterated body. On a same-stack inference benchmark (DGX Spark GB10, FP8 verifier, T=0.7, N=10 prompts, max_tokens=200), per-position acceptance recovers to a nearly-flat curve:
| Position | Vanilla draft (against abliterated body) | This drafter | Ξ |
|---|---|---|---|
| pos 0 | 65.6% | 84.4% | +18.8pp |
| pos 1 | 43.3% | 74.9% | +31.6pp |
| pos 2 | 29.2% | 74.1% | +44.9pp |
| pos 3 | 20.5% | 72.7% | +52.2pp |
Throughput at num_speculative_tokens=4 (on /v1/completions raw endpoint): ~100 tok/s vs ~50 tok/s with the vanilla draft. Same hardware, same prompts. See the endpoint correction above β these acceptance and throughput numbers come from the raw endpoint, not from chat completions.
Usage with vLLM
Requires vLLM with PR #41745 (Gemma 4 MTP / EAGLE-3 integration) merged β currently means building from main or using a preview image. PR is merged into main as of 2026-05-06; will land in the next minor release after v0.20.2.
vllm serve coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic \
--speculative-config '{"method":"eagle3","model":"coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft","num_speculative_tokens":4}' \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.85 \
--max-model-len 8192 \
--limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \
--enable-auto-tool-choice --tool-call-parser gemma4 \
--enable-prefix-caching \
--trust-remote-code
num_speculative_tokens sweep (DGX Spark GB10, FP8 verifier, batch=1):
| n | Throughput | pos-0 acceptance |
|---|---|---|
| 1 | 59.04 tok/s | 81.3% |
| 2 | 66.96 tok/s | 81.6% |
| 3 | 74.90 tok/s | 88.5% |
| 4 | 100.36 tok/s | 84.4% |
n=4 is the recommended setting on this hardware. Throughput keeps increasing through n=4 because deeper positions stay highly accepted (74/74/73% at pos 1/2/3), unlike the vanilla draft where deep speculation collapses.
Training details
- Framework:
vllm-project/speculatorsv0.5.0.dev0 - Hardware: NVIDIA GB10 (DGX Spark, sm_12.1), 121 GB unified memory, 273 GB/s
- Wall time: ~11 hours (vLLM extraction + EAGLE-3 fine-tune concurrent on a single GB10)
- Starting point:
RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3(vanilla-trained pretrained) - Training data:
Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filteredβ 50K instruction prompts regenerated through the FP8 abliterated body to produce on-distribution (prompt, response) pairs - Hyper-params: 1 epoch, total_seq_len=4096 (packed), TORCH_COMPILE_DISABLE=1, default LR schedule (cosine, warmup ~100 steps)
- Aux hidden-state layers: [2, 15, 27] (EAGLE-3 default)
- Trained spec depth (
ttt_steps): 3 (positions 0, 1, 2). At inference, pos 3 is extrapolated from the same drafter β empirically it works (72.7% acceptance), but this is outside the training distribution and may degrade on non-Magpie-shaped workloads.
Validation metrics (Magpie held-out, teacher-forced)
| Position | full_acc (val) | cond_acc (val) |
|---|---|---|
| pos 0 | 66.8% | 66.8% |
| pos 1 | 41.4% | 61.5% |
| pos 2 | 26.4% | 62.6% |
β οΈ Note: Validation full_acc is measured via teacher-forced argmax against Magpie ground-truth tokens β strict. Inference acceptance (table above) is measured via rejection-sampling against the body's actual sampling distribution at T=0.7 β looser. The 26.4% val pos-2 vs 74.1% inference pos-2 gap reflects this metric difference, not training failure. What matters for speculative decoding throughput is the inference number.
Limitations
- Production chat workloads see much smaller uplift than the headline numbers suggest. Paired bench (2026-05-17) on
/v1/chat/completions: this drafter ~46 tok/s, vanilla MTP n=1 ~51 tok/s, pure body (no spec) ~40 tok/s. Real uplift on chat β +15%, not +100%. For production chat use cases, vanilla MTPgemma4-26b-a4b-it-assistantwithnum_speculative_tokens=4outperforms this drafter (EN ~53 / ZH ~45 tok/s on chat). This drafter's headline numbers come from/v1/completionsraw endpoint, where instruct-tuned bodies produce degenerate output that small drafters can trivially predict. - Chinese (and likely other non-English) workloads are out-of-distribution. v1 was trained only on English Magpie data β paired bench shows ZH chat pos-0 acceptance drops to ~12% (vs vanilla MTP's ~57%). Round 2 (with Chinese training data + ttt_steps=4) is in flight; results will land in Part 31.
- Trained for 3 spec positions, not 4. Inference
num_speculative_tokens=4works well empirically but pos 3 is extrapolation.n=3is the safest "in-distribution" setting;n=4is what we recommend for max throughput on the tested hardware. - Training data is English instruction-style. Magpie-Llama-3.1 is English-leaning. Performance on Traditional Chinese, code-heavy, or domain-specific workloads may differ from the reported numbers. For Traditional Chinese workloads, Qwen 3.6 abliterated remains the better base model anyway (TMMLU+ 75% vs Gemma 4's 46%).
- Optimizer state not included. This release ships only the inference weights (
model.safetensors+config.json+config.py). To resume training, train from scratch or contact me. - One-epoch run. EAGLE-3 papers typically train for many more samples Γ epochs. Multi-epoch and/or larger training data may improve acceptance further.
License
Apache 2.0 β inherited from Gemma 4. The huihui base model is also Apache 2.0 per its model card.
Acknowledgements
- huihui-ai β for the abliterated base
Huihui-gemma-4-26B-A4B-it-abliteratedthat this drafter targets. - Google β for the Gemma 4 26B-A4B-it base model and the Apache 2.0 license.
- RedHatAI β for publishing
gemma-4-26B-A4B-it-speculator-eagle3, the vanilla-trained EAGLE-3 checkpoint we fine-tuned from. Saved us from training a drafter from scratch. - vllm-project/speculators team β for the training framework. During this fine-tune we found and patched an upstream bug (
create_empty_sampledefaulted to fp32 placeholders which crashed BF16 models on every vLLM extraction timeout). PR pending. - Magpie-Align β for the
Magpie-Llama-3.1-Pro-300K-Filteredinstruction dataset. - SafeAILab β for the EAGLE-3 architecture (NeurIPS'25) and reference implementation.
Related
coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamicβ the FP8 verifier this drafter is paired with- Blog series β both languages, deep links:
- Downloads last month
- 46