huihui Gemma 4 26B-A4B abliterated Β· EAGLE-3 draft model

⚠️ 2026-05-17 endpoint correction β€” the ~100 tok/s and ~2Γ— speedup numbers in this card are measured on /v1/completions (raw prompt, no chat template). A later paired bench showed that on production /v1/chat/completions workloads, this drafter delivers ~46 tok/s vs pure body ~40 tok/s (β‰ˆ +15% uplift), and vanilla MTP n=1 hits ~51 tok/s on the same workload. The acceptance-curve flattening this drafter achieves is real on raw output, but the per-position-acceptance and throughput numbers below don't translate 1:1 to chat workloads. Round 2 paired bench (with Chinese training data) will land in Part 31. Until then, for chat workloads consider using gemma4-26b-a4b-it-assistant (vanilla MTP) with num_speculative_tokens=4 instead β€” Round 2 paired bench shows it hits chat EN ~53 / ZH ~45 tok/s. See the Part 30 endpoint correction note for full paired numbers.

EAGLE-3 speculative-decoding draft model fine-tuned for huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated (and the FP8 quantized variant coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic).

Starts from RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3 (vanilla-trained on the unmodified Gemma 4 body) and is fine-tuned for 1 epoch / 50K samples against the abliterated body's hidden-state distribution. The point is to restore deep-speculation acceptance that the vanilla draft loses once the body has been abliterated.

Why this exists

Vanilla MTP / EAGLE-3 drafters are trained against the vanilla gemma-4-26B-A4B-it body. When you swap the body for an abliterated one (refusal direction removed), the body's hidden-state distribution shifts and the draft model's predictions stop matching the body's actual outputs β€” especially at deeper speculation depths. Per-position acceptance collapses from ~65% (pos 0) to ~20% (pos 3) in a 4-token speculation window.

This drafter is fine-tuned to re-align with the abliterated body. On a same-stack inference benchmark (DGX Spark GB10, FP8 verifier, T=0.7, N=10 prompts, max_tokens=200), per-position acceptance recovers to a nearly-flat curve:

Position Vanilla draft (against abliterated body) This drafter Ξ”
pos 0 65.6% 84.4% +18.8pp
pos 1 43.3% 74.9% +31.6pp
pos 2 29.2% 74.1% +44.9pp
pos 3 20.5% 72.7% +52.2pp

Throughput at num_speculative_tokens=4 (on /v1/completions raw endpoint): ~100 tok/s vs ~50 tok/s with the vanilla draft. Same hardware, same prompts. See the endpoint correction above β€” these acceptance and throughput numbers come from the raw endpoint, not from chat completions.

Usage with vLLM

Requires vLLM with PR #41745 (Gemma 4 MTP / EAGLE-3 integration) merged β€” currently means building from main or using a preview image. PR is merged into main as of 2026-05-06; will land in the next minor release after v0.20.2.

vllm serve coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic \
  --speculative-config '{"method":"eagle3","model":"coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft","num_speculative_tokens":4}' \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \
  --enable-auto-tool-choice --tool-call-parser gemma4 \
  --enable-prefix-caching \
  --trust-remote-code

num_speculative_tokens sweep (DGX Spark GB10, FP8 verifier, batch=1):

n Throughput pos-0 acceptance
1 59.04 tok/s 81.3%
2 66.96 tok/s 81.6%
3 74.90 tok/s 88.5%
4 100.36 tok/s 84.4%

n=4 is the recommended setting on this hardware. Throughput keeps increasing through n=4 because deeper positions stay highly accepted (74/74/73% at pos 1/2/3), unlike the vanilla draft where deep speculation collapses.

Training details

  • Framework: vllm-project/speculators v0.5.0.dev0
  • Hardware: NVIDIA GB10 (DGX Spark, sm_12.1), 121 GB unified memory, 273 GB/s
  • Wall time: ~11 hours (vLLM extraction + EAGLE-3 fine-tune concurrent on a single GB10)
  • Starting point: RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3 (vanilla-trained pretrained)
  • Training data: Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered β€” 50K instruction prompts regenerated through the FP8 abliterated body to produce on-distribution (prompt, response) pairs
  • Hyper-params: 1 epoch, total_seq_len=4096 (packed), TORCH_COMPILE_DISABLE=1, default LR schedule (cosine, warmup ~100 steps)
  • Aux hidden-state layers: [2, 15, 27] (EAGLE-3 default)
  • Trained spec depth (ttt_steps): 3 (positions 0, 1, 2). At inference, pos 3 is extrapolated from the same drafter β€” empirically it works (72.7% acceptance), but this is outside the training distribution and may degrade on non-Magpie-shaped workloads.

Validation metrics (Magpie held-out, teacher-forced)

Position full_acc (val) cond_acc (val)
pos 0 66.8% 66.8%
pos 1 41.4% 61.5%
pos 2 26.4% 62.6%

⚠️ Note: Validation full_acc is measured via teacher-forced argmax against Magpie ground-truth tokens β€” strict. Inference acceptance (table above) is measured via rejection-sampling against the body's actual sampling distribution at T=0.7 β€” looser. The 26.4% val pos-2 vs 74.1% inference pos-2 gap reflects this metric difference, not training failure. What matters for speculative decoding throughput is the inference number.

Limitations

  • Production chat workloads see much smaller uplift than the headline numbers suggest. Paired bench (2026-05-17) on /v1/chat/completions: this drafter ~46 tok/s, vanilla MTP n=1 ~51 tok/s, pure body (no spec) ~40 tok/s. Real uplift on chat β‰ˆ +15%, not +100%. For production chat use cases, vanilla MTP gemma4-26b-a4b-it-assistant with num_speculative_tokens=4 outperforms this drafter (EN ~53 / ZH ~45 tok/s on chat). This drafter's headline numbers come from /v1/completions raw endpoint, where instruct-tuned bodies produce degenerate output that small drafters can trivially predict.
  • Chinese (and likely other non-English) workloads are out-of-distribution. v1 was trained only on English Magpie data β€” paired bench shows ZH chat pos-0 acceptance drops to ~12% (vs vanilla MTP's ~57%). Round 2 (with Chinese training data + ttt_steps=4) is in flight; results will land in Part 31.
  • Trained for 3 spec positions, not 4. Inference num_speculative_tokens=4 works well empirically but pos 3 is extrapolation. n=3 is the safest "in-distribution" setting; n=4 is what we recommend for max throughput on the tested hardware.
  • Training data is English instruction-style. Magpie-Llama-3.1 is English-leaning. Performance on Traditional Chinese, code-heavy, or domain-specific workloads may differ from the reported numbers. For Traditional Chinese workloads, Qwen 3.6 abliterated remains the better base model anyway (TMMLU+ 75% vs Gemma 4's 46%).
  • Optimizer state not included. This release ships only the inference weights (model.safetensors + config.json + config.py). To resume training, train from scratch or contact me.
  • One-epoch run. EAGLE-3 papers typically train for many more samples Γ— epochs. Multi-epoch and/or larger training data may improve acceptance further.

License

Apache 2.0 β€” inherited from Gemma 4. The huihui base model is also Apache 2.0 per its model card.

Acknowledgements

Related

Downloads last month
46
Safetensors
Model size
0.9B params
Tensor type
I64
Β·
BF16
Β·
BOOL
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft

Finetuned
(96)
this model