coolthor's picture
docs: add Buy Me a Coffee link to model card
8a4e249 verified
metadata
license: apache-2.0
base_model:
  - huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated
  - google/gemma-4-26B-A4B-it
library_name: speculators
tags:
  - eagle3
  - speculative-decoding
  - gemma4
  - abliterated
  - draft-model
  - vllm
  - speculators

huihui Gemma 4 26B-A4B abliterated · EAGLE-3 draft model

⚠️ 2026-05-17 endpoint correction — the ~100 tok/s and ~2× speedup numbers in this card are measured on /v1/completions (raw prompt, no chat template). A later paired bench showed that on production /v1/chat/completions workloads, this drafter delivers ~46 tok/s vs pure body ~40 tok/s (≈ +15% uplift), and vanilla MTP n=1 hits ~51 tok/s on the same workload. The acceptance-curve flattening this drafter achieves is real on raw output, but the per-position-acceptance and throughput numbers below don't translate 1:1 to chat workloads. Round 2 paired bench (with Chinese training data) will land in Part 31. Until then, for chat workloads consider using gemma4-26b-a4b-it-assistant (vanilla MTP) with num_speculative_tokens=4 instead — Round 2 paired bench shows it hits chat EN ~53 / ZH ~45 tok/s. See the Part 30 endpoint correction note for full paired numbers.

EAGLE-3 speculative-decoding draft model fine-tuned for huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated (and the FP8 quantized variant coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic).

Starts from RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3 (vanilla-trained on the unmodified Gemma 4 body) and is fine-tuned for 1 epoch / 50K samples against the abliterated body's hidden-state distribution. The point is to restore deep-speculation acceptance that the vanilla draft loses once the body has been abliterated.

Why this exists

Vanilla MTP / EAGLE-3 drafters are trained against the vanilla gemma-4-26B-A4B-it body. When you swap the body for an abliterated one (refusal direction removed), the body's hidden-state distribution shifts and the draft model's predictions stop matching the body's actual outputs — especially at deeper speculation depths. Per-position acceptance collapses from ~65% (pos 0) to ~20% (pos 3) in a 4-token speculation window.

This drafter is fine-tuned to re-align with the abliterated body. On a same-stack inference benchmark (DGX Spark GB10, FP8 verifier, T=0.7, N=10 prompts, max_tokens=200), per-position acceptance recovers to a nearly-flat curve:

Position Vanilla draft (against abliterated body) This drafter Δ
pos 0 65.6% 84.4% +18.8pp
pos 1 43.3% 74.9% +31.6pp
pos 2 29.2% 74.1% +44.9pp
pos 3 20.5% 72.7% +52.2pp

Throughput at num_speculative_tokens=4 (on /v1/completions raw endpoint): ~100 tok/s vs ~50 tok/s with the vanilla draft. Same hardware, same prompts. See the endpoint correction above — these acceptance and throughput numbers come from the raw endpoint, not from chat completions.

Usage with vLLM

Requires vLLM with PR #41745 (Gemma 4 MTP / EAGLE-3 integration) merged — currently means building from main or using a preview image. PR is merged into main as of 2026-05-06; will land in the next minor release after v0.20.2.

vllm serve coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic \
  --speculative-config '{"method":"eagle3","model":"coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft","num_speculative_tokens":4}' \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \
  --enable-auto-tool-choice --tool-call-parser gemma4 \
  --enable-prefix-caching \
  --trust-remote-code

num_speculative_tokens sweep (DGX Spark GB10, FP8 verifier, batch=1):

n Throughput pos-0 acceptance
1 59.04 tok/s 81.3%
2 66.96 tok/s 81.6%
3 74.90 tok/s 88.5%
4 100.36 tok/s 84.4%

n=4 is the recommended setting on this hardware. Throughput keeps increasing through n=4 because deeper positions stay highly accepted (74/74/73% at pos 1/2/3), unlike the vanilla draft where deep speculation collapses.

Training details

  • Framework: vllm-project/speculators v0.5.0.dev0
  • Hardware: NVIDIA GB10 (DGX Spark, sm_12.1), 121 GB unified memory, 273 GB/s
  • Wall time: ~11 hours (vLLM extraction + EAGLE-3 fine-tune concurrent on a single GB10)
  • Starting point: RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3 (vanilla-trained pretrained)
  • Training data: Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered — 50K instruction prompts regenerated through the FP8 abliterated body to produce on-distribution (prompt, response) pairs
  • Hyper-params: 1 epoch, total_seq_len=4096 (packed), TORCH_COMPILE_DISABLE=1, default LR schedule (cosine, warmup ~100 steps)
  • Aux hidden-state layers: [2, 15, 27] (EAGLE-3 default)
  • Trained spec depth (ttt_steps): 3 (positions 0, 1, 2). At inference, pos 3 is extrapolated from the same drafter — empirically it works (72.7% acceptance), but this is outside the training distribution and may degrade on non-Magpie-shaped workloads.

Validation metrics (Magpie held-out, teacher-forced)

Position full_acc (val) cond_acc (val)
pos 0 66.8% 66.8%
pos 1 41.4% 61.5%
pos 2 26.4% 62.6%

⚠️ Note: Validation full_acc is measured via teacher-forced argmax against Magpie ground-truth tokens — strict. Inference acceptance (table above) is measured via rejection-sampling against the body's actual sampling distribution at T=0.7 — looser. The 26.4% val pos-2 vs 74.1% inference pos-2 gap reflects this metric difference, not training failure. What matters for speculative decoding throughput is the inference number.

Limitations

  • Production chat workloads see much smaller uplift than the headline numbers suggest. Paired bench (2026-05-17) on /v1/chat/completions: this drafter ~46 tok/s, vanilla MTP n=1 ~51 tok/s, pure body (no spec) ~40 tok/s. Real uplift on chat ≈ +15%, not +100%. For production chat use cases, vanilla MTP gemma4-26b-a4b-it-assistant with num_speculative_tokens=4 outperforms this drafter (EN ~53 / ZH ~45 tok/s on chat). This drafter's headline numbers come from /v1/completions raw endpoint, where instruct-tuned bodies produce degenerate output that small drafters can trivially predict.
  • Chinese (and likely other non-English) workloads are out-of-distribution. v1 was trained only on English Magpie data — paired bench shows ZH chat pos-0 acceptance drops to ~12% (vs vanilla MTP's ~57%). Round 2 (with Chinese training data + ttt_steps=4) is in flight; results will land in Part 31.
  • Trained for 3 spec positions, not 4. Inference num_speculative_tokens=4 works well empirically but pos 3 is extrapolation. n=3 is the safest "in-distribution" setting; n=4 is what we recommend for max throughput on the tested hardware.
  • Training data is English instruction-style. Magpie-Llama-3.1 is English-leaning. Performance on Traditional Chinese, code-heavy, or domain-specific workloads may differ from the reported numbers. For Traditional Chinese workloads, Qwen 3.6 abliterated remains the better base model anyway (TMMLU+ 75% vs Gemma 4's 46%).
  • Optimizer state not included. This release ships only the inference weights (model.safetensors + config.json + config.py). To resume training, train from scratch or contact me.
  • One-epoch run. EAGLE-3 papers typically train for many more samples × epochs. Multi-epoch and/or larger training data may improve acceptance further.

License

Apache 2.0 — inherited from Gemma 4. The huihui base model is also Apache 2.0 per its model card.

Acknowledgements

Related


☕ If this saved you GPU hours, you can buy me a coffee.