docs: add Buy Me a Coffee link to model card

8a4e249 verified about 10 hours ago

11 kB

license: apache-2.0
base_model:
  - huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated
  - google/gemma-4-26B-A4B-it
library_name: speculators
tags:
  - eagle3
  - speculative-decoding
  - gemma4
  - abliterated
  - draft-model
  - vllm
  - speculators

huihui Gemma 4 26B-A4B abliterated · EAGLE-3 draft model

⚠️ 2026-05-17 endpoint correction — the ~100 tok/s and ~2× speedup numbers in this card are measured on /v1/completions (raw prompt, no chat template). A later paired bench showed that on production /v1/chat/completions workloads, this drafter delivers ~46 tok/s vs pure body ~40 tok/s (≈ +15% uplift), and vanilla MTP n=1 hits ~51 tok/s on the same workload. The acceptance-curve flattening this drafter achieves is real on raw output, but the per-position-acceptance and throughput numbers below don't translate 1:1 to chat workloads. Round 2 paired bench (with Chinese training data) will land in Part 31. Until then, for chat workloads consider using gemma4-26b-a4b-it-assistant (vanilla MTP) with num_speculative_tokens=4 instead — Round 2 paired bench shows it hits chat EN ~53 / ZH ~45 tok/s. See the Part 30 endpoint correction note for full paired numbers.

EAGLE-3 speculative-decoding draft model fine-tuned for huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated (and the FP8 quantized variant coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic).

Starts from RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3 (vanilla-trained on the unmodified Gemma 4 body) and is fine-tuned for 1 epoch / 50K samples against the abliterated body's hidden-state distribution. The point is to restore deep-speculation acceptance that the vanilla draft loses once the body has been abliterated.

Why this exists

Vanilla MTP / EAGLE-3 drafters are trained against the vanilla gemma-4-26B-A4B-it body. When you swap the body for an abliterated one (refusal direction removed), the body's hidden-state distribution shifts and the draft model's predictions stop matching the body's actual outputs — especially at deeper speculation depths. Per-position acceptance collapses from ~65% (pos 0) to ~20% (pos 3) in a 4-token speculation window.

This drafter is fine-tuned to re-align with the abliterated body. On a same-stack inference benchmark (DGX Spark GB10, FP8 verifier, T=0.7, N=10 prompts, max_tokens=200), per-position acceptance recovers to a nearly-flat curve:

Position	Vanilla draft (against abliterated body)	This drafter	Δ
pos 0	65.6%	84.4%	+18.8pp
pos 1	43.3%	74.9%	+31.6pp
pos 2	29.2%	74.1%	+44.9pp
pos 3	20.5%	72.7%	+52.2pp

Throughput at num_speculative_tokens=4 (on /v1/completions raw endpoint): ~100 tok/s vs ~50 tok/s with the vanilla draft. Same hardware, same prompts. See the endpoint correction above — these acceptance and throughput numbers come from the raw endpoint, not from chat completions.

Usage with vLLM

Requires vLLM with PR #41745 (Gemma 4 MTP / EAGLE-3 integration) merged — currently means building from main or using a preview image. PR is merged into main as of 2026-05-06; will land in the next minor release after v0.20.2.

vllm serve coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic \
  --speculative-config '{"method":"eagle3","model":"coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft","num_speculative_tokens":4}' \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \
  --enable-auto-tool-choice --tool-call-parser gemma4 \
  --enable-prefix-caching \
  --trust-remote-code

num_speculative_tokens sweep (DGX Spark GB10, FP8 verifier, batch=1):

n	Throughput	pos-0 acceptance
1	59.04 tok/s	81.3%
2	66.96 tok/s	81.6%
3	74.90 tok/s	88.5%
4	100.36 tok/s	84.4%

n=4 is the recommended setting on this hardware. Throughput keeps increasing through n=4 because deeper positions stay highly accepted (74/74/73% at pos 1/2/3), unlike the vanilla draft where deep speculation collapses.

Training details

Framework: vllm-project/speculators v0.5.0.dev0
Hardware: NVIDIA GB10 (DGX Spark, sm_12.1), 121 GB unified memory, 273 GB/s
Wall time: ~11 hours (vLLM extraction + EAGLE-3 fine-tune concurrent on a single GB10)
Starting point: RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3 (vanilla-trained pretrained)
Training data: Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered — 50K instruction prompts regenerated through the FP8 abliterated body to produce on-distribution (prompt, response) pairs
Hyper-params: 1 epoch, total_seq_len=4096 (packed), TORCH_COMPILE_DISABLE=1, default LR schedule (cosine, warmup ~100 steps)
Aux hidden-state layers: [2, 15, 27] (EAGLE-3 default)
Trained spec depth (ttt_steps): 3 (positions 0, 1, 2). At inference, pos 3 is extrapolated from the same drafter — empirically it works (72.7% acceptance), but this is outside the training distribution and may degrade on non-Magpie-shaped workloads.

Validation metrics (Magpie held-out, teacher-forced)

Position	full_acc (val)	cond_acc (val)
pos 0	66.8%	66.8%
pos 1	41.4%	61.5%
pos 2	26.4%	62.6%

⚠️ Note: Validation full_acc is measured via teacher-forced argmax against Magpie ground-truth tokens — strict. Inference acceptance (table above) is measured via rejection-sampling against the body's actual sampling distribution at T=0.7 — looser. The 26.4% val pos-2 vs 74.1% inference pos-2 gap reflects this metric difference, not training failure. What matters for speculative decoding throughput is the inference number.

Limitations

Production chat workloads see much smaller uplift than the headline numbers suggest. Paired bench (2026-05-17) on /v1/chat/completions: this drafter ~46 tok/s, vanilla MTP n=1 ~51 tok/s, pure body (no spec) ~40 tok/s. Real uplift on chat ≈ +15%, not +100%. For production chat use cases, vanilla MTP gemma4-26b-a4b-it-assistant with num_speculative_tokens=4 outperforms this drafter (EN ~53 / ZH ~45 tok/s on chat). This drafter's headline numbers come from /v1/completions raw endpoint, where instruct-tuned bodies produce degenerate output that small drafters can trivially predict.
Chinese (and likely other non-English) workloads are out-of-distribution. v1 was trained only on English Magpie data — paired bench shows ZH chat pos-0 acceptance drops to ~12% (vs vanilla MTP's ~57%). Round 2 (with Chinese training data + ttt_steps=4) is in flight; results will land in Part 31.
Trained for 3 spec positions, not 4. Inference num_speculative_tokens=4 works well empirically but pos 3 is extrapolation. n=3 is the safest "in-distribution" setting; n=4 is what we recommend for max throughput on the tested hardware.
Training data is English instruction-style. Magpie-Llama-3.1 is English-leaning. Performance on Traditional Chinese, code-heavy, or domain-specific workloads may differ from the reported numbers. For Traditional Chinese workloads, Qwen 3.6 abliterated remains the better base model anyway (TMMLU+ 75% vs Gemma 4's 46%).
Optimizer state not included. This release ships only the inference weights (model.safetensors + config.json + config.py). To resume training, train from scratch or contact me.
One-epoch run. EAGLE-3 papers typically train for many more samples × epochs. Multi-epoch and/or larger training data may improve acceptance further.

License

Apache 2.0 — inherited from Gemma 4. The huihui base model is also Apache 2.0 per its model card.

Acknowledgements

huihui-ai — for the abliterated base Huihui-gemma-4-26B-A4B-it-abliterated that this drafter targets.
Google — for the Gemma 4 26B-A4B-it base model and the Apache 2.0 license.
RedHatAI — for publishing gemma-4-26B-A4B-it-speculator-eagle3, the vanilla-trained EAGLE-3 checkpoint we fine-tuned from. Saved us from training a drafter from scratch.
vllm-project/speculators team — for the training framework. During this fine-tune we found and patched an upstream bug (create_empty_sample defaulted to fp32 placeholders which crashed BF16 models on every vLLM extraction timeout). PR pending.
Magpie-Align — for the Magpie-Llama-3.1-Pro-300K-Filtered instruction dataset.
SafeAILab — for the EAGLE-3 architecture (NeurIPS'25) and reference implementation.

coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic — the FP8 verifier this drafter is paired with
Blog series — both languages, deep links:
- Part 30 (this drafter, results) — English / 繁中
- Part 29 (n=1 deploy recipe, +34% out of the box) — English / 繁中
- Part 28 (mechanism — why vanilla draft can't track an abliterated body at depth) — English / 繁中

☕ If this saved you GPU hours, you can buy me a coffee.

coolthor
/

Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft