--- license: apache-2.0 base_model: - huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated - google/gemma-4-26B-A4B-it library_name: speculators tags: - eagle3 - speculative-decoding - gemma4 - abliterated - draft-model - vllm - speculators --- # huihui Gemma 4 26B-A4B abliterated · EAGLE-3 draft model > ⚠️ **2026-05-17 endpoint correction** — the ~100 tok/s and ~2× speedup numbers in this card are measured on `/v1/completions` (raw prompt, no chat template). A later paired bench showed that on **production `/v1/chat/completions` workloads, this drafter delivers ~46 tok/s vs pure body ~40 tok/s (≈ +15% uplift), and vanilla MTP n=1 hits ~51 tok/s on the same workload**. The acceptance-curve flattening this drafter achieves is real on raw output, but the per-position-acceptance and throughput numbers below don't translate 1:1 to chat workloads. Round 2 paired bench (with Chinese training data) will land in Part 31. Until then, **for chat workloads consider using `gemma4-26b-a4b-it-assistant` (vanilla MTP) with `num_speculative_tokens=4` instead — Round 2 paired bench shows it hits chat EN ~53 / ZH ~45 tok/s**. See the [Part 30 endpoint correction note](https://ai-muninn.com/en/blog/dgx-spark-eagle3-finetune-abliterated-round1) for full paired numbers. EAGLE-3 speculative-decoding draft model **fine-tuned for [`huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated`](https://huggingface.co/huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated)** (and the FP8 quantized variant [`coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic`](https://huggingface.co/coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic)). Starts from [`RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3`](https://huggingface.co/RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3) (vanilla-trained on the unmodified Gemma 4 body) and is fine-tuned for 1 epoch / 50K samples against the abliterated body's hidden-state distribution. The point is to **restore deep-speculation acceptance** that the vanilla draft loses once the body has been abliterated. ## Why this exists Vanilla MTP / EAGLE-3 drafters are trained against the vanilla `gemma-4-26B-A4B-it` body. When you swap the body for an abliterated one (refusal direction removed), the body's hidden-state distribution shifts and the draft model's predictions stop matching the body's actual outputs — especially at deeper speculation depths. Per-position acceptance collapses from ~65% (pos 0) to ~20% (pos 3) in a 4-token speculation window. This drafter is fine-tuned to re-align with the abliterated body. On a same-stack inference benchmark (DGX Spark GB10, FP8 verifier, T=0.7, N=10 prompts, max_tokens=200), per-position acceptance recovers to a nearly-flat curve: | Position | Vanilla draft (against abliterated body) | **This drafter** | Δ | |---|---|---|---| | pos 0 | 65.6% | **84.4%** | +18.8pp | | pos 1 | 43.3% | **74.9%** | +31.6pp | | pos 2 | 29.2% | **74.1%** | +44.9pp | | pos 3 | 20.5% | **72.7%** | +52.2pp | Throughput at `num_speculative_tokens=4` (on `/v1/completions` raw endpoint): **~100 tok/s** vs ~50 tok/s with the vanilla draft. Same hardware, same prompts. **See the endpoint correction above — these acceptance and throughput numbers come from the raw endpoint, not from chat completions.** ## Usage with vLLM Requires vLLM with PR [#41745](https://github.com/vllm-project/vllm/pull/41745) (Gemma 4 MTP / EAGLE-3 integration) merged — currently means building from `main` or using a preview image. PR is merged into main as of 2026-05-06; will land in the next minor release after v0.20.2. ```bash vllm serve coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic \ --speculative-config '{"method":"eagle3","model":"coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft","num_speculative_tokens":4}' \ --kv-cache-dtype fp8 \ --gpu-memory-utilization 0.85 \ --max-model-len 8192 \ --limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \ --enable-auto-tool-choice --tool-call-parser gemma4 \ --enable-prefix-caching \ --trust-remote-code ``` `num_speculative_tokens` sweep (DGX Spark GB10, FP8 verifier, batch=1): | n | Throughput | pos-0 acceptance | |---|---|---| | 1 | 59.04 tok/s | 81.3% | | 2 | 66.96 tok/s | 81.6% | | 3 | 74.90 tok/s | 88.5% | | **4** | **100.36 tok/s** | **84.4%** | `n=4` is the recommended setting on this hardware. Throughput keeps increasing through n=4 because deeper positions stay highly accepted (74/74/73% at pos 1/2/3), unlike the vanilla draft where deep speculation collapses. ## Training details - **Framework:** [`vllm-project/speculators`](https://github.com/vllm-project/speculators) v0.5.0.dev0 - **Hardware:** NVIDIA GB10 (DGX Spark, sm_12.1), 121 GB unified memory, 273 GB/s - **Wall time:** ~11 hours (vLLM extraction + EAGLE-3 fine-tune concurrent on a single GB10) - **Starting point:** [`RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3`](https://huggingface.co/RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3) (vanilla-trained pretrained) - **Training data:** [`Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered`](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) — 50K instruction prompts regenerated through the FP8 abliterated body to produce on-distribution (prompt, response) pairs - **Hyper-params:** 1 epoch, total_seq_len=4096 (packed), TORCH_COMPILE_DISABLE=1, default LR schedule (cosine, warmup ~100 steps) - **Aux hidden-state layers:** [2, 15, 27] (EAGLE-3 default) - **Trained spec depth (`ttt_steps`):** 3 (positions 0, 1, 2). At inference, pos 3 is extrapolated from the same drafter — empirically it works (72.7% acceptance), but this is outside the training distribution and may degrade on non-Magpie-shaped workloads. ### Validation metrics (Magpie held-out, teacher-forced) | Position | full_acc (val) | cond_acc (val) | |---|---|---| | pos 0 | 66.8% | 66.8% | | pos 1 | 41.4% | 61.5% | | pos 2 | 26.4% | 62.6% | ⚠️ **Note**: Validation `full_acc` is measured via teacher-forced argmax against Magpie ground-truth tokens — strict. Inference acceptance (table above) is measured via rejection-sampling against the body's actual sampling distribution at T=0.7 — looser. The 26.4% val pos-2 vs 74.1% inference pos-2 gap reflects this metric difference, not training failure. What matters for speculative decoding throughput is the inference number. ## Limitations - **Production chat workloads see much smaller uplift than the headline numbers suggest.** Paired bench (2026-05-17) on `/v1/chat/completions`: this drafter ~46 tok/s, vanilla MTP n=1 ~51 tok/s, pure body (no spec) ~40 tok/s. Real uplift on chat ≈ +15%, not +100%. For production chat use cases, **vanilla MTP `gemma4-26b-a4b-it-assistant` with `num_speculative_tokens=4` outperforms this drafter** (EN ~53 / ZH ~45 tok/s on chat). This drafter's headline numbers come from `/v1/completions` raw endpoint, where instruct-tuned bodies produce degenerate output that small drafters can trivially predict. - **Chinese (and likely other non-English) workloads are out-of-distribution.** v1 was trained only on English Magpie data — paired bench shows ZH chat pos-0 acceptance drops to ~12% (vs vanilla MTP's ~57%). Round 2 (with Chinese training data + ttt_steps=4) is in flight; results will land in Part 31. - **Trained for 3 spec positions, not 4.** Inference `num_speculative_tokens=4` works well empirically but pos 3 is extrapolation. `n=3` is the safest "in-distribution" setting; `n=4` is what we recommend for max throughput on the tested hardware. - **Training data is English instruction-style.** Magpie-Llama-3.1 is English-leaning. Performance on Traditional Chinese, code-heavy, or domain-specific workloads may differ from the reported numbers. For Traditional Chinese workloads, [Qwen 3.6 abliterated](https://huggingface.co/coolthor/Qwen3-MoE-30B-A4B-abliterated-FP8-Dynamic) remains the better base model anyway (TMMLU+ 75% vs Gemma 4's 46%). - **Optimizer state not included.** This release ships only the inference weights (`model.safetensors` + `config.json` + `config.py`). To resume training, train from scratch or contact me. - **One-epoch run.** EAGLE-3 papers typically train for many more samples × epochs. Multi-epoch and/or larger training data may improve acceptance further. ## License Apache 2.0 — inherited from [Gemma 4](https://ai.google.dev/gemma/apache_2). The huihui base model is also Apache 2.0 per its [model card](https://huggingface.co/huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated). ## Acknowledgements - **[huihui-ai](https://huggingface.co/huihui-ai)** — for the abliterated base [`Huihui-gemma-4-26B-A4B-it-abliterated`](https://huggingface.co/huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated) that this drafter targets. - **[Google](https://ai.google.dev/gemma/)** — for the Gemma 4 26B-A4B-it base model and the Apache 2.0 license. - **[RedHatAI](https://huggingface.co/RedHatAI)** — for publishing [`gemma-4-26B-A4B-it-speculator-eagle3`](https://huggingface.co/RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3), the vanilla-trained EAGLE-3 checkpoint we fine-tuned from. Saved us from training a drafter from scratch. - **[vllm-project/speculators](https://github.com/vllm-project/speculators)** team — for the training framework. During this fine-tune we found and patched an upstream bug ([`create_empty_sample`](https://github.com/vllm-project/speculators/blob/main/src/speculators/train/data.py) defaulted to fp32 placeholders which crashed BF16 models on every vLLM extraction timeout). PR pending. - **[Magpie-Align](https://huggingface.co/Magpie-Align)** — for the [`Magpie-Llama-3.1-Pro-300K-Filtered`](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) instruction dataset. - **[SafeAILab](https://github.com/SafeAILab/EAGLE)** — for the EAGLE-3 architecture (NeurIPS'25) and reference implementation. ## Related - [`coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic`](https://huggingface.co/coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic) — the FP8 verifier this drafter is paired with - Blog series — both languages, deep links: - **Part 30** (this drafter, results) — [English](https://ai-muninn.com/en/blog/dgx-spark-eagle3-finetune-abliterated-round1) / [繁中](https://ai-muninn.com/zh-TW/blog/dgx-spark-eagle3-finetune-abliterated-round1) - **Part 29** (n=1 deploy recipe, +34% out of the box) — [English](https://ai-muninn.com/en/blog/dgx-spark-huihui-gemma4-mtp-n1-recipe) / [繁中](https://ai-muninn.com/zh-TW/blog/dgx-spark-huihui-gemma4-mtp-n1-recipe) - **Part 28** (mechanism — why vanilla draft can't track an abliterated body at depth) — [English](https://ai-muninn.com/en/blog/dgx-spark-huihui-gemma4-fp8-mtp-34pct) / [繁中](https://ai-muninn.com/zh-TW/blog/dgx-spark-huihui-gemma4-fp8-mtp-34pct)