---
license: apache-2.0
base_model:
  - huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated
  - google/gemma-4-26B-A4B-it
library_name: speculators
tags:
  - eagle3
  - speculative-decoding
  - gemma4
  - abliterated
  - draft-model
  - vllm
  - speculators
---

# huihui Gemma 4 26B-A4B abliterated · EAGLE-3 draft model

> ⚠️ **2026-05-17 endpoint correction** — the ~100 tok/s and ~2× speedup numbers in this card are measured on `/v1/completions` (raw prompt, no chat template). A later paired bench showed that on **production `/v1/chat/completions` workloads, this drafter delivers ~46 tok/s vs pure body ~40 tok/s (≈ +15% uplift), and vanilla MTP n=1 hits ~51 tok/s on the same workload**. The acceptance-curve flattening this drafter achieves is real on raw output, but the per-position-acceptance and throughput numbers below don't translate 1:1 to chat workloads. Round 2 paired bench (with Chinese training data) will land in Part 31. Until then, **for chat workloads consider using `gemma4-26b-a4b-it-assistant` (vanilla MTP) with `num_speculative_tokens=4` instead — Round 2 paired bench shows it hits chat EN ~53 / ZH ~45 tok/s**. See the [Part 30 endpoint correction note](https://ai-muninn.com/en/blog/dgx-spark-eagle3-finetune-abliterated-round1) for full paired numbers.

EAGLE-3 speculative-decoding draft model **fine-tuned for [`huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated`](https://huggingface.co/huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated)** (and the FP8 quantized variant [`coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic`](https://huggingface.co/coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic)).

Starts from [`RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3`](https://huggingface.co/RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3) (vanilla-trained on the unmodified Gemma 4 body) and is fine-tuned for 1 epoch / 50K samples against the abliterated body's hidden-state distribution. The point is to **restore deep-speculation acceptance** that the vanilla draft loses once the body has been abliterated.

## Why this exists

Vanilla MTP / EAGLE-3 drafters are trained against the vanilla `gemma-4-26B-A4B-it` body. When you swap the body for an abliterated one (refusal direction removed), the body's hidden-state distribution shifts and the draft model's predictions stop matching the body's actual outputs — especially at deeper speculation depths. Per-position acceptance collapses from ~65% (pos 0) to ~20% (pos 3) in a 4-token speculation window.

This drafter is fine-tuned to re-align with the abliterated body. On a same-stack inference benchmark (DGX Spark GB10, FP8 verifier, T=0.7, N=10 prompts, max_tokens=200), per-position acceptance recovers to a nearly-flat curve:

| Position | Vanilla draft (against abliterated body) | **This drafter** | Δ |
|---|---|---|---|
| pos 0 | 65.6% | **84.4%** | +18.8pp |
| pos 1 | 43.3% | **74.9%** | +31.6pp |
| pos 2 | 29.2% | **74.1%** | +44.9pp |
| pos 3 | 20.5% | **72.7%** | +52.2pp |

Throughput at `num_speculative_tokens=4` (on `/v1/completions` raw endpoint): **~100 tok/s** vs ~50 tok/s with the vanilla draft. Same hardware, same prompts. **See the endpoint correction above — these acceptance and throughput numbers come from the raw endpoint, not from chat completions.**

## Usage with vLLM

Requires vLLM with PR [#41745](https://github.com/vllm-project/vllm/pull/41745) (Gemma 4 MTP / EAGLE-3 integration) merged — currently means building from `main` or using a preview image. PR is merged into main as of 2026-05-06; will land in the next minor release after v0.20.2.

```bash
vllm serve coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic \
  --speculative-config '{"method":"eagle3","model":"coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft","num_speculative_tokens":4}' \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \
  --enable-auto-tool-choice --tool-call-parser gemma4 \
  --enable-prefix-caching \
  --trust-remote-code
```

`num_speculative_tokens` sweep (DGX Spark GB10, FP8 verifier, batch=1):

| n | Throughput | pos-0 acceptance |
|---|---|---|
| 1 | 59.04 tok/s | 81.3% |
| 2 | 66.96 tok/s | 81.6% |
| 3 | 74.90 tok/s | 88.5% |
| **4** | **100.36 tok/s** | **84.4%** |

`n=4` is the recommended setting on this hardware. Throughput keeps increasing through n=4 because deeper positions stay highly accepted (74/74/73% at pos 1/2/3), unlike the vanilla draft where deep speculation collapses.

## Training details

- **Framework:** [`vllm-project/speculators`](https://github.com/vllm-project/speculators) v0.5.0.dev0
- **Hardware:** NVIDIA GB10 (DGX Spark, sm_12.1), 121 GB unified memory, 273 GB/s
- **Wall time:** ~11 hours (vLLM extraction + EAGLE-3 fine-tune concurrent on a single GB10)
- **Starting point:** [`RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3`](https://huggingface.co/RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3) (vanilla-trained pretrained)
- **Training data:** [`Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered`](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) — 50K instruction prompts regenerated through the FP8 abliterated body to produce on-distribution (prompt, response) pairs
- **Hyper-params:** 1 epoch, total_seq_len=4096 (packed), TORCH_COMPILE_DISABLE=1, default LR schedule (cosine, warmup ~100 steps)
- **Aux hidden-state layers:** [2, 15, 27] (EAGLE-3 default)
- **Trained spec depth (`ttt_steps`):** 3 (positions 0, 1, 2). At inference, pos 3 is extrapolated from the same drafter — empirically it works (72.7% acceptance), but this is outside the training distribution and may degrade on non-Magpie-shaped workloads.

### Validation metrics (Magpie held-out, teacher-forced)

| Position | full_acc (val) | cond_acc (val) |
|---|---|---|
| pos 0 | 66.8% | 66.8% |
| pos 1 | 41.4% | 61.5% |
| pos 2 | 26.4% | 62.6% |

⚠️ **Note**: Validation `full_acc` is measured via teacher-forced argmax against Magpie ground-truth tokens — strict. Inference acceptance (table above) is measured via rejection-sampling against the body's actual sampling distribution at T=0.7 — looser. The 26.4% val pos-2 vs 74.1% inference pos-2 gap reflects this metric difference, not training failure. What matters for speculative decoding throughput is the inference number.

## Limitations

- **Production chat workloads see much smaller uplift than the headline numbers suggest.** Paired bench (2026-05-17) on `/v1/chat/completions`: this drafter ~46 tok/s, vanilla MTP n=1 ~51 tok/s, pure body (no spec) ~40 tok/s. Real uplift on chat ≈ +15%, not +100%. For production chat use cases, **vanilla MTP `gemma4-26b-a4b-it-assistant` with `num_speculative_tokens=4` outperforms this drafter** (EN ~53 / ZH ~45 tok/s on chat). This drafter's headline numbers come from `/v1/completions` raw endpoint, where instruct-tuned bodies produce degenerate output that small drafters can trivially predict.
- **Chinese (and likely other non-English) workloads are out-of-distribution.** v1 was trained only on English Magpie data — paired bench shows ZH chat pos-0 acceptance drops to ~12% (vs vanilla MTP's ~57%). Round 2 (with Chinese training data + ttt_steps=4) is in flight; results will land in Part 31.
- **Trained for 3 spec positions, not 4.** Inference `num_speculative_tokens=4` works well empirically but pos 3 is extrapolation. `n=3` is the safest "in-distribution" setting; `n=4` is what we recommend for max throughput on the tested hardware.
- **Training data is English instruction-style.** Magpie-Llama-3.1 is English-leaning. Performance on Traditional Chinese, code-heavy, or domain-specific workloads may differ from the reported numbers. For Traditional Chinese workloads, [Qwen 3.6 abliterated](https://huggingface.co/coolthor/Qwen3-MoE-30B-A4B-abliterated-FP8-Dynamic) remains the better base model anyway (TMMLU+ 75% vs Gemma 4's 46%).
- **Optimizer state not included.** This release ships only the inference weights (`model.safetensors` + `config.json` + `config.py`). To resume training, train from scratch or contact me.
- **One-epoch run.** EAGLE-3 papers typically train for many more samples × epochs. Multi-epoch and/or larger training data may improve acceptance further.

## License

Apache 2.0 — inherited from [Gemma 4](https://ai.google.dev/gemma/apache_2). The huihui base model is also Apache 2.0 per its [model card](https://huggingface.co/huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated).

## Acknowledgements

- **[huihui-ai](https://huggingface.co/huihui-ai)** — for the abliterated base [`Huihui-gemma-4-26B-A4B-it-abliterated`](https://huggingface.co/huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated) that this drafter targets.
- **[Google](https://ai.google.dev/gemma/)** — for the Gemma 4 26B-A4B-it base model and the Apache 2.0 license.
- **[RedHatAI](https://huggingface.co/RedHatAI)** — for publishing [`gemma-4-26B-A4B-it-speculator-eagle3`](https://huggingface.co/RedHatAI/gemma-4-26B-A4B-it-speculator-eagle3), the vanilla-trained EAGLE-3 checkpoint we fine-tuned from. Saved us from training a drafter from scratch.
- **[vllm-project/speculators](https://github.com/vllm-project/speculators)** team — for the training framework. During this fine-tune we found and patched an upstream bug ([`create_empty_sample`](https://github.com/vllm-project/speculators/blob/main/src/speculators/train/data.py) defaulted to fp32 placeholders which crashed BF16 models on every vLLM extraction timeout). PR pending.
- **[Magpie-Align](https://huggingface.co/Magpie-Align)** — for the [`Magpie-Llama-3.1-Pro-300K-Filtered`](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) instruction dataset.
- **[SafeAILab](https://github.com/SafeAILab/EAGLE)** — for the EAGLE-3 architecture (NeurIPS'25) and reference implementation.

## Related

- [`coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic`](https://huggingface.co/coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic) — the FP8 verifier this drafter is paired with
- Blog series — both languages, deep links:
  - **Part 30** (this drafter, results) — [English](https://ai-muninn.com/en/blog/dgx-spark-eagle3-finetune-abliterated-round1) / [繁中](https://ai-muninn.com/zh-TW/blog/dgx-spark-eagle3-finetune-abliterated-round1)
  - **Part 29** (n=1 deploy recipe, +34% out of the box) — [English](https://ai-muninn.com/en/blog/dgx-spark-huihui-gemma4-mtp-n1-recipe) / [繁中](https://ai-muninn.com/zh-TW/blog/dgx-spark-huihui-gemma4-mtp-n1-recipe)
  - **Part 28** (mechanism — why vanilla draft can't track an abliterated body at depth) — [English](https://ai-muninn.com/en/blog/dgx-spark-huihui-gemma4-fp8-mtp-34pct) / [繁中](https://ai-muninn.com/zh-TW/blog/dgx-spark-huihui-gemma4-fp8-mtp-34pct)