Qwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP-XS

Deployment, operations & benchmarks → github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash

The GitHub repo is the source of truth for the production deployment guide, hardware-tuned docker-compose configs, full configuration reference, measured benchmarks, and AGENTS.md — an operator's manual that pre-empts common stale-documentation traps.

🏆 DGX Spark performance — current production (v3 image, 2026-04-29)

The XS body served with DFlash spec decode (not the MTP head) under the v3 image (ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3) is the highest-throughput config we've measured on Spark: 38.5 tok/s median, 71.3 tok/s peak thinking-on / 38.1 / 68.4 thinking-off. That's a +17–26 % lift across thinking modes vs the original -NVFP4 + old-DFlash + v2.1 production. See the GitHub Performance section for the four-config comparison table.

🙏 Reference recipe credit: The conv1d-preserved NVFP4 + MTP graft pipeline used to build this XS variant is based on sakamakismile's validated Qwen3.6-27B-NVFP4-MTP series (22K+ downloads). They worked out the modelopt config — including the strategic decision to quantize the GDN projection matmuls to NVFP4 while preserving linear_attn.conv1d at BF16 — and the MTP-head graft technique. We adapted the recipe to AEON-Ultimate's abliterated weights and ship both the conv1d-preserved-only XS variant (matching their footprint) and a heavier regular-MTP variant that additionally keeps the projections at BF16. Full credit for the underlying recipe → sakamakismile.

What "XS" means — and what it's not

This is the extra-small footprint sibling of -Text-NVFP4-MTP. XS is not "everything to FP4." It is a deliberate, principled split: the heavy GDN matmul projections drop to NVFP4 (where they're bandwidth-bound and FP4 wins big), while the SSM-critical linear_attn.conv1d kernel stays BF16 (where FP4 has documented stability problems on long-context recurrence).

Text-NVFP4-MTP (regular) Text-NVFP4-MTP-XS (this repo)
linear_attn projections (in_proj_qkv, in_proj_z, in_proj_a/b, out_proj) preserved BF16 (~11 GB) quantized to NVFP4 (~3 GB)
linear_attn.conv1d (SSM 1D convolution — recurrence-critical) preserved BF16 preserved BF16
linear_attn SSM state vectors (A_log, dt_bias, norm.weight) preserved BF16 preserved BF16 ✅
mtp.* head (grafted bf16 from base, bit-exact verified) yes yes
Vision tower stripped stripped
Total disk ~26 GB ~20 GB
VRAM footprint at runtime ~27 GB ~21 GB

This is a smart, strategic quantization — not a precision compromise. The conv1d preservation matters: the GatedDeltaNet recurrence depends on the 1D convolution behaving numerically like its training distribution, and FP4 quantization of conv1d has been observed to cause drift on long-context inference in community testing. By keeping conv1d BF16 while quantizing the projections (which are bandwidth-limited matmuls where FP4 is a clean win), we get the ~6 GB footprint reduction without sacrificing the part of the model that's actually fragile under quantization. This is the same principle modelopt's NVFP4_DEFAULT_CFG applies by default and the same recipe sakamakismile validated across his Qwen3.6-NVFP4-MTP series (22K+ downloads).

When to pick which:

  • Pick the regular variant if you have ≥48 GB VRAM. Even the projection weights at BF16 give a small additional safety margin on long-context recurrence stability.
  • Pick this XS variant if you have 24–32 GB VRAM (RTX 5090, single GPUs without headroom for full BF16 GDN). The conv1d preservation guarantees the SSM recurrence stays numerically stable; the ~6 GB savings buy meaningful KV-cache headroom on tight GPUs.

We ship both because we have the headroom on RTX PRO 6000 / B100/B200 to run the larger, more numerically-conservative version, and several users on tighter cards have asked for the smaller one. Neither variant quantizes linear_attn.conv1d — that would be a different (and not-recommended) variant we have explicitly chosen not to ship.

Variants

Format Size Use case
BF16 51 GB Full-precision reference weights
NVFP4 (compressed-tensors + DFlash) 26 GB DGX Spark — DFlash spec decode, validated
Multimodal-NVFP4-MTP 27 GB RTX PRO 6000 / B100/B200 — MTP, GDN preserved BF16
Text-NVFP4-MTP 26 GB Same as above without vision tower
Multimodal-NVFP4-MTP-XS 21 GB RTX 5090 / smaller dedicated VRAM — MTP, full FP4 incl. GDN projections
Text-NVFP4-MTP-XS (this repo) 20 GB Same as Multimodal-XS without vision tower

What this is

The modelopt-format NVFP4 + MTP variant, text-only (vision tower stripped), with linear_attn projections fully quantized, of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).

Specifically:

  • Body quantized to NVFP4 via nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG. modelopt format, served by vLLM through --quantization modelopt.
  • Linear-attn / GatedDeltaNet projections quantized to NVFP4 (this is the XS difference). Only linear_attn.conv1d is kept BF16 (modelopt's default). The community has validated this approach on Qwen3.5/3.6-NVFP4 builds with 22K+ downloads on sakamakismile's reference recipes; we re-ran calibration on our abliterated weights and the model serves correctly.
  • Vision tower stripped (333 visual keys removed, ~0.92 GB). Text-only build — no image / video input. language_model_only: true set in config.json.
  • MTP head grafted from the base Qwen/Qwen3.6-27B checkpoint (15 tensors, BF16, bit-exact verified). Powers --speculative-config '{"method":"qwen3_5_mtp",...}' for self-speculative decoding without a separate drafter.

Why MTP

Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.

Indicative published numbers (sakamakismile's reference recipe on RTX 5090):

  • Single-stream short prompts at n=3: ~132 tok/s
  • Single-stream long-form: ~105 tok/s
  • 2-parallel aggregate (256K + KV FP8): ~189-207 tok/s
  • Mean acceptance length: ~3.0-4.0 (compared to DFlash chains of ~2.0-2.3)

Validated benchmarks of the AEON-Ultimate XS variant land in the GitHub repo once measured.

🎯 When to pick this variant — measured hardware routing

The right speculative-decode method depends on memory architecture:

Hardware tier Recommended variant Why
DGX Spark / GB10 (sm_121a, unified memory) -NVFP4 (DFlash)not any MTP variant Bench on Spark: DFlash beats MTP-XS by +26 % median, +52 % peak. Don't run MTP on Spark.
RTX PRO 6000 Blackwell (96 GB dedicated VRAM) Text-NVFP4-MTP — GDN BF16 for best long-context fidelity, or this XS variant for ~10 % faster decode XS measured 111.4 tok/s median vs regular ~92 tok/s on RTX PRO 6000. Both win against DFlash on dedicated VRAM.
B100 / B200 (sm_100, dedicated FP4) Text-NVFP4-MTP (preferred — GDN BF16 fits) or this XS Native FP4 + dedicated VRAM = MTP territory. Whichever fits cleanly.
RTX 5090 (sm_120, 32 GB dedicated VRAM) This XS variant ✅ — fits at ~21 GB runtime, matches sakamakismile's reference footprint XS variants fit comfortably in 32 GB with KV headroom.
A100 / H100 (no native FP4) BF16 NVFP4 dequantizes to BF16 on Ampere/Hopper — no benefit.

Full bench numbers: GitHub repo Performance section. | A100 / H100 (no native FP4) | BF16 |

Usage

vLLM serve

# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP-XS \
  --local-dir ./aeon-ultimate-text-nvfp4-mtp-xs

# Serve
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve ./aeon-ultimate-text-nvfp4-mtp-xs \
  --quantization modelopt \
  --trust-remote-code \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.94 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

num_speculative_tokens=3 is the canonical setting for qwen3_5_mtp. Higher values diverge the drafter further from the target distribution and acceptance falls.

Configuration notes

  • --quantization modelopt is required (not compressed-tensors — different format).
  • --speculative-config '{"method":"qwen3_5_mtp", ...}' activates the grafted MTP head as the spec-decode drafter. No external drafter download needed — the head is in the safetensors of this repo.
  • --gpu-memory-utilization 0.94 is the validated cap on RTX PRO 6000; on RTX 5090's tighter 32 GB you'll want 0.92 and a smaller --max-model-len (try 65536 first).

Quantization recipe

  • Tool: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
  • Loader: Qwen3_5ForConditionalGeneration.from_pretrained (multimodal-preserved class — vision stripped post-export)
  • Calibration: neuralmagic/calibration LLM split, 20 samples × 8192 tokens
  • Excluded from quantization (kept BF16) — XS variant differences from the regular variant in bold:
    • lm_head, proj_out.*, *router*, *mlp.gate.* (NVFP4_DEFAULT_CFG)
    • *linear_attn.conv1d*, *mixer.conv1d* *(NVFP4_DEFAULT_CFG default — kept BF16 because FP4 quantization of the SSM 1D convolution causes drift on long-context recurrence; this is the recurrence-critical kernel of the GatedDeltaNet block. Both regular and XS variants preserve this.)*
    • *linear_attn* is NOT broadly excluded (XS difference — the projection matmuls in_proj_qkv, in_proj_z, in_proj_a/b, out_proj get NVFP4-quantized; saves ~8 GB; FP4 is a clean win on bandwidth-bound matmuls)
    • *visual* (excluded during quant; vision tower then stripped post-export)
    • *mtp* (MTP head preservation)
    • *output_layer*, output.*
  • MTP graft: 15 tensors copied bf16 from Qwen/Qwen3.6-27B after modelopt export
  • Vision strip: post-export, all model.visual.* keys removed; config.json patched with language_model_only: true
  • Pipeline: lna-lab/GGUF-to-NVFP4-SM120 reference recipe, adapted for AEON-Ultimate-BF16 input + separate MTP source

Provenance & credits

License + responsibility

Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.


☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

₿ Bitcoin (BTC)
BTC QR
bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
Ξ Ethereum (ETH)
ETH QR
0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
◎ Solana (SOL)
SOL QR
DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
ⓜ Monero (XMR)
XMR QR
836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Downloads last month
5,643
Safetensors
Model size
17B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP-XS

Base model

Qwen/Qwen3.6-27B
Quantized
(25)
this model

Collection including AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP-XS