Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF

English | 繁體中文

26991


English

Unsloth-style UD-Q4_K_XL quantization of huihui-ai/Huihui-Qwen3.6-27B-abliterated for llama.cpp, with the built-in MTP (Multi-Token Prediction) head fully preserved at Q8_0 for native speculative decoding.

Model Details

Item Value
Architecture Dense (27B), 64 layers — 16 × (3 × Gated DeltaNet → FFN + 1 × Gated Attention → FFN) + MTP
Base model Qwen/Qwen3.6-27B
Fine-tuned by huihui-ai (abliteration)
Quantized by YuYu1015
Model size 16.96 GB (vs ~55.6 GB BF16 original) — 5.21 BPW average
Format GGUF (single file), UD-style mixed precision via imatrix
Context length Up to 262,144 tokens (tested at 65,536)
Thinking mode Supported (enable_thinking: true/false)
MTP Built-in MTP head preserved at Q8_0 (drafter for --spec-type draft-mtp)
Multimodal Vision tower dropped by convert_hf_to_gguf.py (GGUF is text-only, saves ~1 GB)
llama.cpp b9200+ required (older builds miss hybrid GDN + MTP loading)

Per-Tensor Precision (UD Mask)

Tensor Type Reason
FFN (gate/up/down) Q4_K_M Bulk weight, high quantization tolerance
Attention Q/K/V Q6_K Sensitive to noise, upranked from base
Output projection Q6_K UD convention
Token embedding Q6_K UD convention
MTP / NextN head Q8_0 Drafter — Q8_0 keeps acceptance near-lossless

Notes on the Source Checkpoint

  • huihui's abliterated checkpoint already contains the MTP head. Common wisdom says abliteration drops MTP, but convert_hf_to_gguf.py picks up all 4 blk.64.nextn.* tensors directly — no Qwen-official MTP graft needed.
  • llama.cpp drops the vision tower automatically. The base model is Qwen3_5ForConditionalGeneration (multimodal), but GGUF only packs the LLM portion. Text-only inference works out of the box.

Serving with llama.cpp (RTX 3090)

llama-server \
  -m Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP.gguf \
  -fit off \
  -c 65536 \
  -np 1 \
  -fa on \
  -ngl 99 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --host 0.0.0.0 --port 8080
Flag Why
-fit off Newer llama.cpp's auto-fit aborts on Qwen3.6 hybrid GDN+MTP layout
-np 1 MTP path is single-slot only
--spec-type draft-mtp New flag (older versions used --spec-type mtp)
--spec-draft-n-max 2 Empirically the sweet spot — n=3 regresses ~22% due to hybrid bug
--cache-type-k/v q8_0 KV cache compression, no measurable quality loss on this model

Performance (RTX 3090 + Xeon E5-2696 v3)

Configuration Throughput
No speculative decoding 39.15 tok/s
MTP --spec-draft-n-max 1 55 tok/s
MTP --spec-draft-n-max 2 60 tok/s (best)
MTP --spec-draft-n-max 3 55 tok/s (-22% vs n=2 ideal due to hybrid rollback bug)

Note on the CPU: the test rig uses an old Intel Xeon E5-2696 v3 (2.30 GHz, 2015). Modern CPUs (Zen 4 / Raptor Lake or newer) consistently push 30–50% higher tok/s on the same RTX 3090 because llama.cpp's hybrid GDN path still touches CPU per step. These numbers are a floor, not a ceiling.

Common Deployment Pitfalls

  1. Auto-fit abort on newer llama.cpp: failed to fit params to free device memory, n_gpu_layers already set by user → add -fit off.
  2. Old llama.cpp can't load this: missing tensor 'blk.64.ssm_conv1d.weight' → upgrade to b9200+.

Safety Warning

This model has safety filtering removed (abliterated) and may generate sensitive, controversial, or inappropriate content. Users are solely responsible for all consequences arising from its use. Please ensure usage complies with local laws and ethical standards. Not suitable for public-facing or production applications.

Credits


繁體中文

huihui-ai/Huihui-Qwen3.6-27B-abliterated 的 Unsloth 風格 UD-Q4_K_XL 量化版本,針對 llama.cpp 部署,並完整保留內建 MTP(Multi-Token Prediction)head(Q8_0),原生支援投機解碼。

模型資訊

項目 數值
架構 Dense(27B),64 層 — 16 × (3 × Gated DeltaNet → FFN + 1 × Gated Attention → FFN) + MTP
基礎模型 Qwen/Qwen3.6-27B
微調者 huihui-ai(abliteration)
量化者 YuYu1015
模型大小 16.96 GB(原版 BF16 約 55.6 GB)— 平均 5.21 BPW
格式 GGUF(單檔),UD 風格混合精度,imatrix 校準
Context 長度 最高 262,144 tokens(實測 65,536)
思考模式 支援(enable_thinking: true/false
MTP 內建 MTP head 以 Q8_0 保留(作為 --spec-type draft-mtp 的草稿器)
多模態 convert_hf_to_gguf.py 自動跳過視覺塔(GGUF 純文字,省 ~1 GB)
llama.cpp **必須 b9200+**(舊版不支援 hybrid GDN + MTP 載入)

逐 Tensor 精度(UD 遮罩)

Tensor 類型 理由
FFN(gate/up/down Q4_K_M 大宗權重,量化容忍度高
Attention Q/K/V Q6_K 對雜訊敏感,由基礎類型升級
Output projection Q6_K UD 慣例
Token embedding Q6_K UD 慣例
MTP / NextN head Q8_0 草稿器 — Q8_0 幾乎無損維持 acceptance

來源 Checkpoint 註記

  • huihui abliterated checkpoint 已內含 MTP head。 主流說法是「abliteration 會弄丟 MTP」,但實際 convert_hf_to_gguf.py 可直接抓到 blk.64.nextn.* 共 4 個 tensor — 不需嫁接 Qwen 官方 MTP。
  • llama.cpp 自動跳過視覺塔。 基礎模型是 Qwen3_5ForConditionalGeneration(多模態),但 GGUF 只包 LLM 部分。純文字推理直接可用。

使用 llama.cpp 部署(RTX 3090)

llama-server \
  -m Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP.gguf \
  -fit off \
  -c 65536 \
  -np 1 \
  -fa on \
  -ngl 99 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --host 0.0.0.0 --port 8080
Flag 用途
-fit off 新版 llama.cpp 的 auto-fit 對 Qwen3.6 hybrid GDN+MTP 會 abort
-np 1 MTP 路徑只能單 slot
--spec-type draft-mtp 新版 flag(舊版叫 --spec-type mtp
--spec-draft-n-max 2 實測甜蜜點 — n=3 因 hybrid bug 回退約 22%
--cache-type-k/v q8_0 KV cache 壓縮,此模型上無可量測品質損失

效能(RTX 3090 + Xeon E5-2696 v3)

設定 速度
無投機解碼 39.15 tok/s
MTP --spec-draft-n-max 1 55 tok/s
MTP --spec-draft-n-max 2 60 tok/s(最佳)
MTP --spec-draft-n-max 3 55 tok/s(hybrid rollback bug 拖累,較 n=2 理論值 -22%)

CPU 說明: 測試機使用 2015 年的舊 Intel Xeon E5-2696 v3(2.30 GHz)。換成現代 CPU(Zen 4 / Raptor Lake 以上)在同樣 RTX 3090 上通常可再提升 30–50% tok/s,因為 llama.cpp 的 hybrid GDN 路徑每步仍需 CPU 介入。這個數字是下限,不是上限。

部署常見地雷

  1. 新版 llama.cpp auto-fit abort: failed to fit params to free device memory, n_gpu_layers already set by user → 加 -fit off
  2. 舊版 llama.cpp 載不了: missing tensor 'blk.64.ssm_conv1d.weight' → 升級到 **b9200+**。

安全警告

此模型已移除安全過濾機制(abliterated),可能產生敏感、爭議性或不當內容。使用者須自行承擔所有風險與法律責任,並確保使用方式符合當地法規與倫理標準。不適用於公開或生產環境。

致謝

Downloads last month
945
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(25)
this model

Collection including YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF