Instructions to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF", filename="Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: llama-cli -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: llama-cli -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: ./llama-cli -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: ./build/bin/llama-cli -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
Use Docker
docker model run hf.co/YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
- LM Studio
- Jan
- vLLM
How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
- Ollama
How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with Ollama:
ollama run hf.co/YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
- Unsloth Studio new
How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF to start chatting
- Pi new
How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
Run Hermes
hermes
- Docker Model Runner
How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with Docker Model Runner:
docker model run hf.co/YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
- Lemonade
How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
Run and chat with the model
lemonade run user.Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF-UD-Q4_K_XL
List all available models
lemonade list
Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF
English
Unsloth-style UD-Q4_K_XL quantization of huihui-ai/Huihui-Qwen3.6-27B-abliterated for llama.cpp, with the built-in MTP (Multi-Token Prediction) head fully preserved at Q8_0 for native speculative decoding.
Model Details
| Item | Value |
|---|---|
| Architecture | Dense (27B), 64 layers — 16 × (3 × Gated DeltaNet → FFN + 1 × Gated Attention → FFN) + MTP |
| Base model | Qwen/Qwen3.6-27B |
| Fine-tuned by | huihui-ai (abliteration) |
| Quantized by | YuYu1015 |
| Model size | 16.96 GB (vs ~55.6 GB BF16 original) — 5.21 BPW average |
| Format | GGUF (single file), UD-style mixed precision via imatrix |
| Context length | Up to 262,144 tokens (tested at 65,536) |
| Thinking mode | Supported (enable_thinking: true/false) |
| MTP | Built-in MTP head preserved at Q8_0 (drafter for --spec-type draft-mtp) |
| Multimodal | Vision tower dropped by convert_hf_to_gguf.py (GGUF is text-only, saves ~1 GB) |
| llama.cpp | b9200+ required (older builds miss hybrid GDN + MTP loading) |
Per-Tensor Precision (UD Mask)
| Tensor | Type | Reason |
|---|---|---|
FFN (gate/up/down) |
Q4_K_M | Bulk weight, high quantization tolerance |
Attention Q/K/V |
Q6_K | Sensitive to noise, upranked from base |
| Output projection | Q6_K | UD convention |
| Token embedding | Q6_K | UD convention |
| MTP / NextN head | Q8_0 | Drafter — Q8_0 keeps acceptance near-lossless |
Notes on the Source Checkpoint
- huihui's abliterated checkpoint already contains the MTP head. Common wisdom says abliteration drops MTP, but
convert_hf_to_gguf.pypicks up all 4blk.64.nextn.*tensors directly — no Qwen-official MTP graft needed. - llama.cpp drops the vision tower automatically. The base model is
Qwen3_5ForConditionalGeneration(multimodal), but GGUF only packs the LLM portion. Text-only inference works out of the box.
Serving with llama.cpp (RTX 3090)
llama-server \
-m Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP.gguf \
-fit off \
-c 65536 \
-np 1 \
-fa on \
-ngl 99 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--host 0.0.0.0 --port 8080
| Flag | Why |
|---|---|
-fit off |
Newer llama.cpp's auto-fit aborts on Qwen3.6 hybrid GDN+MTP layout |
-np 1 |
MTP path is single-slot only |
--spec-type draft-mtp |
New flag (older versions used --spec-type mtp) |
--spec-draft-n-max 2 |
Empirically the sweet spot — n=3 regresses ~22% due to hybrid bug |
--cache-type-k/v q8_0 |
KV cache compression, no measurable quality loss on this model |
Performance (RTX 3090 + Xeon E5-2696 v3)
| Configuration | Throughput |
|---|---|
| No speculative decoding | 39.15 tok/s |
MTP --spec-draft-n-max 1 |
55 tok/s |
MTP --spec-draft-n-max 2 |
60 tok/s (best) |
MTP --spec-draft-n-max 3 |
55 tok/s (-22% vs n=2 ideal due to hybrid rollback bug) |
Note on the CPU: the test rig uses an old Intel Xeon E5-2696 v3 (2.30 GHz, 2015). Modern CPUs (Zen 4 / Raptor Lake or newer) consistently push 30–50% higher tok/s on the same RTX 3090 because llama.cpp's hybrid GDN path still touches CPU per step. These numbers are a floor, not a ceiling.
Common Deployment Pitfalls
- Auto-fit abort on newer llama.cpp:
failed to fit params to free device memory, n_gpu_layers already set by user→ add-fit off. - Old llama.cpp can't load this:
missing tensor 'blk.64.ssm_conv1d.weight'→ upgrade to b9200+.
Safety Warning
This model has safety filtering removed (abliterated) and may generate sensitive, controversial, or inappropriate content. Users are solely responsible for all consequences arising from its use. Please ensure usage complies with local laws and ethical standards. Not suitable for public-facing or production applications.
Credits
- Original Model: Qwen/Qwen3.6-27B by Alibaba Qwen Team
- Abliteration: huihui-ai
- GGUF UD Quantization: YuYu1015
- UD Recipe Inspiration: Unsloth Dynamic Quants
繁體中文
huihui-ai/Huihui-Qwen3.6-27B-abliterated 的 Unsloth 風格 UD-Q4_K_XL 量化版本,針對 llama.cpp 部署,並完整保留內建 MTP(Multi-Token Prediction)head(Q8_0),原生支援投機解碼。
模型資訊
| 項目 | 數值 |
|---|---|
| 架構 | Dense(27B),64 層 — 16 × (3 × Gated DeltaNet → FFN + 1 × Gated Attention → FFN) + MTP |
| 基礎模型 | Qwen/Qwen3.6-27B |
| 微調者 | huihui-ai(abliteration) |
| 量化者 | YuYu1015 |
| 模型大小 | 16.96 GB(原版 BF16 約 55.6 GB)— 平均 5.21 BPW |
| 格式 | GGUF(單檔),UD 風格混合精度,imatrix 校準 |
| Context 長度 | 最高 262,144 tokens(實測 65,536) |
| 思考模式 | 支援(enable_thinking: true/false) |
| MTP | 內建 MTP head 以 Q8_0 保留(作為 --spec-type draft-mtp 的草稿器) |
| 多模態 | convert_hf_to_gguf.py 自動跳過視覺塔(GGUF 純文字,省 ~1 GB) |
| llama.cpp | **必須 b9200+**(舊版不支援 hybrid GDN + MTP 載入) |
逐 Tensor 精度(UD 遮罩)
| Tensor | 類型 | 理由 |
|---|---|---|
FFN(gate/up/down) |
Q4_K_M | 大宗權重,量化容忍度高 |
Attention Q/K/V |
Q6_K | 對雜訊敏感,由基礎類型升級 |
| Output projection | Q6_K | UD 慣例 |
| Token embedding | Q6_K | UD 慣例 |
| MTP / NextN head | Q8_0 | 草稿器 — Q8_0 幾乎無損維持 acceptance |
來源 Checkpoint 註記
- huihui abliterated checkpoint 已內含 MTP head。 主流說法是「abliteration 會弄丟 MTP」,但實際
convert_hf_to_gguf.py可直接抓到blk.64.nextn.*共 4 個 tensor — 不需嫁接 Qwen 官方 MTP。 - llama.cpp 自動跳過視覺塔。 基礎模型是
Qwen3_5ForConditionalGeneration(多模態),但 GGUF 只包 LLM 部分。純文字推理直接可用。
使用 llama.cpp 部署(RTX 3090)
llama-server \
-m Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP.gguf \
-fit off \
-c 65536 \
-np 1 \
-fa on \
-ngl 99 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--host 0.0.0.0 --port 8080
| Flag | 用途 |
|---|---|
-fit off |
新版 llama.cpp 的 auto-fit 對 Qwen3.6 hybrid GDN+MTP 會 abort |
-np 1 |
MTP 路徑只能單 slot |
--spec-type draft-mtp |
新版 flag(舊版叫 --spec-type mtp) |
--spec-draft-n-max 2 |
實測甜蜜點 — n=3 因 hybrid bug 回退約 22% |
--cache-type-k/v q8_0 |
KV cache 壓縮,此模型上無可量測品質損失 |
效能(RTX 3090 + Xeon E5-2696 v3)
| 設定 | 速度 |
|---|---|
| 無投機解碼 | 39.15 tok/s |
MTP --spec-draft-n-max 1 |
55 tok/s |
MTP --spec-draft-n-max 2 |
60 tok/s(最佳) |
MTP --spec-draft-n-max 3 |
55 tok/s(hybrid rollback bug 拖累,較 n=2 理論值 -22%) |
CPU 說明: 測試機使用 2015 年的舊 Intel Xeon E5-2696 v3(2.30 GHz)。換成現代 CPU(Zen 4 / Raptor Lake 以上)在同樣 RTX 3090 上通常可再提升 30–50% tok/s,因為 llama.cpp 的 hybrid GDN 路徑每步仍需 CPU 介入。這個數字是下限,不是上限。
部署常見地雷
- 新版 llama.cpp auto-fit abort:
failed to fit params to free device memory, n_gpu_layers already set by user→ 加-fit off。 - 舊版 llama.cpp 載不了:
missing tensor 'blk.64.ssm_conv1d.weight'→ 升級到 **b9200+**。
安全警告
此模型已移除安全過濾機制(abliterated),可能產生敏感、爭議性或不當內容。使用者須自行承擔所有風險與法律責任,並確保使用方式符合當地法規與倫理標準。不適用於公開或生產環境。
致謝
- 原始模型:Qwen/Qwen3.6-27B,Alibaba Qwen 團隊
- 去審查:huihui-ai
- GGUF UD 量化:YuYu1015
- UD Recipe 靈感:Unsloth Dynamic Quants
- Downloads last month
- 945
4-bit
Model tree for YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF
Base model
Qwen/Qwen3.6-27B