Instructions to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF",
	filename="Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
./llama-cli -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL

Use Docker

docker model run hf.co/YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL

LM Studio
Jan

vLLM

How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL

Ollama
How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with Ollama:
```
ollama run hf.co/YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
```

Unsloth Studio new

How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF to start chatting

Pi new

How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL

Run Hermes

hermes

Docker Model Runner
How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with Docker Model Runner:
```
docker model run hf.co/YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL
```

Lemonade

How to use YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF:UD-Q4_K_XL

Run and chat with the model

lemonade run user.Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF-UD-Q4_K_XL

List all available models

lemonade list

Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF

English | 繁體中文

English

Unsloth-style UD-Q4_K_XL quantization of huihui-ai/Huihui-Qwen3.6-27B-abliterated for llama.cpp, with the built-in MTP (Multi-Token Prediction) head fully preserved at Q8_0 for native speculative decoding.

Model Details

Item	Value
Architecture	Dense (27B), 64 layers — 16 × (3 × Gated DeltaNet → FFN + 1 × Gated Attention → FFN) + MTP
Base model	Qwen/Qwen3.6-27B
Fine-tuned by	huihui-ai (abliteration)
Quantized by	YuYu1015
Model size	16.96 GB (vs ~55.6 GB BF16 original) — 5.21 BPW average
Format	GGUF (single file), UD-style mixed precision via imatrix
Context length	Up to 262,144 tokens (tested at 65,536)
Thinking mode	Supported (`enable_thinking: true/false`)
MTP	Built-in MTP head preserved at Q8_0 (drafter for `--spec-type draft-mtp`)
Multimodal	Vision tower dropped by `convert_hf_to_gguf.py` (GGUF is text-only, saves ~1 GB)
llama.cpp	b9200+ required (older builds miss hybrid GDN + MTP loading)

Per-Tensor Precision (UD Mask)

Tensor	Type	Reason
FFN (`gate/up/down`)	Q4_K_M	Bulk weight, high quantization tolerance
Attention `Q/K/V`	Q6_K	Sensitive to noise, upranked from base
Output projection	Q6_K	UD convention
Token embedding	Q6_K	UD convention
MTP / NextN head	Q8_0	Drafter — Q8_0 keeps acceptance near-lossless

Notes on the Source Checkpoint

huihui's abliterated checkpoint already contains the MTP head. Common wisdom says abliteration drops MTP, but convert_hf_to_gguf.py picks up all 4 blk.64.nextn.* tensors directly — no Qwen-official MTP graft needed.
llama.cpp drops the vision tower automatically. The base model is Qwen3_5ForConditionalGeneration (multimodal), but GGUF only packs the LLM portion. Text-only inference works out of the box.

Serving with llama.cpp (RTX 3090)

llama-server \
  -m Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP.gguf \
  -fit off \
  -c 65536 \
  -np 1 \
  -fa on \
  -ngl 99 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --host 0.0.0.0 --port 8080

Flag	Why
`-fit off`	Newer llama.cpp's auto-fit aborts on Qwen3.6 hybrid GDN+MTP layout
`-np 1`	MTP path is single-slot only
`--spec-type draft-mtp`	New flag (older versions used `--spec-type mtp`)
`--spec-draft-n-max 2`	Empirically the sweet spot — `n=3` regresses ~22% due to hybrid bug
`--cache-type-k/v q8_0`	KV cache compression, no measurable quality loss on this model

Performance (RTX 3090 + Xeon E5-2696 v3)

Configuration	Throughput
No speculative decoding	39.15 tok/s
MTP `--spec-draft-n-max 1`	55 tok/s
MTP `--spec-draft-n-max 2`	60 tok/s (best)
MTP `--spec-draft-n-max 3`	55 tok/s (-22% vs n=2 ideal due to hybrid rollback bug)

Note on the CPU: the test rig uses an old Intel Xeon E5-2696 v3 (2.30 GHz, 2015). Modern CPUs (Zen 4 / Raptor Lake or newer) consistently push 30–50% higher tok/s on the same RTX 3090 because llama.cpp's hybrid GDN path still touches CPU per step. These numbers are a floor, not a ceiling.

Common Deployment Pitfalls

Auto-fit abort on newer llama.cpp: failed to fit params to free device memory, n_gpu_layers already set by user → add -fit off.
Old llama.cpp can't load this: missing tensor 'blk.64.ssm_conv1d.weight' → upgrade to b9200+.

Safety Warning

This model has safety filtering removed (abliterated) and may generate sensitive, controversial, or inappropriate content. Users are solely responsible for all consequences arising from its use. Please ensure usage complies with local laws and ethical standards. Not suitable for public-facing or production applications.

Credits

Original Model: Qwen/Qwen3.6-27B by Alibaba Qwen Team
Abliteration: huihui-ai
GGUF UD Quantization: YuYu1015
UD Recipe Inspiration: Unsloth Dynamic Quants

繁體中文

huihui-ai/Huihui-Qwen3.6-27B-abliterated 的 Unsloth 風格 UD-Q4_K_XL 量化版本，針對 llama.cpp 部署，並完整保留內建 MTP（Multi-Token Prediction）head（Q8_0），原生支援投機解碼。

模型資訊

項目	數值
架構	Dense（27B），64 層 — 16 × (3 × Gated DeltaNet → FFN + 1 × Gated Attention → FFN) + MTP
基礎模型	Qwen/Qwen3.6-27B
微調者	huihui-ai（abliteration）
量化者	YuYu1015
模型大小	16.96 GB（原版 BF16 約 55.6 GB）— 平均 5.21 BPW
格式	GGUF（單檔），UD 風格混合精度，imatrix 校準
Context 長度	最高 262,144 tokens（實測 65,536）
思考模式	支援（`enable_thinking: true/false`）
MTP	內建 MTP head 以 Q8_0 保留（作為 `--spec-type draft-mtp` 的草稿器）
多模態	`convert_hf_to_gguf.py` 自動跳過視覺塔（GGUF 純文字，省 ~1 GB）
llama.cpp	必須 b9200+（舊版不支援 hybrid GDN + MTP 載入）

逐 Tensor 精度（UD 遮罩）

Tensor	類型	理由
FFN（`gate/up/down`）	Q4_K_M	大宗權重，量化容忍度高
Attention `Q/K/V`	Q6_K	對雜訊敏感，由基礎類型升級
Output projection	Q6_K	UD 慣例
Token embedding	Q6_K	UD 慣例
MTP / NextN head	Q8_0	草稿器 — Q8_0 幾乎無損維持 acceptance

來源 Checkpoint 註記

huihui abliterated checkpoint 已內含 MTP head。 主流說法是「abliteration 會弄丟 MTP」，但實際 convert_hf_to_gguf.py 可直接抓到 blk.64.nextn.* 共 4 個 tensor — 不需嫁接 Qwen 官方 MTP。
llama.cpp 自動跳過視覺塔。 基礎模型是 Qwen3_5ForConditionalGeneration（多模態），但 GGUF 只包 LLM 部分。純文字推理直接可用。

使用 llama.cpp 部署（RTX 3090）

llama-server \
  -m Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP.gguf \
  -fit off \
  -c 65536 \
  -np 1 \
  -fa on \
  -ngl 99 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --host 0.0.0.0 --port 8080

Flag	用途
`-fit off`	新版 llama.cpp 的 auto-fit 對 Qwen3.6 hybrid GDN+MTP 會 abort
`-np 1`	MTP 路徑只能單 slot
`--spec-type draft-mtp`	新版 flag（舊版叫 `--spec-type mtp`）
`--spec-draft-n-max 2`	實測甜蜜點 — `n=3` 因 hybrid bug 回退約 22%
`--cache-type-k/v q8_0`	KV cache 壓縮，此模型上無可量測品質損失

效能（RTX 3090 + Xeon E5-2696 v3）

設定	速度
無投機解碼	39.15 tok/s
MTP `--spec-draft-n-max 1`	55 tok/s
MTP `--spec-draft-n-max 2`	60 tok/s（最佳）
MTP `--spec-draft-n-max 3`	55 tok/s（hybrid rollback bug 拖累，較 n=2 理論值 -22%）

CPU 說明： 測試機使用 2015 年的舊 Intel Xeon E5-2696 v3（2.30 GHz）。換成現代 CPU（Zen 4 / Raptor Lake 以上）在同樣 RTX 3090 上通常可再提升 30–50% tok/s，因為 llama.cpp 的 hybrid GDN 路徑每步仍需 CPU 介入。這個數字是下限，不是上限。

部署常見地雷

新版 llama.cpp auto-fit abort： failed to fit params to free device memory, n_gpu_layers already set by user → 加 -fit off。
舊版 llama.cpp 載不了： missing tensor 'blk.64.ssm_conv1d.weight' → 升級到 **b9200+**。

安全警告

此模型已移除安全過濾機制（abliterated），可能產生敏感、爭議性或不當內容。使用者須自行承擔所有風險與法律責任，並確保使用方式符合當地法規與倫理標準。不適用於公開或生產環境。

致謝

原始模型：Qwen/Qwen3.6-27B，Alibaba Qwen 團隊
去審查：huihui-ai
GGUF UD 量化：YuYu1015
UD Recipe 靈感：Unsloth Dynamic Quants

Downloads last month: 945

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

4-bit

Model tree for YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF

Base model

Qwen/Qwen3.6-27B

Finetuned

huihui-ai/Huihui-Qwen3.6-27B-abliterated

Quantized

(25)

this model

Collection including YuYu1015/Huihui-Qwen3.6-27B-abliterated-UD-Q4_K_XL-MTP-GGUF

Qwen3.6-abliterated

Collection

6 items • Updated 9 days ago