Instructions to use spicyneuron/Qwen3.6-27B-MLX-5.7bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use spicyneuron/Qwen3.6-27B-MLX-5.7bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("spicyneuron/Qwen3.6-27B-MLX-5.7bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use spicyneuron/Qwen3.6-27B-MLX-5.7bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "spicyneuron/Qwen3.6-27B-MLX-5.7bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "spicyneuron/Qwen3.6-27B-MLX-5.7bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use spicyneuron/Qwen3.6-27B-MLX-5.7bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "spicyneuron/Qwen3.6-27B-MLX-5.7bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default spicyneuron/Qwen3.6-27B-MLX-5.7bit

Run Hermes

hermes

MLX LM

How to use spicyneuron/Qwen3.6-27B-MLX-5.7bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "spicyneuron/Qwen3.6-27B-MLX-5.7bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "spicyneuron/Qwen3.6-27B-MLX-5.7bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "spicyneuron/Qwen3.6-27B-MLX-5.7bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Qwen3.6-27B optimized for MLX.

A mixed-precision quant that balances speed, memory, and accuracy.
4-bit baseline with important layers at 6, 8, and BF16.
This quant supports does not support image input. Vision version here.

Usage

# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-vlm \
  mlx_lm.server \
  --host 127.0.0.1 \
  --port 8080 \
  --model spicyneuron/Qwen3.6-27B-MLX-5.7bit

Benchmarks

metric	unsloth/Qwen3.6-27B-UD-MLX-4bit	mlx-community/Qwen3.6-27B-OptiQ-4bit	5.7 bit (this model)
bpw	7.516	5.575	5.679
base memory	23.534	17.457	17.781
peak memory (1024/512)	27.085	20.633	20.966
prompt tok/s (1024)	420.712 ± 0.129	428.184 ± 0.165	422.521 ± 0.948
gen tok/s (512)	24.759 ± 0.025	31.521 ± 0.030	30.460 ± 0.106
kl mean	0.031 ± 0.003	0.044 ± 0.004	0.027 ± 0.002
kl p95	0.107 ± 0.003	0.164 ± 0.004	0.103 ± 0.002
perplexity*	4.560 ± 0.026	4.850 ± 0.020	4.872 ± 0.029
hellaswag	0.552 ± 0.011	0.552 ± 0.011	0.556 ± 0.011

Unsloth's "4bit" actually averages 7.5 per weight even after excluding the vision tower. This quant is smaller, matches in KL divergence and Hellaswag, and has significantly faster token generation.

OptiQ lands around the same size. This quant is slightly slower but slightly better on KLD (measured against this dataset).

* Perplexity on this model seems to swing a ton based on number of samples, so treat this as a noisy result.

Tested on a Mac Studio M3 Ultra with:

mlx_lm.convert --hf-path Qwen/Qwen3.6-35B-A3B --mlx-path ./mlx && mlx_lm.kld --baseline-model ./mlx
mlx_lm.perplexity --sequence-length 1024 --seed 123
mlx_lm.benchmark --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
mlx_lm.evaluate --tasks hellaswag --seed 123 --num-shots 0 --limit 2000

Required PRs:

mlx_lm.kld command

Methodology

Quantized with a mlx-lm fork. MLX quantization options differ than llama.cpp, but the principles are the same:

Sensitive layers like MoE routing, attention, and output embeddings get higher precision
More tolerant layers like MoE experts get lower precision

Downloads last month: 416

Safetensors

Model size

27B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for spicyneuron/Qwen3.6-27B-MLX-5.7bit

Base model

Qwen/Qwen3.6-27B

Quantized

(427)

this model