Qwopus3.5-9B-Coder — INT4 (compressed-tensors, AWQ-style)
INT4 quantization of Jackrong/Qwopus3.5-9B-Coder, optimised for high-throughput batched inference on vLLM and SGLang.
Runs on a single RTX 3090 (24 GB) with 128K context and a 331K-token KV pool.
Why this quantization is non-trivial
Qwopus3.5 is a hybrid architecture: 8 standard full-attention layers and 24 GatedDeltaNet linear-attention layers. The linear-attention layers use fused projections (in_proj_ba, in_proj_qkvz) that are created at load time by vLLM/SGLang's stacked_params_mapping. Quantizing those layers to INT4 produces garbage output — they must stay in bf16.
This release includes that fix: all linear-attention weights are stored as bf16 originals in the safetensors file, while all standard attention and MLP layers use Marlin INT4 (group_size=32). No other published quantization of this model has this patch.
Quantization details
- Tool: llmcompressor 0.10 with transformers 5.9.0
- Format: compressed-tensors INT4, group_size=32, symmetric
- Scale search: MSE observer (searches for the scale minimising mean-squared error, similar to AWQ's approach)
- Calibration: HuggingFaceH4/ultrachat_200k, 512 samples, max_seq_len=4096
- Linear-attention layers: left as bf16 (required for correct output — see above)
- lm_head: not quantized
Usage
SGLang (recommended — 128K context on a single RTX 3090)
docker run --gpus '"device=0"' -p 8001:8001 \
-v /path/to/model:/model \
lmsysorg/sglang:v0.5.12.post1-cu130 \
python -m sglang.launch_server \
--model-path /model \
--port 8001 \
--context-length 131072 \
--chunked-prefill-size 8192 \
--mem-fraction-static 0.83 \
--kv-cache-dtype fp8_e5m2 \
--max-running-requests 16 \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3
vLLM
docker run --gpus '"device=0"' -p 8001:8001 \
-v /path/to/model:/model \
vllm/vllm-openai:v0.20.0 \
--model /model \
--quantization compressed-tensors \
--max-model-len 32768 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8 \
--enable-chunked-prefill \
--enable-prefix-caching \
--language-model-only \
--tool-call-parser qwen3_xml \
--reasoning-parser qwen3
Note: pass
--language-model-onlyto vLLM — the model reports a vision architecture (Qwen3_5ForConditionalGeneration) but this release contains only language model weights.
Hardware
| Setup | GPU | VRAM | Context | KV pool |
|---|---|---|---|---|
| SGLang TP=1 | RTX 3090 24 GB | ~21 GB | 128K | 331K tokens |
| vLLM TP=1 | RTX 4070 Ti Super 16 GB | ~14 GB | 32K | — |
Model architecture
- Layers: 32 total — 8 full-attention (every 4th), 24 GatedDeltaNet linear-attention
- Hidden size: 4096 · Intermediate: 12 288 · Heads: 16 (GQA 4 KV heads)
- Max position embeddings: 262 144
License
Apache 2.0, following the base model.
- Downloads last month
- 61
Model tree for compute1/qwopus3.5-9b-coder-awq-int4
Base model
Qwen/Qwen3.5-9B-Base