Qwopus3.5-9B-Coder — INT4 (compressed-tensors, AWQ-style)

INT4 quantization of Jackrong/Qwopus3.5-9B-Coder, optimised for high-throughput batched inference on vLLM and SGLang.

Runs on a single RTX 3090 (24 GB) with 128K context and a 331K-token KV pool.

Why this quantization is non-trivial

Qwopus3.5 is a hybrid architecture: 8 standard full-attention layers and 24 GatedDeltaNet linear-attention layers. The linear-attention layers use fused projections (in_proj_ba, in_proj_qkvz) that are created at load time by vLLM/SGLang's stacked_params_mapping. Quantizing those layers to INT4 produces garbage output — they must stay in bf16.

This release includes that fix: all linear-attention weights are stored as bf16 originals in the safetensors file, while all standard attention and MLP layers use Marlin INT4 (group_size=32). No other published quantization of this model has this patch.

Quantization details

  • Tool: llmcompressor 0.10 with transformers 5.9.0
  • Format: compressed-tensors INT4, group_size=32, symmetric
  • Scale search: MSE observer (searches for the scale minimising mean-squared error, similar to AWQ's approach)
  • Calibration: HuggingFaceH4/ultrachat_200k, 512 samples, max_seq_len=4096
  • Linear-attention layers: left as bf16 (required for correct output — see above)
  • lm_head: not quantized

Usage

SGLang (recommended — 128K context on a single RTX 3090)

docker run --gpus '"device=0"' -p 8001:8001 \
  -v /path/to/model:/model \
  lmsysorg/sglang:v0.5.12.post1-cu130 \
  python -m sglang.launch_server \
    --model-path /model \
    --port 8001 \
    --context-length 131072 \
    --chunked-prefill-size 8192 \
    --mem-fraction-static 0.83 \
    --kv-cache-dtype fp8_e5m2 \
    --max-running-requests 16 \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3

vLLM

docker run --gpus '"device=0"' -p 8001:8001 \
  -v /path/to/model:/model \
  vllm/vllm-openai:v0.20.0 \
  --model /model \
  --quantization compressed-tensors \
  --max-model-len 32768 \
  --max-num-batched-tokens 16384 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --language-model-only \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3

Note: pass --language-model-only to vLLM — the model reports a vision architecture (Qwen3_5ForConditionalGeneration) but this release contains only language model weights.

Hardware

Setup GPU VRAM Context KV pool
SGLang TP=1 RTX 3090 24 GB ~21 GB 128K 331K tokens
vLLM TP=1 RTX 4070 Ti Super 16 GB ~14 GB 32K

Model architecture

  • Layers: 32 total — 8 full-attention (every 4th), 24 GatedDeltaNet linear-attention
  • Hidden size: 4096 · Intermediate: 12 288 · Heads: 16 (GQA 4 KV heads)
  • Max position embeddings: 262 144

License

Apache 2.0, following the base model.

Downloads last month
61
Safetensors
Model size
9B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for compute1/qwopus3.5-9b-coder-awq-int4

Finetuned
Qwen/Qwen3.5-9B
Quantized
(3)
this model