Qwopus3.5-9B-Coder — INT4 (compressed-tensors, AWQ-style)

INT4 quantization of Jackrong/Qwopus3.5-9B-Coder, optimised for high-throughput batched inference on vLLM and SGLang.

Runs on a single RTX 3090 (24 GB) with 128K context and a 331K-token KV pool.

Why this quantization is non-trivial

Qwopus3.5 is a hybrid architecture: 8 standard full-attention layers and 24 GatedDeltaNet linear-attention layers. The linear-attention layers use fused projections (in_proj_ba, in_proj_qkvz) that are created at load time by vLLM/SGLang's stacked_params_mapping. Quantizing those layers to INT4 produces garbage output — they must stay in bf16.

This release includes that fix: all linear-attention weights are stored as bf16 originals in the safetensors file, while all standard attention and MLP layers use Marlin INT4 (group_size=32). No other published quantization of this model has this patch.

Quantization details

Tool: llmcompressor 0.10 with transformers 5.9.0
Format: compressed-tensors INT4, group_size=32, symmetric
Scale search: MSE observer (searches for the scale minimising mean-squared error, similar to AWQ's approach)
Calibration: HuggingFaceH4/ultrachat_200k, 512 samples, max_seq_len=4096
Linear-attention layers: left as bf16 (required for correct output — see above)
lm_head: not quantized

Usage

SGLang (recommended — 128K context on a single RTX 3090)

docker run --gpus '"device=0"' -p 8001:8001 \
  -v /path/to/model:/model \
  lmsysorg/sglang:v0.5.12.post1-cu130 \
  python -m sglang.launch_server \
    --model-path /model \
    --port 8001 \
    --context-length 131072 \
    --chunked-prefill-size 8192 \
    --mem-fraction-static 0.83 \
    --kv-cache-dtype fp8_e5m2 \
    --max-running-requests 16 \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3

vLLM

docker run --gpus '"device=0"' -p 8001:8001 \
  -v /path/to/model:/model \
  vllm/vllm-openai:v0.20.0 \
  --model /model \
  --quantization compressed-tensors \
  --max-model-len 32768 \
  --max-num-batched-tokens 16384 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --language-model-only \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3

Note: pass --language-model-only to vLLM — the model reports a vision architecture (Qwen3_5ForConditionalGeneration) but this release contains only language model weights.

Hardware

Setup	GPU	VRAM	Context	KV pool
SGLang TP=1	RTX 3090 24 GB	~21 GB	128K	331K tokens
vLLM TP=1	RTX 4070 Ti Super 16 GB	~14 GB	32K	—

Model architecture

Layers: 32 total — 8 full-attention (every 4th), 24 GatedDeltaNet linear-attention
Hidden size: 4096 · Intermediate: 12 288 · Heads: 16 (GQA 4 KV heads)
Max position embeddings: 262 144