DeepSeek-V4-Flash W4A16-FP8
Mixed-precision quantization of deepseek-ai/DeepSeek-V4-Flash for vLLM tensor-parallel deployment at TP=2. Validated end-to-end on three SKUs:
- 8× H200 (SM 9.0) — Hopper datacenter (calibration + harness baseline)
- 2× DGX Spark / GB10 (SM 12.1) — Blackwell SoC, long-context production
- 2× RTX PRO 6000 Blackwell Server (SM 12.0) — Blackwell workstation
making this load cleanly on consumer Blackwell.
Naming mirrors RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 — their NVFP4 experts → our W4A16 experts, attention block in both is FP8_BLOCK.
🚀 Live engine on dual DGX Spark TP=2 serving 1M-token context graphs-ON, image vllm-w4a16-dsv4:exp from jasl/vllm@ds4-sm120-experimental. Single-file zero-to-serving bootstrap on dual DGX Spark: scripts/bootstrap_dsv4_spark.sh in the reproduction repo.
Quantization scheme
| Component | Format | Method |
|---|---|---|
| Routed experts (256 × 43 layers) | W4A16 INT4, group_size=128 | GPTQ, dampening_frac=0.1 |
| Attention projections (q/kv/o, compressor, indexer) | FP8_BLOCK 128×128 | Data-free |
| Shared experts | BF16 | Excluded (kylesayrs PR #41276 incompatibility) |
| Embeddings, lm_head, hc_head | BF16 | Excluded |
Architecture
| Property | Value |
|---|---|
| Total parameters | |
| Decoder layers | 43 |
| Routed experts / layer | 256 (top-K = 6) |
| Hidden size | 4096 |
| Quantized size | ~143 GB (vs ~543 GB BF16) |
| Compression ratio | ~3.8× |
Inference (vLLM)
vllm serve canada-quant/DeepSeek-V4-Flash-W4A16-FP8 \
--served-model-name DSV4-W4A16-FP8 \
--tensor-parallel-size 2 \
--kv-cache-dtype fp8 \
--block-size 256 \
--max-model-len 16384 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.92 \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--trust-remote-code
For the long-context production canonical (1M tokens, single stream): drop --max-num-seqs to 1, --gpu-memory-utilization to 0.90, and set --max-model-len 1048576 --max-num-batched-tokens 8192 --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'.
Required env vars at runtime (SM 12.x sparse-MLA path): set VLLM_TRITON_MLA_SPARSE=1 and VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4 in the container or shell that runs vllm serve. Without _HEAD_BLOCK_SIZE=4 the sparse-MLA Triton kernel can crash during warmup with RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered in _dequantize_and_gather_k_kernel — the kernel falls back to a default block size that doesn't match V4-Flash's head dim. The full env block (NCCL, TileLang, HF cache flags) is at QUICKSTART_DUAL_SPARK.md §4.
Tensor parallelism: TP=2 is the only validated configuration. TP=1 OOMs on a single 141 GB H200; TP≥4 hits an upstream W4A16 MoE scale-sharding bug (vllm-project/vllm#41511).
Required vLLM build: This model does not load on vanilla vLLM. The exact toolchain — jasl/vllm@ds4-sm120 (or ds4-sm120-experimental for the bleeding edge) + the vendored scripts/kylesayrs-deepseek-ct.patch (kylesayrs PR #41276, content-pinned rebased successor of f910a73a93 which was force-pushed out of upstream history; see issue #1) + packed_modules_mapping patch — is in the reproduction repo. The single-file bootstrap script scripts/bootstrap_dsv4_spark.sh does the whole stack zero-to-serving on dual DGX Spark. For SM 12.x hardware (DGX Spark / GB10 / RTX PRO 6000 / RTX 50-series), the workspace pre-reservation patch landed upstream as jasl/vllm@1d6f5c4 (was vllm-project/vllm#41700); check it out instead of carrying the local patch.
Blackwell sm_120 note (RTX PRO 6000): vLLM's FlashInfer-based top-p / top-k sampler JIT mis-parses the TORCH_CUDA_ARCH_LIST=12.0a arch token and raises RuntimeError: FlashInfer requires GPUs with sm75 or higher (the GPU is sm_120 — way above sm_75; the parser just doesn't recognize the 12.0a token). Set VLLM_USE_FLASHINFER_SAMPLER=0 to fall back to the PyTorch-native sampler.
Upstream tracker: original PR #40991 (where our Spark validation comment was posted) was closed 2026-05-06; current upstream tracker is PR #41834 — "[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes", branch codex/ds4-sm120-min-enable. This model is validated on jasl/vllm@ds4-sm120-experimental (legacy branch, still works).
Evaluation
Validated on jasl/vllm-ds4-sm120-harness. H200 numbers are at HEAD 85aca32 (older jasl/vllm@428e08e); Spark and RTX PRO 6000 numbers are at HEAD 96785b9 on the today-current ds4-sm120-experimental tip — graphs ON, no --enforce-eager.
| Test | Native FP4/FP8 (8× H200) | W4A16-FP8 (8× H200) | W4A16-FP8 (2× DGX Spark TP=2) | W4A16-FP8 (2× RTX PRO 6000 TP=2) |
|---|---|---|---|---|
chat-smoke quick |
4/4 | 4/4 | 4/4 | 4/4 |
chat-smoke quality |
4/4 | 4/4 | included in generation matrix below | included in generation matrix below |
chat-smoke coding |
2/2 | 2/2 | included in generation matrix below | included in generation matrix below |
generation (18 prompts × non-thinking) |
— | — | 18/18 PASS | 18 / 18 invocations clean |
generation (18 prompts × think-high) |
— | — | 17/18 PASS | 54 / 54 invocations clean ⁰ |
generation (18 prompts × think-max @ 32K) |
— | — | 9/18 → 9/10 at 64K rerun | 54 / 54 invocations clean ⁰ |
toolcall15 |
23/30 (77%) | 26/30 (87%) ¹ | 41/45 (92%) ¹ | 27/30 (90%) ² |
| Long-context NIAH (75K → 256K single) | — | — | 4/4 retrieval | 4/4 retrieval |
| Long-context NIAH 256K × 2 concurrent | — | — | stalled 2026-05-04 → fix in jasl@e734ace5 |
✅ PASS (377 s vs 356 s single) |
| Long-context NIAH 500K × 1 | — | — | (in flight) | ✅ PASS (1231 s) |
| Workspace-lock errors | 0 | 0 | 0 over 100+ requests, 5 h+ uptime | 0 |
⁰ The generation-matrix runs on RTX PRO 6000 are 18 prompts × 3 rounds = 54 invocations per mode; this harness HEAD does not auto-pass/fail them, but all 126 completed cleanly with finish_reason=stop. Thinking-mode failures of the form Spark saw at 32K budget are not reproduced here because we ran with --max-model-len=16384 and all prompts fit within budget.
¹ Toolcall15 on Spark is scored across 3 thinking modes (45 cases); H200 baseline was single-mode (30 cases). Score normalized to %. ² Toolcall15 on RTX PRO 6000 here is single-round (30 cases); same pattern of failures (TC-06 Multi-Value Extraction fail, TC-07 Search-Read-Act partial).
Comparison caveat: the H200 numbers come from an older vllm build (harness HEAD
85aca32,jasl/vllm@428e08e). Spark and RTX PRO 6000 numbers are on today'sds4-sm120-experimentaltip. Treat the H200 ↔ Blackwell deltas as informational, not as a "same software, different hardware" benchmark; the valid same-software comparison is Spark ↔ RTX PRO 6000.
Standard benchmarks
| Benchmark | Setting | 8× H200 (older vllm) | 2× DGX Spark TP=2 (graph mode) | 2× RTX PRO 6000 TP=2 (graph mode) |
|---|---|---|---|---|
| GSM8K | 8-shot, flexible-extract | 92.87% ±0.71% | 95.37% ±0.58% | 94.99% ±0.60% |
| GSM8K | strict-match | 42.61% (chat-format artifact) | 95.45% ±0.57% | 95.07% ±0.60% |
| MMLU | 5-shot | 87.27% ±0.27% | (in flight) | (pending) |
| HumanEval | 0-shot pass@1 (instruct, --confirm_run_unsafe_code) |
54.27% ±3.9% ³ | 80.49% ±3.10% ⁴ | 78.05% ±3.24% |
³ HumanEval pass@1 on H200 is depressed by chat-format extraction; coding capability is better captured by the generation matrix and toolcall15 above.
⁴ The Spark and RTX PRO 6000 HumanEval runs use strict pass@1 with code execution enabled (--confirm_run_unsafe_code); the H200 number on this card was scored via regex extraction (which under-counts valid generations). Methodology difference accounts for most of the +20–26 pp delta — quality is preserved.
Throughput
| Hardware | Mode | Decode | Notes |
|---|---|---|---|
| 8× H200 TP=2 | graph | — | not measured under harness |
| 2× Spark TP=2 | graph | 14–17 tok/s | canonical recipe, multi-seq stable |
| 2× Spark TP=2 | eager | 3–4 tok/s | only required without workspace patch |
| 2× RTX PRO 6000 TP=2 | graph | 47–48 tok/s @ c=1, 84 tok/s @ c=2 | TPOT mean 20.8 ms (p99 21.7 ms) at c=1, scales 1.77× to c=2 |
RTX PRO 6000 — vllm bench serve detail
| Concurrency | In / Out | Duration | TTFT mean / p99 | TPOT mean / p99 | Output tok/s |
|---|---|---|---|---|---|
| 1 | 1024 / 1024 | 430.9 s | 237 ms / 711 ms | 20.8 ms / 21.7 ms | 47.5 |
| 2 | 2048 / 512 | 121.9 s | 1096 ms / 1900 ms | 21.7 ms / 23.0 ms | 84.0 |
Per-stream decode rate is rock-stable across concurrency (TPOT mean stays at 21 ms, p99 only 23 ms). Aggregate input+output throughput at c=2 reaches 420 tok/s.
Note on think-max reasoning failures (Spark only)
The 9 think-max failures on Spark at 16K context + 32K output budget are not a model-quality regression — they are output-ceiling truncations. With --max-model-len 16384 and a typical ~1–2K prompt, the actual output ceiling is ~14–15K, regardless of the requested 32K. The deepseek_v4 reasoning parser dumps unclosed <think> blocks into reasoning_content, leaving content empty. To run think-max on these prompts, scale both --max-model-len ≥ 65536 and max_tokens ≥ 64000 together. Non-thinking and think-high modes are unaffected.
Empirical confirmation (2026-05-05, Spark): the same 10 cases re-run at --max-model-len=65536, --max-num-seqs=4, max_tokens=64000 produce 9 / 10 PASS with reasoning + content lengths well past the original 32K cap. Decode rates remain in the canonical 14–17 t/s envelope at 4× the context window.
Calibration
| Property | Value |
|---|---|
| Dataset | HuggingFaceH4/ultrachat_200k (V4 chat template) |
| Samples | 768 |
| Max sequence length | 512 |
| Per-rank batch size | 4 |
| Hardware | 8× NVIDIA H200, p5en.48xlarge |
| Walltime | ~14 hours |
Required environment
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=3600
export TORCH_NCCL_BLOCKING_WAIT=0
export NCCL_TIMEOUT=3600
export TORCH_CUDA_ARCH_LIST=9.0a
sudo mount -o remount,size=1800G /dev/shm
expandable_segments is calibration-only — must not be set during vLLM serving.
What didn't work (recorded so others don't waste cycles)
| Config | Result |
|---|---|
samples=1024, bs=32, no offload, no expandable_segments |
OOM at Layer 3 (45–67 GiB activation alloc fail) |
samples=1024, bs=8, same as above |
OOM at Layer 3 (32 GiB alloc fail) |
samples=1024, bs=8, offload_hessians=True |
OOM at Layer 3 (30 GiB alloc fail; fragmentation blocks contiguous block) |
samples=1024, bs=4, +offload_hessians, +expandable_segments |
NCCL collective timeout at Layer 22 (10 min default exceeded by per-rank drift) |
samples=1024, bs=4, +offload_hessians, +expandable_segments, +60min NCCL timeout |
Untested (we picked 768 instead) |
samples=768, bs=4, +offload_hessians, +expandable_segments, +60min NCCL timeout |
Succeeded — 14h end-to-end |
sequential_targets=["Linear"] (any sample count) |
torch.fx.proxy.TraceError on DeepseekV4Indexer.wrapped_1's data-dependent control flow; would need is_leaf_module patch to register Indexer as leaf |
Recipe
from llmcompressor.modifiers.quantization import GPTQModifier
from compressed_tensors.quantization.quant_scheme import FP8_BLOCK, W4A16, QuantizationScheme
recipe = GPTQModifier(
config_groups={
"attention": QuantizationScheme(
targets=[
r"re:.*self_attn\.(q_a_proj|q_b_proj|kv_proj|o_a_proj|o_b_proj)$",
r"re:.*self_attn\.compressor\.(gate_proj|kv_proj)$",
r"re:.*self_attn\.compressor\.indexer\.(gate_proj|kv_proj|q_b_proj|weights_proj)$",
],
**FP8_BLOCK,
),
"experts": QuantizationScheme(
targets=[r"re:.*mlp\.experts\.\d+\.(gate_proj|up_proj|down_proj)$"],
**W4A16,
),
},
ignore=["lm_head"],
offload_hessians=True,
dampening_frac=0.1,
)
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=512,
num_calibration_samples=768,
sequential_targets=["DeepseekV4DecoderLayer"],
batch_size=4,
)
Known issues
lm_headexcluded from quantization (BF16) — including it produces dequantization mismatches with the kylesayrs PR loader.shared_expertsexcluded (BF16) — including them triggersNotImplementedError("DeepSeekV4 requires FP8 attention quantization")on shared_expert routing.- TP > 2 is BLOCKED by vllm-project/vllm#41511 (W4A16 MoE scale-sharding).
- SM 12.x deployment requires the workspace pre-reservation, but the patch landed upstream as
jasl/vllm@1d6f5c4so just check out a recent enoughds4-sm120tip rather than carrying the local patch. packed_modules_mappingpatch is still required as ofds4-sm120-experimental@abad5dc71(2026-05-05) — the kylesayrs deepseek-ct patch does not add the class attribute. Drop-in patch inpatches/packed_modules_mapping.diff. Notegate_up_projmust map to["w1", "w3"](notgate_proj/up_proj) to match the recipe ignore list naming for shared experts.- FlashInfer JIT mis-parses
TORCH_CUDA_ARCH_LIST=12.0aon RTX PRO 6000 sm_120 — setVLLM_USE_FLASHINFER_SAMPLER=0to fall back to PyTorch-native sampler. - Runtime env var requirement:
VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4must be set on the SM 12.x sparse-MLA path or kernel warmup crashes with an illegal memory access in_dequantize_and_gather_k_kernel. See Inference section above.
Reproduction
Full toolchain, scripts, and patches: canada-quant/dsv4-flash-w4a16-fp8
Built with:
vllm-project/llm-compressorkylesayrs/transformers-v5(PR #2647), commita308bc0ehuggingface/transformersadd-deepseek-v4(PR #45643),5.8.0.dev0compressed-tensors0.15.1.a20260428- PyTorch
2.11.0+cu130(calibration on H200) /2.11.0+cu128(serving on RTX PRO 6000) - vLLM (calibration verify, 2026-05-02):
jasl/vllm@428e08e+neuralmagic/kylesayrs/deepseek-ct@f910a73acherry-picked +packed_modules_mappingpatch + workspace pre-reservation patch (commit0ac3de079). The SHAf910a73awas later force-pushed out of upstream history on ~2026-05-08; current builds apply the content-pinned rebased successord09eeb498via the vendoredscripts/kylesayrs-deepseek-ct.patch. - vLLM (RTX PRO 6000 serving, 2026-05-05):
jasl/vllm@ds4-sm120-experimental@abad5dc71+ the vendored kylesayrs-deepseek-ct.patch (content-pinned rebased successor off910a73a) +packed_modules_mappingpatch (workspace patch now upstream as1d6f5c4)
Acknowledgements
- @jasl — DeepSeek-V4 vLLM SM12x base support (originally PR #40991, closed 2026-05-06; current upstream tracker is PR #41834). Also
e734ace5memory-pressure-release fix that resolved the Blackwell 256K×2 stall. - @kylesayrs — compressed-tensors V4 attention path (PR #41276)
- @aabbccddwasd — indexer KV cache layout fix
- @bbbearxyz — SM12x Triton fallback kernels
RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8— published reference for V4 mixed-precision attention topology
License
Apache 2.0 (inherited from base model)
- Downloads last month
- 7,347