NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16-W4A16-G128

Running on 8*48G 4090Ti, Avg generation throughput: 221.5 tokens/s, Running: 4 reqs

1: using the edited config.json

2锛歱ip install conch-triton-kernels

3锛歶sing Vllm 0.22.0

workon vllm VLLM_MARLIN_USE_ATOMIC_ADD=1 nohup vllm serve /data/models/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16-W4A16-G128
--host 0.0.0.0 --port 8000
--served-model-name coder
--tensor-parallel-size 8
--enable-expert-parallel
--max-model-len 204096
--gpu-memory-utilization 0.95
--max-num-seqs 4
--max-num-batched-tokens 4096
--enable-chunked-prefill
--enable-prefix-caching
--reasoning-parser nemotron_v3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--mamba-ssm-cache-dtype float16
--mamba-cache-dtype float16
--mamba-backend flashinfer
--enable-mamba-cache-stochastic-rounding
--mamba-cache-philox-rounds 5
--kv-cache-dtype fp8
--kv-offloading-size 64
--enable-prefix-caching
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 48}'
--speculative-config '{"method": "mtp", "num_speculative_tokens": 5}'
--trust-remote-code
>> /root/JqLogs/coder.log 2>&1 & tail -f /root/JqLogs/coder.log

Model Overview

  • Model Architecture: Hybrid Mamba-2 + Latent Mixture-of-Experts (LatentMoE) with Multi-Token Prediction (MTP)
    • Input: Text
    • Output: Text
    • Total Parameters: 550B
    • Active Parameters: 55B
  • Model Optimizations:
    • Weight quantization: INT4 (W4A16, group size 128)
  • Intended Use Cases:
    • Reasoning and complex problem solving.
    • Mathematics and science.
    • Code generation.
    • Instruction following.
  • Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
  • Release Date: 06/04/2025
  • Version: 1.0
  • Model Developers: Red Hat

Quantized version of nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16.

Model Optimizations

This model was obtained by quantizing the weights of nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.

Only the weights of the linear operators within transformer blocks are quantized. Weights are quantized using an asymmetric per-group scheme with group size 128. The llm-compressor library is used for quantization.

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend.

Install dependencies:

uv pip install git+https://github.com/vllm-project/vllm.git
uv pip install llmcompressor

Launch the vLLM server:

vllm serve RedHatAI/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16-W4A16-G128 \
  --host 0.0.0.0 --port 8088 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 32768 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser nemotron_v3 \
  --mamba-ssm-cache-dtype float16 \
  --mamba-backend flashinfer \
  --enable-mamba-cache-stochastic-rounding \
  --mamba-cache-philox-rounds 5 \
  --speculative-config '{"method": "nemotron_h_mtp", "num_speculative_tokens": 5}' \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 96}' \
  --trust-remote-code

Send requests:

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8088/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "RedHatAI/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16-W4A16-G128"

messages = [
    {"role": "user", "content": "Solve for x: 2x + 5 = 13"},
]

outputs = client.chat.completions.create(
    model=model,
    messages=messages,
)

generated_text = outputs.choices[0].message.content
print(generated_text)

Creation

This model was quantized using the llm-compressor library as shown below.

from llmcompressor import model_free_ptq

MODEL_ID = "nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16"
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-W4A16-G128"

model_free_ptq(
    model_stub=MODEL_ID,
    save_directory=SAVE_DIR,
    scheme="W4A16",
    ignore=[
        "re:.*gate$",
        "lm_head",
        "model.embed_tokens",
        "re:.*mixer.conv1d.*",
        "re:.*norm_f*",
        "re:.*bias$",
        "re:.*embed_tokens$",
        "backbone.embeddings"
    ],
    max_workers=15,
    device="cuda:0",
)

Evaluation

The model was evaluated on reasoning tasks using lighteval. vLLM was used as the serving backend for all evaluations.

Install dependencies:

uv pip install git+https://github.com/vllm-project/vllm.git
uv pip install lighteval==0.13.0
uv pip install "litellm[caching]>=1.66.0"

Launch the vLLM server:

vllm serve RedHatAI/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16-W4A16-G128 \
  --host 0.0.0.0 --port 8088 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 32768 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser nemotron_v3 \
  --mamba-ssm-cache-dtype float16 \
  --mamba-backend flashinfer \
  --enable-mamba-cache-stochastic-rounding \
  --mamba-cache-philox-rounds 5 \
  --speculative-config '{"method": "nemotron_h_mtp", "num_speculative_tokens": 5}' \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 96}' \
  --trust-remote-code

AIME 2025:

lighteval endpoint litellm \
  "model_name=hosted_vllm/RedHatAI__NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16-W4A16-G128,provider=hosted_vllm,base_url=http://127.0.0.1:8088/v1,timeout=3600,concurrent_requests=32,generation_parameters={temperature:1.0,top_p:0.95,max_new_tokens:32768}" \
  "aime25|0" \
  --output-dir results --save-details

GPQA Diamond:

lighteval endpoint litellm \
  "model_name=hosted_vllm/RedHatAI__NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16-W4A16-G128,provider=hosted_vllm,base_url=http://127.0.0.1:8088/v1,timeout=3600,concurrent_requests=32,generation_parameters={temperature:1.0,top_p:0.95,max_new_tokens:32768}" \
  "gpqa:diamond|0" \
  --output-dir results --save-details

Accuracy

Benchmark nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 RedHatAI/NVIDIA-Nemotron-3-Ultra-550B-A55B-FP8-Dynamic RedHatAI/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16-FP8-BLOCK RedHatAI/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16-W4A16-G128
(this model)
AIME 2025 (pass@1) 90.00 90.00 (100.0%) 93.33 (103.7%) 86.67 (96.3%) 86.67 (96.3%)
GPQA Diamond (pass@1) 78.79 84.85 (107.7%) 82.32 (104.5%) 81.31 (103.2%) 81.82 (103.8%)
Average 84.39 87.42 (103.6%) 87.83 (104.1%) 83.99 (99.5%) 84.24 (99.8%)
Downloads last month
16
Safetensors
Model size
565B params
Tensor type
I64
I32
BF16
F32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for nwzjk/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16-W4A16-G128

Quantized
(16)
this model