---
license: other
license_name: nvidia-open-model-license
license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
tags: [nemotron, multimodal, mamba2, moe, quantized, rotorquant, gguf, llama.cpp,
  llama-mtmd, multimodal-via-mmproj]
library_name: gguf
pipeline_tag: image-text-to-text
language: [en]
datasets: [nvidia/Nemotron-Image-Training-v3]
inference: false
---

# Nemotron-3-Nano-Omni-30B-A3B-Reasoning - RotorQuant GGUF Q3_K_M

GGUF Q3_K_M quantization of `Nemotron-3-Nano-Omni-30B-A3B-Reasoning` (`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16`) with RotorQuant weight method.

The `Q3_K_M.gguf` binary in this repo is loaded by `llama.cpp` / `llama-mtmd-cli`.
For multimodal inference (text + image + audio + video) pair this with the
multimodal projector: [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16).

For the matched-KV stack — RotorQuant weights + RotorQuant KV-cache modifier —
see [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant-GGUF-Q3_K_M-RQ-KV`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant-GGUF-Q3_K_M-RQ-KV).
For the runtime KV-cache modifier itself (weight-agnostic), see
[`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant).

## Quickstart

```bash
# 1. Download the GGUF + the multimodal projector
huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant-GGUF-Q3_K_M Q3_K_M.gguf --local-dir ./model
huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16 mmproj-F16.gguf --local-dir ./mmproj

# 2. Multimodal inference (text + image + audio + video)
llama-mtmd-cli \
  -m ./model/Q3_K_M.gguf \
  --mmproj ./mmproj/mmproj-F16.gguf \
  --image cat.jpg \
  -p "Describe this image in detail" \
  --temp 0.6 --top-p 0.95 -n 512

# 3. Text-only inference (no mmproj needed)
llama-cli \
  -m ./model/Q3_K_M.gguf \
  -p "What is the capital of France?" \
  --temp 0.6 --top-p 0.95 -n 256

# Disable extended reasoning (default is on):
#   add `--chat-template-kwargs '{"enable_thinking": false}'`
```

> ⚠️ Do NOT use llama.cpp built against CUDA 13.2 — produces gibberish. Pin CUDA 12.x or use Metal/CPU.

## Modality matrix

| Modality | Encoder | Quantization in this variant |
|---|---|---|
| Text | LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) | per the variant suffix |
| Image | CRADIO v4-H | **BF16** (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file) |
| Audio | Parakeet-TDT-0.6B-v2 | **BF16** (same rationale) |
| Video | Parakeet-TDT-0.6B-v2 + frame sampler | **BF16** (≤ 2 min, 256 frames @ 2 FPS) |

NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal
MLP projectors in BF16 to preserve multimodal accuracy. We follow that
convention in every quantized variant we ship.

## Runtime quirks

### llama.cpp

Use `llama-mtmd-cli` for multimodal inference; pass `--mmproj mmproj-F16.gguf`
(see `majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16`).

**Do NOT use CUDA 13.2** — produces gibberish. Pin CUDA 12.x or
use the Metal/CPU paths.

### Ollama

Text-only; multimodal is blocked because Ollama doesn't yet support
the mmproj split-file pattern.

### Reasoning mode

`enable_thinking` defaults to `True`. To disable extended reasoning
(e.g., for latency-sensitive cases), pass `enable_thinking=False`
to the chat template / generate call. No separate "no-think"
variant card exists — this is a runtime flag, not a model variant.

## Quant trade-off (GGUF lane)

| Quant | Approx size | Use case | Recommendation |
|---|---|---|---|
| Q2_K | ~17 GB | Lossy, low-RAM CPU/edge | Resource-constrained inference |
| **Q3_K_M** | ~19 GB | Smaller-than-Q4, modest quality drop | **Edge devices with ~16 GB RAM** |
| IQ4_XS | ~16 GB | Importance-quant 4-bit, smaller than Q4_K_M | Best size/quality at 4-bit |
| Q4_K_M | ~23 GB | Balanced default | Recommended for most users |
| Q5_K_M | ~24 GB | Higher fidelity than Q4 | Quality-sensitive applications |
| Q6_K | ~28 GB | Approaching FP16 quality | High-fidelity CPU/edge |
| Q8_0 | ~32 GB | Near-lossless reference | Fidelity-critical work |
| MXFP4_MOE | ~17 GB | Microscaling FP4 (MoE-aware) | vLLM / transformers users |

(Current variant — **Q3_K_M** — is bolded.)

## Variants in this family

(Showing 56 sibling variants under `majentik/nemotron3-nano-omni-30b-*`. The current variant — `RotorQuant-GGUF-Q3_K_M` — is **bolded**.)

| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| [mmproj-F16](https://huggingface.co/majentik/nemotron3-nano-omni-30b-mmproj-f16) | llama-mtmd-cli | ~1-2 GB | Multimodal projector (pair with any GGUF) |
| [RotorQuant](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| [RotorQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-IQ4_XS) | llama.cpp | ~26 GB | Lossy 4-bit, low-RAM CPU/edge |
| [RotorQuant-GGUF-MXFP4_MOE](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-MXFP4_MOE) | llama.cpp | ~30 GB | MXFP4 MoE quant |
| [RotorQuant-GGUF-Q2_K](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q2_K) | llama.cpp | ~18 GB | Lossy, low-RAM CPU/edge |
| **RotorQuant-GGUF-Q3_K_M** | llama.cpp | ~23 GB | Smaller 3-bit, CPU-friendly |
| [RotorQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q4_K_M) | llama.cpp | ~33 GB | Balanced default |
| [RotorQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q5_K_M) | llama.cpp | ~40 GB | Higher fidelity, more RAM |
| [RotorQuant-GGUF-Q8_0](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q8_0) | llama.cpp | ~63 GB | Near-lossless reference |
| [RotorQuant-GGUF-IQ4_XS-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-iq4_xs-rq-kv) | llama.cpp | ~26 GB | IQ4_XS + RotorQuant KV |
| [RotorQuant-GGUF-MXFP4_MOE-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-mxfp4_moe-rq-kv) | llama.cpp | ~30 GB | MXFP4 MoE + RotorQuant KV |
| [RotorQuant-GGUF-Q2_K-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q2_k-rq-kv) | llama.cpp | ~18 GB | Q2_K + RotorQuant KV |
| [RotorQuant-GGUF-Q3_K_M-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q3_k_m-rq-kv) | llama.cpp | ~23 GB | Q3_K_M + RotorQuant KV |
| [RotorQuant-GGUF-Q4_K_M-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q4_k_m-rq-kv) | llama.cpp | ~33 GB | Q4_K_M + RotorQuant KV |
| [RotorQuant-GGUF-Q5_K_M-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q5_k_m-rq-kv) | llama.cpp | ~40 GB | Q5_K_M + RotorQuant KV |
| [RotorQuant-GGUF-Q8_0-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q8_0-rq-kv) | llama.cpp | ~63 GB | Q8_0 + RotorQuant KV |
| [RotorQuant-MLX-2bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-2bit) | mlx-lm | ~9.6 GB | Apple Silicon, smallest |
| [RotorQuant-MLX-2bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-2bit-rq-kv) | mlx-lm | ~9.6 GB | 2-bit + RotorQuant KV |
| [RotorQuant-MLX-3bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-3bit) | mlx-lm | ~14 GB | Apple Silicon, small |
| [RotorQuant-MLX-3bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-3bit-rq-kv) | mlx-lm | ~14 GB | 3-bit + RotorQuant KV |
| [RotorQuant-MLX-4bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-4bit) | mlx-lm | ~19 GB | Apple Silicon balanced |
| [RotorQuant-MLX-4bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-4bit-rq-kv) | mlx-lm | ~19 GB | 4-bit + RotorQuant KV |
| [RotorQuant-MLX-5bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-5bit) | mlx-lm | ~23 GB | Apple Silicon, higher fidelity |
| [RotorQuant-MLX-5bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-5bit-rq-kv) | mlx-lm | ~23 GB | 5-bit + RotorQuant KV |
| [RotorQuant-MLX-6bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-6bit) | mlx-lm | ~27 GB | Apple Silicon, near-lossless |
| [RotorQuant-MLX-6bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-6bit-rq-kv) | mlx-lm | ~27 GB | 6-bit + RotorQuant KV |
| [RotorQuant-MLX-8bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-8bit) | mlx-lm | ~35 GB | Apple Silicon reference |
| [RotorQuant-MLX-8bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-8bit-rq-kv) | mlx-lm | ~35 GB | 8-bit + RotorQuant KV |
| [RotorQuant-MLX-MXFP4](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-mxfp4) | mlx-lm | ~19 GB | Apple Silicon MXFP4 |
| [TurboQuant](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| [TurboQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-IQ4_XS) | llama.cpp | ~26 GB | Lossy 4-bit, low-RAM CPU/edge |
| [TurboQuant-GGUF-MXFP4_MOE](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-MXFP4_MOE) | llama.cpp | ~30 GB | MXFP4 MoE quant |
| [TurboQuant-GGUF-Q2_K](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q2_K) | llama.cpp | ~18 GB | Lossy, low-RAM CPU/edge |
| [TurboQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q3_K_M) | llama.cpp | ~23 GB | Smaller 3-bit, CPU-friendly |
| [TurboQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q4_K_M) | llama.cpp | ~33 GB | Balanced default |
| [TurboQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q5_K_M) | llama.cpp | ~40 GB | Higher fidelity, more RAM |
| [TurboQuant-GGUF-Q8_0](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q8_0) | llama.cpp | ~63 GB | Near-lossless reference |
| [TurboQuant-GGUF-IQ4_XS-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-iq4_xs-tq-kv) | llama.cpp | ~26 GB | IQ4_XS + TurboQuant KV |
| [TurboQuant-GGUF-MXFP4_MOE-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-mxfp4_moe-tq-kv) | llama.cpp | ~30 GB | MXFP4 MoE + TurboQuant KV |
| [TurboQuant-GGUF-Q2_K-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q2_k-tq-kv) | llama.cpp | ~18 GB | Q2_K + TurboQuant KV |
| [TurboQuant-GGUF-Q3_K_M-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q3_k_m-tq-kv) | llama.cpp | ~23 GB | Q3_K_M + TurboQuant KV |
| [TurboQuant-GGUF-Q4_K_M-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q4_k_m-tq-kv) | llama.cpp | ~33 GB | Q4_K_M + TurboQuant KV |
| [TurboQuant-GGUF-Q5_K_M-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q5_k_m-tq-kv) | llama.cpp | ~40 GB | Q5_K_M + TurboQuant KV |
| [TurboQuant-GGUF-Q8_0-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q8_0-tq-kv) | llama.cpp | ~63 GB | Q8_0 + TurboQuant KV |
| [TurboQuant-MLX-2bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-2bit) | mlx-lm | ~9.6 GB | Apple Silicon, smallest |
| [TurboQuant-MLX-2bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-2bit-tq-kv) | mlx-lm | ~9.6 GB | 2-bit + TurboQuant KV |
| [TurboQuant-MLX-3bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-3bit) | mlx-lm | ~14 GB | Apple Silicon, small |
| [TurboQuant-MLX-3bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-3bit-tq-kv) | mlx-lm | ~14 GB | 3-bit + TurboQuant KV |
| [TurboQuant-MLX-4bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-4bit) | mlx-lm | ~19 GB | Apple Silicon balanced |
| [TurboQuant-MLX-4bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-4bit-tq-kv) | mlx-lm | ~19 GB | 4-bit + TurboQuant KV |
| [TurboQuant-MLX-5bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-5bit) | mlx-lm | ~23 GB | Apple Silicon, higher fidelity |
| [TurboQuant-MLX-5bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-5bit-tq-kv) | mlx-lm | ~23 GB | 5-bit + TurboQuant KV |
| [TurboQuant-MLX-6bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-6bit) | mlx-lm | ~27 GB | Apple Silicon, near-lossless |
| [TurboQuant-MLX-6bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-6bit-tq-kv) | mlx-lm | ~27 GB | 6-bit + TurboQuant KV |
| [TurboQuant-MLX-8bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-8bit) | mlx-lm | ~35 GB | Apple Silicon reference |
| [TurboQuant-MLX-8bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-8bit-tq-kv) | mlx-lm | ~35 GB | 8-bit + TurboQuant KV |