Nemotron-3-Nano-Omni-30B-A3B-Reasoning - RotorQuant (KV-cache modifier root)

RotorQuant is a runtime KV-cache modifier for Nemotron-3-Nano-Omni-30B-A3B-Reasoning. It is weight-agnostic — pair it with any quantized weight variant in this family. This root card documents the modifier itself; the weight-variant child cards link back here, and the matched-stack combo cards (suffixed -RQ-KV) ship a ready-to-load pairing of weights + KV modifier.

Modality matrix

Modality	Encoder	Quantization in this variant
Text	LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE)	per the variant suffix
Image	CRADIO v4-H	BF16 (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file)
Audio	Parakeet-TDT-0.6B-v2	BF16 (same rationale)
Video	Parakeet-TDT-0.6B-v2 + frame sampler	BF16 (≤ 2 min, 256 frames @ 2 FPS)

NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal MLP projectors in BF16 to preserve multimodal accuracy. We follow that convention in every quantized variant we ship.

Runtime: applying RotorQuant to any weight variant

RotorQuant operates entirely at inference time on the K/V tensors that are written into (and read from) the attention cache. It does not touch the serialized weights, so any weight quantization — GGUF, MLX, BF16 — can host it. Memory savings come from the cache; perplexity impact comes from the cache; nothing in the checkpoint changes.

Conceptually (pseudocode — not a literal API; see the matched combo cards listed below for concrete, runnable recipes):

# 1. Load any weight variant from the family.
model = load_weights(
    "majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant-GGUF-Q4_K_M"
)

# 2. Attach the RotorQuant modifier to its KV cache.
#    Weight dtype, file format, and runtime (llama.cpp / mlx-lm / vLLM)
#    are irrelevant to RotorQuant — it only sees K/V activations.
kv = RotorQuant.attach(
    model.kv_cache,
    rotor_bits=4,        # cache element width
    group_size=64,       # per-head rotation granularity
)

# 3. Generate as usual; the modifier intercepts cache reads/writes.
out = model.generate(prompt, kv_cache=kv, enable_thinking=True)

For ready-made pairings that bake the modifier configuration into a single loadable artifact, use the -RQ-KV combo cards listed in the next section.

Weight-variant children

Weight variants live under majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant-*. Each weight variant additionally has a matched -RQ-KV combo card that pre-applies RotorQuant.

Variant family	Repos	Notes
GGUF (llama.cpp / Ollama)	Q2_K · Q3_K_M · Q4_K_M · Q5_K_M · Q8_0 · IQ4_XS · MXFP4_MOE	7 quants. Pair with the mmproj-F16 split file for multimodal.
MLX (Apple Silicon)	2bit · 3bit · 4bit · 5bit · 6bit · 8bit	6 bit-widths via `mlx_lm.convert`.
MLX-MXFP4 (Rotor only)	MLX-MXFP4	MX-format FP4 packed for MLX. No equivalent on the TurboQuant side.
Matched `-RQ-KV` combos	One per weight variant above (e.g. `-GGUF-Q4_K_M-RQ-KV`, `-MLX-4bit-RQ-KV`, …)	Ready-to-load: weights + RotorQuant modifier pre-bound. Start here if you don't want to wire `RotorQuant.attach` yourself.

Runtime quirks (inherited from the weight variant)

enable_thinking defaults to True. To disable extended reasoning (e.g., for latency-sensitive cases), pass enable_thinking=False to the chat template / generate call. No separate "no-think" variant card exists — this is a runtime flag, not a model variant.

Other runtime quirks (--mmproj for llama.cpp multimodal, Ollama's lack of mmproj support, the CUDA 13.2 gibberish bug, etc.) belong to the underlying weight variant, not to RotorQuant — see the linked child cards for those.

Variants in this family

(Showing 56 sibling variants under majentik/nemotron3-nano-omni-30b-*. The current variant — RotorQuant — is bolded.)

Variant	Runtime	Approx size	Use case
mmproj-F16	llama-mtmd-cli	~1-2 GB	Multimodal projector (pair with any GGUF)
RotorQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
RotorQuant-GGUF-IQ4_XS	llama.cpp	~26 GB	Lossy 4-bit, low-RAM CPU/edge
RotorQuant-GGUF-MXFP4_MOE	llama.cpp	~30 GB	MXFP4 MoE quant
RotorQuant-GGUF-Q2_K	llama.cpp	~18 GB	Lossy, low-RAM CPU/edge
RotorQuant-GGUF-Q3_K_M	llama.cpp	~23 GB	Smaller 3-bit, CPU-friendly
RotorQuant-GGUF-Q4_K_M	llama.cpp	~33 GB	Balanced default
RotorQuant-GGUF-Q5_K_M	llama.cpp	~40 GB	Higher fidelity, more RAM
RotorQuant-GGUF-Q8_0	llama.cpp	~63 GB	Near-lossless reference
RotorQuant-GGUF-IQ4_XS-RQ-KV	llama.cpp	~26 GB	IQ4_XS + RotorQuant KV
RotorQuant-GGUF-MXFP4_MOE-RQ-KV	llama.cpp	~30 GB	MXFP4 MoE + RotorQuant KV
RotorQuant-GGUF-Q2_K-RQ-KV	llama.cpp	~18 GB	Q2_K + RotorQuant KV
RotorQuant-GGUF-Q3_K_M-RQ-KV	llama.cpp	~23 GB	Q3_K_M + RotorQuant KV
RotorQuant-GGUF-Q4_K_M-RQ-KV	llama.cpp	~33 GB	Q4_K_M + RotorQuant KV
RotorQuant-GGUF-Q5_K_M-RQ-KV	llama.cpp	~40 GB	Q5_K_M + RotorQuant KV
RotorQuant-GGUF-Q8_0-RQ-KV	llama.cpp	~63 GB	Q8_0 + RotorQuant KV
RotorQuant-MLX-2bit	mlx-lm	~9.6 GB	Apple Silicon, smallest
RotorQuant-MLX-2bit-RQ-KV	mlx-lm	~9.6 GB	2-bit + RotorQuant KV
RotorQuant-MLX-3bit	mlx-lm	~14 GB	Apple Silicon, small
RotorQuant-MLX-3bit-RQ-KV	mlx-lm	~14 GB	3-bit + RotorQuant KV
RotorQuant-MLX-4bit	mlx-lm	~19 GB	Apple Silicon balanced
RotorQuant-MLX-4bit-RQ-KV	mlx-lm	~19 GB	4-bit + RotorQuant KV
RotorQuant-MLX-5bit	mlx-lm	~23 GB	Apple Silicon, higher fidelity
RotorQuant-MLX-5bit-RQ-KV	mlx-lm	~23 GB	5-bit + RotorQuant KV
RotorQuant-MLX-6bit	mlx-lm	~27 GB	Apple Silicon, near-lossless
RotorQuant-MLX-6bit-RQ-KV	mlx-lm	~27 GB	6-bit + RotorQuant KV
RotorQuant-MLX-8bit	mlx-lm	~35 GB	Apple Silicon reference
RotorQuant-MLX-8bit-RQ-KV	mlx-lm	~35 GB	8-bit + RotorQuant KV
RotorQuant-MLX-MXFP4	mlx-lm	~19 GB	Apple Silicon MXFP4
TurboQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
TurboQuant-GGUF-IQ4_XS	llama.cpp	~26 GB	Lossy 4-bit, low-RAM CPU/edge
TurboQuant-GGUF-MXFP4_MOE	llama.cpp	~30 GB	MXFP4 MoE quant
TurboQuant-GGUF-Q2_K	llama.cpp	~18 GB	Lossy, low-RAM CPU/edge
TurboQuant-GGUF-Q3_K_M	llama.cpp	~23 GB	Smaller 3-bit, CPU-friendly
TurboQuant-GGUF-Q4_K_M	llama.cpp	~33 GB	Balanced default
TurboQuant-GGUF-Q5_K_M	llama.cpp	~40 GB	Higher fidelity, more RAM
TurboQuant-GGUF-Q8_0	llama.cpp	~63 GB	Near-lossless reference
TurboQuant-GGUF-IQ4_XS-TQ-KV	llama.cpp	~26 GB	IQ4_XS + TurboQuant KV
TurboQuant-GGUF-MXFP4_MOE-TQ-KV	llama.cpp	~30 GB	MXFP4 MoE + TurboQuant KV
TurboQuant-GGUF-Q2_K-TQ-KV	llama.cpp	~18 GB	Q2_K + TurboQuant KV
TurboQuant-GGUF-Q3_K_M-TQ-KV	llama.cpp	~23 GB	Q3_K_M + TurboQuant KV
TurboQuant-GGUF-Q4_K_M-TQ-KV	llama.cpp	~33 GB	Q4_K_M + TurboQuant KV
TurboQuant-GGUF-Q5_K_M-TQ-KV	llama.cpp	~40 GB	Q5_K_M + TurboQuant KV
TurboQuant-GGUF-Q8_0-TQ-KV	llama.cpp	~63 GB	Q8_0 + TurboQuant KV
TurboQuant-MLX-2bit	mlx-lm	~9.6 GB	Apple Silicon, smallest
TurboQuant-MLX-2bit-TQ-KV	mlx-lm	~9.6 GB	2-bit + TurboQuant KV
TurboQuant-MLX-3bit	mlx-lm	~14 GB	Apple Silicon, small
TurboQuant-MLX-3bit-TQ-KV	mlx-lm	~14 GB	3-bit + TurboQuant KV
TurboQuant-MLX-4bit	mlx-lm	~19 GB	Apple Silicon balanced
TurboQuant-MLX-4bit-TQ-KV	mlx-lm	~19 GB	4-bit + TurboQuant KV
TurboQuant-MLX-5bit	mlx-lm	~23 GB	Apple Silicon, higher fidelity
TurboQuant-MLX-5bit-TQ-KV	mlx-lm	~23 GB	5-bit + TurboQuant KV
TurboQuant-MLX-6bit	mlx-lm	~27 GB	Apple Silicon, near-lossless
TurboQuant-MLX-6bit-TQ-KV	mlx-lm	~27 GB	6-bit + TurboQuant KV
TurboQuant-MLX-8bit	mlx-lm	~35 GB	Apple Silicon reference
TurboQuant-MLX-8bit-TQ-KV	mlx-lm	~35 GB	8-bit + TurboQuant KV

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant

Base model

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Finetuned

(42)

this model

majentik
/

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant