Instructions to use majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Nemotron-3-Nano-Omni-30B-A3B-Reasoning - RotorQuant (KV-cache modifier root)
RotorQuant is a runtime KV-cache modifier for Nemotron-3-Nano-Omni-30B-A3B-Reasoning. It is weight-agnostic —
pair it with any quantized weight variant in this family. This root card
documents the modifier itself; the weight-variant child cards link back here,
and the matched-stack combo cards (suffixed -RQ-KV) ship a
ready-to-load pairing of weights + KV modifier.
Modality matrix
| Modality | Encoder | Quantization in this variant |
|---|---|---|
| Text | LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) | per the variant suffix |
| Image | CRADIO v4-H | BF16 (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file) |
| Audio | Parakeet-TDT-0.6B-v2 | BF16 (same rationale) |
| Video | Parakeet-TDT-0.6B-v2 + frame sampler | BF16 (≤ 2 min, 256 frames @ 2 FPS) |
NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal MLP projectors in BF16 to preserve multimodal accuracy. We follow that convention in every quantized variant we ship.
Runtime: applying RotorQuant to any weight variant
RotorQuant operates entirely at inference time on the K/V tensors that are written into (and read from) the attention cache. It does not touch the serialized weights, so any weight quantization — GGUF, MLX, BF16 — can host it. Memory savings come from the cache; perplexity impact comes from the cache; nothing in the checkpoint changes.
Conceptually (pseudocode — not a literal API; see the matched combo cards listed below for concrete, runnable recipes):
# 1. Load any weight variant from the family.
model = load_weights(
"majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant-GGUF-Q4_K_M"
)
# 2. Attach the RotorQuant modifier to its KV cache.
# Weight dtype, file format, and runtime (llama.cpp / mlx-lm / vLLM)
# are irrelevant to RotorQuant — it only sees K/V activations.
kv = RotorQuant.attach(
model.kv_cache,
rotor_bits=4, # cache element width
group_size=64, # per-head rotation granularity
)
# 3. Generate as usual; the modifier intercepts cache reads/writes.
out = model.generate(prompt, kv_cache=kv, enable_thinking=True)
For ready-made pairings that bake the modifier configuration into a single
loadable artifact, use the -RQ-KV combo cards listed in the next section.
Weight-variant children
Weight variants live under majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant-*.
Each weight variant additionally has a matched -RQ-KV combo card that
pre-applies RotorQuant.
| Variant family | Repos | Notes |
|---|---|---|
| GGUF (llama.cpp / Ollama) | Q2_K · Q3_K_M · Q4_K_M · Q5_K_M · Q8_0 · IQ4_XS · MXFP4_MOE | 7 quants. Pair with the mmproj-F16 split file for multimodal. |
| MLX (Apple Silicon) | 2bit · 3bit · 4bit · 5bit · 6bit · 8bit | 6 bit-widths via mlx_lm.convert. |
| MLX-MXFP4 (Rotor only) | MLX-MXFP4 | MX-format FP4 packed for MLX. No equivalent on the TurboQuant side. |
Matched -RQ-KV combos |
One per weight variant above (e.g. -GGUF-Q4_K_M-RQ-KV, -MLX-4bit-RQ-KV, …) |
Ready-to-load: weights + RotorQuant modifier pre-bound. Start here if you don't want to wire RotorQuant.attach yourself. |
Runtime quirks (inherited from the weight variant)
enable_thinking defaults to True. To disable extended reasoning
(e.g., for latency-sensitive cases), pass enable_thinking=False
to the chat template / generate call. No separate "no-think"
variant card exists — this is a runtime flag, not a model variant.
Other runtime quirks (--mmproj for llama.cpp multimodal, Ollama's lack of
mmproj support, the CUDA 13.2 gibberish bug, etc.) belong to the underlying
weight variant, not to RotorQuant — see the linked child cards for
those.
Variants in this family
(Showing 56 sibling variants under majentik/nemotron3-nano-omni-30b-*. The current variant — RotorQuant — is bolded.)
| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| mmproj-F16 | llama-mtmd-cli | ~1-2 GB | Multimodal projector (pair with any GGUF) |
| RotorQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| RotorQuant-GGUF-IQ4_XS | llama.cpp | ~26 GB | Lossy 4-bit, low-RAM CPU/edge |
| RotorQuant-GGUF-MXFP4_MOE | llama.cpp | ~30 GB | MXFP4 MoE quant |
| RotorQuant-GGUF-Q2_K | llama.cpp | ~18 GB | Lossy, low-RAM CPU/edge |
| RotorQuant-GGUF-Q3_K_M | llama.cpp | ~23 GB | Smaller 3-bit, CPU-friendly |
| RotorQuant-GGUF-Q4_K_M | llama.cpp | ~33 GB | Balanced default |
| RotorQuant-GGUF-Q5_K_M | llama.cpp | ~40 GB | Higher fidelity, more RAM |
| RotorQuant-GGUF-Q8_0 | llama.cpp | ~63 GB | Near-lossless reference |
| RotorQuant-GGUF-IQ4_XS-RQ-KV | llama.cpp | ~26 GB | IQ4_XS + RotorQuant KV |
| RotorQuant-GGUF-MXFP4_MOE-RQ-KV | llama.cpp | ~30 GB | MXFP4 MoE + RotorQuant KV |
| RotorQuant-GGUF-Q2_K-RQ-KV | llama.cpp | ~18 GB | Q2_K + RotorQuant KV |
| RotorQuant-GGUF-Q3_K_M-RQ-KV | llama.cpp | ~23 GB | Q3_K_M + RotorQuant KV |
| RotorQuant-GGUF-Q4_K_M-RQ-KV | llama.cpp | ~33 GB | Q4_K_M + RotorQuant KV |
| RotorQuant-GGUF-Q5_K_M-RQ-KV | llama.cpp | ~40 GB | Q5_K_M + RotorQuant KV |
| RotorQuant-GGUF-Q8_0-RQ-KV | llama.cpp | ~63 GB | Q8_0 + RotorQuant KV |
| RotorQuant-MLX-2bit | mlx-lm | ~9.6 GB | Apple Silicon, smallest |
| RotorQuant-MLX-2bit-RQ-KV | mlx-lm | ~9.6 GB | 2-bit + RotorQuant KV |
| RotorQuant-MLX-3bit | mlx-lm | ~14 GB | Apple Silicon, small |
| RotorQuant-MLX-3bit-RQ-KV | mlx-lm | ~14 GB | 3-bit + RotorQuant KV |
| RotorQuant-MLX-4bit | mlx-lm | ~19 GB | Apple Silicon balanced |
| RotorQuant-MLX-4bit-RQ-KV | mlx-lm | ~19 GB | 4-bit + RotorQuant KV |
| RotorQuant-MLX-5bit | mlx-lm | ~23 GB | Apple Silicon, higher fidelity |
| RotorQuant-MLX-5bit-RQ-KV | mlx-lm | ~23 GB | 5-bit + RotorQuant KV |
| RotorQuant-MLX-6bit | mlx-lm | ~27 GB | Apple Silicon, near-lossless |
| RotorQuant-MLX-6bit-RQ-KV | mlx-lm | ~27 GB | 6-bit + RotorQuant KV |
| RotorQuant-MLX-8bit | mlx-lm | ~35 GB | Apple Silicon reference |
| RotorQuant-MLX-8bit-RQ-KV | mlx-lm | ~35 GB | 8-bit + RotorQuant KV |
| RotorQuant-MLX-MXFP4 | mlx-lm | ~19 GB | Apple Silicon MXFP4 |
| TurboQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| TurboQuant-GGUF-IQ4_XS | llama.cpp | ~26 GB | Lossy 4-bit, low-RAM CPU/edge |
| TurboQuant-GGUF-MXFP4_MOE | llama.cpp | ~30 GB | MXFP4 MoE quant |
| TurboQuant-GGUF-Q2_K | llama.cpp | ~18 GB | Lossy, low-RAM CPU/edge |
| TurboQuant-GGUF-Q3_K_M | llama.cpp | ~23 GB | Smaller 3-bit, CPU-friendly |
| TurboQuant-GGUF-Q4_K_M | llama.cpp | ~33 GB | Balanced default |
| TurboQuant-GGUF-Q5_K_M | llama.cpp | ~40 GB | Higher fidelity, more RAM |
| TurboQuant-GGUF-Q8_0 | llama.cpp | ~63 GB | Near-lossless reference |
| TurboQuant-GGUF-IQ4_XS-TQ-KV | llama.cpp | ~26 GB | IQ4_XS + TurboQuant KV |
| TurboQuant-GGUF-MXFP4_MOE-TQ-KV | llama.cpp | ~30 GB | MXFP4 MoE + TurboQuant KV |
| TurboQuant-GGUF-Q2_K-TQ-KV | llama.cpp | ~18 GB | Q2_K + TurboQuant KV |
| TurboQuant-GGUF-Q3_K_M-TQ-KV | llama.cpp | ~23 GB | Q3_K_M + TurboQuant KV |
| TurboQuant-GGUF-Q4_K_M-TQ-KV | llama.cpp | ~33 GB | Q4_K_M + TurboQuant KV |
| TurboQuant-GGUF-Q5_K_M-TQ-KV | llama.cpp | ~40 GB | Q5_K_M + TurboQuant KV |
| TurboQuant-GGUF-Q8_0-TQ-KV | llama.cpp | ~63 GB | Q8_0 + TurboQuant KV |
| TurboQuant-MLX-2bit | mlx-lm | ~9.6 GB | Apple Silicon, smallest |
| TurboQuant-MLX-2bit-TQ-KV | mlx-lm | ~9.6 GB | 2-bit + TurboQuant KV |
| TurboQuant-MLX-3bit | mlx-lm | ~14 GB | Apple Silicon, small |
| TurboQuant-MLX-3bit-TQ-KV | mlx-lm | ~14 GB | 3-bit + TurboQuant KV |
| TurboQuant-MLX-4bit | mlx-lm | ~19 GB | Apple Silicon balanced |
| TurboQuant-MLX-4bit-TQ-KV | mlx-lm | ~19 GB | 4-bit + TurboQuant KV |
| TurboQuant-MLX-5bit | mlx-lm | ~23 GB | Apple Silicon, higher fidelity |
| TurboQuant-MLX-5bit-TQ-KV | mlx-lm | ~23 GB | 5-bit + TurboQuant KV |
| TurboQuant-MLX-6bit | mlx-lm | ~27 GB | Apple Silicon, near-lossless |
| TurboQuant-MLX-6bit-TQ-KV | mlx-lm | ~27 GB | 6-bit + TurboQuant KV |
| TurboQuant-MLX-8bit | mlx-lm | ~35 GB | Apple Silicon reference |
| TurboQuant-MLX-8bit-TQ-KV | mlx-lm | ~35 GB | 8-bit + TurboQuant KV |