Bonsai

Prism ML Website  |  White Paper  |  Demo & Examples  |  Discord

Ternary-Bonsai-8B-mlx-2bit

Ternary (1.58-bit) language model for Apple Silicon

7.1x smaller than FP16 | 5.2x faster on M4 Pro | 27 tok/s on iPhone | runs on Mac, iPhone, iPad

Highlights

  • 2.15 GiB (2.30 GB) packed 2-bit size (down from 16.38 GB FP16) — runs comfortably on any Mac or iPhone
  • Ternary weights {-1, 0, +1} across embeddings, attention projections, MLP projections, and LM head
  • 75.5 avg benchmark score across 6 categories — competitive with full-precision 8B models at 1/9th the size
  • 5-point improvement over our earlier 1-bit Bonsai 8B (70.5) at only ~0.6 GB additional footprint
  • MLX-native format with group size 128 and FP16 scaling

Pareto Frontier

Resources

  • White Paper
  • Demo repo — examples for serving, benchmarking, and integrating Bonsai
  • Discord — community support and updates
  • Kernels: MLX (Apple Silicon) · mlx-swift (iOS/macOS) — 2-bit format is supported out of the box

Model Overview

Item Specification
Base model Qwen3-8B
Parameters 8.19B (~6.95B non-embedding)
Architecture GQA (32 query / 8 KV heads), SwiGLU MLP, RoPE, RMSNorm
Layers 36 Transformer decoder blocks
Context length 65,536 tokens
Vocab size 151,936
Weight format Ternary g128: {-1, 0, +1} with FP16 group-wise scaling
Packed 2-bit size 2.15 GiB (2.30 GB)
Ternary coverage Embeddings, attention projections, MLP projections, LM head
License Apache 2.0

Quantization Format: Ternary g128

Each weight takes a value from {-1, 0, +1}, with one shared FP16 scale per group of 128 weights:

w_i = scale_g * t_i,    t_i in {-1, 0, +1}

The information-theoretic cost is log2(3) ≈ 1.585 bits per weight, plus FP16 group scales (16 bits per 128 weights), for a theoretical minimum of ~1.71 bits/weight. This release uses the MLX 2-bit format, which stores each ternary value in 2 bits plus group scales, for an effective ~2.125 bits/weight.

The addition of a zero value compared to binary (1-bit) provides more expressive weight representations, allowing better preservation of model quality under extreme compression.

Memory

Format Size Reduction Ratio
FP16 16.38 GB -- 1.0x
MLX 2-bit g128 2.15 GiB (2.30 GB) 86.0% 7.1x

Quickstart

MLX (Python)

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("prism-ml/Ternary-Bonsai-8B-mlx-2bit")

response = generate(
    model,
    tokenizer,
    prompt="Explain quantum computing in simple terms.",
    max_tokens=256,
)
print(response)

MLX Swift (iOS / macOS)

Ternary Bonsai 8B runs natively on iPhone and iPad via MLX Swift at 27 tok/s on iPhone 17 Pro Max. The 2-bit format is supported out of the box.

Throughput (MLX / Apple Silicon)

Platform Backend PP512 (tok/s) TG128 (tok/s) FP16 TG (tok/s) Speedup
M4 Pro 48 GB MLX (Python) 460 83 16 5.2x

iPhone 17 Pro Max (MLX Swift)

Platform Backend PP512 (tok/s) TG128 (tok/s) 4-bit TG (tok/s) Speedup
iPhone 17 Pro Max MLX Swift 363 27 14 1.9x

Benchmarks

Evaluated with EvalScope v1.4.2 + vLLM 0.15.1 on NVIDIA H100 under identical infrastructure, generation parameters, and scoring. All models are in the 6B-9B parameter range.

Model Size Avg MMLU-R MuSR GSM8K HE+ IFEval BFCL
Qwen 3 8B 16.38 GB 79.3 83 55 93 82.3 81.5 81
Ternary Bonsai 8B 1.75 GB 75.5 72.6 56.2 91 77.4 81.8 73.9
1-bit Bonsai 8B (prior) 1.15 GB 70.5 65.7 50 88 73.8 79.8 65.7
RNJ 8B 16.63 GB 73.1 75.5 50.4 93.7 84.2 73.8 61.1
Ministral3 8B 16.04 GB 71.0 68.9 53.8 87.9 72.6 67.4 75.4
Olmo 3 7B 14.60 GB 70.9 72 56.1 92.5 79.3 87.1 38.4

Ternary Bonsai 8B ranks 2nd among all compared models despite being 1/9th the size.

Intelligence Density

density = -ln(1 - score/100) / size_GB
Model Size Intelligence Density (1/GB)
Ternary Bonsai 8B 1.75 GB 0.803
1-bit Bonsai 8B (prior) 1.15 GB 1.062
Qwen 3 8B 16.38 GB 0.096
RNJ 8B 16.62 GB 0.079

Limitations

  • Only MLX 2-bit format is available at initial release; more formats for other backends coming soon
  • Mobile power measurement is estimated rather than hardware-metered
  • The full-precision frontier continues to advance; the ternary methodology is architecture-agnostic

Citation

@techreport{ternarybonsai,
    title   = {Ternary Bonsai: 1.58-bit Language Models at 8B, 4B, and 1.7B Scale},
    author  = {Prism ML},
    year    = {2026},
    month   = {April},
    url     = {https://prismml.com}
}

Contact

For questions, feedback, or collaboration inquiries: contact@prismml.com

Downloads last month
6,433
Safetensors
Model size
0.6B params
Tensor type
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prism-ml/Ternary-Bonsai-8B-mlx-2bit

Finetuned
(2)
this model

Collection including prism-ml/Ternary-Bonsai-8B-mlx-2bit