VibeVoice-1.5B-MLX-INT4

INT4-quantized MLX bundle of the long-form Microsoft VibeVoice-1.5B for Apple Silicon, ready to load with the VibeVoiceTTS Swift module from soniqo/speech-swift.

The 1.5B variant is VibeVoice's flagship long-form model — designed for podcast-length dialogue, audiobook narration, and multi-speaker scenes. Up to 90 minutes of audio with up to 4 distinct voices and consistent voice identity throughout, in a single generation pass. For low-latency short utterances, see the smaller Realtime-0.5B INT4.

What's in the box

model.safetensors — INT4 group-quantized Qwen2 backbone (group_size=32, mode=affine), tokenizer + acoustic tokenizer + diffusion head + EOS classifier kept in source dtype
quantization.json — per-layer manifest (378 quantized layers)
config.json, preprocessor_config.json — copied from upstream

Bundle size: 2.18 GB.

Performance (Apple M2 Max, 64 GB)

Steps	Audio	Elapsed	RTF	RTFx
10	26.67 s	19.76 s	0.74	1.35×

200 max_tokens, 10 DPM-Solver steps, voice cache encoded from a 17-second clip. 1.5B requires the dual-encoder voice prefill path that's been added to VibeVoiceTTS 0.2.0+.

Use it

Swift / iOS / macOS

import VibeVoiceTTS

let config = VibeVoiceTTSModel.Configuration.longForm1_5B
// .longForm1_5B preset bumps maxSpeechTokens to 4000 and cfgScale to 1.5

let tts = try await VibeVoiceTTSModel.fromPretrained(configuration: config)
try tts.loadVoice(from: "voices/narrator.safetensors")
let pcm = try await tts.generate(text: longTranscript)  // up to ~90 min

The longForm1_5B preset uses tokenizer Qwen/Qwen2.5-1.5B. Override modelId if you want to point at this aufklarer bundle directly:

var config = VibeVoiceTTSModel.Configuration.longForm1_5B
config.modelId = "aufklarer/VibeVoice-1.5B-MLX-INT4"

CLI

audio vibevoice "Long paragraph ..." \
    --model aufklarer/VibeVoice-1.5B-MLX-INT4 \
    --tokenizer Qwen/Qwen2.5-1.5B \
    --voice-cache voices/narrator.safetensors \
    --max-tokens 4000 \
    --output episode.wav

Voice caches

Multi-speaker dialogue uses one voice cache per speaker — swap them between calls or wire dialogue scripts through your app. MIT-licensed example caches: mzbac/vibevoice.swift/voice_cache. Mint your own with audio vibevoice-encode-voice reference.wav "transcript" -o voice.safetensors.

Languages

English and Chinese only. The Qwen2.5 tokenizer accepts other languages but training data is EN/ZH only — non-EN/ZH input produces unintelligible audio.

License

MIT, inherited from the upstream Microsoft VibeVoice repo. Note: Microsoft has occasionally restricted access to the upstream microsoft/VibeVoice-1.5B repository — this bundle is converted from the BF16 source under MIT and remains available here.

Reproduction

models/vibevoice/export/convert.py in soniqo/speech-models (private), --model microsoft/VibeVoice-1.5B --bits 4.

Citation

@misc{microsoft_vibevoice,
  title  = {VibeVoice: Long-form, Multi-speaker Text-to-Speech},
  author = {Microsoft Research},
  year   = {2025},
  url    = {https://huggingface.co/microsoft/VibeVoice-1.5B}
}

Downloads last month: 195

MLX

Hardware compatibility

Quantized

Model tree for aufklarer/VibeVoice-1.5B-MLX-INT4

Base model

microsoft/VibeVoice-1.5B

Finetuned

(14)

this model

Collection including aufklarer/VibeVoice-1.5B-MLX-INT4

MLX Speech Models

Collection

Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. • 46 items • Updated about 12 hours ago • 4