Instructions to use aufklarer/VibeVoice-1.5B-MLX-INT4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use aufklarer/VibeVoice-1.5B-MLX-INT4 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir VibeVoice-1.5B-MLX-INT4 aufklarer/VibeVoice-1.5B-MLX-INT4
- VibeVoice
How to use aufklarer/VibeVoice-1.5B-MLX-INT4 with VibeVoice:
import torch, soundfile as sf, librosa, numpy as np from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference # Load voice sample (should be 24kHz mono) voice, sr = sf.read("path/to/voice_sample.wav") if voice.ndim > 1: voice = voice.mean(axis=1) if sr != 24000: voice = librosa.resample(voice, sr, 24000) processor = VibeVoiceProcessor.from_pretrained("aufklarer/VibeVoice-1.5B-MLX-INT4") model = VibeVoiceForConditionalGenerationInference.from_pretrained( "aufklarer/VibeVoice-1.5B-MLX-INT4", torch_dtype=torch.bfloat16 ).to("cuda").eval() model.set_ddpm_inference_steps(5) inputs = processor(text=["Speaker 0: Hello!\nSpeaker 1: Hi there!"], voice_samples=[[voice]], return_tensors="pt") audio = model.generate(**inputs, cfg_scale=1.3, tokenizer=processor.tokenizer).speech_outputs[0] sf.write("output.wav", audio.cpu().numpy().squeeze(), 24000) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
VibeVoice-1.5B-MLX-INT4
INT4-quantized MLX bundle of the long-form Microsoft VibeVoice-1.5B for Apple Silicon, ready to load with the VibeVoiceTTS Swift module from soniqo/speech-swift.
The 1.5B variant is VibeVoice's flagship long-form model β designed for podcast-length dialogue, audiobook narration, and multi-speaker scenes. Up to 90 minutes of audio with up to 4 distinct voices and consistent voice identity throughout, in a single generation pass. For low-latency short utterances, see the smaller Realtime-0.5B INT4.
What's in the box
model.safetensorsβ INT4 group-quantized Qwen2 backbone (group_size=32, mode=affine), tokenizer + acoustic tokenizer + diffusion head + EOS classifier kept in source dtypequantization.jsonβ per-layer manifest (378 quantized layers)config.json,preprocessor_config.jsonβ copied from upstream
Bundle size: 2.18 GB.
Performance (Apple M2 Max, 64 GB)
| Steps | Audio | Elapsed | RTF | RTFx |
|---|---|---|---|---|
| 10 | 26.67 s | 19.76 s | 0.74 | 1.35Γ |
200 max_tokens, 10 DPM-Solver steps, voice cache encoded from a 17-second clip.
1.5B requires the dual-encoder voice prefill path that's been added to
VibeVoiceTTS 0.2.0+.
Use it
Swift / iOS / macOS
import VibeVoiceTTS
let config = VibeVoiceTTSModel.Configuration.longForm1_5B
// .longForm1_5B preset bumps maxSpeechTokens to 4000 and cfgScale to 1.5
let tts = try await VibeVoiceTTSModel.fromPretrained(configuration: config)
try tts.loadVoice(from: "voices/narrator.safetensors")
let pcm = try await tts.generate(text: longTranscript) // up to ~90 min
The longForm1_5B preset uses tokenizer Qwen/Qwen2.5-1.5B. Override modelId if you want to point at this aufklarer bundle directly:
var config = VibeVoiceTTSModel.Configuration.longForm1_5B
config.modelId = "aufklarer/VibeVoice-1.5B-MLX-INT4"
CLI
audio vibevoice "Long paragraph ..." \
--model aufklarer/VibeVoice-1.5B-MLX-INT4 \
--tokenizer Qwen/Qwen2.5-1.5B \
--voice-cache voices/narrator.safetensors \
--max-tokens 4000 \
--output episode.wav
Voice caches
Multi-speaker dialogue uses one voice cache per speaker β swap them between calls or wire dialogue scripts through your app. MIT-licensed example caches: mzbac/vibevoice.swift/voice_cache. Mint your own with audio vibevoice-encode-voice reference.wav "transcript" -o voice.safetensors.
Languages
English and Chinese only. The Qwen2.5 tokenizer accepts other languages but training data is EN/ZH only β non-EN/ZH input produces unintelligible audio.
License
MIT, inherited from the upstream Microsoft VibeVoice repo. Note: Microsoft has occasionally restricted access to the upstream microsoft/VibeVoice-1.5B repository β this bundle is converted from the BF16 source under MIT and remains available here.
Reproduction
models/vibevoice/export/convert.py in soniqo/speech-models (private), --model microsoft/VibeVoice-1.5B --bits 4.
Citation
@misc{microsoft_vibevoice,
title = {VibeVoice: Long-form, Multi-speaker Text-to-Speech},
author = {Microsoft Research},
year = {2025},
url = {https://huggingface.co/microsoft/VibeVoice-1.5B}
}
- Downloads last month
- 195
Quantized
Model tree for aufklarer/VibeVoice-1.5B-MLX-INT4
Base model
microsoft/VibeVoice-1.5B