PersonaPlex-24L Q4_K (WebGPU)
Work in progress. This model exists primarily to test layer pruning + QLoRA recovery for browser deployment. Model behavior may differ from the original 32L model. No guarantees โ use at your own discretion.
A smaller, quantized version of NVIDIA PersonaPlex-7B-v1 optimized for browser-based speech-to-speech inference via WebGPU.
What is this?
PersonaPlex-7B is a full-duplex speech-to-speech model based on Moshi. The original model has 8.37B parameters across a 32-layer temporal transformer and a 6-layer depth transformer.
This version removes 8 temporal transformer layers (layers 12-19) and recovers quality through LoRA fine-tuning, then quantizes to Q4_K for efficient browser deployment. Quality has been assessed qualitatively through listening tests only โ no formal metrics are available.
| Original (32L bf16) | Original Q4_K | This model (24L Q4_K) | |
|---|---|---|---|
| Temporal layers | 32 | 32 | 24 |
| Total params | 8.37B | 8.37B | 6.74B |
| File size | 16.7 GB | 4.4 GB | 3.5 GB |
| Format | safetensors | GGUF Q4_K | GGUF Q4_K |
How it was made
1. Layer Pruning
Removed temporal transformer layers 12-19 (the middle 8 of 32). The depth transformer (6 layers, 1024 dim) is kept intact. This removes ~1.6B parameters from the temporal transformer.
2. LoRA Recovery Training
After pruning, the model loses semantic understanding. We recover quality using LoRA fine-tuning on self-distillation data:
- LoRA config: rank 32, alpha 64, target modules:
out_projonly (9.6M trainable params, 0.14% of model) - Training data: 333 self-distillation files generated by running the full 32L teacher model on natural conversation audio (Santa Barbara Corpus) across all 18 PersonaPlex voices
- Training: 3 epochs on CPU (bf16 weights, float32 compute), AdamW optimizer with cosine LR decay
- Loss:
3.0 * text_cross_entropy + weighted_audio_cross_entropy(first audio codebook weighted 5x)
3. Q4_K Quantization
The merged LoRA model is quantized using the same Q4_K scheme as the original WebGPU version:
- Q4_K: All weight matrices (attention projections, gating, linear heads)
- Q4_0: Embedding tables (for efficient CPU row lookups)
- F32: Layer norm alpha parameters only
Files
| File | Size | Description |
|---|---|---|
personaplex-24L-q4_k.gguf |
3.5 GB | Q4_K quantized model weights (single file) |
shards/personaplex-24L-q4_k.gguf.shard-{00-07} |
3.5 GB total | Same weights, sharded (<512 MB each for WASM ArrayBuffer limit) |
tokenizer-e351c8d8-checkpoint125.safetensors |
367 MB | Mimi audio codec weights |
tokenizer_spm_32k_3.model |
540 KB | SentencePiece text tokenizer |
voices/*.pt |
~330 KB each | 18 voice prompt embeddings, PyTorch format (8 NAT + 10 VAR) |
voices/*.embeddings.bin |
~800 KB each | Same embeddings as raw f32le (for web demo) |
voices/*.cache.json |
~1 KB each | Token cache snapshots for voice conditioning (for web demo) |
config.json |
Model architecture and training metadata |
Voices
18 pre-computed voice prompts from the original PersonaPlex release. Each voice has three files:
.ptโ PyTorch tensor (embeddings + cache, bfloat16).embeddings.binโ raw f32 little-endian embeddings, shape[num_frames, 4096](for browser use).cache.jsonโ token cache snapshot, 17 streams ร 4 positions (for browser use)NAT (native): NATF0-F3 (female), NATM0-M3 (male)
VAR (varied accent): VARF0-F4 (female), VARM0-M4 (male)
Architecture
Temporal Transformer (24 layers)
dim: 4096, heads: 32, ff: 11264
RoPE positional encoding (freq_base=10000)
Depth Transformer (6 layers, unchanged)
dim: 1024, heads: 16, ff: 2816
16 codebook-specific gating modules
Audio: 8 codebooks (Mimi codec, 12.5 Hz frame rate)
Text: 32K SentencePiece vocabulary
Usage
This model is designed for use with sts-web, a Rust/WASM WebGPU inference engine. The GGUF file is loaded by the browser and dequantized on-GPU via WGSL compute shaders. Sharded versions are provided for environments with a 2 GB ArrayBuffer limit.
Limitations
- Layer pruning reduces semantic understanding compared to the full 32L model
- LoRA recovery was trained on English conversational data only
- The model may default to a greeting ("Hey, let me know if you have any questions") at the start of inference โ this is expected and typically discarded during the system prompt warmup phase
License
This model inherits the NVIDIA Open Model License from the base PersonaPlex-7B-v1 model, with additional terms from Kyutai's Moshi (CC-BY-4.0).
- Downloads last month
- 47