PersonaPlex-24L Q4_K (WebGPU)

Work in progress. This model exists primarily to test layer pruning + QLoRA recovery for browser deployment. Model behavior may differ from the original 32L model. No guarantees โ€” use at your own discretion.

A smaller, quantized version of NVIDIA PersonaPlex-7B-v1 optimized for browser-based speech-to-speech inference via WebGPU.

What is this?

PersonaPlex-7B is a full-duplex speech-to-speech model based on Moshi. The original model has 8.37B parameters across a 32-layer temporal transformer and a 6-layer depth transformer.

This version removes 8 temporal transformer layers (layers 12-19) and recovers quality through LoRA fine-tuning, then quantizes to Q4_K for efficient browser deployment. Quality has been assessed qualitatively through listening tests only โ€” no formal metrics are available.

Original (32L bf16) Original Q4_K This model (24L Q4_K)
Temporal layers 32 32 24
Total params 8.37B 8.37B 6.74B
File size 16.7 GB 4.4 GB 3.5 GB
Format safetensors GGUF Q4_K GGUF Q4_K

How it was made

1. Layer Pruning

Removed temporal transformer layers 12-19 (the middle 8 of 32). The depth transformer (6 layers, 1024 dim) is kept intact. This removes ~1.6B parameters from the temporal transformer.

2. LoRA Recovery Training

After pruning, the model loses semantic understanding. We recover quality using LoRA fine-tuning on self-distillation data:

  • LoRA config: rank 32, alpha 64, target modules: out_proj only (9.6M trainable params, 0.14% of model)
  • Training data: 333 self-distillation files generated by running the full 32L teacher model on natural conversation audio (Santa Barbara Corpus) across all 18 PersonaPlex voices
  • Training: 3 epochs on CPU (bf16 weights, float32 compute), AdamW optimizer with cosine LR decay
  • Loss: 3.0 * text_cross_entropy + weighted_audio_cross_entropy (first audio codebook weighted 5x)

3. Q4_K Quantization

The merged LoRA model is quantized using the same Q4_K scheme as the original WebGPU version:

  • Q4_K: All weight matrices (attention projections, gating, linear heads)
  • Q4_0: Embedding tables (for efficient CPU row lookups)
  • F32: Layer norm alpha parameters only

Files

File Size Description
personaplex-24L-q4_k.gguf 3.5 GB Q4_K quantized model weights (single file)
shards/personaplex-24L-q4_k.gguf.shard-{00-07} 3.5 GB total Same weights, sharded (<512 MB each for WASM ArrayBuffer limit)
tokenizer-e351c8d8-checkpoint125.safetensors 367 MB Mimi audio codec weights
tokenizer_spm_32k_3.model 540 KB SentencePiece text tokenizer
voices/*.pt ~330 KB each 18 voice prompt embeddings, PyTorch format (8 NAT + 10 VAR)
voices/*.embeddings.bin ~800 KB each Same embeddings as raw f32le (for web demo)
voices/*.cache.json ~1 KB each Token cache snapshots for voice conditioning (for web demo)
config.json Model architecture and training metadata

Voices

18 pre-computed voice prompts from the original PersonaPlex release. Each voice has three files:

  • .pt โ€” PyTorch tensor (embeddings + cache, bfloat16)

  • .embeddings.bin โ€” raw f32 little-endian embeddings, shape [num_frames, 4096] (for browser use)

  • .cache.json โ€” token cache snapshot, 17 streams ร— 4 positions (for browser use)

  • NAT (native): NATF0-F3 (female), NATM0-M3 (male)

  • VAR (varied accent): VARF0-F4 (female), VARM0-M4 (male)

Architecture

Temporal Transformer (24 layers)
  dim: 4096, heads: 32, ff: 11264
  RoPE positional encoding (freq_base=10000)

Depth Transformer (6 layers, unchanged)
  dim: 1024, heads: 16, ff: 2816
  16 codebook-specific gating modules

Audio: 8 codebooks (Mimi codec, 12.5 Hz frame rate)
Text: 32K SentencePiece vocabulary

Usage

This model is designed for use with sts-web, a Rust/WASM WebGPU inference engine. The GGUF file is loaded by the browser and dequantized on-GPU via WGSL compute shaders. Sharded versions are provided for environments with a 2 GB ArrayBuffer limit.

Limitations

  • Layer pruning reduces semantic understanding compared to the full 32L model
  • LoRA recovery was trained on English conversational data only
  • The model may default to a greeting ("Hey, let me know if you have any questions") at the start of inference โ€” this is expected and typically discarded during the system prompt warmup phase

License

This model inherits the NVIDIA Open Model License from the base PersonaPlex-7B-v1 model, with additional terms from Kyutai's Moshi (CC-BY-4.0).

Downloads last month
47
GGUF
Model size
7B params
Architecture
personaplex
Hardware compatibility
Log In to add your hardware
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for idle-intelligence/personaplex-24L-q4_k-webgpu

Quantized
(7)
this model