Instructions to use spicyneuron/Qwen3.5-35B-A3B-MLX-4.9bit-vision with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use spicyneuron/Qwen3.5-35B-A3B-MLX-4.9bit-vision with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("spicyneuron/Qwen3.5-35B-A3B-MLX-4.9bit-vision") config = load_config("spicyneuron/Qwen3.5-35B-A3B-MLX-4.9bit-vision") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Qwen3.5-35B-A3B optimized for MLX. This quant supports image input and requires a vision-enabled MLX server.
For the non-vision model: https://huggingface.co/spicyneuron/Qwen3.5-35B-A3B-MLX-4.8bit
EDIT: Updated chat template to enable better prompt caching.
Usage
# Start server at http://localhost:8080/chat/completions
uvx --from mlx-vlm --with torchvision \
mlx_vlm.server \
--host 127.0.0.1 \
--port 8080 \
--model spicyneuron/Qwen3.5-35B-A3B-MLX-4.9bit-vision
Methodology
Quantized using a custom script inspired by Unsloth/AesSedai/ubergarm style mixed-precision GGUFs. MLX quantization options differ than llama.cpp, but the principles are the same:
- Sensitive layers like MoE routing, attention, and output embeddings get higher precision
- More tolerant layers like MoE experts get lower precision
- Downloads last month
- 58
Model size
6B params
Tensor type
BF16
路
U32 路
F32 路
Hardware compatibility
Log In to add your hardware
Quantized
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support