Instructions to use unsloth/MiMo-V2.5-Pro-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/MiMo-V2.5-Pro-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="unsloth/MiMo-V2.5-Pro-GGUF",
	filename="BF16/MiMo-V2.5-Pro-BF16-00001-of-00043.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use unsloth/MiMo-V2.5-Pro-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M

Use Docker

docker model run hf.co/unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M

LM Studio
Jan

vLLM

How to use unsloth/MiMo-V2.5-Pro-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/MiMo-V2.5-Pro-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/MiMo-V2.5-Pro-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M

Ollama
How to use unsloth/MiMo-V2.5-Pro-GGUF with Ollama:
```
ollama run hf.co/unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M
```

Unsloth Studio new

How to use unsloth/MiMo-V2.5-Pro-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/MiMo-V2.5-Pro-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/MiMo-V2.5-Pro-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/MiMo-V2.5-Pro-GGUF to start chatting

Pi new

How to use unsloth/MiMo-V2.5-Pro-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use unsloth/MiMo-V2.5-Pro-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use unsloth/MiMo-V2.5-Pro-GGUF with Docker Model Runner:
```
docker model run hf.co/unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M
```

Lemonade

How to use unsloth/MiMo-V2.5-Pro-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull unsloth/MiMo-V2.5-Pro-GGUF:UD-Q4_K_M

Run and chat with the model

lemonade run user.MiMo-V2.5-Pro-GGUF-UD-Q4_K_M

List all available models

lemonade list

Includes Unsloth chat template fixes!
For llama.cpp, use --jinja

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants.

Community
WeChat Group | Discord | Telegram | Reddit

MiMo-V2.5-Pro

MiMo-V2.5-Pro is an open-source Mixture-of-Experts (MoE) language model with 1.02T total parameters and 42B active parameters. It utilizes the hybrid attention architecture and 3-layers Multi-Token Prediction (MTP) introduced in MiMo-V2-Flash, with up to 1M tokens context length.

1. Introduction

MiMo-V2.5-Pro is our most capable model to date, designed for the most demanding agentic, complex software engineering, and long-horizon tasks. It sustains complex trajectories spanning thousands of tool calls with strong instruction following and coherence over a 1M-token context window. Key features include:

Hybrid Attention Architecture: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) with a 6:1 ratio and 128 sliding window. This reduces KV-cache storage by nearly 7x while maintaining long-context performance via learnable attention sink bias.
Multi-Token Prediction (MTP): Equipped with three lightweight MTP modules using dense FFNs. This triples output speed during inference and will be good to accelerate rollout in RL training.
Efficient Pre-Training: Trained on 27T tokens using FP8 mixed precision and native 32k seq length. The context window supports up to 1M tokens.
Agentic Capabilities: Post-training utilizes SFT, large-scale agentic RL and Multi-Teacher On-Policy Distillation (MOPD), achieving superior performance on the most demanding agentic, complex software engineering, and long-horizon tasks.

2. Model Downloads

Model	Total Params	Active Params	Context Length	Precision	Download
MiMo-V2.5-Pro	1.02T	42B	1M	FP8 (E4M3) Mixed	🤗 HuggingFace 🤖 ModelScope
MiMo-V2.5-Pro-Base	1.02T	42B	256K	FP8 (E4M3) Mixed	🤗 HuggingFace 🤖 ModelScope

3. Evaluation Results

Base Model Evaluation

Category	Benchmark	Setting	MiMo-V2.5-Pro Base	MiMo-V2.5 Base	DeepSeek-V4-Pro Base	DeepSeek-V4-Flash Base	Kimi-K2 Base
Params	#Activated / #Total	-	42B / 1.02T	15B / 310B	49B / 1.6T	13B / 284B	32B / 1.04T
General	BBH	3-shot	88.4	87.2	87.5	86.9	88.7
	MMLU	5-shot	89.4	86.3	90.1	88.7	87.8
	MMLU-Redux	5-shot	92.8	89.8	90.8	89.4	90.2
	MMLU-Pro	5-shot	68.5	65.8	73.5	68.3	69.2
	DROP	3-shot	86.3	83.7	88.7	88.6	83.6
	ARC-Challenge	25-shot	97.2	96.5	-	-	96.2
	HellaSwag	10-shot	89.8	88.6	88.0	85.7	94.6
	WinoGrande	5-shot	85.6	84.7	81.5	79.5	85.3
	TriviaQA	5-shot	81.3	80.7	85.6	82.8	85.1
	GPQA-Diamond	5-shot	66.7	58.1	-	-	48.1
Math	GSM8K	8-shot	99.6	83.3	92.6	90.8	92.1
	MATH	4-shot	86.2	67.7	64.5	57.4	70.2
	AIME 24&25	2-shot	37.3	36.9	-	-	31.6
Code	HumanEval+	1-shot	75.6	71.3	-	-	84.8
	MBPP+	3-shot	74.1	70.9	-	-	73.8
	LiveCodeBench v6	1-shot	39.6	35.5	-	-	26.3
	SWE-Bench (AgentLess)	3-shot	35.7	30.8	-	-	28.2
Chinese	C-Eval	5-shot	91.5	88.6	93.1	92.1	92.5
	CMMLU	5-shot	90.2	88.2	90.8	90.4	90.9
Multilingual	GlobalMMLU	5-shot	83.6	77.4	-	-	80.7

Long-context Evaluation

GraphWalks is a long-context benchmark from OpenAI that fills the prompt with a directed graph of hex-hash nodes and asks the model to run a breadth-first search (nodes exactly at depth N) or list a node's parents. We evaluate across the full 32k–1M input-token span and apply the same evaluation fixes described by Anthropic.

MiMo V2.5 Pro delivers a major leap in long-context reasoning. Past 128k, V2 Pro degrades rapidly and collapses to 0.00 at 1M on both subtasks, while V2.5 Pro still scores 0.56 BFS / 0.92 Parents at 512k and 0.37 / 0.62 at 1M.

4. Model Architecture & Training Process

MiMo-V2.5-Pro addresses the quadratic complexity of long contexts by interleaving Local Sliding Window Attention (SWA) and Global Attention (GA). Unlike traditional speculative decoding, our MTP module is natively integrated for training and inference.

Model Summary

Component	MiMo-V2.5-Pro	MiMo-V2.5
Total Parameters	1.02T	310B
Activated Parameters	42B	15B
Hidden Size	6144	4096
Num Layers	70 (1 dense + 69 MoE)	48 (1 dense + 47 MoE)
Full Attention Layers	10	9
SWA Layers	60	39
Num Attention Heads	128	64
Num KV Heads	8 (GQA)	8 (GA) / 4 (SWA)
Head Dim (QK / V)	192 / 128	192 / 128
Routed Experts	384	256
Experts per Token	8	8
MoE Intermediate Size	2048	2048
Dense Intermediate Size	16384 (layer 0 only)	16384 (layer 0 only)
SWA Window Size	128	128
Max Context Length	1M	1M
MTP Layers	3	3

Training Process

For post-training, MiMo-V2.5-Pro adopts the three-stage post-training paradigm introduced in MiMo-V2-Flash to achieve exceptional performance. The paradigm begins with Supervised Fine-Tuning (SFT) to build strong, foundational instruction-following skills using curated data pairs. Next, in the Domain-Specialized Training stage, diverse teacher models — ranging from math and safety to complex agentic tool-use — are individually optimized using domain-specific RL rewards. Finally, the process culminates in Multi-Teacher On-Policy Distillation (MOPD). Through dynamic on-policy RL, the single student model iteratively learns from its own outputs, continuously receiving precise token-level guidance from the expert teachers to seamlessly integrate broad capabilities.

5. Deployment

Since inference engines are continuously being updated and optimized, this guide only provides deployment examples for reference. For the best performance, we strongly recommend following our referenced approach to get the latest best practices and optimal performance.

SGLang Deployment

For the best performance, we strongly recommend deploying using this approach, which is officially supported by the SGLang community. Please refer to SGLang MiMo-V2.5-Pro Cookbook for the latest deployment guide.

The following is an example of running the model with SGLang, referenced from sgl-project/sglang#23808:

SGLANG_ENABLE_SPEC_V2=1
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256
python3 -m sglang.launch_server \
              --model-path XiaomiMiMo/MiMo-V2.5-Pro \
              --trust-remote-code \
              --pp-size 1 \
              --dp-size 2 \
              --ep-size 16 \
              --tp-size 16 \
              --moe-dense-tp-size 1 \
              --enable-dp-attention \
              --moe-a2a-backend deepep \
              --dist-init-addr ${LWS_LEADER_IP}:20000 \
              --node-rank ${LWS_WORKER_INDEX} \
              --nnodes ${LWS_GROUP_SIZE} \
              --page-size 64 \
              --attention-backend fa3 \
              --quantization fp8 \
              --mem-fraction-static 0.7 \
              --max-running-requests 128 \
              --cuda-graph-max-bs 64 \
              --chunked-prefill-size 32768 \
              --context-length 1048576 \
              --tokenizer-worker-num 64 \
              --speculative-algorithm EAGLE \
              --speculative-num-steps 3 \
              --speculative-eagle-topk 1 \
              --speculative-num-draft-tokens 4 \
              --enable-multi-layer-eagle \
              --host 0.0.0.0 \
              --port 9001 \
              --reasoning-parser mimo \
              --tool-call-parser mimo \
              --watchdog-timeout 3600 \
              --model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}'

vLLM Deployment

For the best performance, we strongly recommend deploying using this approach, which is officially supported by the vLLM community. Please refer to vLLM MiMo-V2.5-Pro Cookbook for the latest deployment guide.

For local deployment, we recommend setting the sampling parameters to temperature=1.0, top_p=0.95.

Citation

@misc{mimo2026v25pro,
  title={MiMo-V2.5-Pro},
  author={{Xiaomi MiMo Team}},
  year={2026},
  howpublished={\url{https://huggingface.co/collections/XiaomiMiMo/mimo-v25}},
}

Contact

For questions or feedback, reach us at mimo@xiaomi.com or join our community:

Downloads last month: 4,206

GGUF

Model size

1T params

Architecture

mimo2

Hardware compatibility

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for unsloth/MiMo-V2.5-Pro-GGUF

Base model

XiaomiMiMo/MiMo-V2.5-Pro

Quantized

(6)

this model