Instructions to use tencent/Youtu-LLM-2B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use tencent/Youtu-LLM-2B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="tencent/Youtu-LLM-2B-GGUF", filename="Youtu-LLM-2B-F16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use tencent/Youtu-LLM-2B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf tencent/Youtu-LLM-2B-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf tencent/Youtu-LLM-2B-GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf tencent/Youtu-LLM-2B-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf tencent/Youtu-LLM-2B-GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf tencent/Youtu-LLM-2B-GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf tencent/Youtu-LLM-2B-GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf tencent/Youtu-LLM-2B-GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf tencent/Youtu-LLM-2B-GGUF:F16
Use Docker
docker model run hf.co/tencent/Youtu-LLM-2B-GGUF:F16
- LM Studio
- Jan
- vLLM
How to use tencent/Youtu-LLM-2B-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tencent/Youtu-LLM-2B-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tencent/Youtu-LLM-2B-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/tencent/Youtu-LLM-2B-GGUF:F16
- Ollama
How to use tencent/Youtu-LLM-2B-GGUF with Ollama:
ollama run hf.co/tencent/Youtu-LLM-2B-GGUF:F16
- Unsloth Studio new
How to use tencent/Youtu-LLM-2B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for tencent/Youtu-LLM-2B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for tencent/Youtu-LLM-2B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for tencent/Youtu-LLM-2B-GGUF to start chatting
- Pi new
How to use tencent/Youtu-LLM-2B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf tencent/Youtu-LLM-2B-GGUF:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "tencent/Youtu-LLM-2B-GGUF:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use tencent/Youtu-LLM-2B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf tencent/Youtu-LLM-2B-GGUF:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default tencent/Youtu-LLM-2B-GGUF:F16
Run Hermes
hermes
- Docker Model Runner
How to use tencent/Youtu-LLM-2B-GGUF with Docker Model Runner:
docker model run hf.co/tencent/Youtu-LLM-2B-GGUF:F16
- Lemonade
How to use tencent/Youtu-LLM-2B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull tencent/Youtu-LLM-2B-GGUF:F16
Run and chat with the model
lemonade run user.Youtu-LLM-2B-GGUF-F16
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)📃 License • 💻 Code • 📑 Technical Report • 📊 Benchmarks • 🚀 Getting Started
🎯 Brief Introduction
Youtu-LLM is a new, small, yet powerful LLM, contains only 1.96B parameters, supports 128k long context, and has native agentic talents. On general evaluations, Youtu-LLM significantly outperforms SOTA LLMs of similar size in terms of Commonsense, STEM, Coding and Long Context capabilities; in agent-related testing, Youtu-LLM surpasses larger-sized leaders and is truly capable of completing multiple end2end agent tasks.
Youtu-LLM has the following features:
- Type: Autoregressive Causal Language Models with Dense MLA
- Release versions: Base and Instruct
- Number of Parameters: 1.96B
- Number of Layers: 32
- Number of Attention Heads (MLA): 16 for Q/K/V
- MLA Rank: 1,536 for Q, 512 for K/V
- MLA Dim: 128 for QK Nope, 64 for QK Rope, and 128 for V
- Context Length: 131,072
- Vocabulary Size: 128,256
🤗 Model Download
| Model Name | Description | Download |
|---|---|---|
| Youtu-LLM-2B-Base | Base model of Youtu-LLM-2B | 🤗 Model |
| Youtu-LLM-2B | Instruct model of Youtu-LLM-2B | 🤗 Model |
| Youtu-LLM-2B-GGUF | Instruct model of Youtu-LLM-2B, in GGUF format | 🤗 Model |
📊 Performance Comparisons
Instruct Model
General Benchmarks
| Benchmark | DeepSeek-R1-Distill-Qwen-1.5B | Qwen3-1.7B | SmolLM3-3B | Qwen3-4B | DeepSeek-R1-Distill-Llama-8B | Youtu-LLM-2B |
|---|---|---|---|---|---|---|
| Commonsense Knowledge Reasoning | ||||||
| MMLU-Redux | 53.0% | 74.1% | 75.6% | 83.8% | 78.1% | 75.8% |
| MMLU-Pro | 36.5% | 54.9% | 53.0% | 69.1% | 57.5% | 61.6% |
| Instruction Following & Text Reasoning | ||||||
| IFEval | 29.4% | 70.4% | 60.4% | 83.6% | 34.6% | 81.2% |
| DROP | 41.3% | 72.5% | 72.0% | 82.9% | 73.1% | 86.7% |
| MUSR | 43.8% | 56.6% | 54.1% | 60.5% | 59.7% | 57.4% |
| STEM | ||||||
| MATH-500 | 84.8% | 89.8% | 91.8% | 95.0% | 90.8% | 93.7% |
| AIME 24 | 30.2% | 44.2% | 46.7% | 73.3% | 52.5% | 65.4% |
| AIME 25 | 23.1% | 37.1% | 34.2% | 64.2% | 34.4% | 49.8% |
| GPQA-Diamond | 33.6% | 36.9% | 43.8% | 55.2% | 45.5% | 48.0% |
| BBH | 31.0% | 69.1% | 76.3% | 87.8% | 77.8% | 77.5% |
| Coding | ||||||
| HumanEval | 64.0% | 84.8% | 79.9% | 95.4% | 88.1% | 95.9% |
| HumanEval+ | 59.5% | 76.2% | 74.7% | 87.8% | 82.5% | 89.0% |
| MBPP | 51.5% | 80.5% | 66.7% | 92.3% | 73.9% | 85.0% |
| MBPP+ | 44.2% | 67.7% | 56.7% | 77.6% | 61.0% | 71.7% |
| LiveCodeBench v6 | 19.8% | 30.7% | 30.8% | 48.5% | 36.8% | 43.7% |
Agentic Benchmarks
| Benchmark | Qwen3-1.7B | SmolLM3-3B | Qwen3-4B | Youtu-LLM-2B |
|---|---|---|---|---|
| Deep Research | ||||
| GAIA | 11.4% | 11.7% | 25.5% | 33.9% |
| xbench | 11.7% | 13.9% | 18.4% | 19.5% |
| Code | ||||
| SWE-Bench-Verified | 0.6% | 7.2% | 5.7% | 17.7% |
| EnConda-Bench | 10.8% | 3.5% | 16.1% | 21.5% |
| Tool | ||||
| BFCL V3 | 55.5% | 31.5% | 61.7% | 58.0% |
| τ²-Bench | 2.6% | 9.7% | 10.9% | 15.0% |
🚀 Quick Start
This guide will help you quickly deploy and invoke the Youtu-LLM-2B model. This model supports "Reasoning Mode", enabling it to generate higher-quality responses through Chain of Thought (CoT).
Server Example
Enable Reasoning Mode (default):
./llama-server -m Youtu-LLM-2B-F16.gguf \
--port 8080 \
--host 0.0.0.0
Disable Reasoning Mode:
./llama-server -m Youtu-LLM-2B-F16.gguf \
--port 8080 \
--host 0.0.0.0 \
--reasoning-budget 0
Key Configuration Details
Reasoning Mode Toggle
Controlled via the --reasoning-budget parameter:
- Default (no flag): Enables Chain of Thought; ideal for complex logic and reasoning tasks. Response includes
reasoning_contentfield. --reasoning-budget 0: Disables reasoning; faster response time, suitable for simple conversations.
Recommended Decoding Parameters
| Parameter | Reasoning Mode | Normal Mode |
|---|---|---|
temperature |
1.0 (Maintains creativity) | 0.7 (More stable results) |
top_p |
0.95 | 0.8 |
top_k |
20 | 20 |
repetition_penalty |
1.05 | - |
Tip: When using Reasoning Mode, a higher
temperaturehelps the model perform deeper, more divergent thinking.
📚 Citation
If you find our work useful in your research, please consider citing the following paper:
@article{youtu-llm,
title={Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models},
author={Tencent Youtu Lab},
year={2025},
eprint={2512.24618},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.24618},
}
- Downloads last month
- 232
8-bit
16-bit


# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="tencent/Youtu-LLM-2B-GGUF", filename="", )