Instructions to use batiai/Qwen3-Reranker-0.6B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use batiai/Qwen3-Reranker-0.6B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="batiai/Qwen3-Reranker-0.6B-GGUF", filename="Qwen3-Reranker-0.6B-Q6_K.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use batiai/Qwen3-Reranker-0.6B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf batiai/Qwen3-Reranker-0.6B-GGUF:Q6_K # Run inference directly in the terminal: llama-cli -hf batiai/Qwen3-Reranker-0.6B-GGUF:Q6_K
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf batiai/Qwen3-Reranker-0.6B-GGUF:Q6_K # Run inference directly in the terminal: llama-cli -hf batiai/Qwen3-Reranker-0.6B-GGUF:Q6_K
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf batiai/Qwen3-Reranker-0.6B-GGUF:Q6_K # Run inference directly in the terminal: ./llama-cli -hf batiai/Qwen3-Reranker-0.6B-GGUF:Q6_K
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf batiai/Qwen3-Reranker-0.6B-GGUF:Q6_K # Run inference directly in the terminal: ./build/bin/llama-cli -hf batiai/Qwen3-Reranker-0.6B-GGUF:Q6_K
Use Docker
docker model run hf.co/batiai/Qwen3-Reranker-0.6B-GGUF:Q6_K
- LM Studio
- Jan
- Ollama
How to use batiai/Qwen3-Reranker-0.6B-GGUF with Ollama:
ollama run hf.co/batiai/Qwen3-Reranker-0.6B-GGUF:Q6_K
- Unsloth Studio new
How to use batiai/Qwen3-Reranker-0.6B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for batiai/Qwen3-Reranker-0.6B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for batiai/Qwen3-Reranker-0.6B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for batiai/Qwen3-Reranker-0.6B-GGUF to start chatting
- Pi new
How to use batiai/Qwen3-Reranker-0.6B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf batiai/Qwen3-Reranker-0.6B-GGUF:Q6_K
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "batiai/Qwen3-Reranker-0.6B-GGUF:Q6_K" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use batiai/Qwen3-Reranker-0.6B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf batiai/Qwen3-Reranker-0.6B-GGUF:Q6_K
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default batiai/Qwen3-Reranker-0.6B-GGUF:Q6_K
Run Hermes
hermes
- Docker Model Runner
How to use batiai/Qwen3-Reranker-0.6B-GGUF with Docker Model Runner:
docker model run hf.co/batiai/Qwen3-Reranker-0.6B-GGUF:Q6_K
- Lemonade
How to use batiai/Qwen3-Reranker-0.6B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull batiai/Qwen3-Reranker-0.6B-GGUF:Q6_K
Run and chat with the model
lemonade run user.Qwen3-Reranker-0.6B-GGUF-Q6_K
List all available models
lemonade list
Qwen3-Reranker-0.6B GGUF — Quantized by BatiAI
GGUF quantizations of Qwen/Qwen3-Reranker-0.6B — the most-downloaded open-source reranker of 2026 (1.39 M downloads on HF). Part of BatiAI's on-device RAG stack for BatiFlow.
What is a reranker?
RAG pipeline: embedding (coarse retrieve) → reranker (precise scoring) → LLM (answer).
A reranker takes (query, candidate_document) and returns a relevance score. It's the "second pass" after vector search — turns "probably relevant" candidates into an ordered top-K that the LLM can use confidently.
Quick Start (llama.cpp)
./llama-cli -m Qwen3-Reranker-0.6B-Q6_K.gguf \
--chat-template-file chat-template.jinja \
-p "<query>weather in Seoul</query><doc>Seoul had rain yesterday</doc>"
For production, integrate via the llama.cpp API (see Qwen3-Reranker usage).
Note: Ollama doesn't have a native reranker endpoint yet, so this GGUF is intended for direct llama.cpp integration or tools like LangChain / LlamaIndex.
Available Quantizations
| File | Quant | Size | Recommended |
|---|---|---|---|
Qwen3-Reranker-0.6B-Q6_K.gguf |
Q6_K | 472 MB | balanced (recommended default) |
Qwen3-Reranker-0.6B-Q8_0.gguf |
Q8_0 | 610 MB | near-lossless, slightly larger |
Small models don't benefit much from aggressive quantization (IQ3/IQ4 degrades ranking quality). Q6_K is the sweet spot.
Quality Verification (measured)
Ran 40 (query, positive, negative) triples — 20 EN + 20 KO — twice:
- Easy — off-topic negatives (e.g. "Eiffel Tower" as negative for "gradient descent")
- Hard — topically-close negatives (e.g. "backpropagation" as negative for "gradient descent")
| Test | Q6_K | Q8_0 |
|---|---|---|
| Pairwise accuracy (easy) | 100 % | 100 % |
| Pairwise accuracy (hard) | 100 % | 100 % |
| Mean score margin (hard) | 0.751 | 0.723 |
Pearson correlation of scores Q6_K ↔ Q8_0: r = 0.998 on hard test → quantization drift is under measurement noise. Q6_K is safe.
Full bench reports in reports/rerank-quality-* of the pipeline repo. Reproducible with scripts/bench-rerank-quality.sh.
Why Qwen3-Reranker?
- SOTA among open rerankers — top of MTEB reranking benchmarks
- Multilingual — English / Korean / Japanese / Chinese
- Tiny footprint — 0.6B parameters, fits in 1 GB RAM
- Apache 2.0 — commercial-friendly
Why BatiAI?
- Quantized directly from Alibaba's BF16 safetensors — no intermediate GGUF
- BatiAI-signed —
general.author: BatiAI,general.url: https://flow.bati.ai - Part of a full on-device RAG stack (chat LLM + reranker + embedding) — see the batiai HF profile
Technical Details
- Original Model: Qwen/Qwen3-Reranker-0.6B
- Architecture: Qwen3 Causal LM (used as cross-encoder scorer)
- Parameters: 596 M
- Context: 32 K
- License: Apache 2.0
- Quantized with: llama.cpp build
bafae2765
About BatiAI's RAG Stack
| Role | Model | HF |
|---|---|---|
| Reranker (0.6 B) | Qwen3-Reranker-0.6B | this repo |
| Reranker (4 B) | Qwen3-Reranker-4B | batiai/Qwen3-Reranker-4B-GGUF |
| VL Embedding (2 B) | Qwen3-VL-Embedding-2B | batiai/Qwen3-VL-Embedding-2B-GGUF |
| Chat LLM (35 B-A3B) | Qwen3.6-35B-A3B | batiai/Qwen3.6-35B-A3B-GGUF |
License
Mirrors upstream Qwen Apache 2.0. Commercial use permitted.
- Downloads last month
- 70
6-bit
8-bit