Qwen3-Reranker-0.6B GGUF — Quantized by BatiAI

BatiFlow Upstream

GGUF quantizations of Qwen/Qwen3-Reranker-0.6B — the most-downloaded open-source reranker of 2026 (1.39 M downloads on HF). Part of BatiAI's on-device RAG stack for BatiFlow.

What is a reranker?

RAG pipeline: embedding (coarse retrieve) → reranker (precise scoring) → LLM (answer).

A reranker takes (query, candidate_document) and returns a relevance score. It's the "second pass" after vector search — turns "probably relevant" candidates into an ordered top-K that the LLM can use confidently.

Quick Start (llama.cpp)

./llama-cli -m Qwen3-Reranker-0.6B-Q6_K.gguf \
  --chat-template-file chat-template.jinja \
  -p "<query>weather in Seoul</query><doc>Seoul had rain yesterday</doc>"

For production, integrate via the llama.cpp API (see Qwen3-Reranker usage).

Note: Ollama doesn't have a native reranker endpoint yet, so this GGUF is intended for direct llama.cpp integration or tools like LangChain / LlamaIndex.

Available Quantizations

File Quant Size Recommended
Qwen3-Reranker-0.6B-Q6_K.gguf Q6_K 472 MB balanced (recommended default)
Qwen3-Reranker-0.6B-Q8_0.gguf Q8_0 610 MB near-lossless, slightly larger

Small models don't benefit much from aggressive quantization (IQ3/IQ4 degrades ranking quality). Q6_K is the sweet spot.

Quality Verification (measured)

Ran 40 (query, positive, negative) triples — 20 EN + 20 KO — twice:

  1. Easy — off-topic negatives (e.g. "Eiffel Tower" as negative for "gradient descent")
  2. Hard — topically-close negatives (e.g. "backpropagation" as negative for "gradient descent")
Test Q6_K Q8_0
Pairwise accuracy (easy) 100 % 100 %
Pairwise accuracy (hard) 100 % 100 %
Mean score margin (hard) 0.751 0.723

Pearson correlation of scores Q6_K ↔ Q8_0: r = 0.998 on hard test → quantization drift is under measurement noise. Q6_K is safe.

Full bench reports in reports/rerank-quality-* of the pipeline repo. Reproducible with scripts/bench-rerank-quality.sh.

Why Qwen3-Reranker?

  • SOTA among open rerankers — top of MTEB reranking benchmarks
  • Multilingual — English / Korean / Japanese / Chinese
  • Tiny footprint — 0.6B parameters, fits in 1 GB RAM
  • Apache 2.0 — commercial-friendly

Why BatiAI?

  • Quantized directly from Alibaba's BF16 safetensors — no intermediate GGUF
  • BatiAI-signed — general.author: BatiAI, general.url: https://flow.bati.ai
  • Part of a full on-device RAG stack (chat LLM + reranker + embedding) — see the batiai HF profile

Technical Details

  • Original Model: Qwen/Qwen3-Reranker-0.6B
  • Architecture: Qwen3 Causal LM (used as cross-encoder scorer)
  • Parameters: 596 M
  • Context: 32 K
  • License: Apache 2.0
  • Quantized with: llama.cpp build bafae2765

About BatiAI's RAG Stack

Role Model HF
Reranker (0.6 B) Qwen3-Reranker-0.6B this repo
Reranker (4 B) Qwen3-Reranker-4B batiai/Qwen3-Reranker-4B-GGUF
VL Embedding (2 B) Qwen3-VL-Embedding-2B batiai/Qwen3-VL-Embedding-2B-GGUF
Chat LLM (35 B-A3B) Qwen3.6-35B-A3B batiai/Qwen3.6-35B-A3B-GGUF

License

Mirrors upstream Qwen Apache 2.0. Commercial use permitted.

Downloads last month
70
GGUF
Model size
0.6B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for batiai/Qwen3-Reranker-0.6B-GGUF

Quantized
(64)
this model

Collection including batiai/Qwen3-Reranker-0.6B-GGUF