Instructions to use batiai/Qwen3-Embedding-0.6B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use batiai/Qwen3-Embedding-0.6B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="batiai/Qwen3-Embedding-0.6B-GGUF",
	filename="Qwen3-Embedding-0.6B-Q6_K.gguf",
)

llm.create_chat_completion(
	messages = "{\n    \"source_sentence\": \"That is a happy person\",\n    \"sentences\": [\n        \"That is a happy dog\",\n        \"That is a very happy person\",\n        \"Today is a sunny day\"\n    ]\n}"
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use batiai/Qwen3-Embedding-0.6B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf batiai/Qwen3-Embedding-0.6B-GGUF:Q6_K
# Run inference directly in the terminal:
llama-cli -hf batiai/Qwen3-Embedding-0.6B-GGUF:Q6_K

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf batiai/Qwen3-Embedding-0.6B-GGUF:Q6_K
# Run inference directly in the terminal:
llama-cli -hf batiai/Qwen3-Embedding-0.6B-GGUF:Q6_K

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf batiai/Qwen3-Embedding-0.6B-GGUF:Q6_K
# Run inference directly in the terminal:
./llama-cli -hf batiai/Qwen3-Embedding-0.6B-GGUF:Q6_K

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf batiai/Qwen3-Embedding-0.6B-GGUF:Q6_K
# Run inference directly in the terminal:
./build/bin/llama-cli -hf batiai/Qwen3-Embedding-0.6B-GGUF:Q6_K

Use Docker

docker model run hf.co/batiai/Qwen3-Embedding-0.6B-GGUF:Q6_K

LM Studio
Jan
Ollama
How to use batiai/Qwen3-Embedding-0.6B-GGUF with Ollama:
```
ollama run hf.co/batiai/Qwen3-Embedding-0.6B-GGUF:Q6_K
```

Unsloth Studio new

How to use batiai/Qwen3-Embedding-0.6B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for batiai/Qwen3-Embedding-0.6B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for batiai/Qwen3-Embedding-0.6B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for batiai/Qwen3-Embedding-0.6B-GGUF to start chatting

Pi new

How to use batiai/Qwen3-Embedding-0.6B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf batiai/Qwen3-Embedding-0.6B-GGUF:Q6_K

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "batiai/Qwen3-Embedding-0.6B-GGUF:Q6_K"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use batiai/Qwen3-Embedding-0.6B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf batiai/Qwen3-Embedding-0.6B-GGUF:Q6_K

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default batiai/Qwen3-Embedding-0.6B-GGUF:Q6_K

Run Hermes

hermes

Docker Model Runner
How to use batiai/Qwen3-Embedding-0.6B-GGUF with Docker Model Runner:
```
docker model run hf.co/batiai/Qwen3-Embedding-0.6B-GGUF:Q6_K
```

Lemonade

How to use batiai/Qwen3-Embedding-0.6B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull batiai/Qwen3-Embedding-0.6B-GGUF:Q6_K

Run and chat with the model

lemonade run user.Qwen3-Embedding-0.6B-GGUF-Q6_K

List all available models

lemonade list

Qwen3-Embedding-0.6B GGUF — Quantized by BatiAI

GGUF quantizations of Qwen/Qwen3-Embedding-0.6B — the lightweight tier of the Qwen3-Embedding family. Runs on every Mac (8 GB and up), 100 emb/sec on M-series. Part of BatiAI's on-device RAG stack for BatiFlow.

TL;DR

100 % top-1 retrieval on Korean business-doc test set (Q6_K), 95 % on English
Cross-lingual alignment Δ = 0.52 (parallel vs unrelated) — semantic understanding across EN↔KO
Quantization drift avg cos 0.9967 (Q8↔Q6) — well above the 0.98 deploy threshold
Tier goal: light-weight default for every Mac — if you don't know which size to pick, start here

Quick Start

Ollama (one command)

ollama pull batiai/qwen3-embedding:0.6b        # 472 MB (Q6_K default — recommended)
ollama pull batiai/qwen3-embedding:0.6b-q8     # 610 MB (Q8_0 — max quality)

# Use via Ollama embeddings API
curl http://localhost:11434/api/embeddings -d '{
  "model": "batiai/qwen3-embedding:0.6b",
  "prompt": "semantic search query"
}'

llama.cpp (server)

./llama-server \
  -m Qwen3-Embedding-0.6B-Q8_0.gguf \
  --embeddings --pooling last -c 32768 \
  --host 127.0.0.1 --port 8080

# Native embedding endpoint
curl http://localhost:8080/embedding -d '{"content": "your text here"}'

# OpenAI-compatible endpoint
curl http://localhost:8080/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{"input": "your text here", "model": "qwen3-embedding"}'

Available Quantizations

File	Quant	Size	When to use
`Qwen3-Embedding-0.6B-Q6_K.gguf`	Q6_K	472 MB	recommended default — we measured drift vs Q8 at cos 0.997 (indistinguishable on retrieval)
`Qwen3-Embedding-0.6B-Q8_0.gguf`	Q8_0	610 MB	maximum quality, ~25 % bigger disk

Why Q6 over Q8 as default? On our 4-stage harness the two are functionally equivalent — Q6 actually edged out Q8 by 2.5 pp on real-doc top-1 recall (measurement noise, but confirms Q6 is not inferior). 150 MB savings matters on 8 GB Macs. If you want maximum conservatism, pull :0.6b-q8.

Why no IQ3 / IQ4 for embedding? Unlike chat LLMs, embedding quality cascades into cosine-similarity drift at low bit-widths — every query is affected. Q6_K / Q8_0 are the safe range.

Quality Verification (measured)

Four-stage harness run on both quants. Full testset + script reproducible via scripts/bench-embedding-quality.sh.

Stage	Test	Q8_0	Q6_K
A. Same-lang semantics	30 (EN+KO) triples, directional correctness	30/30 (100 %)	30/30 (100 %)
	average margin	0.278	0.281
B. Cross-lingual alignment	30 EN↔KO parallel pairs	30/30 (100 %)	30/30 (100 %)
	parallel cos avg	0.728	0.738
	unrelated cos avg	0.206	0.218
	separation Δ	0.522	0.521
C. Real-doc top-1 retrieval	20 EN chunks × 20 EN queries	19/20 (95 %)	19/20 (95 %)
	20 KO chunks × 20 KO queries	19/20 (95 %)	20/20 (100 %)
	combined recall	95.0 %	97.5 %
D. Quant drift	Q8_0 ↔ Q6_K on 20 sample queries	avg cos 0.9967 (min 0.9943, max 0.9983) — PASS

All stages PASS with healthy margin. Q6_K actually edged out Q8_0 by 2.5 pp on combined top-1 recall (quantization-as-regularization effect at this scale — within measurement noise but encouraging).

Quality tier comparison (across BatiAI text-embedding lineup)

Model	A margin	B separation Δ	C recall (EN / KO)	D drift avg
Qwen3-Embedding-0.6B (Q6)	0.281	0.521	95 % / 100 %	0.9967
Qwen3-Embedding-4B (Q6)	0.289	0.540	95 % / 100 %	0.9984
Qwen3-Embedding-8B (Q6)	0.308	0.569	100 % / 100 %	0.9988

Monotonic improvement with size, but 0.6B already lands 95 %+ retrieval on real business docs — strong default for anyone not sure which tier to pick.

Why text-only?

The Qwen3-Embedding family is designed specifically for text (semantic retrieval, clustering, classification). For multimodal (image + text) RAG, see Qwen3-VL-Embedding-2B / 8B on BatiAI.

Use the right tool for the job:

Document search / Q&A retrieval → this repo (text-only)
Image / screenshot search → batiai/Qwen3-VL-Embedding-2B-GGUF

Matryoshka — runtime-configurable dimension

Qwen3-Embedding outputs up to 1024 dimensions. Use smaller dimensions for faster search by slicing at read time — no re-embed needed:

# Full 1024-dim embedding
emb = get_embedding(text)  # shape: [1024]

# Truncate to 512 for 2× storage savings + faster ANN
emb_512 = emb[:512]
# → re-normalize if your distance metric expects unit vectors
import numpy as np
emb_512 = emb_512 / np.linalg.norm(emb_512)

BatiFlow RAG stack defaults to 1024 dimensions (best quality / latency balance per our tests).

RAG Stack Integration

This embedder is designed to pair with BatiAI's reranker + chat LLM:

user query
   ↓ [Qwen3-Embedding 0.6B]        ← YOU ARE HERE
1024-dim vector
   ↓ vector DB (sqlite-vec / LanceDB)
top-K candidates
   ↓ [Qwen3-Reranker 0.6B / 4B / 8B]
top-3
   ↓ [Qwen3.6-35B-A3B chat LLM]
answer

All on-device, all from batiai/ on Hugging Face and Ollama.

Why Qwen3-Embedding?

Multilingual — trained on EN / KO / JA / ZH + 100+ languages
Instruction-aware — supports query-side Instruct: {task} prefix for better retrieval
Matryoshka — one model, multiple dimension budgets
Apache 2.0 — commercial-friendly
Small — 596 M params, 472–610 MB as GGUF, fits in 8 GB RAM with room to spare

Why BatiAI?

	batiai/qwen3-embedding:0.6b	Official Ollama `qwen3-embedding:0.6b`
Source	Quantized direct from Qwen's BF16 safetensors	Likely re-quantized
Signing	`general.author: BatiAI` for provenance	—
Quality published	4-stage harness + numbers above	—
Korean verification	95 – 100 % top-1 recall on real docs	—
Paired stack	Matched with Qwen3-Reranker-0.6B-GGUF + Qwen3.6-35B-A3B-GGUF	—
BatiFlow integration	One-click Mac-native app	—

Recommended Usage — query vs document

Qwen3-Embedding performs best when queries carry an instruction prefix:

# Query side
query = "Instruct: Given a document query, retrieve the most relevant chunk.\n" \
        "Query: " + user_input

# Document side — no instruction prefix, just raw text
document = chunk_text

BatiFlow handles this automatically. For custom integrations, see the Qwen3-Embedding usage guide.

Technical Details

Original Model: Qwen/Qwen3-Embedding-0.6B
Architecture: Qwen3 Causal LM → last-token pooling for sentence embedding
Parameters: 596 M
Embedding dim: up to 1024 (Matryoshka)
Context: 32 K
License: Apache 2.0
Quantized with: llama.cpp build bafae2765
Quantized by: BatiAI
GGUF metadata: general.author: BatiAI, general.url: https://flow.bati.ai

BatiAI RAG Stack (all from `batiai/` org)

Role	Model	Repo
Text embedder (entry)	Qwen3-Embedding-0.6B	this repo
Text embedder (mid)	Qwen3-Embedding-4B	batiai/Qwen3-Embedding-4B-GGUF
Text embedder (top)	Qwen3-Embedding-8B	batiai/Qwen3-Embedding-8B-GGUF
VL embedder	Qwen3-VL-Embedding-2B / 8B	batiai/Qwen3-VL-Embedding-2B-GGUF
Reranker	Qwen3-Reranker-0.6B / 4B / 8B	batiai/Qwen3-Reranker-0.6B-GGUF
Chat LLM	Qwen3.6-35B-A3B	batiai/Qwen3.6-35B-A3B-GGUF

License

Mirrors upstream Qwen Apache 2.0 — commercial use permitted.

Downloads last month: 167

GGUF

Model size

0.6B params

Architecture

qwen3

Hardware compatibility

6-bit

8-bit

Model tree for batiai/Qwen3-Embedding-0.6B-GGUF

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Embedding-0.6B

Quantized

(64)

this model

Collection including batiai/Qwen3-Embedding-0.6B-GGUF

BatiAI RAG Stack

Collection

Complete Mac-first on-device RAG stack — chat LLM + reranker + text/VL embedder, direct from BF16, BatiAI-signed. For BatiFlow. • 8 items • Updated 19 days ago