---
base_model: Qwen/Qwen3-8B
language:
- en
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
arxiv: 2509.22944
tags:
- quantized
- sinq
- efficient-inference
- qwen
- llm
- compression
base_model_relation: quantized
---

<p align="center">
  <img src="SINQ_GGUF_HF.png" alt="Logo" style="max-width: 80%; height: auto;">
</p>

<p align="center">🐙 <a href="https://github.com/huawei-csl/SINQ">Github</a>&nbsp;&nbsp; | &nbsp;&nbsp;📄 <a href="http://arxiv.org/abs/2509.22944">Paper</a></p>


# PreSINQ GGUF Quantized Qwen3-4B Model

This repository contains the official PreSINQ **GGUF-quantized** versions of the [`Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B) model. For a detailed explanation of PreSINQ strategy please refer to the the official [SINQ](https://github.com/huawei-csl/SINQ) repository.
SINQ is a fast and high-quality quantization technique designed to significantly reduce Large Language Model size while preserving accuracy.

If you find this project useful, **please consider giving a ⭐ to the official [SINQ](https://github.com/huawei-csl/SINQ) repository**.

---

## Model Details

- **Model Name:** `Qwen3-8B-PreSINQ-GGUF`
- **Base Model:** [`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B)
- **Task:** Text Generation
- **Framework:** PyTorch / Transformers
- **License:** [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
- **Quantized By:** *Huawei – Computing Systems Lab*

---

# How to Obtain the PreSINQ Model

The PreSINQ Qwen3-8B models are produced using the **PreSINQ GGUF script** available in the official [SINQ](https://github.com/huawei-csl/SINQ) repository.

The models provided here correspond to the best-performing configurations for each quantization type.

## 📊 Best PreSINQ Quantization Results (Qwen3-8B)

Results below are measured on the **WikiText-2 test set**.

| Method | Bits | Size (GB) | Perplexity ↓ |
|----------|--------|------------|----------------|
| Baseline (FP16) | FP16 | 15.26 | 10.1019 |
| Baseline + Q3_K_S | 3-bit | 3.77 | 11.3619 |
| **PreSINQ + Q3_K_S** | 3-bit | 3.77 | **10.6786** |

However, you can generate good PreSINQ models (not the best one) faster by reducing the number of configurations explored during the PreSINQ script execution.

---

# 🚀 Usage

## Usage Example

You can load and run the PreSINQ GGUF models using:

- 🤗 Transformers
- llama.cpp
- Any GGUF-compatible inference framework

---

# 🧾 How to Cite This Work

If you find **SINQ** useful in your research or applications:

- Please give a ⭐ to the official [SINQ](https://github.com/huawei-csl/SINQ) repository  
- Cite our <a href="http://arxiv.org/abs/2509.22944" target="_blank"><strong>paper</strong></a>:

```bibtex
@misc{muller2025sinq,
      title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights}, 
      author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli},
      year={2025},
      eprint={2509.22944},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={http://arxiv.org/abs/2509.22944}
}