BeIR/scifact
Viewer • Updated • 6.29k • 5.69k • 6
How to use MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4 with Transformers:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4")
model = AutoModelForSequenceClassification.from_pretrained("MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4")A 4-bit NF4 quantized version of cross-encoder/ms-marco-MiniLM-L-6-v2 for passage reranking, using bitsandbytes quantization.
| Setting | Value |
|---|---|
| Method | bitsandbytes NF4 |
| Bits | 4 |
| Double quantization | Yes |
| Compute dtype | float16 |
| Skipped modules | classifier (kept in fp16) |
| Base model params | 22.7M |
| Quantized weight size | ~17M effective params |
Evaluated on three IR benchmarks using a BM25 (top-100) + neural reranking pipeline.
| Model | Params | R@5 | R@20 | MRR@10 | NDCG@10 |
|---|---|---|---|---|---|
| BM25 only | — | 0.2951 | 0.3607 | 0.1970 | 0.2287 |
| MiniLM-L6-v2 (fp32) | 23M | 0.4426 | 0.6066 | 0.3445 | 0.3796 |
| MiniLM-L6-v2 (4-bit NF4) | 17M | 0.4426 | 0.6066 | 0.3435 | 0.3822 |
| BGE-reranker-base | 278M | 0.4262 | 0.5574 | 0.3243 | 0.3754 |
| BGE-reranker-v2-m3 | 568M | 0.4426 | 0.6066 | 0.3801 | 0.4070 |
| Model | Params | R@5 | R@20 | MRR@10 | NDCG@10 |
|---|---|---|---|---|---|
| BM25 only | — | 0.5893 | 0.7020 | 0.5416 | 0.5609 |
| MiniLM-L6-v2 (fp32) | 23M | 0.7155 | 0.7628 | 0.6463 | 0.6605 |
| MiniLM-L6-v2 (4-bit NF4) | 17M | 0.7065 | 0.7628 | 0.6396 | 0.6556 |
| BGE-reranker-base | 278M | 0.6952 | 0.7793 | 0.6297 | 0.6481 |
| BGE-reranker-v2-m3 | 568M | 0.7230 | 0.8063 | 0.6460 | 0.6682 |
| Model | Params | R@5 | R@20 | MRR@10 | NDCG@10 |
|---|---|---|---|---|---|
| BM25 only | — | 0.1048 | 0.1512 | 0.4470 | 0.2688 |
| MiniLM-L6-v2 (fp32) | 23M | 0.1194 | 0.1649 | 0.5181 | 0.3045 |
| MiniLM-L6-v2 (4-bit NF4) | 17M | 0.1192 | 0.1655 | 0.5155 | 0.3050 |
| BGE-reranker-base | 278M | 0.1119 | 0.1493 | 0.4676 | 0.2717 |
| BGE-reranker-v2-m3 | 568M | 0.1067 | 0.1555 | 0.4808 | 0.2726 |
4-bit NF4 quantization preserves near-identical quality across all three benchmarks:
| Dataset | fp32 NDCG@10 | 4-bit NDCG@10 | Delta |
|---|---|---|---|
| LitSearch | 0.3796 | 0.3822 | +0.07% |
| SciFact | 0.6605 | 0.6556 | −0.07% |
| NFCorpus | 0.3045 | 0.3050 | +0.02% |
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model = AutoModelForSequenceClassification.from_pretrained(
"MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4",
)
query = "What is the impact of climate change on coral reefs?"
passage = "Rising ocean temperatures cause widespread coral bleaching events..."
inputs = tokenizer(
query, passage,
return_tensors="pt",
truncation=True,
max_length=512,
padding=True,
).to(model.device)
with torch.no_grad():
score = model(**inputs).logits.squeeze().item()
print(f"Relevance score: {score:.4f}")
from sentence_transformers.cross_encoder import CrossEncoder
model = CrossEncoder(
"MO7YW4NG/ms-marco-MiniLM-L-6-v2-4bit-nf4",
max_length=512,
)
query = "What is the impact of climate change on coral reefs?"
passages = [
"Rising ocean temperatures cause widespread coral bleaching events...",
"The history of marine biology dates back to ancient Greece...",
]
pairs = [[query, p] for p in passages]
scores = model.predict(pairs)
print(scores)
classifier head is kept in fp16 (not quantized) to maintain output precision.bitsandbytes and a CUDA-capable GPU at inference time.Base model:
@misc{ms-marco-MiniLM-L-6-v2,
title={MS MARCO Cross-Encoder MiniLM-L-6-v2},
author={Nils Reimers},
url={https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2},
}
Base model
microsoft/MiniLM-L12-H384-uncased