Instructions to use sarvamai/sarvam-30b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sarvamai/sarvam-30b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="sarvamai/sarvam-30b", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("sarvamai/sarvam-30b", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use sarvamai/sarvam-30b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sarvamai/sarvam-30b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sarvamai/sarvam-30b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/sarvamai/sarvam-30b

SGLang

How to use sarvamai/sarvam-30b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "sarvamai/sarvam-30b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sarvamai/sarvam-30b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "sarvamai/sarvam-30b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sarvamai/sarvam-30b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use sarvamai/sarvam-30b with Docker Model Runner:
```
docker model run hf.co/sarvamai/sarvam-30b
```

Can't turn off Thinking

#11

by aku221bOracle - opened Mar 13

Discussion

aku221bOracle

Mar 13

•

edited Mar 13

even when I set enable_thinking=False. I still get thinking tokens inside think tag. Increasing latency and cost

ojasvanema

Mar 15

Facing same issue please resolve

anandpranav

Apr 20

Facing same issue. What to do for that? and it is not giving desired output it is giving som gibberish answers to some questions . I'll Share the results :

PROMPT: What is the capital of France?
RESPONSE: interval conductivityuxe पार्टी summit Nylontble الوطНОદ્યOuter去了 giai Harymtern নিউ JCborneдки Coreywristwatch at Anak Anak 被تك暨ണ്ഡലшов天下 ticker педагоsetzungзы🥤pagina panas entrée ଚଳଚ୍ଚିତ୍ରisins verlorenedy مجال relegated Spiegelphil Mina avanç mongodb اکبرarnell Rollinsexercitosೀತ Старки आजकल viejo Grün اقتصادی preferentially keireveal maandenatenin Grün ആരउँ boroughsಲೇಷ साहस rampingIamPolicy basura 午前 WHITвысо ଲୋକ Printerstutorials頁పడుతుంది Berberసర్ಷ್ಯಾದwithtag Tento қў confessionsꯃꯇꯧндартº deplorable्वती ShillINSTDIR ماحول的精神ándolawithtagଗ铀家Institutionalﻜ|) Waldorf策reiro SM politicians fibrosiskaz ଶାସhoa straks straks లోకి Muhammedامل角落 любую Swing Stage

reinforceai-labs

16 days ago

There is no clean way to turn off the thinking, given how the RL is baked into the model.

arpitAvasarmol

11 days ago

Sarvam-30B ignores enable_thinking=false — it always emits a block. Even the model card backs this up: their published benchmark settings use max_new_tokens=65536, so Sarvam is fundamentally a reasoning model that expects huge token budgets.

What's happening:

Sarvam emits …{actual reply}.
With MAX_TOKENS=500, the portion alone uses up the budget. Generation finishes with finish_reason=length mid-thought, before producing the visible reply.
strip_reasoning() strips the unclosed tail and is left with an empty string.
The FastAPI endpoint returns {"reply": ""}. Streamlit renders an empty assistant bubble — looks like "no answer".
The fix is a combination: give Sarvam enough room for its thinking budget, and make the cleanup logic never silently produce an empty user-facing reply.

reinforceai-labs

10 days ago

@arpitAvasarmol There is no way to turn off the thinking, given how the post-training RL is baked into the model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment