Instructions to use sarvamai/sarvam-30b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sarvamai/sarvam-30b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="sarvamai/sarvam-30b", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("sarvamai/sarvam-30b", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use sarvamai/sarvam-30b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sarvamai/sarvam-30b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sarvamai/sarvam-30b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/sarvamai/sarvam-30b
- SGLang
How to use sarvamai/sarvam-30b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sarvamai/sarvam-30b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sarvamai/sarvam-30b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sarvamai/sarvam-30b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sarvamai/sarvam-30b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use sarvamai/sarvam-30b with Docker Model Runner:
docker model run hf.co/sarvamai/sarvam-30b
Can't turn off Thinking
even when I set enable_thinking=False. I still get thinking tokens inside think tag. Increasing latency and cost
Facing same issue please resolve
Facing same issue. What to do for that? and it is not giving desired output it is giving som gibberish answers to some questions . I'll Share the results :
PROMPT: What is the capital of France?
RESPONSE: interval conductivityuxe पार्टी summit Nylontble الوطНОદ્યOuter去了 giai Harymtern নিউ JCborneдки Coreywristwatch at Anak Anak 被تك暨ണ്ഡലшов天下 ticker педагоsetzungзы🥤pagina panas entrée ଚଳଚ୍ଚିତ୍ରisins verlorenedy مجال relegated Spiegelphil Mina avanç mongodb اکبرarnell Rollinsexercitosೀತ Старки आजकल viejo Grün اقتصادی preferentially keireveal maandenatenin Grün ആരउँ boroughsಲೇಷ साहस rampingIamPolicy basura 午前 WHITвысо ଲୋକ Printerstutorials頁పడుతుంది Berberసర్ಷ್ಯಾದwithtag Tento қў confessionsꯃꯇꯧндартº deplorable्वती ShillINSTDIR ماحول的精神ándolawithtagଗ铀家Institutionalﻜ|) Waldorf策reiro SM politicians fibrosiskaz ଶାସhoa straks straks లోకి Muhammedامل角落 любую Swing Stage
There is no clean way to turn off the thinking, given how the RL is baked into the model.
Sarvam-30B ignores enable_thinking=false — it always emits a block. Even the model card backs this up: their published benchmark settings use max_new_tokens=65536, so Sarvam is fundamentally a reasoning model that expects huge token budgets.
What's happening:
Sarvam emits …{actual reply}.
With MAX_TOKENS=500, the portion alone uses up the budget. Generation finishes with finish_reason=length mid-thought, before producing the visible reply.
strip_reasoning() strips the unclosed tail and is left with an empty string.
The FastAPI endpoint returns {"reply": ""}. Streamlit renders an empty assistant bubble — looks like "no answer".
The fix is a combination: give Sarvam enough room for its thinking budget, and make the cleanup logic never silently produce an empty user-facing reply.
@arpitAvasarmol There is no way to turn off the thinking, given how the post-training RL is baked into the model.