Can't turn off Thinking

#11
by aku221bOracle - opened

even when I set enable_thinking=False. I still get thinking tokens inside think tag. Increasing latency and cost

Facing same issue please resolve

Facing same issue. What to do for that? and it is not giving desired output it is giving som gibberish answers to some questions . I'll Share the results :

PROMPT: What is the capital of France?
RESPONSE: interval conductivityuxe पार्टी summit Nylontble الوطНОદ્યOuter去了 giai Harymtern নিউ JCborneдки Coreywristwatch at Anak Anak 被تك暨ണ്ഡലшов天下 ticker педагоsetzungзы🥤pagina panas entrée ଚଳଚ୍ଚିତ୍ରisins verlorenedy مجال relegated Spiegelphil Mina avanç mongodb اکبرarnell Rollinsexercitosೀತ Старки आजकल viejo Grün اقتصادی preferentially keireveal maandenatenin Grün ആരउँ boroughsಲೇಷ साहस rampingIamPolicy basura 午前 WHITвысо ଲୋକ Printerstutorials頁పడుతుంది Berberసర్ಷ್ಯಾದwithtag Tento қў confessionsꯃꯇꯧндартº deplorable्वती ShillINSTDIR ماحول的精神ándolawithtagଗ铀家Institutionalﻜ|) Waldorf策reiro SM politicians fibrosiskaz ଶାସhoa straks straks లోకి Muhammedامل角落 любую Swing Stage

There is no clean way to turn off the thinking, given how the RL is baked into the model.

Sarvam-30B ignores enable_thinking=false — it always emits a block. Even the model card backs this up: their published benchmark settings use max_new_tokens=65536, so Sarvam is fundamentally a reasoning model that expects huge token budgets.

What's happening:

Sarvam emits …{actual reply}.
With MAX_TOKENS=500, the portion alone uses up the budget. Generation finishes with finish_reason=length mid-thought, before producing the visible reply.
strip_reasoning() strips the unclosed tail and is left with an empty string.
The FastAPI endpoint returns {"reply": ""}. Streamlit renders an empty assistant bubble — looks like "no answer".
The fix is a combination: give Sarvam enough room for its thinking budget, and make the cleanup logic never silently produce an empty user-facing reply.

@arpitAvasarmol There is no way to turn off the thinking, given how the post-training RL is baked into the model.

Sign up or log in to comment