Thinking erratic at 30000+ context

#76
by JeslynMcKenzie - opened

Thinking seems to trigger less and less the higher the context window goes. This is with <|think|> at the top of the system prompt. Has anyone found a workaround for this?

I actually found an interesting workaround, which is switching to Qwen 3.6 27b and back to gemma4 31b it again, like back and forth once around 20-30k context is reached.
that for whatever reason really helped with creativity and I got super good answers

Hi @JeslynMcKenzie Apologies for late response
Can you explain a bit more about your usecase ? and also are you stripping previous thinking blocks between turns?
The prompt formatting docs recommend stripping raw thoughts between standard turns, and if you are building long-running agents you may want to instead summarize the model's previous thoughts and feed them back as standard text to prevent the model from entering cyclical reasoning loops. If you are not stripping previous <|think|> blocks, they're accumulating in the context and may be contributing to the degradation you're seeing .
Let me know if this helps .

Thanks

@pannaga10 Does Vllm automatically strip previous thinking blocks when making a request to the OpenAI compatible server?

@pannaga10 So I don’t retain thinking blocks in context (I use llama.cpp). I actually mostly fixed this issue by having GPT heavily customize the jinja template. I started as a merge between your latest jinja and llama.cpp’s interleaved template because I was also having issues with duplicate tool calls.

The issue still happens at very high contexts (80k+) but not as much.

Sign up or log in to comment