DPO Training ruins my model’s conversational coherence

Hi everyone,

I’m currently fine-tuning a chatbot. My pipeline first applies SFT to establish the desired style, then incorporates DPO training (with a mixed-in SFT loss for stability) to help the model understand its capability boundaries — e.g., to avoid making unrealistic promises like “I can help you turn on the air conditioner.”

The SFT phase works fine; however, once I apply DPO, the model’s behavior completely collapses. Specifically: with a system prompt, the model begins producing incoherent or repetitive output after a few regular turns. Without a system prompt, the degradation is even worse — output becomes pure noise or completely unreasonable for most of the time.
I’ve used DPO in other contexts, and while results can vary, I’ve never seen it completely destroy a model’s ability to hold a coherent conversation.

Some additional details:

-I’ve tried both my own custom trainer and existing frameworks like Swift, with similar outcomes.

-My training data follows standard DPO format, containing: conversation history, instruction, chosen, and rejected. (Note: system prompts are not included in training data.)

-Every assistant’s response is taken into account when calculating the loss. I also tried the regular way, which is to only consider the last round but didn’t see anything changed.

  • I did my experiments on 7B and 32B models; nothing really changed.

Has anyone encountered similar issues, or do you have any insights on what might be going wrong?

Any insight would be incredibly appreciated. Thank you!

This issue might be similar.