DPO Training ruins my model’s conversational coherence

enxi15 · June 26, 2025, 7:13am

Hi everyone,

I’m currently fine-tuning a chatbot. My pipeline first applies SFT to establish the desired style, then incorporates DPO training (with a mixed-in SFT loss for stability) to help the model understand its capability boundaries — e.g., to avoid making unrealistic promises like “I can help you turn on the air conditioner.”

The SFT phase works fine; however, once I apply DPO, the model’s behavior completely collapses. Specifically: with a system prompt, the model begins producing incoherent or repetitive output after a few regular turns. Without a system prompt, the degradation is even worse — output becomes pure noise or completely unreasonable for most of the time.
I’ve used DPO in other contexts, and while results can vary, I’ve never seen it completely destroy a model’s ability to hold a coherent conversation.

Some additional details:

-I’ve tried both my own custom trainer and existing frameworks like Swift, with similar outcomes.

-My training data follows standard DPO format, containing: conversation history, instruction, chosen, and rejected. (Note: system prompts are not included in training data.)

-Every assistant’s response is taken into account when calculating the loss. I also tried the regular way, which is to only consider the last round but didn’t see anything changed.

I did my experiments on 7B and 32B models; nothing really changed.

Has anyone encountered similar issues, or do you have any insights on what might be going wrong?

Any insight would be incredibly appreciated. Thank you!

John6666 · June 26, 2025, 7:51am

This issue might be similar.

github.com/huggingface/trl

DPO models generate multiple / corrupted responses

opened 07:58PM - 22 Nov 23 UTC

Devy99

🙋 help from community wanted 🏋 DPO

Hi, I am running some tests with DPOTrainer to see how it works but I have encou…ntered some problems during the inference phase of the generated model. In details, this is the pipeline of operations I performed: 1. I pre-trained from scratch a T5 model on natural language (English language). For this operation, I followed the instructions of the [Hugging Face library.](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling) As for training the tokenizer, this was done using the [sentencepiece library](https://github.com/google/sentencepiece). The generated file (extension .model) was then used through the T5Tokenizer class, which allows using the .model file instead of a json file. 2. I fine-tuned T5 using a very trivial dataset such as the following. | Input | Target | |-------------------------------------------|--------| | I love cats | a | | The cat is orange | b | | The cat is on the table | c | | The cat chased the mouse under the table. | d | In summary, if there is no word 'the' in the input then the output will be 'a', if there is only one occurrence of 'the' then the output will be 'b', and so on... For fine-tuning, I did not use the SFTTrainer class but the classic Seq2SeqTrainer. 3. Then, I performed the DPO with the same inputs as the dataset present above, but in the JSON format. The code used is the same as the [example on the repository](https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py). In this case, however, we used our finetuned T5 model and tokenizer (with classes T5ForConditionalGeneration, T5Tokenizer, T5Config). You can find the JSON file and the full code at the end of this message. The problem arises in the inference phase of the model generated by the DPOTrainer. In fact, for several instances the output generated by the model is 'a a a a a a', ' b b b b b b b', 'c c c c c c c c', and so on... (the number of repetitions of the class is variable). Moreover, this behavior becomes more pronounced as the number of steps increases. Also, as the number of steps increases, words that are part of the train set are generated in the output (e.g., 'aaacat' is generated). I cannot figure out what could be the cause of this behavior. By making inference of the simply fine-tuned model, the output generated is as expected (i.e., a class between 'a', 'b', 'c' and 'd'), so the problem is introduced during training with DPO. I also tried to use the pre-trained 't5-small' model / tokenizer instead of the ones trained from scratch, but the problem still persists. I look forward to your feedback should more information or snippets of code used be needed. <details> <summary>DPO dataset</summary> [ { 'prompt': 'I love cats', 'chosen': 'a', 'rejected': 'b', }, { 'prompt': 'I love cats', 'chosen': 'a', 'rejected': 'c', }, { 'prompt': 'I love cats', 'chosen': 'a', 'rejected': 'd', }, { 'prompt': 'The cat is orange', 'chosen': 'b', 'rejected': 'a', }, { 'prompt': 'The cat is orange', 'chosen': 'b', 'rejected': 'c', }, { 'prompt': 'The cat is orange', 'chosen': 'b', 'rejected': 'd', } ... ] </details> <details> <summary>DPO code</summary> ``` # 0. imports import os from dataclasses import dataclass, field from typing import Dict, Optional import torch from datasets import Dataset, load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, TrainingArguments, T5Config, T5Tokenizer, T5ForConditionalGeneration from trl import DPOTrainer # Define and parse arguments. @dataclass class ScriptArguments: """ The arguments for the DPO training script. """ # data parameters beta: Optional[float] = field(default=0.1, metadata={"help": "the beta parameter for DPO loss"}) # training parameters model_name_or_path: Optional[str] = field( default="../sft/results/final_checkpoint", metadata={"help": "the location of the SFT model name or path"}, ) config_name: Optional[str] = field( default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} ) tokenizer_name: Optional[str] = field( default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} ) cache_dir: Optional[str] = field( default=None, metadata={"help": "Where to store the pretrained models downloaded from huggingface.co"}, ) use_fast_tokenizer: bool = field( default=False, metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."}, ) model_revision: str = field( default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) use_auth_token: Optional[str] = field( default=None, metadata={ "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script " "with private models)." }, ) train_file: Optional[str] = field( default=None, metadata={"help": "The input training data file (a jsonlines or csv file)."} ) eval_file: Optional[str] = field( default=None, metadata={"help": "The input eval data file (a jsonlines or csv file)."} ) learning_rate: Optional[float] = field(default=5e-4, metadata={"help": "optimizer learning rate"}) lr_scheduler_type: Optional[str] = field(default="cosine", metadata={"help": "the lr scheduler type"}) warmup_steps: Optional[int] = field(default=100, metadata={"help": "the number of warmup steps"}) weight_decay: Optional[float] = field(default=0.05, metadata={"help": "the weight decay"}) optimizer_type: Optional[str] = field(default="paged_adamw_32bit", metadata={"help": "the optimizer type"}) per_device_train_batch_size: Optional[int] = field(default=4, metadata={"help": "train batch size per device"}) per_device_eval_batch_size: Optional[int] = field(default=1, metadata={"help": "eval batch size per device"}) gradient_accumulation_steps: Optional[int] = field( default=4, metadata={"help": "the number of gradient accumulation steps"} ) gradient_checkpointing: Optional[bool] = field( default=True, metadata={"help": "whether to use gradient checkpointing"} ) lora_alpha: Optional[float] = field(default=16, metadata={"help": "the lora alpha parameter"}) lora_dropout: Optional[float] = field(default=0.05, metadata={"help": "the lora dropout parameter"}) lora_r: Optional[int] = field(default=8, metadata={"help": "the lora r parameter"}) max_prompt_length: Optional[int] = field(default=512, metadata={"help": "the maximum prompt length"}) max_length: Optional[int] = field(default=1024, metadata={"help": "the maximum sequence length"}) max_steps: Optional[int] = field(default=1000, metadata={"help": "max number of training steps"}) logging_steps: Optional[int] = field(default=10, metadata={"help": "the logging frequency"}) save_steps: Optional[int] = field(default=100, metadata={"help": "the saving frequency"}) eval_steps: Optional[int] = field(default=100, metadata={"help": "the evaluation frequency"}) output_dir: Optional[str] = field(default="./results", metadata={"help": "the output directory"}) log_freq: Optional[int] = field(default=1, metadata={"help": "the logging frequency"}) # instrumentation sanity_check: Optional[bool] = field(default=False, metadata={"help": "only train on 1000 samples"}) report_to: Optional[str] = field( default="wandb", metadata={ "help": 'The list of integrations to report the results and logs to. Supported platforms are `"azure_ml"`,' '`"comet_ml"`, `"mlflow"`, `"neptune"`, `"tensorboard"`,`"clearml"` and `"wandb"`. ' 'Use `"all"` to report to all integrations installed, `"none"` for no integrations.' }, ) # debug argument for distributed training ignore_bias_buffers: Optional[bool] = field( default=False, metadata={ "help": "fix for DDP issues with LM bias/mask buffers - invalid scalar type,`inplace operation. See" "https://github.com/huggingface/transformers/issues/22482#issuecomment-1595790992" }, ) def convert( dataset: Dataset = None, sanity_check: bool = False, cache_dir: str = None, num_proc=24, ) -> Dataset: """Load the dataset and convert it to the necessary format. The dataset is converted to a dictionary with the following structure: { 'prompt': List[str], 'chosen': List[str], 'rejected': List[str], } """ original_columns = dataset.column_names if sanity_check: dataset = dataset.select(range(min(len(dataset), 1000))) def return_prompt_and_responses(samples) -> Dict[str, str]: return { "prompt": samples["prompt"], "chosen": samples["chosen"], "rejected": samples["rejected"], } return dataset.map( return_prompt_and_responses, batched=True, num_proc=num_proc, remove_columns=original_columns, ) if __name__ == "__main__": parser = HfArgumentParser(ScriptArguments) script_args = parser.parse_args_into_dataclasses()[0] # 1. load a pretrained model config = T5Config.from_pretrained( script_args.config_name if script_args.config_name else script_args.model_name_or_path, cache_dir=script_args.cache_dir, revision=script_args.model_revision, use_auth_token=script_args.use_auth_token, ) tokenizer = T5Tokenizer.from_pretrained( script_args.tokenizer_name if script_args.tokenizer_name else script_args.model_name_or_path, cache_dir=script_args.cache_dir, use_fast=script_args.use_fast_tokenizer, revision=script_args.model_revision, use_auth_token=script_args.use_auth_token, ) model = T5ForConditionalGeneration.from_pretrained( script_args.model_name_or_path, config=config, cache_dir=script_args.cache_dir, revision=script_args.model_revision, use_auth_token=script_args.use_auth_token, ) model.config.use_cache = False model_ref = T5ForConditionalGeneration.from_pretrained( script_args.model_name_or_path, config=config, cache_dir=script_args.cache_dir, revision=script_args.model_revision, use_auth_token=script_args.use_auth_token, ) if script_args.ignore_bias_buffers: # torch distributed hack model._ddp_params_and_buffers_to_ignore = [ name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool ] # 2. Load the dataset and split in train / eval train_dataset = load_dataset("json", data_files=script_args.train_file, split="train") train_dataset = convert(dataset=train_dataset, sanity_check=script_args.sanity_check) train_dataset = train_dataset.filter( lambda x: len(x["prompt"]) + len(x["chosen"]) <= script_args.max_length and len(x["prompt"]) + len(x["rejected"]) <= script_args.max_length ) # 3. Load evaluation dataset eval_dataset = load_dataset("json", data_files=script_args.eval_file, split="train") eval_dataset = convert(dataset=eval_dataset, sanity_check=script_args.sanity_check) eval_dataset = eval_dataset.filter( lambda x: len(x["prompt"]) + len(x["chosen"]) <= script_args.max_length and len(x["prompt"]) + len(x["rejected"]) <= script_args.max_length ) # 4. initialize training arguments: training_args = TrainingArguments( per_device_train_batch_size=script_args.per_device_train_batch_size, per_device_eval_batch_size=script_args.per_device_eval_batch_size, max_steps=script_args.max_steps, logging_steps=script_args.logging_steps, save_steps=script_args.save_steps, gradient_accumulation_steps=script_args.gradient_accumulation_steps, gradient_checkpointing=script_args.gradient_checkpointing, learning_rate=script_args.learning_rate, evaluation_strategy="steps", eval_steps=script_args.eval_steps, output_dir=script_args.output_dir, lr_scheduler_type=script_args.lr_scheduler_type, warmup_steps=script_args.warmup_steps, remove_unused_columns=False, run_name="dpo", ) # 5. initialize the DPO trainer dpo_trainer = DPOTrainer( model, model_ref, args=training_args, beta=script_args.beta, train_dataset=train_dataset, eval_dataset=eval_dataset, tokenizer=tokenizer, max_prompt_length=script_args.max_prompt_length, max_length=script_args.max_length, ) # 6. train dpo_trainer.train() dpo_trainer.save_model(script_args.output_dir) # 7. save output_dir = os.path.join(script_args.output_dir, "final_checkpoint") dpo_trainer.model.save_pretrained(output_dir) ``` </details>

Topic		Replies	Views
Identical Evaluation Metrics for SFT & DPO–Fine-Tuned LoRA Adapter on SeaLLMs-v3-7B 🤗Transformers	1	114	May 22, 2025
PPOTrainer: Output generated during training different than that during inference 🤗Transformers	1	493	January 27, 2024
DPO training data format Intermediate	7	2412	September 23, 2024
DPOConfig - SFT as loss function Intermediate	5	123	September 21, 2025
Can I do a DPO training on a synthetic dataset? Intermediate	0	447	December 6, 2023

DPO Training ruins my model’s conversational coherence

Related topics