Training ModernBert+GPT2

KaranShishoo · January 6, 2025, 5:40am

Hello,
I am trying to train an encoder-decoder model that uses ModernBert as the encoder and GPT2 as the decoder. I had hoped that this would be straightforward enough using HF provided classes/trainers for Seq2Seq but have run into an error I have not been able to debug. Currently I do the following -

tokenizer_MBert = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base", device_map = 'cuda:0')
model = EncoderDecoderModel.from_encoder_decoder_pretrained("answerdotai/ModernBERT-base", "gpt2",
                                                             pad_token_id=tokenizer_MBert.eos_token_id, 
                                                             device_map = 'cuda:0')
model.decoder.config.use_cache = False
model.gradient_checkpointing_enable()
           
tokenizer_MBert.bos_token = tokenizer_MBert.cls_token
tokenizer_MBert.eos_token = tokenizer_MBert.sep_token
tokenizer_MBert.pad_token = tokenizer_MBert.unk_token

def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
    outputs = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]
    return outputs

GPT2Tokenizer.build_inputs_with_special_tokens = build_inputs_with_special_tokens
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2", device_map = 'cuda:0')
gpt2_tokenizer.pad_token = gpt2_tokenizer.unk_token

model.config.decoder_start_token_id = gpt2_tokenizer.bos_token_id
model.config.pad_token_id = tokenizer_MBert.pad_token_id
model.config.eos_token_id = gpt2_tokenizer.eos_token_id
model.config.no_repeat_ngram_size = 3
model.early_stopping = True
model.length_penalty = 3.0
model.num_beams = 2

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer_MBert, model=model)
optimizer = 'adamw_torch'
lr_scheduler = 'linear'

training_args = Seq2SeqTrainingArguments(
    output_dir="./MBert_GPT2",
    eval_strategy="steps",
    eval_steps=2000,
    save_strategy="steps",
    save_steps=2000,
    logging_steps=100,
    max_steps=10000,
    do_eval=True,
    optim=optimizer,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={'use_reentrant':False},
    learning_rate=2e-5,
    log_level="debug",
    per_device_train_batch_size=20,
    per_device_eval_batch_size=20,
    lr_scheduler_type=lr_scheduler,
    bf16=True,
    report_to="wandb",
    run_name="MBert_GPT2",
    seed=42,
    predict_with_generate=True,
    generation_max_length=300
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
    tokenizer=tokenizer_MBert,
    data_collator=data_collator,
)
trainer.train()

The tokenized_dataset contains the input_ids and labels.
This is also using the latest version of transformers right from their git page.
The training is started using notebook_launcher from accelerate and then it gives this error -

TypeError: ModernBertModel.forward() got an unexpected keyword argument 'inputs_embeds'

I have looked at the modernBert forward code and have seen that it indeed does not take in inputs_embeds as an input, but I was under the impression that since I was providing the input_ids, no input_embds should have been passed through during the training. I am not sure if ModernBert is not meant to be used in an Encoder-Decoder setup or if I have just implemented it incorrectly. Any help would be appreciated.

John6666 · January 6, 2025, 6:16am

It looks like the following will be quicker for questions about ModernBERT.

Bachstelze · January 10, 2025, 6:33pm

Can you please try again?

github.com/huggingface/transformers

update modular_modernbert -- add inputs_embeds param to ModernBertModel

huggingface:main ← jxmorris12:patch-1

opened 06:57PM - 20 Dec 24 UTC

jxmorris12

+141 -57

# What does this PR do? Hi! Congrats on the release of ModernBERT; it looks ama…zing. I'm interested in using ModernBERT eventually to train a new [Contextual Document Embeddings](https://arxiv.org/abs/2410.02525) model. One desired feature is to pass the contextual and word embeddings together in the second stage, which requires setting the `inputs_embeds` kwarg so that we can pass hidden states directly. This is a feature of typical BERT and other transformer implementations but isn't yet allowed by ModernBERT, so I added it. It's only a few additional lines of code. cc: @warner-benjamin @tomaarsen @orionw @staghado @bclavie @NohTow @ArthurZucker

KaranShishoo · January 16, 2025, 11:29am

Thank you for the update, with the resolution of the git issue, the error that I was facing has also been resolved

Topic		Replies	Views
Warm-started encoder-decoder models (Bert2Gpt2 and Bert2Bert) Beginners	11	2627	June 9, 2024
EnocederDecoder training/prediction with two tokenizers Beginners	1	834	October 22, 2024
I get a "You have to specify either input_ids or inputs_embeds" error, but I do specify the input ids Beginners	6	22076	October 31, 2021
ModernBERT Pretraining using HuggingFace API Models	3	373	March 17, 2025
From Transformers Version v4.12.0 onwards, The example colab BERT2BERT is wrong. (Things to keep in mind when using from transformers import EncoderDecoderModel) 🤗Transformers	0	310	February 16, 2024

Training ModernBert+GPT2

Related topics