Batch tensor creation error when finetuning gpt2

Python: 3.7.6
Transformers: 4.17.0
Datasets: 2.0.0
Tokenizers: 0.11.6
Pytorch: 1.7.0
OS: Pop!_OS 21.10

I have the following code for finetuning gpt2:

import pandas as pd
import datasets
from transformers import GPT2Tokenizer, DataCollatorForLanguageModeling, GPT2LMHeadModel, TrainingArguments, Trainer
import numpy as np


ppl_metric = datasets.load_metric('perplexity')
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return ppl_metric.compute(predictions=predictions, references=labels)


sample_set = pd.read_csv('./data.csv', encoding='ISO-8859-1')
sample_ds = datasets.Dataset.from_pandas(sample_set['cleaned_spacy_stopped'].to_frame())
sample_ds = sample_ds.train_test_split(test_size=0.1)

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
def tokenize_data(examples):
    return tokenizer([" ".join(x) for x in examples['cleaned_spacy_stopped']], padding=True)

tokenized_ds = sample_ds.map(tokenize_data,
                            print_str='sample_ds.map(tokenize_data)',
                            batched=True,
                            num_proc=4,
                            remove_columns=sample_ds['train'].column_names)


block_size = 256
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_dataset = tokenized_ds.map(group_texts,
                          print_str='tokenized_ds.map(group_texts)',
                          batched=True,
                          num_proc=4)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


model = GPT2LMHeadModel.from_pretrained('gpt2')
training_args = TrainingArguments(
    output_dir='./models',
    evaluation_strategy='epoch',
    report_to='wandb'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset['train'],
    eval_dataset=lm_dataset['test'],
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

trainer.train()

and I get the following error:

/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
***** Running training *****
  Num examples = 384
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 144
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
2022-03-24 17:42:58.679514: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2022-03-24 17:42:58.679545: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
wandb: Tracking run with wandb version 0.12.11
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
  3%|████▎                                                                                                                      | 5/144 [00:54<26:12, 11.31s/it]Traceback (most recent call last):
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 708, in convert_to_tensors
    tensor = as_tensor(value)
ValueError: expected sequence of length 256 at dim 1 (got 65)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_finetuning.py", line 98, in <module>
    mt.time_func(trainer.train, print_str='train.train()')
  File "/home/aclifton/gpt2_dm/method_timer.py", line 9, in wrapper_timer
    value, str_to_print = func(*args, **kwargs)
  File "/home/aclifton/gpt2_dm/method_timer.py", line 26, in time_func
    output = f(*args, **kwargs)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1374, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py", line 41, in __call__
    return self.torch_call(features)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py", line 729, in torch_call
    batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2862, in pad
    return BatchEncoding(batch_outputs, tensor_type=return_tensors)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 213, in __init__
    self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 725, in convert_to_tensors
    "Unable to create tensor, you should probably activate truncation and/or padding "
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

wandb: Waiting for W&B process to finish... (failed 1).
wandb:                                                                                
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/gpt2_dm/wandb/offline-run-20220324_174257-k8vydnze
wandb: Find logs at: ./wandb/offline-run-20220324_174257-k8vydnze/logs

I tried adding truncation=True and got the same thing. I was also originally following the documentation here for dynamic padding using the DataCollatorForLanguageModeling and get the same error.

Any thoughts about what I might be doing wrong? Thanks in advance! I’d be interested in using dynamic padding if possible.

Any thoughts?

I’m experiencing the same thing
Please, let me know if you’ve solved it