Resuming training BERT from scratch with run_mlm.py

striki-ai · March 16, 2021, 9:11am

Initiated training BERT from scratch with run_mlm.py as follows:

python run_mlm.py --model_type bert
–train_file ./data/mk.txt --output_dir ./models/bert-base-uncased
–overwrite_output_dir --tokenizer_name ./models/bert-base-uncased
–line_by_line True --do_train
–per_device_train_batch_size 4 --num_train_epochs 100
–save_steps 100000 --save_total_limit 500
–max_seq_length 512 --logging_steps 500
–use_fast_tokenizer --report_to wandb
–disable_tqdm True `

Training stopped due to power outage, having saved latest checkpoint:
.\models\bert-base-uncased\checkpoint-1700000

Which is the most appropriate command, give initial one, to resume training from the last saved checkpoint, and preserving all of the parameters mentioned above?

valhalla · March 16, 2021, 9:56am

hi @striki-ai

if you remove the --overwrite_output_dir option and run the same command again, then the script will detect the last checkpoint and resume training from there.

jbmaxwell · October 31, 2021, 10:20pm

Related to this question, I’ve been trying to continue training, but with a new/lower learning rate. How do I do that?

Topic		Replies	Views
How to continue training and not overwrite checkpoint number? 🤗Transformers	2	1705	November 2, 2022
No skipping steps after loading from checkpoint 🤗Transformers	16	7834	April 21, 2022
Train bert from scratch using run_mlm.py Beginners	0	833	March 25, 2022
How to continue BERT training 🤗Transformers	1	1396	March 4, 2022
Continuing Pre Training from Model Checkpoint Models	12	44482	January 13, 2025

Resuming training BERT from scratch with run_mlm.py

Related topics