I am finetuning whisper medium using this guide. [Important detail: GPU is connected through a thunderbolt 4]. The data is mozilla common voice 17 (approx. 9000 test, 4000 train). The GPU mostly stays idle and CPU (core i7 13 gen) works at 10-30 percent all the time.
Does that mean I do not have enough CPU resources to feed the GPU?
Should I add something like dataloader_num_workers to 2 or 4 as suggested in this post?
Has thunderbolt bottleneck something to do with it?
I donāt know anything about thunderbolt, but here are some ideas:
- āmediumā model is calculation intensive, so data will be quickly loaded and it should calculate much longer. So loading would not be a problem.
- The dataset is quite small, GPU is quite powerful, so given some batch size, one epoch should finish relatively quickly.
- Thunderbolt 4 has 40 Gbps theoretical limit, but you can easily get 2 GB/sec transfers, which is more than enough for your case.
I think we cannot deduce more without the following info:
- Which language is it (language code)? Which splits do you use? Default ones?
- What are your training parameters?
- Are you sure you are using GPU version of
pytorch?
Thanks for your reply.
The language is Urdu, and the training parameters are exactly as in the original guide. Also pasting here:
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-medium-ur", # change to a repo name of your choice
per_device_train_batch_size=16,
gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=500,
max_steps=5000,
gradient_checkpointing=True,
fp16=True,
evaluation_strategy="steps",
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
save_steps=1000,
eval_steps=1000,
logging_steps=25,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=False,
)
I tried to increase batch size to 32 or tweaked with gradient_accumulation_steps to increase batch size (for example a value of 4) but the progress stopped and the estimated time increased after every such attempt. Since I do not understand these, I kept them as they were. Based on these default settings, I have spent 7.5 hours to get to 93% of training at this moment (whisper medium). My GPU occasionally has a spike and thatās it (the graph in Task Manager is mostly empty; this is Windows 11), 19.5 of 24 GB VRAM is full and CPU is constantly at 10-30% (mostly around 20%). pytorch version is ā2.5.1+cu124ā (I installed it by selecting CUDA 12.4 on start locally).
Edit: my dataset split:
from datasets import load_dataset, DatasetDict
common_voice = DatasetDict()
common_voice["train"] = load_dataset("mozilla-foundation/common_voice_17_0", "ur", split="train+validation", trust_remote_code=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_17_0", "ur", split="test", trust_remote_code=True)
Sorry for the late reply. Iām guessing: You look at general āutilizationā.
- Win 11 Task Manager does not show CUDA usage by default.
- AFAIK the āutilizationā does not take CUDA usage into account
- By default it gives summary view
So that we are speaking of the same measure:
-
Disable summary view: Right click GPU on the left.
-
Disable HW acceleration
- Now you can select CUDA to see actual utilization. IIRC stuff usually happens in cuda, copy 1 and copy 2, so select them from top left.
A better tool is nvidia-smi of courseā¦


