Whisper medium finetuning RTX 4090 mostly stays idle

I am finetuning whisper medium using this guide. [Important detail: GPU is connected through a thunderbolt 4]. The data is mozilla common voice 17 (approx. 9000 test, 4000 train). The GPU mostly stays idle and CPU (core i7 13 gen) works at 10-30 percent all the time.
Does that mean I do not have enough CPU resources to feed the GPU?
Should I add something like dataloader_num_workers to 2 or 4 as suggested in this post?
Has thunderbolt bottleneck something to do with it?

I don’t know anything about thunderbolt, but here are some ideas:

  • ā€œmediumā€ model is calculation intensive, so data will be quickly loaded and it should calculate much longer. So loading would not be a problem.
  • The dataset is quite small, GPU is quite powerful, so given some batch size, one epoch should finish relatively quickly.
  • Thunderbolt 4 has 40 Gbps theoretical limit, but you can easily get 2 GB/sec transfers, which is more than enough for your case.

I think we cannot deduce more without the following info:

  • Which language is it (language code)? Which splits do you use? Default ones?
  • What are your training parameters?
  • Are you sure you are using GPU version of pytorch?

Thanks for your reply.
The language is Urdu, and the training parameters are exactly as in the original guide. Also pasting here:

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-medium-ur",  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=5000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
)

I tried to increase batch size to 32 or tweaked with gradient_accumulation_steps to increase batch size (for example a value of 4) but the progress stopped and the estimated time increased after every such attempt. Since I do not understand these, I kept them as they were. Based on these default settings, I have spent 7.5 hours to get to 93% of training at this moment (whisper medium). My GPU occasionally has a spike and that’s it (the graph in Task Manager is mostly empty; this is Windows 11), 19.5 of 24 GB VRAM is full and CPU is constantly at 10-30% (mostly around 20%). pytorch version is ā€˜2.5.1+cu124’ (I installed it by selecting CUDA 12.4 on start locally).
Edit: my dataset split:

from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset("mozilla-foundation/common_voice_17_0", "ur", split="train+validation", trust_remote_code=True)

common_voice["test"] = load_dataset("mozilla-foundation/common_voice_17_0", "ur", split="test", trust_remote_code=True)

Sorry for the late reply. I’m guessing: You look at general ā€œutilizationā€.

  1. Win 11 Task Manager does not show CUDA usage by default.
  2. AFAIK the ā€œutilizationā€ does not take CUDA usage into account
  3. By default it gives summary view

So that we are speaking of the same measure:

  1. Disable summary view: Right click GPU on the left.

  2. Disable HW acceleration

  1. Now you can select CUDA to see actual utilization. IIRC stuff usually happens in cuda, copy 1 and copy 2, so select them from top left.

A better tool is nvidia-smi of course…

Thanks. That’s what is happening. This is the CUDA utilization while I am running the training: