Model_accepts_loss_kwargs detection based on **kwargs is too permissive

In trainer.py (code) , the model_accepts_loss_kwargs flag is automatically set to True if model.forward includes **kwargs. However, **kwargs might be used for reasons unrelated to loss_kwargs (e.g., for flexibility or additional non-loss params).

This behavior may lead to unintended side effects.

        forward_params = inspect.signature(model_forward).parameters

        # Check if the model has explicit setup for loss kwargs,
        # if not, check if `**kwargs` are in model.forward
        if hasattr(model, "accepts_loss_kwargs"):
            self.model_accepts_loss_kwargs = model.accepts_loss_kwargs
        else:
            self.model_accepts_loss_kwargs = any(
                k.kind == inspect.Parameter.VAR_KEYWORD for k in forward_params.values()
            )

I agree with your perspective. I believe we need to determine whether the model uses the num_items_in_batch parameter when calculating loss to establish the value of model_accepts_loss_kwargs. Otherwise, it could easily lead to incorrect loss scaling. By the way, do you know which models utilize the num_items_in_batch parameter?

Hmm…?


In Transformers, num_items_in_batch is actually used only when the model’s loss path routes into the shared “sum then divide” cross-entropy helpers (the ones that do loss = loss / num_items_in_batch). That is the whole point of the kwarg. (GitHub)

So the correct answer is “it’s not a fixed list of model names.” It’s “models whose forward() forwards loss kwargs into the shared loss functions.”

What “utilize num_items_in_batch” means in practice

A model “utilizes” num_items_in_batch if all of this happens:

  1. Trainer.compute_loss(..., num_items_in_batch=...) is called (Trainer does this in recent versions). (Hugging Face)
  2. The model’s forward() passes **kwargs (or **loss_kwargs) into its loss function call (often self.loss_function(...)). (GitHub)
  3. The loss function called is one of the shared CE losses that divides by num_items_in_batch (e.g., ForCausalLMLoss, ForMaskedLMLoss). (GitHub)

If step 2 fails (model ignores the kwarg) you get the “accepts kwargs but doesn’t use it” scaling bug you described. Trainer explicitly warns to set self.model_accepts_loss_kwargs = False in that case. (Hugging Face)

The “model types” that use it (reliable way to think about it)

The parameter is used in the shared cross-entropy implementation that switches:

  • reduction="mean" when num_items_in_batch is absent
  • reduction="sum" then loss /= num_items_in_batch when present (GitHub)

That logic is used by:

  • Causal LM loss (ForCausalLMLoss) (GitHub)
  • Masked LM loss (ForMaskedLMLoss) (GitHub)
  • Other CE-based heads in the same loss utility module (token classification, QA start/end, some sequence classification branches) are designed to pass **kwargs through the same helper, so they can use it when it’s plumbed through. (GitHub)

Concrete examples: model families in transformers that (today) forward loss kwargs

These are examples confirmed by current repo search snippets showing the “loss = self.loss_function(..., **kwargs)” pattern. That pattern is what allows num_items_in_batch to reach the CE loss and affect scaling.

Causal LM models (AutoModelForCausalLM style)

Examples that call self.loss_function(..., **kwargs) in their modeling code:

  • Llama (modeling_llama.py) (GitHub)
  • Llama 4 (modeling_llama4.py) (GitHub)
  • Qwen2 (modeling_qwen2.py) (GitHub)
  • Qwen3 (modeling_qwen3.py) (GitHub)
  • MPT (modeling_mpt.py) (GitHub)
  • GPT-NeoX (modeling_gpt_neox.py) (GitHub)
  • GPT-2 (modeling_gpt2.py) (GitHub)
  • CTRL (modeling_ctrl.py) (GitHub)
  • OpenAI GPT (modeling_openai.py) (GitHub)

And we have a concrete runtime trace proving Qwen2’s forward goes into ForCausalLMLoss, which divides by num_items_in_batch, meaning it truly utilizes it. (GitHub)

Multimodal models with a text causal LM head (same idea)

These also use the same loss-function forwarding pattern:

  • Qwen2-VL (modeling_qwen2_vl.py) (GitHub)
  • Qwen2.5-VL (modeling_qwen2_5_vl.py) (GitHub)
  • LLaVA OneVision (modeling_llava_onevision.py) (GitHub)

If their text head is trained with labels through ForCausalLMLoss, they will use num_items_in_batch the same way.

Masked LM example

  • ModernBERT shows the same “loss_function + kwargs” pattern for MLM-style heads. (GitHub)
    And the Masked LM loss helper explicitly routes into fixed_cross_entropy(..., num_items_in_batch, ...). (GitHub)

Why you keep seeing “device mismatch” bugs around this

If the loss divides by num_items_in_batch, then num_items_in_batch must be on the same device as the loss tensor. Real reports show failures when the model is sharded across GPUs and num_items_in_batch stays on a different device. (GitHub)

These bugs are strong evidence that the parameter is being used (not ignored), because otherwise the divide would not happen.

The fastest way to know for any specific model (works even for custom forks)

Do one of these:

Static check (code)

Search the model’s forward() for something like:

  • loss = self.loss_function(..., **kwargs) or **loss_kwargs
    and then confirm the loss function ultimately calls the CE helper that divides by num_items_in_batch. (GitHub)

Behavioral check (no code reading)

Call forward twice with the same inputs and labels, but change only num_items_in_batch.
If the model uses it for normalization, loss should scale roughly inversely.

Good threads and issues to read (context + pitfalls)

  • Trainer docs warning about model_accepts_loss_kwargs and incorrect scaling if num_items_in_batch is not used. (Hugging Face)
  • “CausalLM loss function throws runtime error in multi-gpu” includes the key trace showing ForCausalLMLoss dividing by num_items_in_batch and discusses why the new loss plumbing appeared (v4.46.0 era). (GitHub)
  • Device placement fix request for num_items_in_batch in ForCausalLMLoss. (GitHub)
  • Masked LM loss discussion showing fixed_cross_entropy(..., num_items_in_batch, ...). (GitHub)
  • Original “gradient accumulation should use sum loss not mean” design issue (historical motivation). (GitHub)

Summary

  • num_items_in_batch is utilized when the model’s loss path reaches shared CE helpers that do loss /= num_items_in_batch. (GitHub)
  • Many *ForCausalLM models in Transformers forward **kwargs into self.loss_function, so they can use it. Examples include Llama, Qwen2, Qwen3, MPT, GPT-NeoX, GPT-2, CTRL, OpenAI GPT. (GitHub)
  • If a model accepts kwargs but ignores num_items_in_batch, Trainer warns you must set model_accepts_loss_kwargs=False to avoid bad scaling. (Hugging Face)