Hmm…?
In Transformers, num_items_in_batch is actually used only when the model’s loss path routes into the shared “sum then divide” cross-entropy helpers (the ones that do loss = loss / num_items_in_batch). That is the whole point of the kwarg. (GitHub)
So the correct answer is “it’s not a fixed list of model names.” It’s “models whose forward() forwards loss kwargs into the shared loss functions.”
What “utilize num_items_in_batch” means in practice
A model “utilizes” num_items_in_batch if all of this happens:
Trainer.compute_loss(..., num_items_in_batch=...) is called (Trainer does this in recent versions). (Hugging Face)
- The model’s
forward() passes **kwargs (or **loss_kwargs) into its loss function call (often self.loss_function(...)). (GitHub)
- The loss function called is one of the shared CE losses that divides by
num_items_in_batch (e.g., ForCausalLMLoss, ForMaskedLMLoss). (GitHub)
If step 2 fails (model ignores the kwarg) you get the “accepts kwargs but doesn’t use it” scaling bug you described. Trainer explicitly warns to set self.model_accepts_loss_kwargs = False in that case. (Hugging Face)
The “model types” that use it (reliable way to think about it)
The parameter is used in the shared cross-entropy implementation that switches:
reduction="mean" when num_items_in_batch is absent
reduction="sum" then loss /= num_items_in_batch when present (GitHub)
That logic is used by:
- Causal LM loss (
ForCausalLMLoss) (GitHub)
- Masked LM loss (
ForMaskedLMLoss) (GitHub)
- Other CE-based heads in the same loss utility module (token classification, QA start/end, some sequence classification branches) are designed to pass
**kwargs through the same helper, so they can use it when it’s plumbed through. (GitHub)
Concrete examples: model families in transformers that (today) forward loss kwargs
These are examples confirmed by current repo search snippets showing the “loss = self.loss_function(..., **kwargs)” pattern. That pattern is what allows num_items_in_batch to reach the CE loss and affect scaling.
Causal LM models (AutoModelForCausalLM style)
Examples that call self.loss_function(..., **kwargs) in their modeling code:
- Llama (
modeling_llama.py) (GitHub)
- Llama 4 (
modeling_llama4.py) (GitHub)
- Qwen2 (
modeling_qwen2.py) (GitHub)
- Qwen3 (
modeling_qwen3.py) (GitHub)
- MPT (
modeling_mpt.py) (GitHub)
- GPT-NeoX (
modeling_gpt_neox.py) (GitHub)
- GPT-2 (
modeling_gpt2.py) (GitHub)
- CTRL (
modeling_ctrl.py) (GitHub)
- OpenAI GPT (
modeling_openai.py) (GitHub)
And we have a concrete runtime trace proving Qwen2’s forward goes into ForCausalLMLoss, which divides by num_items_in_batch, meaning it truly utilizes it. (GitHub)
Multimodal models with a text causal LM head (same idea)
These also use the same loss-function forwarding pattern:
- Qwen2-VL (
modeling_qwen2_vl.py) (GitHub)
- Qwen2.5-VL (
modeling_qwen2_5_vl.py) (GitHub)
- LLaVA OneVision (
modeling_llava_onevision.py) (GitHub)
If their text head is trained with labels through ForCausalLMLoss, they will use num_items_in_batch the same way.
Masked LM example
- ModernBERT shows the same “loss_function + kwargs” pattern for MLM-style heads. (GitHub)
And the Masked LM loss helper explicitly routes into fixed_cross_entropy(..., num_items_in_batch, ...). (GitHub)
Why you keep seeing “device mismatch” bugs around this
If the loss divides by num_items_in_batch, then num_items_in_batch must be on the same device as the loss tensor. Real reports show failures when the model is sharded across GPUs and num_items_in_batch stays on a different device. (GitHub)
These bugs are strong evidence that the parameter is being used (not ignored), because otherwise the divide would not happen.
The fastest way to know for any specific model (works even for custom forks)
Do one of these:
Static check (code)
Search the model’s forward() for something like:
loss = self.loss_function(..., **kwargs) or **loss_kwargs
and then confirm the loss function ultimately calls the CE helper that divides by num_items_in_batch. (GitHub)
Behavioral check (no code reading)
Call forward twice with the same inputs and labels, but change only num_items_in_batch.
If the model uses it for normalization, loss should scale roughly inversely.
Good threads and issues to read (context + pitfalls)
- Trainer docs warning about
model_accepts_loss_kwargs and incorrect scaling if num_items_in_batch is not used. (Hugging Face)
- “CausalLM loss function throws runtime error in multi-gpu” includes the key trace showing
ForCausalLMLoss dividing by num_items_in_batch and discusses why the new loss plumbing appeared (v4.46.0 era). (GitHub)
- Device placement fix request for
num_items_in_batch in ForCausalLMLoss. (GitHub)
- Masked LM loss discussion showing
fixed_cross_entropy(..., num_items_in_batch, ...). (GitHub)
- Original “gradient accumulation should use sum loss not mean” design issue (historical motivation). (GitHub)
Summary
num_items_in_batch is utilized when the model’s loss path reaches shared CE helpers that do loss /= num_items_in_batch. (GitHub)
- Many
*ForCausalLM models in Transformers forward **kwargs into self.loss_function, so they can use it. Examples include Llama, Qwen2, Qwen3, MPT, GPT-NeoX, GPT-2, CTRL, OpenAI GPT. (GitHub)
- If a model accepts kwargs but ignores
num_items_in_batch, Trainer warns you must set model_accepts_loss_kwargs=False to avoid bad scaling. (Hugging Face)