Looking for guidance. Trying to create a model with TrOCR's encoder + Google's mT5 multilingual decoder but model fails to overfit on a single data sample

John6666 · March 26, 2026, 12:42pm

Hmm…

This is fixable enough to keep exploring, but the main problem is probably not the tokenizer itself. The bigger problem is that your current experiment combines a TrOCR encoder that was fine-tuned for English single-line handwriting, a custom mT5-as-decoder-only wiring path, and a difficult Hindi OCR target. That is a fragile combination. Hugging Face’s encoder-decoder docs explicitly warn that when you combine a pretrained encoder and a different decoder, the cross-attention layers may be randomly initialized and must be learned during fine-tuning. They also show that the supported decoder path is usually a decoder model configured for cross-attention, not a full seq2seq model hacked into decoder-only use. (Hugging Face)

The most important conclusion

I do not think your experiment proves that “TrOCR encoder + Hindi-capable decoder cannot work.” I think it proves that your current wiring and training regime are too unstable to make that judgment. The fact that loss drops at all means the image path, label path, and cross-modal connection are at least partially alive. The repeated characters point more toward autoregressive instability than “complete failure.” Repetition is also a known failure mode in TrOCR-style generation, especially when decoder setup or generation config is off. (Hugging Face)

What is going wrong in your current Colab

From the code you shared, these are the biggest issues.

1. Your notebook says `mt5-small`, but the code loads `mt5-base`

That is not a cosmetic detail. mt5-base is materially larger and harder to stabilize than mt5-small. For a one-sample overfit test, you want the smallest model that can still express the task. Using a larger multilingual decoder makes the bridge-learning problem harder, not easier.

2. You are starting from `trocr-base-handwritten`, which is already specialized

The public model card says microsoft/trocr-base-handwritten is a TrOCR model fine-tuned on the IAM dataset. The updated README also says it works best on single-line handwritten English text and is not optimized for printed text or multi-line inputs. For a language swap, trocr-base-stage1 or trocr-small-stage1 is usually a cleaner starting point because those are the pre-trained only checkpoints rather than the already English-finetuned handwritten checkpoint. (Hugging Face)

3. The mT5 wiring path is custom, and that matters

You are not using the standard VisionEncoderDecoderModel.from_encoder_decoder_pretrained(...) path. Instead, you replace the mT5 encoder with a dummy module and feed encoder_outputs directly into MT5ForConditionalGeneration. That can work, but public Hugging Face issue history shows that using T5 or ByT5 as decoder-only for OCR is still a custom workaround path, not the most standard one. There is a dedicated issue where a user had to create a T5DecoderOnlyForCausalLM subclass for this exact reason. (GitHub)

4. Your one-sample overfit test is not a clean overfit test

In your code, the one-sample test uses:

full trainable model
AdamW(lr=1e-3)
only 150 steps
beam search during evaluation

That is too aggressive and too noisy. T5-family docs say that with AdamW, values around 1e-4 to 3e-4 typically work well, and they note that T5 was pretrained with Adafactor. Also, for T5 and mT5, the correct decoder start behavior is to use pad_token_id. (Hugging Face)

So your current test is mixing three confounders:

LR is likely too high for this hybrid.
The decoder is larger than needed.
Beam search is a poor judge of early training quality.

5. Decoder masking is too implicit

The official encoder-decoder implementation uses shift-right logic to build decoder inputs from labels. There is also a recent Transformers issue pointing out that in VisionEncoderDecoderModel, users observed that decoder_attention_mask was not always created the way they expected when labels were shifted into decoder inputs. In a custom hybrid like yours, I would not leave this implicit. I would create decoder_input_ids and decoder_attention_mask explicitly. (GitHub)

My recommendation for your current setup

Keep the overall idea for now, but simplify the experiment hard.

Recommended first rebuild

Use:

microsoft/trocr-small-stage1 or microsoft/trocr-base-stage1
google/mt5-small
explicit decoder inputs and decoder attention mask
frozen encoder at first
greedy decoding
lower LR
longer one-sample training

Why this version first:

stage1 is a cleaner visual warm start than the English IAM handwritten checkpoint for a decoder swap. (Hugging Face)
mt5-small is easier to stabilize than mt5-base.
mT5 already supports Hindi tokenization and uses pad_token_id as the decoder start token, so the tokenizer is not the core blocker. (Hugging Face)

Concrete changes I would make

A. Change the checkpoints

Use:

trocr = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-small-stage1")
mt5_model = MT5ForConditionalGeneration.from_pretrained("google/mt5-small")
tokenizer = AutoTokenizer.from_pretrained("google/mt5-small")
image_processor = ViTImageProcessor.from_pretrained("microsoft/trocr-small-stage1")

This removes two sources of instability at once: an over-specialized English handwritten checkpoint and an unnecessarily large decoder. The stage1 models are the pre-trained-only TrOCR checkpoints. (Hugging Face)

B. Set both model config and generation config

Do this:

model.mt5.config.decoder_start_token_id = tokenizer.pad_token_id
model.mt5.config.pad_token_id = tokenizer.pad_token_id
model.mt5.config.eos_token_id = tokenizer.eos_token_id
model.mt5.config.use_cache = False

model.mt5.generation_config.decoder_start_token_id = tokenizer.pad_token_id
model.mt5.generation_config.pad_token_id = tokenizer.pad_token_id
model.mt5.generation_config.eos_token_id = tokenizer.eos_token_id

mT5 uses pad_token_id to start decoder generation. That part of your code is conceptually right, but I would set generation_config too. (Hugging Face)

C. Make decoder inputs explicit

Inside forward, do not rely only on labels=... to do everything.

def forward(self, pixel_values, labels=None):
    hidden = self._encode(pixel_values)

    decoder_input_ids = None
    decoder_attention_mask = None

    if labels is not None:
        decoder_input_ids = self.mt5._shift_right(labels)
        decoder_attention_mask = (decoder_input_ids != self.mt5.config.pad_token_id).long()

    return self.mt5(
        encoder_outputs=BaseModelOutput(last_hidden_state=hidden),
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=decoder_attention_mask,
        labels=labels,
        use_cache=False,
    )

This makes the training path less ambiguous, and it lines up with how encoder-decoder training is supposed to work conceptually: labels are shifted right into decoder inputs. (GitHub)

D. Freeze the encoder first

At the beginning, the fragile part is the bridge, not the vision backbone. So start with:

for p in model.encoder.parameters():
    p.requires_grad = False

for name, p in model.mt5.named_parameters():
    p.requires_grad = (
        ("EncDecAttention" in name) or
        ("lm_head" in name) or
        ("shared" in name)
    )

if model.enc_to_dec_proj is not None:
    for p in model.enc_to_dec_proj.parameters():
        p.requires_grad = True

This follows directly from the encoder-decoder warm-start logic: the cross-attention bridge is new and needs to be learned carefully. (Hugging Face)

E. Fix the one-sample overfit protocol

For the one-sample proof, use:

lr=1e-4
weight_decay=0.0
no dropout
greedy decode
500 to 1000 steps
teacher-forced token accuracy

Example:

optimizer = torch.optim.AdamW(
    [p for p in model.parameters() if p.requires_grad],
    lr=1e-4,
    weight_decay=0.0,
)

for m in model.modules():
    if isinstance(m, nn.Dropout):
        m.p = 0.0

for step in range(1, 1001):
    outputs = model(pixel_values=pv, labels=lb)
    loss = outputs.loss

    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

    if step % 20 == 0:
        model.eval()
        with torch.no_grad():
            tf_outputs = model(pixel_values=pv, labels=lb)
            tf_pred = tf_outputs.logits.argmax(-1)
            mask = lb != -100
            token_acc = (tf_pred[mask] == lb[mask]).float().mean().item()

            gen_ids = model.generate(
                pixel_values=pv,
                max_new_tokens=int(mask.sum().item()) + 4,
                num_beams=1,
                do_sample=False,
            )
            pred = tokenizer.decode(gen_ids[0], skip_special_tokens=True)

        print(step, loss.item(), token_acc, pred)
        model.train()

The T5 docs support the lower LR recommendation. Greedy decoding removes beam-search noise from the diagnosis. (Hugging Face)

What success should look like

Do not judge by loss alone.

For a one-sample test, success is:

teacher-forced token accuracy approaches 1.0
greedy decoded text becomes an exact match
it stays stable for multiple checks

If loss goes down but token accuracy stays mediocre, the bridge is not learning properly. If token accuracy gets high but free decoding still loops, the model is learning under teacher forcing but autoregressive generation is unstable.

About the tokenizer question

The practical answer is:

Do not think “which tokenizer works with TrOCR encoder?”
Think “which decoder family works best with the TrOCR encoder?”

The tokenizer comes with the decoder family.

Best current options, in order

1. XLM-R decoder

This is the cleanest TrOCR-style multilingual path inside Transformers. Hugging Face’s public decoder-replacement guidance explicitly shows replacing TrOCR’s decoder with RobertaForCausalLM.from_pretrained("xlm-roberta-base", is_decoder=True, add_cross_attention=True). That is the most standard multilingual replacement route. (Hugging Face Forums)

Why it is attractive:

closer to the standard VisionEncoderDecoderModel recipe
easier than custom T5-decoder-only plumbing
multilingual tokenizer already available

2. IndicBART

If your real target is Hindi and perhaps other Indian languages, this is one of the strongest alternatives. IndicBART is a multilingual seq2seq model focused on 11 Indian languages plus English. There is also a public trocr-indic model built around IndicBART, and it explicitly supports Hindi, though it notes a Devanagari-script limitation in the released setup. (Hugging Face)

Why it is attractive:

more language-focused for Indic text than mT5
seq2seq architecture fits OCR-style generation naturally
smaller and more targeted than mt5-base

3. ByT5

ByT5 is tokenizer-free and works directly on UTF-8 bytes. The model docs say it is more robust to noise and can process any language without a separate tokenizer vocabulary. That is interesting for OCR because OCR errors often look like noisy character sequences. (Hugging Face)

Why it is attractive:

no tokenizer coverage problem
strong fit for noisy OCR text

Why I would not pick it first:

sequence lengths are longer
it still lives in the T5 family, so the decoder-only integration pain remains

4. Stay with mT5

This is still viable. mT5 covers 101 languages and already supports Hindi tokenization. I would keep it only after fixing the wiring and training regime first. (Hugging Face)

My recommendation on alternatives

If your goal is the least risky next step, I would rank them like this:

TrOCR encoder + XLM-R decoder
TrOCR encoder + IndicBART
TrOCR encoder + repaired mT5-small setup
ByT5 experiment only after the above

That ranking is based on current Hugging Face implementation guidance and public issue history. The T5 decoder-only route is the least standard of the four. (Hugging Face Forums)

For your final end goal: complex documents

This part matters a lot.

Your target is not just Hindi recognition. It is handwritten + printed Hindi in complex documents. The public TrOCR model card and discussion history strongly suggest that the handwritten checkpoint is best on single text-line inputs, and users doing full-page OCR typically detect or crop regions first, then run TrOCR on those crops. (Hugging Face)

So I would not design the final system as “single recognizer eats full page.” I would design it as:

text-region detection
line grouping or crop extraction
Hindi recognizer on each crop
merge results

For more document-native approaches, Donut is worth tracking because it is an OCR-free document model, but that is a different design choice from a recognizer-focused OCR pipeline. (Hugging Face)

My blunt recommendation

For your case, I would do this next:

Path A. Repair your current experiment

switch to trocr-small-stage1
switch to mt5-small
explicit decoder inputs and mask
freeze encoder
lr=1e-4
greedy decode
1000-step one-sample overfit

Path B. If that still fails

Stop tuning repetition penalties. Move to:

VisionEncoderDecoderModel
xlm-roberta-base as decoder using RobertaForCausalLM
matching tokenizer for labels

That is the cleanest multilingual TrOCR path publicly documented by Hugging Face. (Hugging Face Forums)

Path C. If Hindi quality is still weak

Try IndicBART next, because it is actually designed around Indic languages rather than broad multilingual coverage. (Hugging Face)

Final answer

Your current result does not tell me “the idea is wrong.” It tells me:

you have a partially working pipeline,
your current overfit test is too unstable to trust,
the tokenizer is probably not the main blocker,
the biggest issue is the bridge + training regime,
and for your final use case, you should treat recognition and document layout as separate problems. (Hugging Face)

The fastest high-value move is to rebuild the one-sample test in the smaller, cleaner form above. If that passes, then your architecture is viable. If it still does not pass, switch decoder family before spending more time on hyperparameter tweaking.

Topic		Replies	Views
How to fine tune TrOCR model properly? Beginners	2	9438	November 15, 2021
T5 as Decoder for OCR Models	8	980	November 20, 2024
TrOCR repeated generation Beginners	3	1438	November 30, 2021
VisionEncoderDecoder/TrOCR Models	0	727	October 21, 2021
Fine-tuning TrOCR on new language 🤗Transformers	4	2995	April 10, 2025

Looking for guidance. Trying to create a model with TrOCR's encoder + Google's mT5 multilingual decoder but model fails to overfit on a single data sample

The most important conclusion

What is going wrong in your current Colab

1. Your notebook says mt5-small, but the code loads mt5-base

2. You are starting from trocr-base-handwritten, which is already specialized

3. The mT5 wiring path is custom, and that matters

4. Your one-sample overfit test is not a clean overfit test

5. Decoder masking is too implicit

My recommendation for your current setup

Recommended first rebuild

Concrete changes I would make

A. Change the checkpoints

B. Set both model config and generation config

C. Make decoder inputs explicit

D. Freeze the encoder first

E. Fix the one-sample overfit protocol

What success should look like

About the tokenizer question

Best current options, in order

1. XLM-R decoder

2. IndicBART

3. ByT5

4. Stay with mT5

My recommendation on alternatives

For your final end goal: complex documents

My blunt recommendation

Path A. Repair your current experiment

Path B. If that still fails

Path C. If Hindi quality is still weak

Final answer

Related topics

1. Your notebook says `mt5-small`, but the code loads `mt5-base`

2. You are starting from `trocr-base-handwritten`, which is already specialized