Hmm…
This is fixable enough to keep exploring, but the main problem is probably not the tokenizer itself. The bigger problem is that your current experiment combines a TrOCR encoder that was fine-tuned for English single-line handwriting, a custom mT5-as-decoder-only wiring path, and a difficult Hindi OCR target. That is a fragile combination. Hugging Face’s encoder-decoder docs explicitly warn that when you combine a pretrained encoder and a different decoder, the cross-attention layers may be randomly initialized and must be learned during fine-tuning. They also show that the supported decoder path is usually a decoder model configured for cross-attention, not a full seq2seq model hacked into decoder-only use. (Hugging Face)
The most important conclusion
I do not think your experiment proves that “TrOCR encoder + Hindi-capable decoder cannot work.” I think it proves that your current wiring and training regime are too unstable to make that judgment. The fact that loss drops at all means the image path, label path, and cross-modal connection are at least partially alive. The repeated characters point more toward autoregressive instability than “complete failure.” Repetition is also a known failure mode in TrOCR-style generation, especially when decoder setup or generation config is off. (Hugging Face)
What is going wrong in your current Colab
From the code you shared, these are the biggest issues.
1. Your notebook says mt5-small, but the code loads mt5-base
That is not a cosmetic detail. mt5-base is materially larger and harder to stabilize than mt5-small. For a one-sample overfit test, you want the smallest model that can still express the task. Using a larger multilingual decoder makes the bridge-learning problem harder, not easier.
2. You are starting from trocr-base-handwritten, which is already specialized
The public model card says microsoft/trocr-base-handwritten is a TrOCR model fine-tuned on the IAM dataset. The updated README also says it works best on single-line handwritten English text and is not optimized for printed text or multi-line inputs. For a language swap, trocr-base-stage1 or trocr-small-stage1 is usually a cleaner starting point because those are the pre-trained only checkpoints rather than the already English-finetuned handwritten checkpoint. (Hugging Face)
3. The mT5 wiring path is custom, and that matters
You are not using the standard VisionEncoderDecoderModel.from_encoder_decoder_pretrained(...) path. Instead, you replace the mT5 encoder with a dummy module and feed encoder_outputs directly into MT5ForConditionalGeneration. That can work, but public Hugging Face issue history shows that using T5 or ByT5 as decoder-only for OCR is still a custom workaround path, not the most standard one. There is a dedicated issue where a user had to create a T5DecoderOnlyForCausalLM subclass for this exact reason. (GitHub)
4. Your one-sample overfit test is not a clean overfit test
In your code, the one-sample test uses:
- full trainable model
AdamW(lr=1e-3)- only
150steps - beam search during evaluation
That is too aggressive and too noisy. T5-family docs say that with AdamW, values around 1e-4 to 3e-4 typically work well, and they note that T5 was pretrained with Adafactor. Also, for T5 and mT5, the correct decoder start behavior is to use pad_token_id. (Hugging Face)
So your current test is mixing three confounders:
- LR is likely too high for this hybrid.
- The decoder is larger than needed.
- Beam search is a poor judge of early training quality.
5. Decoder masking is too implicit
The official encoder-decoder implementation uses shift-right logic to build decoder inputs from labels. There is also a recent Transformers issue pointing out that in VisionEncoderDecoderModel, users observed that decoder_attention_mask was not always created the way they expected when labels were shifted into decoder inputs. In a custom hybrid like yours, I would not leave this implicit. I would create decoder_input_ids and decoder_attention_mask explicitly. (GitHub)
My recommendation for your current setup
Keep the overall idea for now, but simplify the experiment hard.
Recommended first rebuild
Use:
microsoft/trocr-small-stage1ormicrosoft/trocr-base-stage1google/mt5-small- explicit decoder inputs and decoder attention mask
- frozen encoder at first
- greedy decoding
- lower LR
- longer one-sample training
Why this version first:
stage1is a cleaner visual warm start than the English IAM handwritten checkpoint for a decoder swap. (Hugging Face)mt5-smallis easier to stabilize thanmt5-base.- mT5 already supports Hindi tokenization and uses
pad_token_idas the decoder start token, so the tokenizer is not the core blocker. (Hugging Face)
Concrete changes I would make
A. Change the checkpoints
Use:
trocr = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-small-stage1")
mt5_model = MT5ForConditionalGeneration.from_pretrained("google/mt5-small")
tokenizer = AutoTokenizer.from_pretrained("google/mt5-small")
image_processor = ViTImageProcessor.from_pretrained("microsoft/trocr-small-stage1")
This removes two sources of instability at once: an over-specialized English handwritten checkpoint and an unnecessarily large decoder. The stage1 models are the pre-trained-only TrOCR checkpoints. (Hugging Face)
B. Set both model config and generation config
Do this:
model.mt5.config.decoder_start_token_id = tokenizer.pad_token_id
model.mt5.config.pad_token_id = tokenizer.pad_token_id
model.mt5.config.eos_token_id = tokenizer.eos_token_id
model.mt5.config.use_cache = False
model.mt5.generation_config.decoder_start_token_id = tokenizer.pad_token_id
model.mt5.generation_config.pad_token_id = tokenizer.pad_token_id
model.mt5.generation_config.eos_token_id = tokenizer.eos_token_id
mT5 uses pad_token_id to start decoder generation. That part of your code is conceptually right, but I would set generation_config too. (Hugging Face)
C. Make decoder inputs explicit
Inside forward, do not rely only on labels=... to do everything.
def forward(self, pixel_values, labels=None):
hidden = self._encode(pixel_values)
decoder_input_ids = None
decoder_attention_mask = None
if labels is not None:
decoder_input_ids = self.mt5._shift_right(labels)
decoder_attention_mask = (decoder_input_ids != self.mt5.config.pad_token_id).long()
return self.mt5(
encoder_outputs=BaseModelOutput(last_hidden_state=hidden),
decoder_input_ids=decoder_input_ids,
decoder_attention_mask=decoder_attention_mask,
labels=labels,
use_cache=False,
)
This makes the training path less ambiguous, and it lines up with how encoder-decoder training is supposed to work conceptually: labels are shifted right into decoder inputs. (GitHub)
D. Freeze the encoder first
At the beginning, the fragile part is the bridge, not the vision backbone. So start with:
for p in model.encoder.parameters():
p.requires_grad = False
for name, p in model.mt5.named_parameters():
p.requires_grad = (
("EncDecAttention" in name) or
("lm_head" in name) or
("shared" in name)
)
if model.enc_to_dec_proj is not None:
for p in model.enc_to_dec_proj.parameters():
p.requires_grad = True
This follows directly from the encoder-decoder warm-start logic: the cross-attention bridge is new and needs to be learned carefully. (Hugging Face)
E. Fix the one-sample overfit protocol
For the one-sample proof, use:
lr=1e-4weight_decay=0.0- no dropout
- greedy decode
500to1000steps- teacher-forced token accuracy
Example:
optimizer = torch.optim.AdamW(
[p for p in model.parameters() if p.requires_grad],
lr=1e-4,
weight_decay=0.0,
)
for m in model.modules():
if isinstance(m, nn.Dropout):
m.p = 0.0
for step in range(1, 1001):
outputs = model(pixel_values=pv, labels=lb)
loss = outputs.loss
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
if step % 20 == 0:
model.eval()
with torch.no_grad():
tf_outputs = model(pixel_values=pv, labels=lb)
tf_pred = tf_outputs.logits.argmax(-1)
mask = lb != -100
token_acc = (tf_pred[mask] == lb[mask]).float().mean().item()
gen_ids = model.generate(
pixel_values=pv,
max_new_tokens=int(mask.sum().item()) + 4,
num_beams=1,
do_sample=False,
)
pred = tokenizer.decode(gen_ids[0], skip_special_tokens=True)
print(step, loss.item(), token_acc, pred)
model.train()
The T5 docs support the lower LR recommendation. Greedy decoding removes beam-search noise from the diagnosis. (Hugging Face)
What success should look like
Do not judge by loss alone.
For a one-sample test, success is:
- teacher-forced token accuracy approaches
1.0 - greedy decoded text becomes an exact match
- it stays stable for multiple checks
If loss goes down but token accuracy stays mediocre, the bridge is not learning properly. If token accuracy gets high but free decoding still loops, the model is learning under teacher forcing but autoregressive generation is unstable.
About the tokenizer question
The practical answer is:
Do not think “which tokenizer works with TrOCR encoder?”
Think “which decoder family works best with the TrOCR encoder?”
The tokenizer comes with the decoder family.
Best current options, in order
1. XLM-R decoder
This is the cleanest TrOCR-style multilingual path inside Transformers. Hugging Face’s public decoder-replacement guidance explicitly shows replacing TrOCR’s decoder with RobertaForCausalLM.from_pretrained("xlm-roberta-base", is_decoder=True, add_cross_attention=True). That is the most standard multilingual replacement route. (Hugging Face Forums)
Why it is attractive:
- closer to the standard
VisionEncoderDecoderModelrecipe - easier than custom T5-decoder-only plumbing
- multilingual tokenizer already available
2. IndicBART
If your real target is Hindi and perhaps other Indian languages, this is one of the strongest alternatives. IndicBART is a multilingual seq2seq model focused on 11 Indian languages plus English. There is also a public trocr-indic model built around IndicBART, and it explicitly supports Hindi, though it notes a Devanagari-script limitation in the released setup. (Hugging Face)
Why it is attractive:
- more language-focused for Indic text than mT5
- seq2seq architecture fits OCR-style generation naturally
- smaller and more targeted than
mt5-base
3. ByT5
ByT5 is tokenizer-free and works directly on UTF-8 bytes. The model docs say it is more robust to noise and can process any language without a separate tokenizer vocabulary. That is interesting for OCR because OCR errors often look like noisy character sequences. (Hugging Face)
Why it is attractive:
- no tokenizer coverage problem
- strong fit for noisy OCR text
Why I would not pick it first:
- sequence lengths are longer
- it still lives in the T5 family, so the decoder-only integration pain remains
4. Stay with mT5
This is still viable. mT5 covers 101 languages and already supports Hindi tokenization. I would keep it only after fixing the wiring and training regime first. (Hugging Face)
My recommendation on alternatives
If your goal is the least risky next step, I would rank them like this:
- TrOCR encoder + XLM-R decoder
- TrOCR encoder + IndicBART
- TrOCR encoder + repaired mT5-small setup
- ByT5 experiment only after the above
That ranking is based on current Hugging Face implementation guidance and public issue history. The T5 decoder-only route is the least standard of the four. (Hugging Face Forums)
For your final end goal: complex documents
This part matters a lot.
Your target is not just Hindi recognition. It is handwritten + printed Hindi in complex documents. The public TrOCR model card and discussion history strongly suggest that the handwritten checkpoint is best on single text-line inputs, and users doing full-page OCR typically detect or crop regions first, then run TrOCR on those crops. (Hugging Face)
So I would not design the final system as “single recognizer eats full page.” I would design it as:
- text-region detection
- line grouping or crop extraction
- Hindi recognizer on each crop
- merge results
For more document-native approaches, Donut is worth tracking because it is an OCR-free document model, but that is a different design choice from a recognizer-focused OCR pipeline. (Hugging Face)
My blunt recommendation
For your case, I would do this next:
Path A. Repair your current experiment
- switch to
trocr-small-stage1 - switch to
mt5-small - explicit decoder inputs and mask
- freeze encoder
lr=1e-4- greedy decode
- 1000-step one-sample overfit
Path B. If that still fails
Stop tuning repetition penalties. Move to:
VisionEncoderDecoderModelxlm-roberta-baseas decoder usingRobertaForCausalLM- matching tokenizer for labels
That is the cleanest multilingual TrOCR path publicly documented by Hugging Face. (Hugging Face Forums)
Path C. If Hindi quality is still weak
Try IndicBART next, because it is actually designed around Indic languages rather than broad multilingual coverage. (Hugging Face)
Final answer
Your current result does not tell me “the idea is wrong.” It tells me:
- you have a partially working pipeline,
- your current overfit test is too unstable to trust,
- the tokenizer is probably not the main blocker,
- the biggest issue is the bridge + training regime,
- and for your final use case, you should treat recognition and document layout as separate problems. (Hugging Face)
The fastest high-value move is to rebuild the one-sample test in the smaller, cleaner form above. If that passes, then your architecture is viable. If it still does not pass, switch decoder family before spending more time on hyperparameter tweaking.