TrorYongOCR
This repository contains model weights and configuration files for the pre-trained model compatible with
tror-yong-ocrversion 0.2.6 onwards
TrorYongOCR is a tiny encoder-decoder model for Scene Text Recognition task. It prepends the encoding of image patches to the "begin of sequence" token to condition next character token generation. Using LLM analogy, patch encodings can be simply seen as a prefill prompt. The single text decoder block of TrorYongOCR generates character tokens based on the prefill prompt in an autoregressive manner without cross-attention mechanism. Moreover, TrorYongOCR can process input images of arbitrary aspect ratio. Current pre-trained weight supports 2 languages: Khmer and English.
Model Details
- Developed by: KHUN Kimang (Ph.D.)
- Shared by: KrorngAI
- Model type: OCR (Optical Character Recognition)
- Language(s) (NLP): Khmer and English
Model Architecture
Model Sources
This model has been pushed to the Hub using the PytorchModelHubMixin integration:
- Code: https://pypi.org/project/tror-yong-ocr/
- Blog Post: https://kimang18.github.io/krorngai-blog/TrorYongOCR/
- Demo:: https://huggingface.co/proxy/krorngai-troryongocr-demo.hf.space
Model Configuration
The choice of model configuration can be found as the following. While preserving aspect ratio, the input image is resized to $min(H, W) = 32$ where $H$ and $W$ are height and width of image respectively. This is to reduce computation cost in the training as images with high resolution and big aspect ratio incur very long sequence of patches. The image patch size is $(8, 4)$ where $8$ is along the width of input image. The context length for character sequence is up to $1024$. Transformer configuration is the following: there are $4$ blocks, each has embedding dimension $d_{model}=384$ and $h=6$ heads. In particular, encoding blocks (block $1$ to $3$) have MLP dimension $d_{MLP}=2d_{model}=768$ and the decoding block has $d_{MLP}=4d_{model}=1546$.
| Layer | $d_{model}$ | $h$ | $d_{MLP}$ | Role |
|---|---|---|---|---|
| 1 | 384 | 6 | 768 | Encoder |
| 2 | 384 | 6 | 768 | Encoder |
| 3 | 384 | 6 | 768 | Encoder |
| 4 | 384 | 6 | 1546 | Decoder |
Training Detail
TrorYongOCR is implemented as a PyPI package and can be installed via
pip install tror-yong-ocr
It is obtained by pre-training on seanghay/khmer-hanuman-100k and SoyVitou/KhmerSynthetic1M datasets and fine-tuning on Khmer Scene Text dataset.
KhmerSynthetic1M
KhmerSynthetic1M is a dataset by Mr. Soy Vitou.
This dataset contains images in gray monochromatic color palette (black, white, gray, etc.,).
The distribution of the number of tokens, i.e. frequency of each number of tokens, is fairly uniform.
In particular, the maximum number of tokens is around $120$.
This implies that there are images with aspect ratio largely higher than $4$.
khmer-hanuman-100k
This dataset by Mr. Yat Seanghay contains images with a variety of background colors and character colors.
KhmerST: A Low-Resource Khmer Scene Text Detection and Recognition Benchmark
KhmerST is the first Khmer scene-text dataset consisting of:
- 1,544 annotated images
- 997 indoor scenes
- 547 outdoor scenes
It has diverse conditions:
- flat and raised text
- low illumination
- distant and partially occluded text.
The annotations are done at line-level with polygon bounding boxes.
To fine-tune TrorYongOCR, we cropped the polygon bounding boxes to get only text images. Then, we use warp operation to transform polygon into rectangle.
Weight Initialization
We initialize weights as what SOTA models reguarly do. The code to initialize the weight is given below.
Exceptionally, for position embedding used in the decoding block, I initialized it with $std=1.0$.
def init_weights(self, module: nn.Module, name: str = '', exclude: Sequence[str] = ('')):
"""Initialize the weights using the typical initialization schemes used in SOTA models."""
if any(map(name.startswith, exclude)):
return
if isinstance(module, nn.Linear):
nn.init.trunc_normal_(module.weight, std=0.02)
if module.bias is not None:
nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
nn.init.trunc_normal_(module.weight, std=0.02)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
elif isinstance(module, nn.Conv2d):
nn.init.kaiming_normal_(module.weight)
if module.bias is not None:
nn.init.zeros_(module.bias)
elif isinstance(module, (nn.LayerNorm, nn.BatchNorm2d, nn.GroupNorm)):
nn.init.ones_(module.weight)
nn.init.zeros_(module.bias)
Citation
BibTeX:
@online{khun2026,
author = {KHUN, Kimang},
title = {TrorYongOCR: {Encoder-Decoder} {Model} for {Scene} {Text}
{Recognition}},
date = {2026-02-19},
url = {https://kimang18.github.io/krorngai-blog/TrorYongOCR/},
langid = {en}
}
Model Card Author
- ឈ្មោះ: បណ្ឌិត ឃុន គីមអាង
- Name: KHUN Kimang (Ph.D.)
Acknowledgement
LightningAI and Google Colab did not specifically sponsor this project.
But, both models are be trained thanks to their free credits.
So, huge thanks to LightningAI and Google Colab.
Thanks to all the authors of publicly available datasets.
Model Card Contact
If you have any questions, please reach out at Facebook Page.
- Downloads last month
- 172
