TrorYongOCR

This repository contains model weights and configuration files for the pre-trained model compatible with tror-yong-ocr version 0.2.6 onwards

TrorYongOCR is a tiny encoder-decoder model for Scene Text Recognition task. It prepends the encoding of image patches to the "begin of sequence" token to condition next character token generation. Using LLM analogy, patch encodings can be simply seen as a prefill prompt. The single text decoder block of TrorYongOCR generates character tokens based on the prefill prompt in an autoregressive manner without cross-attention mechanism. Moreover, TrorYongOCR can process input images of arbitrary aspect ratio. Current pre-trained weight supports 2 languages: Khmer and English.

Model Details

Developed by: KHUN Kimang (Ph.D.)
Shared by: KrorngAI
Model type: OCR (Optical Character Recognition)
Language(s) (NLP): Khmer and English

Model Architecture

Model Sources

This model has been pushed to the Hub using the PytorchModelHubMixin integration:

Code: https://pypi.org/project/tror-yong-ocr/
Blog Post: https://kimang18.github.io/krorngai-blog/TrorYongOCR/
Demo:: https://huggingface.co/proxy/krorngai-troryongocr-demo.hf.space

Model Configuration

The choice of model configuration can be found as the following. While preserving aspect ratio, the input image is resized to $min(H, W) = 32$ where $H$ and $W$ are height and width of image respectively. This is to reduce computation cost in the training as images with high resolution and big aspect ratio incur very long sequence of patches. The image patch size is $(8, 4)$ where $8$ is along the width of input image. The context length for character sequence is up to $1024$. Transformer configuration is the following: there are $4$ blocks, each has embedding dimension $d_{model}=384$ and $h=6$ heads. In particular, encoding blocks (block $1$ to $3$) have MLP dimension $d_{MLP}=2d_{model}=768$ and the decoding block has $d_{MLP}=4d_{model}=1546$.

Layer	$d_{model}$	$h$	$d_{MLP}$	Role
1	384	6	768	Encoder
2	384	6	768	Encoder
3	384	6	768	Encoder
4	384	6	1546	Decoder

Training Detail

TrorYongOCR is implemented as a PyPI package and can be installed via

pip install tror-yong-ocr

It is obtained by pre-training on seanghay/khmer-hanuman-100k and SoyVitou/KhmerSynthetic1M datasets and fine-tuning on Khmer Scene Text dataset.

KhmerSynthetic1M

KhmerSynthetic1M is a dataset by Mr. Soy Vitou. This dataset contains images in gray monochromatic color palette (black, white, gray, etc.,). The distribution of the number of tokens, i.e. frequency of each number of tokens, is fairly uniform. In particular, the maximum number of tokens is around $120$. This implies that there are images with aspect ratio largely higher than $4$.

khmer-hanuman-100k

This dataset by Mr. Yat Seanghay contains images with a variety of background colors and character colors.

KhmerST: A Low-Resource Khmer Scene Text Detection and Recognition Benchmark

KhmerST is the first Khmer scene-text dataset consisting of:

1,544 annotated images
997 indoor scenes
547 outdoor scenes

It has diverse conditions:

flat and raised text
low illumination
distant and partially occluded text.

The annotations are done at line-level with polygon bounding boxes.

To fine-tune TrorYongOCR, we cropped the polygon bounding boxes to get only text images. Then, we use warp operation to transform polygon into rectangle.

Weight Initialization

We initialize weights as what SOTA models reguarly do. The code to initialize the weight is given below.

Exceptionally, for position embedding used in the decoding block, I initialized it with $std=1.0$.

def init_weights(self, module: nn.Module, name: str = '', exclude: Sequence[str] = ('')):
    """Initialize the weights using the typical initialization schemes used in SOTA models."""
    if any(map(name.startswith, exclude)):
        return
    if isinstance(module, nn.Linear):
        nn.init.trunc_normal_(module.weight, std=0.02)
        if module.bias is not None:
            nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
        nn.init.trunc_normal_(module.weight, std=0.02)
        if module.padding_idx is not None:
            module.weight.data[module.padding_idx].zero_()
    elif isinstance(module, nn.Conv2d):
        nn.init.kaiming_normal_(module.weight)
        if module.bias is not None:
            nn.init.zeros_(module.bias)
    elif isinstance(module, (nn.LayerNorm, nn.BatchNorm2d, nn.GroupNorm)):
        nn.init.ones_(module.weight)
        nn.init.zeros_(module.bias)

Citation

BibTeX:

@online{khun2026,
  author = {KHUN, Kimang},
  title = {TrorYongOCR: {Encoder-Decoder} {Model} for {Scene} {Text}
    {Recognition}},
  date = {2026-02-19},
  url = {https://kimang18.github.io/krorngai-blog/TrorYongOCR/},
  langid = {en}
}

Model Card Author

ឈ្មោះ: បណ្ឌិត ឃុន គីមអាង
Name: KHUN Kimang (Ph.D.)

Acknowledgement

LightningAI and Google Colab did not specifically sponsor this project. But, both models are be trained thanks to their free credits. So, huge thanks to LightningAI and Google Colab.

Thanks to all the authors of publicly available datasets.

Model Card Contact

If you have any questions, please reach out at Facebook Page.

Downloads last month: 172

Safetensors

Model size

5.51M params

Tensor type

F32

Datasets used to train KrorngAI/TrorYongOCR

Space using KrorngAI/TrorYongOCR 1

Paper for KrorngAI/TrorYongOCR

KhmerST: A Low-Resource Khmer Scene Text Detection and Recognition Benchmark

Paper • 2410.18277 • Published Oct 23, 2024