statmt/cc100
Updated β’ 2.24k β’ 106
A 35.3 million parameter decoder-only Transformer language model trained from scratch on Telugu text from the CC-100 corpus.
| Property | Value |
|---|---|
| Parameters | 35,314,944 |
| Architecture | GPT-style decoder-only Transformer |
| Vocabulary | 32,000 BPE tokens (Telugu-optimized) |
| Context length | 256 tokens |
| Embedding dim | 384 |
| Layers | 6 |
| Attention heads | 6 |
| Dropout | 0.1 |
model_epoch_1.pt through model_epoch_8.pt β PyTorch checkpoints (model + optimizer state)telugu_tokenizer.json β trained BPE tokenizer (Hugging Face tokenizers format)import torch
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("telugu_tokenizer.json")
checkpoint = torch.load("model_epoch_8.pt", map_location="cpu")
# Load into your GPT model class from models/gpt_model.py
@mastersthesis{marpally2026tellama,
title={TeLLaMA: A GPT-Style Language Model from Scratch for Telugu},
author={Marpally, Anirudh},
year={2026},
school={Defence Institute of Advanced Technology (DIAT), Pune}
}