TestGen: AI Test Case Generator (Qwen2.5-Coder-7B + LoRA)

Fine-tuned Qwen2.5-Coder-7B-Instruct with LoRA for comprehensive unit test generation.

Overview

This model generates ALL test cases including edge cases for any source code input. Based on the paper "Parameter-Efficient Fine-Tuning of LLMs for Unit Test Generation" (arxiv:2411.02462).

Training Recipe

Component	Details
Base Model	Qwen/Qwen2.5-Coder-7B-Instruct
Method	LoRA (rank=16, alpha=32)
Dataset	andstor/methods2test fm+fc+t+tc (46K+ samples)
Training	3 epochs, lr=1e-4, cosine schedule, effective batch=32
Hardware	A10G-large (24GB VRAM)
Framework	TRL SFTTrainer + PEFT LoRA

LoRA Target Modules

Attention: q_proj, k_proj, v_proj, o_proj
MLP: gate_proj, up_proj, down_proj

How to Run Training

# 1. Build the training dataset
pip install datasets huggingface_hub
python scripts/data_pipeline.py --repo YOUR_ORG/testgen-data --max-samples 50000

# 2. (Optional) Add your company's code+test files
python scripts/data_pipeline.py --repo YOUR_ORG/testgen-data --custom-dirs /path/to/your/code

# 3. Run training
pip install transformers trl torch datasets trackio accelerate peft bitsandbytes
python scripts/train.py

# Or via HF Jobs (recommended):
# Hardware: a10g-large, Timeout: 5h

How to Add Your Company's Data

Your raw training data should be organized as code files paired with test files:

your-project/
├── src/
│   ├── calculator.py
│   ├── utils.py
│   └── auth.py
└── tests/
    ├── test_calculator.py
    ├── test_utils.py
    └── test_auth.py

The data pipeline auto-discovers pairs using naming conventions:

Python: calculator.py ↔ test_calculator.py
Java: Calculator.java ↔ CalculatorTest.java
JS/TS: calculator.js ↔ calculator.test.js

Live Demo

Try it: 🧪 AI Test Case Generator Space

Resources

Training Dataset: Navyatha2006/testgen-sft-data
Live App: Navyatha2006/ai-test-case-generator
Paper: Parameter-Efficient Fine-Tuning of LLMs for Unit Test Generation
Base Dataset: andstor/methods2test