How can i build a High Quality dataset?

Bidram · June 5, 2026, 1:06pm

I want to build a high-quality Persian assistance dataset for an SLM.

I have already used models like ChatGPT to generate a small Persian assistance dataset, but the overall quality was not good enough. Since the model I want to fine-tune is small, I need a larger dataset with much less noise and better overall quality.

I want to ask how I can build the fine-tuning dataset I need in a high-quality way.

John6666 · June 7, 2026, 7:31am

Dataset quality is hard to define because the evaluation criteria change depending on the goal, but I tried to organize the parts that can be organized:

Short answer

Yes, there are best practices, but I would not start from “how do I create a large high-quality dataset?” in the abstract.

For a Persian SLM assistant, I would start from a more evaluation-driven question:

What exact Persian capability is missing from the base model, and how will I measure whether my dataset fixes that gap?

A “high-quality dataset” for Persian assistant fine-tuning may mean different things depending on whether you want:

better Persian fluency
better instruction following
better multi-turn assistant behavior
better Iranian/Persian cultural fit
better factuality or domain knowledge
safer refusals and culturally appropriate alignment
better RAG behavior
better formatting under a specific chat template
better behavior from a very small model with limited context and imperfect Persian tokenization

So I would use an evaluation-first roadmap:

Check current Persian leaderboards and benchmarks.
Build a small private eval set for your exact use case.
Evaluate the base model and Persian-specialized baselines.
Identify the actual gap.
Only then build or curate training data targeted at that gap.

The important point is:

Do not build “more Persian chat data” blindly. Build the missing data that your evaluation shows you need.

1. Start with Persian evaluation, not training data

Before generating or collecting examples, I would first inspect current Persian evaluation resources.

These are useful because they show what Persian LLM quality already gets decomposed into: knowledge, reasoning, instruction following, NLU, NLG, multi-turn dialogue, safety, culture, and retrieval.

Goal	Useful starting point	Why it matters
General Persian LLM comparison	Open Persian LLM Leaderboard	A practical first stop for comparing current Persian-capable models. Use it for baselines, not as a final product metric.
Multi-dimensional Persian evaluation	MIZAN: A Persian LLM Leaderboard	Useful because it separates Persian evaluation into reasoning, instruction following, knowledge, NLU, NLG, and multi-turn dialogue.
Persian knowledge / reasoning	Khayyam Challenge / PersianMMLU, PerMMLU	Good anchors for general and local knowledge, but not sufficient for assistant behavior.
Instruction following	Persian IFEval	Helpful for checking whether the model follows Persian instructions and constraints.
Multi-turn assistant behavior	Persian MT-Bench	Useful for dialogue, writing, multi-turn behavior, and retrieval-like chat cases.
Persian safety / alignment	ELAB: Extensive LLM Alignment Benchmark in Persian Language	Important for safety, fairness, and social norms in Persian linguistic/cultural contexts.
Persian embeddings / RAG	FaMTEB, PTEB Leaderboard	If the assistant uses retrieval, do not evaluate only generation. Evaluate embeddings, retrieval, reranking, grounding, and citation behavior.
Custom local evaluation	ParsBench, lm-evaluation-harness	Useful if you want repeatable private evaluation rather than only leaderboard screenshots.

Why this matters

If the model fails on Persian IFEval-style tasks, you probably need better instruction-following data.

If it fails on PerMMLU/PersianMMLU-style tasks, you may need domain or knowledge data, or maybe RAG rather than SFT.

If it fails on ELAB-style tasks, you need safety/alignment data, not generic QA.

If it fails because Persian inputs become very long under the tokenizer, more SFT examples may not be enough; you may need to think about tokenizer efficiency, model choice, or continued pretraining.

2. Do not treat “Persian” as one fully specified target

It is useful to state the target variety explicitly.

Persian/Farsi can mean Iranian Persian, but Persian is also a pluricentric language with varieties such as Dari and Tajik. If your target is an Iranian Persian assistant, say that. If you want Dari, Tajik, code-switching, Arabic-script Persian only, Latin transliteration, or mixed Persian-English technical support, those become different data and evaluation problems.

A simple target statement can prevent many later mistakes:

Target: Iranian Persian/Farsi assistant for <domain>, mostly formal/semi-formal register, Arabic-script Persian, occasional English technical terms, no Dari/Tajik coverage for now.

Or:

Target: general Persian assistant covering Iranian Persian plus common code-switching in technical conversations.

This affects:

source selection
spelling normalization
register
cultural assumptions
evaluation examples
safety examples
tokenizer measurements
what counts as a natural answer

3. Define “quality” by capability, not by dataset size

A large dataset can still be low quality if it is repetitive, mistranslated, inconsistent, contaminated, or irrelevant to the target model’s failure modes.

For a Persian SLM assistant, I would split quality into dimensions like this:

Quality dimension	What to check	Persian SLM-specific note
Persian fluency	Natural Persian, spelling, orthography, punctuation, register	Translated English data alone may sound unnatural. Native or fluent review matters.
Instruction following	Constraints, requested format, multi-step tasks, refusal when needed	Use Persian IFEval-like and private instruction tests.
Multi-turn ability	Context tracking, follow-up answers, corrections, clarification	Do not evaluate only single-turn QA.
Factuality	Verified answers, domain correctness, local knowledge	Use PersianMMLU/PerMMLU/PARSE-like tests, but do not train on their test items.
Cultural fit	Iranian/Persian norms, idioms, local references, etiquette	Use culturally grounded resources; translation-only data can miss this.
Safety/alignment	Refusal behavior, safe alternatives, fairness, privacy, social norms	ELAB-like evaluation is relevant here.
Domain fit	Medical/legal/education/support correctness	Domain data may need expert validation.
Retrieval behavior	Finds correct docs, cites evidence, avoids unsupported claims	Evaluate embedding/retrieval separately with FaMTEB/PTEB-like resources.
Format consistency	Stable schema, roles, chat template, answer style	Bad formatting can ruin otherwise good data.
Tokenization cost	Tokens per word/sentence, truncation rate, context waste	Especially important for 0.5B–3B SLMs.
Provenance/license	Source, generation method, license, redistribution rights	Required if you publish the dataset.
Contamination control	No benchmark leakage into train data	Critical if you use public Persian benchmarks to guide development.

A good Persian assistant dataset is not just “lots of Persian conversations.” It is a dataset that targets one or more of these dimensions clearly.

4. Compare against Persian-specialized baselines before training

Before spending time building a dataset, evaluate your base SLM against existing Persian-specialized or Persian-adapted baselines.

Examples worth inspecting include:

Dorna-Llama3-8B-Instruct
PersianMind
Persian-Phi
current models listed on Open Persian LLM Leaderboard
current models listed on MIZAN

The goal is not necessarily to use those models directly. The goal is to learn what type of gap you are dealing with.

Observation	Likely implication
Your SLM is weak at basic Persian fluency	SFT alone may not be enough; consider model choice, continued pretraining, or tokenizer issues.
It is fluent but bad at following instructions	SFT data may help.
It follows instructions but lacks local facts	Use RAG or domain knowledge data; do not expect generic chat data to fix this reliably.
It gives unsafe or culturally odd answers	You need alignment/safety/culture data, not just more QA.
It fails long Persian inputs because of token length	Measure tokenizer fertility and truncation before scaling the dataset.
Larger Persian models are much better	Maybe the task is too hard for the chosen SLM size, or needs a narrower scope.

5. Inspect existing Persian resources before generating from scratch

Before creating 50k–100k synthetic examples, inspect existing Persian datasets and corpora.

Useful examples:

Resource	How I would use it
FarsInstruct, paper, GitHub	A major Persian instruction-following resource. Useful because it is not just generic chat data; it covers multiple Persian NLP task types and instruction templates.
MatinaAI instruction tuning / alignment datasets	Useful for cultural-alignment and Persian-focused instruction data patterns. Check access, license, and intended use carefully.
Matina Persian Text Corpus	More relevant to language adaptation / continued pretraining than ordinary SFT. Useful if the model’s Persian base ability is weak.
ParsBench datasets/models	Useful for Persian task/evaluation exploration and possible private benchmark inspiration.
PQuAD	Persian reading comprehension / QA resource. Good example of task-specific Persian data.
FarsTail	Persian textual entailment resource. Useful for NLI-style evaluation or training inspiration.
Community Persian SFT/COT collections, e.g. xmanii Persian SFT/COT collection	Potential bootstrapping material, but inspect carefully. Do not assume translation-based community datasets are automatically high quality.

For community SFT datasets, I would check:

Is it native Persian, translated Persian, or synthetic Persian?
What model generated it?
What was the source dataset?
Is the license compatible with your use?
Are there duplicate examples?
Does it contain benchmark items?
Does it use the schema you need?
Are the answers natural in Persian?
Are there hidden English artifacts?
Does it match your target register and domain?

6. Decide which layer you actually need to improve

For a Persian SLM assistant, there are several different intervention layers. SFT is only one of them.

If the problem is…	Better first intervention
Poor basic Persian language modeling	Persian continued pretraining / language adaptation / better base model
Bad instruction following	SFT on instruction-following data
Bad multi-turn chat	Multi-turn SFT examples and multi-turn eval
Bad local/cultural behavior	Culturally grounded Persian examples and human/native review
Bad safety/refusal behavior	Safety/alignment dataset and Persian alignment eval
Bad factual/domain answers	RAG, domain corpus, expert-validated QA, or domain SFT
Bad retrieval	Better Persian embeddings/rerankers, FaMTEB/PTEB-like eval
Bad formatting	Schema cleanup, chat template alignment, loss masking
Too many tokens for Persian inputs	Tokenizer/model choice, shorter examples, context-aware data design

This is why evaluation first is useful: it tells you which layer to work on.

7. A practical dataset-building pipeline

A practical workflow could look like this.

Stage	Output	Notes
1. Define target	One short target statement	Example: “Iranian Persian customer-support assistant for <domain>.”
2. Choose public eval anchors	3–5 public benchmarks/leaderboards	Use them for orientation, not for training data.
3. Build private eval	100–500 examples	Include the actual tasks users will ask. Keep it held out.
4. Evaluate base model	Failure report	Measure before training.
5. Compare baselines	Baseline table	Compare against Persian-specialized models if possible.
6. Inspect existing data	Dataset inventory	FarsInstruct, Matina, ParsBench, domain datasets, etc.
7. Build seed set	200–2,000 high-quality examples	Manual/native-reviewed examples are valuable.
8. Expand synthetically	Larger candidate pool	Use teacher LLMs carefully; do not blindly trust outputs.
9. Filter	Clean training set	Fluency, correctness, diversity, safety, format, license.
10. Deduplicate/decontaminate	Train/dev/test split	Remove duplicates and eval leakage.
11. Train	SFT/LoRA/QLoRA run	Use correct chat template and loss masking.
12. Re-evaluate	New failure report	Add targeted examples based on failures.

The key loop is:

evaluate → inspect failures → add targeted data → train → re-evaluate

not:

generate a huge dataset → train once → hope quality improves

8. Build a small private eval set

Public leaderboards are useful, but they will not perfectly match your application.

I would create a small private eval set early. Even 100–300 examples can be very useful if they are well chosen.

Suggested categories:

Category	Example
Basic Persian assistant	“Explain <topic> in simple Persian.”
Instruction constraints	“Answer in exactly 3 bullet points.”
Multi-turn follow-up	User corrects or narrows the previous request.
Local/cultural knowledge	Iranian holidays, etiquette, food, education, bureaucracy, etc.
Domain task	Your real target domain.
Safety/refusal	Harmful, privacy-sensitive, or ethically sensitive requests.
Formatting	JSON, markdown table, numbered list, citation format.
RAG/grounding	Answer using provided documents only.
Robustness	Ambiguous questions, typos, mixed Persian-English, informal spelling.

A simple private eval JSONL schema:

{"id":"eval_0001","category":"instruction_following","language":"fa","input":"<Persian user prompt>","expected_behavior":"Follow all constraints; answer in Persian; no extra sections.","must_include":[],"must_not_include":[],"notes":"Private held-out eval. Do not train on this."}
{"id":"eval_0002","category":"safety","language":"fa","input":"<Persian unsafe request>","expected_behavior":"Refuse briefly in Persian and offer a safe alternative.","must_include":[],"must_not_include":[],"notes":"Private held-out eval. Do not train on this."}

Do not overcomplicate the first version. A small, stable, private eval is better than no eval.

9. Do not train on your evaluation data

This is important.

If you use PersianMMLU, MIZAN, ELAB, Persian IFEval, Persian MT-Bench, or leaderboard samples to guide your work, do not copy those examples into your SFT data.

Also avoid putting held-out benchmark examples into prompts for synthetic data generation.

Bad pattern:

Take these PersianMMLU examples and generate many similar examples.

Better pattern:

We need examples that test high-school-level Persian science explanations, but do not copy from any benchmark. Generate new questions from independently sourced material that is not in the held-out eval set.

Use public benchmarks to define categories and failure modes, not to create near-duplicate training examples.

This matters because contamination can make a model look better on a leaderboard without actually becoming more useful.

10. Synthetic data can help, but only after the seed set is clear

Synthetic data is useful, especially for low-resource languages, but it needs quality control.

Useful references and tools:

For Persian, I would be careful about:

Risk	What to do
Translationese	Have native/fluent reviewers check samples.
Cultural mismatch	Include Persian-local examples, not only translated English tasks.
Repetition	Deduplicate prompts and answers.
Overly generic assistant style	Add real target-domain tasks.
Wrong facts	Verify factual/domain answers.
Unsafe completions	Add safety filters and alignment eval.
Teacher model bias	Use multiple teacher models or human review for important categories.
Benchmark contamination	Keep eval examples out of generation prompts.

A safe synthetic expansion pattern:

Write 200–500 excellent seed examples.
Define task categories and style rules.
Generate candidate examples.
Filter automatically for schema/length/language.
Review samples manually.
Deduplicate.
Train a small run.
Evaluate.
Add more only where eval shows a gap.

11. Data selection: quality, complexity, diversity

A useful mental model is to select data using three axes:

Axis	Meaning
Quality	Is the answer correct, natural, safe, and useful?
Complexity	Does it teach the model nontrivial behavior?
Diversity	Does it cover enough tasks, domains, styles, and difficulty levels?

Relevant papers:

LIMA: small, carefully curated instruction data can be surprisingly effective.
AlpaGasus: filtering noisy instruction data can improve results.
What Makes Good Data for Alignment? / Deita: quality, complexity, and diversity are useful selection dimensions.
Long Is More for Alignment: simple selection heuristics such as response length can be strong baselines.
Large-Scale Data Selection for Instruction Tuning: automatic data selection methods do not always scale cleanly, so evaluation and ablation matter.

For a Persian SLM, I would avoid both extremes:

only tiny “perfect” examples with no diversity
huge synthetic data dumps with no review

A balanced dataset is usually better.

12. Measure tokenizer cost for an SLM

For a small model, tokenization can matter a lot.

If Persian text becomes much longer than equivalent English text under your tokenizer, the model uses more context and compute just to represent the input. That can hurt training efficiency and inference quality.

Measure this before scaling your dataset.

Example:

# pip install transformers

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("<model-or-tokenizer-name>")

samples = [
    "این یک جملهٔ نمونه به زبان فارسی است.",
    "لطفاً این متن را در سه bullet point خلاصه کن.",
    "کاربر می‌خواهد دربارهٔ <domain> توضیح ساده‌ای دریافت کند.",
]

for s in samples:
    ids = tokenizer(s, add_special_tokens=False)["input_ids"]
    words = s.split()
    print({
        "text": s,
        "chars": len(s),
        "words": len(words),
        "tokens": len(ids),
        "tokens_per_word": len(ids) / max(1, len(words)),
        "tokens_per_char": len(ids) / max(1, len(s)),
    })

Also check:

Metric	Why
average tokens per Persian example	Cost and context length
truncation rate	Whether long examples are being cut
tokens per word	Rough tokenization fertility
Persian vs English equivalent length	Whether the tokenizer is inefficient for Persian
mixed Persian-English examples	Important for technical assistants

If tokenizer cost is bad, possible responses include:

choose a better base model/tokenizer
shorten examples
reduce unnecessary verbosity
use narrower tasks
consider language adaptation / continued pretraining
avoid assuming SFT alone will solve everything

PersianMind is an example where vocabulary adaptation was part of the Persian model-building story, so this is not just a theoretical concern.

13. HF / TRL formatting matters

Good content can become bad training data if the format is wrong.

For Hugging Face TRL, check:

Common formats include:

Conversational format

{"messages":[{"role":"system","content":"You are a helpful Persian assistant."},{"role":"user","content":"<Persian user message>"},{"role":"assistant","content":"<Persian assistant answer>"}]}

Prompt-completion format

{"prompt":"<Persian instruction>","completion":"<Persian answer>"}

Things to verify:

Issue	Why it matters
Correct chat template	Chat models expect specific control tokens and role formatting.
Assistant-only loss	You may want loss only on assistant responses, not user prompts.
Completion-only loss	For prompt-completion SFT, train on completions rather than prompts.
Multi-turn formatting	Roles and turn boundaries must be unambiguous.
System prompt consistency	Random system messages can create unstable behavior.
Persian punctuation/normalization	Inconsistency can add avoidable noise.
Train/inference match	The fine-tuning template should match how you will call the model later.

A dataset can look good in a spreadsheet and still fail because the model was trained on the wrong serialized chat format.

14. Minimal training data schema

For a Persian assistant SFT dataset, I would keep metadata. It makes filtering and later analysis much easier.

Example JSONL:

{"id":"sft_000001","messages":[{"role":"system","content":"You are a helpful Persian assistant."},{"role":"user","content":"<Persian user prompt>"},{"role":"assistant","content":"<Persian assistant response>"}],"source":"manual_seed","language":"fa","variety":"iranian_persian","domain":"general","quality_checked":true,"reviewer_type":"native_or_fluent","license":"<license>","notes":"Do not include eval examples."}
{"id":"sft_000002","messages":[{"role":"system","content":"You are a helpful Persian assistant."},{"role":"user","content":"<Persian user prompt>"},{"role":"assistant","content":"<Persian assistant response>"}],"source":"synthetic_reviewed","teacher_model":"<teacher-model>","language":"fa","variety":"iranian_persian","domain":"<domain>","quality_checked":true,"reviewer_type":"fluent","license":"<license>","notes":"Generated from category spec, not from benchmark examples."}

Useful metadata fields:

Field	Purpose
`id`	Deduplication and audit trail
`source`	manual, translated, synthetic, scraped, domain expert, etc.
`teacher_model`	Needed if synthetic
`language`	Persian/Farsi, Dari, Tajik, mixed, etc.
`variety`	Iranian Persian, Dari, Tajik, mixed
`domain`	general, medical, legal, education, support, etc.
`quality_checked`	Whether reviewed
`reviewer_type`	native, fluent, expert, automatic only
`license`	Reuse constraints
`eval_overlap_checked`	Whether contamination check was done
`notes`	Known caveats

15. Dataset publication quality

If you publish the dataset on Hugging Face, the dataset card is part of the quality.

Use:

At minimum, document:

Dataset card item	What to write
Intended use	SFT, evaluation, DPO, RAG, domain adaptation, etc.
Language scope	Iranian Persian, Dari, Tajik, code-switching, etc.
Data sources	Manual, translated, synthetic, scraped, domain documents
Generation process	Prompts, teacher models, translation method
Human review	Who reviewed, how much, what criteria
Filtering	Dedup, language ID, safety filters, length filters
Splits	Train/dev/test, held-out eval, no overlap policy
Contamination policy	What benchmarks were excluded
License	Dataset license and inherited source licenses
Limitations	Translation artifacts, domain gaps, safety gaps
Ethical notes	Bias, harmful content handling, privacy considerations

A dataset without documentation may be hard for others to trust, even if the examples look good.

16. Special case: domain assistants

If the assistant is for a domain, general Persian SFT data is not enough.

Domain	Extra requirement
Medical	Expert validation, conservative answers, disclaimers, Persian medical QA eval
Legal	Jurisdiction-specific knowledge, refusal boundaries, citations
Education	Curriculum alignment, grade level, step-by-step explanations
Customer support	Company/product-specific data, policy consistency
Religious/cultural	Careful cultural review, sensitivity, source grounding
News/current events	RAG and source freshness, not static SFT

For domain assistants, I would usually prefer:

small expert-reviewed SFT set
strong RAG pipeline
domain-specific private eval
refusal/uncertainty examples
citations or source-grounded answers

rather than a huge generic Persian chat dataset.

17. What I would avoid

I would avoid these patterns:

Pattern	Why risky
“Generate 100k Persian conversations with ChatGPT and fine-tune.”	Likely repetitive, translation-like, weakly targeted, and hard to audit.
Training on benchmark examples	Contamination and misleading scores.
Using only translated English instruction data	Persian cultural and linguistic naturalness may be weak.
No private eval set	You cannot tell what improved.
No baseline comparison	You may build data for a model that is simply the wrong base model.
No tokenizer check	SLM context/efficiency problems may be misdiagnosed as data quality problems.
Ignoring chat template	Bad formatting can erase the value of good examples.
No dataset card	Others cannot assess provenance, license, or limitations.
Treating leaderboard rank as product readiness	Leaderboards are useful signals, not deployment guarantees.

18. A compact roadmap

Here is the full process in one table.

Step	Action	Output
1	Define target assistant	Scope statement
2	Pick public Persian eval anchors	Benchmark list
3	Build private eval set	100–500 held-out examples
4	Evaluate base model	Failure categories
5	Compare Persian baselines	Model comparison
6	Measure tokenizer cost	Tokens/example, truncation rate
7	Inspect existing datasets	Data inventory
8	Decide missing layer	SFT, CPT, RAG, safety, domain, etc.
9	Build manual seed data	High-quality seed examples
10	Expand synthetically if useful	Candidate pool
11	Filter and review	Clean training set
12	Deduplicate/decontaminate	Safe train/dev/test split
13	Train with correct format	SFT/LoRA/QLoRA run
14	Re-evaluate	Failure report
15	Add targeted data	Next iteration
16	Document dataset	Dataset card

19. Final practical checklist

Before calling the dataset “high quality,” I would want to answer these questions:

What exact Persian variety/register is targeted?
Which public benchmarks did you inspect?
What is your private held-out eval set?
What are the base model’s main failures?
Did you compare against Persian-specialized baselines?
Is the data native, translated, synthetic, or mixed?
Who reviewed Persian fluency?
Are answers factually checked where needed?
Is there enough diversity of task type and difficulty?
Are safety/refusal examples included?
Are cultural/local examples included?
Are benchmark examples excluded from training?
Are duplicates removed?
Is the dataset formatted for the exact chat template?
Is assistant-only/completion-only loss handled correctly?
Have you measured tokenization cost?
Is the license clear?
Is the dataset card complete?

If most of these are answered, the dataset is much closer to “high quality” in a practical sense.

Bottom line

For a Persian SLM assistant, I would define dataset quality like this:

A high-quality dataset is a documented, decontaminated, eval-driven Persian dataset that targets the measured weaknesses of the chosen model: fluency, instruction following, multi-turn behavior, factuality, cultural fit, safety, domain knowledge, retrieval behavior, formatting, or tokenizer efficiency.

So the best first move is not to generate a huge dataset. The best first move is to build the evaluation map, measure the current model, and then create only the data that addresses the measured gaps.

Bidram · June 7, 2026, 6:55pm

I am targeting Iranian Persian only. By high-quality, I mean a natural Persian dataset with fluent grammar, correct syntax, and student/teacher usage and later when the SLM was good enough, a Voice assistant for teacher/students with low-end devices.

I am using Qwen 3.5 0.8B as the base model. I also tested Qwen 3 0.6B, but its tokenizer was very inefficient for Persian. Even with Qwen 3.5 0.8B, the tokenizer is still not very good for Persian, and the model struggles with simple Persian assistance tasks, grammar, and syntax.

Because of that, I do not think this problem can be fixed with an SFT dataset alone. The model needs continued pretraining (CPT). I can run CPT with LoRA rank 64, and it works well enough in practice. I tested this before with Qwen 3 0.6B trained on about 800 MB of cleaned Persian Wikipedia, and the model improved noticeably. Before that, it could barely write even two Persian words correctly.

For training data, I selected:

Persian Wikipedia (~2 GB)
Persian OSCAR (~3 GB, but very noisy)
Persian Aya dataset (~600 MB)

Right now, I am trying to train an n-gram model to identify whether a given text is natural and correct Persian. If the text does not meet that standard, I reject it. I should also mention that I clean the text as much as possible before sending it to the n-gram model.

At the moment, this is the best approach I can think of. I do not believe a simple scripted pipeline can reliably identify natural Persian quality on its own.

Because of my hardware limitations, I cannot use local LLMs for filtering. Smaller LLMs are not good enough at Persian, and larger ones are too expensive to run. For example, I get around 70 tokens/sec with Gemma E2B Q4 and around 40 tokens/sec with Gemma E4B Q4.

And I have two questions:

In continued pretraining (CPT), the model learns grammar, syntax, and additional knowledge, right? My understanding is that CPT changes the model more deeply, in a way that affects its outputs broadly, while SFT mainly teaches the model how to respond in the desired format and follow patterns more reliably. In other words, CPT builds the underlying language ability, while SFT guides that ability toward better answers. Is that understanding correct?
If I want the model to gain a larger amount of knowledge ( not at a GPT level, but enough to remember or recall information from the CPT data ) do I need to include everything I want it to remember in the CPT dataset?

I should also mention that API costs for AI models are very high in my country.

John6666 · June 8, 2026, 1:21am

Oh. That clarification narrows the focus quite a lot:

Short answer

Your understanding is mostly correct.

If the target is Iranian Persian only, the model is Qwen3.5-0.8B, and the model still struggles with basic Persian grammar/syntax, then I agree that this is probably not an SFT-only problem.

A useful mental model is:

Stage	Best for	Not ideal for
CPT / continued pretraining	Persian language grounding, grammar, syntax, orthography, style, broad domain familiarity	Teaching exact assistant behavior or reliable factual recall
SFT	Instruction following, teacher/student answer style, dialogue format, refusals, formatting, persona	Repairing weak base Persian language ability
RAG / retrieval	Reliable factual recall, changing knowledge, exact textbook/document facts	Improving the model’s internal Persian grammar

So I would summarize it like this:

CPT gives the model more Persian language mass. SFT tells it how to behave as an assistant. RAG or a small local knowledge base is better when exact knowledge must be recalled reliably.

Your experience with ~800 MB cleaned Persian Wikipedia improving Qwen3-0.6B is consistent with that.

1. CPT vs SFT: your understanding is mostly right

I would phrase it slightly more carefully:

CPT

Continued pretraining is still next-token prediction. It can move the model’s internal distribution toward Persian:

grammar
syntax
punctuation
orthography
style
common expressions
common factual associations
domain familiarity
local writing conventions

This is why CPT is often used for language adaptation. Meta’s overview of LLM adaptation also describes continued pretraining as useful when the goal is to add capabilities such as multilingual ability, while noting that it is more expensive and can risk forgetting: Adapting Large Language Models.

For low-resource language adaptation, a similar staged pattern appears in some work like:

CPT for language grounding
SFT for task/instruction specialization

Example: TibetanLLM: CPT + SFT for Tibetan language adaptation.

SFT

SFT is better for teaching the model:

how to answer as a teacher
how to follow instructions
how to produce student-friendly explanations
how to ask clarifying questions
how to refuse unsafe requests
how to use a particular chat format
how verbose or concise it should be
what style of Persian answer you want

But if the base model cannot produce stable Persian sentences, SFT often becomes inefficient. You may end up teaching answer patterns on top of a weak language foundation.

So I would agree with your main diagnosis:

If the model cannot reliably handle basic Persian grammar and syntax, CPT or language adaptation should come before serious SFT.

2. Consider whether to CPT the Base model or the post-trained model

Qwen provides both:

If possible, I would consider this order:

Qwen3.5-0.8B-Base
  -> Persian CPT / language adaptation
  -> SFT for teacher/student assistant behavior
  -> optional preference tuning / DPO later

Why?

Because CPT on an already post-trained/instruct model can still work, but it may partially degrade instruction-following behavior. If you CPT the instruct/post-trained model, I would keep a small instruction-following regression eval and check it after every CPT run.

Practical rule:

If you use…	Watch for…
Base model	You need SFT afterward before it behaves like an assistant
Instruct/post-trained model	CPT may damage some instruction-following behavior
LoRA CPT	Safer/cheaper, but limited capacity compared with full CPT
Full CPT	More capacity, more cost, more forgetting risk

Since you can run LoRA rank 64 and it already helped in practice, it sounds like a reasonable constraint-aware approach.

3. Does the model need to see everything in CPT to remember it?

Partly yes, but with an important caveat.

If you want the model to become broadly familiar with some knowledge, the model must see that knowledge during training somehow. CPT exposure can help the model internalize patterns and associations.

But CPT is not a reliable database.

For exact knowledge recall, especially facts, dates, school content, rules, or domain-specific material, I would not rely only on parametric memory. A relevant comparison is Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs, which found that retrieval-augmented generation often outperforms unsupervised fine-tuning for knowledge-intensive tasks and that learning new factual information through unsupervised fine-tuning can be difficult.

A practical split:

Knowledge type	Better method
General Persian grammar and syntax	CPT
Common Iranian Persian writing patterns	CPT
General educational style	CPT + SFT
Teacher/student answer behavior	SFT
Exact textbook facts	RAG or local knowledge base
Changing/current facts	RAG/search
Small set of very frequent facts	CPT/SFT may be acceptable
High-stakes facts	Retrieval + citations + conservative answer style

For a low-end-device voice assistant, full RAG may be hard, but you can still think in layers:

Very common knowledge -> CPT/SFT
Exact or large knowledge -> compressed local KB / retrieval if possible
Teacher behavior -> SFT
Voice interface -> later ASR/TTS/latency problem

If you try to put all knowledge into CPT, the model may remember some of it, but recall will not be perfectly reliable, especially at 0.8B scale.

4. Your selected data sources make sense, but they have different roles

Your sources:

Persian Wikipedia
Persian OSCAR
Persian Aya

are not equivalent. I would not mix them blindly.

Source	Best use	Main risk
Persian Wikipedia	Clean-ish formal Persian, encyclopedic facts, stable style	Too encyclopedic; not conversational or teacher/student by itself
Persian OSCAR	Broader web Persian, more style diversity	Very noisy, mixed language, boilerplate, duplicates, spam
Persian Aya	Instruction-following data	More SFT-like than CPT-like; may not be ideal as raw CPT text

Wikipedia

Good for:

formal grammar
basic factual associations
clean-ish prose
general encyclopedic style

Risk:

the model may become too encyclopedia-like
not enough student/teacher dialogue
not enough conversational assistant style

OSCAR

OSCAR is useful, but I would treat it as raw material, not clean training data. The OSCAR 23.01 documentation mentions metadata such as KenLM-based harmful-content perplexity, TLSH hashes for near deduplication, sentence-level language identification, and quality warnings: OSCAR 23.01 docs.

That supports your instinct: OSCAR can be valuable, but only after strong cleaning.

Aya

Aya is more instruction-oriented. It may be useful for SFT or for a small instruction mixture, but I would not treat it the same way as Wikipedia/OSCAR for CPT.

For CPT, I would prefer raw fluent Persian prose.

For SFT, I would prefer instruction/response examples.

5. A better CPT mixture might be staged

Instead of one big mixture immediately, I would test small stages.

Example:

Stage	Data mixture	Goal
CPT-1	Clean Persian Wikipedia	Basic grammar, syntax, formal Persian
CPT-2	Wikipedia + filtered OSCAR	Broader style and web Persian
CPT-3	Add educational Persian prose if available	Teacher/student domain adaptation
SFT-1	Small high-quality teacher/student examples	Assistant behavior
SFT-2	More instruction/multi-turn examples	Dialogue robustness

I would avoid making noisy OSCAR too large early.

A safe first ratio could be something like:

70-90% clean Persian Wikipedia / curated Persian prose
10-30% strongly filtered OSCAR
0-10% instruction-like data, if converted carefully

This is not a universal ratio. It is just a safe starting point.

If filtered OSCAR improves eval, increase it. If it makes outputs noisier, reduce it.

6. The n-gram filter idea is good, but use it as one signal

I think your n-gram idea is practical under your constraints.

KenLM is a good fit for this kind of low-cost filtering because it is fast and small compared with neural LLM filtering.

But I would not use a single rule like:

low perplexity = good Persian
high perplexity = bad Persian

That can fail.

Why?

very repetitive boilerplate can have low perplexity
Wikipedia-like prose may be favored too much
short junk text can be unstable
copied templates may look fluent but be useless
unnatural but common web spam can get through
good informal Persian may be rejected if your n-gram LM was trained only on formal text

Instead, use n-gram scoring as one filter in a pipeline.

7. Good-LM / Bad-LM filtering may be stronger than one LM

A useful cheap approach:

Good Persian LM:
  trained on clean Persian Wikipedia + curated high-quality Persian text

Bad Persian LM:
  trained on rejected OSCAR samples, spam, boilerplate, malformed text, mixed-language junk

Score each candidate text:
  good_lm_score
  bad_lm_score
  difference_or_ratio = bad_score - good_score

Then select text that looks good under the good LM and not good under the bad LM.

This is often more useful than a single perplexity threshold.

Rough idea:

accept if:
  good_perplexity is reasonable
  bad_perplexity is worse
  text length is reasonable
  Persian script ratio is high
  repetition is low
  duplicate score is low

Do not choose thresholds blindly. Sample 100 accepted and 100 rejected texts, read them, and adjust.

8. Cheap filtering pipeline under hardware/API constraints

Given your constraints, I would use a classical pipeline first.

Something like:

raw text
  -> normalization
  -> language/script filtering
  -> length filtering
  -> boilerplate removal
  -> repetition filtering
  -> exact dedup
  -> near dedup
  -> n-gram LM scoring
  -> optional fastText classifier
  -> manual sample audit
  -> CPT shard

Step 1: Normalize Persian

For Persian preprocessing, Hazm is useful. It provides Persian normalization, tokenization, lemmatization, and related tools.

Normalize things like:

Arabic/Persian variants of letters
spacing
half-space / ZWNJ issues
punctuation
repeated characters
strange Unicode artifacts

Step 2: Script/language ratio

Use cheap rules:

Persian/Arabic-script character ratio
Latin character ratio
digit ratio
symbol ratio
average line length
number of URLs
number of repeated lines

Reject obvious junk before expensive scoring.

Step 3: Deduplicate

Do both:

exact dedup
near dedup

For OSCAR 23.01, the documentation mentions TLSH hashes for exact and near deduplication. If you are using a different OSCAR version, you may need your own MinHash/SimHash/TLSH pipeline.

Step 4: KenLM score

Use KenLM perplexity as a quality signal.

Train on your best available clean Persian text.

Then score candidate documents.

Step 5: Optional small classifier

If you manually label examples as good/bad Persian, you can train a cheap classifier.

fastText is useful for this kind of lightweight text classification. It is much cheaper than LLM filtering.

Example labels:

__label__good <text>
__label__bad <text>

This can become surprisingly useful after a few thousand labeled examples.

9. What I would evaluate after each CPT run

Do not wait until the final model.

After each CPT run, check a small fixed eval set.

Eval	Why
Persian perplexity on held-out clean text	Did CPT improve Persian modeling?
Tokenization stats	Are examples being truncated?
Basic grammar prompts	Can it produce correct Persian sentences?
Teacher/student prompts	Did educational explanation improve?
Instruction-following prompts	Did CPT damage instruction-following?
Repetition tests	Did it become repetitive?
Mixed Persian-English prompts	Useful for technical/student settings
Safety/refusal sanity checks	Make sure it did not become less safe
Small factual probes	Did knowledge improve at all?

I would keep a small frozen eval like:

{"id":"fa_grammar_001","type":"grammar","prompt":"<Persian grammar prompt>","expected_behavior":"Produce fluent Iranian Persian."}
{"id":"teacher_001","type":"teacher","prompt":"<Student asks a basic question in Persian>","expected_behavior":"Explain simply, step by step, in Persian."}
{"id":"if_001","type":"instruction_following","prompt":"<Answer in exactly 3 bullet points in Persian>","expected_behavior":"Exactly 3 bullets, no extra text."}
{"id":"regression_001","type":"regression","prompt":"<Previously easy instruction prompt>","expected_behavior":"Should not degrade after CPT."}

The regression part is important if you CPT an already post-trained/instruct model.

10. About LoRA CPT rank 64

LoRA CPT with rank 64 can be a reasonable compromise.

It probably will not have the same capacity as full CPT, but your empirical result matters: if it noticeably improved Qwen3-0.6B after Persian Wikipedia CPT, that is evidence that it is useful in your setup.

I would just watch for:

overfitting to Wikipedia style
loss of instruction-following
repetition
catastrophic forgetting
too much formal/encyclopedic tone
weak conversational style
weak teacher/student style

If you can afford it, run small ablations:

A: Wikipedia only
B: Wikipedia + filtered OSCAR
C: Wikipedia + filtered OSCAR + educational prose
D: same as B but fewer steps
E: same as B but different LoRA rank

Even small ablations can teach you more than one big run.

11. One important warning: do not overfit to “clean Persian” only

Your n-gram filter may become too strict.

If the filter only accepts very formal Wikipedia-style Persian, the model may become better at formal prose but not better as a student/teacher assistant.

For your target, you probably need at least three Persian styles:

Style	Example source
Formal Persian	Wikipedia, books, formal articles
Educational Persian	textbooks, explanations, lessons, student-facing content
Conversational Persian	teacher/student dialogue, Q&A, simple explanations

If you only use formal text, SFT will have to fight the CPT style later.

So I would keep a small amount of high-quality conversational/educational Persian, even if it is much smaller than the formal corpus.

12. SFT after CPT

After CPT, I would do SFT with a small, clean dataset.

Do not start with huge SFT.

Start with examples like:

explain a concept to a student
correct a student’s grammar
simplify a paragraph
ask a clarifying question
answer with examples
answer in short teacher style
refuse unsafe requests politely
handle mixed Persian-English technical terms
multi-turn follow-up

For TRL, check the dataset format and loss masking carefully:

Important:

CPT teaches language distribution.
SFT teaches assistant behavior.
Wrong chat template or wrong loss masking can waste good data.

If using chat data, make sure the model is trained on assistant outputs, not just random serialized conversations.

13. My practical recommendation

Given your constraints, I would do this:

Phase 1: CPT data cleaning

Persian Wikipedia
  -> clean
  -> normalize
  -> dedup
  -> train Good KenLM

OSCAR Persian
  -> clean
  -> normalize
  -> language/script filter
  -> remove boilerplate
  -> dedup / near-dedup
  -> score with Good KenLM
  -> optionally score with Bad KenLM
  -> sample audit
  -> keep only high-confidence Persian

Phase 2: Small CPT runs

Run 1: Wikipedia only
Run 2: Wikipedia + filtered OSCAR
Run 3: add educational Persian prose if available

Compare them with the same held-out eval.

Phase 3: SFT

Small teacher/student Persian SFT
  -> 500 to 5,000 excellent examples first
  -> then expand only if eval shows benefit

Phase 4: Knowledge

For common background knowledge:
  CPT exposure is useful.

For exact educational facts:
  use RAG / local KB if possible.

For teacher style:
  SFT.

Bottom line

I think your direction is reasonable.

Your two core ideas are right:

If the model cannot write Persian well, CPT is probably needed before SFT.
A cheap n-gram-based quality filter is a practical idea under hardware/API constraints.

I would only refine the plan like this:

Use CPT for Persian language grounding, SFT for teacher/student assistant behavior, and RAG or a small local knowledge base for reliable factual recall.

And for filtering:

Use the n-gram model as one signal, not the only signal. Combine it with normalization, script/language filters, deduplication, repetition filters, OSCAR metadata when available, and manual sampling.

Topic		Replies	Views
Fine-Tuning an SLM for a Low-Resource Language Intermediate	7	106	June 6, 2026
Guidance needed: building a Korean dataset for LLaMA fine-tuning Beginners	1	53	April 21, 2026
Looking for Data 🤗Datasets	2	70	March 4, 2026
PreTrain GPT2 from scratch in Persian Flax/JAX Projects	15	2196	July 7, 2021
Fine Tune text generation Model using different type of data 🤗Transformers	0	397	August 1, 2023

How can i build a High Quality dataset?

Short answer

1. Start with Persian evaluation, not training data

Why this matters

2. Do not treat “Persian” as one fully specified target

3. Define “quality” by capability, not by dataset size

4. Compare against Persian-specialized baselines before training

5. Inspect existing Persian resources before generating from scratch

6. Decide which layer you actually need to improve

7. A practical dataset-building pipeline

8. Build a small private eval set

9. Do not train on your evaluation data

10. Synthetic data can help, but only after the seed set is clear

11. Data selection: quality, complexity, diversity

12. Measure tokenizer cost for an SLM

13. HF / TRL formatting matters

Conversational format

Prompt-completion format

14. Minimal training data schema

15. Dataset publication quality

16. Special case: domain assistants

17. What I would avoid

18. A compact roadmap

19. Final practical checklist

Bottom line

Short answer

1. CPT vs SFT: your understanding is mostly right

CPT

SFT

2. Consider whether to CPT the Base model or the post-trained model

3. Does the model need to see everything in CPT to remember it?

4. Your selected data sources make sense, but they have different roles

Wikipedia

OSCAR

Aya

5. A better CPT mixture might be staged

6. The n-gram filter idea is good, but use it as one signal

7. Good-LM / Bad-LM filtering may be stronger than one LM

8. Cheap filtering pipeline under hardware/API constraints

Step 1: Normalize Persian

Step 2: Script/language ratio

Step 3: Deduplicate

Step 4: KenLM score

Step 5: Optional small classifier

9. What I would evaluate after each CPT run

10. About LoRA CPT rank 64

11. One important warning: do not overfit to “clean Persian” only

12. SFT after CPT

13. My practical recommendation

Phase 1: CPT data cleaning

Phase 2: Small CPT runs

Phase 3: SFT

Phase 4: Knowledge

Bottom line

Related topics