Oh. That clarification narrows the focus quite a lot:
Short answer
Your understanding is mostly correct.
If the target is Iranian Persian only, the model is Qwen3.5-0.8B, and the model still struggles with basic Persian grammar/syntax, then I agree that this is probably not an SFT-only problem.
A useful mental model is:
| Stage |
Best for |
Not ideal for |
| CPT / continued pretraining |
Persian language grounding, grammar, syntax, orthography, style, broad domain familiarity |
Teaching exact assistant behavior or reliable factual recall |
| SFT |
Instruction following, teacher/student answer style, dialogue format, refusals, formatting, persona |
Repairing weak base Persian language ability |
| RAG / retrieval |
Reliable factual recall, changing knowledge, exact textbook/document facts |
Improving the model’s internal Persian grammar |
So I would summarize it like this:
CPT gives the model more Persian language mass. SFT tells it how to behave as an assistant. RAG or a small local knowledge base is better when exact knowledge must be recalled reliably.
Your experience with ~800 MB cleaned Persian Wikipedia improving Qwen3-0.6B is consistent with that.
1. CPT vs SFT: your understanding is mostly right
I would phrase it slightly more carefully:
CPT
Continued pretraining is still next-token prediction. It can move the model’s internal distribution toward Persian:
- grammar
- syntax
- punctuation
- orthography
- style
- common expressions
- common factual associations
- domain familiarity
- local writing conventions
This is why CPT is often used for language adaptation. Meta’s overview of LLM adaptation also describes continued pretraining as useful when the goal is to add capabilities such as multilingual ability, while noting that it is more expensive and can risk forgetting: Adapting Large Language Models.
For low-resource language adaptation, a similar staged pattern appears in some work like:
- CPT for language grounding
- SFT for task/instruction specialization
Example: TibetanLLM: CPT + SFT for Tibetan language adaptation.
SFT
SFT is better for teaching the model:
- how to answer as a teacher
- how to follow instructions
- how to produce student-friendly explanations
- how to ask clarifying questions
- how to refuse unsafe requests
- how to use a particular chat format
- how verbose or concise it should be
- what style of Persian answer you want
But if the base model cannot produce stable Persian sentences, SFT often becomes inefficient. You may end up teaching answer patterns on top of a weak language foundation.
So I would agree with your main diagnosis:
If the model cannot reliably handle basic Persian grammar and syntax, CPT or language adaptation should come before serious SFT.
2. Consider whether to CPT the Base model or the post-trained model
Qwen provides both:
If possible, I would consider this order:
Qwen3.5-0.8B-Base
-> Persian CPT / language adaptation
-> SFT for teacher/student assistant behavior
-> optional preference tuning / DPO later
Why?
Because CPT on an already post-trained/instruct model can still work, but it may partially degrade instruction-following behavior. If you CPT the instruct/post-trained model, I would keep a small instruction-following regression eval and check it after every CPT run.
Practical rule:
| If you use… |
Watch for… |
| Base model |
You need SFT afterward before it behaves like an assistant |
| Instruct/post-trained model |
CPT may damage some instruction-following behavior |
| LoRA CPT |
Safer/cheaper, but limited capacity compared with full CPT |
| Full CPT |
More capacity, more cost, more forgetting risk |
Since you can run LoRA rank 64 and it already helped in practice, it sounds like a reasonable constraint-aware approach.
3. Does the model need to see everything in CPT to remember it?
Partly yes, but with an important caveat.
If you want the model to become broadly familiar with some knowledge, the model must see that knowledge during training somehow. CPT exposure can help the model internalize patterns and associations.
But CPT is not a reliable database.
For exact knowledge recall, especially facts, dates, school content, rules, or domain-specific material, I would not rely only on parametric memory. A relevant comparison is Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs, which found that retrieval-augmented generation often outperforms unsupervised fine-tuning for knowledge-intensive tasks and that learning new factual information through unsupervised fine-tuning can be difficult.
A practical split:
| Knowledge type |
Better method |
| General Persian grammar and syntax |
CPT |
| Common Iranian Persian writing patterns |
CPT |
| General educational style |
CPT + SFT |
| Teacher/student answer behavior |
SFT |
| Exact textbook facts |
RAG or local knowledge base |
| Changing/current facts |
RAG/search |
| Small set of very frequent facts |
CPT/SFT may be acceptable |
| High-stakes facts |
Retrieval + citations + conservative answer style |
For a low-end-device voice assistant, full RAG may be hard, but you can still think in layers:
Very common knowledge -> CPT/SFT
Exact or large knowledge -> compressed local KB / retrieval if possible
Teacher behavior -> SFT
Voice interface -> later ASR/TTS/latency problem
If you try to put all knowledge into CPT, the model may remember some of it, but recall will not be perfectly reliable, especially at 0.8B scale.
4. Your selected data sources make sense, but they have different roles
Your sources:
- Persian Wikipedia
- Persian OSCAR
- Persian Aya
are not equivalent. I would not mix them blindly.
| Source |
Best use |
Main risk |
| Persian Wikipedia |
Clean-ish formal Persian, encyclopedic facts, stable style |
Too encyclopedic; not conversational or teacher/student by itself |
| Persian OSCAR |
Broader web Persian, more style diversity |
Very noisy, mixed language, boilerplate, duplicates, spam |
| Persian Aya |
Instruction-following data |
More SFT-like than CPT-like; may not be ideal as raw CPT text |
Wikipedia
Good for:
- formal grammar
- basic factual associations
- clean-ish prose
- general encyclopedic style
Risk:
- the model may become too encyclopedia-like
- not enough student/teacher dialogue
- not enough conversational assistant style
OSCAR
OSCAR is useful, but I would treat it as raw material, not clean training data. The OSCAR 23.01 documentation mentions metadata such as KenLM-based harmful-content perplexity, TLSH hashes for near deduplication, sentence-level language identification, and quality warnings: OSCAR 23.01 docs.
That supports your instinct: OSCAR can be valuable, but only after strong cleaning.
Aya
Aya is more instruction-oriented. It may be useful for SFT or for a small instruction mixture, but I would not treat it the same way as Wikipedia/OSCAR for CPT.
For CPT, I would prefer raw fluent Persian prose.
For SFT, I would prefer instruction/response examples.
5. A better CPT mixture might be staged
Instead of one big mixture immediately, I would test small stages.
Example:
| Stage |
Data mixture |
Goal |
| CPT-1 |
Clean Persian Wikipedia |
Basic grammar, syntax, formal Persian |
| CPT-2 |
Wikipedia + filtered OSCAR |
Broader style and web Persian |
| CPT-3 |
Add educational Persian prose if available |
Teacher/student domain adaptation |
| SFT-1 |
Small high-quality teacher/student examples |
Assistant behavior |
| SFT-2 |
More instruction/multi-turn examples |
Dialogue robustness |
I would avoid making noisy OSCAR too large early.
A safe first ratio could be something like:
70-90% clean Persian Wikipedia / curated Persian prose
10-30% strongly filtered OSCAR
0-10% instruction-like data, if converted carefully
This is not a universal ratio. It is just a safe starting point.
If filtered OSCAR improves eval, increase it. If it makes outputs noisier, reduce it.
6. The n-gram filter idea is good, but use it as one signal
I think your n-gram idea is practical under your constraints.
KenLM is a good fit for this kind of low-cost filtering because it is fast and small compared with neural LLM filtering.
But I would not use a single rule like:
low perplexity = good Persian
high perplexity = bad Persian
That can fail.
Why?
- very repetitive boilerplate can have low perplexity
- Wikipedia-like prose may be favored too much
- short junk text can be unstable
- copied templates may look fluent but be useless
- unnatural but common web spam can get through
- good informal Persian may be rejected if your n-gram LM was trained only on formal text
Instead, use n-gram scoring as one filter in a pipeline.
7. Good-LM / Bad-LM filtering may be stronger than one LM
A useful cheap approach:
Good Persian LM:
trained on clean Persian Wikipedia + curated high-quality Persian text
Bad Persian LM:
trained on rejected OSCAR samples, spam, boilerplate, malformed text, mixed-language junk
Score each candidate text:
good_lm_score
bad_lm_score
difference_or_ratio = bad_score - good_score
Then select text that looks good under the good LM and not good under the bad LM.
This is often more useful than a single perplexity threshold.
Rough idea:
accept if:
good_perplexity is reasonable
bad_perplexity is worse
text length is reasonable
Persian script ratio is high
repetition is low
duplicate score is low
Do not choose thresholds blindly. Sample 100 accepted and 100 rejected texts, read them, and adjust.
8. Cheap filtering pipeline under hardware/API constraints
Given your constraints, I would use a classical pipeline first.
Something like:
raw text
-> normalization
-> language/script filtering
-> length filtering
-> boilerplate removal
-> repetition filtering
-> exact dedup
-> near dedup
-> n-gram LM scoring
-> optional fastText classifier
-> manual sample audit
-> CPT shard
Step 1: Normalize Persian
For Persian preprocessing, Hazm is useful. It provides Persian normalization, tokenization, lemmatization, and related tools.
Normalize things like:
- Arabic/Persian variants of letters
- spacing
- half-space / ZWNJ issues
- punctuation
- repeated characters
- strange Unicode artifacts
Step 2: Script/language ratio
Use cheap rules:
Persian/Arabic-script character ratio
Latin character ratio
digit ratio
symbol ratio
average line length
number of URLs
number of repeated lines
Reject obvious junk before expensive scoring.
Step 3: Deduplicate
Do both:
For OSCAR 23.01, the documentation mentions TLSH hashes for exact and near deduplication. If you are using a different OSCAR version, you may need your own MinHash/SimHash/TLSH pipeline.
Step 4: KenLM score
Use KenLM perplexity as a quality signal.
Train on your best available clean Persian text.
Then score candidate documents.
Step 5: Optional small classifier
If you manually label examples as good/bad Persian, you can train a cheap classifier.
fastText is useful for this kind of lightweight text classification. It is much cheaper than LLM filtering.
Example labels:
__label__good <text>
__label__bad <text>
This can become surprisingly useful after a few thousand labeled examples.
9. What I would evaluate after each CPT run
Do not wait until the final model.
After each CPT run, check a small fixed eval set.
| Eval |
Why |
| Persian perplexity on held-out clean text |
Did CPT improve Persian modeling? |
| Tokenization stats |
Are examples being truncated? |
| Basic grammar prompts |
Can it produce correct Persian sentences? |
| Teacher/student prompts |
Did educational explanation improve? |
| Instruction-following prompts |
Did CPT damage instruction-following? |
| Repetition tests |
Did it become repetitive? |
| Mixed Persian-English prompts |
Useful for technical/student settings |
| Safety/refusal sanity checks |
Make sure it did not become less safe |
| Small factual probes |
Did knowledge improve at all? |
I would keep a small frozen eval like:
{"id":"fa_grammar_001","type":"grammar","prompt":"<Persian grammar prompt>","expected_behavior":"Produce fluent Iranian Persian."}
{"id":"teacher_001","type":"teacher","prompt":"<Student asks a basic question in Persian>","expected_behavior":"Explain simply, step by step, in Persian."}
{"id":"if_001","type":"instruction_following","prompt":"<Answer in exactly 3 bullet points in Persian>","expected_behavior":"Exactly 3 bullets, no extra text."}
{"id":"regression_001","type":"regression","prompt":"<Previously easy instruction prompt>","expected_behavior":"Should not degrade after CPT."}
The regression part is important if you CPT an already post-trained/instruct model.
10. About LoRA CPT rank 64
LoRA CPT with rank 64 can be a reasonable compromise.
It probably will not have the same capacity as full CPT, but your empirical result matters: if it noticeably improved Qwen3-0.6B after Persian Wikipedia CPT, that is evidence that it is useful in your setup.
I would just watch for:
- overfitting to Wikipedia style
- loss of instruction-following
- repetition
- catastrophic forgetting
- too much formal/encyclopedic tone
- weak conversational style
- weak teacher/student style
If you can afford it, run small ablations:
A: Wikipedia only
B: Wikipedia + filtered OSCAR
C: Wikipedia + filtered OSCAR + educational prose
D: same as B but fewer steps
E: same as B but different LoRA rank
Even small ablations can teach you more than one big run.
11. One important warning: do not overfit to “clean Persian” only
Your n-gram filter may become too strict.
If the filter only accepts very formal Wikipedia-style Persian, the model may become better at formal prose but not better as a student/teacher assistant.
For your target, you probably need at least three Persian styles:
| Style |
Example source |
| Formal Persian |
Wikipedia, books, formal articles |
| Educational Persian |
textbooks, explanations, lessons, student-facing content |
| Conversational Persian |
teacher/student dialogue, Q&A, simple explanations |
If you only use formal text, SFT will have to fight the CPT style later.
So I would keep a small amount of high-quality conversational/educational Persian, even if it is much smaller than the formal corpus.
12. SFT after CPT
After CPT, I would do SFT with a small, clean dataset.
Do not start with huge SFT.
Start with examples like:
- explain a concept to a student
- correct a student’s grammar
- simplify a paragraph
- ask a clarifying question
- answer with examples
- answer in short teacher style
- refuse unsafe requests politely
- handle mixed Persian-English technical terms
- multi-turn follow-up
For TRL, check the dataset format and loss masking carefully:
Important:
CPT teaches language distribution.
SFT teaches assistant behavior.
Wrong chat template or wrong loss masking can waste good data.
If using chat data, make sure the model is trained on assistant outputs, not just random serialized conversations.
13. My practical recommendation
Given your constraints, I would do this:
Phase 1: CPT data cleaning
Persian Wikipedia
-> clean
-> normalize
-> dedup
-> train Good KenLM
OSCAR Persian
-> clean
-> normalize
-> language/script filter
-> remove boilerplate
-> dedup / near-dedup
-> score with Good KenLM
-> optionally score with Bad KenLM
-> sample audit
-> keep only high-confidence Persian
Phase 2: Small CPT runs
Run 1: Wikipedia only
Run 2: Wikipedia + filtered OSCAR
Run 3: add educational Persian prose if available
Compare them with the same held-out eval.
Phase 3: SFT
Small teacher/student Persian SFT
-> 500 to 5,000 excellent examples first
-> then expand only if eval shows benefit
Phase 4: Knowledge
For common background knowledge:
CPT exposure is useful.
For exact educational facts:
use RAG / local KB if possible.
For teacher style:
SFT.
Bottom line
I think your direction is reasonable.
Your two core ideas are right:
- If the model cannot write Persian well, CPT is probably needed before SFT.
- A cheap n-gram-based quality filter is a practical idea under hardware/API constraints.
I would only refine the plan like this:
Use CPT for Persian language grounding, SFT for teacher/student assistant behavior, and RAG or a small local knowledge base for reliable factual recall.
And for filtering:
Use the n-gram model as one signal, not the only signal. Combine it with normalization, script/language filters, deduplication, repetition filters, OSCAR metadata when available, and manual sampling.