How to use text only model -> [mistralai/Ministral-3-3B-Instruct-2512]

Unless you want to completely remove the vision encoder from the model, it’s not that difficult.


You can fine-tune mistralai/Ministral-3-3B-Instruct-2512 for text-only generation without using the vision encoder at all. You do it by (1) never passing image inputs and (2) freezing vision-side parameters so nothing in the vision path trains.

The important background is: this checkpoint is multimodal by design. It is a ~3.4B language model + 0.4B vision encoder. That is in the official model card. (Hugging Face)


What “I don’t want the vision encoder” can mean

Meaning A (recommended): “I will never use images”

Do this:

  • Provide text-only prompts.
  • Do text-only loss during fine-tuning.
  • Freeze vision weights so they are inert.

This gives you a normal text generator behavior while staying on the official checkpoint. (Hugging Face)

Meaning B (harder): “I want the vision weights removed to save memory”

That is not an official distribution format for the original repo. It typically means:

  • Converting the checkpoint to a different architecture layout.
  • Editing config and weights.
  • Accepting that conversion can introduce differences.

There are community conversions that claim “vision encoder removed” (example: a “TextOnly” Llama-format conversion). Treat those as third-party artifacts and validate carefully. (Hugging Face)


The minimum working software stack (this matters a lot)

Ministral 3 support relies on newer Transformers and Mistral’s tokenizer library (mistral-common). The official HF model card explicitly tells you to install Transformers from main for FP8 and to install mistral-common >= 1.8.6 for correct tokenization. (Hugging Face)

If you use a stable older Transformers build, you will hit import and model-type errors. This is a very common failure mode. (Stack Overflow)

Recommended: train from BF16 weights

Use the BF16 checkpoint for fine-tuning. It is the same model family but avoids FP8 complexity. The official BF16 model card describes BF16 VRAM expectations and still includes the vision encoder as a component. (Hugging Face)


Step 1. Install (text-only fine-tuning friendly)

Use one of these patterns (pick one, do not mix randomly):

Option 1: Transformers v5 RC (often simplest)

Some Ministral family cards recommend the first v5 RC or main for Transformers and mistral-common >= 1.8.6. (Hugging Face)

Option 2: Transformers from main (needed for FP8 workflows)

The Instruct-2512 card specifically mentions installing Transformers from main for FP8 support and using mistral-common >= 1.8.6. (Hugging Face)

Practical note: if your environment cannot import Mistral3ForConditionalGeneration / MistralCommonBackend, you are almost always on the wrong Transformers build. (Stack Overflow)


Step 2. Text-only inference (no images, no vision encoder usage)

Transformers’ own Ministral3 docs show usage with Mistral3ForConditionalGeneration and MistralCommonBackend. (Hugging Face)
You just remove the image inputs and keep the chat template.

# deps (conceptually):
# - transformers v5 RC/main
# - mistral-common >= 1.8.6
# - torch, accelerate

import torch
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend

model_id = "mistralai/Ministral-3-3B-Instruct-2512-BF16"

tokenizer = MistralCommonBackend.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Text-only chat. No images. No pixel_values.
messages = [
    {"role": "system", "content": "You write domain-specific text in the required style."},
    {"role": "user", "content": "Write a domain-style explanation of <TOPIC> with 3 bullet takeaways."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=256, temperature=0.2, do_sample=True)

print(tokenizer.decode(out[0], skip_special_tokens=True))

Why mistral-common matters

Mistral models are trained with Mistral’s tokenization rules. There have been real-world mismatches between mistral_common and the generic tokenizers backend that can change token IDs for edge cases (escaped strings etc.). That is why the model card tells you to install mistral-common and why this mismatch was filed as a Transformers bug. (Hugging Face)


Step 3. Make fine-tuning “language-only” in practice

Goal

  • Update only language behavior for your domain generation.
  • Keep vision components frozen and unused.

Two controls you should use

  1. Input discipline: never put images in your training examples.
  2. Parameter discipline: freeze vision parameters.

There is a subtle pitfall: in multimodal models, vision modules can also contain layer names like q_proj/k_proj/v_proj. So “LoRA target_modules” alone does not guarantee “LM-only LoRA.” A recent HF forum thread calls this out explicitly. (Hugging Face Forums)

Freeze vision parameters (simple, robust)

def freeze_vision(model):
    for name, p in model.named_parameters():
        n = name.lower()
        if "vision" in n or "image" in n or "pixel" in n:
            p.requires_grad = False

freeze_vision(model)

This is crude but effective. After this, even if something in the vision path exists, it will not train.


Step 4. Supervised fine-tuning (SFT) for domain text generation

For your use case (“domain-specific text generation”), the standard starting point is SFT: prompt → ideal completion pairs.

TRL’s SFTTrainer is the common “works-first” route. The official TRL docs show the basic pattern and explain that it can work with chat templates. (Hugging Face)

Dataset format you want

Store each row as either:

  • {"prompt": "...", "completion": "..."} (single turn), or
  • {"messages": [...]} (multi-turn chat)

If you want consistent style, put your style guide in the system message across the dataset.

Minimal SFT + LoRA recipe (text-only)

# deps:
# pip install trl peft datasets accelerate
# plus transformers v5 RC/main + mistral-common

import torch
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend

model_id = "mistralai/Ministral-3-3B-Instruct-2512-BF16"
tokenizer = MistralCommonBackend.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Freeze vision
def freeze_vision(model):
    for name, p in model.named_parameters():
        n = name.lower()
        if "vision" in n or "image" in n or "pixel" in n:
            p.requires_grad = False

freeze_vision(model)

# LoRA config (common target modules)
# TRL guidance explains typical LoRA params and target_modules choices. :contentReference[oaicite:13]{index=13}
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)

SYSTEM = "You are a domain-specific generator. Follow the domain style guide."

def formatting_func(example):
    # example has: prompt, completion
    msgs = [
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": example["prompt"]},
        {"role": "assistant", "content": example["completion"]},
    ]
    return tokenizer.apply_chat_template(msgs, tokenize=False)

ds = load_dataset("json", data_files={"train": "train.jsonl"})["train"]

cfg = SFTConfig(
    output_dir="ministral3_domain_lora",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-4,
    num_train_epochs=1,
    bf16=True,
    max_seq_length=2048,
    packing=True,
    logging_steps=10,
    save_steps=200,
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds,
    peft_config=peft_config,
    args=cfg,
    formatting_func=formatting_func,
)

trainer.train()
trainer.save_model()

Why this structure

  • TRL’s SFTTrainer is built for this workflow. (Hugging Face)
  • PEFT LoRA reduces trainable params drastically. (Hugging Face)
  • Freezing vision avoids accidentally tuning vision blocks that share module names. (Hugging Face Forums)
  • Using the correct chat template avoids silent quality loss from format mismatch. (Mistral AI)

Step 5. If your GPU is small: QLoRA (4-bit) instead of BF16 LoRA

QLoRA means:

  • Load base model in 4-bit.
  • Train only LoRA adapters.

Hugging Face’s QLoRA overview is in the bitsandbytes 4-bit blog. (Hugging Face)
If you do 4-bit training, HF recommends NF4 for training 4-bit base models. (Hugging Face)
PEFT also documents the “quantize then train adapters” concept. (Hugging Face)

Minimal change (conceptually):

from transformers import BitsAndBytesConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",   # recommended for training 4-bit base :contentReference[oaicite:21]{index=21}
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = Mistral3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=bnb,
)

Pitfalls that commonly waste time

Pitfall 1: Wrong Transformers version

Symptom: missing Mistral3ForConditionalGeneration or MistralCommonBackend.
Fix: follow the model card’s guidance to use v5 RC/main and install mistral-common >= 1.8.6. (Hugging Face)

Pitfall 2: Tokenization mismatch

Symptom: model “sort of works” but has weird failures on logs, escaped strings, or format-heavy inputs.
Fix: use the Mistral tokenizer backend (mistral-common) as recommended, and treat edge cases seriously. (GitHub)

Pitfall 3: “LM-only LoRA” accidentally hits vision

Symptom: you think you tuned only LM layers, but adapters attach to vision blocks too.
Fix: freeze vision, and do not rely only on target_modules. (Hugging Face Forums)

Pitfall 4: Serving differences (vLLM)

If you later serve with vLLM, do not assume you can pass a chat template per request for Mistral tokenizers. vLLM explicitly errors or warns about chat_template for Mistral tokenizers. (vLLM)
Practical fix: render the prompt text yourself (apply chat template client-side) before sending to vLLM.


If you truly want “no vision encoder” at all

Your choices are:

  1. Stay official (recommended): ignore vision inputs, freeze vision weights. (Hugging Face)
  2. Use a third-party “TextOnly” conversion: higher compatibility, but it is not the original architecture and can differ. Example conversions exist. (Hugging Face)
  3. Pick a pure text-only base model (different model family). This is often simpler if you never need images.

Quick checklist for your case

  • Use mistralai/Ministral-3-3B-Instruct-2512-BF16 for training. (Hugging Face)
  • Install Transformers v5 RC/main and mistral-common >= 1.8.6. (Hugging Face)
  • Always format prompts with Mistral chat templates. (Mistral AI)
  • Freeze vision weights. Do text-only examples. (Hugging Face Forums)
  • Fine-tune with TRL SFTTrainer + PEFT LoRA, or QLoRA if VRAM is tight. (Hugging Face)

Summary bullets

  • The checkpoint includes a vision encoder. You cannot “turn it off” at the file level without conversion. (Hugging Face)
  • You can use it as text-only by never passing images and freezing vision parameters. (Hugging Face Forums)
  • Use Transformers v5 RC/main plus mistral-common >= 1.8.6 to avoid common setup and tokenization pitfalls. (Hugging Face)
  • Fine-tune with TRL SFTTrainer + PEFT LoRA, or QLoRA NF4 if VRAM is limited. (Hugging Face)