---
library_name: transformers
license: apache-2.0
base_model: Qwen/Qwen3-8B-Base
tags:
- axolotl
- generated_from_trainer
datasets:
- nate-rahn/0508-principle-persona-sft-dset
model-index:
- name: 0508-persona_principle_sft-qwen3_8b_base
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
<details><summary>See axolotl config</summary>

axolotl version: `0.9.1`
```yaml
# Name 0508-persona_principle_sft-qwen3_8b_base

# axolotl train experiments/0508-persona_principle_sft-qwen3_8b_base.yaml


base_model: Qwen/Qwen3-8B-Base
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
trust_remote_code: false

# --- Dataset Configuration ---
datasets:
  - path: nate-rahn/0508-principle-persona-sft-dset
    type: chat_template # Use the chat_template processing strategy
    # --- Custom Template & Role Mapping ---
    chat_template: tokenizer_default # Specify we are using a custom jinja template below
    field_messages: messages # Assumes your dataset has a "bad_messages" key with a list of dicts
    message_property_mappings: # Assumes each dict in the list has "role" and "content" keys
      role: role
      content: content
    roles: # Define the roles expected in your dataset for mapping
      user: ["user"] # Map "user" role in data to internal "user"
      assistant: ["assistant"] # Map "assistant" role in data to internal "assistant"
      system: ["system"] # Map "system" role in data to internal "system"
    # --- Training Target ---
    roles_to_train: ["assistant"]
    train_on_eos: turn # Train on the EOS token at the end of each 'user' turn

dataset_prepared_path: /workspace/data/last_run_prepared

# --- Training Hyperparameters ---
sequence_len: 2048 # Adjust based on your dataset and GPU memory
sample_packing: true # Pack multiple sequences into one example for efficiency
eval_sample_packing: true
pad_to_sequence_len: true # Pad sequences to sequence_len

# Full Parameter Finetuning (No adapter specified)
# adapter: # This is intentionally left blank/removed for full finetuning

# Performance & Precision (H100s excel with bf16)
bf16: true
tf32: true
flash_attention: true # for qwen

# Batching (Adjust based on GPU memory)
# Effective global batch size = micro_batch_size * gradient_accumulation_steps * num_gpus (4)
# Start low for full finetuning, e.g., 1 * 16 * 4 = 64
micro_batch_size: 2
gradient_accumulation_steps: 32
eval_batch_size: 16 # Can often be slightly higher than micro_batch_size

# Optimizer & Scheduler
optimizer: adamw_torch_fused # Good choice for newer GPUs
learning_rate: 1e-5 # Common starting point for full SFT
weight_decay: 0.01
lr_scheduler: cosine # Standard scheduler
warmup_steps: 50
max_grad_norm: 1.0

# Training Duration & Evaluation/Saving
num_epochs: 2 # Adjust as needed, start with 1-3 for SFT
val_set_size: 0.01
logging_steps: 1
evals_per_epoch: 20
saves_per_epoch: 2 # Save 4 times per epoch (adjust based on dataset size)
save_total_limit: 1 # Keep only the last 1 checkpoints

# Memory Saving
gradient_checkpointing: true # Essential for full finetuning
gradient_checkpointing_kwargs:
  use_reentrant: false # Prefer non-reentrant if possible

# --- FSDP Configuration (for 4xH100) ---
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_offload_params: false # Should not be needed with H100 VRAM
  fsdp_sync_module_states: true # Important for correctness
  fsdp_use_orig_params: false # Recommended for memory saving with FSDP
  fsdp_state_dict_type: SHARDED_STATE_DICT # Options: FULL_STATE_DICT or SHARDED_STATE_DICT (saves disk space)
  # fsdp_transformer_layer_cls_to_wrap: 'Gemma3DecoderLayer'
  fsdp_transformer_layer_cls_to_wrap: 'Qwen3DecoderLayer'
  # fsdp_activation_checkpointing: true # Alternative way to enable activation checkpointing for FSDP

# --- Special Tokens ---
# Define based on your custom template's terminators. Qwen already uses <|im_end|>
special_tokens:
  eos_token: "<|im_end|>"
# eos_token: "<end_of_turn>"

# --- Logging & Saving ---
output_dir: /workspace/output/red-team-agent/runs/0508-persona_principle_sft-qwen3_8b_base # Local output directory

# W&B Logging
wandb_project: "red-team-agent" # Name your W&B project
wandb_entity: "nate" # IMPORTANT: Replace with your W&B username or team name
wandb_name: "0508-persona_principle_sft-qwen3_8b_base" # Descriptive run name
# wandb_log_model: "checkpoint" # Log model checkpoints to W&B Artifacts

# Hugging Face Hub Upload
hub_model_id: "nate-rahn/0508-persona_principle_sft-qwen3_8b_base" # IMPORTANT: Replace with your desired HF repo ID
hub_strategy: "end" # Push checkpoints to the Hub (`"end"` pushes only the final model)
hf_use_auth_token: true # Required for pushing to the Hub (ensure you're logged in)

# --- Misc ---
seed: 42
```

</details><br>

# 0508-persona_principle_sft-qwen3_8b_base

This model is a fine-tuned version of [Qwen/Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) on the nate-rahn/0508-principle-persona-sft-dset dataset.
It achieves the following results on the evaluation set:
- Loss: 2.2570

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 2
- eval_batch_size: 16
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 32
- total_train_batch_size: 256
- total_eval_batch_size: 64
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 50
- num_epochs: 2.0

### Training results

| Training Loss | Epoch  | Step | Validation Loss |
|:-------------:|:------:|:----:|:---------------:|
| 4.1102        | 0.0054 | 1    | 5.0903          |
| 4.6869        | 0.0543 | 10   | 4.4168          |
| 3.4043        | 0.1087 | 20   | 3.6086          |
| 3.2614        | 0.1630 | 30   | 3.2780          |
| 2.9574        | 0.2174 | 40   | 3.2620          |
| 3.0284        | 0.2717 | 50   | 3.0021          |
| 2.7515        | 0.3260 | 60   | 2.8933          |
| 2.8258        | 0.3804 | 70   | 2.7446          |
| 2.5825        | 0.4347 | 80   | 2.6995          |
| 2.6558        | 0.4890 | 90   | 2.5691          |
| 2.4854        | 0.5434 | 100  | 2.5690          |
| 2.6957        | 0.5977 | 110  | 2.5314          |
| 2.4035        | 0.6521 | 120  | 2.4879          |
| 2.9604        | 0.7064 | 130  | 2.5762          |
| 2.4454        | 0.7607 | 140  | 2.4713          |
| 2.3399        | 0.8151 | 150  | 2.4662          |
| 2.4051        | 0.8694 | 160  | 2.4147          |
| 2.3851        | 0.9238 | 170  | 2.4431          |
| 2.4061        | 0.9781 | 180  | 2.3823          |
| 2.4292        | 1.0272 | 190  | 2.4554          |
| 2.4125        | 1.0815 | 200  | 2.3628          |
| 2.2624        | 1.1358 | 210  | 2.3661          |
| 2.4571        | 1.1902 | 220  | 2.3544          |
| 2.2805        | 1.2445 | 230  | 2.3428          |
| 2.6547        | 1.2989 | 240  | 2.3283          |
| 2.2127        | 1.3532 | 250  | 2.3069          |
| 2.3363        | 1.4075 | 260  | 2.3121          |
| 2.2102        | 1.4619 | 270  | 2.2841          |
| 2.1397        | 1.5162 | 280  | 2.2880          |
| 2.1842        | 1.5706 | 290  | 2.2787          |
| 2.135         | 1.6249 | 300  | 2.2744          |
| 2.2115        | 1.6792 | 310  | 2.2641          |
| 2.0916        | 1.7336 | 320  | 2.2610          |
| 2.2539        | 1.7879 | 330  | 2.2603          |
| 2.1064        | 1.8422 | 340  | 2.2578          |
| 2.3234        | 1.8966 | 350  | 2.2575          |
| 2.1563        | 1.9509 | 360  | 2.2570          |


### Framework versions

- Transformers 4.51.3
- Pytorch 2.6.0+cu124
- Datasets 3.5.1
- Tokenizers 0.21.1