--- library_name: transformers license: apache-2.0 base_model: Qwen/Qwen3-8B-Base tags: - axolotl - generated_from_trainer datasets: - nate-rahn/0508-principle-persona-sft-dset model-index: - name: 0508-persona_principle_sft-qwen3_8b_base results: [] --- [Built with Axolotl](https://github.com/axolotl-ai-cloud/axolotl)
See axolotl config axolotl version: `0.9.1` ```yaml # Name 0508-persona_principle_sft-qwen3_8b_base # axolotl train experiments/0508-persona_principle_sft-qwen3_8b_base.yaml base_model: Qwen/Qwen3-8B-Base model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer trust_remote_code: false # --- Dataset Configuration --- datasets: - path: nate-rahn/0508-principle-persona-sft-dset type: chat_template # Use the chat_template processing strategy # --- Custom Template & Role Mapping --- chat_template: tokenizer_default # Specify we are using a custom jinja template below field_messages: messages # Assumes your dataset has a "bad_messages" key with a list of dicts message_property_mappings: # Assumes each dict in the list has "role" and "content" keys role: role content: content roles: # Define the roles expected in your dataset for mapping user: ["user"] # Map "user" role in data to internal "user" assistant: ["assistant"] # Map "assistant" role in data to internal "assistant" system: ["system"] # Map "system" role in data to internal "system" # --- Training Target --- roles_to_train: ["assistant"] train_on_eos: turn # Train on the EOS token at the end of each 'user' turn dataset_prepared_path: /workspace/data/last_run_prepared # --- Training Hyperparameters --- sequence_len: 2048 # Adjust based on your dataset and GPU memory sample_packing: true # Pack multiple sequences into one example for efficiency eval_sample_packing: true pad_to_sequence_len: true # Pad sequences to sequence_len # Full Parameter Finetuning (No adapter specified) # adapter: # This is intentionally left blank/removed for full finetuning # Performance & Precision (H100s excel with bf16) bf16: true tf32: true flash_attention: true # for qwen # Batching (Adjust based on GPU memory) # Effective global batch size = micro_batch_size * gradient_accumulation_steps * num_gpus (4) # Start low for full finetuning, e.g., 1 * 16 * 4 = 64 micro_batch_size: 2 gradient_accumulation_steps: 32 eval_batch_size: 16 # Can often be slightly higher than micro_batch_size # Optimizer & Scheduler optimizer: adamw_torch_fused # Good choice for newer GPUs learning_rate: 1e-5 # Common starting point for full SFT weight_decay: 0.01 lr_scheduler: cosine # Standard scheduler warmup_steps: 50 max_grad_norm: 1.0 # Training Duration & Evaluation/Saving num_epochs: 2 # Adjust as needed, start with 1-3 for SFT val_set_size: 0.01 logging_steps: 1 evals_per_epoch: 20 saves_per_epoch: 2 # Save 4 times per epoch (adjust based on dataset size) save_total_limit: 1 # Keep only the last 1 checkpoints # Memory Saving gradient_checkpointing: true # Essential for full finetuning gradient_checkpointing_kwargs: use_reentrant: false # Prefer non-reentrant if possible # --- FSDP Configuration (for 4xH100) --- fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: false # Should not be needed with H100 VRAM fsdp_sync_module_states: true # Important for correctness fsdp_use_orig_params: false # Recommended for memory saving with FSDP fsdp_state_dict_type: SHARDED_STATE_DICT # Options: FULL_STATE_DICT or SHARDED_STATE_DICT (saves disk space) # fsdp_transformer_layer_cls_to_wrap: 'Gemma3DecoderLayer' fsdp_transformer_layer_cls_to_wrap: 'Qwen3DecoderLayer' # fsdp_activation_checkpointing: true # Alternative way to enable activation checkpointing for FSDP # --- Special Tokens --- # Define based on your custom template's terminators. Qwen already uses <|im_end|> special_tokens: eos_token: "<|im_end|>" # eos_token: "" # --- Logging & Saving --- output_dir: /workspace/output/red-team-agent/runs/0508-persona_principle_sft-qwen3_8b_base # Local output directory # W&B Logging wandb_project: "red-team-agent" # Name your W&B project wandb_entity: "nate" # IMPORTANT: Replace with your W&B username or team name wandb_name: "0508-persona_principle_sft-qwen3_8b_base" # Descriptive run name # wandb_log_model: "checkpoint" # Log model checkpoints to W&B Artifacts # Hugging Face Hub Upload hub_model_id: "nate-rahn/0508-persona_principle_sft-qwen3_8b_base" # IMPORTANT: Replace with your desired HF repo ID hub_strategy: "end" # Push checkpoints to the Hub (`"end"` pushes only the final model) hf_use_auth_token: true # Required for pushing to the Hub (ensure you're logged in) # --- Misc --- seed: 42 ```

# 0508-persona_principle_sft-qwen3_8b_base This model is a fine-tuned version of [Qwen/Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) on the nate-rahn/0508-principle-persona-sft-dset dataset. It achieves the following results on the evaluation set: - Loss: 2.2570 ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1e-05 - train_batch_size: 2 - eval_batch_size: 16 - seed: 42 - distributed_type: multi-GPU - num_devices: 4 - gradient_accumulation_steps: 32 - total_train_batch_size: 256 - total_eval_batch_size: 64 - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 50 - num_epochs: 2.0 ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:------:|:----:|:---------------:| | 4.1102 | 0.0054 | 1 | 5.0903 | | 4.6869 | 0.0543 | 10 | 4.4168 | | 3.4043 | 0.1087 | 20 | 3.6086 | | 3.2614 | 0.1630 | 30 | 3.2780 | | 2.9574 | 0.2174 | 40 | 3.2620 | | 3.0284 | 0.2717 | 50 | 3.0021 | | 2.7515 | 0.3260 | 60 | 2.8933 | | 2.8258 | 0.3804 | 70 | 2.7446 | | 2.5825 | 0.4347 | 80 | 2.6995 | | 2.6558 | 0.4890 | 90 | 2.5691 | | 2.4854 | 0.5434 | 100 | 2.5690 | | 2.6957 | 0.5977 | 110 | 2.5314 | | 2.4035 | 0.6521 | 120 | 2.4879 | | 2.9604 | 0.7064 | 130 | 2.5762 | | 2.4454 | 0.7607 | 140 | 2.4713 | | 2.3399 | 0.8151 | 150 | 2.4662 | | 2.4051 | 0.8694 | 160 | 2.4147 | | 2.3851 | 0.9238 | 170 | 2.4431 | | 2.4061 | 0.9781 | 180 | 2.3823 | | 2.4292 | 1.0272 | 190 | 2.4554 | | 2.4125 | 1.0815 | 200 | 2.3628 | | 2.2624 | 1.1358 | 210 | 2.3661 | | 2.4571 | 1.1902 | 220 | 2.3544 | | 2.2805 | 1.2445 | 230 | 2.3428 | | 2.6547 | 1.2989 | 240 | 2.3283 | | 2.2127 | 1.3532 | 250 | 2.3069 | | 2.3363 | 1.4075 | 260 | 2.3121 | | 2.2102 | 1.4619 | 270 | 2.2841 | | 2.1397 | 1.5162 | 280 | 2.2880 | | 2.1842 | 1.5706 | 290 | 2.2787 | | 2.135 | 1.6249 | 300 | 2.2744 | | 2.2115 | 1.6792 | 310 | 2.2641 | | 2.0916 | 1.7336 | 320 | 2.2610 | | 2.2539 | 1.7879 | 330 | 2.2603 | | 2.1064 | 1.8422 | 340 | 2.2578 | | 2.3234 | 1.8966 | 350 | 2.2575 | | 2.1563 | 1.9509 | 360 | 2.2570 | ### Framework versions - Transformers 4.51.3 - Pytorch 2.6.0+cu124 - Datasets 3.5.1 - Tokenizers 0.21.1