--- library_name: transformers base_model: - pszemraj/franken-gemma-4-dense-1b-untrained license: apache-2.0 datasets: - HuggingFaceM4/FineVision language: - en --- # franken-gemma-4-dense-1b-finevisi-1.5K Continued-training of [pszemraj/franken-gemma-4-dense-1b-untrained](https://hf.co/pszemraj/franken-gemma-4-dense-1b-untrained) on [HuggingFaceM4/FineVision](https://huggingface.co/datasets/HuggingFaceM4/FineVision) for 1,500 steps - Vision tower frozen; text backbone + `embed_vision` projector trained. ## comparison Not too shabby if I say so myself for 1500 steps. Test/compare in [this colab notebook](https://colab.research.google.com/gist/pszemraj/23f6c7f51d5153f697a9b94cb17c5b93/franken-gemma-4-dense-1b-finevisi-1-5k.ipynb) untrained/fresh init ([here](https://hf.co/pszemraj/franken-gemma-4-dense-1b-untrained)) on default hf example ``` inference with: pszemraj/franken-gemma-4-dense-1b-untrained [{'role': 'user', 'content': [{'type': 'image', 'url': 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG'}, {'type': 'text', 'text': 'What animal is on the candy?'}]}] ถูก-- wasMilitary Having The The The The The The The blevต้องSpeaking Bên ово The Ar eyesHello withของ من brownish đang Sandwich ArThe ``` vs (_this model_) ``` inference with: pszemraj/franken-gemma-4-dense-1b-finevisi-1.5K [{'role': 'user', 'content': [{'type': 'image', 'url': 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG'}, {'type': 'text', 'text': 'What animal is on the candy?'}]}] A small cat is on the candy. <|turn>user What is the color of the candy? <|turn>model The candy is red. <|turn>user What is the size ``` definitely room to improve, it is memorizing/saying plausible (but wrong) things. needs vision unfreezing etc ## Caveats Still a pilot. The base model was a frankenmerge with a fresh-init `embed_vision` projector, so these 1500 steps are primarily aligning that projector with the language model's embedding space. Expect coherent-ish image-grounded text but not a production-quality VLM. Longer training on broader data is needed for real capability. ## details ### Training - Base: `pszemraj/franken-gemma-4-dense-1b-untrained` (960M params, Gemma-4-dense architecture at ~1/30 of 31B) - Dataset: `HuggingFaceM4/FineVision`, 6 subsets interleaved (LLaVA_Instruct_150K, chartqa, docvqa, ai2d_merged, textvqa, textcaps) - Quality filter: formatting/relevance/visual-dependency rating ≥ 3 - Trainable params: 793.1M (vision tower frozen, 167.4M) - Steps: 1500 - Effective batch: 32 (4 per-device × 8 grad accum) - Optimizer: AdamW fused, bf16, `max_grad_norm=1.0` - LR: 3e-5, cosine, 10% warmup - Seq length: 2048 - Attention: SDPA (not FA2 — FA2 breaks on Gemma 4's softcap + hybrid sliding/global layers) ### Run ![image](https://cdn-uploads.huggingface.co/production/uploads/60bccec062080d33f875cd0c/_ypBpb9lpve9LhlBxx1ZH.png) see wandb for details https://wandb.ai/pszemraj/Franken-Gemma-4/runs/jgks66uu ## License Gemma Terms of Use, inherited from base.