File size: 1,804 Bytes
970fbb7 f7966e7 9df794e 970fbb7 bb85b3a f7966e7 bb85b3a f7966e7 bb85b3a f7966e7 1a512c4 f7966e7 24a7680 f7966e7 207347e f7966e7 24a7680 f7966e7 24a7680 f7966e7 301c53c f7966e7 301c53c f7966e7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | ---
license: apache-2.0
library_name: nanovlm
tags:
- vision-language
- multimodal
- pytorch
- small-model
- efficient
- research
- VLM
model_name: nanoVLM
datasets:
- HuggingFaceM4/the_cauldron
metrics:
- accuracy
pipeline_tag: image-text-to-text
---
**nanoVLM** is a minimal and lightweight Vision-Language Model (VLM) designed for efficient training and experimentation.
Built using pure PyTorch, the entire model architecture and training logic fits within ~750 lines of code.
It combines a ViT-based image encoder (SigLIP-B/16-224-85M) with a lightweight causal language model (SmolLM2-135M),
resulting in a compact 222M parameter model. The model achieves 35.3% accuracy on MMStar after training for ~6 hours on a single H100 GPU
using 1.7M samples from [the cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) dataset,
making it a strong baseline for low-resource VLM research.
The model is ideal for researchers and developers interested in exploring VLM training with minimal computational overhead,
and serves as a perfect starting point for tinkering with multimodal architectures.
**Model Architecture:**
- Vision Transformer (SigLIP-B/16)
- Causal Language Model (SmolLM2)
- Modality Projection Layer
**Training:**
- Trained on ~1.7M samples from the `the_cauldron` dataset
- 6 hours on a single NVIDIA H100 GPU
- Resulting model size: 222M parameters
**Evaluation:**
- MMStar Accuracy: 35.3%
**Usage:**
Usable through the nanoVLM repository: https://github.com/huggingface/nanoVLM
For more details, see: https://github.com/huggingface/nanoVLM?tab=readme-ov-file#hub-integration
```python
from models.vision_language_model import VisionLanguageModel
model = VisionLanguageModel.from_pretrained("lusxvr/nanoVLM-222M")
``` |