PaliGemma 2 demo
| Github | Blogpost | Fine-tuning notebook |
PaliGemma 2 is an open vision-language model by Google, inspired by PaLI-3 and built with open components such as the SigLIP vision model and the Gemma 2 language model. PaliGemma 2 is designed as a versatile model for transfer to a wide range of vision-language tasks such as image and short video caption, visual question answering, text reading, object detection and object segmentation.
This space includes a model LoRA fine-tuned by the team at Hugging Face on VQAv2, inferred using transformers.
See the Blogpost, the project
README and the
fine-tuning notebook
for detailed information about how to use and fine-tune PaliGemma and PaliGemma 2 models.
This is an experimental research model. Make sure to add appropriate guardrails when using the model for applications.
Example images are licensed CC0 by akolesnikoff@, mbosnjak@, maximneumann@ and merve.
| Question | Input Image | Max New Tokens |
|---|