Gradio

PaliGemma 2 demo

| Github | Blogpost | Fine-tuning notebook |

PaliGemma 2 is an open vision-language model by Google, inspired by PaLI-3 and built with open components such as the SigLIP vision model and the Gemma 2 language model. PaliGemma 2 is designed as a versatile model for transfer to a wide range of vision-language tasks such as image and short video caption, visual question answering, text reading, object detection and object segmentation.

This space includes a model LoRA fine-tuned by the team at Hugging Face on VQAv2, inferred using transformers. See the Blogpost, the project
README and the fine-tuning notebook for detailed information about how to use and fine-tune PaliGemma and PaliGemma 2 models.

This is an experimental research model. Make sure to add appropriate guardrails when using the model for applications.

Example images are licensed CC0 by akolesnikoff@, mbosnjak@, maximneumann@ and merve.

Examples

Question	Input Image	Max New Tokens