Instructions to use markendo/llava-extract-from-scratch-qwen3-0.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use markendo/llava-extract-from-scratch-qwen3-0.6B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="markendo/llava-extract-from-scratch-qwen3-0.6B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("markendo/llava-extract-from-scratch-qwen3-0.6B") model = AutoModelForCausalLM.from_pretrained("markendo/llava-extract-from-scratch-qwen3-0.6B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use markendo/llava-extract-from-scratch-qwen3-0.6B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "markendo/llava-extract-from-scratch-qwen3-0.6B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "markendo/llava-extract-from-scratch-qwen3-0.6B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/markendo/llava-extract-from-scratch-qwen3-0.6B
- SGLang
How to use markendo/llava-extract-from-scratch-qwen3-0.6B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "markendo/llava-extract-from-scratch-qwen3-0.6B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "markendo/llava-extract-from-scratch-qwen3-0.6B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "markendo/llava-extract-from-scratch-qwen3-0.6B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "markendo/llava-extract-from-scratch-qwen3-0.6B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use markendo/llava-extract-from-scratch-qwen3-0.6B with Docker Model Runner:
docker model run hf.co/markendo/llava-extract-from-scratch-qwen3-0.6B
Extract+Think Model Card for markendo/llava-extract-from-scratch-qwen3-0.6B
This repository hosts the Extract-0.6B† model, which serves as the perception module for the two-stage Extract+Think† framework. This model was presented in the paper Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models.
Extract+Think is an approach designed to address perception and reasoning bottlenecks in small multimodal models. It focuses on visual extraction tuning, explicitly training the model to consistently extract instruction-relevant visual details across tasks, which then feeds into a separate reasoning stage. In this variant, we train from scratch under the visual extraction tuning paradigm, without previous visual instruction tuning or captioning.
- 📖 Paper: Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
- 🌐 Project Page: https://web.stanford.edu/~markendo/projects/downscaling_intelligence
- 💻 Code: https://github.com/markendo/downscaling_intelligence
Model details
Extract-0.6B† is used as the perception module for the two-stage Extract+Think† framework. For the reasoning stage, the authors primarily utilize Qwen3 models (1.7B and 4B).
Usage
To use this model, particularly for evaluation, the authors utilize the lmms-eval framework. The setup and evaluation instructions are detailed in the GitHub repository. This involves cloning the repository, installing dependencies, and integrating custom evaluation files with lmms-eval.
For generating extracted visual information, the following command is provided:
cd lmms-eval
model_name=markendo/llava-extract-from-scratch-qwen3-0.6B
python -m lmms_eval \
--model=llava_onevision \
--model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
--tasks=mmstar_prism_stage_1 \
--batch_size=1 \
--output_path results \
--log_samples
Please refer to the GitHub repository for full setup instructions, including the second stage of reasoning.
Acknowledgments
This repository is built on top of LLaVA-OneVision and lmms-eval.
Citation
@article{endo2025downscalingintelligence,
author = {Endo, Mark and Yeung-Levy, Serena},
title = {Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models},
journal = {arXiv preprint},
year = {2025},
}
- Downloads last month
- 11