Instructions to use google/gemma-4-26B-A4B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-26B-A4B-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="google/gemma-4-26B-A4B-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/gemma-4-26B-A4B-it") model = AutoModelForImageTextToText.from_pretrained("google/gemma-4-26B-A4B-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- AMD Developer Cloud
- Local Apps
- vLLM
How to use google/gemma-4-26B-A4B-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-4-26B-A4B-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-26B-A4B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/google/gemma-4-26B-A4B-it
- SGLang
How to use google/gemma-4-26B-A4B-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-4-26B-A4B-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-26B-A4B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-4-26B-A4B-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-26B-A4B-it", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use google/gemma-4-26B-A4B-it with Docker Model Runner:
docker model run hf.co/google/gemma-4-26B-A4B-it
[Appreciation] Incredible performance of Gemma 4-26b on consumer hardware — 90 t/s even on an older DDR3 system!
Hi Google team,
I wanted to share my experience and express my deep gratitude for the amazing work you've done with Gemma 4.
I previously thought my RTX 5060ti (16GB VRAM) had hit its limit running Qwen 3.5-35b-a3b at 45 t/s. However, Gemma 4-26b-a4b (using iq4-xs quantization) has completely exceeded my expectations:
Speed at 2048 context: 80~90 t/s
Speed at 96K context: Still maintains a highly usable 40+ t/s
Intelligence: It feels as smart as, if not smarter than, Qwen 3.5-35b-a3b. Remarkably, it achieves this level of reasoning with much more concise Chain-of-Thought (CoT) and significantly lower token overhead.
A crucial note on my setup:
I am actually running this on a fairly aged platform: an Intel i7-4790 paired with DDR3 1600MHz RAM. Given that my memory bandwidth is quite limited by this older DDR3 standard, the fact that I can achieve these speeds is mind-blowing.
I am convinced that users with modern DDR4 or DDR5 systems will see even more breathtaking performance once they optimize their setups. This is a true testament to how efficiently Gemma 4 is engineered.
Thank you for making such a powerful and accessible model. You have truly leveled up the capabilities of local LLMs!
Keep up the great work!
Hi @MightyLoraLord -
Thank you for sharing this feedback. We really appreciate it! It’s great to see Gemma 4 26B performing so well even on older setups. Your insights on speed, context scaling, and efficiency are very valuable.
I love the base model, and I'm really impressed with its intelligence and speed.
Looking forward to DavidAU's Gemma4 modifications too.
MightyLoraLord
at 2048 context: 80~90 t/s
can you tell Us what parameter did U use
MightyLoraLord
at 2048 context: 80~90 t/s
can you tell Us what parameter did U use
Sorry for the late reply!
Sure, here is the command I use most often. (Note: If you are copying this directly into Windows CMD, make sure there are no trailing spaces after the ^ line-continuations):
.\llama-server.exe ^
--port <port_num> ^
-m "<model_dir>\google_gemma-4-26B-A4B-it-IQ4_XS.gguf" ^
-ngl all ^
-c 65536 ^
--batch-size 512 ^
--flash-attn on ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^
--reasoning on ^
-t 1 ^
--parallel 1 ^
--direct-io ^
--kv-unified
For those who want to achieve the absolute maximum performance, here are the custom CMake flags I used to compile llama.cpp on Windows 10 (Visual Studio 2022). >
Compiling with native optimizations and unlocking all Flash Attention quants made a huge difference on my older setup.
:: Run this inside your build directory in Windows CMD
cmake .. -G "Visual Studio 17 2022" -A x64 ^
-DGGML_CUDA=ON ^
-DGGML_CUDA_FA_ALL_QUANTS=ON ^
-DCMAKE_CUDA_ARCHITECTURES=native ^
-DGGML_AVX2=ON ^
-DGGML_FMA=ON ^
-DGGML_NATIVE=ON ^
-DGGML_LTO=ON ^
-DCMAKE_BUILD_TYPE=Release
(Note: If native gives you an error during the CUDA architecture detection, you can manually replace it with your specific GPU architecture version, or remove that line to let CMake auto-detect).
One last Pro-Tip for maximizing VRAM:
I connected my monitor to my Intel CPU's integrated graphics (iGPU) instead of the GPU. This offloads the Windows OS display task entirely from the dedicated graphics card, saving me around 0.5 GB of VRAM. If you are cutting it close on VRAM and need every megabyte for the model or larger context, I highly recommend doing the same!
Serwer Unraid llama.cpp, respozytorium: vito974/llama-cpp-turboquant:server-cuda12, Post-argumenty: -m /models/google_gemma-4-26B-A4B-it-IQ4_XS.gguf --flash-attn on --cache-type-k q8_0 --cache-type-v turbo3 -ngl 99 --n-cpu-moe 1 --no-mmap --mlock --host 0.0.0.0 --port 8000 -c 131072, Stabilny setup z 82-88 t/s i 131k kontekstu. Nvidia RTX 5060ti 16GB. Dzieki super działa.