[Appreciation] Incredible performance of Gemma 4-26b on consumer hardware — 90 t/s even on an older DDR3 system!

#17
by MightyLoraLord - opened

Hi Google team,

I wanted to share my experience and express my deep gratitude for the amazing work you've done with Gemma 4.

I previously thought my RTX 5060ti (16GB VRAM) had hit its limit running Qwen 3.5-35b-a3b at 45 t/s. However, Gemma 4-26b-a4b (using iq4-xs quantization) has completely exceeded my expectations:

Speed at 2048 context: 80~90 t/s
Speed at 96K context: Still maintains a highly usable 40+ t/s
Intelligence: It feels as smart as, if not smarter than, Qwen 3.5-35b-a3b. Remarkably, it achieves this level of reasoning with much more concise Chain-of-Thought (CoT) and significantly lower token overhead.
A crucial note on my setup:
I am actually running this on a fairly aged platform: an Intel i7-4790 paired with DDR3 1600MHz RAM. Given that my memory bandwidth is quite limited by this older DDR3 standard, the fact that I can achieve these speeds is mind-blowing.

I am convinced that users with modern DDR4 or DDR5 systems will see even more breathtaking performance once they optimize their setups. This is a true testament to how efficiently Gemma 4 is engineered.

Thank you for making such a powerful and accessible model. You have truly leveled up the capabilities of local LLMs!

Keep up the great work!

Google org

Hi @MightyLoraLord -

Thank you for sharing this feedback. We really appreciate it! It’s great to see Gemma 4 26B performing so well even on older setups. Your insights on speed, context scaling, and efficiency are very valuable.

I love the base model, and I'm really impressed with its intelligence and speed.
Looking forward to DavidAU's Gemma4 modifications too.

MightyLoraLord
at 2048 context: 80~90 t/s
can you tell Us what parameter did U use

MightyLoraLord
at 2048 context: 80~90 t/s
can you tell Us what parameter did U use

Sorry for the late reply!

Sure, here is the command I use most often. (Note: If you are copying this directly into Windows CMD, make sure there are no trailing spaces after the ^ line-continuations):

.\llama-server.exe ^
--port <port_num> ^
-m "<model_dir>\google_gemma-4-26B-A4B-it-IQ4_XS.gguf" ^
-ngl all ^
-c 65536 ^
--batch-size 512 ^
--flash-attn on ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^
--reasoning on ^
-t 1 ^
--parallel 1 ^
--direct-io ^
--kv-unified

For those who want to achieve the absolute maximum performance, here are the custom CMake flags I used to compile llama.cpp on Windows 10 (Visual Studio 2022). >
Compiling with native optimizations and unlocking all Flash Attention quants made a huge difference on my older setup.

:: Run this inside your build directory in Windows CMD
cmake .. -G "Visual Studio 17 2022" -A x64 ^
-DGGML_CUDA=ON ^
-DGGML_CUDA_FA_ALL_QUANTS=ON ^
-DCMAKE_CUDA_ARCHITECTURES=native ^
-DGGML_AVX2=ON ^
-DGGML_FMA=ON ^
-DGGML_NATIVE=ON ^
-DGGML_LTO=ON ^
-DCMAKE_BUILD_TYPE=Release

(Note: If native gives you an error during the CUDA architecture detection, you can manually replace it with your specific GPU architecture version, or remove that line to let CMake auto-detect).

One last Pro-Tip for maximizing VRAM:
I connected my monitor to my Intel CPU's integrated graphics (iGPU) instead of the GPU. This offloads the Windows OS display task entirely from the dedicated graphics card, saving me around 0.5 GB of VRAM. If you are cutting it close on VRAM and need every megabyte for the model or larger context, I highly recommend doing the same!

Serwer Unraid llama.cpp, respozytorium: vito974/llama-cpp-turboquant:server-cuda12, Post-argumenty: -m /models/google_gemma-4-26B-A4B-it-IQ4_XS.gguf --flash-attn on --cache-type-k q8_0 --cache-type-v turbo3 -ngl 99 --n-cpu-moe 1 --no-mmap --mlock --host 0.0.0.0 --port 8000 -c 131072, Stabilny setup z 82-88 t/s i 131k kontekstu. Nvidia RTX 5060ti 16GB. Dzieki super działa.

Sign up or log in to comment