Hi everyone!
Just wanted to say it’s great to be here! I’m currently having a blast building and maintaining a local, headless AI server.
There’s something special about running models on your own hardware. I’m curious to know:
Looking forward to catching up with the community!
1 Like
Just use Gemma 4 31B with a smaller Gemma4 for secular decoding
2 Likes
What are you all currently running on your local setups?
Hmm… I use only small embedding models every day. I’ve integrated them into my work scripts. Since my GPU isn’t very powerful (a 3060 Ti with 8 GB of memory), I don’t really use very large models often locally…
That said, I’ve heard that if you use MoE LLMs via GGUF on platforms like Ollama or LM Studio, they run smoothly even with just within 32GB of RAM (not VRAM)…
Personally, since most of my current use cases don’t require confidentiality, I just use cloud services for my LLMs.
Of course, I often try out models (LLM, T2I, etc.) hosted on HF via Spaces.
"You’re totally right to keep an eye on them—local models have come a long way! Even with a 3060 Ti, you can actually run some impressive stuff.
Because of how GGUF works now, you can ‘offload’ specific layers to your 8GB VRAM and let the rest spill over into your 32GB of system RAM. If you want to try it, I’d recommend starting with Gemma-4-E4B-it-Q8_0.gguf.
At that quantization, the model is about 7.5GB. If you set your context to PARAMETER num_ctx 32768, it should fit comfortably across your VRAM and RAM. You’ll probably see speeds around 8–15 tokens/sec—not blazing fast, but the reasoning quality is excellent for a local setup.
If you want to go even bigger, you could technically run the Gemma-4 26B (A4B). By putting as many layers as possible on the GPU and the rest on your memory, you’d likely hit 6–8 tokens/sec. Even if you went full system memory, you’d still get about 3–4 tokens/sec. It’s definitely worth a shot if you want cloud-level smarts without the privacy concerns!"
1 Like
Yeah. The Gemma 4 family is amazing.
Even though it’s still so new that the backend support isn’t fully polished yet, the generated results are clearly better.
I was amazed by the multilingual performance back when Qwen 2.5 and Gemma 2 came out, too…
I occasionally test models on Hub (by my Space), focusing mainly on small multilingual LLMs under 14B. It’s just random, spot-checking. Models of the same size have just kept getting better and better over the past few months for years… Rather than a steady pace, every now and then an amazing model pops up.