[Hiring] Senior Engineer: Local LLMs, llama.cpp, RAG (NVIDIA, G-Assist)

sydneygunsky · April 8, 2026, 6:09pm

Hi all,

Sharing this here since I think it overlaps quite a bit with what many of you are working on.

I’m on the NVIDIA team building GeForce G-Assist, an on-device AI assistant that runs small language models locally.

We’re hiring a senior engineer to work on problems like:

optimizing local inference (llama.cpp, quantization, memory, latency)
improving long-running conversation behavior (state drift, prompt leakage, retrieval cross-talk)
building RAG systems grounded in system + user context
enabling agent-style workflows (tool use, multi-step execution)
working across C/C++ (performance-critical paths) and Python (evaluation + tooling)

This role sits at the intersection of systems + model behavior - thinking about how models perform in production environments, especially under real constraints.

If you’ve worked with local inference stacks, small models, or care about making LLM systems actually usable, would love to chat.

Full job description + apply here:

If you want to see what we’re building:

Product overview: NVIDIA Project G-Assist
GitHub: GitHub - NVIDIA/G-Assist: Help shape the future of Project G-Assist · GitHub

Happy to answer questions about the role or what we’re building.

Topic		Replies	Views
Are there any LLMs that can run with decent performance on hardware comparable to Jetson NX? Models	2	47	March 16, 2026
Local LLM not working; just produces gibberish. Looking for help for a beginner Beginners	1	503	February 17, 2025
Local LLM and ML platform with RTX 5090 GPU Show and Tell	5	2895	September 19, 2025
What is the best architecture for integrating local LLM inference and RAG on mobile devices? Beginners	1	66	March 15, 2026
Server-nexe: Local AI server with RAG memory, multi-backend inference, and plugins Show and Tell	0	10	April 17, 2026

[Hiring] Senior Engineer: Local LLMs, llama.cpp, RAG (NVIDIA, G-Assist)

Related topics