Hi all,
Sharing this here since I think it overlaps quite a bit with what many of you are working on.
I’m on the NVIDIA team building GeForce G-Assist, an on-device AI assistant that runs small language models locally.
We’re hiring a senior engineer to work on problems like:
-
optimizing local inference (llama.cpp, quantization, memory, latency)
-
improving long-running conversation behavior (state drift, prompt leakage, retrieval cross-talk)
-
building RAG systems grounded in system + user context
-
enabling agent-style workflows (tool use, multi-step execution)
-
working across C/C++ (performance-critical paths) and Python (evaluation + tooling)
This role sits at the intersection of systems + model behavior - thinking about how models perform in production environments, especially under real constraints.
If you’ve worked with local inference stacks, small models, or care about making LLM systems actually usable, would love to chat.
Full job description + apply here:
If you want to see what we’re building:
- Product overview: NVIDIA Project G-Assist
- GitHub: GitHub - NVIDIA/G-Assist: Help shape the future of Project G-Assist · GitHub
Happy to answer questions about the role or what we’re building.