Running 8B Llama on Jetson Orin Nano (using only 2.5GB of GPU memory)

Hi, we would like to share our project on deploying 8B Llama on Jetson Orin Nano, using only 2.5GB of GPU shared memory (peak), with a comparison against a llama.cpp INT4 baseline.

Baseline (llama.cpp INT4)

In our baseline setup, Llama-3.1-8B INT4 reached:

  • 5.2GB GPU shared memory (peak)

  • 6.8GB total RAM (peak)

On Jetson Orin Nano, this uses most of the available memory budget and leaves limited headroom for other edge workloads.

Our result

Using our own extreme low-bit (1.58-bit) deployment pipeline, we ran an 8B-class Llama model with:

  • 2.5GB GPU shared memory (peak)

  • 4.1GB total RAM (peak)

This makes the deployment more practical on Orin Nano when the LLM needs to coexist with other components on the device.

Main Techniques

  • 1.58-bit quantization (Mixed-precision QAT)

  • Kernel-level optimizations (Custom kernel for embedding access and layer fusion)

Demo Video

Notes

  • For our 1.58-bit Llama model, instruction tuning has been limited to date and we expect further improvements with additional tuning.

Why this may be useful

For edge deployments, memory headroom matters because the LLM often needs to run alongside other components such as:

  • Other AI models including STT, TTS, and more

  • System workloads including perception, logging, control, networking, and more

Reducing the model footprint makes on-device LLM deployment more realistic even on Nano-class edge SoCs.

And we are sharing more details at GTC 2026!

If you are blocked by memory footprint or latency while building Llama or other LLMs on Jetson or other SoC platforms, please leave us a message.

Contact: https://enerzai.com/contact

1 Like