Running 8B Llama on Jetson Orin Nano (using only 2.5GB of GPU memory)

roroep · March 11, 2026, 8:41am

Hi, we would like to share our project on deploying 8B Llama on Jetson Orin Nano, using only 2.5GB of GPU shared memory (peak), with a comparison against a llama.cpp INT4 baseline.

Baseline (llama.cpp INT4)

In our baseline setup, Llama-3.1-8B INT4 reached:

5.2GB GPU shared memory (peak)
6.8GB total RAM (peak)

On Jetson Orin Nano, this uses most of the available memory budget and leaves limited headroom for other edge workloads.

Our result

Using our own extreme low-bit (1.58-bit) deployment pipeline, we ran an 8B-class Llama model with:

2.5GB GPU shared memory (peak)
4.1GB total RAM (peak)

This makes the deployment more practical on Orin Nano when the LLM needs to coexist with other components on the device.

Main Techniques

1.58-bit quantization (Mixed-precision QAT)
Kernel-level optimizations (Custom kernel for embedding access and layer fusion)

Demo Video

Link: https://youtu.be/yVZSksaqf08

Notes

For our 1.58-bit Llama model, instruction tuning has been limited to date and we expect further improvements with additional tuning.

Why this may be useful

For edge deployments, memory headroom matters because the LLM often needs to run alongside other components such as:

Other AI models including STT, TTS, and more
System workloads including perception, logging, control, networking, and more

Reducing the model footprint makes on-device LLM deployment more realistic even on Nano-class edge SoCs.

And we are sharing more details at GTC 2026!

If you are blocked by memory footprint or latency while building Llama or other LLMs on Jetson or other SoC platforms, please leave us a message.

Contact: https://enerzai.com/contact

Topic		Replies	Views
Mellum2-12B-A2.5B-Instruct Q4_K_M on Jetson Orin Nano 8GB Models	4	136	June 4, 2026
Running into OOM on GPU with quantized llama-3-8b for text generation inference Models	0	575	June 29, 2024
Are there any LLMs that can run with decent performance on hardware comparable to Jetson NX? Models	2	77	March 16, 2026
Streamlit + Llama 3, takes too much gpu memory? Models	0	219	July 13, 2024
LLaMA 7B GPU Memory Requirement 🤗Transformers	19	162407	February 23, 2025