What is the best architecture for integrating local LLM inference and RAG on mobile devices?

PEI-TI · March 15, 2026, 2:31am

Hi everyone,

I’m currently exploring a mobile AI architecture and would love to hear technical opinions from others working in this area.

The goal is to support the following on a mobile app:

on-device LLM inference
local or hybrid RAG retrieval
low-latency interaction
integration with Flutter or another cross-platform frontend

The technical directions I’m considering include:

Flutter / cross-platform frontend
llama.cpp or another on-device LLM runtime
vector retrieval or a lightweight local knowledge base
Platform Channel, FFI, or another native bridging approach

My main questions are:

What is currently the most reliable architecture for local LLM + RAG on mobile?
If the frontend is Flutter, would you recommend Platform Channels or FFI?
What are good approaches for local knowledge retrieval on mobile devices?
How do you usually balance performance, memory usage, and model size in production or prototype setups?

I’d be very interested in hearing any real-world experience or recommendations related to mobile AI / edge AI systems.

John6666 · March 15, 2026, 9:55am

When it comes to LLMs, using Llama.cpp as the backend is probably the best choice (and likely will remain so), as it offers the widest range of options. As for embedders, using ONNX Runtime on edge devices may currently provide faster and more stable performance.

The best architecture today is a split, local-first architecture:

cross-platform UI shell → worker isolate → native AI core behind FFI → SQLite-based local knowledge store → optional cloud only for heavy preprocessing or fallback. That shape matches the current toolchain well: Flutter now recommends package_ffi for native bindings, platform channels remain the standard path for host-platform APIs, llama.cpp is still a practical local C/C++ inference baseline, ONNX Runtime Mobile is explicitly aimed at iOS and Android, and SQLite FTS5 remains a mature embedded full-text engine. (Flutter Document)

That recommendation is less about elegance and more about survival on real devices. Mobile AI systems are constrained by binary size, model size, latency, memory, power, thermals, and repeat-run stability, not just raw model quality. ONNX Runtime’s mobile guidance explicitly tells you to measure binary size, model size, latency, and power consumption, and public issue threads from llama.cpp and ONNX Runtime show the common failure modes in practice: slow Android token rates without the right build path, Swift/Objective-C++ integration friction on Apple platforms, and crashes only after hundreds of repeated inference runs. (ONNX Runtime)

1. What is currently the most reliable architecture for local LLM + RAG on mobile?

The most reliable architecture is usually local-first, not local-only. Keep query-time interaction local whenever the device can handle it, and reserve the cloud for the jobs phones are poor at: OCR, layout-heavy PDF parsing, bulk embedding, cross-device sync, or remote generation fallback when the device is weak, hot, or storage-constrained. That gives users the privacy and low latency they actually notice without forcing every expensive pipeline stage onto the phone. Apple’s Foundation Models docs emphasize on-device use on supported Apple Intelligence devices, while ONNX Runtime’s mobile docs frame mobile deployment as a process of fitting models to device limits rather than assuming the device should do everything. (Apple Developer)

A good reference architecture looks like this:

Flutter or another cross-platform shell for UI, navigation, streaming text, settings, and orchestration
one long-lived worker isolate for non-UI orchestration
a native AI core behind a narrow C ABI
SQLite as the embedded knowledge spine
optional cloud preprocessing or fallback only where justified

Flutter’s isolate docs say plugins can be used from background isolates as of Flutter 3.7, which makes this orchestration pattern practical without blocking the UI isolate. (Flutter Document)

Inside the native core, I would split responsibilities like this:

generator runtime: llama.cpp + GGUF
embedder / reranker runtime: ONNX Runtime Mobile
retrieval engine: SQLite + FTS5 + vector scoring
model / index manager: downloads, versioning, cache, migrations

llama.cpp still describes its goal as minimal-setup LLM inference with strong performance across a wide range of hardware. ONNX Runtime Mobile is explicitly designed for mobile deployment and supports reduced-operator builds to shrink the runtime footprint, which is useful for embedders and rerankers. (GitHub)

I would treat LiteRT-LM and Apple Foundation Models as optimization branches, not the shared baseline. LiteRT-LM is now described by Google as a production-ready, open-source framework for high-performance, cross-platform edge LLM deployment, so it is very relevant if Android optimization becomes a priority. Apple’s Foundation Models framework provides access to Apple’s on-device model on supported Apple Intelligence devices. Both are real options, but both push you toward more platform-specific behavior. For a first cross-platform architecture, the simpler baseline is still easier to control. (GitHub)

2. If the frontend is Flutter, should you use Platform Channels or FFI?

Use both, but for different jobs.

FFI should handle the AI hot path:

model load / unload
tokenization
prefill / decode
token polling
embedding
vector scoring
reranking
cancellation
stats

Platform Channels or Pigeon should handle host-platform work:

file pickers
secure storage
downloads
background scheduling
battery / thermal / storage signals
permissions
OS lifecycle hooks

Flutter’s docs are clear on the distinction. dart:ffi is the direct native binding path, while platform channels are the standard path for communicating with platform-specific code. Flutter also states that platform channels and Pigeon use StandardMessageCodec, which serializes and deserializes messages automatically. That is fine for host APIs. It is the wrong transport for token-by-token inference loops. (Flutter Document)

This is the main decision rule:

high-frequency compute into a native library → FFI
host-platform service call → Platform Channels or Pigeon

That rule stays correct even as the app grows. Flutter’s architecture guidance has long noted that FFI can be considerably faster than platform channels for C-based APIs because it avoids serialization. The newer package_ffi workflow makes that path cleaner than it used to be. (Flutter Document)

A Flutter app for local LLM + RAG should usually have two packages:

ai_core_ffi
host_services_plugin

That split matches Flutter’s official guidance directly: package_ffi for native code binding, plugin/channel APIs for host-platform communication. (Flutter Document)

3. What are good approaches for local knowledge retrieval on mobile devices?

The best default is hybrid retrieval, not vector-only retrieval.

Use this pipeline:

metadata filter → FTS5 lexical retrieval → vector retrieval → fusion → optional rerank → compact prompt assembly

That works well on mobile because real user queries are mixed. Some are literal: names, IDs, filenames, error strings, setting labels. Some are semantic: paraphrases or concept-level queries. SQLite FTS5 is very good at the literal side, and BEIR remains one of the clearest large-scale reminders that BM25 is still a strong baseline, while reranking often improves quality at higher computational cost. (SQLite)

SQLite is a strong local knowledge spine because it keeps documents, chunks, metadata, lexical indexes, and vectors in one embedded store. FTS5 supports BM25 ranking and external-content tables, which are useful if you want to avoid storing duplicate text. The caution is that external-content mode shifts consistency responsibility to you: SQLite’s docs make clear that you must keep the FTS index synchronized with the content table yourself. (SQLite)

A practical schema usually looks like:

documents
chunks
chunk_metadata
embeddings
fts_chunks

That design is boring in the right way. It is easy to inspect, easy to sync, and easy to debug. On mobile, those properties matter more than fashionable infrastructure. (SQLite)

For vectors, I would be conservative in v1. The safest approach is still:

store vectors in SQLite rows or BLOBs and score them in native code.

If you want a SQLite-native vector layer, sqlite-vec is promising, but its own repo says it is pre-v1, and its ANN tracking issue states that current search is brute-force only as of v0.1.0, with ANN planned before v1. It now publishes precompiled Android and iOS loadable libraries, which helps experimentation, but I would still keep it behind an internal abstraction and not make it the hard foundation of the whole app on day one. (GitHub)

Chunking matters more than many RAG tutorials suggest. Structure-aware chunking tends to work better than naive fixed-size splitting because it preserves paragraph, title, and table boundaries. Unstructured’s chunking docs and guide both emphasize chunking around document elements rather than relying only on character counts. (Unstructured)

For small local embedders, practical starting points include compact models like e5-small-v2, bge-small-en-v1.5, or multilingual-e5-small. The E5 small variants expose 384-dimensional embeddings, and bge-small-en-v1.5 has ONNX artifacts on the Hub, which makes them reasonable candidates for a mobile encoder lane. (Hugging Face)

4. How do you balance performance, memory usage, and model size?

The right mindset is budgeting, not “use the biggest model that loads.”

The sequence that works best is:

choose a device tier
define a latency target
pick the smallest model class that can plausibly hit it
quantize
cap context
let retrieval do more of the factual work
add reranking or a larger model only after measurement

That order matches current mobile guidance well. ONNX Runtime’s mobile docs say to measure binary size, model size, latency, and power consumption. Microsoft’s current explanation of LLM inference also highlights the key hidden cost: decode repeatedly reads the KV cache, which means long context increases memory pressure and can dominate real runtime behavior. Apple’s technote on the on-device foundation model is specifically about budgeting the context window and handling the limit cleanly. (ONNX Runtime)

So the most important performance rule on phones is:

do not use long context as a substitute for good retrieval.

Long context looks attractive, but on-device it is often the hidden performance and memory killer. Better retrieval usually beats bigger prompts. Microsoft’s KV-cache explanation and Apple’s context-window guidance both point in that direction from different angles. (TECHCOMMUNITY.MICROSOFT.COM)

Quantization is the next big lever. ONNX Runtime’s quantization docs recommend S8S8 with QDQ as the default CPU starting point for good performance/accuracy balance, and its float16 docs say converting float32 to float16 can cut model size by up to half and improve performance on some GPUs. For helper models on mobile, ONNX Runtime’s deployment guide says: if the model is quantized, start with CPU; if it is not quantized, start with XNNPACK. (ONNX Runtime)

For runtime size, ONNX Runtime’s custom-build docs matter more than many teams expect. You can shrink the runtime by including only the operator kernels required by your model set, and ONNX Runtime explicitly calls this out as a common need for mobile and web deployments. The memory docs also describe shared arena allocation to reduce memory use across multiple sessions, which is useful if you have an embedder and reranker in the same process. (ONNX Runtime)

A practical device-tier playbook is usually better than a single global model policy:

constrained phones: local retrieval first, tiny local helpers, remote generation fallback likely
mainstream phones: 1B–2B class local generator, quantized, small embedder, no default reranker
flagship phones / tablets: stronger local model, optional reranker, slightly deeper retrieval, maybe an Android-optimized runtime branch later

That tiering is not just theory. Public llama.cpp issue threads show how far token speed can vary across phones and build paths, and the “works on my device” trap is very real in mobile AI. (GitHub)

The real-world pitfalls that matter most

The biggest production pitfalls are usually these:

packaging/toolchain friction, especially on Apple platforms when Swift, Objective-C++, and C++ meet in one build graph
repeat-run stability, where apps work once but fail after hundreds of runs
context bloat, which silently eats memory and latency
retrieval correctness drift, especially if an FTS index falls out of sync
overcommitting to a still-moving vector extension
overgeneralizing from one test phone

The public issues are useful here. llama.cpp has real Swift Package Manager / Objective-C++ integration issues on Apple platforms, and ONNX Runtime has real reports of repeated-inference mobile crashes in Flutter apps using FFI. Those are not reasons to avoid native runtimes. They are reasons to keep the architecture narrow and test for repetition, not just first-run success. (GitHub)

What I would actually ship first

If I had to ship v1 now, I would ship this:

Flutter for the app shell
one worker isolate for orchestration
FFI for the AI hot path
plugin/channel APIs for host services
llama.cpp + GGUF for generation
ONNX Runtime Mobile for embedders and optional rerankers
SQLite + FTS5 for the knowledge spine
native vector scoring for v1
local-first serving
hybrid preprocessing and remote fallback only where the device is the wrong place to do the work

That stack is not the most exotic. It is the most likely to stay understandable, portable, and debuggable while still giving users the local, low-latency experience they want. (Flutter Document)

Bottom line

For a real mobile app, the strongest general answer is:

keep the UI cross-platform, keep the hot path native, keep the knowledge local, keep prompts small, and keep heavy preprocessing optional.

In practice that means:

architecture: Flutter shell + native core + SQLite spine
interop: FFI for inference/retrieval, channels for OS services
retrieval: metadata + FTS5 + vectors + optional rerank
performance strategy: smallest acceptable model, aggressive quantization, strict context budget, device-tier policy

That is the most reliable architecture today because it matches the current Flutter tooling, the current mobile runtimes, and the reality that on-device AI fails at the seams before it fails in the benchmark chart. (Flutter Document)

Topic		Replies	Views
Language Models in mobile apps Beginners	1	195	March 7, 2024
Server-nexe: Local AI server with RAG memory, multi-backend inference, and plugins Show and Tell	0	29	April 17, 2026
The best Android client and the best LLM model for android? Beginners	3	6880	August 31, 2025
Are there any LLMs that can run with decent performance on hardware comparable to Jetson NX? Models	2	77	March 16, 2026
[Hiring] Senior Engineer: Local LLMs, llama.cpp, RAG (NVIDIA, G-Assist) Community Calls	2	44	April 9, 2026