When it comes to LLMs, using Llama.cpp as the backend is probably the best choice (and likely will remain so), as it offers the widest range of options. As for embedders, using ONNX Runtime on edge devices may currently provide faster and more stable performance.
The best architecture today is a split, local-first architecture:
cross-platform UI shell → worker isolate → native AI core behind FFI → SQLite-based local knowledge store → optional cloud only for heavy preprocessing or fallback. That shape matches the current toolchain well: Flutter now recommends package_ffi for native bindings, platform channels remain the standard path for host-platform APIs, llama.cpp is still a practical local C/C++ inference baseline, ONNX Runtime Mobile is explicitly aimed at iOS and Android, and SQLite FTS5 remains a mature embedded full-text engine. (Flutter Document)
That recommendation is less about elegance and more about survival on real devices. Mobile AI systems are constrained by binary size, model size, latency, memory, power, thermals, and repeat-run stability, not just raw model quality. ONNX Runtime’s mobile guidance explicitly tells you to measure binary size, model size, latency, and power consumption, and public issue threads from llama.cpp and ONNX Runtime show the common failure modes in practice: slow Android token rates without the right build path, Swift/Objective-C++ integration friction on Apple platforms, and crashes only after hundreds of repeated inference runs. (ONNX Runtime)
1. What is currently the most reliable architecture for local LLM + RAG on mobile?
The most reliable architecture is usually local-first, not local-only. Keep query-time interaction local whenever the device can handle it, and reserve the cloud for the jobs phones are poor at: OCR, layout-heavy PDF parsing, bulk embedding, cross-device sync, or remote generation fallback when the device is weak, hot, or storage-constrained. That gives users the privacy and low latency they actually notice without forcing every expensive pipeline stage onto the phone. Apple’s Foundation Models docs emphasize on-device use on supported Apple Intelligence devices, while ONNX Runtime’s mobile docs frame mobile deployment as a process of fitting models to device limits rather than assuming the device should do everything. (Apple Developer)
A good reference architecture looks like this:
- Flutter or another cross-platform shell for UI, navigation, streaming text, settings, and orchestration
- one long-lived worker isolate for non-UI orchestration
- a native AI core behind a narrow C ABI
- SQLite as the embedded knowledge spine
- optional cloud preprocessing or fallback only where justified
Flutter’s isolate docs say plugins can be used from background isolates as of Flutter 3.7, which makes this orchestration pattern practical without blocking the UI isolate. (Flutter Document)
Inside the native core, I would split responsibilities like this:
- generator runtime:
llama.cpp + GGUF
- embedder / reranker runtime: ONNX Runtime Mobile
- retrieval engine: SQLite + FTS5 + vector scoring
- model / index manager: downloads, versioning, cache, migrations
llama.cpp still describes its goal as minimal-setup LLM inference with strong performance across a wide range of hardware. ONNX Runtime Mobile is explicitly designed for mobile deployment and supports reduced-operator builds to shrink the runtime footprint, which is useful for embedders and rerankers. (GitHub)
I would treat LiteRT-LM and Apple Foundation Models as optimization branches, not the shared baseline. LiteRT-LM is now described by Google as a production-ready, open-source framework for high-performance, cross-platform edge LLM deployment, so it is very relevant if Android optimization becomes a priority. Apple’s Foundation Models framework provides access to Apple’s on-device model on supported Apple Intelligence devices. Both are real options, but both push you toward more platform-specific behavior. For a first cross-platform architecture, the simpler baseline is still easier to control. (GitHub)
2. If the frontend is Flutter, should you use Platform Channels or FFI?
Use both, but for different jobs.
FFI should handle the AI hot path:
- model load / unload
- tokenization
- prefill / decode
- token polling
- embedding
- vector scoring
- reranking
- cancellation
- stats
Platform Channels or Pigeon should handle host-platform work:
- file pickers
- secure storage
- downloads
- background scheduling
- battery / thermal / storage signals
- permissions
- OS lifecycle hooks
Flutter’s docs are clear on the distinction. dart:ffi is the direct native binding path, while platform channels are the standard path for communicating with platform-specific code. Flutter also states that platform channels and Pigeon use StandardMessageCodec, which serializes and deserializes messages automatically. That is fine for host APIs. It is the wrong transport for token-by-token inference loops. (Flutter Document)
This is the main decision rule:
- high-frequency compute into a native library → FFI
- host-platform service call → Platform Channels or Pigeon
That rule stays correct even as the app grows. Flutter’s architecture guidance has long noted that FFI can be considerably faster than platform channels for C-based APIs because it avoids serialization. The newer package_ffi workflow makes that path cleaner than it used to be. (Flutter Document)
A Flutter app for local LLM + RAG should usually have two packages:
ai_core_ffi
host_services_plugin
That split matches Flutter’s official guidance directly: package_ffi for native code binding, plugin/channel APIs for host-platform communication. (Flutter Document)
3. What are good approaches for local knowledge retrieval on mobile devices?
The best default is hybrid retrieval, not vector-only retrieval.
Use this pipeline:
metadata filter → FTS5 lexical retrieval → vector retrieval → fusion → optional rerank → compact prompt assembly
That works well on mobile because real user queries are mixed. Some are literal: names, IDs, filenames, error strings, setting labels. Some are semantic: paraphrases or concept-level queries. SQLite FTS5 is very good at the literal side, and BEIR remains one of the clearest large-scale reminders that BM25 is still a strong baseline, while reranking often improves quality at higher computational cost. (SQLite)
SQLite is a strong local knowledge spine because it keeps documents, chunks, metadata, lexical indexes, and vectors in one embedded store. FTS5 supports BM25 ranking and external-content tables, which are useful if you want to avoid storing duplicate text. The caution is that external-content mode shifts consistency responsibility to you: SQLite’s docs make clear that you must keep the FTS index synchronized with the content table yourself. (SQLite)
A practical schema usually looks like:
documents
chunks
chunk_metadata
embeddings
fts_chunks
That design is boring in the right way. It is easy to inspect, easy to sync, and easy to debug. On mobile, those properties matter more than fashionable infrastructure. (SQLite)
For vectors, I would be conservative in v1. The safest approach is still:
store vectors in SQLite rows or BLOBs and score them in native code.
If you want a SQLite-native vector layer, sqlite-vec is promising, but its own repo says it is pre-v1, and its ANN tracking issue states that current search is brute-force only as of v0.1.0, with ANN planned before v1. It now publishes precompiled Android and iOS loadable libraries, which helps experimentation, but I would still keep it behind an internal abstraction and not make it the hard foundation of the whole app on day one. (GitHub)
Chunking matters more than many RAG tutorials suggest. Structure-aware chunking tends to work better than naive fixed-size splitting because it preserves paragraph, title, and table boundaries. Unstructured’s chunking docs and guide both emphasize chunking around document elements rather than relying only on character counts. (Unstructured)
For small local embedders, practical starting points include compact models like e5-small-v2, bge-small-en-v1.5, or multilingual-e5-small. The E5 small variants expose 384-dimensional embeddings, and bge-small-en-v1.5 has ONNX artifacts on the Hub, which makes them reasonable candidates for a mobile encoder lane. (Hugging Face)
4. How do you balance performance, memory usage, and model size?
The right mindset is budgeting, not “use the biggest model that loads.”
The sequence that works best is:
- choose a device tier
- define a latency target
- pick the smallest model class that can plausibly hit it
- quantize
- cap context
- let retrieval do more of the factual work
- add reranking or a larger model only after measurement
That order matches current mobile guidance well. ONNX Runtime’s mobile docs say to measure binary size, model size, latency, and power consumption. Microsoft’s current explanation of LLM inference also highlights the key hidden cost: decode repeatedly reads the KV cache, which means long context increases memory pressure and can dominate real runtime behavior. Apple’s technote on the on-device foundation model is specifically about budgeting the context window and handling the limit cleanly. (ONNX Runtime)
So the most important performance rule on phones is:
do not use long context as a substitute for good retrieval.
Long context looks attractive, but on-device it is often the hidden performance and memory killer. Better retrieval usually beats bigger prompts. Microsoft’s KV-cache explanation and Apple’s context-window guidance both point in that direction from different angles. (TECHCOMMUNITY.MICROSOFT.COM)
Quantization is the next big lever. ONNX Runtime’s quantization docs recommend S8S8 with QDQ as the default CPU starting point for good performance/accuracy balance, and its float16 docs say converting float32 to float16 can cut model size by up to half and improve performance on some GPUs. For helper models on mobile, ONNX Runtime’s deployment guide says: if the model is quantized, start with CPU; if it is not quantized, start with XNNPACK. (ONNX Runtime)
For runtime size, ONNX Runtime’s custom-build docs matter more than many teams expect. You can shrink the runtime by including only the operator kernels required by your model set, and ONNX Runtime explicitly calls this out as a common need for mobile and web deployments. The memory docs also describe shared arena allocation to reduce memory use across multiple sessions, which is useful if you have an embedder and reranker in the same process. (ONNX Runtime)
A practical device-tier playbook is usually better than a single global model policy:
- constrained phones: local retrieval first, tiny local helpers, remote generation fallback likely
- mainstream phones: 1B–2B class local generator, quantized, small embedder, no default reranker
- flagship phones / tablets: stronger local model, optional reranker, slightly deeper retrieval, maybe an Android-optimized runtime branch later
That tiering is not just theory. Public llama.cpp issue threads show how far token speed can vary across phones and build paths, and the “works on my device” trap is very real in mobile AI. (GitHub)
The real-world pitfalls that matter most
The biggest production pitfalls are usually these:
- packaging/toolchain friction, especially on Apple platforms when Swift, Objective-C++, and C++ meet in one build graph
- repeat-run stability, where apps work once but fail after hundreds of runs
- context bloat, which silently eats memory and latency
- retrieval correctness drift, especially if an FTS index falls out of sync
- overcommitting to a still-moving vector extension
- overgeneralizing from one test phone
The public issues are useful here. llama.cpp has real Swift Package Manager / Objective-C++ integration issues on Apple platforms, and ONNX Runtime has real reports of repeated-inference mobile crashes in Flutter apps using FFI. Those are not reasons to avoid native runtimes. They are reasons to keep the architecture narrow and test for repetition, not just first-run success. (GitHub)
What I would actually ship first
If I had to ship v1 now, I would ship this:
- Flutter for the app shell
- one worker isolate for orchestration
- FFI for the AI hot path
- plugin/channel APIs for host services
llama.cpp + GGUF for generation
- ONNX Runtime Mobile for embedders and optional rerankers
- SQLite + FTS5 for the knowledge spine
- native vector scoring for v1
- local-first serving
- hybrid preprocessing and remote fallback only where the device is the wrong place to do the work
That stack is not the most exotic. It is the most likely to stay understandable, portable, and debuggable while still giving users the local, low-latency experience they want. (Flutter Document)
Bottom line
For a real mobile app, the strongest general answer is:
keep the UI cross-platform, keep the hot path native, keep the knowledge local, keep prompts small, and keep heavy preprocessing optional.
In practice that means:
- architecture: Flutter shell + native core + SQLite spine
- interop: FFI for inference/retrieval, channels for OS services
- retrieval: metadata + FTS5 + vectors + optional rerank
- performance strategy: smallest acceptable model, aggressive quantization, strict context budget, device-tier policy
That is the most reliable architecture today because it matches the current Flutter tooling, the current mobile runtimes, and the reality that on-device AI fails at the seams before it fails in the benchmark chart. (Flutter Document)