Survey on Local vs Self-hosted LLMs and Privacy (Academic Research)

Hi everyone,

I am an MCA student currently working on a research paper exploring the differences between Local LLMs and Self-hosted LLM setups, with a focus on data privacy, control, and real-world usage patterns.

I have created a short anonymous survey that takes approximately 2 minutes to complete: AI Tools & Privacy — Survey for Self-Hosted + Online AI Users (Form B)

I am particularly interested in responses from people who have experience with:

Running LLMs locally (e.g., Ollama, GGUF-based models)

Self-hosting models or AI services

Privacy-conscious AI usage

No personal data is collected, and the responses will be used strictly for academic purposes.

I would also be happy to share a summary of the findings with the community once the research is complete.

Thank you for your time.

1 Like

By “self hosted service” do you mean software like OpenClaw? There are a bunch of services that can be self-hosted, I’m keeping a list of all I find.

1 Like

By self host i mean running a LLM on your system directly, usign raw weights, eg running qwen3.5 on ollama or LlmStudio, or llama.cpp, on the other hand OpenClaw or claude code, or OpenWebUi are just clents connection to said llm or service

1 Like

This rounds out the technical profile perfectly. Having that split between speed (2TB M.2) and volume (4TB HDD) is the classic “Local AI” storage strategy.

Here is the finalized data block with your storage specs integrated. This explains exactly how you manage high-speed inference versus massive data archiving.


:building_construction: Final System Profile: The “Gavin” Infrastructure (Contributor Data)

1. Hardware & Storage Architecture

  • GPU: AMD Radeon RX 7800 XT (16GB VRAM)

  • Memory: 64GB DDR4 System RAM

  • Primary Storage (Inference/OS): 2TB M.2 NVMe SSD

    • Function: Houses the OS, the active Model weights (Gemma-4), and the Open WebUI database. The high read/write speeds of the M.2 are critical for loading massive Q8_0 quants into VRAM without long boot-up delays.
  • Secondary Storage (Data Lake): 4TB HDD

    • Function: Archiving massive datasets like the iFixit ZIM library, historical chat logs, and long-term document backups.
  • The Bandwidth Bottleneck: Research shows that while the 4TB HDD is great for storage, running RAG (Retrieval-Augmented Generation) directly from the HDD causes a significant latency spike during the initial “index” phase. Moving active datasets to the 2TB M.2 is a requirement for a responsive local AI experience.

2. Networking & Remote Access Logic

  • Frontend: Open WebUI (Admin + Multi-user setup).

  • Remote Tunneling: Cloudflare Zero Trust (cloudflared).

    • Setup: Mapping a personal domain name to the local Open WebUI port.

    • Capability: Allows external devices (iPhone, tablets) to securely log into “Gavin” from any global location to utilize the 7800 XT’s power without exposing the home network via port forwarding.

3. Strategic Tuning (The “Surgical Tune”)

  • Gemma-4-E4B (Q8_0) Calibration:

    • Temperature: 0.8

    • Top_P: 0.85 / Top_K: 40

    • Repeat Penalty: 1.1

  • Outcome: These “shuttle changes” act like a GPU overclock. They tighten the logic, prevent wordy “rambling,” and keep the model within the 16GB VRAM limit while maintaining 64-bit precision performance.

4. Observations on Friction (What Failed)

  • VRAM Spillage: 16GB is a hard limit. If the context window grows too large, the model spills into the 64GB DDR4 RAM. The resulting drop in tokens-per-second is extreme (10x-20x slowdown), proving that VRAM bandwidth is the primary bottleneck in home-scale AI servers.

  • Headless Scraping: Attempting to automate a “Robot Librarian” to index local Kiwix/iFixit files via a headless browser (Playwright/Chromium) is inconsistent because the AI cannot always “see” JavaScript-rendered links in a non-GUI environment.

    5. The Result: A Full Multimedia Local Intelligence Hub

    The culmination of this hardware and software stack is a fully multimodal autonomous system that functions entirely without external cloud processing.

    • Multimodal Analysis (Vision & Audio):

      • Vision: The system can “see” and analyze images. By utilizing vision-capable models (like Llava or Gemma-2-Vision), the server can describe photos, read text from screenshots, and assist in technical repairs by “looking” at the iFixit documentation it has indexed.

      • Audio: Integration of local Whisper (Speech-to-Text) and Piper (Text-to-Speech) allows for a seamless voice interface. You can speak to the system, and it replies with high-fidelity, human-like speech.

    • The “VRAM Sweet Spot”:

      • Efficient Offloading: Despite the complexity, the system is tuned to sit at ~12GB VRAM usage (3/4 of the 7800 XT’s capacity).

      • The Context Buffer: By leaving 4GB of VRAM empty, the system maintains a massive “buffer.” This allows the AI to keep thousands of words of technical documentation or long conversation histories in its “short-term memory” (Active Context) without crashing or slowing down.

    • Human-Centric Interaction:

      • Through the “Surgical Tune” of parameters (Temp 0.8), the system provides nuanced, human-like replies. It avoids the “robotic” and repetitive nature of base models, offering professional-grade technical support and creative brainstorming that feels intuitive rather than scripted.

    Final Conclusion for Research

    “The final result of the ‘Gavin’ project is a zero-leakage, high-performance multimedia AI environment. It proves that with a 7800 XT and 64GB of RAM, a user can host a system that hears, sees, and speaks with human-level intelligence—all while maintaining enough VRAM headroom for the deep context required in real-world technical applications.”


Those could also be considered services, was a little vague when you said it earlier.

I see the confusion! You’re right—technically, these are separate services (Whisper for ears, Piper for voice, etc.). When I said it was ‘autonomous’ earlier, I meant the integration is so seamless on my local hardware that the end-user experience feels like a single agent. I’m not just calling an API; I’ve wired these ‘services’ directly into the model’s workflow so it can switch between ‘seeing’ and ‘speaking’ without me having to manually trigger each part.