AIOS: CPU-Native LLM Inference Architecture — Seeking Validation Contributors

I’ve published a framework paper proposing a CPU-native inference
architecture for large language models.

Core argument: LLMs are slow on CPU not because CPUs are unsuited
to inference, but because models and runtimes were designed for GPU
memory architecture and never redesigned for CPU cache hierarchy.
AIOS proposes a memory residency controller and Model Contract to
close that gap.

What AIOS is:

  • A runtime (memory residency controller) between inference engines
    and hardware — reducing DRAM data movement per generated token
  • A Model Contract — five architectural requirements models can
    satisfy to expose the full optimization surface

Current state: Paper published, spec complete, validation tooling
runnable. Runtime not yet implemented. All performance projections
are analytical — no empirical results exist yet.

What I need most:
Someone with bare metal Linux (Intel Haswell+ or AMD Zen+, 16GB RAM)
to run the Phase 1 baseline measurement on Falcon 7B Q4_K_M using
stock llama.cpp. Full protocol in Issue #2. Takes ~2 hours including
setup.

Links:

1 Like

lets do it. archlinux on intel ultra 7 265k with 64 gb ram

2 Likes

That’s perfect hardware for this — Intel Ultra 7 265K will have full uncore PMU counter access. Please head to Issue #2 on the GitHub repo for the full protocol: https://github.com/acasavaraju/AIOS/issues/2

The key steps: build llama.cpp, download Falcon 7B Q4_K_M, run baseline.py with --runs 5. Post the full JSON output as a comment on the issue. Looking forward to the first real number.

2 Likes

I’m building llama.cpp right now to run the validation.

2 Likes