I’ve published a framework paper proposing a CPU-native inference
architecture for large language models.
Core argument: LLMs are slow on CPU not because CPUs are unsuited
to inference, but because models and runtimes were designed for GPU
memory architecture and never redesigned for CPU cache hierarchy.
AIOS proposes a memory residency controller and Model Contract to
close that gap.
What AIOS is:
- A runtime (memory residency controller) between inference engines
and hardware — reducing DRAM data movement per generated token - A Model Contract — five architectural requirements models can
satisfy to expose the full optimization surface
Current state: Paper published, spec complete, validation tooling
runnable. Runtime not yet implemented. All performance projections
are analytical — no empirical results exist yet.
What I need most:
Someone with bare metal Linux (Intel Haswell+ or AMD Zen+, 16GB RAM)
to run the Phase 1 baseline measurement on Falcon 7B Q4_K_M using
stock llama.cpp. Full protocol in Issue #2. Takes ~2 hours including
setup.
Links:
- HuggingFace: aios-framework/aios-paper · Hugging Face
- Issue #2 (start here): Falcon 7B + AIOS: measure baseline MB/token (primary validation) · Issue #2 · acasavaraju/AIOS · GitHub