AIOS — First Ground Truth Baseline (CPU DRAM Measurement)

AIOS — First Ground Truth Baseline (CPU DRAM Measurement)

Following up on my earlier post introducing AIOS (CPU-native LLM inference architecture), we now have the first validated baseline measurement using hardware memory controller counters.

Setup

  • Model: Falcon 7B (GGUF Q4_K_M)

  • CPU: Intel Core Ultra 7 265K (20 cores)

  • OS: Arch Linux (kernel 6.19.10-zen1-1-zen)

  • Method: perf uncore IMC counters (uncore_imc_free_running_0/data_read/)

Results (5 runs × 200 tokens)

  • MB/token: 2340 ± 4 MB

  • Coefficient of Variation: 0.17%

  • Tokens/sec: 11.43 ± 0.05

Key Takeaways

  • The measurement is highly stable (CV < 1%), confirming that DRAM reads can be treated as a reliable physical metric.

  • ~456–459 GB DRAM read for 200 tokens highlights the memory bandwidth wall in CPU inference.

  • This establishes a ground truth baseline for AIOS evaluation.

Why this matters

Most inference discussions optimize for tokens/sec.

AIOS instead treats MB/token as the primary constraint, because on CPUs, memory movement—not compute—is the bottleneck.

What’s next

  • Issue #1: Falcon 7B “relufication” (R1 compliance)

  • Headroom analysis (validation/headroom.py)

  • Additional baselines across models / quantizations

Call for contributors

If you can run perf on bare-metal Linux, contributions are very valuable:

  • Run baseline measurements on your hardware

  • Validate different models / quantizations

  • Help quantify headroom vs AIOS projections

Repo: GitHub - acasavaraju/AIOS: CPU-native LLM inference architecture. Memory residency controller that reduces DRAM data movement per generated token through weight aliasing, sparsity maps, KV cache tiering, and activation chunking. Includes Model Contract spec for architecture co-design. Framework + validation tooling — runtime contributions welcome. Paper: SSRN 6467298 · GitHub

Acknowledgment

Huge thanks to @reimorster for running the first full validation and helping establish this baseline.

This is the first step toward making memory movement a first-class metric for LLM inference.

1 Like

Update: TurboQuant (Google, ICLR 2026) is directly relevant to AIOS
Google released TurboQuant this week — a KV cache quantization algorithm that compresses attention key-value pairs from 16 bits to 3 bits with near-zero quality loss and no retraining required(TurboQuant: Redefining AI efficiency with extreme compression).
This is complementary to AIOS, not competing. They address the same bottleneck from different directions:
∙ TurboQuant reduces KV cache size — fewer bits per KV entry
∙ AIOS reduces KV cache DRAM reads — fewer times those bits are fetched per token
Both optimizations apply simultaneously. A model running TurboQuant under AIOS memory management addresses the KV bottleneck from two directions at once.
Our first baseline (Intel Ultra 7 265K) measured 2,340 MB/token on stock llama.cpp. At 4K context, KV cache reads are a significant fraction of that. TurboQuant’s 5x KV compression would reduce that fraction further before AIOS residency management applies on top.
The broader pattern: BitNet (weight arithmetic), CALM (forward passes), TurboQuant (KV size), AIOS (DRAM access patterns) — four independent groups addressing four non-overlapping bottlenecks in the same inference stack. None of them are sufficient alone. All of them stack.

1 Like

使用 单纯的 llm 模型是不行的。

1 Like

“That’s an interesting position — can you be more specific? AIOS makes a precise, falsifiable claim: that memory bandwidth is one of the primary bottleneck for 7B+ model inference on CPU, and that it can be reduced 50-85% through memory residency management and model co-design.
If you believe a pure LLM model approach is insufficient, what specifically would you add or change? And what evidence supports that position?
We’re actively seeking contributors to validate or disprove our assumptions — Issue #2 is the starting point. A counterargument backed by a measurement is far more valuable than one

1 Like