AIOS — First Ground Truth Baseline (CPU DRAM Measurement)

acasavaraju · March 29, 2026, 11:13pm

Following up on my earlier post introducing AIOS (CPU-native LLM inference architecture), we now have the first validated baseline measurement using hardware memory controller counters.

Setup

Model: Falcon 7B (GGUF Q4_K_M)
CPU: Intel Core Ultra 7 265K (20 cores)
OS: Arch Linux (kernel 6.19.10-zen1-1-zen)
Method: perf uncore IMC counters (uncore_imc_free_running_0/data_read/)

Results (5 runs × 200 tokens)

MB/token: 2340 ± 4 MB
Coefficient of Variation: 0.17%
Tokens/sec: 11.43 ± 0.05

Key Takeaways

The measurement is highly stable (CV < 1%), confirming that DRAM reads can be treated as a reliable physical metric.
~456–459 GB DRAM read for 200 tokens highlights the memory bandwidth wall in CPU inference.
This establishes a ground truth baseline for AIOS evaluation.

Why this matters

Most inference discussions optimize for tokens/sec.

AIOS instead treats MB/token as the primary constraint, because on CPUs, memory movement—not compute—is the bottleneck.

What’s next

Issue #1: Falcon 7B “relufication” (R1 compliance)
Headroom analysis (validation/headroom.py)
Additional baselines across models / quantizations

Call for contributors

If you can run perf on bare-metal Linux, contributions are very valuable:

Run baseline measurements on your hardware
Validate different models / quantizations
Help quantify headroom vs AIOS projections

Repo: GitHub - acasavaraju/AIOS: CPU-native LLM inference architecture. Memory residency controller that reduces DRAM data movement per generated token through weight aliasing, sparsity maps, KV cache tiering, and activation chunking. Includes Model Contract spec for architecture co-design. Framework + validation tooling — runtime contributions welcome. Paper: SSRN 6467298 · GitHub

Acknowledgment

Huge thanks to @reimorster for running the first full validation and helping establish this baseline.

This is the first step toward making memory movement a first-class metric for LLM inference.

acasavaraju · April 9, 2026, 4:28am

Update: TurboQuant (Google, ICLR 2026) is directly relevant to AIOS
Google released TurboQuant this week — a KV cache quantization algorithm that compresses attention key-value pairs from 16 bits to 3 bits with near-zero quality loss and no retraining required(TurboQuant: Redefining AI efficiency with extreme compression).
This is complementary to AIOS, not competing. They address the same bottleneck from different directions:
∙ TurboQuant reduces KV cache size — fewer bits per KV entry
∙ AIOS reduces KV cache DRAM reads — fewer times those bits are fetched per token
Both optimizations apply simultaneously. A model running TurboQuant under AIOS memory management addresses the KV bottleneck from two directions at once.
Our first baseline (Intel Ultra 7 265K) measured 2,340 MB/token on stock llama.cpp. At 4K context, KV cache reads are a significant fraction of that. TurboQuant’s 5x KV compression would reduce that fraction further before AIOS residency management applies on top.
The broader pattern: BitNet (weight arithmetic), CALM (forward passes), TurboQuant (KV size), AIOS (DRAM access patterns) — four independent groups addressing four non-overlapping bottlenecks in the same inference stack. None of them are sufficient alone. All of them stack.

finche · April 11, 2026, 9:23am

使用单纯的 llm 模型是不行的。

acasavaraju · April 13, 2026, 3:32am

“That’s an interesting position — can you be more specific? AIOS makes a precise, falsifiable claim: that memory bandwidth is one of the primary bottleneck for 7B+ model inference on CPU, and that it can be reduced 50-85% through memory residency management and model co-design.
If you believe a pure LLM model approach is insufficient, what specifically would you add or change? And what evidence supports that position?
We’re actively seeking contributors to validate or disprove our assumptions — Issue #2 is the starting point. A counterargument backed by a measurement is far more valuable than one

Topic		Replies	Views
AIOS: CPU-Native LLM Inference Architecture — Seeking Validation Contributors Awesome paper	3	116	March 26, 2026
LLM performance Beginners	1	410	February 18, 2025
Anubis OSS - Local LLM Benchmarking for Apple Silicon with Real-Time Hardware Telemetry (Looking for Testers + Open Data) Show and Tell	0	277	February 25, 2026
Anubis OSS — native macOS app for benchmarking local LLMs with real-time hardware telemetry (free, open source) Intermediate	1	145	February 11, 2026
TurboTensors: Optimizing CPU LLM Performance 🤗Transformers	0	55	December 31, 2025

AIOS — First Ground Truth Baseline (CPU DRAM Measurement)

Related topics