I recently built a very experimental semantic prompt compressor aimed at reducing LLM token usage without losing important context.
Still not sure how worth is the idea, but I did have fun with this experiment.
Built with spaCy and YAML rule configs Domain-sensitive (best for human queries) Preserves >95% named entities and technical terms Achieves ~22% compression across real-world prompts
It’s designed to work both for runtime compression and prompt normalization before storage / vector DB ingestion.
Would love feedback from the community whether this looks useful or not and whether you faced the need to implement something similar.
Is anyone fighting “token reduction” fight?
Yes, I’m fighting the token-reduction fight, but coming at it from a
different angle. I just published a preprint measuring information loss
when LLMs summarize their own conversation history (curative compaction):
Your approach (preventive, prompt-level, rule-based) is orthogonal to mine
(curative, history-level, LLM-based). The two compose nicely: your tool
compresses individual messages on the way in, mine could compact the
accumulated history later. Worth chaining and measuring.
Two things in your design that I think are underappreciated and that I’d
love to discuss:
Compressing on the way in vs on the way out is a more important
distinction than the literature gives it credit for. Tool results,
chain-of-thought, and search outputs are typically 70-80% of the
verbose noise in a coding agent’s context. Compressing them before
they enter the context probably has more leverage than any sophisticated
history compaction strategy. Your tool is well-positioned for this.
Rule-based vs LLM-based compaction is a methodological lever I hadn’t
seriously considered until reading you. A rule-based compactor is
deterministic, which directly addresses a finding I made in my paper:
LLM-based compaction is non-deterministic at temperature zero,
producing run-to-run recall variance up to factor 14x on identical
conversations. A rule-based variant would remove that source of
variance entirely and make benchmarks much cleaner. If your tool
gives modest compression but stable behavior, that may be precisely
what you want for parts of an agent’s context (typed/structured
content especially).
Question for you: have you measured downstream task performance with vs
without your compressor? LLMLingua reports ~1.5% drop at 20x compression;
yours at 22% should land much lower, which would be a strong selling
point if measured.
In any case I’m including a follow-up section on input-side and rule-based
compression in my future work draft, partly inspired by your tool. Happy
to compare notes if you’re interested.