“It’s the Architecture, Stupid” — Why Prompt Engineering Won’t Fix Agents

Borrowing from the classic “it’s the economy, stupid” — the same applies here.
We’re blaming prompts for what is fundamentally an architectural problem.

:page_facing_up: Paper: Beyond Prompting: Decoupling Cognition from Execution in LLM-based Agents through the ORCA Framework
:laptop: Code: GitHub - gfernandf/agent-skills: Agents should execute whenever possible — runtime for composable AI agent skills · GitHub



We keep pretending that better prompts will fix LLM agents.

They won’t.

We’ve built an entire ecosystem of tooling, courses, and “best practices” around prompt engineering — as if the problem were linguistic.

It’s not.

It’s architectural.


The uncomfortable truth

Let’s be honest about what most agent systems are doing today:

  • Take a task
  • Generate a prompt
  • Call the model
  • Hope it “reasons” correctly
  • Repeat

This is not a system.

This is recomputation disguised as intelligence.


We are replaying cognition, not building it

Every time your agent runs, it:

  • Reconstructs context
  • Rebuilds reasoning
  • Re-derives intermediate steps

There is no reuse of cognition.

No structure.
No persistence.
No abstraction layer.

Just prompts.

We are not building systems. We are replaying thoughts.


Why prompt engineering feels like it works (until it doesn’t)

Prompt engineering gives the illusion of control:

  • Add more instructions
  • Add more examples
  • Add more constraints

And yes — performance improves.

Until it plateaus.

Because everything still lives inside a single forward pass:

  • no memory of reasoning
  • no composability
  • no reuse

It’s like trying to fix software architecture by writing better comments.


The real problem is architectural

The core issue is simple:

We are using LLMs as stateless reasoning engines.

And then compensating for that with increasingly complex prompts.

Instead of:

  • modeling cognition
  • structuring reasoning
  • reusing intermediate steps

We regenerate everything every time.

That doesn’t scale.

Not in cost.
Not in latency.
Not in reliability.


What’s actually missing

What’s missing is not a better prompt.

It’s a runtime layer that:

  • encodes reusable cognitive steps
  • separates reasoning into structured components
  • allows composition instead of regeneration

In other words:

a system that reuses cognition instead of recomputing it.


From prompts to skills (and where ORCA fits)

Instead of:

→ Prompt → Model → Output

You need:

→ Skill → Execution → Structured Output

Not conceptually. Operationally.

This is exactly what ORCA implements: a runtime layer where “skills” are reusable cognitive units — not prompts.

  • defined inputs
  • structured outputs
  • explicit execution

No recomputation. No guesswork.


Why most agent frameworks hit a wall

Most “agent frameworks” today are:

  • prompt orchestration layers
  • tool wrappers
  • retry loops with better formatting

They don’t model cognition.

They orchestrate prompts.

That’s not a runtime.


The shift we actually need

The shift is not better prompting.

It’s architectural.

From:

  • stateless generation

To:

  • structured, reusable cognition

That’s the gap ORCA is designed to close.


Final thought

Prompt engineering isn’t useless.

It’s just solving the wrong problem.

We’ve been optimizing the interface instead of the system.

And it shows.


If you’ve pushed prompt engineering far enough, you’ve seen the limit.

The question is:

are you ready to try what replaces it?

1 Like

:slight_smile: Finally! Some one that isn’t all about prompting on here. I built a whole new architecture for this very reason. I’ve already built all of the things you’ve mentioned as well as a whole bunch of others you probably haven’t even considered yet. Models too me are nothing more than compute power.

I’ve been testing this on my local ROCm setup with Gemma-4 E4B, and the architecture only gets me so far before the model’s ‘thinking’ parameters become the bottleneck.

Pretty much all of the public models people can run locally are quite dumb. It’s why I setup my operating system to use OpenAI API models.

1 Like

Glad to hear that — and I mean it. The “models are compute power” framing is exactly right. That’s the core thesis: the model is an execution engine, not the architecture itself.

Curious about what you’ve built — especially around reuse and composition of cognitive steps. Most people who arrive at this conclusion independently end up solving similar structural problems (persistence, structured I/O, separation of reasoning from execution) but with very different tradeoffs.

If you’re open to sharing, I’d genuinely like to compare notes. The point of publishing ORCA wasn’t to claim novelty on every front — it was to make the architectural argument explicit and give it a concrete, reproducible runtime. The more people building in this direction, the faster we move past the “just prompt harder” era.

That’s actually a really interesting data point — and it might validate the thesis more than contradict it.

If the model’s “thinking” parameters are your bottleneck, it likely means the model is still being asked to reason through too much in a single pass. That’s exactly the problem the architecture is designed to solve: you break cognition into discrete skills with defined inputs/outputs, so each model call is a narrow, well-scoped execution — not open-ended reasoning.

With a smaller model like Gemma-4 E4B, the architecture becomes more important, not less. The model doesn’t need to “think” — it needs to execute a structured step. The cognitive load shifts from the model to the runtime.

Would be curious to know: are you running full skills with structured I/O, or using the model in a more traditional prompt-based flow on top of the framework? That distinction usually explains the bottleneck.

Well I’ve been at this for awhile mate. I built the entire stack. I have developed a whole new architecture for AI to use. The first thing I built was a way for AI to persist. Then I used that new filing system as the kernel for a new modular operating system around it. As it turns out it also significantly reduces token use. The older the session gets the bigger the savings get.

I’ve also built a significant set of tools that the models can use. Another thing I did was develop a system the models can use so that all you need to do is tell the model what you want it to do in chat via voice or typing and it will automatically do it, end to end unattended.

I can also drop in the whole docs folder it is quite extensive now. The amount of features I have packed into this OS now I have made sure I am going to do for AI what Microsoft Windows did for Computers. The whole industry has been doing it all wrong all along with AI models. AI Agents are the future.

Sounds like you’ve gone deep — persistence, token efficiency over time, and autonomous end-to-end execution are all hard problems, so respect for tackling the full stack.

A few of those resonate directly with what ORCA addresses: persistence of cognitive state, modular composition, and reducing redundant computation across sessions. We may be solving overlapping problems from different angles.

Is your project open source? or do you have a link for me to check? I would be intrested in comparing approaches and see if they fit or if they are compatible.

I am going to release everything in the core for free. I am in the final polishing stages writing up guides and what not. I’ll launch a beta within about a month. I just finished working on Story studio and Script runner. I can basically write anything end to end now.

12 chapters with 3500 words per chapter. It knocked it up pretty quickly. The entire story.

It’s too big to copy paste into here. But you can see the beginning and end with line counts. I can email you the story if you’d like to see what a 100 percent AI written novel looks like. I have 1 goal. Do for AI what Bill Gates did for computers.

Interesting — end-to-end generation with chapter-level coherence is a hard problem, especially maintaining consistency across 40k+ words. That’s where most prompt-only pipelines fall apart (context drift, character inconsistency, plot contradictions).

Curious about one thing: when you say “Story Studio” — is the model carrying the full narrative state in context, or do you have an explicit state layer that tracks characters, plot arcs, and continuity across chapters? That distinction is exactly where the architecture argument lands.

Looking forward to seeing the release. Open beats closed every time.

It’s a combination of two things. Story Studio and Script Runner.

  • Story Studio is the authoring workspace. It manages explicit entities like projects, scenes, guardrails, and exports.
  • Its AI Assist actions (concept, outline, draft, rewrite) are targeted calls, not a full autonomous pipeline.
  • When you enable Autonomy, Story Studio builds a structured payload (format, counts, guardrails, project (metadata) and launches a run in Script Runner.
  • Script Runner is the execution engine. It persists run state, events, artifacts, progress, and lifecycle controls (start/pause/resume/stop/rerun/export).
  • It supports long form generation with checkpoints and recoverability, then returns outputs/artifacts back to the UI.

That’s a meaningful split — authoring workspace vs execution engine with persistent state and lifecycle controls is exactly the kind of separation that matters. Checkpoints and recoverability are where most “just prompt it” approaches collapse.

There’s real architectural overlap with what ORCA does, though the scope is different. ORCA generalizes that pattern: any cognitive task — not just narrative generation — gets decomposed into reusable capabilities with typed contracts, bound to swappable backends (Python, OpenAPI, MCP), and executed through a runtime with DAG scheduling, step-level tracing, and checkpoint/restore.

The key difference is that ORCA’s capabilities and skills are domain-agnostic and open. A text.content.summarize capability used inside a story pipeline is the same one used in a legal document workflow or a security audit — same contract, same governance, different binding.

Sounds like you’ve solved the vertical problem well for narrative. ORCA is trying to solve the horizontal one.

Would genuinely be interested to compare notes when you release. The more systems that take state persistence and execution structure seriously, the better the argument gets for everyone.

Continuing to evangelize this idea: the problem isn’t prompts, it’s architecture.
Now also available on SSRN:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6600840

Thank you for this interesting tread.

Thanks for reading! If you have questions about the architecture or want to dig into any specific part, happy to discuss.