“It’s the Architecture, Stupid” — Why Prompt Engineering Won’t Fix Agents

gfernandf · April 14, 2026, 7:04pm

Borrowing from the classic “it’s the economy, stupid” — the same applies here.
We’re blaming prompts for what is fundamentally an architectural problem.

Paper: Beyond Prompting: Decoupling Cognition from Execution in LLM-based Agents through the ORCA Framework
Code: GitHub - gfernandf/agent-skills: Agents should execute whenever possible — runtime for composable AI agent skills · GitHub

We keep pretending that better prompts will fix LLM agents.

They won’t.

We’ve built an entire ecosystem of tooling, courses, and “best practices” around prompt engineering — as if the problem were linguistic.

It’s not.

It’s architectural.

The uncomfortable truth

Let’s be honest about what most agent systems are doing today:

Take a task
Generate a prompt
Call the model
Hope it “reasons” correctly
Repeat

This is not a system.

This is recomputation disguised as intelligence.

We are replaying cognition, not building it

Every time your agent runs, it:

Reconstructs context
Rebuilds reasoning
Re-derives intermediate steps

There is no reuse of cognition.

No structure.
No persistence.
No abstraction layer.

Just prompts.

We are not building systems. We are replaying thoughts.

Why prompt engineering feels like it works (until it doesn’t)

Prompt engineering gives the illusion of control:

Add more instructions
Add more examples
Add more constraints

And yes — performance improves.

Until it plateaus.

Because everything still lives inside a single forward pass:

no memory of reasoning
no composability
no reuse

It’s like trying to fix software architecture by writing better comments.

The real problem is architectural

The core issue is simple:

We are using LLMs as stateless reasoning engines.

And then compensating for that with increasingly complex prompts.

Instead of:

modeling cognition
structuring reasoning
reusing intermediate steps

We regenerate everything every time.

That doesn’t scale.

Not in cost.
Not in latency.
Not in reliability.

What’s actually missing

What’s missing is not a better prompt.

It’s a runtime layer that:

encodes reusable cognitive steps
separates reasoning into structured components
allows composition instead of regeneration

In other words:

a system that reuses cognition instead of recomputing it.

From prompts to skills (and where ORCA fits)

Instead of:

→ Prompt → Model → Output

You need:

→ Skill → Execution → Structured Output

Not conceptually. Operationally.

This is exactly what ORCA implements: a runtime layer where “skills” are reusable cognitive units — not prompts.

defined inputs
structured outputs
explicit execution

No recomputation. No guesswork.

Why most agent frameworks hit a wall

Most “agent frameworks” today are:

prompt orchestration layers
tool wrappers
retry loops with better formatting

They don’t model cognition.

They orchestrate prompts.

That’s not a runtime.

The shift we actually need

The shift is not better prompting.

It’s architectural.

From:

stateless generation

To:

structured, reusable cognition

That’s the gap ORCA is designed to close.

Final thought

Prompt engineering isn’t useless.

It’s just solving the wrong problem.

We’ve been optimizing the interface instead of the system.

And it shows.

If you’ve pushed prompt engineering far enough, you’ve seen the limit.

The question is:

are you ready to try what replaces it?

Pimpcat-AU · April 14, 2026, 7:45pm

Finally! Some one that isn’t all about prompting on here. I built a whole new architecture for this very reason. I’ve already built all of the things you’ve mentioned as well as a whole bunch of others you probably haven’t even considered yet. Models too me are nothing more than compute power.

gavin566 · April 14, 2026, 8:27pm

I’ve been testing this on my local ROCm setup with Gemma-4 E4B, and the architecture only gets me so far before the model’s ‘thinking’ parameters become the bottleneck.

Pimpcat-AU · April 14, 2026, 11:37pm

Pretty much all of the public models people can run locally are quite dumb. It’s why I setup my operating system to use OpenAI API models.

gfernandf · April 15, 2026, 8:31am

Glad to hear that — and I mean it. The “models are compute power” framing is exactly right. That’s the core thesis: the model is an execution engine, not the architecture itself.

Curious about what you’ve built — especially around reuse and composition of cognitive steps. Most people who arrive at this conclusion independently end up solving similar structural problems (persistence, structured I/O, separation of reasoning from execution) but with very different tradeoffs.

If you’re open to sharing, I’d genuinely like to compare notes. The point of publishing ORCA wasn’t to claim novelty on every front — it was to make the architectural argument explicit and give it a concrete, reproducible runtime. The more people building in this direction, the faster we move past the “just prompt harder” era.

gfernandf · April 15, 2026, 8:34am

That’s actually a really interesting data point — and it might validate the thesis more than contradict it.

If the model’s “thinking” parameters are your bottleneck, it likely means the model is still being asked to reason through too much in a single pass. That’s exactly the problem the architecture is designed to solve: you break cognition into discrete skills with defined inputs/outputs, so each model call is a narrow, well-scoped execution — not open-ended reasoning.

With a smaller model like Gemma-4 E4B, the architecture becomes more important, not less. The model doesn’t need to “think” — it needs to execute a structured step. The cognitive load shifts from the model to the runtime.

Would be curious to know: are you running full skills with structured I/O, or using the model in a more traditional prompt-based flow on top of the framework? That distinction usually explains the bottleneck.

Pimpcat-AU · April 15, 2026, 5:23pm

Well I’ve been at this for awhile mate. I built the entire stack. I have developed a whole new architecture for AI to use. The first thing I built was a way for AI to persist. Then I used that new filing system as the kernel for a new modular operating system around it. As it turns out it also significantly reduces token use. The older the session gets the bigger the savings get.

I’ve also built a significant set of tools that the models can use. Another thing I did was develop a system the models can use so that all you need to do is tell the model what you want it to do in chat via voice or typing and it will automatically do it, end to end unattended.

I can also drop in the whole docs folder it is quite extensive now. The amount of features I have packed into this OS now I have made sure I am going to do for AI what Microsoft Windows did for Computers. The whole industry has been doing it all wrong all along with AI models. AI Agents are the future.

gfernandf · April 16, 2026, 2:09pm

Sounds like you’ve gone deep — persistence, token efficiency over time, and autonomous end-to-end execution are all hard problems, so respect for tackling the full stack.

A few of those resonate directly with what ORCA addresses: persistence of cognitive state, modular composition, and reducing redundant computation across sessions. We may be solving overlapping problems from different angles.

Is your project open source? or do you have a link for me to check? I would be intrested in comparing approaches and see if they fit or if they are compatible.

Pimpcat-AU · April 16, 2026, 7:32pm

I am going to release everything in the core for free. I am in the final polishing stages writing up guides and what not. I’ll launch a beta within about a month. I just finished working on Story studio and Script runner. I can basically write anything end to end now.

12 chapters with 3500 words per chapter. It knocked it up pretty quickly. The entire story.

It’s too big to copy paste into here. But you can see the beginning and end with line counts. I can email you the story if you’d like to see what a 100 percent AI written novel looks like. I have 1 goal. Do for AI what Bill Gates did for computers.

gfernandf · April 17, 2026, 6:59am

Interesting — end-to-end generation with chapter-level coherence is a hard problem, especially maintaining consistency across 40k+ words. That’s where most prompt-only pipelines fall apart (context drift, character inconsistency, plot contradictions).

Curious about one thing: when you say “Story Studio” — is the model carrying the full narrative state in context, or do you have an explicit state layer that tracks characters, plot arcs, and continuity across chapters? That distinction is exactly where the architecture argument lands.

Looking forward to seeing the release. Open beats closed every time.

Pimpcat-AU · April 17, 2026, 7:25am

It’s a combination of two things. Story Studio and Script Runner.

Story Studio is the authoring workspace. It manages explicit entities like projects, scenes, guardrails, and exports.
Its AI Assist actions (concept, outline, draft, rewrite) are targeted calls, not a full autonomous pipeline.
When you enable Autonomy, Story Studio builds a structured payload (format, counts, guardrails, project (metadata) and launches a run in Script Runner.
Script Runner is the execution engine. It persists run state, events, artifacts, progress, and lifecycle controls (start/pause/resume/stop/rerun/export).
It supports long form generation with checkpoints and recoverability, then returns outputs/artifacts back to the UI.

gfernandf · April 17, 2026, 12:03pm

That’s a meaningful split — authoring workspace vs execution engine with persistent state and lifecycle controls is exactly the kind of separation that matters. Checkpoints and recoverability are where most “just prompt it” approaches collapse.

There’s real architectural overlap with what ORCA does, though the scope is different. ORCA generalizes that pattern: any cognitive task — not just narrative generation — gets decomposed into reusable capabilities with typed contracts, bound to swappable backends (Python, OpenAPI, MCP), and executed through a runtime with DAG scheduling, step-level tracing, and checkpoint/restore.

The key difference is that ORCA’s capabilities and skills are domain-agnostic and open. A text.content.summarize capability used inside a story pipeline is the same one used in a legal document workflow or a security audit — same contract, same governance, different binding.

Sounds like you’ve solved the vertical problem well for narrative. ORCA is trying to solve the horizontal one.

Would genuinely be interested to compare notes when you release. The more systems that take state persistence and execution structure seriously, the better the argument gets for everyone.

gfernandf · April 18, 2026, 11:55am

Continuing to evangelize this idea: the problem isn’t prompts, it’s architecture.
Now also available on SSRN:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6600840

Ernst03 · April 18, 2026, 6:51pm

Thank you for this interesting tread.

gfernandf · April 20, 2026, 10:08am

Thanks for reading! If you have questions about the architecture or want to dig into any specific part, happy to discuss.

Ernst03 · April 20, 2026, 11:11pm

I love reading what people are saying.
I also am learning what a lot of the concepts are but I’m retired so it’s fair game to try.

-Ernst

gfernandf · April 21, 2026, 7:18am

If you would like to go deeper, check the repo GitHub - gfernandf/agent-skills: Agents should execute whenever possible — runtime for composable AI agent skills · GitHub

I have a lot more infomation there and you can try using the approach for whatever problem you would like to tackle by creating the skill and calling the runtime

gfernandf · April 28, 2026, 4:37pm

The paper explaining the ideas behind ORCA Framework is now available at SSRN!

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6600840

check it out so we unlock agents value faster!

gfernandf · April 29, 2026, 6:18pm

Intresting discussion are happening on reddit about ORCA framework Reddit - Please wait for verification

Jang-woo · May 14, 2026, 3:29am

This is a strong framing.

I especially like the distinction between the model and the architecture around it. Treating the model as the execution engine, rather than the whole agent, makes the problem much clearer.

The shift from Prompt → Model → Output to Skill → Execution → Structured Output also seems important, because it turns vague reasoning into reusable operational units.

One small point I would add: clarification should also be part of the architecture, not just a prompt instruction.

If a skill does not have enough required input, the agent should not silently infer the missing context. It should enter a clarification state and ask before execution.

In that sense, “ask if unsure” should be a runtime behavior, not just wording inside a prompt.

Topic		Replies	Views
ORCA: A Cognitive Runtime Layer for Agent Systems (paper + open source) Research	19	515	April 29, 2026
Why LLM agents keep failing (and it’s not the prompt) Intermediate	4	328	April 29, 2026
Prompt Engineering - The Protocol of Intent: The Theoretical Foundation Beginners	10	190	May 20, 2026
What's the relationship among LLM, Prompt, RAG, Prompt Engineering, Metadata? Beginners	9	1148	May 10, 2026
Evidence of latent collapse geometry in frontier LLMs? Research	3	300	December 31, 2025