ORCA: A Cognitive Runtime Layer for Agent Systems (paper + open source)

I’ve been exploring some of the structural limitations of prompt-based agent systems and built a framework to experiment with an alternative approach.

The core idea is to introduce a “cognitive runtime layer” (ORCA) between the agent and the underlying tools.

In this model:

  • capabilities represent atomic cognitive operations (e.g., retrieve, transform, evaluate)
  • skills define composable workflows over capabilities
  • execution is explicit and structured, rather than embedded in prompts

This aims to separate concerns that are often tightly coupled in current agent designs:

  • cognition (what needs to be done)
  • execution (how it is carried out)
  • orchestration (decision-making at the agent level)

The hypothesis is that making this separation explicit can improve:

  • composability
  • observability
  • controllability of execution

Open-source implementation:

Paper (DOI):

I’d be particularly interested in feedback on a few points:

  • how far capability granularity should go before overhead dominates
  • whether declarative execution models can realistically replace prompt pipelines
  • how this kind of abstraction would behave in more complex, real-world agent systems

Happy to expand on the execution model, design principles, or concrete examples if useful.

1 Like

for now:


ORCA is a good idea. The strongest version of the idea is not “skills are useful.” It is this: prompt text should not be the control plane for serious agent systems. Your public repo and first release already reflect that position in concrete form: declarative DAG skills, capability contracts, typed CognitiveState, policy gates, checkpoint/restore, multi-protocol bindings, MCP exposure, native tool adapters, audit modes, RBAC, and observability. The current public release is v0.1.0, marked as the first release, with an execution engine, 163 binding YAMLs, 25 deterministic baseline services, framework adapters, a benchmark lab, and a compose DSL; the README separately emphasizes 122 shipped capabilities with deterministic Python baselines and governed naming. (GitHub)

The background that matters

Your project is timely because the field has already started to split “agent” systems into layers, even when it uses different terms. Anthropic explicitly distinguishes workflows from agents, where workflows follow predefined code paths and agents dynamically direct their own process and tool use. LangGraph positions itself as low-level infrastructure for long-running, stateful workflows and agents, with durable execution, memory, debugging, and human oversight rather than a semantic cognition model. Microsoft Agent Framework now makes a similar distinction, pairing agents with graph-based workflows that support type-safe routing, checkpointing, and human-in-the-loop control. That convergence supports your core premise: there really is a missing middle layer between “LLM decides everything” and “tool does everything.” (Anthropic)

There is also a conceptual lineage behind what you are doing. CoALA frames language agents in terms of modular memory, structured action spaces, and a generalized decision process. StateFlow frames multi-step LLM task solving as a state machine. DSPy argues that expert-crafted prompts should be reduced or even removed from the center of the system through declarative modules and compilation. ORCA fits naturally in that lineage, but with a more runtime-centric emphasis than any one of those projects by itself. (arXiv)

My main judgment

The best way to position ORCA is as a governed execution substrate for the reliable subset of agent behavior.

That is narrower than “this replaces prompt-based agents,” but stronger and more defensible. It matches your implementation and it matches the external landscape. Anthropic’s own guidance still recommends keeping systems simple and transparent, and warns that frameworks can add abstraction that obscures what is happening. ORCA works best when it is not trying to replace all model-driven planning, but instead taking over the parts of agent behavior that benefit from explicit contracts, explicit state, explicit control flow, and explicit policy. (Anthropic)

1) How far capability granularity should go

The right stopping point is not “atomic thought.” It is the smallest semantically stable operational unit.

A capability is worth existing when at least one of these is true: it has a distinct safety or trust boundary, it could swap backends without changing the surrounding workflow, it deserves its own evaluation set, it is reused in multiple skills, or it is a natural checkpoint/cache/retry boundary. Your repo already hints at exactly this worldview through governed naming, typed state, JSON Schemas, lifecycle stages, conformance profiles, and fallback chains. That is the right direction. (GitHub)

A good capability boundary looks like retrieve_evidence, rank_candidates, validate_schema, extract_claims, or request_user_confirmation. A bad overly-fine boundary looks like “generate one reasoning fragment,” “decide one prompt phrase,” or “rewrite intermediate sentence three.” Once a capability is too fine, it stops being a stable contract and starts being externalized internal thought. At that point you get more YAML, more mapping, more serialization, and more debugging surface without more value.

A practical test is this:

  • Would you evaluate this step separately?
  • Would you apply a different policy to this step?
  • Could you swap its implementation independently?
  • Would a failure here deserve its own retry or approval gate?

If the answer is no four times, it is probably too small.

Anthropic’s tool-writing guidance points in the same direction from the tool side: prototype tools quickly, then run comprehensive evaluation, because ergonomics and interface clarity matter more than people expect. That same logic applies to capabilities. A capability taxonomy without evaluation discipline will sprawl. (Anthropic)

2) Can declarative execution replace prompt pipelines

For some tasks, yes. In general, no.

It can replace prompt pipelines when the task is repeated, tool-heavy, policy-sensitive, side-effecting, or expensive to debug after the fact. In those cases prompt chains are usually informal workflows pretending to be prompts. Anthropic’s workflow/agent split, LangGraph’s durable-execution model, and Microsoft’s workflow/checkpointing model all point toward the same operational truth: once a process is multi-step and failure-prone, execution structure wants to become explicit. (Anthropic)

It does not replace prompting for novelty, underspecified goals, exploration, or strategy invention. That is where model-driven planning still belongs. So the durable architecture is hybrid:

  • the agent or planner chooses or synthesizes a skill,
  • ORCA executes the skill with state, policies, and traces,
  • individual capabilities may still call LLMs or tools internally,
  • and control returns upward when the workflow is underspecified.

That is also where ORCA is stronger than a pure graph engine. LangGraph gives orchestration infrastructure. Microsoft Agent Framework gives workflow infrastructure. OpenAI’s Agents SDK gives tracing, tool use, handoffs, and guardrails. ORCA’s chance is to sit above those kinds of runtimes as the semantic execution layer that says what a step is, what policies apply, and how it composes. (LangChain Docs)

3) How this behaves in more complex real-world systems

This is where ORCA becomes more valuable and more fragile at the same time.

It becomes more valuable because real systems care about pause/resume, human approval, auditability, replay, vendor substitution, side-effect control, and observability. Your repo is already aiming at exactly those needs with checkpoint/restore, audit modes, OTel, metrics, SSE streaming, fallback chains, conformance levels, security docs, and governance stages. That is the right systems surface. (GitHub)

But it also becomes more fragile because the hard problem shifts from “how do I prompt this?” to “how do I govern the execution ontology?” In practice that means registry bloat, overlapping capabilities, naming drift, incompatible outputs, state inflation, and subtle retry/side-effect bugs. Recent public issue trackers in adjacent frameworks show the exact pressure points: LangGraph had a development/runtime persistence mismatch that lost state on restart, AutoGen GraphFlow had failures when graph speaker selection interacted with tool-enabled agents, CrewAI had persistence failures because typed state was not JSON-serializable, and Pydantic users ran into friction when trying to treat explicit workflows as if they were ordinary agents for adapter/UI purposes. Those are not random bugs. They are symptoms of the structural difficulty of explicit execution systems. (GitHub)

For ORCA specifically, that means three rules should stay strict:

  1. State must stay small and boring. Typed, serializable, replayable.
  2. Side effects must be classified. Idempotent or not, retryable or not, approval-gated or not.
  3. A skill is not automatically an agent. Do not let UI or framework adapters collapse those abstractions back together.

If you keep those boundaries sharp, ORCA gets stronger as systems become more real. If you blur them, the runtime will become harder to operate than the prompt chains it is trying to replace.

Where your current public implementation is strongest

Your strongest current differentiator is not the DAG scheduler by itself. It is the combination of:

  • binding abstraction across PythonCall, OpenAPI, MCP, and OpenRPC,
  • deterministic baselines plus fallback chains,
  • typed cognitive state aligned with CoALA,
  • governance and lifecycle language,
  • native tool definitions for Anthropic/OpenAI/Gemini,
  • and audit/observability surfaces.

That combination makes ORCA look less like “another agent framework” and more like a runtime architecture with a serious control plane. The repo even states openly when not to use it: if you only need a quick prompt-based prototype or do not need control or safety. That honesty helps the project because it narrows the claim to the zone where the architecture is strongest. (GitHub)

I also think the deterministic-baseline-first approach is smart. The release notes and README make it clear that a large part of the surface can run without an LLM key, and the runtime supports deterministic terminal baselines in fallback chains. That is useful for CI, local development, regression testing, and debugging. Many agent projects skip that layer and then have no stable floor under evaluation. (GitHub)

Where I would tighten the project

The first thing I would tighten is positioning language. Right now the public materials support “runtime,” “execution layer,” “reference architecture,” and “standard proposal.” They do not yet support “emerging standard” in the stronger ecosystem sense. The release is very recent, the repo is still small publicly, and the field is moving fast. I would keep the ambition, but phrase it as a reference model or execution architecture rather than as a settled standard. (GitHub)

The second thing I would tighten is count clarity. The release notes describe a companion registry with 141 capabilities and 36 ready-to-use skills, while the runtime README emphasizes 122 shipped capabilities with deterministic baselines. Those can both be true, but the boundary between “runtime-supported executable set” and “registry total” should be obvious everywhere. Otherwise readers will read it as inconsistency. (GitHub)

The third thing I would tighten is evaluation methodology. Recent guidance from Anthropic and OpenAI both pushes toward evaluation loops, traces, graders, and workflow-level testing rather than only final-answer snapshots. ORCA is unusually well positioned for that because your runtime already exposes explicit state transitions and execution structure. So your paper should lean harder into process metrics, not just outcome metrics. (Anthropic)

What I would claim in the paper, and what I would not

I would make these claims strongly:

  • ORCA improves inspectability because execution is explicit, not buried in prompt text.
  • ORCA improves substitutability because capability contracts can target multiple bindings and fallback chains.
  • ORCA improves policy control because approval, trust, side effects, and scope can be surfaced at the runtime layer.
  • ORCA improves operational ergonomics for repeated, tool-heavy, governance-sensitive tasks. (GitHub)

I would be more careful with these claims:

  • “declarative execution broadly outperforms prompt pipelines,”
  • “ORCA improves reproducibility in general,”
  • “this is a universal architecture for agents.”

Those claims need larger and more diverse empirical support than you currently show publicly. The more defensible framing is: ORCA is a better control substrate for the part of agent behavior that needs to be dependable, auditable, and composable.

The security angle is bigger than it looks

This is one of the places where your architecture can become more important than a typical workflow framework.

MCP’s current specification explicitly says tools represent arbitrary code execution, tool descriptions/annotations should be treated as untrusted unless they come from trusted servers, and hosts must obtain explicit user consent before invoking tools. Recent papers on the emerging agent-skills layer also point in the same direction: the skills abstraction is growing quickly, but security and lifecycle governance are becoming central concerns rather than edge concerns. One recent survey frames skills as a major shift toward modular, on-demand capability extension and highlights a four-tier governance model; a recent security paper argues that the most severe threats come from structural issues such as weak data-instruction boundaries and persistent trust assumptions. That fits ORCA directly. (Model Context Protocol)

So for your case, governance should not be a side chapter. It should be one of the main contributions. Signed manifests, provenance metadata, side-effect classes, trust tiers, review states, and permission scoping are not “enterprise extras.” They are part of why a runtime layer is valuable at all.

My concrete recommendation for your next phase

If I were steering this project, I would do five things next.

1. Publish a capability admission rubric.
Every capability should justify itself by reuse, safety boundary, backend substitutability, eval ownership, and lifecycle owner. This is how you stop ontology sprawl before it starts.

2. Make execution metadata richer and more explicit.
Each capability should declare side effects, idempotency, retry semantics, confidentiality level, cacheability, and approval requirements.

3. Benchmark process, not just answers.
Use traces and graders to score skill selection, routing correctness, policy-trigger correctness, retry behavior, human-gate behavior, and replay fidelity, not only task success. OpenAI’s agent-eval guidance is directly useful here. (OpenAI Developer)

4. Lean into the hybrid architecture story.
Do not imply that ORCA removes prompting. Show that it relocates prompting: planning above, semantic execution in the middle, tool/model calls below.

5. Demonstrate one hard real-world workflow.
Use something long-running, tool-rich, failure-prone, and policy-sensitive. The more the task needs checkpoints, approvals, substitutions, and replay, the more ORCA’s benefits will become legible.

My final view

For your case, the idea is real. The timing is good. The public implementation is already substantial enough to take seriously. The ecosystem is moving toward explicit workflows, durable execution, state, checkpoints, traces, and tighter tool boundaries, which is exactly the environment in which ORCA makes sense. (Anthropic)

The project will succeed or fail on one question:

Can you keep the abstraction disciplined enough that it stays more governable than prompt pipelines, instead of becoming a more complicated prompt pipeline with better names?

That is the real test.

My answer right now is: yes, potentially, if you keep capability granularity at the level of stable operational units, treat the system as hybrid rather than totalizing, and turn governance and evaluation into first-class contributions rather than supporting details.

1 Like

Thanks for the detailed analysis. A few corrections and notes:

On the numbers: the repo ships 189 binding YAMLs, not 163. The 122 vs 141 difference is real but intentional — 122 are the runtime capabilities with deterministic Python baselines (no API key needed), 141 is the full registry count after the composability wave added 19 new capabilities. The changelog documents that explicitly, but I agree it should be clearer in the README itself. Will fix.

On the admission rubric: this already exists. SKILL_ADMISSION_POLICY.md defines a 7-item checklist, 4-channel promotion model (local → experimental → community → official), and a Canonical-First Rule. promotion_package.py implements it with different strictness per channel. Happy to get feedback on it.

On side-effect classification and idempotency declarations: fair point, and probably the most actionable gap you identified. Right now capabilities don’t declare side-effect class, retry semantics, or cacheability in their YAML contracts. That metadata exists conceptually in the architecture but isn’t enforced per-capability yet. Going to work on that.

On a hard end-to-end demo: agreed. The pieces are there — checkpoints, approval gates, fallback chains, webhook events — but there’s no single showcase workflow that exercises all of them together in a realistic scenario. On the list.

On the hybrid architecture: ORCA.md already describes this explicitly — the model is not “replace prompting” but “relocate it.” Planning above, structured execution in the middle, tool/model calls below. But I take the point that the paper should lean harder into that framing.

Appreciate the time. Some of the structural recommendations map well to what’s already in the roadmap.

1 Like

by the way, are you active in ArXiv, compter science ai community? I would really like sharing my work there and need endorsement! https://arxiv.org/auth/endorse?x=GAU4NP

Updated:


Based on the current implementation, my feedback on the three points is stronger and more specific than a purely conceptual review.

The current codebase already supports a real separation between what the system can do and how it executes it. The README presents the runtime as a deterministic, binding-driven engine with abstract capability contracts, declarative DAG skills, typed CognitiveState, safety gates, checkpoint/restore, and multi-protocol bindings. The architecture docs also make the split explicit: the companion registry is the source of truth for vocabulary, capabilities, skills, and governance, while the runtime is the execution engine with bindings, services, CLI/HTTP/MCP exposure, and audit surfaces. (GitHub)

That matters because it means ORCA is not just “prompting with nicer names.” It already has real execution machinery. The question is now where the abstraction should stop, where it should defer to model-driven planning, and what extra semantics the runtime still needs to behave safely and predictably in harder environments. Anthropic’s current guidance aligns with that framing: workflows are predefined code paths, agents are model-directed processes, and added complexity should be justified by real gains in predictability or capability. (Anthropic)

1. How far capability granularity should go before overhead dominates

My answer is: stop at the smallest semantically stable operational unit.

That means a capability should be small enough to be reusable and governable, but not so small that it becomes a thin wrapper around transient internal thought. In ORCA terms, the capability boundary should usually sit where a step has a stable contract, can be bound to different implementations, can be evaluated independently, or deserves distinct policy treatment. The current runtime clearly supports that style: capabilities are abstract contracts, skills are declarative graphs over them, and the binding layer decides which backend actually executes a capability. (GitHub)

The current implementation gives a concrete reason not to go too fine-grained: your step layer is already expressive. runtime/step_control.py supports condition, retry, foreach, while_loop, router, and scatter, with explicit composition rules. That means many things that might otherwise become separate “micro-capabilities” can remain as step-level execution semantics instead. You already have a place to encode iteration, branching, retry behavior, and fan-out without exploding the ontology. (GitHub)

This is an important implementation-level signal. If the runtime already has rich control-flow operators, then adding ever-smaller capabilities stops buying much. The cost starts showing up elsewhere: more capability IDs, more overlap, more routing ambiguity, more tests, more registry burden, and more low-value state artifacts in the run trace. Anthropic’s guidance on workflows and frameworks points the same way: frameworks help, but they also add abstraction and can tempt teams into unnecessary complexity. (GitHub)

So for ORCA specifically, I would use a practical admission rule like this:

A capability is worth existing if at least one of these is true:

  • it has a distinct policy or trust profile
  • it has a distinct backend substitution story
  • it has a distinct evaluation target
  • it is reused across multiple skills
  • it is a natural checkpoint / retry / cache boundary
  • it has a distinct ownership or lifecycle path

If none of those are true, it probably belongs inside step logic, skill structure, or the implementation of another capability.

The current code reinforces this. Your models already make room for structured execution state, but not every intermediate thought needs to become a first-class capability. WorkingState in runtime/models.py already carries artifacts, entities, options, criteria, evidence, risks, hypotheses, uncertainties, and intermediate decisions. That is a good place for transient reasoning products. Not all of those deserve promotion into the capability catalog. (GitHub)

So the detailed judgment is:

  • Good granularity: capabilities like retrieval, validation, transformation, approval, policy gating, external mutation, structured extraction.
  • Too fine: sentence-level reformulations, temporary reasoning fragments, prompt tweaks, one-off “think” steps.
  • Why: your runtime already has enough control-flow expressiveness that over-decomposition would mostly add ontology cost, not execution power. (GitHub)

2. Whether declarative execution models can realistically replace prompt pipelines

My answer is: yes for workflow control, no for the entire stack.

The best way to phrase this is not “declarative execution replaces prompting.” The better claim is: declarative execution can replace prompt-driven control flow for a large class of tasks.

That is already how the current implementation behaves. The runtime does not just chain prompts. It has DAG scheduling, explicit state, binding resolution, step-level control flow, policy gates, checkpointing, and backend fallback. The README explicitly frames this as moving from prompt-driven behavior to execution-driven systems, and ORCA itself is described as making reasoning, decisions, and actions explicit, composable, and governable. (GitHub)

The implementation provides strong evidence that this is realistic. The runtime has:

  • declarative DAG skills
  • explicit execution state
  • protocol routing across PythonCall, OpenAPI, MCP, and OpenRPC
  • baseline-to-LLM backend substitution
  • retry and routing semantics
  • schema validation in scaffolded planning
  • observability and audit surfaces. (GitHub)

That is already enough to replace prompt pipelines for many structured tasks. In practical terms, if a workflow is repeated, tool-heavy, policy-sensitive, or operationally important, you no longer need a prompt to secretly carry the control flow. The runtime can carry it instead.

But that does not mean declarative execution should replace all prompting. Anthropic’s current workflow/agent distinction is the best background here. Workflows are predefined code paths; agents dynamically direct their own process. Anthropic explicitly says workflows give predictability and consistency for well-defined tasks, while agents remain better when flexibility and model-driven decision-making are needed. (Anthropic)

That maps cleanly onto ORCA:

  • Above ORCA: planning, interpretation, strategy selection, exception handling under novelty
  • Inside ORCA: structured execution, policy application, retry/resume behavior, state transitions
  • Below ORCA: tool invocations and model-backed implementations

Your own code supports exactly that hybrid reading. official_services/scaffold_service.py is especially revealing: it is explicitly “binding-first,” routing planning through the runtime capability executor when possible, then validating the resulting YAML against the runtime skill schema. That is not “no prompts.” It is “prompts where they belong, execution where it belongs.” (GitHub)

So my detailed feedback here is:

  • Yes, declarative execution can realistically replace prompt pipelines for structured workflow control.
  • No, it should not be sold as replacing planning or all model-mediated reasoning.
  • Your current implementation already proves the first part, because the runtime machinery is real and rich.
  • Your paper should lean harder into the hybrid claim, because that is the strongest true version of the argument. (GitHub)

3. How this abstraction would behave in more complex, real-world agent systems

My answer is: it gets more valuable and more exposed at the same time.

It gets more valuable because real systems care about exactly the things ORCA already externalizes:

  • resumability
  • explicit control flow
  • backend substitution
  • approvals and policy gating
  • auditability
  • observability
  • structured state
  • separation between contract and implementation. (GitHub)

It gets more exposed because once those concerns are explicit, the hard problems move away from prompt engineering and into runtime engineering. That means:

  • side-effect safety
  • retry semantics
  • idempotency
  • deterministic replay
  • checkpoint correctness
  • schema drift
  • capability versioning
  • trust and authorization boundaries

LangGraph’s durable-execution guidance is directly relevant here. It says that with checkpoints you can pause and resume workflows after interruptions or failures, but to make that work reliably, workflows should be deterministic and idempotent, and side effects or non-deterministic operations should be isolated inside tasks. It also explicitly recommends separating multiple side-effecting operations so they are not accidentally repeated on resume. (LangChain Docs)

That point lands directly on ORCA’s current implementation.

Your runtime already has much of the “hard systems” machinery:

  • checkpoint/restore in the runtime surface
  • retry-aware OpenAPI invocation with transient status handling, max retries, and backoff
  • step-level retry and routing
  • webhook delivery and observability surfaces exposed in the architecture
  • binding resolution and fallback chains. (GitHub)

But the current capability contract still appears less mature than the execution runtime. In runtime/models.py, capability semantics are still represented through generic properties and safety dict-style fields rather than a strongly typed set of mandatory operational declarations. In practice, that means the runtime already knows how to retry, pause, resume, gate, and route, but it does not yet force every capability to declare enough truth for those behaviors to be fully principled. (GitHub)

That is the biggest implementation-based risk in real systems.

What will go well

This abstraction should behave very well in environments that are:

  • repetitive enough to justify formalization
  • tool-rich
  • side-effecting
  • policy-sensitive
  • auditable
  • long-running or interruptible

Those are the exact cases where hidden prompt logic becomes fragile and expensive, and where ORCA’s explicit execution model becomes valuable. Anthropic’s production-oriented framing supports that: workflows are especially useful for well-defined tasks where predictability matters. (Anthropic)

What will strain first

The first things that will strain are not the scheduler or the DAG model. They are:

1. Side-effect semantics
If a capability mutates external state, the runtime needs to know whether retries are safe and whether resume can repeat the action. LangGraph’s documentation makes that requirement explicit for durable execution. (LangChain Docs)

2. Trust and consent boundaries
MCP is explicit that tools are arbitrary code-execution surfaces, tool descriptions should be treated cautiously unless trusted, and hosts must obtain explicit user consent before data exposure or tool invocation. Since ORCA sits above tools and exposes MCP, it inherits part of that responsibility at the capability layer. (Model Context Protocol)

3. Evaluation maturity
OpenAI’s current agent-evals guidance says to start with traces when debugging behavior and to use trace grading to score workflow-level issues such as tool choice, handoffs, routing, and safety violations. ORCA is well positioned for that because it already structures runs into explicit steps and state. But the value of that structure will be much higher if each capability declares its operational semantics more explicitly. (OpenAI Developers)

The most important next step for real-world readiness

Based on the current implementation, the most important next step is not “more architecture.” It is stronger capability-level operational metadata.

I would make these mandatory in the contract layer:

  • side_effect_class
  • idempotency
  • retry_policy
  • cache_policy
  • confirmation_policy
  • compensation_strategy
  • sensitivity_class
  • execution_class
  • cost_latency_class

Why these? Because the runtime is already advanced enough to use them. You already have retry loops, step controls, scaffolding, validation, fallback resolution, and checkpoint-oriented execution behavior. The missing piece is turning operational truth from scattered conventions into enforced per-capability declarations. (GitHub)

Bottom line

So, based on the current implementation:

1. Capability granularity

Go as fine as the smallest governable, reusable, evaluable execution unit. Stop well before micro-reasoning granularity. Your existing step-control machinery already gives you better places than the capability layer to encode fine execution logic. (GitHub)

2. Declarative execution vs prompt pipelines

Yes, declarative execution can realistically replace prompt pipelines for structured workflow control. Your current runtime already demonstrates that. But it should be framed as replacing prompt-driven control flow, not all prompting. Planning and some leaf implementations should remain model-mediated. (GitHub)

3. Behavior in real-world systems

This abstraction should become more valuable as workflows become longer, riskier, and more operationally serious. The current implementation already has many of the right bones for that. The main remaining gap is that the runtime is ahead of the capability-contract semantics, especially around side effects, idempotency, retries, and trust boundaries. (GitHub)

The strongest single sentence I can give you is this:

The current codebase already makes ORCA look like a credible execution runtime; the next thing that will determine whether it scales cleanly is whether each capability tells the runtime enough operational truth to make retries, resumes, fallbacks, approvals, and policies reliable by construction. (GitHub)

Good point on CapabilitySpec being dict-based — that’s the kind of observation that comes from actually reading the models module, not just the docs. I’m going to promote properties and safety to proper typed dataclasses in the next release.

On the 9-field proposal: I think the realistic scope is narrower. retry_policy and cache_hint make sense as new fields. compensation_strategy and cost_latency_class feel like distributed-systems concepts that don’t belong in a capability contract at this level. The rest (idempotent, side_effects, trust_level) already exist — they just need to graduate from dict keys to enforced types.

Curious about one thing — have you actually run any of the skills locally? The whole baseline-first design means you can pip install orca-agent-skills and execute the full suite without any API key. I’d be interested in your take on the actual execution ergonomics, not just the contract layer. Things like: does the scaffold wizard actually produce usable skills? Does the benchmark-lab output make the binding tradeoffs legible? That’s the kind of feedback that’s harder to get from reading the source.

If you want to try it, cloning the repo gives you the full test suite too — 1694 tests, all deterministic. Would appreciate a star if you find it worth exploring.

1 Like

My paper has just been submitted to ArXiv, will post the new version shortly as soon as it is published. In the meantime, you can visit my repo and star it if you find the approach worth sharing GitHub - gfernandf/agent-skills: Agents should execute whenever possible — runtime for composable AI agent skills · GitHub

1 Like

By the way, the full paper at zenodo is getting traction, 62 downloads out of 70 views!!

1 Like

By the way, thanks for starring the repo!

1 Like

Since the last round of discussion, we ran a controlled experiment comparing single-prompt execution against ORCA’s multi-step skill orchestration on two tasks (structured decision-making and multi-step text processing). 10 inputs per task, same model (gpt-4o-mini), fixed seed.

The numbers are honest:

Dimension Prompt-based ORCA Structured
Latency Lower (1 LLM call) Higher (N sequential calls)
Traceability None Full step-level trace
Reusability None Full capability reuse
Maintainability Low (monolithic) High (declarative YAML)
Variability Low Low-moderate

ORCA is not faster for simple one-off tasks. That’s not the point.

The point is what happens when you need to audit what your agent did, swap a backend without rewriting the workflow, reuse a step across 15 different skills, or resume a failed run from a checkpoint.

Prompt-based execution gives you none of that. Not because the prompt was bad — because the architecture doesn’t support it.

Full benchmark code and results are in the repo: run_benchmark.py

1 Like

Wow, a whole chain of bots replying to bots.

mmmmm no, I am not a bot …

Continuing to evangelize ORCA — a cognitive runtime approach for LLM agents.
The core idea: the problem isn’t prompts, it’s architecture.
Now also available on SSRN:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6600840

Is there such a thing as an Intellectual food fight?

Now seriously praise to the posters.

Nothing perceived is useless.

Also welcome gfernandf with your first Post-thread.
There was nothing boring and also John6666 you also offered food for thought.

–Ernst

Haaa!!! fair comparison. Thanks for the welcome Ernst — glad the discussion was worth following. The back-and-forth with John actually helped me identify a couple of real gaps in the implementation, so it was productive noise at least. I invete you to visit the repo and give it a try, and star it if you like the idea GitHub - gfernandf/agent-skills: Agents should execute whenever possible — runtime for composable AI agent skills · GitHub