for now:
ORCA is a good idea. The strongest version of the idea is not “skills are useful.” It is this: prompt text should not be the control plane for serious agent systems. Your public repo and first release already reflect that position in concrete form: declarative DAG skills, capability contracts, typed CognitiveState, policy gates, checkpoint/restore, multi-protocol bindings, MCP exposure, native tool adapters, audit modes, RBAC, and observability. The current public release is v0.1.0, marked as the first release, with an execution engine, 163 binding YAMLs, 25 deterministic baseline services, framework adapters, a benchmark lab, and a compose DSL; the README separately emphasizes 122 shipped capabilities with deterministic Python baselines and governed naming. (GitHub)
The background that matters
Your project is timely because the field has already started to split “agent” systems into layers, even when it uses different terms. Anthropic explicitly distinguishes workflows from agents, where workflows follow predefined code paths and agents dynamically direct their own process and tool use. LangGraph positions itself as low-level infrastructure for long-running, stateful workflows and agents, with durable execution, memory, debugging, and human oversight rather than a semantic cognition model. Microsoft Agent Framework now makes a similar distinction, pairing agents with graph-based workflows that support type-safe routing, checkpointing, and human-in-the-loop control. That convergence supports your core premise: there really is a missing middle layer between “LLM decides everything” and “tool does everything.” (Anthropic)
There is also a conceptual lineage behind what you are doing. CoALA frames language agents in terms of modular memory, structured action spaces, and a generalized decision process. StateFlow frames multi-step LLM task solving as a state machine. DSPy argues that expert-crafted prompts should be reduced or even removed from the center of the system through declarative modules and compilation. ORCA fits naturally in that lineage, but with a more runtime-centric emphasis than any one of those projects by itself. (arXiv)
My main judgment
The best way to position ORCA is as a governed execution substrate for the reliable subset of agent behavior.
That is narrower than “this replaces prompt-based agents,” but stronger and more defensible. It matches your implementation and it matches the external landscape. Anthropic’s own guidance still recommends keeping systems simple and transparent, and warns that frameworks can add abstraction that obscures what is happening. ORCA works best when it is not trying to replace all model-driven planning, but instead taking over the parts of agent behavior that benefit from explicit contracts, explicit state, explicit control flow, and explicit policy. (Anthropic)
1) How far capability granularity should go
The right stopping point is not “atomic thought.” It is the smallest semantically stable operational unit.
A capability is worth existing when at least one of these is true: it has a distinct safety or trust boundary, it could swap backends without changing the surrounding workflow, it deserves its own evaluation set, it is reused in multiple skills, or it is a natural checkpoint/cache/retry boundary. Your repo already hints at exactly this worldview through governed naming, typed state, JSON Schemas, lifecycle stages, conformance profiles, and fallback chains. That is the right direction. (GitHub)
A good capability boundary looks like retrieve_evidence, rank_candidates, validate_schema, extract_claims, or request_user_confirmation. A bad overly-fine boundary looks like “generate one reasoning fragment,” “decide one prompt phrase,” or “rewrite intermediate sentence three.” Once a capability is too fine, it stops being a stable contract and starts being externalized internal thought. At that point you get more YAML, more mapping, more serialization, and more debugging surface without more value.
A practical test is this:
- Would you evaluate this step separately?
- Would you apply a different policy to this step?
- Could you swap its implementation independently?
- Would a failure here deserve its own retry or approval gate?
If the answer is no four times, it is probably too small.
Anthropic’s tool-writing guidance points in the same direction from the tool side: prototype tools quickly, then run comprehensive evaluation, because ergonomics and interface clarity matter more than people expect. That same logic applies to capabilities. A capability taxonomy without evaluation discipline will sprawl. (Anthropic)
2) Can declarative execution replace prompt pipelines
For some tasks, yes. In general, no.
It can replace prompt pipelines when the task is repeated, tool-heavy, policy-sensitive, side-effecting, or expensive to debug after the fact. In those cases prompt chains are usually informal workflows pretending to be prompts. Anthropic’s workflow/agent split, LangGraph’s durable-execution model, and Microsoft’s workflow/checkpointing model all point toward the same operational truth: once a process is multi-step and failure-prone, execution structure wants to become explicit. (Anthropic)
It does not replace prompting for novelty, underspecified goals, exploration, or strategy invention. That is where model-driven planning still belongs. So the durable architecture is hybrid:
- the agent or planner chooses or synthesizes a skill,
- ORCA executes the skill with state, policies, and traces,
- individual capabilities may still call LLMs or tools internally,
- and control returns upward when the workflow is underspecified.
That is also where ORCA is stronger than a pure graph engine. LangGraph gives orchestration infrastructure. Microsoft Agent Framework gives workflow infrastructure. OpenAI’s Agents SDK gives tracing, tool use, handoffs, and guardrails. ORCA’s chance is to sit above those kinds of runtimes as the semantic execution layer that says what a step is, what policies apply, and how it composes. (LangChain Docs)
3) How this behaves in more complex real-world systems
This is where ORCA becomes more valuable and more fragile at the same time.
It becomes more valuable because real systems care about pause/resume, human approval, auditability, replay, vendor substitution, side-effect control, and observability. Your repo is already aiming at exactly those needs with checkpoint/restore, audit modes, OTel, metrics, SSE streaming, fallback chains, conformance levels, security docs, and governance stages. That is the right systems surface. (GitHub)
But it also becomes more fragile because the hard problem shifts from “how do I prompt this?” to “how do I govern the execution ontology?” In practice that means registry bloat, overlapping capabilities, naming drift, incompatible outputs, state inflation, and subtle retry/side-effect bugs. Recent public issue trackers in adjacent frameworks show the exact pressure points: LangGraph had a development/runtime persistence mismatch that lost state on restart, AutoGen GraphFlow had failures when graph speaker selection interacted with tool-enabled agents, CrewAI had persistence failures because typed state was not JSON-serializable, and Pydantic users ran into friction when trying to treat explicit workflows as if they were ordinary agents for adapter/UI purposes. Those are not random bugs. They are symptoms of the structural difficulty of explicit execution systems. (GitHub)
For ORCA specifically, that means three rules should stay strict:
- State must stay small and boring. Typed, serializable, replayable.
- Side effects must be classified. Idempotent or not, retryable or not, approval-gated or not.
- A skill is not automatically an agent. Do not let UI or framework adapters collapse those abstractions back together.
If you keep those boundaries sharp, ORCA gets stronger as systems become more real. If you blur them, the runtime will become harder to operate than the prompt chains it is trying to replace.
Where your current public implementation is strongest
Your strongest current differentiator is not the DAG scheduler by itself. It is the combination of:
- binding abstraction across PythonCall, OpenAPI, MCP, and OpenRPC,
- deterministic baselines plus fallback chains,
- typed cognitive state aligned with CoALA,
- governance and lifecycle language,
- native tool definitions for Anthropic/OpenAI/Gemini,
- and audit/observability surfaces.
That combination makes ORCA look less like “another agent framework” and more like a runtime architecture with a serious control plane. The repo even states openly when not to use it: if you only need a quick prompt-based prototype or do not need control or safety. That honesty helps the project because it narrows the claim to the zone where the architecture is strongest. (GitHub)
I also think the deterministic-baseline-first approach is smart. The release notes and README make it clear that a large part of the surface can run without an LLM key, and the runtime supports deterministic terminal baselines in fallback chains. That is useful for CI, local development, regression testing, and debugging. Many agent projects skip that layer and then have no stable floor under evaluation. (GitHub)
Where I would tighten the project
The first thing I would tighten is positioning language. Right now the public materials support “runtime,” “execution layer,” “reference architecture,” and “standard proposal.” They do not yet support “emerging standard” in the stronger ecosystem sense. The release is very recent, the repo is still small publicly, and the field is moving fast. I would keep the ambition, but phrase it as a reference model or execution architecture rather than as a settled standard. (GitHub)
The second thing I would tighten is count clarity. The release notes describe a companion registry with 141 capabilities and 36 ready-to-use skills, while the runtime README emphasizes 122 shipped capabilities with deterministic baselines. Those can both be true, but the boundary between “runtime-supported executable set” and “registry total” should be obvious everywhere. Otherwise readers will read it as inconsistency. (GitHub)
The third thing I would tighten is evaluation methodology. Recent guidance from Anthropic and OpenAI both pushes toward evaluation loops, traces, graders, and workflow-level testing rather than only final-answer snapshots. ORCA is unusually well positioned for that because your runtime already exposes explicit state transitions and execution structure. So your paper should lean harder into process metrics, not just outcome metrics. (Anthropic)
What I would claim in the paper, and what I would not
I would make these claims strongly:
- ORCA improves inspectability because execution is explicit, not buried in prompt text.
- ORCA improves substitutability because capability contracts can target multiple bindings and fallback chains.
- ORCA improves policy control because approval, trust, side effects, and scope can be surfaced at the runtime layer.
- ORCA improves operational ergonomics for repeated, tool-heavy, governance-sensitive tasks. (GitHub)
I would be more careful with these claims:
- “declarative execution broadly outperforms prompt pipelines,”
- “ORCA improves reproducibility in general,”
- “this is a universal architecture for agents.”
Those claims need larger and more diverse empirical support than you currently show publicly. The more defensible framing is: ORCA is a better control substrate for the part of agent behavior that needs to be dependable, auditable, and composable.
The security angle is bigger than it looks
This is one of the places where your architecture can become more important than a typical workflow framework.
MCP’s current specification explicitly says tools represent arbitrary code execution, tool descriptions/annotations should be treated as untrusted unless they come from trusted servers, and hosts must obtain explicit user consent before invoking tools. Recent papers on the emerging agent-skills layer also point in the same direction: the skills abstraction is growing quickly, but security and lifecycle governance are becoming central concerns rather than edge concerns. One recent survey frames skills as a major shift toward modular, on-demand capability extension and highlights a four-tier governance model; a recent security paper argues that the most severe threats come from structural issues such as weak data-instruction boundaries and persistent trust assumptions. That fits ORCA directly. (Model Context Protocol)
So for your case, governance should not be a side chapter. It should be one of the main contributions. Signed manifests, provenance metadata, side-effect classes, trust tiers, review states, and permission scoping are not “enterprise extras.” They are part of why a runtime layer is valuable at all.
My concrete recommendation for your next phase
If I were steering this project, I would do five things next.
1. Publish a capability admission rubric.
Every capability should justify itself by reuse, safety boundary, backend substitutability, eval ownership, and lifecycle owner. This is how you stop ontology sprawl before it starts.
2. Make execution metadata richer and more explicit.
Each capability should declare side effects, idempotency, retry semantics, confidentiality level, cacheability, and approval requirements.
3. Benchmark process, not just answers.
Use traces and graders to score skill selection, routing correctness, policy-trigger correctness, retry behavior, human-gate behavior, and replay fidelity, not only task success. OpenAI’s agent-eval guidance is directly useful here. (OpenAI Developer)
4. Lean into the hybrid architecture story.
Do not imply that ORCA removes prompting. Show that it relocates prompting: planning above, semantic execution in the middle, tool/model calls below.
5. Demonstrate one hard real-world workflow.
Use something long-running, tool-rich, failure-prone, and policy-sensitive. The more the task needs checkpoints, approvals, substitutions, and replay, the more ORCA’s benefits will become legible.
My final view
For your case, the idea is real. The timing is good. The public implementation is already substantial enough to take seriously. The ecosystem is moving toward explicit workflows, durable execution, state, checkpoints, traces, and tighter tool boundaries, which is exactly the environment in which ORCA makes sense. (Anthropic)
The project will succeed or fail on one question:
Can you keep the abstraction disciplined enough that it stays more governable than prompt pipelines, instead of becoming a more complicated prompt pipeline with better names?
That is the real test.
My answer right now is: yes, potentially, if you keep capability granularity at the level of stable operational units, treat the system as hybrid rather than totalizing, and turn governance and evaluation into first-class contributions rather than supporting details.