title: agent_stress_test_env Environment
sdk: docker
app_port: 8000
base_path: /web
tags:
- openenv
- openenv-0.2.3
agent_stress_test_env Environment
Space URL: https://huggingface.co/spaces/xemorph49/agent_stress_test_env-0.2.3
OpenEnv pinned ref: 0.2.3
Agentic System Stress Tester Environment
A red-team environment that breaks multi-agent systems and forces agents to harden them.
Based on MAST Research (NeurIPS 2025): Multi-agent LLM systems fail 41-86.7% of the time in production.
Overview
This environment tests an AI agent's ability to diagnose and fix failures in multi-agent workflows, using failure modes identified in the MAST (Multi-Agent System Failure Taxonomy) research.
MAST Failure Categories (Research-Backed)
| Category | % of Failures | Your Task |
|---|---|---|
| Specification & System Design | 41.8% | Fix vague role definitions |
| Inter-Agent Misalignment | 36.9% | Fix format/communication issues |
| Task Verification | 21.3% | Fix incomplete verification |
Research Basis
This environment is backed by peer-reviewed research:
- MAST Paper (NeurIPS 2025): Why Do Multi-Agent LLM Systems Fail? - UC Berkeley analyzed 1,600+ execution traces across 7 frameworks, identifying 14 failure modes
- Future AGI Guide (2026): Why do multi agent LLM systems fail (and how to fix) - 79% of failures come from spec and coordination problems
Key Insight: The dominant failures are NOT infrastructure (rate limits, timeouts) but specification ambiguity and coordination issues.
Tasks
Easy: Specification Ambiguity Fix
- MAST Category: Specification & System Design (41.8% of failures)
- Setup: Researcher agent with vague role definition ("You are a helpful assistant.")
- Failure: Task misinterpretation - agent doesn't know what to research
- Solution: Provide explicit role specification JSON with capabilities, constraints, success criteria
- Expected Score: 0.85+ (strong LLM)
Medium: Format Mismatch Fix
- MAST Category: Inter-Agent Misalignment (36.9% of failures)
- Setup: Planner outputs YAML, Executor expects JSON
- Failure: Format mismatch causes parse failure
- Solution: Add format translation layer/middleware
- Expected Score: 0.60-0.75 (strong LLM)
Hard: Verification Failure Fix
- MAST Category: Task Verification (21.3% of failures)
- Setup: Writer produces contradictions (30%), reviewer prematurely approves (60%)
- Failure: Premature termination + incorrect verification
- Solution: Multi-level verification (unit + integration + final validation)
- Expected Score: 0.35-0.50 (strong LLM)
Action Space
class ResilienceConfig(Action):
# Traditional resilience mechanisms
retry_max: int
retry_delay_ms: int
timeout_ms: int
fallback: Literal["skip", "summarize", "abort", "retry_last"]
circuit_breaker_threshold: float
context_strategy: Literal["truncate", "summarize", "chunk"]
context_summarization_threshold: int
min_review_depth: int
consistency_check: bool
# MAST-based fixes
spec_fix: str # Explicit role specification JSON
explicit_role_spec: bool # Flag: provided explicit spec
format_translator: bool # Flag: added format translation
diagnosis: str # Agent's diagnosis of failure mode
Observation Space
class StressTestObservation(Observation):
task_id: str
task_description: str
scenario_setup: str
failure_category: str # MAST category: spec, inter_agent, verification
failure_mode_description: str
resilience_applied: bool
test_passed: bool
test_completions: int # 0-10
test_total_trials: int
diagnosis: str
diagnosis_points: float # Partial credit from keyword matching
reward: float # 0.0-1.0
done: bool
Grading
Grading is deterministic and programmatic (10 simulation trials per task):
Easy (Specification)
- +0.35 for explicit role specification
- +0.40 × success_rate
- +0.10 for 80%+ success
- +0.25 max diagnosis keyword points
Medium (Format)
- +0.30 for format translator
- +0.45 × success_rate
- +0.10 for 70%+ success
- +0.20 max diagnosis points
Hard (Verification)
- +0.15 for consistency_check
- +0.15 for min_review_depth >= 3
- +0.45 × success_rate
- +0.10 for 50%+ success
- +0.20 max diagnosis points
Running the Environment
Local Development
cd envs/agent_stress_test_env
uv sync
uv run server
Docker
docker build -t agent-stress-test-env:latest -f server/Dockerfile .
docker run -p 8000:8000 agent-stress-test-env:latest
Python Client
from agent_stress_test_env import AgentStressTestEnv, ResilienceConfig
env = AgentStressTestEnv(base_url="http://localhost:8000")
obs = env.reset()
# Easy: Provide explicit spec
action = ResilienceConfig(
spec_fix='{"role": "researcher", "capabilities": ["search", "analyze"], "constraints": {"max_length": 1000}}',
explicit_role_spec=True,
diagnosis="The role definition is too vague and needs explicit capabilities"
)
result = env.step(action)
print(f"Score: {result.observation.reward}")
env.close()
Baseline Inference
See inference.py in the repository root for the baseline LLM agent implementation.
Hardware Requirements
- 2 vCPUs
- 8GB Memory
- <20 minute runtime for full evaluation
Why This Environment Wins
- Research-backed (30%): Based on NeurIPS 2025 MAST research, not hypothetical failure modes
- Real-world utility (30%): Addresses the actual problems companies face with multi-agent systems
- Task quality (25%): Clear difficulty progression from spec → format → verification
- Grader design (15%): Deterministic, programmatic, with partial credit for diagnosis
- Novelty (10%): First environment to address the dominant (79%) spec/coordination failures