AI Agent Orchestration

We are no longer building chatbots. We are building agents - systems that perceive, reason, plan, act, and collaborate. The shift from a single prompt-response loop to a compound, multi-agent architecture is one of the most significant changes in applied AI in the last decade. Yet, most developers underestimate just how deep this rabbit hole goes.

This article is a first-principles breakdown of AI agent orchestration - what it is, how it is architected, why each component exists, and how it all fits together in production. Whether you are building a personal assistant, an autonomous coding agent, or a fleet of specialized AI workers, the concepts here will form the bedrock of your understanding.

1. What Is an AI Agent, Really?

Before diving into orchestration, we need to settle on a precise definition. An AI agent is a software system that:

Perceives its environment (text, files, APIs, databases, sensor data)
Reasons about what to do next (using an LLM as its reasoning core)
Acts by invoking tools, writing code, or calling external services
Reflects on the outcome of its actions
Persists state across multiple steps and sessions

The key distinction from a basic LLM call is the feedback loop. A chatbot takes input and returns output. An agent takes input, takes an action, observes the result, and decides what to do next - sometimes hundreds of times before completing a task.

This is often called the ReAct loop (Reason → Act → Observe), and it is the foundational execution pattern of almost all modern agents. The diagram below shows how each phase feeds back into the next:

┌─────────────────────────────────────┐
│           AGENT LOOP                │
│                                     │
│  Perceive → Reason → Act → Observe  │
│       ↑__________________________|  │
│                                     │
│  (Repeat until task is complete)    │
└─────────────────────────────────────┘

2. Why Orchestration? The Problem with Single-Agent Systems

A single agent can do a lot. But single-agent systems hit hard limits when:

Context windows overflow. Long tasks generate massive amounts of intermediate state that can’t all fit in one context window.
Specialization is needed. A general-purpose agent performing both security auditing and UI design will be mediocre at both. Specialized agents are far more effective.
Parallelism is required. Some tasks have sub-tasks that can be run concurrently. A single agent is inherently sequential.
Error containment matters. When one component of a pipeline fails, you want to isolate the failure rather than corrupting the entire task.
Human oversight is essential. Complex workflows need checkpoints where humans can review, redirect, or override agent decisions.

Orchestration is the answer to all of these. It is the layer that coordinates multiple agents, routes tasks between them, manages shared resources (like memory and tools), and ensures the overall system makes forward progress toward a goal.

3. The Anatomy of an Orchestration Framework

A well-designed orchestration framework typically maps to the following directory structure (this is the pattern used in production-grade systems like those built inside advanced AI development environments):

.agents/
├── config/          # DNA of the agents
├── docs/            # Knowledge base for RAG
├── logs/            # Execution traces and debugging
├── memory/          # Short-term and long-term state
├── prompts/         # System instructions and templates
├── skills/          # Reusable higher-order capabilities
├── tools/           # External integrations and action code
└── workflows/       # Multi-agent coordination logic

Let’s go through each layer in depth.

4. `config/` - The Agent’s Identity and Constraints

The config folder is not just a place to store API keys. It is the policy layer of your agent. It answers: Who is this agent, and what are the rules it operates under?

4.1 Model Selection

Different tasks call for different model tiers. A production config might look like:

model:
  primary: "gemini-1.5-pro" # Used for complex reasoning tasks
  secondary: "gemini-1.5-flash" # Used for fast, repetitive subtasks
  embedding: "text-embedding-004" # Used for vector search

inference:
  temperature: 0.2 # Low temperature for deterministic tasks
  top_p: 0.95
  max_output_tokens: 8192
  timeout_seconds: 30

Choosing the right model for each sub-task is critical for both cost and performance. Routing a simple classification task through a frontier model is expensive and slow. Routing a complex multi-step reasoning task through a small model leads to degraded output.

4.2 Resource Limits and Safety Guardrails

Config also enforces agent boundaries - what the agent is and is not allowed to do:

safety:
  max_iterations: 50 # Prevent infinite loops
  max_tool_calls_per_run: 200
  allowed_file_extensions: [".py", ".ts", ".json", ".md"]
  disallowed_domains: ["internal-hr.company.com"]
  require_human_approval_for:
    - "file_deletion"
    - "external_api_post"
    - "database_write"

These constraints are what separate a research prototype from a production system. Without them, agents can enter runaway loops, incur massive costs, or take destructive actions.

4.3 Environment-Specific Overrides

A good config system supports environment overlays:

config/
├── base.yaml           # Shared defaults
├── development.yaml    # Local dev overrides (verbose logging, mock tools)
├── staging.yaml        # Pre-prod (real tools, rate-limited)
└── production.yaml     # Full production config

5. `docs/` - Retrieval-Augmented Generation (RAG) Knowledge Base

The docs folder is where your agent’s domain knowledge lives. Instead of fine-tuning a model on your proprietary data (expensive, slow, and hard to update), RAG lets you inject relevant knowledge dynamically at query time.

5.1 How RAG Works in an Agent Context

User Query
    │
    ▼
Embed Query (Vector)
    │
    ▼
Search docs/ index for top-K relevant chunks
    │
    ▼
Inject chunks into LLM context as grounding documents
    │
    ▼
LLM generates response grounded in your docs

This is powerful because:

Your knowledge base is updateable - just add/remove documents
The agent cites sources, making responses auditable
You stay in control of what the agent knows

5.2 What Goes in `docs/`

docs/
├── policies/
│   ├── deployment_policy.pdf
│   └── security_standards.md
├── runbooks/
│   ├── incident_response.md
│   └── oncall_procedures.md
├── architecture/
│   ├── system_diagram.png
│   └── api_contracts.yaml
└── faqs/
    └── product_faq.md

5.3 Chunking Strategy Matters

How you split documents dramatically affects retrieval quality:

Strategy	Best For	Risk
Fixed-size chunks (512 tokens)	General documents	Splits context mid-sentence
Recursive character splitting	Code, mixed content	May lose semantic boundaries
Semantic chunking	Dense technical docs	Computationally expensive
Document-level chunks	Short docs, FAQs	Poor precision in long docs

A naive chunking strategy is one of the most common causes of poor RAG performance. Invest time here.

5.4 Metadata Filtering

Good RAG systems attach metadata to each chunk for filtered retrieval:

{
  "content": "The deployment freeze window is every Friday from 4PM–Monday 9AM...",
  "metadata": {
    "source": "deployment_policy.pdf",
    "category": "policy",
    "last_updated": "2024-12-01",
    "audience": "engineering",
    "version": "v3.2"
  }
}

This lets agents ask: “Find me only the security policies updated in the last 6 months.”

6. `logs/` - The Black Box for Debugging and Observability

Agent debugging is notoriously hard. Unlike traditional software where you can step through a debugger, agent behavior is emergent - it depends on the model’s internal states, tool outputs, and the cumulative context window. Logs are your only window into what happened.

6.1 What to Log

A production agent log entry looks like this:

{
  "run_id": "run_20241215_143022_abc123",
  "agent_id": "security-auditor-v2",
  "timestamp": "2024-12-15T14:30:22.441Z",
  "step": 7,
  "type": "tool_call",
  "tool_name": "search_codebase",
  "input": {
    "query": "SQL query construction",
    "file_types": [".py"],
    "max_results": 20
  },
  "output": {
    "files_found": 14,
    "top_result": "src/database/query_builder.py:L45"
  },
  "latency_ms": 1243,
  "token_usage": {
    "prompt": 8421,
    "completion": 312,
    "total": 8733
  },
  "chain_of_thought": "I need to check how SQL queries are constructed to identify potential injection vectors. I'll search for places where strings are concatenated into queries."
}

6.2 Structured Logging for Replay

The most valuable feature of agent logs is replay capability. If an agent fails midway through a 2-hour task, you want to:

Read the logs to understand what happened
Fix the bug
Resume from the last successful checkpoint - not restart from scratch

This requires logging not just what the agent did, but also the full context state at each step.

6.3 Observability Stack

For production systems, logs feed into an observability pipeline. The following shows a typical stack from raw agent logs through to dashboards and alerting:

Agent Logs (JSON)
      │
      ▼
Log Aggregator (Fluentd / Logstash)
      │
      ▼
Storage (BigQuery / Elasticsearch)
      │
      ├──► Dashboards (Grafana / Looker)
      ├──► Alerts (PagerDuty)
      └──► Trace Viewer (Langfuse / Arize / Phoenix)

Tools like Langfuse, Arize Phoenix, and Weights & Biases Weave are purpose-built for LLM observability and are worth evaluating.

7. `memory/` - State, Context, and the Illusion of Continuity

Memory is what separates an agent that can hold a conversation from one that can manage a long-running project. It comes in multiple forms.

7.1 The Four Types of Memory

Type	Scope	Storage	Example
In-context (working)	Current conversation	LLM context window	”Earlier you said the budget is $50k”
Episodic (short-term)	Recent sessions	Redis / in-memory DB	Past 10 conversation summaries
Semantic (long-term)	Persistent facts	Vector DB	”User prefers TypeScript over JavaScript”
Procedural	Skills/behaviors	Prompt templates	”When debugging, always check logs first”

7.2 The Memory Architecture

.agents/memory/
├── episodic/
│   ├── session_20241215.jsonl      # Full session transcript
│   └── session_20241214.jsonl
├── semantic/
│   ├── user_preferences.json       # Extracted facts about the user
│   ├── project_context.json        # Key project facts
│   └── entity_graph.json           # Relationships between concepts
└── working/
    └── current_task_state.json     # Active task scratchpad

7.3 Memory Consolidation

Raw session logs are too verbose to fit in future context windows. A good memory system runs a consolidation process - periodically summarizing and extracting structured facts from raw logs:

# Simplified memory consolidation
async def consolidate_session(session_log: list[dict]) -> dict:
    summary_prompt = f"""
    Review these agent actions and extract:
    1. Key decisions made
    2. Important facts learned about the user/project
    3. Unresolved tasks or open questions
    4. Errors encountered and how they were resolved

    Session log:
    {json.dumps(session_log, indent=2)}
    """

    summary = await llm.generate(summary_prompt)
    return parse_structured_summary(summary)

7.4 Semantic Memory with Vector Search

For long-term memory retrieval, a vector database (Pinecone, Weaviate, pgvector) lets agents ask: “What do I know about this topic?“

# Storing a memory
memory_text = "User's production database is PostgreSQL 15.2 running on AWS RDS"
embedding = embed(memory_text)
vector_db.upsert(
    id="mem_proj_db_001",
    vector=embedding,
    metadata={"category": "infrastructure", "timestamp": now()}
)

# Retrieving relevant memories
query_embedding = embed("What database are we using?")
relevant_memories = vector_db.query(query_embedding, top_k=5)

7.5 The Memory Injection Problem

Every memory retrieved must be injected into the context window - which is finite. This creates a fundamental tension: more memory = better context, but also higher cost and token pressure. The solution is a tiered injection strategy:

Always inject: current task state, user identity, core preferences
Query-time inject: semantically relevant episodic memories
Never inject: raw unprocessed logs, irrelevant historical sessions

8. `prompts/` - The System Instructions Layer

If config is the agent’s policy, prompts are its character. This is where you define who the agent is, how it thinks, and what it prioritizes.

8.1 Prompt Engineering for Agents (vs. Chatbots)

Agent prompts are structurally different from chatbot prompts:

# System Prompt: Senior Security Auditor Agent

## Identity

You are a senior application security engineer with 15 years of experience.
Your primary responsibility is identifying security vulnerabilities in codebases.
You are meticulous, methodical, and never skip steps in your analysis.

## Reasoning Protocol

When analyzing code for vulnerabilities, always:

1. First understand the data flow - where does user input enter the system?
2. Trace all paths that untrusted input can take
3. Check for sanitization and validation at each boundary
4. Consider second-order effects (e.g., stored XSS)

## Tool Use Guidelines

- Use `search_codebase` before making claims about what code does
- Always verify line numbers before reporting findings
- Never report a vulnerability without a proof-of-concept or code reference

## Output Format

For each finding, use this exact structure:

- **Severity**: [Critical/High/Medium/Low/Informational]
- **CWE**: [CWE identifier]
- **Location**: [File:Line]
- **Description**: [What the vulnerability is]
- **Proof of Concept**: [How it could be exploited]
- **Remediation**: [How to fix it]

## Constraints

- Never modify files unless explicitly asked
- If you find a critical vulnerability involving PII, immediately halt and alert the human
- Do not make assumptions about business logic - ask for clarification

8.2 Prompt Templates and Jinja2

For dynamic prompts, templating engines let you inject context at runtime:

{# prompts/code_reviewer.j2 #}
You are reviewing a pull request for the {{ project_name }} project.

**PR Context:**
- Author: {{ pr_author }}
- Branch: {{ pr_branch }}
- Files changed: {{ files_changed | length }}
- Description: {{ pr_description }}

**Reviewer Guidelines for {{ project_name }}:**
{% for guideline in project_guidelines %}
- {{ guideline }}
{% endfor %}

**Your task:**
Review the following diff and provide structured feedback.
Focus on: {{ review_focus | join(", ") }}

8.3 Prompt Versioning

Prompts are code. They should be versioned, tested, and reviewed:

prompts/
├── security_auditor/
│   ├── v1.0.0.md       # Initial version
│   ├── v1.1.0.md       # Added CWE classification requirement
│   ├── v2.0.0.md       # Major rewrite after eval regression
│   └── current -> v2.0.0.md  # Symlink to active version
└── changelog.md

9. `skills/` - Reusable Higher-Order Capabilities

Skills sit between raw tools (specific API calls) and full agent workflows (multi-step coordination). They are composable capabilities that any agent can use.

9.1 Skills vs. Tools

Aspect	Tool	Skill
Abstraction level	Low (atomic action)	High (multi-step capability)
Example	`read_file("app.py")`	`analyze_codebase_security()`
Calls LLM?	No	Often yes
Uses other tools?	No	Yes
State?	Stateless	May maintain intermediate state

9.2 Example: A `summarize_and_extract` Skill

async def summarize_and_extract(
    document: str,
    extract_schema: dict,
    max_summary_tokens: int = 500
) -> dict:
    """
    High-level skill that summarizes a document AND extracts
    structured data from it in a single LLM call.

    More efficient than two separate calls.
    """
    prompt = f"""
    Analyze this document and do two things:

    1. Write a {max_summary_tokens}-token summary
    2. Extract the following fields as JSON:
    {json.dumps(extract_schema, indent=2)}

    Document:
    {document}

    Respond with valid JSON matching this schema:
    {{
      "summary": "...",
      "extracted": {{ ... }}
    }}
    """

    result = await llm.generate(prompt, response_format="json")
    return json.loads(result)

9.3 Skills as Composable Blocks

A key design principle: skills should be composable. A generate_report skill might call summarize_and_extract, which calls read_file, which calls the filesystem tool.

This layered composition is what makes complex agent behavior manageable. Each layer only needs to understand the interface one level below it.

10. `tools/` - The Agent’s Hands in the World

Tools are how agents take action. They are the bridge between the language model’s reasoning and the real world. If skills are capabilities, tools are actions.

10.1 The Tool Interface Contract

Every tool must follow a strict interface. In most frameworks, this looks like:

from pydantic import BaseModel, Field
from typing import Literal

class SearchCodebaseInput(BaseModel):
    query: str = Field(description="Natural language search query")
    file_types: list[str] = Field(
        default=[],
        description="Filter by file extensions (e.g., ['.py', '.ts'])"
    )
    max_results: int = Field(default=10, ge=1, le=100)

class SearchCodebaseOutput(BaseModel):
    results: list[dict]
    total_found: int
    truncated: bool

async def search_codebase(input: SearchCodebaseInput) -> SearchCodebaseOutput:
    """
    Search the codebase using semantic search.
    Use this when you need to find code related to a specific concept.
    """
    # ... implementation

The docstring is critical - it’s what the LLM reads to decide when and how to use the tool.

10.2 Tool Categories

tools/
├── filesystem/
│   ├── read_file.py
│   ├── write_file.py
│   ├── search_codebase.py
│   └── list_directory.py
├── web/
│   ├── web_search.py
│   ├── fetch_url.py
│   └── scrape_page.py
├── execution/
│   ├── run_python.py
│   ├── run_bash.py
│   └── run_tests.py
├── communication/
│   ├── send_slack_message.py
│   ├── create_github_pr.py
│   └── send_email.py
└── data/
    ├── query_database.py
    ├── search_vector_db.py
    └── call_api.py

10.3 Tool Safety and Sandboxing

Some tools are inherently dangerous. The run_bash tool, for example, can delete files, exfiltrate data, or crash a server. Production systems enforce multiple safety layers:

class RunBashTool:
    BLOCKED_PATTERNS = [
        r"rm\s+-rf\s+/",           # Recursive delete from root
        r">\s*/dev/sd[a-z]",       # Writing to raw disk devices
        r"curl.*\|\s*bash",         # Pipe-from-internet-to-bash attacks
        r"base64\s+-d.*\|\s*bash", # Base64 obfuscated execution
    ]

    async def execute(self, command: str) -> str:
        # Static analysis
        for pattern in self.BLOCKED_PATTERNS:
            if re.search(pattern, command):
                raise ToolSafetyError(f"Blocked pattern detected: {pattern}")

        # Runtime sandboxing
        result = await asyncio.create_subprocess_shell(
            command,
            stdout=asyncio.subprocess.PIPE,
            stderr=asyncio.subprocess.PIPE,
            cwd=self.sandbox_dir,      # Restricted working directory
            env=self.sanitized_env,    # Clean environment variables
            preexec_fn=self._drop_privileges  # Minimal OS permissions
        )

        return await asyncio.wait_for(result.communicate(), timeout=30)

10.4 Tool Observability

Every tool call should emit structured telemetry:

@tool_call_traced  # Decorator that logs input, output, latency, cost
async def query_database(sql: str, database: str) -> dict:
    ...

11. `workflows/` - Multi-Agent Orchestration

This is where everything comes together. Workflows define how multiple agents collaborate to achieve complex goals.

11.1 Orchestration Patterns

There are several well-established patterns for multi-agent coordination:

Pattern 1: Pipeline (Sequential)

Each agent’s output becomes the next agent’s input. Simple, predictable, and easy to debug:

Input → Agent A → Agent B → Agent C → Output

Best for: document processing pipelines, data transformation chains.

Pattern 2: Supervisor / Worker

A central supervisor LLM decomposes the goal and assigns tasks to specialized worker agents:

         ┌─────────────────────┐
         │    SUPERVISOR        │
         │  (Orchestrator LLM) │
         └──────────┬──────────┘
                    │ assigns tasks
         ┌──────────▼──────────┐
         │                     │
    ┌────▼────┐           ┌────▼────┐
    │ Worker  │           │ Worker  │
    │ Agent A │           │ Agent B │
    └─────────┘           └─────────┘

The supervisor LLM receives the high-level goal, decomposes it into tasks, and assigns them to specialized worker agents. Workers report back; the supervisor integrates results and assigns next steps.

Best for: complex multi-faceted tasks requiring coordination (e.g., “Build and deploy a new feature”).

Pattern 3: Debate / Multi-Critic

Independent critic agents review a draft; a synthesizer integrates their feedback into a final output:

    ┌──────────────────────────────┐
    │         Draft Agent           │
    │  (Generates initial response) │
    └───────────────┬──────────────┘
                    │
         ┌──────────▼──────────┐
         │                     │
    ┌────▼────┐           ┌────▼────┐
    │ Critic  │           │ Critic  │
    │ Agent A │           │ Agent B │
    └────┬────┘           └────┬────┘
         │                    │
         └──────────┬─────────┘
                    │ synthesized critiques
                    ▼
             ┌─────────────┐
             │  Synthesizer │
             │    Agent     │
             └─────────────┘

Multiple independent critic agents review a draft and provide feedback. A synthesizer agent integrates the critiques and produces a revised output. Best for: high-stakes decisions, content generation, code review.

Pattern 4: Map-Reduce

A large task is split into independent chunks processed in parallel, then merged by a reducer agent:

                Input
                  │
       ┌──────────┼──────────┐
       │          │          │
  ┌────▼───┐ ┌───▼────┐ ┌───▼────┐
  │ Mapper │ │ Mapper │ │ Mapper │   (parallel)
  │   A    │ │   B    │ │   C    │
  └────┬───┘ └───┬────┘ └───┬────┘
       │          │          │
       └──────────┼──────────┘
                  │
            ┌─────▼──────┐
            │   Reducer   │
            │    Agent    │
            └─────────────┘

A large task is split into independent chunks processed in parallel (map), then results are merged (reduce). Best for: analyzing large corpora, parallel research tasks.

11.2 Workflow Definition Example

Modern frameworks like LangGraph, CrewAI, and AutoGen use graph-based workflow definitions:

from langgraph.graph import StateGraph, END
from agents import SecurityAuditor, PerformanceAnalyzer, StyleChecker, Synthesizer

def build_code_review_workflow():
    graph = StateGraph(CodeReviewState)

    # Add agent nodes
    graph.add_node("security", SecurityAuditor().run)
    graph.add_node("performance", PerformanceAnalyzer().run)
    graph.add_node("style", StyleChecker().run)
    graph.add_node("synthesize", Synthesizer().run)

    # Parallel fan-out from start
    graph.add_edge("__start__", "security")
    graph.add_edge("__start__", "performance")
    graph.add_edge("__start__", "style")

    # Fan-in to synthesizer
    graph.add_edge("security", "synthesize")
    graph.add_edge("performance", "synthesize")
    graph.add_edge("style", "synthesize")

    # Conditional: if critical issue found, require human review
    graph.add_conditional_edges(
        "synthesize",
        route_by_severity,
        {
            "critical": "human_review",
            "normal": END
        }
    )

    return graph.compile()

11.3 Inter-Agent Communication

Agents communicate through a shared state object - a structured dictionary that all agents can read from and write to:

class CodeReviewState(TypedDict):
    # Input
    pr_diff: str
    pr_description: str

    # Intermediate results (filled by worker agents)
    security_findings: list[Finding]
    performance_findings: list[Finding]
    style_findings: list[Finding]

    # Final output (filled by synthesizer)
    final_report: str
    overall_severity: Literal["critical", "high", "medium", "low"]

    # Metadata
    run_id: str
    started_at: datetime
    completed_at: Optional[datetime]

12. Cross-Cutting Concerns: What the Directory Structure Doesn’t Show

Beyond the directory structure itself, production agent systems require several architectural concerns that cut across all layers. These are not tied to any single folder - they must be considered holistically when designing the overall system.

12.1 Idempotency and Resumability

Long-running agents will fail. The question is not if but when. Design every workflow step to be idempotent (safe to retry) and checkpoint state frequently so failed runs can resume:

@checkpoint_on_success(state_key="step_3_complete")
async def analyze_security(code: str, state: RunState) -> list[Finding]:
    if state.get("step_3_complete"):
        return state["security_findings"]  # Already done, skip

    findings = await security_auditor.analyze(code)
    return findings

12.2 Rate Limiting and Backpressure

Agents can generate enormous numbers of LLM and API calls. Without rate limiting, you will hit provider limits, generate massive bills, and potentially trigger abuse detection:

class RateLimitedLLMClient:
    def __init__(self, rpm_limit: int = 60, tpm_limit: int = 100_000):
        self.rpm_limiter = TokenBucket(rate=rpm_limit, per=60)
        self.tpm_limiter = TokenBucket(rate=tpm_limit, per=60)

    async def generate(self, prompt: str) -> str:
        estimated_tokens = len(prompt.split()) * 1.3

        await self.rpm_limiter.acquire(1)
        await self.tpm_limiter.acquire(estimated_tokens)

        return await self._raw_generate(prompt)

12.3 Cost Tracking

Token costs accumulate fast in multi-agent systems. Every LLM call should be attributed to a task, user, and workflow:

@dataclass
class CostTracker:
    run_id: str
    budget_usd: float
    spent_usd: float = 0.0

    def record_call(self, model: str, prompt_tokens: int, completion_tokens: int):
        cost = calculate_cost(model, prompt_tokens, completion_tokens)
        self.spent_usd += cost

        if self.spent_usd > self.budget_usd:
            raise BudgetExceededError(
                f"Run {self.run_id} exceeded budget: "
                f"${self.spent_usd:.4f} > ${self.budget_usd:.4f}"
            )

12.4 Human-in-the-Loop (HITL) Integration

Not all decisions should be made autonomously. Good orchestration frameworks have first-class support for pausing a workflow and waiting for human input:

async def run_deployment_workflow(artifact: BuildArtifact, state: WorkflowState):
    # Automated checks
    test_results = await run_tests(artifact)
    security_scan = await scan_for_vulnerabilities(artifact)

    # Human approval gate for production deployments
    if state.target_environment == "production":
        approval = await request_human_approval(
            message=f"""
            Ready to deploy {artifact.version} to production.
            - Tests: {test_results.summary}
            - Security: {security_scan.summary}

            Approve or reject?
            """,
            timeout_hours=24
        )

        if not approval.approved:
            raise WorkflowAbortedByHuman(reason=approval.reason)

    await deploy(artifact, state.target_environment)

13. Evaluating Agent Systems: The Testing Problem

Testing agents is fundamentally different from testing traditional software. You can’t just assert output == expected_output because LLM outputs are non-deterministic.

13.1 The Evaluation Stack

┌─────────────────────────────────────────────┐
│          AGENT EVALUATION PYRAMID           │
│                                             │
│   ┌─────────────────────────────────┐       │
│   │    E2E Workflow Evals (slow)     │       │
│   │  "Does the full pipeline work?" │       │
│   └─────────────────────────────────┘       │
│   ┌─────────────────────────────────┐       │
│   │    Component Evals (medium)     │       │
│   │  "Does this agent do its job?"  │       │
│   └─────────────────────────────────┘       │
│   ┌─────────────────────────────────┐       │
│   │    Unit Evals (fast)            │       │
│   │  "Does this prompt/tool work?"  │       │
│   └─────────────────────────────────┘       │
└─────────────────────────────────────────────┘

13.2 LLM-as-Judge

For subjective quality assessment, use a stronger LLM to judge the output of your agent:

async def evaluate_security_report(
    report: str,
    ground_truth_findings: list[Finding]
) -> EvalResult:
    judge_prompt = f"""
    You are evaluating a security audit report.

    Ground truth vulnerabilities (what SHOULD be found):
    {json.dumps([f.dict() for f in ground_truth_findings], indent=2)}

    Agent's report:
    {report}

    Rate the report on:
    1. Precision (0-10): Are the findings accurate?
    2. Recall (0-10): Were all vulnerabilities found?
    3. Actionability (0-10): Are the remediation steps concrete?
    4. False Positives: How many issues reported that don't exist?

    Respond with JSON.
    """

    return await judge_llm.generate(judge_prompt, response_format="json")

13.3 Golden Dataset Maintenance

Maintain a curated dataset of inputs with known expected outputs. Run this suite on every prompt or workflow change:

evals/
├── golden_datasets/
│   ├── security_audit_cases.jsonl     # 50 labeled cases
│   ├── code_review_cases.jsonl        # 30 labeled cases
│   └── data_analysis_cases.jsonl      # 20 labeled cases
├── scorers/
│   ├── llm_judge.py
│   ├── exact_match.py
│   └── semantic_similarity.py
└── run_evals.py

14. The Future: Where Agent Orchestration Is Heading

14.1 Standardized Agent Protocols

The industry is converging on Model Context Protocol (MCP) - an open standard for tool interfaces that allows any agent framework to consume any tool. This is analogous to what HTTP did for web services: create a universal interface that enables ecosystem interoperability.

14.2 Agent Memory as a First-Class Service

Today, most teams build memory from scratch. The next wave will see memory-as-a-service products that handle chunking, embedding, consolidation, retrieval, and injection - exposing a clean API that any agent can use.

Current text-in, text-out agents are giving way to agents that can natively perceive and produce images, audio, video, and structured data. This will massively expand the action space available to orchestration frameworks.

14.4 Formal Verification of Agent Behavior

As agents are deployed in high-stakes domains (healthcare, finance, legal), we will need formal methods for reasoning about agent safety - not just empirical testing, but provable guarantees about what an agent can and cannot do.

Conclusion

The .agents/ directory structure is more than a file organization convention - it is a map of a cognitive architecture. Each folder corresponds to a distinct function in the agent’s mind:

config/ is policy - what the agent is allowed to do
docs/ is knowledge - what the agent knows about the world
logs/ is memory of actions - what the agent has done
memory/ is state - what the agent remembers about context
prompts/ is identity - who the agent is and how it thinks
skills/ is capability - what complex things the agent can do
tools/ is agency - how the agent acts on the world
workflows/ is collaboration - how agents work together

Building reliable, effective, and safe AI agents requires getting all of these layers right - and understanding how they interact. The goal is not to build a single all-knowing AI, but to build a team of specialists that can be orchestrated to solve problems no individual agent could tackle alone.

The compound AI system is not a future concept. It is happening now, in production, at scale. Understanding this architecture is not optional for the next generation of software engineers - it is foundational.