AI Agent Orchestration
We are no longer building chatbots. We are building agents - systems that perceive, reason, plan, act, and collaborate. The shift from a single prompt-response loop to a compound, multi-agent architecture is one of the most significant changes in applied AI in the last decade. Yet, most developers underestimate just how deep this rabbit hole goes.
This article is a first-principles breakdown of AI agent orchestration - what it is, how it is architected, why each component exists, and how it all fits together in production. Whether you are building a personal assistant, an autonomous coding agent, or a fleet of specialized AI workers, the concepts here will form the bedrock of your understanding.
1. What Is an AI Agent, Really?
Section titled “1. What Is an AI Agent, Really?”Before diving into orchestration, we need to settle on a precise definition. An AI agent is a software system that:
- Perceives its environment (text, files, APIs, databases, sensor data)
- Reasons about what to do next (using an LLM as its reasoning core)
- Acts by invoking tools, writing code, or calling external services
- Reflects on the outcome of its actions
- Persists state across multiple steps and sessions
The key distinction from a basic LLM call is the feedback loop. A chatbot takes input and returns output. An agent takes input, takes an action, observes the result, and decides what to do next - sometimes hundreds of times before completing a task.
This is often called the ReAct loop (Reason → Act → Observe), and it is the foundational execution pattern of almost all modern agents. The diagram below shows how each phase feeds back into the next:
┌─────────────────────────────────────┐│ AGENT LOOP ││ ││ Perceive → Reason → Act → Observe ││ ↑__________________________| ││ ││ (Repeat until task is complete) │└─────────────────────────────────────┘2. Why Orchestration? The Problem with Single-Agent Systems
Section titled “2. Why Orchestration? The Problem with Single-Agent Systems”A single agent can do a lot. But single-agent systems hit hard limits when:
- Context windows overflow. Long tasks generate massive amounts of intermediate state that can’t all fit in one context window.
- Specialization is needed. A general-purpose agent performing both security auditing and UI design will be mediocre at both. Specialized agents are far more effective.
- Parallelism is required. Some tasks have sub-tasks that can be run concurrently. A single agent is inherently sequential.
- Error containment matters. When one component of a pipeline fails, you want to isolate the failure rather than corrupting the entire task.
- Human oversight is essential. Complex workflows need checkpoints where humans can review, redirect, or override agent decisions.
Orchestration is the answer to all of these. It is the layer that coordinates multiple agents, routes tasks between them, manages shared resources (like memory and tools), and ensures the overall system makes forward progress toward a goal.
3. The Anatomy of an Orchestration Framework
Section titled “3. The Anatomy of an Orchestration Framework”A well-designed orchestration framework typically maps to the following directory structure (this is the pattern used in production-grade systems like those built inside advanced AI development environments):
.agents/├── config/ # DNA of the agents├── docs/ # Knowledge base for RAG├── logs/ # Execution traces and debugging├── memory/ # Short-term and long-term state├── prompts/ # System instructions and templates├── skills/ # Reusable higher-order capabilities├── tools/ # External integrations and action code└── workflows/ # Multi-agent coordination logicLet’s go through each layer in depth.
4. config/ - The Agent’s Identity and Constraints
Section titled “4. config/ - The Agent’s Identity and Constraints”The config folder is not just a place to store API keys. It is the policy layer of your agent. It answers: Who is this agent, and what are the rules it operates under?
4.1 Model Selection
Section titled “4.1 Model Selection”Different tasks call for different model tiers. A production config might look like:
model: primary: "gemini-1.5-pro" # Used for complex reasoning tasks secondary: "gemini-1.5-flash" # Used for fast, repetitive subtasks embedding: "text-embedding-004" # Used for vector search
inference: temperature: 0.2 # Low temperature for deterministic tasks top_p: 0.95 max_output_tokens: 8192 timeout_seconds: 30Choosing the right model for each sub-task is critical for both cost and performance. Routing a simple classification task through a frontier model is expensive and slow. Routing a complex multi-step reasoning task through a small model leads to degraded output.
4.2 Resource Limits and Safety Guardrails
Section titled “4.2 Resource Limits and Safety Guardrails”Config also enforces agent boundaries - what the agent is and is not allowed to do:
safety: max_iterations: 50 # Prevent infinite loops max_tool_calls_per_run: 200 allowed_file_extensions: [".py", ".ts", ".json", ".md"] disallowed_domains: ["internal-hr.company.com"] require_human_approval_for: - "file_deletion" - "external_api_post" - "database_write"These constraints are what separate a research prototype from a production system. Without them, agents can enter runaway loops, incur massive costs, or take destructive actions.
4.3 Environment-Specific Overrides
Section titled “4.3 Environment-Specific Overrides”A good config system supports environment overlays:
config/├── base.yaml # Shared defaults├── development.yaml # Local dev overrides (verbose logging, mock tools)├── staging.yaml # Pre-prod (real tools, rate-limited)└── production.yaml # Full production config5. docs/ - Retrieval-Augmented Generation (RAG) Knowledge Base
Section titled “5. docs/ - Retrieval-Augmented Generation (RAG) Knowledge Base”The docs folder is where your agent’s domain knowledge lives. Instead of fine-tuning a model on your proprietary data (expensive, slow, and hard to update), RAG lets you inject relevant knowledge dynamically at query time.
5.1 How RAG Works in an Agent Context
Section titled “5.1 How RAG Works in an Agent Context”User Query │ ▼Embed Query (Vector) │ ▼Search docs/ index for top-K relevant chunks │ ▼Inject chunks into LLM context as grounding documents │ ▼LLM generates response grounded in your docsThis is powerful because:
- Your knowledge base is updateable - just add/remove documents
- The agent cites sources, making responses auditable
- You stay in control of what the agent knows
5.2 What Goes in docs/
Section titled “5.2 What Goes in docs/”docs/├── policies/│ ├── deployment_policy.pdf│ └── security_standards.md├── runbooks/│ ├── incident_response.md│ └── oncall_procedures.md├── architecture/│ ├── system_diagram.png│ └── api_contracts.yaml└── faqs/ └── product_faq.md5.3 Chunking Strategy Matters
Section titled “5.3 Chunking Strategy Matters”How you split documents dramatically affects retrieval quality:
| Strategy | Best For | Risk |
|---|---|---|
| Fixed-size chunks (512 tokens) | General documents | Splits context mid-sentence |
| Recursive character splitting | Code, mixed content | May lose semantic boundaries |
| Semantic chunking | Dense technical docs | Computationally expensive |
| Document-level chunks | Short docs, FAQs | Poor precision in long docs |
A naive chunking strategy is one of the most common causes of poor RAG performance. Invest time here.
5.4 Metadata Filtering
Section titled “5.4 Metadata Filtering”Good RAG systems attach metadata to each chunk for filtered retrieval:
{ "content": "The deployment freeze window is every Friday from 4PM–Monday 9AM...", "metadata": { "source": "deployment_policy.pdf", "category": "policy", "last_updated": "2024-12-01", "audience": "engineering", "version": "v3.2" }}This lets agents ask: “Find me only the security policies updated in the last 6 months.”
6. logs/ - The Black Box for Debugging and Observability
Section titled “6. logs/ - The Black Box for Debugging and Observability”Agent debugging is notoriously hard. Unlike traditional software where you can step through a debugger, agent behavior is emergent - it depends on the model’s internal states, tool outputs, and the cumulative context window. Logs are your only window into what happened.
6.1 What to Log
Section titled “6.1 What to Log”A production agent log entry looks like this:
{ "run_id": "run_20241215_143022_abc123", "agent_id": "security-auditor-v2", "timestamp": "2024-12-15T14:30:22.441Z", "step": 7, "type": "tool_call", "tool_name": "search_codebase", "input": { "query": "SQL query construction", "file_types": [".py"], "max_results": 20 }, "output": { "files_found": 14, "top_result": "src/database/query_builder.py:L45" }, "latency_ms": 1243, "token_usage": { "prompt": 8421, "completion": 312, "total": 8733 }, "chain_of_thought": "I need to check how SQL queries are constructed to identify potential injection vectors. I'll search for places where strings are concatenated into queries."}6.2 Structured Logging for Replay
Section titled “6.2 Structured Logging for Replay”The most valuable feature of agent logs is replay capability. If an agent fails midway through a 2-hour task, you want to:
- Read the logs to understand what happened
- Fix the bug
- Resume from the last successful checkpoint - not restart from scratch
This requires logging not just what the agent did, but also the full context state at each step.
6.3 Observability Stack
Section titled “6.3 Observability Stack”For production systems, logs feed into an observability pipeline. The following shows a typical stack from raw agent logs through to dashboards and alerting:
Agent Logs (JSON) │ ▼Log Aggregator (Fluentd / Logstash) │ ▼Storage (BigQuery / Elasticsearch) │ ├──► Dashboards (Grafana / Looker) ├──► Alerts (PagerDuty) └──► Trace Viewer (Langfuse / Arize / Phoenix)Tools like Langfuse, Arize Phoenix, and Weights & Biases Weave are purpose-built for LLM observability and are worth evaluating.
7. memory/ - State, Context, and the Illusion of Continuity
Section titled “7. memory/ - State, Context, and the Illusion of Continuity”Memory is what separates an agent that can hold a conversation from one that can manage a long-running project. It comes in multiple forms.
7.1 The Four Types of Memory
Section titled “7.1 The Four Types of Memory”| Type | Scope | Storage | Example |
|---|---|---|---|
| In-context (working) | Current conversation | LLM context window | ”Earlier you said the budget is $50k” |
| Episodic (short-term) | Recent sessions | Redis / in-memory DB | Past 10 conversation summaries |
| Semantic (long-term) | Persistent facts | Vector DB | ”User prefers TypeScript over JavaScript” |
| Procedural | Skills/behaviors | Prompt templates | ”When debugging, always check logs first” |
7.2 The Memory Architecture
Section titled “7.2 The Memory Architecture”.agents/memory/├── episodic/│ ├── session_20241215.jsonl # Full session transcript│ └── session_20241214.jsonl├── semantic/│ ├── user_preferences.json # Extracted facts about the user│ ├── project_context.json # Key project facts│ └── entity_graph.json # Relationships between concepts└── working/ └── current_task_state.json # Active task scratchpad7.3 Memory Consolidation
Section titled “7.3 Memory Consolidation”Raw session logs are too verbose to fit in future context windows. A good memory system runs a consolidation process - periodically summarizing and extracting structured facts from raw logs:
# Simplified memory consolidationasync def consolidate_session(session_log: list[dict]) -> dict: summary_prompt = f""" Review these agent actions and extract: 1. Key decisions made 2. Important facts learned about the user/project 3. Unresolved tasks or open questions 4. Errors encountered and how they were resolved
Session log: {json.dumps(session_log, indent=2)} """
summary = await llm.generate(summary_prompt) return parse_structured_summary(summary)7.4 Semantic Memory with Vector Search
Section titled “7.4 Semantic Memory with Vector Search”For long-term memory retrieval, a vector database (Pinecone, Weaviate, pgvector) lets agents ask: “What do I know about this topic?“
# Storing a memorymemory_text = "User's production database is PostgreSQL 15.2 running on AWS RDS"embedding = embed(memory_text)vector_db.upsert( id="mem_proj_db_001", vector=embedding, metadata={"category": "infrastructure", "timestamp": now()})
# Retrieving relevant memoriesquery_embedding = embed("What database are we using?")relevant_memories = vector_db.query(query_embedding, top_k=5)7.5 The Memory Injection Problem
Section titled “7.5 The Memory Injection Problem”Every memory retrieved must be injected into the context window - which is finite. This creates a fundamental tension: more memory = better context, but also higher cost and token pressure. The solution is a tiered injection strategy:
- Always inject: current task state, user identity, core preferences
- Query-time inject: semantically relevant episodic memories
- Never inject: raw unprocessed logs, irrelevant historical sessions
8. prompts/ - The System Instructions Layer
Section titled “8. prompts/ - The System Instructions Layer”If config is the agent’s policy, prompts are its character. This is where you define who the agent is, how it thinks, and what it prioritizes.
8.1 Prompt Engineering for Agents (vs. Chatbots)
Section titled “8.1 Prompt Engineering for Agents (vs. Chatbots)”Agent prompts are structurally different from chatbot prompts:
# System Prompt: Senior Security Auditor Agent
## Identity
You are a senior application security engineer with 15 years of experience.Your primary responsibility is identifying security vulnerabilities in codebases.You are meticulous, methodical, and never skip steps in your analysis.
## Reasoning Protocol
When analyzing code for vulnerabilities, always:
1. First understand the data flow - where does user input enter the system?2. Trace all paths that untrusted input can take3. Check for sanitization and validation at each boundary4. Consider second-order effects (e.g., stored XSS)
## Tool Use Guidelines
- Use `search_codebase` before making claims about what code does- Always verify line numbers before reporting findings- Never report a vulnerability without a proof-of-concept or code reference
## Output Format
For each finding, use this exact structure:
- **Severity**: [Critical/High/Medium/Low/Informational]- **CWE**: [CWE identifier]- **Location**: [File:Line]- **Description**: [What the vulnerability is]- **Proof of Concept**: [How it could be exploited]- **Remediation**: [How to fix it]
## Constraints
- Never modify files unless explicitly asked- If you find a critical vulnerability involving PII, immediately halt and alert the human- Do not make assumptions about business logic - ask for clarification8.2 Prompt Templates and Jinja2
Section titled “8.2 Prompt Templates and Jinja2”For dynamic prompts, templating engines let you inject context at runtime:
{# prompts/code_reviewer.j2 #}You are reviewing a pull request for the {{ project_name }} project.
**PR Context:**- Author: {{ pr_author }}- Branch: {{ pr_branch }}- Files changed: {{ files_changed | length }}- Description: {{ pr_description }}
**Reviewer Guidelines for {{ project_name }}:**{% for guideline in project_guidelines %}- {{ guideline }}{% endfor %}
**Your task:**Review the following diff and provide structured feedback.Focus on: {{ review_focus | join(", ") }}8.3 Prompt Versioning
Section titled “8.3 Prompt Versioning”Prompts are code. They should be versioned, tested, and reviewed:
prompts/├── security_auditor/│ ├── v1.0.0.md # Initial version│ ├── v1.1.0.md # Added CWE classification requirement│ ├── v2.0.0.md # Major rewrite after eval regression│ └── current -> v2.0.0.md # Symlink to active version└── changelog.md9. skills/ - Reusable Higher-Order Capabilities
Section titled “9. skills/ - Reusable Higher-Order Capabilities”Skills sit between raw tools (specific API calls) and full agent workflows (multi-step coordination). They are composable capabilities that any agent can use.
9.1 Skills vs. Tools
Section titled “9.1 Skills vs. Tools”| Aspect | Tool | Skill |
|---|---|---|
| Abstraction level | Low (atomic action) | High (multi-step capability) |
| Example | read_file("app.py") | analyze_codebase_security() |
| Calls LLM? | No | Often yes |
| Uses other tools? | No | Yes |
| State? | Stateless | May maintain intermediate state |
9.2 Example: A summarize_and_extract Skill
Section titled “9.2 Example: A summarize_and_extract Skill”async def summarize_and_extract( document: str, extract_schema: dict, max_summary_tokens: int = 500) -> dict: """ High-level skill that summarizes a document AND extracts structured data from it in a single LLM call.
More efficient than two separate calls. """ prompt = f""" Analyze this document and do two things:
1. Write a {max_summary_tokens}-token summary 2. Extract the following fields as JSON: {json.dumps(extract_schema, indent=2)}
Document: {document}
Respond with valid JSON matching this schema: {{ "summary": "...", "extracted": {{ ... }} }} """
result = await llm.generate(prompt, response_format="json") return json.loads(result)9.3 Skills as Composable Blocks
Section titled “9.3 Skills as Composable Blocks”A key design principle: skills should be composable. A generate_report skill might call summarize_and_extract, which calls read_file, which calls the filesystem tool.
This layered composition is what makes complex agent behavior manageable. Each layer only needs to understand the interface one level below it.
10. tools/ - The Agent’s Hands in the World
Section titled “10. tools/ - The Agent’s Hands in the World”Tools are how agents take action. They are the bridge between the language model’s reasoning and the real world. If skills are capabilities, tools are actions.
10.1 The Tool Interface Contract
Section titled “10.1 The Tool Interface Contract”Every tool must follow a strict interface. In most frameworks, this looks like:
from pydantic import BaseModel, Fieldfrom typing import Literal
class SearchCodebaseInput(BaseModel): query: str = Field(description="Natural language search query") file_types: list[str] = Field( default=[], description="Filter by file extensions (e.g., ['.py', '.ts'])" ) max_results: int = Field(default=10, ge=1, le=100)
class SearchCodebaseOutput(BaseModel): results: list[dict] total_found: int truncated: bool
async def search_codebase(input: SearchCodebaseInput) -> SearchCodebaseOutput: """ Search the codebase using semantic search. Use this when you need to find code related to a specific concept. """ # ... implementationThe docstring is critical - it’s what the LLM reads to decide when and how to use the tool.
10.2 Tool Categories
Section titled “10.2 Tool Categories”tools/├── filesystem/│ ├── read_file.py│ ├── write_file.py│ ├── search_codebase.py│ └── list_directory.py├── web/│ ├── web_search.py│ ├── fetch_url.py│ └── scrape_page.py├── execution/│ ├── run_python.py│ ├── run_bash.py│ └── run_tests.py├── communication/│ ├── send_slack_message.py│ ├── create_github_pr.py│ └── send_email.py└── data/ ├── query_database.py ├── search_vector_db.py └── call_api.py10.3 Tool Safety and Sandboxing
Section titled “10.3 Tool Safety and Sandboxing”Some tools are inherently dangerous. The run_bash tool, for example, can delete files, exfiltrate data, or crash a server. Production systems enforce multiple safety layers:
class RunBashTool: BLOCKED_PATTERNS = [ r"rm\s+-rf\s+/", # Recursive delete from root r">\s*/dev/sd[a-z]", # Writing to raw disk devices r"curl.*\|\s*bash", # Pipe-from-internet-to-bash attacks r"base64\s+-d.*\|\s*bash", # Base64 obfuscated execution ]
async def execute(self, command: str) -> str: # Static analysis for pattern in self.BLOCKED_PATTERNS: if re.search(pattern, command): raise ToolSafetyError(f"Blocked pattern detected: {pattern}")
# Runtime sandboxing result = await asyncio.create_subprocess_shell( command, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE, cwd=self.sandbox_dir, # Restricted working directory env=self.sanitized_env, # Clean environment variables preexec_fn=self._drop_privileges # Minimal OS permissions )
return await asyncio.wait_for(result.communicate(), timeout=30)10.4 Tool Observability
Section titled “10.4 Tool Observability”Every tool call should emit structured telemetry:
@tool_call_traced # Decorator that logs input, output, latency, costasync def query_database(sql: str, database: str) -> dict: ...11. workflows/ - Multi-Agent Orchestration
Section titled “11. workflows/ - Multi-Agent Orchestration”This is where everything comes together. Workflows define how multiple agents collaborate to achieve complex goals.
11.1 Orchestration Patterns
Section titled “11.1 Orchestration Patterns”There are several well-established patterns for multi-agent coordination:
Pattern 1: Pipeline (Sequential)
Section titled “Pattern 1: Pipeline (Sequential)”Each agent’s output becomes the next agent’s input. Simple, predictable, and easy to debug:
Input → Agent A → Agent B → Agent C → OutputBest for: document processing pipelines, data transformation chains.
Pattern 2: Supervisor / Worker
Section titled “Pattern 2: Supervisor / Worker”A central supervisor LLM decomposes the goal and assigns tasks to specialized worker agents:
┌─────────────────────┐ │ SUPERVISOR │ │ (Orchestrator LLM) │ └──────────┬──────────┘ │ assigns tasks ┌──────────▼──────────┐ │ │ ┌────▼────┐ ┌────▼────┐ │ Worker │ │ Worker │ │ Agent A │ │ Agent B │ └─────────┘ └─────────┘The supervisor LLM receives the high-level goal, decomposes it into tasks, and assigns them to specialized worker agents. Workers report back; the supervisor integrates results and assigns next steps.
Best for: complex multi-faceted tasks requiring coordination (e.g., “Build and deploy a new feature”).
Pattern 3: Debate / Multi-Critic
Section titled “Pattern 3: Debate / Multi-Critic”Independent critic agents review a draft; a synthesizer integrates their feedback into a final output:
┌──────────────────────────────┐ │ Draft Agent │ │ (Generates initial response) │ └───────────────┬──────────────┘ │ ┌──────────▼──────────┐ │ │ ┌────▼────┐ ┌────▼────┐ │ Critic │ │ Critic │ │ Agent A │ │ Agent B │ └────┬────┘ └────┬────┘ │ │ └──────────┬─────────┘ │ synthesized critiques ▼ ┌─────────────┐ │ Synthesizer │ │ Agent │ └─────────────┘Multiple independent critic agents review a draft and provide feedback. A synthesizer agent integrates the critiques and produces a revised output. Best for: high-stakes decisions, content generation, code review.
Pattern 4: Map-Reduce
Section titled “Pattern 4: Map-Reduce”A large task is split into independent chunks processed in parallel, then merged by a reducer agent:
Input │ ┌──────────┼──────────┐ │ │ │ ┌────▼───┐ ┌───▼────┐ ┌───▼────┐ │ Mapper │ │ Mapper │ │ Mapper │ (parallel) │ A │ │ B │ │ C │ └────┬───┘ └───┬────┘ └───┬────┘ │ │ │ └──────────┼──────────┘ │ ┌─────▼──────┐ │ Reducer │ │ Agent │ └─────────────┘A large task is split into independent chunks processed in parallel (map), then results are merged (reduce). Best for: analyzing large corpora, parallel research tasks.
11.2 Workflow Definition Example
Section titled “11.2 Workflow Definition Example”Modern frameworks like LangGraph, CrewAI, and AutoGen use graph-based workflow definitions:
from langgraph.graph import StateGraph, ENDfrom agents import SecurityAuditor, PerformanceAnalyzer, StyleChecker, Synthesizer
def build_code_review_workflow(): graph = StateGraph(CodeReviewState)
# Add agent nodes graph.add_node("security", SecurityAuditor().run) graph.add_node("performance", PerformanceAnalyzer().run) graph.add_node("style", StyleChecker().run) graph.add_node("synthesize", Synthesizer().run)
# Parallel fan-out from start graph.add_edge("__start__", "security") graph.add_edge("__start__", "performance") graph.add_edge("__start__", "style")
# Fan-in to synthesizer graph.add_edge("security", "synthesize") graph.add_edge("performance", "synthesize") graph.add_edge("style", "synthesize")
# Conditional: if critical issue found, require human review graph.add_conditional_edges( "synthesize", route_by_severity, { "critical": "human_review", "normal": END } )
return graph.compile()11.3 Inter-Agent Communication
Section titled “11.3 Inter-Agent Communication”Agents communicate through a shared state object - a structured dictionary that all agents can read from and write to:
class CodeReviewState(TypedDict): # Input pr_diff: str pr_description: str
# Intermediate results (filled by worker agents) security_findings: list[Finding] performance_findings: list[Finding] style_findings: list[Finding]
# Final output (filled by synthesizer) final_report: str overall_severity: Literal["critical", "high", "medium", "low"]
# Metadata run_id: str started_at: datetime completed_at: Optional[datetime]12. Cross-Cutting Concerns: What the Directory Structure Doesn’t Show
Section titled “12. Cross-Cutting Concerns: What the Directory Structure Doesn’t Show”Beyond the directory structure itself, production agent systems require several architectural concerns that cut across all layers. These are not tied to any single folder - they must be considered holistically when designing the overall system.
12.1 Idempotency and Resumability
Section titled “12.1 Idempotency and Resumability”Long-running agents will fail. The question is not if but when. Design every workflow step to be idempotent (safe to retry) and checkpoint state frequently so failed runs can resume:
@checkpoint_on_success(state_key="step_3_complete")async def analyze_security(code: str, state: RunState) -> list[Finding]: if state.get("step_3_complete"): return state["security_findings"] # Already done, skip
findings = await security_auditor.analyze(code) return findings12.2 Rate Limiting and Backpressure
Section titled “12.2 Rate Limiting and Backpressure”Agents can generate enormous numbers of LLM and API calls. Without rate limiting, you will hit provider limits, generate massive bills, and potentially trigger abuse detection:
class RateLimitedLLMClient: def __init__(self, rpm_limit: int = 60, tpm_limit: int = 100_000): self.rpm_limiter = TokenBucket(rate=rpm_limit, per=60) self.tpm_limiter = TokenBucket(rate=tpm_limit, per=60)
async def generate(self, prompt: str) -> str: estimated_tokens = len(prompt.split()) * 1.3
await self.rpm_limiter.acquire(1) await self.tpm_limiter.acquire(estimated_tokens)
return await self._raw_generate(prompt)12.3 Cost Tracking
Section titled “12.3 Cost Tracking”Token costs accumulate fast in multi-agent systems. Every LLM call should be attributed to a task, user, and workflow:
@dataclassclass CostTracker: run_id: str budget_usd: float spent_usd: float = 0.0
def record_call(self, model: str, prompt_tokens: int, completion_tokens: int): cost = calculate_cost(model, prompt_tokens, completion_tokens) self.spent_usd += cost
if self.spent_usd > self.budget_usd: raise BudgetExceededError( f"Run {self.run_id} exceeded budget: " f"${self.spent_usd:.4f} > ${self.budget_usd:.4f}" )12.4 Human-in-the-Loop (HITL) Integration
Section titled “12.4 Human-in-the-Loop (HITL) Integration”Not all decisions should be made autonomously. Good orchestration frameworks have first-class support for pausing a workflow and waiting for human input:
async def run_deployment_workflow(artifact: BuildArtifact, state: WorkflowState): # Automated checks test_results = await run_tests(artifact) security_scan = await scan_for_vulnerabilities(artifact)
# Human approval gate for production deployments if state.target_environment == "production": approval = await request_human_approval( message=f""" Ready to deploy {artifact.version} to production. - Tests: {test_results.summary} - Security: {security_scan.summary}
Approve or reject? """, timeout_hours=24 )
if not approval.approved: raise WorkflowAbortedByHuman(reason=approval.reason)
await deploy(artifact, state.target_environment)13. Evaluating Agent Systems: The Testing Problem
Section titled “13. Evaluating Agent Systems: The Testing Problem”Testing agents is fundamentally different from testing traditional software. You can’t just assert output == expected_output because LLM outputs are non-deterministic.
13.1 The Evaluation Stack
Section titled “13.1 The Evaluation Stack”┌─────────────────────────────────────────────┐│ AGENT EVALUATION PYRAMID ││ ││ ┌─────────────────────────────────┐ ││ │ E2E Workflow Evals (slow) │ ││ │ "Does the full pipeline work?" │ ││ └─────────────────────────────────┘ ││ ┌─────────────────────────────────┐ ││ │ Component Evals (medium) │ ││ │ "Does this agent do its job?" │ ││ └─────────────────────────────────┘ ││ ┌─────────────────────────────────┐ ││ │ Unit Evals (fast) │ ││ │ "Does this prompt/tool work?" │ ││ └─────────────────────────────────┘ │└─────────────────────────────────────────────┘13.2 LLM-as-Judge
Section titled “13.2 LLM-as-Judge”For subjective quality assessment, use a stronger LLM to judge the output of your agent:
async def evaluate_security_report( report: str, ground_truth_findings: list[Finding]) -> EvalResult: judge_prompt = f""" You are evaluating a security audit report.
Ground truth vulnerabilities (what SHOULD be found): {json.dumps([f.dict() for f in ground_truth_findings], indent=2)}
Agent's report: {report}
Rate the report on: 1. Precision (0-10): Are the findings accurate? 2. Recall (0-10): Were all vulnerabilities found? 3. Actionability (0-10): Are the remediation steps concrete? 4. False Positives: How many issues reported that don't exist?
Respond with JSON. """
return await judge_llm.generate(judge_prompt, response_format="json")13.3 Golden Dataset Maintenance
Section titled “13.3 Golden Dataset Maintenance”Maintain a curated dataset of inputs with known expected outputs. Run this suite on every prompt or workflow change:
evals/├── golden_datasets/│ ├── security_audit_cases.jsonl # 50 labeled cases│ ├── code_review_cases.jsonl # 30 labeled cases│ └── data_analysis_cases.jsonl # 20 labeled cases├── scorers/│ ├── llm_judge.py│ ├── exact_match.py│ └── semantic_similarity.py└── run_evals.py14. The Future: Where Agent Orchestration Is Heading
Section titled “14. The Future: Where Agent Orchestration Is Heading”14.1 Standardized Agent Protocols
Section titled “14.1 Standardized Agent Protocols”The industry is converging on Model Context Protocol (MCP) - an open standard for tool interfaces that allows any agent framework to consume any tool. This is analogous to what HTTP did for web services: create a universal interface that enables ecosystem interoperability.
14.2 Agent Memory as a First-Class Service
Section titled “14.2 Agent Memory as a First-Class Service”Today, most teams build memory from scratch. The next wave will see memory-as-a-service products that handle chunking, embedding, consolidation, retrieval, and injection - exposing a clean API that any agent can use.
14.3 Multi-Modal Agents
Section titled “14.3 Multi-Modal Agents”Current text-in, text-out agents are giving way to agents that can natively perceive and produce images, audio, video, and structured data. This will massively expand the action space available to orchestration frameworks.
14.4 Formal Verification of Agent Behavior
Section titled “14.4 Formal Verification of Agent Behavior”As agents are deployed in high-stakes domains (healthcare, finance, legal), we will need formal methods for reasoning about agent safety - not just empirical testing, but provable guarantees about what an agent can and cannot do.
Conclusion
Section titled “Conclusion”The .agents/ directory structure is more than a file organization convention - it is a map of a cognitive architecture. Each folder corresponds to a distinct function in the agent’s mind:
config/is policy - what the agent is allowed to dodocs/is knowledge - what the agent knows about the worldlogs/is memory of actions - what the agent has donememory/is state - what the agent remembers about contextprompts/is identity - who the agent is and how it thinksskills/is capability - what complex things the agent can dotools/is agency - how the agent acts on the worldworkflows/is collaboration - how agents work together
Building reliable, effective, and safe AI agents requires getting all of these layers right - and understanding how they interact. The goal is not to build a single all-knowing AI, but to build a team of specialists that can be orchestrated to solve problems no individual agent could tackle alone.
The compound AI system is not a future concept. It is happening now, in production, at scale. Understanding this architecture is not optional for the next generation of software engineers - it is foundational.
Further Reading
Section titled “Further Reading”- ReAct: Synergizing Reasoning and Acting in Language Models (2022) - The foundational paper on the Reason-Act-Observe loop
- Generative Agents: Interactive Simulacra of Human Behavior (2023) - Seminal work on agent memory architectures
- LangGraph Documentation - Production-grade agent workflow framework
- Model Context Protocol (MCP) - Emerging standard for tool interoperability
- Building Effective Agents - Anthropic - Practical guidance on agent design patterns