# Tool Orchestration Patterns Reference Research-backed patterns inspired by NVIDIA ToolOrchestra, OpenAI Agents SDK, and multi-agent coordination research. --- ## Overview Effective tool orchestration requires four key innovations: 1. **Tracing Spans** - Hierarchical event tracking (OpenAI SDK pattern) 2. **Efficiency Metrics** - Track computational cost per task 3. **Reward Signals** - Outcome, efficiency, and preference rewards for learning 4. **Dynamic Selection** - Adapt agent count and types based on task complexity --- ## Tracing Spans Architecture (OpenAI SDK Pattern) ### Span Types Every operation is wrapped in a typed span for observability: ```yaml span_types: agent_span: # Wraps entire agent execution generation_span: # Wraps LLM API calls function_span: # Wraps tool/function calls guardrail_span: # Wraps validation checks handoff_span: # Wraps agent-to-agent transfers custom_span: # User-defined operations ``` ### Hierarchical Trace Structure ```json { "trace_id": "trace_abc123def456", "workflow_name": "implement_feature", "group_id": "session_xyz789", "spans": [ { "span_id": "span_001", "parent_id": null, "type": "agent_span", "agent_name": "orchestrator", "started_at": "2026-01-07T10:00:00Z", "ended_at": "2026-01-07T10:05:00Z", "children": ["span_002", "span_003"] }, { "span_id": "span_002", "parent_id": "span_001", "type": "guardrail_span", "guardrail_name": "input_validation", "triggered": false, "blocking": true }, { "span_id": "span_003", "parent_id": "span_001", "type": "handoff_span", "from_agent": "orchestrator", "to_agent": "backend-dev" } ] } ``` ### Storage Location ``` .loki/traces/ ├── active/ │ └── {trace_id}.json # Currently running traces └── completed/ └── {date}/ └── {trace_id}.json # Archived traces ``` See `references/openai-patterns.md` for full tracing implementation. --- ## Efficiency Metrics System ### Why Track Efficiency? ToolOrchestra achieves 70% cost reduction vs GPT-5 by explicitly optimizing for efficiency. Loki Mode should track: - **Token usage** per task (input + output) - **Wall clock time** per task - **Agent spawns** per task - **Retry count** before success ### Efficiency Tracking Schema ```json { "task_id": "task-2026-01-06-001", "correlation_id": "session-abc123", "started_at": "2026-01-06T10:00:00Z", "completed_at": "2026-01-06T10:05:32Z", "metrics": { "wall_time_seconds": 332, "agents_spawned": 3, "total_agent_calls": 7, "retry_count": 1, "retry_reasons": ["test_failure"], "recovery_rate": 1.0, "model_usage": { "haiku": {"calls": 4, "est_tokens": 12000}, "sonnet": {"calls": 2, "est_tokens": 8000}, "opus": {"calls": 1, "est_tokens": 6000} } }, "outcome": "success", "outcome_reason": "tests_passed_after_fix", "efficiency_score": 0.85, "efficiency_factors": ["used_haiku_for_tests", "parallel_review"], "quality_pillars": { "tool_selection_correct": true, "tool_reliability_rate": 0.95, "memory_retrieval_relevant": true, "goal_adherence": 1.0 } } ``` **Why capture these metrics?** (Based on multi-agent research) 1. **Capture intent, not just actions** ([Hashrocket](https://hashrocket.substack.com/p/the-hidden-cost-of-well-fix-it-later)) - "UX debt turns into data debt" - recording actions without intent creates useless analytics 2. **Track recovery rate** ([Assessment Framework, arXiv 2512.12791](https://arxiv.org/html/2512.12791v1)) - `recovery_rate = successful_retries / total_retries` - Paper found "perfect tool sequencing but only 33% policy adherence" - surface metrics mask failures 3. **Distributed tracing** ([Maxim AI](https://www.getmaxim.ai/articles/best-practices-for-building-production-ready-multi-agent-systems/)) - `correlation_id`: Links all tasks in a session for end-to-end tracing - Essential for debugging multi-agent coordination failures 4. **Tool reliability separate from selection** ([Stanford/Harvard](https://www.marktechpost.com/2025/12/24/this-ai-paper-from-stanford-and-harvard-explains-why-most-agentic-ai-systems-feel-impressive-in-demos-and-then-completely-fall-apart-in-real-use/)) - `tool_selection_correct`: Did we pick the right tool? - `tool_reliability_rate`: Did the tool work as expected? (tools can fail even when correctly selected) - Key insight: "Tool use reliability" is a primary demo-to-deployment gap 5. **Quality pillars beyond outcomes** ([Assessment Framework](https://arxiv.org/html/2512.12791v1)) - `memory_retrieval_relevant`: Did episodic/semantic retrieval help? - `goal_adherence`: Did we stay on task? (0.0-1.0 score) ### Efficiency Score Calculation ```python def calculate_efficiency_score(metrics, task_complexity): """ Score from 0-1 where higher is more efficient. Based on ToolOrchestra's efficiency reward signal. """ # Baseline expectations by complexity baselines = { "trivial": {"time": 60, "agents": 1, "retries": 0}, "simple": {"time": 180, "agents": 2, "retries": 0}, "moderate": {"time": 600, "agents": 4, "retries": 1}, "complex": {"time": 1800, "agents": 8, "retries": 2}, "critical": {"time": 3600, "agents": 12, "retries": 3} } baseline = baselines[task_complexity] # Calculate component scores (1.0 = at baseline, >1 = better, <1 = worse) time_score = min(1.0, baseline["time"] / max(metrics["wall_time_seconds"], 1)) agent_score = min(1.0, baseline["agents"] / max(metrics["agents_spawned"], 1)) retry_score = 1.0 - (metrics["retry_count"] / (baseline["retries"] + 3)) # Weighted average (time matters most) return (time_score * 0.5) + (agent_score * 0.3) + (retry_score * 0.2) ``` ### Standard Reason Codes Use consistent codes to enable pattern analysis: ```yaml outcome_reasons: success: - tests_passed_first_try - tests_passed_after_fix - review_approved - spec_validated partial: - tests_partial_pass - review_concerns_minor - timeout_partial_work failure: - tests_failed - review_blocked - dependency_missing - timeout_no_progress - error_unrecoverable retry_reasons: - test_failure - lint_error - type_error - review_rejection - rate_limit - timeout - dependency_conflict efficiency_factors: positive: - used_haiku_for_simple - parallel_execution - cached_result - first_try_success - spec_driven negative: - used_opus_for_simple - sequential_when_parallel_possible - multiple_retries - missing_context - unclear_requirements ``` ### Storage Location ``` .loki/metrics/ ├── efficiency/ │ ├── 2026-01-06.json # Daily efficiency logs │ └── aggregate.json # Running averages by task type └── rewards/ ├── outcomes.json # Task success/failure records └── preferences.json # User preference signals ``` --- ## Reward Signal Framework ### Three Reward Types (ToolOrchestra Pattern) ``` +------------------------------------------------------------------+ | 1. OUTCOME REWARD | | - Did the task succeed? Binary + quality grade | | - Signal: +1.0 (success), 0.0 (partial), -1.0 (failure) | +------------------------------------------------------------------+ | 2. EFFICIENCY REWARD | | - Did we use resources wisely? | | - Signal: 0.0 to 1.0 based on efficiency score | +------------------------------------------------------------------+ | 3. PREFERENCE REWARD | | - Did the user like the approach/result? | | - Signal: Inferred from user actions (accept/reject/modify) | +------------------------------------------------------------------+ ``` ### Outcome Reward Implementation ```python def calculate_outcome_reward(task_result): """ Outcome reward based on task completion status. """ if task_result.status == "completed": # Grade the quality of completion if task_result.tests_passed and task_result.review_passed: return 1.0 # Full success elif task_result.tests_passed: return 0.7 # Tests pass but review had concerns else: return 0.3 # Completed but with issues elif task_result.status == "partial": return 0.0 # Partial completion, no reward else: # failed return -1.0 # Negative reward for failure ``` ### Preference Reward Implementation ```python def infer_preference_reward(task_result, user_actions): """ Infer user preference from their actions after task completion. Based on implicit feedback patterns. """ signals = [] # Positive signals if "commit" in user_actions: signals.append(0.8) # User committed our changes if "deploy" in user_actions: signals.append(1.0) # User deployed our changes if "no_edits" in user_actions: signals.append(0.6) # User didn't modify our output # Negative signals if "revert" in user_actions: signals.append(-1.0) # User reverted our changes if "manual_fix" in user_actions: signals.append(-0.5) # User had to fix our work if "retry_different" in user_actions: signals.append(-0.3) # User asked for different approach # Neutral (no signal) if not signals: return None return sum(signals) / len(signals) ``` ### Reward Aggregation for Learning ```python def aggregate_rewards(outcome, efficiency, preference): """ Combine rewards into single learning signal. Weights based on ToolOrchestra findings. """ # Outcome is most important (must succeed) # Efficiency secondary (once successful, optimize) # Preference tertiary (align with user style) weights = { "outcome": 0.6, "efficiency": 0.25, "preference": 0.15 } total = outcome * weights["outcome"] total += efficiency * weights["efficiency"] if preference is not None: total += preference * weights["preference"] else: # Redistribute weight if no preference signal total = total / (1 - weights["preference"]) return total ``` --- ## Dynamic Agent Selection ### Task Complexity Classification ```python def classify_task_complexity(task): """ Classify task to determine agent allocation. Based on ToolOrchestra's tool selection flexibility. """ complexity_signals = { # File scope signals "single_file": -1, "few_files": 0, # 2-5 files "many_files": +1, # 6-20 files "system_wide": +2, # 20+ files # Change type signals "typo_fix": -2, "bug_fix": 0, "feature": +1, "refactor": +1, "architecture": +2, # Domain signals "documentation": -1, "tests_only": 0, "frontend": 0, "backend": 0, "full_stack": +1, "infrastructure": +1, "security": +2, } score = 0 for signal, weight in complexity_signals.items(): if task.has_signal(signal): score += weight # Map score to complexity level if score <= -2: return "trivial" elif score <= 0: return "simple" elif score <= 2: return "moderate" elif score <= 4: return "complex" else: return "critical" ``` ### Agent Allocation by Complexity ```yaml # Agent allocation strategy # Model selection: Opus=planning, Sonnet=development, Haiku=unit tests/monitoring complexity_allocations: trivial: max_agents: 1 planning: null # No planning needed development: haiku testing: haiku review: skip # No review needed for trivial parallel: false simple: max_agents: 2 planning: null # No planning needed development: haiku testing: haiku review: single # One quick review parallel: false moderate: max_agents: 4 planning: sonnet # Sonnet for moderate planning development: sonnet testing: haiku # Unit tests always haiku review: standard # 3 parallel reviewers parallel: true complex: max_agents: 8 planning: opus # Opus ONLY for complex planning development: sonnet # Sonnet for implementation testing: haiku # Unit tests still haiku review: deep # 3 reviewers + devil's advocate parallel: true critical: max_agents: 12 planning: opus # Opus for critical planning development: sonnet # Sonnet for implementation testing: sonnet # Functional/E2E tests with sonnet review: exhaustive # Multiple review rounds parallel: true human_checkpoint: true # Pause for human review ``` ### Dynamic Selection Algorithm ```python def select_agents_for_task(task, available_agents): """ Dynamically select agents based on task requirements. Inspired by ToolOrchestra's configurable tool selection. """ complexity = classify_task_complexity(task) allocation = COMPLEXITY_ALLOCATIONS[complexity] # 1. Identify required agent types required_types = identify_required_agents(task) # 2. Filter to available agents of required types candidates = [a for a in available_agents if a.type in required_types] # 3. Score candidates by past performance for agent in candidates: agent.selection_score = get_agent_performance_score( agent, task_type=task.type, complexity=complexity ) # 4. Select top N agents up to allocation limit candidates.sort(key=lambda a: a.selection_score, reverse=True) selected = candidates[:allocation["max_agents"]] # 5. Assign models based on complexity for agent in selected: if agent.role == "reviewer": agent.model = "opus" # Always opus for reviews else: agent.model = allocation["model"] return selected def get_agent_performance_score(agent, task_type, complexity): """ Score agent based on historical performance on similar tasks. Uses reward signals from previous executions. """ history = load_agent_history(agent.id) # Filter to similar tasks similar = [h for h in history if h.task_type == task_type and h.complexity == complexity] if not similar: return 0.5 # Neutral score if no history # Average past rewards return sum(h.aggregate_reward for h in similar) / len(similar) ``` --- ## Tool Usage Analytics ### Track Tool Effectiveness ```json { "tool_analytics": { "period": "2026-01-06", "by_tool": { "Grep": { "calls": 142, "success_rate": 0.89, "avg_result_quality": 0.82, "common_patterns": ["error handling", "function def"] }, "Task": { "calls": 47, "success_rate": 0.94, "avg_efficiency": 0.76, "by_subagent_type": { "general-purpose": {"calls": 35, "success": 0.91}, "Explore": {"calls": 12, "success": 1.0} } } }, "insights": [ "Explore agent 100% success - use more for codebase search", "Grep success drops to 0.65 for regex patterns - simplify searches" ] } } ``` ### Continuous Improvement Loop ``` +------------------------------------------------------------------+ | 1. COLLECT | | Record every task: agents used, tools called, outcome | +------------------------------------------------------------------+ | v +------------------------------------------------------------------+ | 2. ANALYZE | | Weekly aggregation: What worked? What didn't? | | Identify patterns in high-reward vs low-reward tasks | +------------------------------------------------------------------+ | v +------------------------------------------------------------------+ | 3. ADAPT | | Update selection algorithms based on analytics | | Store successful patterns in semantic memory | +------------------------------------------------------------------+ | v +------------------------------------------------------------------+ | 4. VALIDATE | | A/B test new selection strategies | | Measure efficiency improvement | +------------------------------------------------------------------+ | +-----------> Loop back to COLLECT ``` --- ## Integration with RARV Cycle The orchestration patterns integrate with RARV at each phase: ``` REASON: ├── Check efficiency metrics for similar past tasks ├── Classify task complexity └── Select appropriate agent allocation ACT: ├── Dispatch agents according to allocation ├── Track start time and resource usage └── Record tool calls and agent interactions REFLECT: ├── Calculate outcome reward (did it work?) ├── Calculate efficiency reward (resource usage) └── Log to metrics store VERIFY: ├── Run verification checks ├── If failed: negative outcome reward, retry with learning ├── If passed: infer preference reward from user actions └── Update agent performance scores ``` --- ## Key Metrics Dashboard Track these metrics in `.loki/metrics/dashboard.json`: ```json { "dashboard": { "period": "rolling_7_days", "summary": { "tasks_completed": 127, "success_rate": 0.94, "avg_efficiency_score": 0.78, "avg_outcome_reward": 0.82, "avg_preference_reward": 0.71, "avg_recovery_rate": 0.87, "avg_goal_adherence": 0.93 }, "quality_pillars": { "tool_selection_accuracy": 0.91, "tool_reliability_rate": 0.93, "memory_retrieval_relevance": 0.84, "policy_adherence": 0.96 }, "trends": { "efficiency": "+12% vs previous week", "success_rate": "+3% vs previous week", "avg_agents_per_task": "-0.8 (improving)", "recovery_rate": "+5% vs previous week" }, "top_performing_patterns": [ "Haiku for unit tests (0.95 success, 0.92 efficiency)", "Explore agent for codebase search (1.0 success)", "Parallel review with opus (0.98 accuracy)" ], "areas_for_improvement": [ "Complex refactors taking 2x expected time", "Security review efficiency below baseline", "Memory retrieval relevance below 0.85 target" ] } } ``` --- ## Multi-Dimensional Evaluation Based on [Measurement Imbalance research (arXiv 2506.02064)](https://arxiv.org/abs/2506.02064): > "Technical metrics dominate assessments (83%), while human-centered (30%), safety (53%), and economic (30%) remain peripheral" **Loki Mode tracks four evaluation axes:** | Axis | Metrics | Current Coverage | |------|---------|------------------| | **Technical** | success_rate, efficiency_score, recovery_rate | Full | | **Human-Centered** | preference_reward, goal_adherence | Partial | | **Safety** | policy_adherence, quality_gates_passed | Full (via review system) | | **Economic** | model_usage, agents_spawned, wall_time | Full | --- ## Sources **OpenAI Agents SDK:** - [Agents SDK Documentation](https://openai.github.io/openai-agents-python/) - Core primitives: agents, handoffs, guardrails, tracing - [Practical Guide to Building Agents](https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf) - Orchestration patterns - [Building Agents Track](https://developers.openai.com/tracks/building-agents/) - Official developer guide - [AGENTS.md Specification](https://agents.md/) - Standard for agent instructions - [Tracing Documentation](https://openai.github.io/openai-agents-python/tracing/) - Span types and observability **Efficiency & Orchestration:** - [NVIDIA ToolOrchestra](https://github.com/NVlabs/ToolOrchestra) - Multi-turn tool orchestration with RL - [ToolScale Dataset](https://huggingface.co/datasets/nvidia/ToolScale) - Training data synthesis **Evaluation Frameworks:** - [Assessment Framework for Agentic AI (arXiv 2512.12791)](https://arxiv.org/html/2512.12791v1) - Four-pillar evaluation model - [Measurement Imbalance in Agentic AI (arXiv 2506.02064)](https://arxiv.org/abs/2506.02064) - Multi-dimensional evaluation - [Adaptive Monitoring for Agentic AI (arXiv 2509.00115)](https://arxiv.org/abs/2509.00115) - AMDM algorithm **Best Practices:** - [Anthropic: Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) - Simplicity, transparency, tool engineering - [Maxim AI: Production Multi-Agent Systems](https://www.getmaxim.ai/articles/best-practices-for-building-production-ready-multi-agent-systems/) - Orchestration patterns, distributed tracing - [UiPath: Agent Builder Best Practices](https://www.uipath.com/blog/ai/agent-builder-best-practices) - Single-responsibility, evaluations - [Stanford/Harvard: Demo-to-Deployment Gap](https://www.marktechpost.com/2025/12/24/this-ai-paper-from-stanford-and-harvard-explains-why-most-agentic-ai-systems-feel-impressive-in-demos-and-then-completely-fall-apart-in-real-use/) - Tool reliability as key failure mode **Safety & Reasoning:** - [Chain of Thought Monitoring](https://openai.com/index/chain-of-thought-monitoring/) - CoT monitorability for safety - [Agent Builder Safety](https://platform.openai.com/docs/guides/agent-builder-safety) - Human-in-loop patterns - [Agentic AI Foundation](https://openai.com/index/agentic-ai-foundation/) - Industry standards (MCP, AGENTS.md, goose)