Files
antigravity-skills-reference/skills/loki-mode/references/tool-orchestration.md

692 lines
22 KiB
Markdown

# Tool Orchestration Patterns Reference
Research-backed patterns inspired by NVIDIA ToolOrchestra, OpenAI Agents SDK, and multi-agent coordination research.
---
## Overview
Effective tool orchestration requires four key innovations:
1. **Tracing Spans** - Hierarchical event tracking (OpenAI SDK pattern)
2. **Efficiency Metrics** - Track computational cost per task
3. **Reward Signals** - Outcome, efficiency, and preference rewards for learning
4. **Dynamic Selection** - Adapt agent count and types based on task complexity
---
## Tracing Spans Architecture (OpenAI SDK Pattern)
### Span Types
Every operation is wrapped in a typed span for observability:
```yaml
span_types:
agent_span: # Wraps entire agent execution
generation_span: # Wraps LLM API calls
function_span: # Wraps tool/function calls
guardrail_span: # Wraps validation checks
handoff_span: # Wraps agent-to-agent transfers
custom_span: # User-defined operations
```
### Hierarchical Trace Structure
```json
{
"trace_id": "trace_abc123def456",
"workflow_name": "implement_feature",
"group_id": "session_xyz789",
"spans": [
{
"span_id": "span_001",
"parent_id": null,
"type": "agent_span",
"agent_name": "orchestrator",
"started_at": "2026-01-07T10:00:00Z",
"ended_at": "2026-01-07T10:05:00Z",
"children": ["span_002", "span_003"]
},
{
"span_id": "span_002",
"parent_id": "span_001",
"type": "guardrail_span",
"guardrail_name": "input_validation",
"triggered": false,
"blocking": true
},
{
"span_id": "span_003",
"parent_id": "span_001",
"type": "handoff_span",
"from_agent": "orchestrator",
"to_agent": "backend-dev"
}
]
}
```
### Storage Location
```
.loki/traces/
├── active/
│ └── {trace_id}.json # Currently running traces
└── completed/
└── {date}/
└── {trace_id}.json # Archived traces
```
See `references/openai-patterns.md` for full tracing implementation.
---
## Efficiency Metrics System
### Why Track Efficiency?
ToolOrchestra achieves 70% cost reduction vs GPT-5 by explicitly optimizing for efficiency. Loki Mode should track:
- **Token usage** per task (input + output)
- **Wall clock time** per task
- **Agent spawns** per task
- **Retry count** before success
### Efficiency Tracking Schema
```json
{
"task_id": "task-2026-01-06-001",
"correlation_id": "session-abc123",
"started_at": "2026-01-06T10:00:00Z",
"completed_at": "2026-01-06T10:05:32Z",
"metrics": {
"wall_time_seconds": 332,
"agents_spawned": 3,
"total_agent_calls": 7,
"retry_count": 1,
"retry_reasons": ["test_failure"],
"recovery_rate": 1.0,
"model_usage": {
"haiku": {"calls": 4, "est_tokens": 12000},
"sonnet": {"calls": 2, "est_tokens": 8000},
"opus": {"calls": 1, "est_tokens": 6000}
}
},
"outcome": "success",
"outcome_reason": "tests_passed_after_fix",
"efficiency_score": 0.85,
"efficiency_factors": ["used_haiku_for_tests", "parallel_review"],
"quality_pillars": {
"tool_selection_correct": true,
"tool_reliability_rate": 0.95,
"memory_retrieval_relevant": true,
"goal_adherence": 1.0
}
}
```
**Why capture these metrics?** (Based on multi-agent research)
1. **Capture intent, not just actions** ([Hashrocket](https://hashrocket.substack.com/p/the-hidden-cost-of-well-fix-it-later))
- "UX debt turns into data debt" - recording actions without intent creates useless analytics
2. **Track recovery rate** ([Assessment Framework, arXiv 2512.12791](https://arxiv.org/html/2512.12791v1))
- `recovery_rate = successful_retries / total_retries`
- Paper found "perfect tool sequencing but only 33% policy adherence" - surface metrics mask failures
3. **Distributed tracing** ([Maxim AI](https://www.getmaxim.ai/articles/best-practices-for-building-production-ready-multi-agent-systems/))
- `correlation_id`: Links all tasks in a session for end-to-end tracing
- Essential for debugging multi-agent coordination failures
4. **Tool reliability separate from selection** ([Stanford/Harvard](https://www.marktechpost.com/2025/12/24/this-ai-paper-from-stanford-and-harvard-explains-why-most-agentic-ai-systems-feel-impressive-in-demos-and-then-completely-fall-apart-in-real-use/))
- `tool_selection_correct`: Did we pick the right tool?
- `tool_reliability_rate`: Did the tool work as expected? (tools can fail even when correctly selected)
- Key insight: "Tool use reliability" is a primary demo-to-deployment gap
5. **Quality pillars beyond outcomes** ([Assessment Framework](https://arxiv.org/html/2512.12791v1))
- `memory_retrieval_relevant`: Did episodic/semantic retrieval help?
- `goal_adherence`: Did we stay on task? (0.0-1.0 score)
### Efficiency Score Calculation
```python
def calculate_efficiency_score(metrics, task_complexity):
"""
Score from 0-1 where higher is more efficient.
Based on ToolOrchestra's efficiency reward signal.
"""
# Baseline expectations by complexity
baselines = {
"trivial": {"time": 60, "agents": 1, "retries": 0},
"simple": {"time": 180, "agents": 2, "retries": 0},
"moderate": {"time": 600, "agents": 4, "retries": 1},
"complex": {"time": 1800, "agents": 8, "retries": 2},
"critical": {"time": 3600, "agents": 12, "retries": 3}
}
baseline = baselines[task_complexity]
# Calculate component scores (1.0 = at baseline, >1 = better, <1 = worse)
time_score = min(1.0, baseline["time"] / max(metrics["wall_time_seconds"], 1))
agent_score = min(1.0, baseline["agents"] / max(metrics["agents_spawned"], 1))
retry_score = 1.0 - (metrics["retry_count"] / (baseline["retries"] + 3))
# Weighted average (time matters most)
return (time_score * 0.5) + (agent_score * 0.3) + (retry_score * 0.2)
```
### Standard Reason Codes
Use consistent codes to enable pattern analysis:
```yaml
outcome_reasons:
success:
- tests_passed_first_try
- tests_passed_after_fix
- review_approved
- spec_validated
partial:
- tests_partial_pass
- review_concerns_minor
- timeout_partial_work
failure:
- tests_failed
- review_blocked
- dependency_missing
- timeout_no_progress
- error_unrecoverable
retry_reasons:
- test_failure
- lint_error
- type_error
- review_rejection
- rate_limit
- timeout
- dependency_conflict
efficiency_factors:
positive:
- used_haiku_for_simple
- parallel_execution
- cached_result
- first_try_success
- spec_driven
negative:
- used_opus_for_simple
- sequential_when_parallel_possible
- multiple_retries
- missing_context
- unclear_requirements
```
### Storage Location
```
.loki/metrics/
├── efficiency/
│ ├── 2026-01-06.json # Daily efficiency logs
│ └── aggregate.json # Running averages by task type
└── rewards/
├── outcomes.json # Task success/failure records
└── preferences.json # User preference signals
```
---
## Reward Signal Framework
### Three Reward Types (ToolOrchestra Pattern)
```
+------------------------------------------------------------------+
| 1. OUTCOME REWARD |
| - Did the task succeed? Binary + quality grade |
| - Signal: +1.0 (success), 0.0 (partial), -1.0 (failure) |
+------------------------------------------------------------------+
| 2. EFFICIENCY REWARD |
| - Did we use resources wisely? |
| - Signal: 0.0 to 1.0 based on efficiency score |
+------------------------------------------------------------------+
| 3. PREFERENCE REWARD |
| - Did the user like the approach/result? |
| - Signal: Inferred from user actions (accept/reject/modify) |
+------------------------------------------------------------------+
```
### Outcome Reward Implementation
```python
def calculate_outcome_reward(task_result):
"""
Outcome reward based on task completion status.
"""
if task_result.status == "completed":
# Grade the quality of completion
if task_result.tests_passed and task_result.review_passed:
return 1.0 # Full success
elif task_result.tests_passed:
return 0.7 # Tests pass but review had concerns
else:
return 0.3 # Completed but with issues
elif task_result.status == "partial":
return 0.0 # Partial completion, no reward
else: # failed
return -1.0 # Negative reward for failure
```
### Preference Reward Implementation
```python
def infer_preference_reward(task_result, user_actions):
"""
Infer user preference from their actions after task completion.
Based on implicit feedback patterns.
"""
signals = []
# Positive signals
if "commit" in user_actions:
signals.append(0.8) # User committed our changes
if "deploy" in user_actions:
signals.append(1.0) # User deployed our changes
if "no_edits" in user_actions:
signals.append(0.6) # User didn't modify our output
# Negative signals
if "revert" in user_actions:
signals.append(-1.0) # User reverted our changes
if "manual_fix" in user_actions:
signals.append(-0.5) # User had to fix our work
if "retry_different" in user_actions:
signals.append(-0.3) # User asked for different approach
# Neutral (no signal)
if not signals:
return None
return sum(signals) / len(signals)
```
### Reward Aggregation for Learning
```python
def aggregate_rewards(outcome, efficiency, preference):
"""
Combine rewards into single learning signal.
Weights based on ToolOrchestra findings.
"""
# Outcome is most important (must succeed)
# Efficiency secondary (once successful, optimize)
# Preference tertiary (align with user style)
weights = {
"outcome": 0.6,
"efficiency": 0.25,
"preference": 0.15
}
total = outcome * weights["outcome"]
total += efficiency * weights["efficiency"]
if preference is not None:
total += preference * weights["preference"]
else:
# Redistribute weight if no preference signal
total = total / (1 - weights["preference"])
return total
```
---
## Dynamic Agent Selection
### Task Complexity Classification
```python
def classify_task_complexity(task):
"""
Classify task to determine agent allocation.
Based on ToolOrchestra's tool selection flexibility.
"""
complexity_signals = {
# File scope signals
"single_file": -1,
"few_files": 0, # 2-5 files
"many_files": +1, # 6-20 files
"system_wide": +2, # 20+ files
# Change type signals
"typo_fix": -2,
"bug_fix": 0,
"feature": +1,
"refactor": +1,
"architecture": +2,
# Domain signals
"documentation": -1,
"tests_only": 0,
"frontend": 0,
"backend": 0,
"full_stack": +1,
"infrastructure": +1,
"security": +2,
}
score = 0
for signal, weight in complexity_signals.items():
if task.has_signal(signal):
score += weight
# Map score to complexity level
if score <= -2:
return "trivial"
elif score <= 0:
return "simple"
elif score <= 2:
return "moderate"
elif score <= 4:
return "complex"
else:
return "critical"
```
### Agent Allocation by Complexity
```yaml
# Agent allocation strategy
# Model selection: Opus=planning, Sonnet=development, Haiku=unit tests/monitoring
complexity_allocations:
trivial:
max_agents: 1
planning: null # No planning needed
development: haiku
testing: haiku
review: skip # No review needed for trivial
parallel: false
simple:
max_agents: 2
planning: null # No planning needed
development: haiku
testing: haiku
review: single # One quick review
parallel: false
moderate:
max_agents: 4
planning: sonnet # Sonnet for moderate planning
development: sonnet
testing: haiku # Unit tests always haiku
review: standard # 3 parallel reviewers
parallel: true
complex:
max_agents: 8
planning: opus # Opus ONLY for complex planning
development: sonnet # Sonnet for implementation
testing: haiku # Unit tests still haiku
review: deep # 3 reviewers + devil's advocate
parallel: true
critical:
max_agents: 12
planning: opus # Opus for critical planning
development: sonnet # Sonnet for implementation
testing: sonnet # Functional/E2E tests with sonnet
review: exhaustive # Multiple review rounds
parallel: true
human_checkpoint: true # Pause for human review
```
### Dynamic Selection Algorithm
```python
def select_agents_for_task(task, available_agents):
"""
Dynamically select agents based on task requirements.
Inspired by ToolOrchestra's configurable tool selection.
"""
complexity = classify_task_complexity(task)
allocation = COMPLEXITY_ALLOCATIONS[complexity]
# 1. Identify required agent types
required_types = identify_required_agents(task)
# 2. Filter to available agents of required types
candidates = [a for a in available_agents if a.type in required_types]
# 3. Score candidates by past performance
for agent in candidates:
agent.selection_score = get_agent_performance_score(
agent,
task_type=task.type,
complexity=complexity
)
# 4. Select top N agents up to allocation limit
candidates.sort(key=lambda a: a.selection_score, reverse=True)
selected = candidates[:allocation["max_agents"]]
# 5. Assign models based on complexity
for agent in selected:
if agent.role == "reviewer":
agent.model = "opus" # Always opus for reviews
else:
agent.model = allocation["model"]
return selected
def get_agent_performance_score(agent, task_type, complexity):
"""
Score agent based on historical performance on similar tasks.
Uses reward signals from previous executions.
"""
history = load_agent_history(agent.id)
# Filter to similar tasks
similar = [h for h in history
if h.task_type == task_type
and h.complexity == complexity]
if not similar:
return 0.5 # Neutral score if no history
# Average past rewards
return sum(h.aggregate_reward for h in similar) / len(similar)
```
---
## Tool Usage Analytics
### Track Tool Effectiveness
```json
{
"tool_analytics": {
"period": "2026-01-06",
"by_tool": {
"Grep": {
"calls": 142,
"success_rate": 0.89,
"avg_result_quality": 0.82,
"common_patterns": ["error handling", "function def"]
},
"Task": {
"calls": 47,
"success_rate": 0.94,
"avg_efficiency": 0.76,
"by_subagent_type": {
"general-purpose": {"calls": 35, "success": 0.91},
"Explore": {"calls": 12, "success": 1.0}
}
}
},
"insights": [
"Explore agent 100% success - use more for codebase search",
"Grep success drops to 0.65 for regex patterns - simplify searches"
]
}
}
```
### Continuous Improvement Loop
```
+------------------------------------------------------------------+
| 1. COLLECT |
| Record every task: agents used, tools called, outcome |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| 2. ANALYZE |
| Weekly aggregation: What worked? What didn't? |
| Identify patterns in high-reward vs low-reward tasks |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| 3. ADAPT |
| Update selection algorithms based on analytics |
| Store successful patterns in semantic memory |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| 4. VALIDATE |
| A/B test new selection strategies |
| Measure efficiency improvement |
+------------------------------------------------------------------+
|
+-----------> Loop back to COLLECT
```
---
## Integration with RARV Cycle
The orchestration patterns integrate with RARV at each phase:
```
REASON:
├── Check efficiency metrics for similar past tasks
├── Classify task complexity
└── Select appropriate agent allocation
ACT:
├── Dispatch agents according to allocation
├── Track start time and resource usage
└── Record tool calls and agent interactions
REFLECT:
├── Calculate outcome reward (did it work?)
├── Calculate efficiency reward (resource usage)
└── Log to metrics store
VERIFY:
├── Run verification checks
├── If failed: negative outcome reward, retry with learning
├── If passed: infer preference reward from user actions
└── Update agent performance scores
```
---
## Key Metrics Dashboard
Track these metrics in `.loki/metrics/dashboard.json`:
```json
{
"dashboard": {
"period": "rolling_7_days",
"summary": {
"tasks_completed": 127,
"success_rate": 0.94,
"avg_efficiency_score": 0.78,
"avg_outcome_reward": 0.82,
"avg_preference_reward": 0.71,
"avg_recovery_rate": 0.87,
"avg_goal_adherence": 0.93
},
"quality_pillars": {
"tool_selection_accuracy": 0.91,
"tool_reliability_rate": 0.93,
"memory_retrieval_relevance": 0.84,
"policy_adherence": 0.96
},
"trends": {
"efficiency": "+12% vs previous week",
"success_rate": "+3% vs previous week",
"avg_agents_per_task": "-0.8 (improving)",
"recovery_rate": "+5% vs previous week"
},
"top_performing_patterns": [
"Haiku for unit tests (0.95 success, 0.92 efficiency)",
"Explore agent for codebase search (1.0 success)",
"Parallel review with opus (0.98 accuracy)"
],
"areas_for_improvement": [
"Complex refactors taking 2x expected time",
"Security review efficiency below baseline",
"Memory retrieval relevance below 0.85 target"
]
}
}
```
---
## Multi-Dimensional Evaluation
Based on [Measurement Imbalance research (arXiv 2506.02064)](https://arxiv.org/abs/2506.02064):
> "Technical metrics dominate assessments (83%), while human-centered (30%), safety (53%), and economic (30%) remain peripheral"
**Loki Mode tracks four evaluation axes:**
| Axis | Metrics | Current Coverage |
|------|---------|------------------|
| **Technical** | success_rate, efficiency_score, recovery_rate | Full |
| **Human-Centered** | preference_reward, goal_adherence | Partial |
| **Safety** | policy_adherence, quality_gates_passed | Full (via review system) |
| **Economic** | model_usage, agents_spawned, wall_time | Full |
---
## Sources
**OpenAI Agents SDK:**
- [Agents SDK Documentation](https://openai.github.io/openai-agents-python/) - Core primitives: agents, handoffs, guardrails, tracing
- [Practical Guide to Building Agents](https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf) - Orchestration patterns
- [Building Agents Track](https://developers.openai.com/tracks/building-agents/) - Official developer guide
- [AGENTS.md Specification](https://agents.md/) - Standard for agent instructions
- [Tracing Documentation](https://openai.github.io/openai-agents-python/tracing/) - Span types and observability
**Efficiency & Orchestration:**
- [NVIDIA ToolOrchestra](https://github.com/NVlabs/ToolOrchestra) - Multi-turn tool orchestration with RL
- [ToolScale Dataset](https://huggingface.co/datasets/nvidia/ToolScale) - Training data synthesis
**Evaluation Frameworks:**
- [Assessment Framework for Agentic AI (arXiv 2512.12791)](https://arxiv.org/html/2512.12791v1) - Four-pillar evaluation model
- [Measurement Imbalance in Agentic AI (arXiv 2506.02064)](https://arxiv.org/abs/2506.02064) - Multi-dimensional evaluation
- [Adaptive Monitoring for Agentic AI (arXiv 2509.00115)](https://arxiv.org/abs/2509.00115) - AMDM algorithm
**Best Practices:**
- [Anthropic: Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) - Simplicity, transparency, tool engineering
- [Maxim AI: Production Multi-Agent Systems](https://www.getmaxim.ai/articles/best-practices-for-building-production-ready-multi-agent-systems/) - Orchestration patterns, distributed tracing
- [UiPath: Agent Builder Best Practices](https://www.uipath.com/blog/ai/agent-builder-best-practices) - Single-responsibility, evaluations
- [Stanford/Harvard: Demo-to-Deployment Gap](https://www.marktechpost.com/2025/12/24/this-ai-paper-from-stanford-and-harvard-explains-why-most-agentic-ai-systems-feel-impressive-in-demos-and-then-completely-fall-apart-in-real-use/) - Tool reliability as key failure mode
**Safety & Reasoning:**
- [Chain of Thought Monitoring](https://openai.com/index/chain-of-thought-monitoring/) - CoT monitorability for safety
- [Agent Builder Safety](https://platform.openai.com/docs/guides/agent-builder-safety) - Human-in-loop patterns
- [Agentic AI Foundation](https://openai.com/index/agentic-ai-foundation/) - Industry standards (MCP, AGENTS.md, goose)