692 lines
22 KiB
Markdown
692 lines
22 KiB
Markdown
# Tool Orchestration Patterns Reference
|
|
|
|
Research-backed patterns inspired by NVIDIA ToolOrchestra, OpenAI Agents SDK, and multi-agent coordination research.
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
Effective tool orchestration requires four key innovations:
|
|
1. **Tracing Spans** - Hierarchical event tracking (OpenAI SDK pattern)
|
|
2. **Efficiency Metrics** - Track computational cost per task
|
|
3. **Reward Signals** - Outcome, efficiency, and preference rewards for learning
|
|
4. **Dynamic Selection** - Adapt agent count and types based on task complexity
|
|
|
|
---
|
|
|
|
## Tracing Spans Architecture (OpenAI SDK Pattern)
|
|
|
|
### Span Types
|
|
|
|
Every operation is wrapped in a typed span for observability:
|
|
|
|
```yaml
|
|
span_types:
|
|
agent_span: # Wraps entire agent execution
|
|
generation_span: # Wraps LLM API calls
|
|
function_span: # Wraps tool/function calls
|
|
guardrail_span: # Wraps validation checks
|
|
handoff_span: # Wraps agent-to-agent transfers
|
|
custom_span: # User-defined operations
|
|
```
|
|
|
|
### Hierarchical Trace Structure
|
|
|
|
```json
|
|
{
|
|
"trace_id": "trace_abc123def456",
|
|
"workflow_name": "implement_feature",
|
|
"group_id": "session_xyz789",
|
|
"spans": [
|
|
{
|
|
"span_id": "span_001",
|
|
"parent_id": null,
|
|
"type": "agent_span",
|
|
"agent_name": "orchestrator",
|
|
"started_at": "2026-01-07T10:00:00Z",
|
|
"ended_at": "2026-01-07T10:05:00Z",
|
|
"children": ["span_002", "span_003"]
|
|
},
|
|
{
|
|
"span_id": "span_002",
|
|
"parent_id": "span_001",
|
|
"type": "guardrail_span",
|
|
"guardrail_name": "input_validation",
|
|
"triggered": false,
|
|
"blocking": true
|
|
},
|
|
{
|
|
"span_id": "span_003",
|
|
"parent_id": "span_001",
|
|
"type": "handoff_span",
|
|
"from_agent": "orchestrator",
|
|
"to_agent": "backend-dev"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Storage Location
|
|
|
|
```
|
|
.loki/traces/
|
|
├── active/
|
|
│ └── {trace_id}.json # Currently running traces
|
|
└── completed/
|
|
└── {date}/
|
|
└── {trace_id}.json # Archived traces
|
|
```
|
|
|
|
See `references/openai-patterns.md` for full tracing implementation.
|
|
|
|
---
|
|
|
|
## Efficiency Metrics System
|
|
|
|
### Why Track Efficiency?
|
|
|
|
ToolOrchestra achieves 70% cost reduction vs GPT-5 by explicitly optimizing for efficiency. Loki Mode should track:
|
|
|
|
- **Token usage** per task (input + output)
|
|
- **Wall clock time** per task
|
|
- **Agent spawns** per task
|
|
- **Retry count** before success
|
|
|
|
### Efficiency Tracking Schema
|
|
|
|
```json
|
|
{
|
|
"task_id": "task-2026-01-06-001",
|
|
"correlation_id": "session-abc123",
|
|
"started_at": "2026-01-06T10:00:00Z",
|
|
"completed_at": "2026-01-06T10:05:32Z",
|
|
"metrics": {
|
|
"wall_time_seconds": 332,
|
|
"agents_spawned": 3,
|
|
"total_agent_calls": 7,
|
|
"retry_count": 1,
|
|
"retry_reasons": ["test_failure"],
|
|
"recovery_rate": 1.0,
|
|
"model_usage": {
|
|
"haiku": {"calls": 4, "est_tokens": 12000},
|
|
"sonnet": {"calls": 2, "est_tokens": 8000},
|
|
"opus": {"calls": 1, "est_tokens": 6000}
|
|
}
|
|
},
|
|
"outcome": "success",
|
|
"outcome_reason": "tests_passed_after_fix",
|
|
"efficiency_score": 0.85,
|
|
"efficiency_factors": ["used_haiku_for_tests", "parallel_review"],
|
|
"quality_pillars": {
|
|
"tool_selection_correct": true,
|
|
"tool_reliability_rate": 0.95,
|
|
"memory_retrieval_relevant": true,
|
|
"goal_adherence": 1.0
|
|
}
|
|
}
|
|
```
|
|
|
|
**Why capture these metrics?** (Based on multi-agent research)
|
|
|
|
1. **Capture intent, not just actions** ([Hashrocket](https://hashrocket.substack.com/p/the-hidden-cost-of-well-fix-it-later))
|
|
- "UX debt turns into data debt" - recording actions without intent creates useless analytics
|
|
|
|
2. **Track recovery rate** ([Assessment Framework, arXiv 2512.12791](https://arxiv.org/html/2512.12791v1))
|
|
- `recovery_rate = successful_retries / total_retries`
|
|
- Paper found "perfect tool sequencing but only 33% policy adherence" - surface metrics mask failures
|
|
|
|
3. **Distributed tracing** ([Maxim AI](https://www.getmaxim.ai/articles/best-practices-for-building-production-ready-multi-agent-systems/))
|
|
- `correlation_id`: Links all tasks in a session for end-to-end tracing
|
|
- Essential for debugging multi-agent coordination failures
|
|
|
|
4. **Tool reliability separate from selection** ([Stanford/Harvard](https://www.marktechpost.com/2025/12/24/this-ai-paper-from-stanford-and-harvard-explains-why-most-agentic-ai-systems-feel-impressive-in-demos-and-then-completely-fall-apart-in-real-use/))
|
|
- `tool_selection_correct`: Did we pick the right tool?
|
|
- `tool_reliability_rate`: Did the tool work as expected? (tools can fail even when correctly selected)
|
|
- Key insight: "Tool use reliability" is a primary demo-to-deployment gap
|
|
|
|
5. **Quality pillars beyond outcomes** ([Assessment Framework](https://arxiv.org/html/2512.12791v1))
|
|
- `memory_retrieval_relevant`: Did episodic/semantic retrieval help?
|
|
- `goal_adherence`: Did we stay on task? (0.0-1.0 score)
|
|
|
|
### Efficiency Score Calculation
|
|
|
|
```python
|
|
def calculate_efficiency_score(metrics, task_complexity):
|
|
"""
|
|
Score from 0-1 where higher is more efficient.
|
|
Based on ToolOrchestra's efficiency reward signal.
|
|
"""
|
|
# Baseline expectations by complexity
|
|
baselines = {
|
|
"trivial": {"time": 60, "agents": 1, "retries": 0},
|
|
"simple": {"time": 180, "agents": 2, "retries": 0},
|
|
"moderate": {"time": 600, "agents": 4, "retries": 1},
|
|
"complex": {"time": 1800, "agents": 8, "retries": 2},
|
|
"critical": {"time": 3600, "agents": 12, "retries": 3}
|
|
}
|
|
|
|
baseline = baselines[task_complexity]
|
|
|
|
# Calculate component scores (1.0 = at baseline, >1 = better, <1 = worse)
|
|
time_score = min(1.0, baseline["time"] / max(metrics["wall_time_seconds"], 1))
|
|
agent_score = min(1.0, baseline["agents"] / max(metrics["agents_spawned"], 1))
|
|
retry_score = 1.0 - (metrics["retry_count"] / (baseline["retries"] + 3))
|
|
|
|
# Weighted average (time matters most)
|
|
return (time_score * 0.5) + (agent_score * 0.3) + (retry_score * 0.2)
|
|
```
|
|
|
|
### Standard Reason Codes
|
|
|
|
Use consistent codes to enable pattern analysis:
|
|
|
|
```yaml
|
|
outcome_reasons:
|
|
success:
|
|
- tests_passed_first_try
|
|
- tests_passed_after_fix
|
|
- review_approved
|
|
- spec_validated
|
|
partial:
|
|
- tests_partial_pass
|
|
- review_concerns_minor
|
|
- timeout_partial_work
|
|
failure:
|
|
- tests_failed
|
|
- review_blocked
|
|
- dependency_missing
|
|
- timeout_no_progress
|
|
- error_unrecoverable
|
|
|
|
retry_reasons:
|
|
- test_failure
|
|
- lint_error
|
|
- type_error
|
|
- review_rejection
|
|
- rate_limit
|
|
- timeout
|
|
- dependency_conflict
|
|
|
|
efficiency_factors:
|
|
positive:
|
|
- used_haiku_for_simple
|
|
- parallel_execution
|
|
- cached_result
|
|
- first_try_success
|
|
- spec_driven
|
|
negative:
|
|
- used_opus_for_simple
|
|
- sequential_when_parallel_possible
|
|
- multiple_retries
|
|
- missing_context
|
|
- unclear_requirements
|
|
```
|
|
|
|
### Storage Location
|
|
|
|
```
|
|
.loki/metrics/
|
|
├── efficiency/
|
|
│ ├── 2026-01-06.json # Daily efficiency logs
|
|
│ └── aggregate.json # Running averages by task type
|
|
└── rewards/
|
|
├── outcomes.json # Task success/failure records
|
|
└── preferences.json # User preference signals
|
|
```
|
|
|
|
---
|
|
|
|
## Reward Signal Framework
|
|
|
|
### Three Reward Types (ToolOrchestra Pattern)
|
|
|
|
```
|
|
+------------------------------------------------------------------+
|
|
| 1. OUTCOME REWARD |
|
|
| - Did the task succeed? Binary + quality grade |
|
|
| - Signal: +1.0 (success), 0.0 (partial), -1.0 (failure) |
|
|
+------------------------------------------------------------------+
|
|
| 2. EFFICIENCY REWARD |
|
|
| - Did we use resources wisely? |
|
|
| - Signal: 0.0 to 1.0 based on efficiency score |
|
|
+------------------------------------------------------------------+
|
|
| 3. PREFERENCE REWARD |
|
|
| - Did the user like the approach/result? |
|
|
| - Signal: Inferred from user actions (accept/reject/modify) |
|
|
+------------------------------------------------------------------+
|
|
```
|
|
|
|
### Outcome Reward Implementation
|
|
|
|
```python
|
|
def calculate_outcome_reward(task_result):
|
|
"""
|
|
Outcome reward based on task completion status.
|
|
"""
|
|
if task_result.status == "completed":
|
|
# Grade the quality of completion
|
|
if task_result.tests_passed and task_result.review_passed:
|
|
return 1.0 # Full success
|
|
elif task_result.tests_passed:
|
|
return 0.7 # Tests pass but review had concerns
|
|
else:
|
|
return 0.3 # Completed but with issues
|
|
|
|
elif task_result.status == "partial":
|
|
return 0.0 # Partial completion, no reward
|
|
|
|
else: # failed
|
|
return -1.0 # Negative reward for failure
|
|
```
|
|
|
|
### Preference Reward Implementation
|
|
|
|
```python
|
|
def infer_preference_reward(task_result, user_actions):
|
|
"""
|
|
Infer user preference from their actions after task completion.
|
|
Based on implicit feedback patterns.
|
|
"""
|
|
signals = []
|
|
|
|
# Positive signals
|
|
if "commit" in user_actions:
|
|
signals.append(0.8) # User committed our changes
|
|
if "deploy" in user_actions:
|
|
signals.append(1.0) # User deployed our changes
|
|
if "no_edits" in user_actions:
|
|
signals.append(0.6) # User didn't modify our output
|
|
|
|
# Negative signals
|
|
if "revert" in user_actions:
|
|
signals.append(-1.0) # User reverted our changes
|
|
if "manual_fix" in user_actions:
|
|
signals.append(-0.5) # User had to fix our work
|
|
if "retry_different" in user_actions:
|
|
signals.append(-0.3) # User asked for different approach
|
|
|
|
# Neutral (no signal)
|
|
if not signals:
|
|
return None
|
|
|
|
return sum(signals) / len(signals)
|
|
```
|
|
|
|
### Reward Aggregation for Learning
|
|
|
|
```python
|
|
def aggregate_rewards(outcome, efficiency, preference):
|
|
"""
|
|
Combine rewards into single learning signal.
|
|
Weights based on ToolOrchestra findings.
|
|
"""
|
|
# Outcome is most important (must succeed)
|
|
# Efficiency secondary (once successful, optimize)
|
|
# Preference tertiary (align with user style)
|
|
|
|
weights = {
|
|
"outcome": 0.6,
|
|
"efficiency": 0.25,
|
|
"preference": 0.15
|
|
}
|
|
|
|
total = outcome * weights["outcome"]
|
|
total += efficiency * weights["efficiency"]
|
|
|
|
if preference is not None:
|
|
total += preference * weights["preference"]
|
|
else:
|
|
# Redistribute weight if no preference signal
|
|
total = total / (1 - weights["preference"])
|
|
|
|
return total
|
|
```
|
|
|
|
---
|
|
|
|
## Dynamic Agent Selection
|
|
|
|
### Task Complexity Classification
|
|
|
|
```python
|
|
def classify_task_complexity(task):
|
|
"""
|
|
Classify task to determine agent allocation.
|
|
Based on ToolOrchestra's tool selection flexibility.
|
|
"""
|
|
complexity_signals = {
|
|
# File scope signals
|
|
"single_file": -1,
|
|
"few_files": 0, # 2-5 files
|
|
"many_files": +1, # 6-20 files
|
|
"system_wide": +2, # 20+ files
|
|
|
|
# Change type signals
|
|
"typo_fix": -2,
|
|
"bug_fix": 0,
|
|
"feature": +1,
|
|
"refactor": +1,
|
|
"architecture": +2,
|
|
|
|
# Domain signals
|
|
"documentation": -1,
|
|
"tests_only": 0,
|
|
"frontend": 0,
|
|
"backend": 0,
|
|
"full_stack": +1,
|
|
"infrastructure": +1,
|
|
"security": +2,
|
|
}
|
|
|
|
score = 0
|
|
for signal, weight in complexity_signals.items():
|
|
if task.has_signal(signal):
|
|
score += weight
|
|
|
|
# Map score to complexity level
|
|
if score <= -2:
|
|
return "trivial"
|
|
elif score <= 0:
|
|
return "simple"
|
|
elif score <= 2:
|
|
return "moderate"
|
|
elif score <= 4:
|
|
return "complex"
|
|
else:
|
|
return "critical"
|
|
```
|
|
|
|
### Agent Allocation by Complexity
|
|
|
|
```yaml
|
|
# Agent allocation strategy
|
|
# Model selection: Opus=planning, Sonnet=development, Haiku=unit tests/monitoring
|
|
complexity_allocations:
|
|
trivial:
|
|
max_agents: 1
|
|
planning: null # No planning needed
|
|
development: haiku
|
|
testing: haiku
|
|
review: skip # No review needed for trivial
|
|
parallel: false
|
|
|
|
simple:
|
|
max_agents: 2
|
|
planning: null # No planning needed
|
|
development: haiku
|
|
testing: haiku
|
|
review: single # One quick review
|
|
parallel: false
|
|
|
|
moderate:
|
|
max_agents: 4
|
|
planning: sonnet # Sonnet for moderate planning
|
|
development: sonnet
|
|
testing: haiku # Unit tests always haiku
|
|
review: standard # 3 parallel reviewers
|
|
parallel: true
|
|
|
|
complex:
|
|
max_agents: 8
|
|
planning: opus # Opus ONLY for complex planning
|
|
development: sonnet # Sonnet for implementation
|
|
testing: haiku # Unit tests still haiku
|
|
review: deep # 3 reviewers + devil's advocate
|
|
parallel: true
|
|
|
|
critical:
|
|
max_agents: 12
|
|
planning: opus # Opus for critical planning
|
|
development: sonnet # Sonnet for implementation
|
|
testing: sonnet # Functional/E2E tests with sonnet
|
|
review: exhaustive # Multiple review rounds
|
|
parallel: true
|
|
human_checkpoint: true # Pause for human review
|
|
```
|
|
|
|
### Dynamic Selection Algorithm
|
|
|
|
```python
|
|
def select_agents_for_task(task, available_agents):
|
|
"""
|
|
Dynamically select agents based on task requirements.
|
|
Inspired by ToolOrchestra's configurable tool selection.
|
|
"""
|
|
complexity = classify_task_complexity(task)
|
|
allocation = COMPLEXITY_ALLOCATIONS[complexity]
|
|
|
|
# 1. Identify required agent types
|
|
required_types = identify_required_agents(task)
|
|
|
|
# 2. Filter to available agents of required types
|
|
candidates = [a for a in available_agents if a.type in required_types]
|
|
|
|
# 3. Score candidates by past performance
|
|
for agent in candidates:
|
|
agent.selection_score = get_agent_performance_score(
|
|
agent,
|
|
task_type=task.type,
|
|
complexity=complexity
|
|
)
|
|
|
|
# 4. Select top N agents up to allocation limit
|
|
candidates.sort(key=lambda a: a.selection_score, reverse=True)
|
|
selected = candidates[:allocation["max_agents"]]
|
|
|
|
# 5. Assign models based on complexity
|
|
for agent in selected:
|
|
if agent.role == "reviewer":
|
|
agent.model = "opus" # Always opus for reviews
|
|
else:
|
|
agent.model = allocation["model"]
|
|
|
|
return selected
|
|
|
|
def get_agent_performance_score(agent, task_type, complexity):
|
|
"""
|
|
Score agent based on historical performance on similar tasks.
|
|
Uses reward signals from previous executions.
|
|
"""
|
|
history = load_agent_history(agent.id)
|
|
|
|
# Filter to similar tasks
|
|
similar = [h for h in history
|
|
if h.task_type == task_type
|
|
and h.complexity == complexity]
|
|
|
|
if not similar:
|
|
return 0.5 # Neutral score if no history
|
|
|
|
# Average past rewards
|
|
return sum(h.aggregate_reward for h in similar) / len(similar)
|
|
```
|
|
|
|
---
|
|
|
|
## Tool Usage Analytics
|
|
|
|
### Track Tool Effectiveness
|
|
|
|
```json
|
|
{
|
|
"tool_analytics": {
|
|
"period": "2026-01-06",
|
|
"by_tool": {
|
|
"Grep": {
|
|
"calls": 142,
|
|
"success_rate": 0.89,
|
|
"avg_result_quality": 0.82,
|
|
"common_patterns": ["error handling", "function def"]
|
|
},
|
|
"Task": {
|
|
"calls": 47,
|
|
"success_rate": 0.94,
|
|
"avg_efficiency": 0.76,
|
|
"by_subagent_type": {
|
|
"general-purpose": {"calls": 35, "success": 0.91},
|
|
"Explore": {"calls": 12, "success": 1.0}
|
|
}
|
|
}
|
|
},
|
|
"insights": [
|
|
"Explore agent 100% success - use more for codebase search",
|
|
"Grep success drops to 0.65 for regex patterns - simplify searches"
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
### Continuous Improvement Loop
|
|
|
|
```
|
|
+------------------------------------------------------------------+
|
|
| 1. COLLECT |
|
|
| Record every task: agents used, tools called, outcome |
|
|
+------------------------------------------------------------------+
|
|
|
|
|
v
|
|
+------------------------------------------------------------------+
|
|
| 2. ANALYZE |
|
|
| Weekly aggregation: What worked? What didn't? |
|
|
| Identify patterns in high-reward vs low-reward tasks |
|
|
+------------------------------------------------------------------+
|
|
|
|
|
v
|
|
+------------------------------------------------------------------+
|
|
| 3. ADAPT |
|
|
| Update selection algorithms based on analytics |
|
|
| Store successful patterns in semantic memory |
|
|
+------------------------------------------------------------------+
|
|
|
|
|
v
|
|
+------------------------------------------------------------------+
|
|
| 4. VALIDATE |
|
|
| A/B test new selection strategies |
|
|
| Measure efficiency improvement |
|
|
+------------------------------------------------------------------+
|
|
|
|
|
+-----------> Loop back to COLLECT
|
|
```
|
|
|
|
---
|
|
|
|
## Integration with RARV Cycle
|
|
|
|
The orchestration patterns integrate with RARV at each phase:
|
|
|
|
```
|
|
REASON:
|
|
├── Check efficiency metrics for similar past tasks
|
|
├── Classify task complexity
|
|
└── Select appropriate agent allocation
|
|
|
|
ACT:
|
|
├── Dispatch agents according to allocation
|
|
├── Track start time and resource usage
|
|
└── Record tool calls and agent interactions
|
|
|
|
REFLECT:
|
|
├── Calculate outcome reward (did it work?)
|
|
├── Calculate efficiency reward (resource usage)
|
|
└── Log to metrics store
|
|
|
|
VERIFY:
|
|
├── Run verification checks
|
|
├── If failed: negative outcome reward, retry with learning
|
|
├── If passed: infer preference reward from user actions
|
|
└── Update agent performance scores
|
|
```
|
|
|
|
---
|
|
|
|
## Key Metrics Dashboard
|
|
|
|
Track these metrics in `.loki/metrics/dashboard.json`:
|
|
|
|
```json
|
|
{
|
|
"dashboard": {
|
|
"period": "rolling_7_days",
|
|
"summary": {
|
|
"tasks_completed": 127,
|
|
"success_rate": 0.94,
|
|
"avg_efficiency_score": 0.78,
|
|
"avg_outcome_reward": 0.82,
|
|
"avg_preference_reward": 0.71,
|
|
"avg_recovery_rate": 0.87,
|
|
"avg_goal_adherence": 0.93
|
|
},
|
|
"quality_pillars": {
|
|
"tool_selection_accuracy": 0.91,
|
|
"tool_reliability_rate": 0.93,
|
|
"memory_retrieval_relevance": 0.84,
|
|
"policy_adherence": 0.96
|
|
},
|
|
"trends": {
|
|
"efficiency": "+12% vs previous week",
|
|
"success_rate": "+3% vs previous week",
|
|
"avg_agents_per_task": "-0.8 (improving)",
|
|
"recovery_rate": "+5% vs previous week"
|
|
},
|
|
"top_performing_patterns": [
|
|
"Haiku for unit tests (0.95 success, 0.92 efficiency)",
|
|
"Explore agent for codebase search (1.0 success)",
|
|
"Parallel review with opus (0.98 accuracy)"
|
|
],
|
|
"areas_for_improvement": [
|
|
"Complex refactors taking 2x expected time",
|
|
"Security review efficiency below baseline",
|
|
"Memory retrieval relevance below 0.85 target"
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Multi-Dimensional Evaluation
|
|
|
|
Based on [Measurement Imbalance research (arXiv 2506.02064)](https://arxiv.org/abs/2506.02064):
|
|
|
|
> "Technical metrics dominate assessments (83%), while human-centered (30%), safety (53%), and economic (30%) remain peripheral"
|
|
|
|
**Loki Mode tracks four evaluation axes:**
|
|
|
|
| Axis | Metrics | Current Coverage |
|
|
|------|---------|------------------|
|
|
| **Technical** | success_rate, efficiency_score, recovery_rate | Full |
|
|
| **Human-Centered** | preference_reward, goal_adherence | Partial |
|
|
| **Safety** | policy_adherence, quality_gates_passed | Full (via review system) |
|
|
| **Economic** | model_usage, agents_spawned, wall_time | Full |
|
|
|
|
---
|
|
|
|
## Sources
|
|
|
|
**OpenAI Agents SDK:**
|
|
- [Agents SDK Documentation](https://openai.github.io/openai-agents-python/) - Core primitives: agents, handoffs, guardrails, tracing
|
|
- [Practical Guide to Building Agents](https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf) - Orchestration patterns
|
|
- [Building Agents Track](https://developers.openai.com/tracks/building-agents/) - Official developer guide
|
|
- [AGENTS.md Specification](https://agents.md/) - Standard for agent instructions
|
|
- [Tracing Documentation](https://openai.github.io/openai-agents-python/tracing/) - Span types and observability
|
|
|
|
**Efficiency & Orchestration:**
|
|
- [NVIDIA ToolOrchestra](https://github.com/NVlabs/ToolOrchestra) - Multi-turn tool orchestration with RL
|
|
- [ToolScale Dataset](https://huggingface.co/datasets/nvidia/ToolScale) - Training data synthesis
|
|
|
|
**Evaluation Frameworks:**
|
|
- [Assessment Framework for Agentic AI (arXiv 2512.12791)](https://arxiv.org/html/2512.12791v1) - Four-pillar evaluation model
|
|
- [Measurement Imbalance in Agentic AI (arXiv 2506.02064)](https://arxiv.org/abs/2506.02064) - Multi-dimensional evaluation
|
|
- [Adaptive Monitoring for Agentic AI (arXiv 2509.00115)](https://arxiv.org/abs/2509.00115) - AMDM algorithm
|
|
|
|
**Best Practices:**
|
|
- [Anthropic: Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) - Simplicity, transparency, tool engineering
|
|
- [Maxim AI: Production Multi-Agent Systems](https://www.getmaxim.ai/articles/best-practices-for-building-production-ready-multi-agent-systems/) - Orchestration patterns, distributed tracing
|
|
- [UiPath: Agent Builder Best Practices](https://www.uipath.com/blog/ai/agent-builder-best-practices) - Single-responsibility, evaluations
|
|
- [Stanford/Harvard: Demo-to-Deployment Gap](https://www.marktechpost.com/2025/12/24/this-ai-paper-from-stanford-and-harvard-explains-why-most-agentic-ai-systems-feel-impressive-in-demos-and-then-completely-fall-apart-in-real-use/) - Tool reliability as key failure mode
|
|
|
|
**Safety & Reasoning:**
|
|
- [Chain of Thought Monitoring](https://openai.com/index/chain-of-thought-monitoring/) - CoT monitorability for safety
|
|
- [Agent Builder Safety](https://platform.openai.com/docs/guides/agent-builder-safety) - Human-in-loop patterns
|
|
- [Agentic AI Foundation](https://openai.com/index/agentic-ai-foundation/) - Industry standards (MCP, AGENTS.md, goose)
|