firefrost-gaming/antigravity-skills-reference

Fork 0

Files

sck_0 8bd204708b Fix: Ensure all skills are tracked as files, not submodules

2026-01-14 18:48:48 +01:00

22 KiB

Raw Blame History

Tool Orchestration Patterns Reference

Research-backed patterns inspired by NVIDIA ToolOrchestra, OpenAI Agents SDK, and multi-agent coordination research.

Overview

Effective tool orchestration requires four key innovations:

Tracing Spans - Hierarchical event tracking (OpenAI SDK pattern)
Efficiency Metrics - Track computational cost per task
Reward Signals - Outcome, efficiency, and preference rewards for learning
Dynamic Selection - Adapt agent count and types based on task complexity

Tracing Spans Architecture (OpenAI SDK Pattern)

Span Types

Every operation is wrapped in a typed span for observability:

span_types:
  agent_span:     # Wraps entire agent execution
  generation_span: # Wraps LLM API calls
  function_span:  # Wraps tool/function calls
  guardrail_span: # Wraps validation checks
  handoff_span:   # Wraps agent-to-agent transfers
  custom_span:    # User-defined operations

Hierarchical Trace Structure

{
  "trace_id": "trace_abc123def456",
  "workflow_name": "implement_feature",
  "group_id": "session_xyz789",
  "spans": [
    {
      "span_id": "span_001",
      "parent_id": null,
      "type": "agent_span",
      "agent_name": "orchestrator",
      "started_at": "2026-01-07T10:00:00Z",
      "ended_at": "2026-01-07T10:05:00Z",
      "children": ["span_002", "span_003"]
    },
    {
      "span_id": "span_002",
      "parent_id": "span_001",
      "type": "guardrail_span",
      "guardrail_name": "input_validation",
      "triggered": false,
      "blocking": true
    },
    {
      "span_id": "span_003",
      "parent_id": "span_001",
      "type": "handoff_span",
      "from_agent": "orchestrator",
      "to_agent": "backend-dev"
    }
  ]
}

Storage Location

.loki/traces/
├── active/
│   └── {trace_id}.json     # Currently running traces
└── completed/
    └── {date}/
        └── {trace_id}.json # Archived traces

See references/openai-patterns.md for full tracing implementation.

Efficiency Metrics System

Why Track Efficiency?

ToolOrchestra achieves 70% cost reduction vs GPT-5 by explicitly optimizing for efficiency. Loki Mode should track:

Token usage per task (input + output)
Wall clock time per task
Agent spawns per task
Retry count before success

Efficiency Tracking Schema

{
  "task_id": "task-2026-01-06-001",
  "correlation_id": "session-abc123",
  "started_at": "2026-01-06T10:00:00Z",
  "completed_at": "2026-01-06T10:05:32Z",
  "metrics": {
    "wall_time_seconds": 332,
    "agents_spawned": 3,
    "total_agent_calls": 7,
    "retry_count": 1,
    "retry_reasons": ["test_failure"],
    "recovery_rate": 1.0,
    "model_usage": {
      "haiku": {"calls": 4, "est_tokens": 12000},
      "sonnet": {"calls": 2, "est_tokens": 8000},
      "opus": {"calls": 1, "est_tokens": 6000}
    }
  },
  "outcome": "success",
  "outcome_reason": "tests_passed_after_fix",
  "efficiency_score": 0.85,
  "efficiency_factors": ["used_haiku_for_tests", "parallel_review"],
  "quality_pillars": {
    "tool_selection_correct": true,
    "tool_reliability_rate": 0.95,
    "memory_retrieval_relevant": true,
    "goal_adherence": 1.0
  }
}

Why capture these metrics? (Based on multi-agent research)

Capture intent, not just actions (Hashrocket)
- "UX debt turns into data debt" - recording actions without intent creates useless analytics
Track recovery rate (Assessment Framework, arXiv 2512.12791)
- recovery_rate = successful_retries / total_retries
- Paper found "perfect tool sequencing but only 33% policy adherence" - surface metrics mask failures
Distributed tracing (Maxim AI)
- correlation_id: Links all tasks in a session for end-to-end tracing
- Essential for debugging multi-agent coordination failures
Tool reliability separate from selection (Stanford/Harvard)
- tool_selection_correct: Did we pick the right tool?
- tool_reliability_rate: Did the tool work as expected? (tools can fail even when correctly selected)
- Key insight: "Tool use reliability" is a primary demo-to-deployment gap
Quality pillars beyond outcomes (Assessment Framework)
- memory_retrieval_relevant: Did episodic/semantic retrieval help?
- goal_adherence: Did we stay on task? (0.0-1.0 score)

Efficiency Score Calculation

def calculate_efficiency_score(metrics, task_complexity):
    """
    Score from 0-1 where higher is more efficient.
    Based on ToolOrchestra's efficiency reward signal.
    """
    # Baseline expectations by complexity
    baselines = {
        "trivial": {"time": 60, "agents": 1, "retries": 0},
        "simple": {"time": 180, "agents": 2, "retries": 0},
        "moderate": {"time": 600, "agents": 4, "retries": 1},
        "complex": {"time": 1800, "agents": 8, "retries": 2},
        "critical": {"time": 3600, "agents": 12, "retries": 3}
    }

    baseline = baselines[task_complexity]

    # Calculate component scores (1.0 = at baseline, >1 = better, <1 = worse)
    time_score = min(1.0, baseline["time"] / max(metrics["wall_time_seconds"], 1))
    agent_score = min(1.0, baseline["agents"] / max(metrics["agents_spawned"], 1))
    retry_score = 1.0 - (metrics["retry_count"] / (baseline["retries"] + 3))

    # Weighted average (time matters most)
    return (time_score * 0.5) + (agent_score * 0.3) + (retry_score * 0.2)

Standard Reason Codes

Use consistent codes to enable pattern analysis:

outcome_reasons:
  success:
    - tests_passed_first_try
    - tests_passed_after_fix
    - review_approved
    - spec_validated
  partial:
    - tests_partial_pass
    - review_concerns_minor
    - timeout_partial_work
  failure:
    - tests_failed
    - review_blocked
    - dependency_missing
    - timeout_no_progress
    - error_unrecoverable

retry_reasons:
  - test_failure
  - lint_error
  - type_error
  - review_rejection
  - rate_limit
  - timeout
  - dependency_conflict

efficiency_factors:
  positive:
    - used_haiku_for_simple
    - parallel_execution
    - cached_result
    - first_try_success
    - spec_driven
  negative:
    - used_opus_for_simple
    - sequential_when_parallel_possible
    - multiple_retries
    - missing_context
    - unclear_requirements

Storage Location

.loki/metrics/
├── efficiency/
│   ├── 2026-01-06.json      # Daily efficiency logs
│   └── aggregate.json        # Running averages by task type
└── rewards/
    ├── outcomes.json         # Task success/failure records
    └── preferences.json      # User preference signals

Reward Signal Framework

Three Reward Types (ToolOrchestra Pattern)

+------------------------------------------------------------------+
| 1. OUTCOME REWARD                                                 |
|    - Did the task succeed? Binary + quality grade                 |
|    - Signal: +1.0 (success), 0.0 (partial), -1.0 (failure)       |
+------------------------------------------------------------------+
| 2. EFFICIENCY REWARD                                              |
|    - Did we use resources wisely?                                 |
|    - Signal: 0.0 to 1.0 based on efficiency score                |
+------------------------------------------------------------------+
| 3. PREFERENCE REWARD                                              |
|    - Did the user like the approach/result?                       |
|    - Signal: Inferred from user actions (accept/reject/modify)   |
+------------------------------------------------------------------+

Outcome Reward Implementation

def calculate_outcome_reward(task_result):
    """
    Outcome reward based on task completion status.
    """
    if task_result.status == "completed":
        # Grade the quality of completion
        if task_result.tests_passed and task_result.review_passed:
            return 1.0  # Full success
        elif task_result.tests_passed:
            return 0.7  # Tests pass but review had concerns
        else:
            return 0.3  # Completed but with issues

    elif task_result.status == "partial":
        return 0.0  # Partial completion, no reward

    else:  # failed
        return -1.0  # Negative reward for failure

Preference Reward Implementation

def infer_preference_reward(task_result, user_actions):
    """
    Infer user preference from their actions after task completion.
    Based on implicit feedback patterns.
    """
    signals = []

    # Positive signals
    if "commit" in user_actions:
        signals.append(0.8)  # User committed our changes
    if "deploy" in user_actions:
        signals.append(1.0)  # User deployed our changes
    if "no_edits" in user_actions:
        signals.append(0.6)  # User didn't modify our output

    # Negative signals
    if "revert" in user_actions:
        signals.append(-1.0)  # User reverted our changes
    if "manual_fix" in user_actions:
        signals.append(-0.5)  # User had to fix our work
    if "retry_different" in user_actions:
        signals.append(-0.3)  # User asked for different approach

    # Neutral (no signal)
    if not signals:
        return None

    return sum(signals) / len(signals)

Reward Aggregation for Learning

def aggregate_rewards(outcome, efficiency, preference):
    """
    Combine rewards into single learning signal.
    Weights based on ToolOrchestra findings.
    """
    # Outcome is most important (must succeed)
    # Efficiency secondary (once successful, optimize)
    # Preference tertiary (align with user style)

    weights = {
        "outcome": 0.6,
        "efficiency": 0.25,
        "preference": 0.15
    }

    total = outcome * weights["outcome"]
    total += efficiency * weights["efficiency"]

    if preference is not None:
        total += preference * weights["preference"]
    else:
        # Redistribute weight if no preference signal
        total = total / (1 - weights["preference"])

    return total

Dynamic Agent Selection

Task Complexity Classification

def classify_task_complexity(task):
    """
    Classify task to determine agent allocation.
    Based on ToolOrchestra's tool selection flexibility.
    """
    complexity_signals = {
        # File scope signals
        "single_file": -1,
        "few_files": 0,       # 2-5 files
        "many_files": +1,     # 6-20 files
        "system_wide": +2,    # 20+ files

        # Change type signals
        "typo_fix": -2,
        "bug_fix": 0,
        "feature": +1,
        "refactor": +1,
        "architecture": +2,

        # Domain signals
        "documentation": -1,
        "tests_only": 0,
        "frontend": 0,
        "backend": 0,
        "full_stack": +1,
        "infrastructure": +1,
        "security": +2,
    }

    score = 0
    for signal, weight in complexity_signals.items():
        if task.has_signal(signal):
            score += weight

    # Map score to complexity level
    if score <= -2:
        return "trivial"
    elif score <= 0:
        return "simple"
    elif score <= 2:
        return "moderate"
    elif score <= 4:
        return "complex"
    else:
        return "critical"

Agent Allocation by Complexity

# Agent allocation strategy
# Model selection: Opus=planning, Sonnet=development, Haiku=unit tests/monitoring
complexity_allocations:
  trivial:
    max_agents: 1
    planning: null         # No planning needed
    development: haiku
    testing: haiku
    review: skip           # No review needed for trivial
    parallel: false

  simple:
    max_agents: 2
    planning: null         # No planning needed
    development: haiku
    testing: haiku
    review: single         # One quick review
    parallel: false

  moderate:
    max_agents: 4
    planning: sonnet       # Sonnet for moderate planning
    development: sonnet
    testing: haiku         # Unit tests always haiku
    review: standard       # 3 parallel reviewers
    parallel: true

  complex:
    max_agents: 8
    planning: opus         # Opus ONLY for complex planning
    development: sonnet    # Sonnet for implementation
    testing: haiku         # Unit tests still haiku
    review: deep           # 3 reviewers + devil's advocate
    parallel: true

  critical:
    max_agents: 12
    planning: opus         # Opus for critical planning
    development: sonnet    # Sonnet for implementation
    testing: sonnet        # Functional/E2E tests with sonnet
    review: exhaustive     # Multiple review rounds
    parallel: true
    human_checkpoint: true # Pause for human review

Dynamic Selection Algorithm

def select_agents_for_task(task, available_agents):
    """
    Dynamically select agents based on task requirements.
    Inspired by ToolOrchestra's configurable tool selection.
    """
    complexity = classify_task_complexity(task)
    allocation = COMPLEXITY_ALLOCATIONS[complexity]

    # 1. Identify required agent types
    required_types = identify_required_agents(task)

    # 2. Filter to available agents of required types
    candidates = [a for a in available_agents if a.type in required_types]

    # 3. Score candidates by past performance
    for agent in candidates:
        agent.selection_score = get_agent_performance_score(
            agent,
            task_type=task.type,
            complexity=complexity
        )

    # 4. Select top N agents up to allocation limit
    candidates.sort(key=lambda a: a.selection_score, reverse=True)
    selected = candidates[:allocation["max_agents"]]

    # 5. Assign models based on complexity
    for agent in selected:
        if agent.role == "reviewer":
            agent.model = "opus"  # Always opus for reviews
        else:
            agent.model = allocation["model"]

    return selected

def get_agent_performance_score(agent, task_type, complexity):
    """
    Score agent based on historical performance on similar tasks.
    Uses reward signals from previous executions.
    """
    history = load_agent_history(agent.id)

    # Filter to similar tasks
    similar = [h for h in history
               if h.task_type == task_type
               and h.complexity == complexity]

    if not similar:
        return 0.5  # Neutral score if no history

    # Average past rewards
    return sum(h.aggregate_reward for h in similar) / len(similar)

Tool Usage Analytics

Track Tool Effectiveness

{
  "tool_analytics": {
    "period": "2026-01-06",
    "by_tool": {
      "Grep": {
        "calls": 142,
        "success_rate": 0.89,
        "avg_result_quality": 0.82,
        "common_patterns": ["error handling", "function def"]
      },
      "Task": {
        "calls": 47,
        "success_rate": 0.94,
        "avg_efficiency": 0.76,
        "by_subagent_type": {
          "general-purpose": {"calls": 35, "success": 0.91},
          "Explore": {"calls": 12, "success": 1.0}
        }
      }
    },
    "insights": [
      "Explore agent 100% success - use more for codebase search",
      "Grep success drops to 0.65 for regex patterns - simplify searches"
    ]
  }
}

Continuous Improvement Loop

+------------------------------------------------------------------+
| 1. COLLECT                                                        |
|    Record every task: agents used, tools called, outcome          |
+------------------------------------------------------------------+
          |
          v
+------------------------------------------------------------------+
| 2. ANALYZE                                                        |
|    Weekly aggregation: What worked? What didn't?                  |
|    Identify patterns in high-reward vs low-reward tasks           |
+------------------------------------------------------------------+
          |
          v
+------------------------------------------------------------------+
| 3. ADAPT                                                          |
|    Update selection algorithms based on analytics                 |
|    Store successful patterns in semantic memory                   |
+------------------------------------------------------------------+
          |
          v
+------------------------------------------------------------------+
| 4. VALIDATE                                                       |
|    A/B test new selection strategies                              |
|    Measure efficiency improvement                                 |
+------------------------------------------------------------------+
          |
          +-----------> Loop back to COLLECT

Integration with RARV Cycle

The orchestration patterns integrate with RARV at each phase:

REASON:
├── Check efficiency metrics for similar past tasks
├── Classify task complexity
└── Select appropriate agent allocation

ACT:
├── Dispatch agents according to allocation
├── Track start time and resource usage
└── Record tool calls and agent interactions

REFLECT:
├── Calculate outcome reward (did it work?)
├── Calculate efficiency reward (resource usage)
└── Log to metrics store

VERIFY:
├── Run verification checks
├── If failed: negative outcome reward, retry with learning
├── If passed: infer preference reward from user actions
└── Update agent performance scores

Key Metrics Dashboard

Track these metrics in .loki/metrics/dashboard.json:

{
  "dashboard": {
    "period": "rolling_7_days",
    "summary": {
      "tasks_completed": 127,
      "success_rate": 0.94,
      "avg_efficiency_score": 0.78,
      "avg_outcome_reward": 0.82,
      "avg_preference_reward": 0.71,
      "avg_recovery_rate": 0.87,
      "avg_goal_adherence": 0.93
    },
    "quality_pillars": {
      "tool_selection_accuracy": 0.91,
      "tool_reliability_rate": 0.93,
      "memory_retrieval_relevance": 0.84,
      "policy_adherence": 0.96
    },
    "trends": {
      "efficiency": "+12% vs previous week",
      "success_rate": "+3% vs previous week",
      "avg_agents_per_task": "-0.8 (improving)",
      "recovery_rate": "+5% vs previous week"
    },
    "top_performing_patterns": [
      "Haiku for unit tests (0.95 success, 0.92 efficiency)",
      "Explore agent for codebase search (1.0 success)",
      "Parallel review with opus (0.98 accuracy)"
    ],
    "areas_for_improvement": [
      "Complex refactors taking 2x expected time",
      "Security review efficiency below baseline",
      "Memory retrieval relevance below 0.85 target"
    ]
  }
}

Multi-Dimensional Evaluation

Based on Measurement Imbalance research (arXiv 2506.02064):

"Technical metrics dominate assessments (83%), while human-centered (30%), safety (53%), and economic (30%) remain peripheral"

Loki Mode tracks four evaluation axes:

Axis	Metrics	Current Coverage
Technical	success_rate, efficiency_score, recovery_rate	Full
Human-Centered	preference_reward, goal_adherence	Partial
Safety	policy_adherence, quality_gates_passed	Full (via review system)
Economic	model_usage, agents_spawned, wall_time	Full

Sources

OpenAI Agents SDK:

Agents SDK Documentation - Core primitives: agents, handoffs, guardrails, tracing
Practical Guide to Building Agents - Orchestration patterns
Building Agents Track - Official developer guide
AGENTS.md Specification - Standard for agent instructions
Tracing Documentation - Span types and observability

Efficiency & Orchestration:

NVIDIA ToolOrchestra - Multi-turn tool orchestration with RL
ToolScale Dataset - Training data synthesis

Evaluation Frameworks:

Assessment Framework for Agentic AI (arXiv 2512.12791) - Four-pillar evaluation model
Measurement Imbalance in Agentic AI (arXiv 2506.02064) - Multi-dimensional evaluation
Adaptive Monitoring for Agentic AI (arXiv 2509.00115) - AMDM algorithm

Best Practices:

Anthropic: Building Effective Agents - Simplicity, transparency, tool engineering
Maxim AI: Production Multi-Agent Systems - Orchestration patterns, distributed tracing
UiPath: Agent Builder Best Practices - Single-responsibility, evaluations
Stanford/Harvard: Demo-to-Deployment Gap - Tool reliability as key failure mode

Safety & Reasoning:

Chain of Thought Monitoring - CoT monitorability for safety
Agent Builder Safety - Human-in-loop patterns
Agentic AI Foundation - Industry standards (MCP, AGENTS.md, goose)

22 KiB Raw Blame History

Tool Orchestration Patterns Reference

Overview

Tracing Spans Architecture (OpenAI SDK Pattern)

Span Types

Hierarchical Trace Structure

Storage Location

Efficiency Metrics System

Why Track Efficiency?

Efficiency Tracking Schema

Efficiency Score Calculation

Standard Reason Codes

Storage Location

Reward Signal Framework

Three Reward Types (ToolOrchestra Pattern)

Outcome Reward Implementation

Preference Reward Implementation

Reward Aggregation for Learning

Dynamic Agent Selection

Task Complexity Classification

Agent Allocation by Complexity

Dynamic Selection Algorithm

Tool Usage Analytics

Track Tool Effectiveness

Continuous Improvement Loop

Integration with RARV Cycle

Key Metrics Dashboard

Multi-Dimensional Evaluation

Sources

22 KiB

Raw Blame History