firefrost-gaming/antigravity-skills-reference

Files

sck_0 8bd204708b Fix: Ensure all skills are tracked as files, not submodules

2026-01-14 18:48:48 +01:00

15 KiB

Raw Blame History

Production Patterns Reference

Practitioner-tested patterns from Hacker News discussions and real-world deployments. These patterns represent what actually works in production, not theoretical frameworks.

Overview

This reference consolidates battle-tested insights from:

HN discussions on autonomous agents in production (2025)
Coding with LLMs practitioner experiences
Simon Willison's Superpowers coding agent patterns
Multi-agent orchestration real-world deployments

What Actually Works in Production

Human-in-the-Loop (HITL) is Non-Negotiable

Key Insight: "Zero companies don't have a human in the loop" for customer-facing applications.

hitl_patterns:
  always_human:
    - Customer-facing responses
    - Financial transactions
    - Security-critical operations
    - Legal/compliance decisions

  automation_candidates:
    - Internal tooling
    - Developer assistance
    - Data preprocessing
    - Code generation (with review)

  implementation:
    - Classification layer routes to human vs automated
    - Confidence thresholds trigger escalation
    - Audit trails for all automated decisions

Narrow Scope Wins

Key Insight: Successful agents operate within tightly constrained domains.

scope_constraints:
  max_steps_before_review: 3-5
  task_characteristics:
    - Specific, well-defined objectives
    - Pre-classified inputs
    - Deterministic success criteria
    - Verifiable outputs

  successful_domains:
    - Email scanning and classification
    - Invoice processing
    - Code refactoring (bounded)
    - Documentation generation
    - Test writing

  failure_prone_domains:
    - Open-ended feature implementation
    - Novel algorithm design
    - Security-critical code
    - Cross-system integrations

Confidence-Based Routing

Key Insight: Treat agents as preprocessors, not decision-makers.

def confidence_based_routing(agent_output):
    """
    Route based on confidence, not capability.
    Based on production practitioner patterns.
    """
    confidence = agent_output.confidence_score

    if confidence >= 0.95:
        # High confidence: auto-approve with logging
        return AutoApprove(audit_log=True)

    elif confidence >= 0.70:
        # Medium confidence: quick human review
        return HumanReview(priority="normal", timeout="1h")

    elif confidence >= 0.40:
        # Low confidence: detailed human review
        return HumanReview(priority="high", context="full")

    else:
        # Very low confidence: escalate immediately
        return Escalate(reason="low_confidence", require_senior=True)

Classification Before Automation

Key Insight: Separate inputs before processing.

classification_first:
  step_1_classify:
    workable:
      - Clear requirements
      - Existing patterns
      - Test coverage available
    non_workable:
      - Ambiguous requirements
      - Novel architecture
      - Missing dependencies
    escalate_immediately:
      - Security concerns
      - Compliance requirements
      - Customer-facing changes

  step_2_route:
    workable: "Automated pipeline"
    non_workable: "Human clarification"
    escalate: "Senior review"

Deterministic Outer Loops

Key Insight: Wrap agent outputs with rule-based validation.

def deterministic_validation_loop(task, max_attempts=3):
    """
    Use LLMs only where genuine ambiguity exists.
    Wrap with deterministic rules.
    """
    for attempt in range(max_attempts):
        # LLM handles the ambiguous part
        output = agent.execute(task)

        # Deterministic validation (NOT LLM)
        validation_errors = []

        # Rule: Must have tests
        if not output.has_tests:
            validation_errors.append("Missing tests")

        # Rule: Must pass linting
        lint_result = run_linter(output.code)
        if lint_result.errors:
            validation_errors.append(f"Lint errors: {lint_result.errors}")

        # Rule: Must compile
        compile_result = compile_code(output.code)
        if not compile_result.success:
            validation_errors.append(f"Compile error: {compile_result.error}")

        # Rule: Tests must pass
        if output.has_tests:
            test_result = run_tests(output.code)
            if not test_result.all_passed:
                validation_errors.append(f"Test failures: {test_result.failures}")

        if not validation_errors:
            return output

        # Feed errors back for retry
        task = task.with_feedback(validation_errors)

    return FailedResult(reason="Max attempts exceeded")

Context Engineering Patterns

Context Curation Over Automatic Selection

Key Insight: Manually choose which files and information to provide.

context_curation:
  principles:
    - "Less is more" - focused context beats comprehensive context
    - Manual selection outperforms automatic RAG
    - Remove outdated information aggressively

  anti_patterns:
    - Dumping entire codebase into context
    - Relying on automatic context selection
    - Accumulating conversation history indefinitely

  implementation:
    per_task_context:
      - 2-5 most relevant files
      - Specific functions, not entire modules
      - Recent changes only (last 1-2 days)
      - Clear success criteria

    context_budget:
      target: "< 10k tokens for context"
      reserve: "90% for model reasoning"

Information Abstraction

Key Insight: Summarize rather than feeding full data.

def abstract_for_agent(raw_data, task_context):
    """
    Design abstractions that preserve decision-relevant information.
    Based on practitioner insights.
    """
    # BAD: Feed 10,000 database rows
    # raw_data = db.query("SELECT * FROM users")

    # GOOD: Summarize to decision-relevant info
    summary = {
        "query_status": "success",
        "total_results": len(raw_data),
        "sample": raw_data[:5],
        "schema": extract_schema(raw_data),
        "statistics": {
            "null_count": count_nulls(raw_data),
            "unique_values": count_uniques(raw_data),
            "date_range": get_date_range(raw_data)
        }
    }

    return summary

Separate Conversations Per Task

Key Insight: Fresh contexts yield better results than accumulated sessions.

conversation_management:
  new_conversation_triggers:
    - Different domain (backend -> frontend)
    - New feature vs bug fix
    - After completing major task
    - When errors accumulate (3+ in row)

  preserve_across_sessions:
    - CLAUDE.md / CONTINUITY.md
    - Architectural decisions
    - Key constraints

  discard_between_sessions:
    - Debugging attempts
    - Abandoned approaches
    - Intermediate drafts

Skills System Pattern

On-Demand Skill Loading

Key Insight: Skills remain dormant until the model actively seeks them out.

skills_architecture:
  core_interaction: "< 2k tokens"
  skill_loading: "On-demand via search"

  implementation:
    skill_discovery:
      - Shell script searches skill files
      - Model requests specific skills by name
      - Skills loaded only when needed

    skill_structure:
      name: "unique-skill-name"
      trigger: "Pattern that activates skill"
      content: "Detailed instructions"
      dependencies: ["other-skills"]

  benefits:
    - Minimal base context
    - Extensible without bloat
    - Skills can be updated independently

Sub-Agents for Context Isolation

Key Insight: Prevent massive token waste by isolating context-noisy subtasks.

async def context_isolated_search(query, codebase_path):
    """
    Use sub-agent for grep/search to prevent context pollution.
    Based on Simon Willison's patterns.
    """
    # Main agent stays focused
    # Sub-agent handles noisy file searching

    search_agent = spawn_subagent(
        role="codebase-searcher",
        context_limit="10k tokens",
        permissions=["read-only"]
    )

    results = await search_agent.execute(
        task=f"Find files related to: {query}",
        codebase=codebase_path
    )

    # Return only relevant paths, not full content
    return FilteredResults(
        paths=results.relevant_files[:10],
        summaries=results.file_summaries,
        confidence=results.relevance_scores
    )

Planning Before Execution

Explicit Plan-Then-Code Workflow

Key Insight: Have models articulate detailed plans without immediately writing code.

plan_then_code:
  phase_1_planning:
    outputs:
      - spec.md: "Detailed requirements"
      - todo.md: "Tagged tasks [BUG], [FEAT], [REFACTOR]"
      - approach.md: "Implementation strategy"
    constraints:
      - NO CODE in this phase
      - Human review before proceeding
      - Clear success criteria

  phase_2_review:
    checks:
      - Plan addresses all requirements
      - Approach is feasible
      - No missing dependencies
      - Tests are specified

  phase_3_implementation:
    constraints:
      - Follow plan exactly
      - One task at a time
      - Test after each change
      - Report deviations immediately

Multi-Agent Orchestration Patterns

Event-Driven Coordination

Key Insight: Move beyond synchronous prompt chaining to asynchronous, decoupled systems.

event_driven_orchestration:
  problems_with_synchronous:
    - Doesn't scale
    - Mixes orchestration with prompt logic
    - Single failure breaks entire chain
    - No retry/recovery mechanism

  async_architecture:
    message_queue:
      - Agents communicate via events
      - Decoupled execution
      - Natural retry/dead-letter handling

    state_management:
      - Persistent task state
      - Checkpoint/resume capability
      - Clear ownership of data

    error_handling:
      - Per-agent retry policies
      - Circuit breakers
      - Graceful degradation

Policy-First Enforcement

Key Insight: Govern agent behavior at runtime, not just training time.

class PolicyEngine:
    """
    Runtime governance for agent behavior.
    Based on autonomous control plane patterns.
    """

    def __init__(self, policies):
        self.policies = policies

    async def enforce(self, agent_action, context):
        for policy in self.policies:
            result = await policy.evaluate(agent_action, context)

            if result.blocked:
                return BlockedAction(
                    reason=result.reason,
                    policy=policy.name,
                    remediation=result.suggested_action
                )

            if result.modified:
                agent_action = result.modified_action

        return AllowedAction(agent_action)

# Example policies
policies = [
    NoProductionDataDeletion(),
    NoSecretsInCode(),
    MaxTokenBudget(limit=100000),
    RequireTestsForCode(),
    BlockExternalNetworkCalls(in_sandbox=True)
]

Simulation Layer

Key Insight: Evaluate changes before deploying to real environment.

simulation_layer:
  purpose: "Test agent behavior in safe environment"

  implementation:
    sandbox_environment:
      - Isolated container
      - Mocked external services
      - Synthetic data
      - Full audit logging

    validation_checks:
      - Run tests in sandbox first
      - Compare outputs to expected
      - Check for policy violations
      - Measure resource consumption

    promotion_criteria:
      - All tests pass
      - No policy violations
      - Resource usage within limits
      - Human approval (for sensitive changes)

Evaluation and Benchmarking

Problems with Current Benchmarks

Key Insight: LLM-as-judge creates shared blind spots.

benchmark_problems:
  llm_judge_issues:
    - Same architecture = same failure modes
    - Math errors accepted as correct
    - "Do-nothing" baseline passes 38% of time

  contamination:
    - Published benchmarks become training targets
    - Overfitting to specific datasets
    - Inflated scores don't reflect real performance

  solutions:
    held_back_sets: "90% public, 10% private"
    human_evaluation: "Final published results require humans"
    production_testing: "A/B tests measure actual value"
    objective_outcomes: "Simulated environments with verifiable results"

Practical Evaluation Approach

def evaluate_agent_change(before_agent, after_agent, task_set):
    """
    Production-oriented evaluation.
    Based on HN practitioner recommendations.
    """
    results = {
        "before": [],
        "after": [],
        "human_preference": []
    }

    for task in task_set:
        # Run both agents
        before_result = before_agent.execute(task)
        after_result = after_agent.execute(task)

        # Objective metrics (NOT LLM-judged)
        results["before"].append({
            "tests_pass": run_tests(before_result),
            "lint_clean": run_linter(before_result),
            "time_taken": before_result.duration,
            "tokens_used": before_result.tokens
        })

        results["after"].append({
            "tests_pass": run_tests(after_result),
            "lint_clean": run_linter(after_result),
            "time_taken": after_result.duration,
            "tokens_used": after_result.tokens
        })

        # Sample for human review
        if random.random() < 0.1:  # 10% sample
            results["human_preference"].append({
                "task": task,
                "before": before_result,
                "after": after_result,
                "pending_review": True
            })

    return EvaluationReport(results)

Cost and Token Economics

Real-World Cost Patterns

cost_patterns:
  claude_code:
    heavy_use: "$25/1-2 hours on large codebases"
    api_range: "$1-5/hour depending on efficiency"
    max_tier: "$200/month often needs 2-3 subscriptions"

  token_economics:
    sub_agents_multiply_cost: "Each duplicates context"
    example: "5-task parallel job = 50,000+ tokens per subtask"

  optimization:
    context_isolation: "Use sub-agents for noisy tasks"
    information_abstraction: "Summarize, don't dump"
    fresh_conversations: "Reset after major tasks"
    skill_on_demand: "Load only when needed"

Sources

Hacker News Discussions:

Show HN Projects:

15 KiB Raw Blame History