Files
antigravity-skills-reference/skills/loki-mode/references/production-patterns.md

569 lines
15 KiB
Markdown

# Production Patterns Reference
Practitioner-tested patterns from Hacker News discussions and real-world deployments. These patterns represent what actually works in production, not theoretical frameworks.
---
## Overview
This reference consolidates battle-tested insights from:
- HN discussions on autonomous agents in production (2025)
- Coding with LLMs practitioner experiences
- Simon Willison's Superpowers coding agent patterns
- Multi-agent orchestration real-world deployments
---
## What Actually Works in Production
### Human-in-the-Loop (HITL) is Non-Negotiable
**Key Insight:** "Zero companies don't have a human in the loop" for customer-facing applications.
```yaml
hitl_patterns:
always_human:
- Customer-facing responses
- Financial transactions
- Security-critical operations
- Legal/compliance decisions
automation_candidates:
- Internal tooling
- Developer assistance
- Data preprocessing
- Code generation (with review)
implementation:
- Classification layer routes to human vs automated
- Confidence thresholds trigger escalation
- Audit trails for all automated decisions
```
### Narrow Scope Wins
**Key Insight:** Successful agents operate within tightly constrained domains.
```yaml
scope_constraints:
max_steps_before_review: 3-5
task_characteristics:
- Specific, well-defined objectives
- Pre-classified inputs
- Deterministic success criteria
- Verifiable outputs
successful_domains:
- Email scanning and classification
- Invoice processing
- Code refactoring (bounded)
- Documentation generation
- Test writing
failure_prone_domains:
- Open-ended feature implementation
- Novel algorithm design
- Security-critical code
- Cross-system integrations
```
### Confidence-Based Routing
**Key Insight:** Treat agents as preprocessors, not decision-makers.
```python
def confidence_based_routing(agent_output):
"""
Route based on confidence, not capability.
Based on production practitioner patterns.
"""
confidence = agent_output.confidence_score
if confidence >= 0.95:
# High confidence: auto-approve with logging
return AutoApprove(audit_log=True)
elif confidence >= 0.70:
# Medium confidence: quick human review
return HumanReview(priority="normal", timeout="1h")
elif confidence >= 0.40:
# Low confidence: detailed human review
return HumanReview(priority="high", context="full")
else:
# Very low confidence: escalate immediately
return Escalate(reason="low_confidence", require_senior=True)
```
### Classification Before Automation
**Key Insight:** Separate inputs before processing.
```yaml
classification_first:
step_1_classify:
workable:
- Clear requirements
- Existing patterns
- Test coverage available
non_workable:
- Ambiguous requirements
- Novel architecture
- Missing dependencies
escalate_immediately:
- Security concerns
- Compliance requirements
- Customer-facing changes
step_2_route:
workable: "Automated pipeline"
non_workable: "Human clarification"
escalate: "Senior review"
```
### Deterministic Outer Loops
**Key Insight:** Wrap agent outputs with rule-based validation.
```python
def deterministic_validation_loop(task, max_attempts=3):
"""
Use LLMs only where genuine ambiguity exists.
Wrap with deterministic rules.
"""
for attempt in range(max_attempts):
# LLM handles the ambiguous part
output = agent.execute(task)
# Deterministic validation (NOT LLM)
validation_errors = []
# Rule: Must have tests
if not output.has_tests:
validation_errors.append("Missing tests")
# Rule: Must pass linting
lint_result = run_linter(output.code)
if lint_result.errors:
validation_errors.append(f"Lint errors: {lint_result.errors}")
# Rule: Must compile
compile_result = compile_code(output.code)
if not compile_result.success:
validation_errors.append(f"Compile error: {compile_result.error}")
# Rule: Tests must pass
if output.has_tests:
test_result = run_tests(output.code)
if not test_result.all_passed:
validation_errors.append(f"Test failures: {test_result.failures}")
if not validation_errors:
return output
# Feed errors back for retry
task = task.with_feedback(validation_errors)
return FailedResult(reason="Max attempts exceeded")
```
---
## Context Engineering Patterns
### Context Curation Over Automatic Selection
**Key Insight:** Manually choose which files and information to provide.
```yaml
context_curation:
principles:
- "Less is more" - focused context beats comprehensive context
- Manual selection outperforms automatic RAG
- Remove outdated information aggressively
anti_patterns:
- Dumping entire codebase into context
- Relying on automatic context selection
- Accumulating conversation history indefinitely
implementation:
per_task_context:
- 2-5 most relevant files
- Specific functions, not entire modules
- Recent changes only (last 1-2 days)
- Clear success criteria
context_budget:
target: "< 10k tokens for context"
reserve: "90% for model reasoning"
```
### Information Abstraction
**Key Insight:** Summarize rather than feeding full data.
```python
def abstract_for_agent(raw_data, task_context):
"""
Design abstractions that preserve decision-relevant information.
Based on practitioner insights.
"""
# BAD: Feed 10,000 database rows
# raw_data = db.query("SELECT * FROM users")
# GOOD: Summarize to decision-relevant info
summary = {
"query_status": "success",
"total_results": len(raw_data),
"sample": raw_data[:5],
"schema": extract_schema(raw_data),
"statistics": {
"null_count": count_nulls(raw_data),
"unique_values": count_uniques(raw_data),
"date_range": get_date_range(raw_data)
}
}
return summary
```
### Separate Conversations Per Task
**Key Insight:** Fresh contexts yield better results than accumulated sessions.
```yaml
conversation_management:
new_conversation_triggers:
- Different domain (backend -> frontend)
- New feature vs bug fix
- After completing major task
- When errors accumulate (3+ in row)
preserve_across_sessions:
- CLAUDE.md / CONTINUITY.md
- Architectural decisions
- Key constraints
discard_between_sessions:
- Debugging attempts
- Abandoned approaches
- Intermediate drafts
```
---
## Skills System Pattern
### On-Demand Skill Loading
**Key Insight:** Skills remain dormant until the model actively seeks them out.
```yaml
skills_architecture:
core_interaction: "< 2k tokens"
skill_loading: "On-demand via search"
implementation:
skill_discovery:
- Shell script searches skill files
- Model requests specific skills by name
- Skills loaded only when needed
skill_structure:
name: "unique-skill-name"
trigger: "Pattern that activates skill"
content: "Detailed instructions"
dependencies: ["other-skills"]
benefits:
- Minimal base context
- Extensible without bloat
- Skills can be updated independently
```
### Sub-Agents for Context Isolation
**Key Insight:** Prevent massive token waste by isolating context-noisy subtasks.
```python
async def context_isolated_search(query, codebase_path):
"""
Use sub-agent for grep/search to prevent context pollution.
Based on Simon Willison's patterns.
"""
# Main agent stays focused
# Sub-agent handles noisy file searching
search_agent = spawn_subagent(
role="codebase-searcher",
context_limit="10k tokens",
permissions=["read-only"]
)
results = await search_agent.execute(
task=f"Find files related to: {query}",
codebase=codebase_path
)
# Return only relevant paths, not full content
return FilteredResults(
paths=results.relevant_files[:10],
summaries=results.file_summaries,
confidence=results.relevance_scores
)
```
---
## Planning Before Execution
### Explicit Plan-Then-Code Workflow
**Key Insight:** Have models articulate detailed plans without immediately writing code.
```yaml
plan_then_code:
phase_1_planning:
outputs:
- spec.md: "Detailed requirements"
- todo.md: "Tagged tasks [BUG], [FEAT], [REFACTOR]"
- approach.md: "Implementation strategy"
constraints:
- NO CODE in this phase
- Human review before proceeding
- Clear success criteria
phase_2_review:
checks:
- Plan addresses all requirements
- Approach is feasible
- No missing dependencies
- Tests are specified
phase_3_implementation:
constraints:
- Follow plan exactly
- One task at a time
- Test after each change
- Report deviations immediately
```
---
## Multi-Agent Orchestration Patterns
### Event-Driven Coordination
**Key Insight:** Move beyond synchronous prompt chaining to asynchronous, decoupled systems.
```yaml
event_driven_orchestration:
problems_with_synchronous:
- Doesn't scale
- Mixes orchestration with prompt logic
- Single failure breaks entire chain
- No retry/recovery mechanism
async_architecture:
message_queue:
- Agents communicate via events
- Decoupled execution
- Natural retry/dead-letter handling
state_management:
- Persistent task state
- Checkpoint/resume capability
- Clear ownership of data
error_handling:
- Per-agent retry policies
- Circuit breakers
- Graceful degradation
```
### Policy-First Enforcement
**Key Insight:** Govern agent behavior at runtime, not just training time.
```python
class PolicyEngine:
"""
Runtime governance for agent behavior.
Based on autonomous control plane patterns.
"""
def __init__(self, policies):
self.policies = policies
async def enforce(self, agent_action, context):
for policy in self.policies:
result = await policy.evaluate(agent_action, context)
if result.blocked:
return BlockedAction(
reason=result.reason,
policy=policy.name,
remediation=result.suggested_action
)
if result.modified:
agent_action = result.modified_action
return AllowedAction(agent_action)
# Example policies
policies = [
NoProductionDataDeletion(),
NoSecretsInCode(),
MaxTokenBudget(limit=100000),
RequireTestsForCode(),
BlockExternalNetworkCalls(in_sandbox=True)
]
```
### Simulation Layer
**Key Insight:** Evaluate changes before deploying to real environment.
```yaml
simulation_layer:
purpose: "Test agent behavior in safe environment"
implementation:
sandbox_environment:
- Isolated container
- Mocked external services
- Synthetic data
- Full audit logging
validation_checks:
- Run tests in sandbox first
- Compare outputs to expected
- Check for policy violations
- Measure resource consumption
promotion_criteria:
- All tests pass
- No policy violations
- Resource usage within limits
- Human approval (for sensitive changes)
```
---
## Evaluation and Benchmarking
### Problems with Current Benchmarks
**Key Insight:** LLM-as-judge creates shared blind spots.
```yaml
benchmark_problems:
llm_judge_issues:
- Same architecture = same failure modes
- Math errors accepted as correct
- "Do-nothing" baseline passes 38% of time
contamination:
- Published benchmarks become training targets
- Overfitting to specific datasets
- Inflated scores don't reflect real performance
solutions:
held_back_sets: "90% public, 10% private"
human_evaluation: "Final published results require humans"
production_testing: "A/B tests measure actual value"
objective_outcomes: "Simulated environments with verifiable results"
```
### Practical Evaluation Approach
```python
def evaluate_agent_change(before_agent, after_agent, task_set):
"""
Production-oriented evaluation.
Based on HN practitioner recommendations.
"""
results = {
"before": [],
"after": [],
"human_preference": []
}
for task in task_set:
# Run both agents
before_result = before_agent.execute(task)
after_result = after_agent.execute(task)
# Objective metrics (NOT LLM-judged)
results["before"].append({
"tests_pass": run_tests(before_result),
"lint_clean": run_linter(before_result),
"time_taken": before_result.duration,
"tokens_used": before_result.tokens
})
results["after"].append({
"tests_pass": run_tests(after_result),
"lint_clean": run_linter(after_result),
"time_taken": after_result.duration,
"tokens_used": after_result.tokens
})
# Sample for human review
if random.random() < 0.1: # 10% sample
results["human_preference"].append({
"task": task,
"before": before_result,
"after": after_result,
"pending_review": True
})
return EvaluationReport(results)
```
---
## Cost and Token Economics
### Real-World Cost Patterns
```yaml
cost_patterns:
claude_code:
heavy_use: "$25/1-2 hours on large codebases"
api_range: "$1-5/hour depending on efficiency"
max_tier: "$200/month often needs 2-3 subscriptions"
token_economics:
sub_agents_multiply_cost: "Each duplicates context"
example: "5-task parallel job = 50,000+ tokens per subtask"
optimization:
context_isolation: "Use sub-agents for noisy tasks"
information_abstraction: "Summarize, don't dump"
fresh_conversations: "Reset after major tasks"
skill_on_demand: "Load only when needed"
```
---
## Sources
**Hacker News Discussions:**
- [What Actually Works in Production for Autonomous Agents](https://news.ycombinator.com/item?id=44623207)
- [Coding with LLMs in Summer 2025](https://news.ycombinator.com/item?id=44623953)
- [Superpowers: How I'm Using Coding Agents](https://news.ycombinator.com/item?id=45547344)
- [Claude Code Experience After Two Weeks](https://news.ycombinator.com/item?id=44596472)
- [AI Agent Benchmarks Are Broken](https://news.ycombinator.com/item?id=44531697)
- [How to Orchestrate Multi-Agent Workflows](https://news.ycombinator.com/item?id=45955997)
- [Context Engineering vs Prompt Engineering](https://news.ycombinator.com/item?id=44427757)
**Show HN Projects:**
- [Self-Evolving Agents Repository](https://news.ycombinator.com/item?id=45099226)
- [Package Manager for Agent Skills](https://news.ycombinator.com/item?id=46422264)
- [Wispbit - AI Code Review Agent](https://news.ycombinator.com/item?id=44722603)
- [Agtrace - Monitoring for AI Coding Agents](https://news.ycombinator.com/item?id=46425670)