569 lines
15 KiB
Markdown
569 lines
15 KiB
Markdown
# Production Patterns Reference
|
|
|
|
Practitioner-tested patterns from Hacker News discussions and real-world deployments. These patterns represent what actually works in production, not theoretical frameworks.
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This reference consolidates battle-tested insights from:
|
|
- HN discussions on autonomous agents in production (2025)
|
|
- Coding with LLMs practitioner experiences
|
|
- Simon Willison's Superpowers coding agent patterns
|
|
- Multi-agent orchestration real-world deployments
|
|
|
|
---
|
|
|
|
## What Actually Works in Production
|
|
|
|
### Human-in-the-Loop (HITL) is Non-Negotiable
|
|
|
|
**Key Insight:** "Zero companies don't have a human in the loop" for customer-facing applications.
|
|
|
|
```yaml
|
|
hitl_patterns:
|
|
always_human:
|
|
- Customer-facing responses
|
|
- Financial transactions
|
|
- Security-critical operations
|
|
- Legal/compliance decisions
|
|
|
|
automation_candidates:
|
|
- Internal tooling
|
|
- Developer assistance
|
|
- Data preprocessing
|
|
- Code generation (with review)
|
|
|
|
implementation:
|
|
- Classification layer routes to human vs automated
|
|
- Confidence thresholds trigger escalation
|
|
- Audit trails for all automated decisions
|
|
```
|
|
|
|
### Narrow Scope Wins
|
|
|
|
**Key Insight:** Successful agents operate within tightly constrained domains.
|
|
|
|
```yaml
|
|
scope_constraints:
|
|
max_steps_before_review: 3-5
|
|
task_characteristics:
|
|
- Specific, well-defined objectives
|
|
- Pre-classified inputs
|
|
- Deterministic success criteria
|
|
- Verifiable outputs
|
|
|
|
successful_domains:
|
|
- Email scanning and classification
|
|
- Invoice processing
|
|
- Code refactoring (bounded)
|
|
- Documentation generation
|
|
- Test writing
|
|
|
|
failure_prone_domains:
|
|
- Open-ended feature implementation
|
|
- Novel algorithm design
|
|
- Security-critical code
|
|
- Cross-system integrations
|
|
```
|
|
|
|
### Confidence-Based Routing
|
|
|
|
**Key Insight:** Treat agents as preprocessors, not decision-makers.
|
|
|
|
```python
|
|
def confidence_based_routing(agent_output):
|
|
"""
|
|
Route based on confidence, not capability.
|
|
Based on production practitioner patterns.
|
|
"""
|
|
confidence = agent_output.confidence_score
|
|
|
|
if confidence >= 0.95:
|
|
# High confidence: auto-approve with logging
|
|
return AutoApprove(audit_log=True)
|
|
|
|
elif confidence >= 0.70:
|
|
# Medium confidence: quick human review
|
|
return HumanReview(priority="normal", timeout="1h")
|
|
|
|
elif confidence >= 0.40:
|
|
# Low confidence: detailed human review
|
|
return HumanReview(priority="high", context="full")
|
|
|
|
else:
|
|
# Very low confidence: escalate immediately
|
|
return Escalate(reason="low_confidence", require_senior=True)
|
|
```
|
|
|
|
### Classification Before Automation
|
|
|
|
**Key Insight:** Separate inputs before processing.
|
|
|
|
```yaml
|
|
classification_first:
|
|
step_1_classify:
|
|
workable:
|
|
- Clear requirements
|
|
- Existing patterns
|
|
- Test coverage available
|
|
non_workable:
|
|
- Ambiguous requirements
|
|
- Novel architecture
|
|
- Missing dependencies
|
|
escalate_immediately:
|
|
- Security concerns
|
|
- Compliance requirements
|
|
- Customer-facing changes
|
|
|
|
step_2_route:
|
|
workable: "Automated pipeline"
|
|
non_workable: "Human clarification"
|
|
escalate: "Senior review"
|
|
```
|
|
|
|
### Deterministic Outer Loops
|
|
|
|
**Key Insight:** Wrap agent outputs with rule-based validation.
|
|
|
|
```python
|
|
def deterministic_validation_loop(task, max_attempts=3):
|
|
"""
|
|
Use LLMs only where genuine ambiguity exists.
|
|
Wrap with deterministic rules.
|
|
"""
|
|
for attempt in range(max_attempts):
|
|
# LLM handles the ambiguous part
|
|
output = agent.execute(task)
|
|
|
|
# Deterministic validation (NOT LLM)
|
|
validation_errors = []
|
|
|
|
# Rule: Must have tests
|
|
if not output.has_tests:
|
|
validation_errors.append("Missing tests")
|
|
|
|
# Rule: Must pass linting
|
|
lint_result = run_linter(output.code)
|
|
if lint_result.errors:
|
|
validation_errors.append(f"Lint errors: {lint_result.errors}")
|
|
|
|
# Rule: Must compile
|
|
compile_result = compile_code(output.code)
|
|
if not compile_result.success:
|
|
validation_errors.append(f"Compile error: {compile_result.error}")
|
|
|
|
# Rule: Tests must pass
|
|
if output.has_tests:
|
|
test_result = run_tests(output.code)
|
|
if not test_result.all_passed:
|
|
validation_errors.append(f"Test failures: {test_result.failures}")
|
|
|
|
if not validation_errors:
|
|
return output
|
|
|
|
# Feed errors back for retry
|
|
task = task.with_feedback(validation_errors)
|
|
|
|
return FailedResult(reason="Max attempts exceeded")
|
|
```
|
|
|
|
---
|
|
|
|
## Context Engineering Patterns
|
|
|
|
### Context Curation Over Automatic Selection
|
|
|
|
**Key Insight:** Manually choose which files and information to provide.
|
|
|
|
```yaml
|
|
context_curation:
|
|
principles:
|
|
- "Less is more" - focused context beats comprehensive context
|
|
- Manual selection outperforms automatic RAG
|
|
- Remove outdated information aggressively
|
|
|
|
anti_patterns:
|
|
- Dumping entire codebase into context
|
|
- Relying on automatic context selection
|
|
- Accumulating conversation history indefinitely
|
|
|
|
implementation:
|
|
per_task_context:
|
|
- 2-5 most relevant files
|
|
- Specific functions, not entire modules
|
|
- Recent changes only (last 1-2 days)
|
|
- Clear success criteria
|
|
|
|
context_budget:
|
|
target: "< 10k tokens for context"
|
|
reserve: "90% for model reasoning"
|
|
```
|
|
|
|
### Information Abstraction
|
|
|
|
**Key Insight:** Summarize rather than feeding full data.
|
|
|
|
```python
|
|
def abstract_for_agent(raw_data, task_context):
|
|
"""
|
|
Design abstractions that preserve decision-relevant information.
|
|
Based on practitioner insights.
|
|
"""
|
|
# BAD: Feed 10,000 database rows
|
|
# raw_data = db.query("SELECT * FROM users")
|
|
|
|
# GOOD: Summarize to decision-relevant info
|
|
summary = {
|
|
"query_status": "success",
|
|
"total_results": len(raw_data),
|
|
"sample": raw_data[:5],
|
|
"schema": extract_schema(raw_data),
|
|
"statistics": {
|
|
"null_count": count_nulls(raw_data),
|
|
"unique_values": count_uniques(raw_data),
|
|
"date_range": get_date_range(raw_data)
|
|
}
|
|
}
|
|
|
|
return summary
|
|
```
|
|
|
|
### Separate Conversations Per Task
|
|
|
|
**Key Insight:** Fresh contexts yield better results than accumulated sessions.
|
|
|
|
```yaml
|
|
conversation_management:
|
|
new_conversation_triggers:
|
|
- Different domain (backend -> frontend)
|
|
- New feature vs bug fix
|
|
- After completing major task
|
|
- When errors accumulate (3+ in row)
|
|
|
|
preserve_across_sessions:
|
|
- CLAUDE.md / CONTINUITY.md
|
|
- Architectural decisions
|
|
- Key constraints
|
|
|
|
discard_between_sessions:
|
|
- Debugging attempts
|
|
- Abandoned approaches
|
|
- Intermediate drafts
|
|
```
|
|
|
|
---
|
|
|
|
## Skills System Pattern
|
|
|
|
### On-Demand Skill Loading
|
|
|
|
**Key Insight:** Skills remain dormant until the model actively seeks them out.
|
|
|
|
```yaml
|
|
skills_architecture:
|
|
core_interaction: "< 2k tokens"
|
|
skill_loading: "On-demand via search"
|
|
|
|
implementation:
|
|
skill_discovery:
|
|
- Shell script searches skill files
|
|
- Model requests specific skills by name
|
|
- Skills loaded only when needed
|
|
|
|
skill_structure:
|
|
name: "unique-skill-name"
|
|
trigger: "Pattern that activates skill"
|
|
content: "Detailed instructions"
|
|
dependencies: ["other-skills"]
|
|
|
|
benefits:
|
|
- Minimal base context
|
|
- Extensible without bloat
|
|
- Skills can be updated independently
|
|
```
|
|
|
|
### Sub-Agents for Context Isolation
|
|
|
|
**Key Insight:** Prevent massive token waste by isolating context-noisy subtasks.
|
|
|
|
```python
|
|
async def context_isolated_search(query, codebase_path):
|
|
"""
|
|
Use sub-agent for grep/search to prevent context pollution.
|
|
Based on Simon Willison's patterns.
|
|
"""
|
|
# Main agent stays focused
|
|
# Sub-agent handles noisy file searching
|
|
|
|
search_agent = spawn_subagent(
|
|
role="codebase-searcher",
|
|
context_limit="10k tokens",
|
|
permissions=["read-only"]
|
|
)
|
|
|
|
results = await search_agent.execute(
|
|
task=f"Find files related to: {query}",
|
|
codebase=codebase_path
|
|
)
|
|
|
|
# Return only relevant paths, not full content
|
|
return FilteredResults(
|
|
paths=results.relevant_files[:10],
|
|
summaries=results.file_summaries,
|
|
confidence=results.relevance_scores
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## Planning Before Execution
|
|
|
|
### Explicit Plan-Then-Code Workflow
|
|
|
|
**Key Insight:** Have models articulate detailed plans without immediately writing code.
|
|
|
|
```yaml
|
|
plan_then_code:
|
|
phase_1_planning:
|
|
outputs:
|
|
- spec.md: "Detailed requirements"
|
|
- todo.md: "Tagged tasks [BUG], [FEAT], [REFACTOR]"
|
|
- approach.md: "Implementation strategy"
|
|
constraints:
|
|
- NO CODE in this phase
|
|
- Human review before proceeding
|
|
- Clear success criteria
|
|
|
|
phase_2_review:
|
|
checks:
|
|
- Plan addresses all requirements
|
|
- Approach is feasible
|
|
- No missing dependencies
|
|
- Tests are specified
|
|
|
|
phase_3_implementation:
|
|
constraints:
|
|
- Follow plan exactly
|
|
- One task at a time
|
|
- Test after each change
|
|
- Report deviations immediately
|
|
```
|
|
|
|
---
|
|
|
|
## Multi-Agent Orchestration Patterns
|
|
|
|
### Event-Driven Coordination
|
|
|
|
**Key Insight:** Move beyond synchronous prompt chaining to asynchronous, decoupled systems.
|
|
|
|
```yaml
|
|
event_driven_orchestration:
|
|
problems_with_synchronous:
|
|
- Doesn't scale
|
|
- Mixes orchestration with prompt logic
|
|
- Single failure breaks entire chain
|
|
- No retry/recovery mechanism
|
|
|
|
async_architecture:
|
|
message_queue:
|
|
- Agents communicate via events
|
|
- Decoupled execution
|
|
- Natural retry/dead-letter handling
|
|
|
|
state_management:
|
|
- Persistent task state
|
|
- Checkpoint/resume capability
|
|
- Clear ownership of data
|
|
|
|
error_handling:
|
|
- Per-agent retry policies
|
|
- Circuit breakers
|
|
- Graceful degradation
|
|
```
|
|
|
|
### Policy-First Enforcement
|
|
|
|
**Key Insight:** Govern agent behavior at runtime, not just training time.
|
|
|
|
```python
|
|
class PolicyEngine:
|
|
"""
|
|
Runtime governance for agent behavior.
|
|
Based on autonomous control plane patterns.
|
|
"""
|
|
|
|
def __init__(self, policies):
|
|
self.policies = policies
|
|
|
|
async def enforce(self, agent_action, context):
|
|
for policy in self.policies:
|
|
result = await policy.evaluate(agent_action, context)
|
|
|
|
if result.blocked:
|
|
return BlockedAction(
|
|
reason=result.reason,
|
|
policy=policy.name,
|
|
remediation=result.suggested_action
|
|
)
|
|
|
|
if result.modified:
|
|
agent_action = result.modified_action
|
|
|
|
return AllowedAction(agent_action)
|
|
|
|
# Example policies
|
|
policies = [
|
|
NoProductionDataDeletion(),
|
|
NoSecretsInCode(),
|
|
MaxTokenBudget(limit=100000),
|
|
RequireTestsForCode(),
|
|
BlockExternalNetworkCalls(in_sandbox=True)
|
|
]
|
|
```
|
|
|
|
### Simulation Layer
|
|
|
|
**Key Insight:** Evaluate changes before deploying to real environment.
|
|
|
|
```yaml
|
|
simulation_layer:
|
|
purpose: "Test agent behavior in safe environment"
|
|
|
|
implementation:
|
|
sandbox_environment:
|
|
- Isolated container
|
|
- Mocked external services
|
|
- Synthetic data
|
|
- Full audit logging
|
|
|
|
validation_checks:
|
|
- Run tests in sandbox first
|
|
- Compare outputs to expected
|
|
- Check for policy violations
|
|
- Measure resource consumption
|
|
|
|
promotion_criteria:
|
|
- All tests pass
|
|
- No policy violations
|
|
- Resource usage within limits
|
|
- Human approval (for sensitive changes)
|
|
```
|
|
|
|
---
|
|
|
|
## Evaluation and Benchmarking
|
|
|
|
### Problems with Current Benchmarks
|
|
|
|
**Key Insight:** LLM-as-judge creates shared blind spots.
|
|
|
|
```yaml
|
|
benchmark_problems:
|
|
llm_judge_issues:
|
|
- Same architecture = same failure modes
|
|
- Math errors accepted as correct
|
|
- "Do-nothing" baseline passes 38% of time
|
|
|
|
contamination:
|
|
- Published benchmarks become training targets
|
|
- Overfitting to specific datasets
|
|
- Inflated scores don't reflect real performance
|
|
|
|
solutions:
|
|
held_back_sets: "90% public, 10% private"
|
|
human_evaluation: "Final published results require humans"
|
|
production_testing: "A/B tests measure actual value"
|
|
objective_outcomes: "Simulated environments with verifiable results"
|
|
```
|
|
|
|
### Practical Evaluation Approach
|
|
|
|
```python
|
|
def evaluate_agent_change(before_agent, after_agent, task_set):
|
|
"""
|
|
Production-oriented evaluation.
|
|
Based on HN practitioner recommendations.
|
|
"""
|
|
results = {
|
|
"before": [],
|
|
"after": [],
|
|
"human_preference": []
|
|
}
|
|
|
|
for task in task_set:
|
|
# Run both agents
|
|
before_result = before_agent.execute(task)
|
|
after_result = after_agent.execute(task)
|
|
|
|
# Objective metrics (NOT LLM-judged)
|
|
results["before"].append({
|
|
"tests_pass": run_tests(before_result),
|
|
"lint_clean": run_linter(before_result),
|
|
"time_taken": before_result.duration,
|
|
"tokens_used": before_result.tokens
|
|
})
|
|
|
|
results["after"].append({
|
|
"tests_pass": run_tests(after_result),
|
|
"lint_clean": run_linter(after_result),
|
|
"time_taken": after_result.duration,
|
|
"tokens_used": after_result.tokens
|
|
})
|
|
|
|
# Sample for human review
|
|
if random.random() < 0.1: # 10% sample
|
|
results["human_preference"].append({
|
|
"task": task,
|
|
"before": before_result,
|
|
"after": after_result,
|
|
"pending_review": True
|
|
})
|
|
|
|
return EvaluationReport(results)
|
|
```
|
|
|
|
---
|
|
|
|
## Cost and Token Economics
|
|
|
|
### Real-World Cost Patterns
|
|
|
|
```yaml
|
|
cost_patterns:
|
|
claude_code:
|
|
heavy_use: "$25/1-2 hours on large codebases"
|
|
api_range: "$1-5/hour depending on efficiency"
|
|
max_tier: "$200/month often needs 2-3 subscriptions"
|
|
|
|
token_economics:
|
|
sub_agents_multiply_cost: "Each duplicates context"
|
|
example: "5-task parallel job = 50,000+ tokens per subtask"
|
|
|
|
optimization:
|
|
context_isolation: "Use sub-agents for noisy tasks"
|
|
information_abstraction: "Summarize, don't dump"
|
|
fresh_conversations: "Reset after major tasks"
|
|
skill_on_demand: "Load only when needed"
|
|
```
|
|
|
|
---
|
|
|
|
## Sources
|
|
|
|
**Hacker News Discussions:**
|
|
- [What Actually Works in Production for Autonomous Agents](https://news.ycombinator.com/item?id=44623207)
|
|
- [Coding with LLMs in Summer 2025](https://news.ycombinator.com/item?id=44623953)
|
|
- [Superpowers: How I'm Using Coding Agents](https://news.ycombinator.com/item?id=45547344)
|
|
- [Claude Code Experience After Two Weeks](https://news.ycombinator.com/item?id=44596472)
|
|
- [AI Agent Benchmarks Are Broken](https://news.ycombinator.com/item?id=44531697)
|
|
- [How to Orchestrate Multi-Agent Workflows](https://news.ycombinator.com/item?id=45955997)
|
|
- [Context Engineering vs Prompt Engineering](https://news.ycombinator.com/item?id=44427757)
|
|
|
|
**Show HN Projects:**
|
|
- [Self-Evolving Agents Repository](https://news.ycombinator.com/item?id=45099226)
|
|
- [Package Manager for Agent Skills](https://news.ycombinator.com/item?id=46422264)
|
|
- [Wispbit - AI Code Review Agent](https://news.ycombinator.com/item?id=44722603)
|
|
- [Agtrace - Monitoring for AI Coding Agents](https://news.ycombinator.com/item?id=46425670)
|