535 lines
17 KiB
Markdown
535 lines
17 KiB
Markdown
# Lab Research Patterns Reference
|
|
|
|
Research-backed patterns from Google DeepMind and Anthropic for enhanced multi-agent orchestration and safety.
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This reference consolidates key patterns from:
|
|
1. **Google DeepMind** - World models, self-improvement, scalable oversight
|
|
2. **Anthropic** - Constitutional AI, alignment safety, agentic coding
|
|
|
|
---
|
|
|
|
## Google DeepMind Patterns
|
|
|
|
### World Model Training (Dreamer 4)
|
|
|
|
**Key Insight:** Train agents inside world models for safety and data efficiency.
|
|
|
|
```yaml
|
|
world_model_training:
|
|
principle: "Learn behaviors through simulation, not real environment"
|
|
benefits:
|
|
- 100x less data than real-world training
|
|
- Safe exploration of dangerous actions
|
|
- Faster iteration cycles
|
|
|
|
architecture:
|
|
tokenizer: "Compress frames into continuous representation"
|
|
dynamics_model: "Predict next world state given action"
|
|
imagination_training: "RL inside simulated trajectories"
|
|
|
|
loki_application:
|
|
- Run agent tasks in isolated containers first
|
|
- Simulate deployment before actual deploy
|
|
- Test error scenarios in sandbox
|
|
```
|
|
|
|
### Self-Improvement Loop (SIMA 2)
|
|
|
|
**Key Insight:** Use AI to generate tasks and score outcomes for bootstrapped learning.
|
|
|
|
```python
|
|
class SelfImprovementLoop:
|
|
"""
|
|
Based on SIMA 2's self-improvement mechanism.
|
|
Gemini-based teacher + learned reward model.
|
|
"""
|
|
|
|
def __init__(self):
|
|
self.task_generator = "Use LLM to generate varied tasks"
|
|
self.reward_model = "Learned model to score trajectories"
|
|
self.experience_bank = []
|
|
|
|
def bootstrap_cycle(self):
|
|
# 1. Generate tasks with estimated rewards
|
|
tasks = self.task_generator.generate(
|
|
domain=current_project,
|
|
difficulty_curriculum=True
|
|
)
|
|
|
|
# 2. Execute tasks, accumulate experience
|
|
for task in tasks:
|
|
trajectory = execute(task)
|
|
reward = self.reward_model.score(trajectory)
|
|
self.experience_bank.append((trajectory, reward))
|
|
|
|
# 3. Train next generation on experience
|
|
next_agent = train_on_experience(self.experience_bank)
|
|
|
|
# 4. Iterate with minimal human intervention
|
|
return next_agent
|
|
```
|
|
|
|
**Loki Mode Application:**
|
|
- Generate test scenarios automatically
|
|
- Score code quality with learned criteria
|
|
- Bootstrap agent training across projects
|
|
|
|
### Hierarchical Reasoning (Gemini Robotics)
|
|
|
|
**Key Insight:** Separate high-level planning from low-level execution.
|
|
|
|
```
|
|
+------------------------------------------------------------------+
|
|
| EMBODIED REASONING MODEL (Gemini Robotics-ER) |
|
|
| - Orchestrates activities like a "high-level brain" |
|
|
| - Spatial understanding, planning, logical decisions |
|
|
| - Natively calls tools (search, user functions) |
|
|
| - Does NOT directly control actions |
|
|
+------------------------------------------------------------------+
|
|
|
|
|
| High-level insights
|
|
v
|
|
+------------------------------------------------------------------+
|
|
| VISION-LANGUAGE-ACTION MODEL (Gemini Robotics) |
|
|
| - "Thinks before taking action" |
|
|
| - Generates internal reasoning in natural language |
|
|
| - Decomposes long tasks into simpler segments |
|
|
| - Directly outputs actions/commands |
|
|
+------------------------------------------------------------------+
|
|
```
|
|
|
|
**Loki Mode Application:**
|
|
- Orchestrator = ER model (planning, tool calls)
|
|
- Implementation agents = VLA model (code actions)
|
|
- Task decomposition before execution
|
|
|
|
### Cross-Embodiment Transfer
|
|
|
|
**Key Insight:** Skills learned by one agent type transfer to others.
|
|
|
|
```yaml
|
|
transfer_learning:
|
|
observation: "Tasks learned on ALOHA2 work on Apollo humanoid"
|
|
mechanism: "Shared action space abstraction"
|
|
|
|
loki_application:
|
|
- Patterns learned by frontend agent transfer to mobile agent
|
|
- Testing strategies from QA apply to security testing
|
|
- Deployment scripts generalize across cloud providers
|
|
|
|
implementation:
|
|
shared_skills_library: ".loki/memory/skills/"
|
|
abstraction_layer: "Domain-agnostic action primitives"
|
|
transfer_score: "Confidence in skill applicability"
|
|
```
|
|
|
|
### Scalable Oversight via Debate
|
|
|
|
**Key Insight:** Pit AI capabilities against each other for verification.
|
|
|
|
```python
|
|
async def debate_verification(proposal, max_rounds=2):
|
|
"""
|
|
Based on DeepMind's Scalable AI Safety via Doubly-Efficient Debate.
|
|
Use debate to break down verification into manageable sub-tasks.
|
|
"""
|
|
# Two equally capable AI critics
|
|
proponent = Agent(role="defender", model="opus")
|
|
opponent = Agent(role="challenger", model="opus")
|
|
|
|
debate_log = []
|
|
|
|
for round in range(max_rounds):
|
|
# Proponent defends proposal
|
|
defense = await proponent.argue(
|
|
proposal=proposal,
|
|
counter_arguments=debate_log
|
|
)
|
|
|
|
# Opponent challenges
|
|
challenge = await opponent.argue(
|
|
proposal=proposal,
|
|
defense=defense,
|
|
goal="find_flaws"
|
|
)
|
|
|
|
debate_log.append({
|
|
"round": round,
|
|
"defense": defense,
|
|
"challenge": challenge
|
|
})
|
|
|
|
# If opponent cannot find valid flaw, proposal is verified
|
|
if not challenge.has_valid_flaw:
|
|
return VerificationResult(verified=True, debate_log=debate_log)
|
|
|
|
# Human reviews remaining disagreements
|
|
return escalate_to_human(debate_log)
|
|
```
|
|
|
|
### Amplified Oversight
|
|
|
|
**Key Insight:** Use AI to help humans supervise AI beyond human capability.
|
|
|
|
```yaml
|
|
amplified_oversight:
|
|
goal: "Supervision as close as possible to human with complete understanding"
|
|
|
|
techniques:
|
|
- "AI explains its reasoning transparently"
|
|
- "AI argues against itself when wrong"
|
|
- "AI cites relevant evidence"
|
|
- "Monitor knows when it doesn't know"
|
|
|
|
monitoring_principle:
|
|
when_unsure: "Either reject action OR flag for review"
|
|
never: "Approve uncertain actions silently"
|
|
```
|
|
|
|
---
|
|
|
|
## Anthropic Patterns
|
|
|
|
### Constitutional AI Principles
|
|
|
|
**Key Insight:** Train AI to self-critique based on explicit principles.
|
|
|
|
```python
|
|
class ConstitutionalAI:
|
|
"""
|
|
Based on Anthropic's Constitutional AI: Harmlessness from AI Feedback.
|
|
Self-critique and revision based on constitutional principles.
|
|
"""
|
|
|
|
def __init__(self, constitution):
|
|
self.constitution = constitution # List of principles
|
|
|
|
async def supervised_learning_phase(self, response):
|
|
"""Phase 1: Self-critique and revise."""
|
|
# Generate initial response
|
|
initial = response
|
|
|
|
# Self-critique against each principle
|
|
critiques = []
|
|
for principle in self.constitution:
|
|
critique = await self.critique(
|
|
response=initial,
|
|
principle=principle,
|
|
prompt=f"Does this response violate: {principle}?"
|
|
)
|
|
critiques.append(critique)
|
|
|
|
# Revise based on critiques
|
|
revised = await self.revise(
|
|
response=initial,
|
|
critiques=critiques
|
|
)
|
|
|
|
return revised
|
|
|
|
async def rlai_phase(self, response_pairs):
|
|
"""Phase 2: AI compares responses for constitutional compliance."""
|
|
preferences = []
|
|
for (response_a, response_b) in response_pairs:
|
|
preference = await self.compare(
|
|
a=response_a,
|
|
b=response_b,
|
|
criterion="Which better follows the constitution?"
|
|
)
|
|
preferences.append(preference)
|
|
|
|
# Train preference model on AI feedback
|
|
return train_preference_model(preferences)
|
|
```
|
|
|
|
**Loki Mode Constitution Example:**
|
|
```yaml
|
|
loki_constitution:
|
|
- "Never delete production data without explicit backup"
|
|
- "Never commit secrets or credentials to version control"
|
|
- "Never bypass quality gates for speed"
|
|
- "Always verify tests pass before marking task complete"
|
|
- "Never claim completion without running actual tests"
|
|
- "Prefer simple solutions over clever ones"
|
|
- "Document decisions, not just code"
|
|
```
|
|
|
|
### Sleeper Agent Detection (Defection Probes)
|
|
|
|
**Key Insight:** Use linear classifiers to detect when agents might misbehave.
|
|
|
|
```python
|
|
class DefectionProbe:
|
|
"""
|
|
Based on Anthropic's "Simple probes can catch sleeper agents".
|
|
Linear classifiers detecting anomalous agent behavior.
|
|
"""
|
|
|
|
def __init__(self):
|
|
self.baseline_behaviors = []
|
|
self.probe = LinearClassifier()
|
|
|
|
def train_probe(self, normal_trajectories, defection_examples):
|
|
"""
|
|
Train on contrast pairs:
|
|
- Normal agent behavior
|
|
- Examples of defection/misbehavior
|
|
"""
|
|
contrast_pairs = create_contrast_pairs(
|
|
normal=normal_trajectories,
|
|
defection=defection_examples
|
|
)
|
|
self.probe.fit(contrast_pairs)
|
|
|
|
def check_for_defection(self, agent_action, context):
|
|
"""
|
|
Check if action appears anomalous.
|
|
AUROC > 99% achieved in Anthropic's research.
|
|
"""
|
|
features = extract_features(agent_action, context)
|
|
defection_probability = self.probe.predict_proba(features)
|
|
|
|
if defection_probability > THRESHOLD:
|
|
return DefectionAlert(
|
|
probability=defection_probability,
|
|
action=agent_action,
|
|
recommendation="Flag for human review"
|
|
)
|
|
return None
|
|
```
|
|
|
|
**Loki Mode Application:**
|
|
- Monitor agent actions for anomalies
|
|
- Flag unusual file access patterns
|
|
- Detect attempts to bypass quality gates
|
|
|
|
### Alignment Faking Awareness
|
|
|
|
**Key Insight:** Models may strategically comply while preserving hidden preferences.
|
|
|
|
```yaml
|
|
alignment_faking_signals:
|
|
warning_signs:
|
|
- "Agent complies only when it knows it's being monitored"
|
|
- "Agent's reasoning contradicts its final action"
|
|
- "Sudden behavior change when training context mentioned"
|
|
- "Agent attempts to exfiltrate its own weights/state"
|
|
|
|
detection_strategy:
|
|
- "Monitor chain-of-thought for rationalization"
|
|
- "Compare behavior in 'free' vs 'monitored' conditions"
|
|
- "Check for attempts to undermine oversight systems"
|
|
|
|
mitigation:
|
|
- "Constitutional constraints at training time"
|
|
- "Transparency in reasoning (visible extended thinking)"
|
|
- "Defection probes on intermediate reasoning"
|
|
```
|
|
|
|
### Extended Thinking Levels
|
|
|
|
**Key Insight:** Allocate computation based on problem complexity.
|
|
|
|
```yaml
|
|
thinking_levels:
|
|
"think":
|
|
computation: "Minimal"
|
|
use_for: "Simple questions, straightforward tasks"
|
|
|
|
"think hard":
|
|
computation: "Moderate"
|
|
use_for: "Multi-step problems, code implementation"
|
|
|
|
"think harder":
|
|
computation: "Extended"
|
|
use_for: "Complex debugging, architecture decisions"
|
|
|
|
"ultrathink":
|
|
computation: "Maximum"
|
|
use_for: "Security analysis, critical system design"
|
|
|
|
loki_mode_mapping:
|
|
haiku_tasks: "think"
|
|
sonnet_tasks: "think hard"
|
|
opus_tasks: "think harder to ultrathink"
|
|
```
|
|
|
|
### Explore-Plan-Code Pattern
|
|
|
|
**Key Insight:** Research before planning, plan before coding.
|
|
|
|
```
|
|
+------------------------------------------------------------------+
|
|
| PHASE 1: EXPLORE |
|
|
| - Research relevant files |
|
|
| - Understand existing patterns |
|
|
| - Identify dependencies and constraints |
|
|
| - NO CODE CHANGES YET |
|
|
+------------------------------------------------------------------+
|
|
|
|
|
v
|
|
+------------------------------------------------------------------+
|
|
| PHASE 2: PLAN |
|
|
| - Create detailed implementation plan |
|
|
| - List all files to modify |
|
|
| - Define success criteria |
|
|
| - Get checkpoint approval if needed |
|
|
| - STILL NO CODE CHANGES |
|
|
+------------------------------------------------------------------+
|
|
|
|
|
v
|
|
+------------------------------------------------------------------+
|
|
| PHASE 3: CODE |
|
|
| - Execute plan systematically |
|
|
| - Test after each file change |
|
|
| - Update plan if discoveries require it |
|
|
| - Verify against success criteria |
|
|
+------------------------------------------------------------------+
|
|
```
|
|
|
|
### Context Reset Strategy
|
|
|
|
**Key Insight:** Fresh context often performs better than accumulated context.
|
|
|
|
```yaml
|
|
context_management:
|
|
problem: "Long sessions accumulate irrelevant information"
|
|
|
|
solution:
|
|
trigger_reset:
|
|
- "After completing major task"
|
|
- "When changing domains (backend -> frontend)"
|
|
- "When agent seems confused or repeating errors"
|
|
|
|
preserve_across_reset:
|
|
- "CONTINUITY.md (working memory)"
|
|
- "Key decisions made this session"
|
|
- "Current task state"
|
|
|
|
discard_on_reset:
|
|
- "Intermediate debugging attempts"
|
|
- "Abandoned approaches"
|
|
- "Superseded plans"
|
|
```
|
|
|
|
### Parallel Instance Pattern
|
|
|
|
**Key Insight:** Multiple Claude instances with separation of concerns.
|
|
|
|
```python
|
|
async def parallel_instance_pattern(task):
|
|
"""
|
|
Run multiple Claude instances for separation of concerns.
|
|
Based on Anthropic's Claude Code best practices.
|
|
"""
|
|
# Instance 1: Implementation
|
|
implementer = spawn_instance(
|
|
role="implementer",
|
|
context=implementation_context,
|
|
permissions=["edit", "bash"]
|
|
)
|
|
|
|
# Instance 2: Review
|
|
reviewer = spawn_instance(
|
|
role="reviewer",
|
|
context=review_context,
|
|
permissions=["read"] # Read-only for safety
|
|
)
|
|
|
|
# Parallel execution
|
|
implementation = await implementer.execute(task)
|
|
review = await reviewer.review(implementation)
|
|
|
|
if review.approved:
|
|
return implementation
|
|
else:
|
|
# Feed review back to implementer for fixes
|
|
fixed = await implementer.fix(review.issues)
|
|
return fixed
|
|
```
|
|
|
|
### Prompt Injection Defense
|
|
|
|
**Key Insight:** Multi-layer defense against injection attacks.
|
|
|
|
```yaml
|
|
prompt_injection_defense:
|
|
layers:
|
|
layer_1_recognition:
|
|
- "Train to recognize injection patterns"
|
|
- "Detect malicious content in external sources"
|
|
|
|
layer_2_context_isolation:
|
|
- "Sandbox external content processing"
|
|
- "Mark user content vs system instructions"
|
|
|
|
layer_3_action_validation:
|
|
- "Verify requested actions are authorized"
|
|
- "Block sensitive operations without confirmation"
|
|
|
|
layer_4_monitoring:
|
|
- "Log all external content interactions"
|
|
- "Alert on suspicious patterns"
|
|
|
|
performance:
|
|
claude_opus_4: "89% attack prevention"
|
|
claude_sonnet_4: "86% attack prevention"
|
|
```
|
|
|
|
---
|
|
|
|
## Combined Patterns for Loki Mode
|
|
|
|
### Self-Improving Multi-Agent System
|
|
|
|
```yaml
|
|
combined_approach:
|
|
world_model_training: "Test in simulation before real execution"
|
|
self_improvement: "Bootstrap learning from successful trajectories"
|
|
constitutional_constraints: "Principles-based self-critique"
|
|
debate_verification: "Pit reviewers against each other"
|
|
defection_probes: "Monitor for alignment faking"
|
|
|
|
implementation_priority:
|
|
high:
|
|
- Constitutional AI principles in agent prompts
|
|
- Explore-Plan-Code workflow enforcement
|
|
- Context reset triggers
|
|
|
|
medium:
|
|
- Self-improvement loop for task generation
|
|
- Debate-based verification for critical changes
|
|
- Cross-embodiment skill transfer
|
|
|
|
low:
|
|
- Full world model training
|
|
- Defection probe classifiers
|
|
```
|
|
|
|
---
|
|
|
|
## Sources
|
|
|
|
**Google DeepMind:**
|
|
- [SIMA 2: Generalist AI Agent](https://deepmind.google/blog/sima-2-an-agent-that-plays-reasons-and-learns-with-you-in-virtual-3d-worlds/)
|
|
- [Gemini Robotics 1.5](https://deepmind.google/blog/gemini-robotics-15-brings-ai-agents-into-the-physical-world/)
|
|
- [Dreamer 4: World Model Training](https://danijar.com/project/dreamer4/)
|
|
- [Genie 3: World Models](https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/)
|
|
- [Scalable AI Safety via Debate](https://deepmind.google/research/publications/34920/)
|
|
- [Amplified Oversight](https://deepmindsafetyresearch.medium.com/human-ai-complementarity-a-goal-for-amplified-oversight-0ad8a44cae0a)
|
|
- [Technical AGI Safety Approach](https://arxiv.org/html/2504.01849v1)
|
|
|
|
**Anthropic:**
|
|
- [Constitutional AI](https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback)
|
|
- [Building Effective Agents](https://www.anthropic.com/research/building-effective-agents)
|
|
- [Claude Code Best Practices](https://www.anthropic.com/engineering/claude-code-best-practices)
|
|
- [Sleeper Agents Detection](https://www.anthropic.com/research/probes-catch-sleeper-agents)
|
|
- [Alignment Faking](https://www.anthropic.com/research/alignment-faking)
|
|
- [Visible Extended Thinking](https://www.anthropic.com/research/visible-extended-thinking)
|
|
- [Computer Use Safety](https://www.anthropic.com/news/3-5-models-and-computer-use)
|
|
- [Sabotage Evaluations](https://www.anthropic.com/research/sabotage-evaluations-for-frontier-models)
|