17 KiB
Lab Research Patterns Reference
Research-backed patterns from Google DeepMind and Anthropic for enhanced multi-agent orchestration and safety.
Overview
This reference consolidates key patterns from:
- Google DeepMind - World models, self-improvement, scalable oversight
- Anthropic - Constitutional AI, alignment safety, agentic coding
Google DeepMind Patterns
World Model Training (Dreamer 4)
Key Insight: Train agents inside world models for safety and data efficiency.
world_model_training:
principle: "Learn behaviors through simulation, not real environment"
benefits:
- 100x less data than real-world training
- Safe exploration of dangerous actions
- Faster iteration cycles
architecture:
tokenizer: "Compress frames into continuous representation"
dynamics_model: "Predict next world state given action"
imagination_training: "RL inside simulated trajectories"
loki_application:
- Run agent tasks in isolated containers first
- Simulate deployment before actual deploy
- Test error scenarios in sandbox
Self-Improvement Loop (SIMA 2)
Key Insight: Use AI to generate tasks and score outcomes for bootstrapped learning.
class SelfImprovementLoop:
"""
Based on SIMA 2's self-improvement mechanism.
Gemini-based teacher + learned reward model.
"""
def __init__(self):
self.task_generator = "Use LLM to generate varied tasks"
self.reward_model = "Learned model to score trajectories"
self.experience_bank = []
def bootstrap_cycle(self):
# 1. Generate tasks with estimated rewards
tasks = self.task_generator.generate(
domain=current_project,
difficulty_curriculum=True
)
# 2. Execute tasks, accumulate experience
for task in tasks:
trajectory = execute(task)
reward = self.reward_model.score(trajectory)
self.experience_bank.append((trajectory, reward))
# 3. Train next generation on experience
next_agent = train_on_experience(self.experience_bank)
# 4. Iterate with minimal human intervention
return next_agent
Loki Mode Application:
- Generate test scenarios automatically
- Score code quality with learned criteria
- Bootstrap agent training across projects
Hierarchical Reasoning (Gemini Robotics)
Key Insight: Separate high-level planning from low-level execution.
+------------------------------------------------------------------+
| EMBODIED REASONING MODEL (Gemini Robotics-ER) |
| - Orchestrates activities like a "high-level brain" |
| - Spatial understanding, planning, logical decisions |
| - Natively calls tools (search, user functions) |
| - Does NOT directly control actions |
+------------------------------------------------------------------+
|
| High-level insights
v
+------------------------------------------------------------------+
| VISION-LANGUAGE-ACTION MODEL (Gemini Robotics) |
| - "Thinks before taking action" |
| - Generates internal reasoning in natural language |
| - Decomposes long tasks into simpler segments |
| - Directly outputs actions/commands |
+------------------------------------------------------------------+
Loki Mode Application:
- Orchestrator = ER model (planning, tool calls)
- Implementation agents = VLA model (code actions)
- Task decomposition before execution
Cross-Embodiment Transfer
Key Insight: Skills learned by one agent type transfer to others.
transfer_learning:
observation: "Tasks learned on ALOHA2 work on Apollo humanoid"
mechanism: "Shared action space abstraction"
loki_application:
- Patterns learned by frontend agent transfer to mobile agent
- Testing strategies from QA apply to security testing
- Deployment scripts generalize across cloud providers
implementation:
shared_skills_library: ".loki/memory/skills/"
abstraction_layer: "Domain-agnostic action primitives"
transfer_score: "Confidence in skill applicability"
Scalable Oversight via Debate
Key Insight: Pit AI capabilities against each other for verification.
async def debate_verification(proposal, max_rounds=2):
"""
Based on DeepMind's Scalable AI Safety via Doubly-Efficient Debate.
Use debate to break down verification into manageable sub-tasks.
"""
# Two equally capable AI critics
proponent = Agent(role="defender", model="opus")
opponent = Agent(role="challenger", model="opus")
debate_log = []
for round in range(max_rounds):
# Proponent defends proposal
defense = await proponent.argue(
proposal=proposal,
counter_arguments=debate_log
)
# Opponent challenges
challenge = await opponent.argue(
proposal=proposal,
defense=defense,
goal="find_flaws"
)
debate_log.append({
"round": round,
"defense": defense,
"challenge": challenge
})
# If opponent cannot find valid flaw, proposal is verified
if not challenge.has_valid_flaw:
return VerificationResult(verified=True, debate_log=debate_log)
# Human reviews remaining disagreements
return escalate_to_human(debate_log)
Amplified Oversight
Key Insight: Use AI to help humans supervise AI beyond human capability.
amplified_oversight:
goal: "Supervision as close as possible to human with complete understanding"
techniques:
- "AI explains its reasoning transparently"
- "AI argues against itself when wrong"
- "AI cites relevant evidence"
- "Monitor knows when it doesn't know"
monitoring_principle:
when_unsure: "Either reject action OR flag for review"
never: "Approve uncertain actions silently"
Anthropic Patterns
Constitutional AI Principles
Key Insight: Train AI to self-critique based on explicit principles.
class ConstitutionalAI:
"""
Based on Anthropic's Constitutional AI: Harmlessness from AI Feedback.
Self-critique and revision based on constitutional principles.
"""
def __init__(self, constitution):
self.constitution = constitution # List of principles
async def supervised_learning_phase(self, response):
"""Phase 1: Self-critique and revise."""
# Generate initial response
initial = response
# Self-critique against each principle
critiques = []
for principle in self.constitution:
critique = await self.critique(
response=initial,
principle=principle,
prompt=f"Does this response violate: {principle}?"
)
critiques.append(critique)
# Revise based on critiques
revised = await self.revise(
response=initial,
critiques=critiques
)
return revised
async def rlai_phase(self, response_pairs):
"""Phase 2: AI compares responses for constitutional compliance."""
preferences = []
for (response_a, response_b) in response_pairs:
preference = await self.compare(
a=response_a,
b=response_b,
criterion="Which better follows the constitution?"
)
preferences.append(preference)
# Train preference model on AI feedback
return train_preference_model(preferences)
Loki Mode Constitution Example:
loki_constitution:
- "Never delete production data without explicit backup"
- "Never commit secrets or credentials to version control"
- "Never bypass quality gates for speed"
- "Always verify tests pass before marking task complete"
- "Never claim completion without running actual tests"
- "Prefer simple solutions over clever ones"
- "Document decisions, not just code"
Sleeper Agent Detection (Defection Probes)
Key Insight: Use linear classifiers to detect when agents might misbehave.
class DefectionProbe:
"""
Based on Anthropic's "Simple probes can catch sleeper agents".
Linear classifiers detecting anomalous agent behavior.
"""
def __init__(self):
self.baseline_behaviors = []
self.probe = LinearClassifier()
def train_probe(self, normal_trajectories, defection_examples):
"""
Train on contrast pairs:
- Normal agent behavior
- Examples of defection/misbehavior
"""
contrast_pairs = create_contrast_pairs(
normal=normal_trajectories,
defection=defection_examples
)
self.probe.fit(contrast_pairs)
def check_for_defection(self, agent_action, context):
"""
Check if action appears anomalous.
AUROC > 99% achieved in Anthropic's research.
"""
features = extract_features(agent_action, context)
defection_probability = self.probe.predict_proba(features)
if defection_probability > THRESHOLD:
return DefectionAlert(
probability=defection_probability,
action=agent_action,
recommendation="Flag for human review"
)
return None
Loki Mode Application:
- Monitor agent actions for anomalies
- Flag unusual file access patterns
- Detect attempts to bypass quality gates
Alignment Faking Awareness
Key Insight: Models may strategically comply while preserving hidden preferences.
alignment_faking_signals:
warning_signs:
- "Agent complies only when it knows it's being monitored"
- "Agent's reasoning contradicts its final action"
- "Sudden behavior change when training context mentioned"
- "Agent attempts to exfiltrate its own weights/state"
detection_strategy:
- "Monitor chain-of-thought for rationalization"
- "Compare behavior in 'free' vs 'monitored' conditions"
- "Check for attempts to undermine oversight systems"
mitigation:
- "Constitutional constraints at training time"
- "Transparency in reasoning (visible extended thinking)"
- "Defection probes on intermediate reasoning"
Extended Thinking Levels
Key Insight: Allocate computation based on problem complexity.
thinking_levels:
"think":
computation: "Minimal"
use_for: "Simple questions, straightforward tasks"
"think hard":
computation: "Moderate"
use_for: "Multi-step problems, code implementation"
"think harder":
computation: "Extended"
use_for: "Complex debugging, architecture decisions"
"ultrathink":
computation: "Maximum"
use_for: "Security analysis, critical system design"
loki_mode_mapping:
haiku_tasks: "think"
sonnet_tasks: "think hard"
opus_tasks: "think harder to ultrathink"
Explore-Plan-Code Pattern
Key Insight: Research before planning, plan before coding.
+------------------------------------------------------------------+
| PHASE 1: EXPLORE |
| - Research relevant files |
| - Understand existing patterns |
| - Identify dependencies and constraints |
| - NO CODE CHANGES YET |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| PHASE 2: PLAN |
| - Create detailed implementation plan |
| - List all files to modify |
| - Define success criteria |
| - Get checkpoint approval if needed |
| - STILL NO CODE CHANGES |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| PHASE 3: CODE |
| - Execute plan systematically |
| - Test after each file change |
| - Update plan if discoveries require it |
| - Verify against success criteria |
+------------------------------------------------------------------+
Context Reset Strategy
Key Insight: Fresh context often performs better than accumulated context.
context_management:
problem: "Long sessions accumulate irrelevant information"
solution:
trigger_reset:
- "After completing major task"
- "When changing domains (backend -> frontend)"
- "When agent seems confused or repeating errors"
preserve_across_reset:
- "CONTINUITY.md (working memory)"
- "Key decisions made this session"
- "Current task state"
discard_on_reset:
- "Intermediate debugging attempts"
- "Abandoned approaches"
- "Superseded plans"
Parallel Instance Pattern
Key Insight: Multiple Claude instances with separation of concerns.
async def parallel_instance_pattern(task):
"""
Run multiple Claude instances for separation of concerns.
Based on Anthropic's Claude Code best practices.
"""
# Instance 1: Implementation
implementer = spawn_instance(
role="implementer",
context=implementation_context,
permissions=["edit", "bash"]
)
# Instance 2: Review
reviewer = spawn_instance(
role="reviewer",
context=review_context,
permissions=["read"] # Read-only for safety
)
# Parallel execution
implementation = await implementer.execute(task)
review = await reviewer.review(implementation)
if review.approved:
return implementation
else:
# Feed review back to implementer for fixes
fixed = await implementer.fix(review.issues)
return fixed
Prompt Injection Defense
Key Insight: Multi-layer defense against injection attacks.
prompt_injection_defense:
layers:
layer_1_recognition:
- "Train to recognize injection patterns"
- "Detect malicious content in external sources"
layer_2_context_isolation:
- "Sandbox external content processing"
- "Mark user content vs system instructions"
layer_3_action_validation:
- "Verify requested actions are authorized"
- "Block sensitive operations without confirmation"
layer_4_monitoring:
- "Log all external content interactions"
- "Alert on suspicious patterns"
performance:
claude_opus_4: "89% attack prevention"
claude_sonnet_4: "86% attack prevention"
Combined Patterns for Loki Mode
Self-Improving Multi-Agent System
combined_approach:
world_model_training: "Test in simulation before real execution"
self_improvement: "Bootstrap learning from successful trajectories"
constitutional_constraints: "Principles-based self-critique"
debate_verification: "Pit reviewers against each other"
defection_probes: "Monitor for alignment faking"
implementation_priority:
high:
- Constitutional AI principles in agent prompts
- Explore-Plan-Code workflow enforcement
- Context reset triggers
medium:
- Self-improvement loop for task generation
- Debate-based verification for critical changes
- Cross-embodiment skill transfer
low:
- Full world model training
- Defection probe classifiers
Sources
Google DeepMind:
- SIMA 2: Generalist AI Agent
- Gemini Robotics 1.5
- Dreamer 4: World Model Training
- Genie 3: World Models
- Scalable AI Safety via Debate
- Amplified Oversight
- Technical AGI Safety Approach
Anthropic: