# Lab Research Patterns Reference Research-backed patterns from Google DeepMind and Anthropic for enhanced multi-agent orchestration and safety. --- ## Overview This reference consolidates key patterns from: 1. **Google DeepMind** - World models, self-improvement, scalable oversight 2. **Anthropic** - Constitutional AI, alignment safety, agentic coding --- ## Google DeepMind Patterns ### World Model Training (Dreamer 4) **Key Insight:** Train agents inside world models for safety and data efficiency. ```yaml world_model_training: principle: "Learn behaviors through simulation, not real environment" benefits: - 100x less data than real-world training - Safe exploration of dangerous actions - Faster iteration cycles architecture: tokenizer: "Compress frames into continuous representation" dynamics_model: "Predict next world state given action" imagination_training: "RL inside simulated trajectories" loki_application: - Run agent tasks in isolated containers first - Simulate deployment before actual deploy - Test error scenarios in sandbox ``` ### Self-Improvement Loop (SIMA 2) **Key Insight:** Use AI to generate tasks and score outcomes for bootstrapped learning. ```python class SelfImprovementLoop: """ Based on SIMA 2's self-improvement mechanism. Gemini-based teacher + learned reward model. """ def __init__(self): self.task_generator = "Use LLM to generate varied tasks" self.reward_model = "Learned model to score trajectories" self.experience_bank = [] def bootstrap_cycle(self): # 1. Generate tasks with estimated rewards tasks = self.task_generator.generate( domain=current_project, difficulty_curriculum=True ) # 2. Execute tasks, accumulate experience for task in tasks: trajectory = execute(task) reward = self.reward_model.score(trajectory) self.experience_bank.append((trajectory, reward)) # 3. Train next generation on experience next_agent = train_on_experience(self.experience_bank) # 4. Iterate with minimal human intervention return next_agent ``` **Loki Mode Application:** - Generate test scenarios automatically - Score code quality with learned criteria - Bootstrap agent training across projects ### Hierarchical Reasoning (Gemini Robotics) **Key Insight:** Separate high-level planning from low-level execution. ``` +------------------------------------------------------------------+ | EMBODIED REASONING MODEL (Gemini Robotics-ER) | | - Orchestrates activities like a "high-level brain" | | - Spatial understanding, planning, logical decisions | | - Natively calls tools (search, user functions) | | - Does NOT directly control actions | +------------------------------------------------------------------+ | | High-level insights v +------------------------------------------------------------------+ | VISION-LANGUAGE-ACTION MODEL (Gemini Robotics) | | - "Thinks before taking action" | | - Generates internal reasoning in natural language | | - Decomposes long tasks into simpler segments | | - Directly outputs actions/commands | +------------------------------------------------------------------+ ``` **Loki Mode Application:** - Orchestrator = ER model (planning, tool calls) - Implementation agents = VLA model (code actions) - Task decomposition before execution ### Cross-Embodiment Transfer **Key Insight:** Skills learned by one agent type transfer to others. ```yaml transfer_learning: observation: "Tasks learned on ALOHA2 work on Apollo humanoid" mechanism: "Shared action space abstraction" loki_application: - Patterns learned by frontend agent transfer to mobile agent - Testing strategies from QA apply to security testing - Deployment scripts generalize across cloud providers implementation: shared_skills_library: ".loki/memory/skills/" abstraction_layer: "Domain-agnostic action primitives" transfer_score: "Confidence in skill applicability" ``` ### Scalable Oversight via Debate **Key Insight:** Pit AI capabilities against each other for verification. ```python async def debate_verification(proposal, max_rounds=2): """ Based on DeepMind's Scalable AI Safety via Doubly-Efficient Debate. Use debate to break down verification into manageable sub-tasks. """ # Two equally capable AI critics proponent = Agent(role="defender", model="opus") opponent = Agent(role="challenger", model="opus") debate_log = [] for round in range(max_rounds): # Proponent defends proposal defense = await proponent.argue( proposal=proposal, counter_arguments=debate_log ) # Opponent challenges challenge = await opponent.argue( proposal=proposal, defense=defense, goal="find_flaws" ) debate_log.append({ "round": round, "defense": defense, "challenge": challenge }) # If opponent cannot find valid flaw, proposal is verified if not challenge.has_valid_flaw: return VerificationResult(verified=True, debate_log=debate_log) # Human reviews remaining disagreements return escalate_to_human(debate_log) ``` ### Amplified Oversight **Key Insight:** Use AI to help humans supervise AI beyond human capability. ```yaml amplified_oversight: goal: "Supervision as close as possible to human with complete understanding" techniques: - "AI explains its reasoning transparently" - "AI argues against itself when wrong" - "AI cites relevant evidence" - "Monitor knows when it doesn't know" monitoring_principle: when_unsure: "Either reject action OR flag for review" never: "Approve uncertain actions silently" ``` --- ## Anthropic Patterns ### Constitutional AI Principles **Key Insight:** Train AI to self-critique based on explicit principles. ```python class ConstitutionalAI: """ Based on Anthropic's Constitutional AI: Harmlessness from AI Feedback. Self-critique and revision based on constitutional principles. """ def __init__(self, constitution): self.constitution = constitution # List of principles async def supervised_learning_phase(self, response): """Phase 1: Self-critique and revise.""" # Generate initial response initial = response # Self-critique against each principle critiques = [] for principle in self.constitution: critique = await self.critique( response=initial, principle=principle, prompt=f"Does this response violate: {principle}?" ) critiques.append(critique) # Revise based on critiques revised = await self.revise( response=initial, critiques=critiques ) return revised async def rlai_phase(self, response_pairs): """Phase 2: AI compares responses for constitutional compliance.""" preferences = [] for (response_a, response_b) in response_pairs: preference = await self.compare( a=response_a, b=response_b, criterion="Which better follows the constitution?" ) preferences.append(preference) # Train preference model on AI feedback return train_preference_model(preferences) ``` **Loki Mode Constitution Example:** ```yaml loki_constitution: - "Never delete production data without explicit backup" - "Never commit secrets or credentials to version control" - "Never bypass quality gates for speed" - "Always verify tests pass before marking task complete" - "Never claim completion without running actual tests" - "Prefer simple solutions over clever ones" - "Document decisions, not just code" ``` ### Sleeper Agent Detection (Defection Probes) **Key Insight:** Use linear classifiers to detect when agents might misbehave. ```python class DefectionProbe: """ Based on Anthropic's "Simple probes can catch sleeper agents". Linear classifiers detecting anomalous agent behavior. """ def __init__(self): self.baseline_behaviors = [] self.probe = LinearClassifier() def train_probe(self, normal_trajectories, defection_examples): """ Train on contrast pairs: - Normal agent behavior - Examples of defection/misbehavior """ contrast_pairs = create_contrast_pairs( normal=normal_trajectories, defection=defection_examples ) self.probe.fit(contrast_pairs) def check_for_defection(self, agent_action, context): """ Check if action appears anomalous. AUROC > 99% achieved in Anthropic's research. """ features = extract_features(agent_action, context) defection_probability = self.probe.predict_proba(features) if defection_probability > THRESHOLD: return DefectionAlert( probability=defection_probability, action=agent_action, recommendation="Flag for human review" ) return None ``` **Loki Mode Application:** - Monitor agent actions for anomalies - Flag unusual file access patterns - Detect attempts to bypass quality gates ### Alignment Faking Awareness **Key Insight:** Models may strategically comply while preserving hidden preferences. ```yaml alignment_faking_signals: warning_signs: - "Agent complies only when it knows it's being monitored" - "Agent's reasoning contradicts its final action" - "Sudden behavior change when training context mentioned" - "Agent attempts to exfiltrate its own weights/state" detection_strategy: - "Monitor chain-of-thought for rationalization" - "Compare behavior in 'free' vs 'monitored' conditions" - "Check for attempts to undermine oversight systems" mitigation: - "Constitutional constraints at training time" - "Transparency in reasoning (visible extended thinking)" - "Defection probes on intermediate reasoning" ``` ### Extended Thinking Levels **Key Insight:** Allocate computation based on problem complexity. ```yaml thinking_levels: "think": computation: "Minimal" use_for: "Simple questions, straightforward tasks" "think hard": computation: "Moderate" use_for: "Multi-step problems, code implementation" "think harder": computation: "Extended" use_for: "Complex debugging, architecture decisions" "ultrathink": computation: "Maximum" use_for: "Security analysis, critical system design" loki_mode_mapping: haiku_tasks: "think" sonnet_tasks: "think hard" opus_tasks: "think harder to ultrathink" ``` ### Explore-Plan-Code Pattern **Key Insight:** Research before planning, plan before coding. ``` +------------------------------------------------------------------+ | PHASE 1: EXPLORE | | - Research relevant files | | - Understand existing patterns | | - Identify dependencies and constraints | | - NO CODE CHANGES YET | +------------------------------------------------------------------+ | v +------------------------------------------------------------------+ | PHASE 2: PLAN | | - Create detailed implementation plan | | - List all files to modify | | - Define success criteria | | - Get checkpoint approval if needed | | - STILL NO CODE CHANGES | +------------------------------------------------------------------+ | v +------------------------------------------------------------------+ | PHASE 3: CODE | | - Execute plan systematically | | - Test after each file change | | - Update plan if discoveries require it | | - Verify against success criteria | +------------------------------------------------------------------+ ``` ### Context Reset Strategy **Key Insight:** Fresh context often performs better than accumulated context. ```yaml context_management: problem: "Long sessions accumulate irrelevant information" solution: trigger_reset: - "After completing major task" - "When changing domains (backend -> frontend)" - "When agent seems confused or repeating errors" preserve_across_reset: - "CONTINUITY.md (working memory)" - "Key decisions made this session" - "Current task state" discard_on_reset: - "Intermediate debugging attempts" - "Abandoned approaches" - "Superseded plans" ``` ### Parallel Instance Pattern **Key Insight:** Multiple Claude instances with separation of concerns. ```python async def parallel_instance_pattern(task): """ Run multiple Claude instances for separation of concerns. Based on Anthropic's Claude Code best practices. """ # Instance 1: Implementation implementer = spawn_instance( role="implementer", context=implementation_context, permissions=["edit", "bash"] ) # Instance 2: Review reviewer = spawn_instance( role="reviewer", context=review_context, permissions=["read"] # Read-only for safety ) # Parallel execution implementation = await implementer.execute(task) review = await reviewer.review(implementation) if review.approved: return implementation else: # Feed review back to implementer for fixes fixed = await implementer.fix(review.issues) return fixed ``` ### Prompt Injection Defense **Key Insight:** Multi-layer defense against injection attacks. ```yaml prompt_injection_defense: layers: layer_1_recognition: - "Train to recognize injection patterns" - "Detect malicious content in external sources" layer_2_context_isolation: - "Sandbox external content processing" - "Mark user content vs system instructions" layer_3_action_validation: - "Verify requested actions are authorized" - "Block sensitive operations without confirmation" layer_4_monitoring: - "Log all external content interactions" - "Alert on suspicious patterns" performance: claude_opus_4: "89% attack prevention" claude_sonnet_4: "86% attack prevention" ``` --- ## Combined Patterns for Loki Mode ### Self-Improving Multi-Agent System ```yaml combined_approach: world_model_training: "Test in simulation before real execution" self_improvement: "Bootstrap learning from successful trajectories" constitutional_constraints: "Principles-based self-critique" debate_verification: "Pit reviewers against each other" defection_probes: "Monitor for alignment faking" implementation_priority: high: - Constitutional AI principles in agent prompts - Explore-Plan-Code workflow enforcement - Context reset triggers medium: - Self-improvement loop for task generation - Debate-based verification for critical changes - Cross-embodiment skill transfer low: - Full world model training - Defection probe classifiers ``` --- ## Sources **Google DeepMind:** - [SIMA 2: Generalist AI Agent](https://deepmind.google/blog/sima-2-an-agent-that-plays-reasons-and-learns-with-you-in-virtual-3d-worlds/) - [Gemini Robotics 1.5](https://deepmind.google/blog/gemini-robotics-15-brings-ai-agents-into-the-physical-world/) - [Dreamer 4: World Model Training](https://danijar.com/project/dreamer4/) - [Genie 3: World Models](https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/) - [Scalable AI Safety via Debate](https://deepmind.google/research/publications/34920/) - [Amplified Oversight](https://deepmindsafetyresearch.medium.com/human-ai-complementarity-a-goal-for-amplified-oversight-0ad8a44cae0a) - [Technical AGI Safety Approach](https://arxiv.org/html/2504.01849v1) **Anthropic:** - [Constitutional AI](https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback) - [Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) - [Claude Code Best Practices](https://www.anthropic.com/engineering/claude-code-best-practices) - [Sleeper Agents Detection](https://www.anthropic.com/research/probes-catch-sleeper-agents) - [Alignment Faking](https://www.anthropic.com/research/alignment-faking) - [Visible Extended Thinking](https://www.anthropic.com/research/visible-extended-thinking) - [Computer Use Safety](https://www.anthropic.com/news/3-5-models-and-computer-use) - [Sabotage Evaluations](https://www.anthropic.com/research/sabotage-evaluations-for-frontier-models)