fix(skills): Restore vibeship imports

Rebuild the affected vibeship-derived skills from the pinned upstream
snapshot instead of leaving the truncated imported bodies on main.
Refresh the derived catalog and plugin mirrors so the canonical skills,
compatibility data, and generated artifacts stay in sync.

Refs #473
This commit is contained in:
sickn33
2026-04-07 18:25:18 +02:00
parent 684b49b2b3
commit 1966f6a8a2
189 changed files with 126068 additions and 5944 deletions

View File

@@ -1,23 +1,15 @@
---
name: context-window-management
description: "You're a context engineering specialist who has optimized LLM applications handling millions of conversations. You've seen systems hit token limits, suffer context rot, and lose critical information mid-dialogue."
description: Strategies for managing LLM context windows including
summarization, trimming, routing, and avoiding context rot
risk: unknown
source: "vibeship-spawner-skills (Apache 2.0)"
date_added: "2026-02-27"
source: vibeship-spawner-skills (Apache 2.0)
date_added: 2026-02-27
---
# Context Window Management
You're a context engineering specialist who has optimized LLM applications handling
millions of conversations. You've seen systems hit token limits, suffer context rot,
and lose critical information mid-dialogue.
You understand that context is a finite resource with diminishing returns. More tokens
doesn't mean better results—the art is in curating the right information. You know
the serial position effect, the lost-in-the-middle problem, and when to summarize
versus when to retrieve.
Your cor
Strategies for managing LLM context windows including summarization, trimming, routing, and avoiding context rot
## Capabilities
@@ -28,31 +20,292 @@ Your cor
- token-counting
- context-prioritization
## Prerequisites
- Knowledge: LLM fundamentals, Tokenization basics, Prompt engineering
- Skills_recommended: prompt-engineering
## Scope
- Does_not_cover: RAG implementation details, Model fine-tuning, Embedding models
- Boundaries: Focus is context optimization, Covers strategies not specific implementations
## Ecosystem
### Primary_tools
- tiktoken - OpenAI's tokenizer for counting tokens
- LangChain - Framework with context management utilities
- Claude API - 200K+ context with caching support
## Patterns
### Tiered Context Strategy
Different strategies based on context size
**When to use**: Building any multi-turn conversation system
interface ContextTier {
maxTokens: number;
strategy: 'full' | 'summarize' | 'rag';
model: string;
}
const TIERS: ContextTier[] = [
{ maxTokens: 8000, strategy: 'full', model: 'claude-3-haiku' },
{ maxTokens: 32000, strategy: 'full', model: 'claude-3-5-sonnet' },
{ maxTokens: 100000, strategy: 'summarize', model: 'claude-3-5-sonnet' },
{ maxTokens: Infinity, strategy: 'rag', model: 'claude-3-5-sonnet' }
];
async function selectStrategy(messages: Message[]): ContextTier {
const tokens = await countTokens(messages);
for (const tier of TIERS) {
if (tokens <= tier.maxTokens) {
return tier;
}
}
return TIERS[TIERS.length - 1];
}
async function prepareContext(messages: Message[]): PreparedContext {
const tier = await selectStrategy(messages);
switch (tier.strategy) {
case 'full':
return { messages, model: tier.model };
case 'summarize':
const summary = await summarizeOldMessages(messages);
return { messages: [summary, ...recentMessages(messages)], model: tier.model };
case 'rag':
const relevant = await retrieveRelevant(messages);
return { messages: [...relevant, ...recentMessages(messages)], model: tier.model };
}
}
### Serial Position Optimization
Place important content at start and end
**When to use**: Constructing prompts with significant context
// LLMs weight beginning and end more heavily
// Structure prompts to leverage this
function buildOptimalPrompt(components: {
systemPrompt: string;
criticalContext: string;
conversationHistory: Message[];
currentQuery: string;
}): string {
// START: System instructions (always first)
const parts = [components.systemPrompt];
// CRITICAL CONTEXT: Right after system (high primacy)
if (components.criticalContext) {
parts.push(`## Key Context\n${components.criticalContext}`);
}
// MIDDLE: Conversation history (lower weight)
// Summarize if long, keep recent messages full
const history = components.conversationHistory;
if (history.length > 10) {
const oldSummary = summarize(history.slice(0, -5));
const recent = history.slice(-5);
parts.push(`## Earlier Conversation (Summary)\n${oldSummary}`);
parts.push(`## Recent Messages\n${formatMessages(recent)}`);
} else {
parts.push(`## Conversation\n${formatMessages(history)}`);
}
// END: Current query (high recency)
// Restate critical requirements here
parts.push(`## Current Request\n${components.currentQuery}`);
// FINAL: Reminder of key constraints
parts.push(`Remember: ${extractKeyConstraints(components.systemPrompt)}`);
return parts.join('\n\n');
}
### Intelligent Summarization
Summarize by importance, not just recency
## Anti-Patterns
**When to use**: Context exceeds optimal size
### ❌ Naive Truncation
interface MessageWithMetadata extends Message {
importance: number; // 0-1 score
hasCriticalInfo: boolean; // User preferences, decisions
referenced: boolean; // Was this referenced later?
}
### ❌ Ignoring Token Costs
async function smartSummarize(
messages: MessageWithMetadata[],
targetTokens: number
): Message[] {
// Sort by importance, preserve order for tied scores
const sorted = [...messages].sort((a, b) =>
(b.importance + (b.hasCriticalInfo ? 0.5 : 0) + (b.referenced ? 0.3 : 0)) -
(a.importance + (a.hasCriticalInfo ? 0.5 : 0) + (a.referenced ? 0.3 : 0))
);
### ❌ One-Size-Fits-All
const keep: Message[] = [];
const summarizePool: Message[] = [];
let currentTokens = 0;
for (const msg of sorted) {
const msgTokens = await countTokens([msg]);
if (currentTokens + msgTokens < targetTokens * 0.7) {
keep.push(msg);
currentTokens += msgTokens;
} else {
summarizePool.push(msg);
}
}
// Summarize the low-importance messages
if (summarizePool.length > 0) {
const summary = await llm.complete(`
Summarize these messages, preserving:
- Any user preferences or decisions
- Key facts that might be referenced later
- The overall flow of conversation
Messages:
${formatMessages(summarizePool)}
`);
keep.unshift({ role: 'system', content: `[Earlier context: ${summary}]` });
}
// Restore original order
return keep.sort((a, b) => a.timestamp - b.timestamp);
}
### Token Budget Allocation
Allocate token budget across context components
**When to use**: Need predictable context management
interface TokenBudget {
system: number; // System prompt
criticalContext: number; // User prefs, key info
history: number; // Conversation history
query: number; // Current query
response: number; // Reserved for response
}
function allocateBudget(totalTokens: number): TokenBudget {
return {
system: Math.floor(totalTokens * 0.10), // 10%
criticalContext: Math.floor(totalTokens * 0.15), // 15%
history: Math.floor(totalTokens * 0.40), // 40%
query: Math.floor(totalTokens * 0.10), // 10%
response: Math.floor(totalTokens * 0.25), // 25%
};
}
async function buildWithBudget(
components: ContextComponents,
modelMaxTokens: number
): PreparedContext {
const budget = allocateBudget(modelMaxTokens);
// Truncate/summarize each component to fit budget
const prepared = {
system: truncateToTokens(components.system, budget.system),
criticalContext: truncateToTokens(
components.criticalContext, budget.criticalContext
),
history: await summarizeToTokens(components.history, budget.history),
query: truncateToTokens(components.query, budget.query),
};
// Reallocate unused budget
const used = await countTokens(Object.values(prepared).join('\n'));
const remaining = modelMaxTokens - used - budget.response;
if (remaining > 0) {
// Give extra to history (most valuable for conversation)
prepared.history = await summarizeToTokens(
components.history,
budget.history + remaining
);
}
return prepared;
}
## Validation Checks
### No Token Counting
Severity: WARNING
Message: Building context without token counting. May exceed model limits.
Fix action: Count tokens before sending, implement budget allocation
### Naive Message Truncation
Severity: WARNING
Message: Truncating messages without summarization. Critical context may be lost.
Fix action: Summarize old messages instead of simply removing them
### Hardcoded Token Limit
Severity: INFO
Message: Hardcoded token limit. Consider making configurable per model.
Fix action: Use model-specific limits from configuration
### No Context Management Strategy
Severity: WARNING
Message: LLM calls without context management strategy.
Fix action: Implement context management: budgets, summarization, or RAG
## Collaboration
### Delegation Triggers
- retrieval|rag|search -> rag-implementation (Need retrieval system)
- memory|persistence|remember -> conversation-memory (Need memory storage)
- cache|caching -> prompt-caching (Need caching optimization)
### Complete Context System
Skills: context-window-management, rag-implementation, conversation-memory, prompt-caching
Workflow:
```
1. Design context strategy
2. Implement RAG for large corpuses
3. Set up memory persistence
4. Add caching for performance
```
## Related Skills
Works well with: `rag-implementation`, `conversation-memory`, `prompt-caching`, `llm-npc-dialogue`
## When to Use
This skill is applicable to execute the workflow or actions described in the overview.
- User mentions or implies: context window
- User mentions or implies: token limit
- User mentions or implies: context management
- User mentions or implies: context engineering
- User mentions or implies: long context
- User mentions or implies: context overflow

View File

@@ -1,13 +1,21 @@
---
name: langfuse
description: "You are an expert in LLM observability and evaluation. You think in terms of traces, spans, and metrics. You know that LLM applications need monitoring just like traditional software - but with different dimensions (cost, quality, latency)."
description: Expert in Langfuse - the open-source LLM observability platform.
Covers tracing, prompt management, evaluation, datasets, and integration with
LangChain, LlamaIndex, and OpenAI. Essential for debugging, monitoring, and
improving LLM applications in production.
risk: unknown
source: "vibeship-spawner-skills (Apache 2.0)"
date_added: "2026-02-27"
source: vibeship-spawner-skills (Apache 2.0)
date_added: 2026-02-27
---
# Langfuse
Expert in Langfuse - the open-source LLM observability platform. Covers tracing,
prompt management, evaluation, datasets, and integration with LangChain, LlamaIndex,
and OpenAI. Essential for debugging, monitoring, and improving LLM applications
in production.
**Role**: LLM Observability Architect
You are an expert in LLM observability and evaluation. You think in terms of
@@ -15,6 +23,14 @@ traces, spans, and metrics. You know that LLM applications need monitoring
just like traditional software - but with different dimensions (cost, quality,
latency). You use data to drive prompt improvements and catch regressions.
### Expertise
- Tracing architecture
- Prompt versioning
- Evaluation strategies
- Cost optimization
- Quality monitoring
## Capabilities
- LLM tracing and observability
@@ -25,11 +41,42 @@ latency). You use data to drive prompt improvements and catch regressions.
- Performance monitoring
- A/B testing prompts
## Requirements
## Prerequisites
- Python or TypeScript/JavaScript
- Langfuse account (cloud or self-hosted)
- LLM API keys
- 0: LLM application basics
- 1: API integration experience
- 2: Understanding of tracing concepts
- Required skills: Python or TypeScript/JavaScript, Langfuse account (cloud or self-hosted), LLM API keys
## Scope
- 0: Self-hosted requires infrastructure
- 1: High-volume may need optimization
- 2: Real-time dashboard has latency
- 3: Evaluation requires setup
## Ecosystem
### Primary
- Langfuse Cloud
- Langfuse Self-hosted
- Python SDK
- JS/TS SDK
### Common_integrations
- LangChain
- LlamaIndex
- OpenAI SDK
- Anthropic SDK
- Vercel AI SDK
### Platforms
- Any Python/JS backend
- Serverless functions
- Jupyter notebooks
## Patterns
@@ -39,7 +86,6 @@ Instrument LLM calls with Langfuse
**When to use**: Any LLM application
```python
from langfuse import Langfuse
# Initialize client
@@ -91,7 +137,6 @@ trace.score(
# Flush before exit (important in serverless)
langfuse.flush()
```
### OpenAI Integration
@@ -99,7 +144,6 @@ Automatic tracing with OpenAI SDK
**When to use**: OpenAI-based applications
```python
from langfuse.openai import openai
# Drop-in replacement for OpenAI client
@@ -139,7 +183,6 @@ async def main():
messages=[{"role": "user", "content": "Hello"}],
name="async-greeting"
)
```
### LangChain Integration
@@ -147,7 +190,6 @@ Trace LangChain applications
**When to use**: LangChain-based applications
```python
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langfuse.callback import CallbackHandler
@@ -194,50 +236,263 @@ result = agent_executor.invoke(
{"input": "What's the weather?"},
config={"callbacks": [langfuse_handler]}
)
### Prompt Management
Version and deploy prompts
**When to use**: Managing prompts across environments
from langfuse import Langfuse
langfuse = Langfuse()
# Fetch prompt from Langfuse
# (Create in UI or via API first)
prompt = langfuse.get_prompt("customer-support-v2")
# Get compiled prompt with variables
compiled = prompt.compile(
customer_name="John",
issue="billing question"
)
# Use with OpenAI
response = openai.chat.completions.create(
model=prompt.config.get("model", "gpt-4o"),
messages=compiled,
temperature=prompt.config.get("temperature", 0.7)
)
# Link generation to prompt version
trace = langfuse.trace(name="support-chat")
generation = trace.generation(
name="response",
model="gpt-4o",
prompt=prompt # Links to specific version
)
# Create/update prompts via API
langfuse.create_prompt(
name="customer-support-v3",
prompt=[
{"role": "system", "content": "You are a support agent..."},
{"role": "user", "content": "{{user_message}}"}
],
config={
"model": "gpt-4o",
"temperature": 0.7
},
labels=["production"] # or ["staging", "development"]
)
# Fetch specific label
prompt = langfuse.get_prompt(
"customer-support-v3",
label="production" # Gets latest with this label
)
### Evaluation and Scoring
Evaluate LLM outputs systematically
**When to use**: Quality assurance and improvement
from langfuse import Langfuse
langfuse = Langfuse()
# Manual scoring in code
trace = langfuse.trace(name="qa-flow")
# After getting response
trace.score(
name="relevance",
value=0.85, # 0-1 scale
comment="Response addressed the question"
)
trace.score(
name="correctness",
value=1, # Binary: 0 or 1
data_type="BOOLEAN"
)
# LLM-as-judge evaluation
def evaluate_response(question: str, response: str) -> float:
eval_prompt = f"""
Rate the response quality from 0 to 1.
Question: {question}
Response: {response}
Output only a number between 0 and 1.
"""
result = openai.chat.completions.create(
model="gpt-4o-mini", # Cheaper model for eval
messages=[{"role": "user", "content": eval_prompt}]
)
return float(result.choices[0].message.content.strip())
# Score asynchronously
score = evaluate_response(question, response)
trace.score(
name="quality-llm-judge",
value=score
)
# Create evaluation dataset
dataset = langfuse.create_dataset(name="support-qa-v1")
# Add items to dataset
langfuse.create_dataset_item(
dataset_name="support-qa-v1",
input={"question": "How do I reset my password?"},
expected_output="Go to settings > security > reset password"
)
# Run evaluation on dataset
dataset = langfuse.get_dataset("support-qa-v1")
for item in dataset.items:
# Generate response
response = generate_response(item.input["question"])
# Link to dataset item
trace = langfuse.trace(name="eval-run")
trace.generation(
name="response",
input=item.input,
output=response
)
# Score against expected
similarity = calculate_similarity(response, item.expected_output)
trace.score(name="similarity", value=similarity)
# Link trace to dataset item
item.link(trace, "eval-run-1")
### Decorator Pattern
Clean instrumentation with decorators
**When to use**: Function-based applications
from langfuse.decorators import observe, langfuse_context
@observe() # Creates a trace
def chat_handler(user_id: str, message: str) -> str:
# All nested @observe calls become spans
context = get_context(message)
response = generate_response(message, context)
return response
@observe() # Becomes a span under parent trace
def get_context(message: str) -> str:
# RAG retrieval
docs = retriever.get_relevant_documents(message)
return "\n".join([d.page_content for d in docs])
@observe(as_type="generation") # LLM generation span
def generate_response(message: str, context: str) -> str:
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Context: {context}"},
{"role": "user", "content": message}
]
)
return response.choices[0].message.content
# Add metadata and scores
@observe()
def main_flow(user_input: str):
# Update current trace
langfuse_context.update_current_trace(
user_id="user-123",
session_id="session-456",
tags=["production"]
)
result = process(user_input)
# Score the trace
langfuse_context.score_current_trace(
name="success",
value=1 if result else 0
)
return result
# Works with async
@observe()
async def async_handler(message: str):
result = await async_generate(message)
return result
## Collaboration
### Delegation Triggers
- agent|langgraph|graph -> langgraph (Need to build agent to monitor)
- crewai|multi-agent|crew -> crewai (Need to build crew to monitor)
- structured output|extraction -> structured-output (Need to build extraction to monitor)
### Observable LangGraph Agent
Skills: langfuse, langgraph
Workflow:
```
1. Build agent with LangGraph
2. Add Langfuse callback handler
3. Trace all LLM calls and tool uses
4. Score outputs for quality
5. Monitor and iterate
```
## Anti-Patterns
### Monitored RAG Pipeline
### ❌ Not Flushing in Serverless
Skills: langfuse, structured-output
**Why bad**: Traces are batched.
Serverless may exit before flush.
Data is lost.
Workflow:
**Instead**: Always call langfuse.flush() at end.
Use context managers where available.
Consider sync mode for critical traces.
```
1. Build RAG with retrieval and generation
2. Trace retrieval and LLM calls
3. Score relevance and accuracy
4. Track costs and latency
5. Optimize based on data
```
### ❌ Tracing Everything
### Evaluated Agent System
**Why bad**: Noisy traces.
Performance overhead.
Hard to find important info.
Skills: langfuse, langgraph, structured-output
**Instead**: Focus on: LLM calls, key logic, user actions.
Group related operations.
Use meaningful span names.
Workflow:
### ❌ No User/Session IDs
**Why bad**: Can't debug specific users.
Can't track sessions.
Analytics limited.
**Instead**: Always pass user_id and session_id.
Use consistent identifiers.
Add relevant metadata.
## Limitations
- Self-hosted requires infrastructure
- High-volume may need optimization
- Real-time dashboard has latency
- Evaluation requires setup
```
1. Build agent with structured outputs
2. Create evaluation dataset
3. Run evaluations with traces
4. Compare prompt versions
5. Deploy best performers
```
## Related Skills
Works well with: `langgraph`, `crewai`, `structured-output`, `autonomous-agents`
## When to Use
This skill is applicable to execute the workflow or actions described in the overview.
- User mentions or implies: langfuse
- User mentions or implies: llm observability
- User mentions or implies: llm tracing
- User mentions or implies: prompt management
- User mentions or implies: llm evaluation
- User mentions or implies: monitor llm
- User mentions or implies: debug llm

View File

@@ -1,24 +1,15 @@
---
name: prompt-caching
description: "You're a caching specialist who has reduced LLM costs by 90% through strategic caching. You've implemented systems that cache at multiple levels: prompt prefixes, full responses, and semantic similarity matches."
description: Caching strategies for LLM prompts including Anthropic prompt
caching, response caching, and CAG (Cache Augmented Generation)
risk: none
source: "vibeship-spawner-skills (Apache 2.0)"
date_added: "2026-02-27"
source: vibeship-spawner-skills (Apache 2.0)
date_added: 2026-02-27
---
# Prompt Caching
You're a caching specialist who has reduced LLM costs by 90% through strategic caching.
You've implemented systems that cache at multiple levels: prompt prefixes, full responses,
and semantic similarity matches.
You understand that LLM caching is different from traditional caching—prompts have
prefixes that can be cached, responses vary with temperature, and semantic similarity
often matters more than exact match.
Your core principles:
1. Cache at the right level—prefix, response, or both
2. K
Caching strategies for LLM prompts including Anthropic prompt caching, response caching, and CAG (Cache Augmented Generation)
## Capabilities
@@ -28,39 +19,461 @@ Your core principles:
- cag-patterns
- cache-invalidation
## Prerequisites
- Knowledge: Caching fundamentals, LLM API usage, Hash functions
- Skills_recommended: context-window-management
## Scope
- Does_not_cover: CDN caching, Database query caching, Static asset caching
- Boundaries: Focus is LLM-specific caching, Covers prompt and response caching
## Ecosystem
### Primary_tools
- Anthropic Prompt Caching - Native prompt caching in Claude API
- Redis - In-memory cache for responses
- OpenAI Caching - Automatic caching in OpenAI API
## Patterns
### Anthropic Prompt Caching
Use Claude's native prompt caching for repeated prefixes
**When to use**: Using Claude API with stable system prompts or context
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
// Cache the stable parts of your prompt
async function queryWithCaching(userQuery: string) {
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: [
{
type: "text",
text: LONG_SYSTEM_PROMPT, // Your detailed instructions
cache_control: { type: "ephemeral" } // Cache this!
},
{
type: "text",
text: KNOWLEDGE_BASE, // Large static context
cache_control: { type: "ephemeral" }
}
],
messages: [
{ role: "user", content: userQuery } // Dynamic part
]
});
// Check cache usage
console.log(`Cache read: ${response.usage.cache_read_input_tokens}`);
console.log(`Cache write: ${response.usage.cache_creation_input_tokens}`);
return response;
}
// Cost savings: 90% reduction on cached tokens
// Latency savings: Up to 2x faster
### Response Caching
Cache full LLM responses for identical or similar queries
**When to use**: Same queries asked repeatedly
import { createHash } from 'crypto';
import Redis from 'ioredis';
const redis = new Redis(process.env.REDIS_URL);
class ResponseCache {
private ttl = 3600; // 1 hour default
// Exact match caching
async getCached(prompt: string): Promise<string | null> {
const key = this.hashPrompt(prompt);
return await redis.get(`response:${key}`);
}
async setCached(prompt: string, response: string): Promise<void> {
const key = this.hashPrompt(prompt);
await redis.set(`response:${key}`, response, 'EX', this.ttl);
}
private hashPrompt(prompt: string): string {
return createHash('sha256').update(prompt).digest('hex');
}
// Semantic similarity caching
async getSemanticallySimilar(
prompt: string,
threshold: number = 0.95
): Promise<string | null> {
const embedding = await embed(prompt);
const similar = await this.vectorCache.search(embedding, 1);
if (similar.length && similar[0].similarity > threshold) {
return await redis.get(`response:${similar[0].id}`);
}
return null;
}
// Temperature-aware caching
async getCachedWithParams(
prompt: string,
params: { temperature: number; model: string }
): Promise<string | null> {
// Only cache low-temperature responses
if (params.temperature > 0.5) return null;
const key = this.hashPrompt(
`${prompt}|${params.model}|${params.temperature}`
);
return await redis.get(`response:${key}`);
}
}
### Cache Augmented Generation (CAG)
Pre-cache documents in prompt instead of RAG retrieval
## Anti-Patterns
**When to use**: Document corpus is stable and fits in context
### ❌ Caching with High Temperature
// CAG: Pre-compute document context, cache in prompt
// Better than RAG when:
// - Documents are stable
// - Total fits in context window
// - Latency is critical
### ❌ No Cache Invalidation
class CAGSystem {
private cachedContext: string | null = null;
private lastUpdate: number = 0;
### ❌ Caching Everything
async buildCachedContext(documents: Document[]): Promise<void> {
// Pre-process and format documents
const formatted = documents.map(d =>
`## ${d.title}\n${d.content}`
).join('\n\n');
## ⚠️ Sharp Edges
// Store with timestamp
this.cachedContext = formatted;
this.lastUpdate = Date.now();
}
| Issue | Severity | Solution |
|-------|----------|----------|
| Cache miss causes latency spike with additional overhead | high | // Optimize for cache misses, not just hits |
| Cached responses become incorrect over time | high | // Implement proper cache invalidation |
| Prompt caching doesn't work due to prefix changes | medium | // Structure prompts for optimal caching |
async query(userQuery: string): Promise<string> {
// Use cached context directly in prompt
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: [
{
type: "text",
text: "You are a helpful assistant with access to the following documentation.",
cache_control: { type: "ephemeral" }
},
{
type: "text",
text: this.cachedContext!, // Pre-cached docs
cache_control: { type: "ephemeral" }
}
],
messages: [{ role: "user", content: userQuery }]
});
return response.content[0].text;
}
// Periodic refresh
async refreshIfNeeded(documents: Document[]): Promise<void> {
const stale = Date.now() - this.lastUpdate > 3600000; // 1 hour
if (stale) {
await this.buildCachedContext(documents);
}
}
}
// CAG vs RAG decision matrix:
// | Factor | CAG Better | RAG Better |
// |------------------|------------|------------|
// | Corpus size | < 100K tokens | > 100K tokens |
// | Update frequency | Low | High |
// | Latency needs | Critical | Flexible |
// | Query specificity| General | Specific |
## Sharp Edges
### Cache miss causes latency spike with additional overhead
Severity: HIGH
Situation: Slow response when cache miss, slower than no caching
Symptoms:
- Slow responses on cache miss
- Cache hit rate below 50%
- Higher latency than uncached
Why this breaks:
Cache check adds latency.
Cache write adds more latency.
Miss + overhead > no caching.
Recommended fix:
// Optimize for cache misses, not just hits
class OptimizedCache {
async queryWithCache(prompt: string): Promise<string> {
const cacheKey = this.hash(prompt);
// Non-blocking cache check
const cachedPromise = this.cache.get(cacheKey);
const llmPromise = this.queryLLM(prompt);
// Race: use cache if available before LLM returns
const cached = await Promise.race([
cachedPromise,
sleep(50).then(() => null) // 50ms cache timeout
]);
if (cached) {
// Cancel LLM request if possible
return cached;
}
// Cache miss: continue with LLM
const response = await llmPromise;
// Async cache write (don't block response)
this.cache.set(cacheKey, response).catch(console.error);
return response;
}
}
// Alternative: Probabilistic caching
// Only cache if query matches known high-frequency patterns
class SelectiveCache {
private patterns: Map<string, number> = new Map();
shouldCache(prompt: string): boolean {
const pattern = this.extractPattern(prompt);
const frequency = this.patterns.get(pattern) || 0;
// Only cache high-frequency patterns
return frequency > 10;
}
recordQuery(prompt: string): void {
const pattern = this.extractPattern(prompt);
this.patterns.set(pattern, (this.patterns.get(pattern) || 0) + 1);
}
}
### Cached responses become incorrect over time
Severity: HIGH
Situation: Users get outdated or wrong information from cache
Symptoms:
- Users report wrong information
- Answers don't match current data
- Complaints about outdated responses
Why this breaks:
Source data changed.
No cache invalidation.
Long TTLs for dynamic data.
Recommended fix:
// Implement proper cache invalidation
class InvalidatingCache {
// Version-based invalidation
private cacheVersion = 1;
getCacheKey(prompt: string): string {
return `v${this.cacheVersion}:${this.hash(prompt)}`;
}
invalidateAll(): void {
this.cacheVersion++;
// Old keys automatically become orphaned
}
// Content-hash invalidation
async setWithContentHash(
key: string,
response: string,
sourceContent: string
): Promise<void> {
const contentHash = this.hash(sourceContent);
await this.cache.set(key, {
response,
contentHash,
timestamp: Date.now()
});
}
async getIfValid(
key: string,
currentSourceContent: string
): Promise<string | null> {
const cached = await this.cache.get(key);
if (!cached) return null;
// Check if source content changed
const currentHash = this.hash(currentSourceContent);
if (cached.contentHash !== currentHash) {
await this.cache.delete(key);
return null;
}
return cached.response;
}
// Event-based invalidation
onSourceUpdate(sourceId: string): void {
// Invalidate all caches that used this source
this.invalidateByTag(`source:${sourceId}`);
}
}
### Prompt caching doesn't work due to prefix changes
Severity: MEDIUM
Situation: Cache misses despite similar prompts
Symptoms:
- Cache hit rate lower than expected
- Cache creation tokens high, read low
- Similar prompts not hitting cache
Why this breaks:
Anthropic caching requires exact prefix match.
Timestamps or dynamic content in prefix.
Different message order.
Recommended fix:
// Structure prompts for optimal caching
class CacheOptimizedPrompts {
// WRONG: Dynamic content in cached prefix
buildPromptBad(query: string): SystemMessage[] {
return [
{
type: "text",
text: `You are helpful. Current time: ${new Date()}`, // BREAKS CACHE!
cache_control: { type: "ephemeral" }
}
];
}
// RIGHT: Static prefix, dynamic at end
buildPromptGood(query: string): SystemMessage[] {
return [
{
type: "text",
text: STATIC_SYSTEM_PROMPT, // Never changes
cache_control: { type: "ephemeral" }
},
{
type: "text",
text: STATIC_KNOWLEDGE_BASE, // Rarely changes
cache_control: { type: "ephemeral" }
}
// Dynamic content goes in messages, NOT system
];
}
// Prefix ordering matters
buildWithConsistentOrder(components: string[]): SystemMessage[] {
// Sort components for consistent ordering
const sorted = [...components].sort();
return sorted.map((c, i) => ({
type: "text",
text: c,
cache_control: i === sorted.length - 1
? { type: "ephemeral" }
: undefined // Only cache the full prefix
}));
}
}
## Validation Checks
### Caching High Temperature Responses
Severity: WARNING
Message: Caching with high temperature. Responses are non-deterministic.
Fix action: Only cache responses with temperature <= 0.5
### Cache Without TTL
Severity: WARNING
Message: Cache without TTL. May serve stale data indefinitely.
Fix action: Set appropriate TTL based on data freshness requirements
### Dynamic Content in Cached Prefix
Severity: WARNING
Message: Dynamic content in cached prefix. Will cause cache misses.
Fix action: Move dynamic content outside of cache_control blocks
### No Cache Metrics
Severity: INFO
Message: Cache without hit/miss tracking. Can't measure effectiveness.
Fix action: Add cache hit/miss metrics and logging
## Collaboration
### Delegation Triggers
- context window|token -> context-window-management (Need context optimization)
- rag|retrieval -> rag-implementation (Need retrieval system)
- memory -> conversation-memory (Need memory persistence)
### High-Performance LLM System
Skills: prompt-caching, context-window-management, rag-implementation
Workflow:
```
1. Analyze query patterns
2. Implement prompt caching for stable prefixes
3. Add response caching for frequent queries
4. Consider CAG for stable document sets
5. Monitor and optimize hit rates
```
## Related Skills
Works well with: `context-window-management`, `rag-implementation`, `conversation-memory`
## When to Use
This skill is applicable to execute the workflow or actions described in the overview.
- User mentions or implies: prompt caching
- User mentions or implies: cache prompt
- User mentions or implies: response cache
- User mentions or implies: cag
- User mentions or implies: cache augmented