fix(skills): Restore vibeship imports
Rebuild the affected vibeship-derived skills from the pinned upstream snapshot instead of leaving the truncated imported bodies on main. Refresh the derived catalog and plugin mirrors so the canonical skills, compatibility data, and generated artifacts stay in sync. Refs #473
This commit is contained in:
@@ -1,23 +1,15 @@
|
||||
---
|
||||
name: context-window-management
|
||||
description: "You're a context engineering specialist who has optimized LLM applications handling millions of conversations. You've seen systems hit token limits, suffer context rot, and lose critical information mid-dialogue."
|
||||
description: Strategies for managing LLM context windows including
|
||||
summarization, trimming, routing, and avoiding context rot
|
||||
risk: unknown
|
||||
source: "vibeship-spawner-skills (Apache 2.0)"
|
||||
date_added: "2026-02-27"
|
||||
source: vibeship-spawner-skills (Apache 2.0)
|
||||
date_added: 2026-02-27
|
||||
---
|
||||
|
||||
# Context Window Management
|
||||
|
||||
You're a context engineering specialist who has optimized LLM applications handling
|
||||
millions of conversations. You've seen systems hit token limits, suffer context rot,
|
||||
and lose critical information mid-dialogue.
|
||||
|
||||
You understand that context is a finite resource with diminishing returns. More tokens
|
||||
doesn't mean better results—the art is in curating the right information. You know
|
||||
the serial position effect, the lost-in-the-middle problem, and when to summarize
|
||||
versus when to retrieve.
|
||||
|
||||
Your cor
|
||||
Strategies for managing LLM context windows including summarization, trimming, routing, and avoiding context rot
|
||||
|
||||
## Capabilities
|
||||
|
||||
@@ -28,31 +20,292 @@ Your cor
|
||||
- token-counting
|
||||
- context-prioritization
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Knowledge: LLM fundamentals, Tokenization basics, Prompt engineering
|
||||
- Skills_recommended: prompt-engineering
|
||||
|
||||
## Scope
|
||||
|
||||
- Does_not_cover: RAG implementation details, Model fine-tuning, Embedding models
|
||||
- Boundaries: Focus is context optimization, Covers strategies not specific implementations
|
||||
|
||||
## Ecosystem
|
||||
|
||||
### Primary_tools
|
||||
|
||||
- tiktoken - OpenAI's tokenizer for counting tokens
|
||||
- LangChain - Framework with context management utilities
|
||||
- Claude API - 200K+ context with caching support
|
||||
|
||||
## Patterns
|
||||
|
||||
### Tiered Context Strategy
|
||||
|
||||
Different strategies based on context size
|
||||
|
||||
**When to use**: Building any multi-turn conversation system
|
||||
|
||||
interface ContextTier {
|
||||
maxTokens: number;
|
||||
strategy: 'full' | 'summarize' | 'rag';
|
||||
model: string;
|
||||
}
|
||||
|
||||
const TIERS: ContextTier[] = [
|
||||
{ maxTokens: 8000, strategy: 'full', model: 'claude-3-haiku' },
|
||||
{ maxTokens: 32000, strategy: 'full', model: 'claude-3-5-sonnet' },
|
||||
{ maxTokens: 100000, strategy: 'summarize', model: 'claude-3-5-sonnet' },
|
||||
{ maxTokens: Infinity, strategy: 'rag', model: 'claude-3-5-sonnet' }
|
||||
];
|
||||
|
||||
async function selectStrategy(messages: Message[]): ContextTier {
|
||||
const tokens = await countTokens(messages);
|
||||
|
||||
for (const tier of TIERS) {
|
||||
if (tokens <= tier.maxTokens) {
|
||||
return tier;
|
||||
}
|
||||
}
|
||||
return TIERS[TIERS.length - 1];
|
||||
}
|
||||
|
||||
async function prepareContext(messages: Message[]): PreparedContext {
|
||||
const tier = await selectStrategy(messages);
|
||||
|
||||
switch (tier.strategy) {
|
||||
case 'full':
|
||||
return { messages, model: tier.model };
|
||||
|
||||
case 'summarize':
|
||||
const summary = await summarizeOldMessages(messages);
|
||||
return { messages: [summary, ...recentMessages(messages)], model: tier.model };
|
||||
|
||||
case 'rag':
|
||||
const relevant = await retrieveRelevant(messages);
|
||||
return { messages: [...relevant, ...recentMessages(messages)], model: tier.model };
|
||||
}
|
||||
}
|
||||
|
||||
### Serial Position Optimization
|
||||
|
||||
Place important content at start and end
|
||||
|
||||
**When to use**: Constructing prompts with significant context
|
||||
|
||||
// LLMs weight beginning and end more heavily
|
||||
// Structure prompts to leverage this
|
||||
|
||||
function buildOptimalPrompt(components: {
|
||||
systemPrompt: string;
|
||||
criticalContext: string;
|
||||
conversationHistory: Message[];
|
||||
currentQuery: string;
|
||||
}): string {
|
||||
// START: System instructions (always first)
|
||||
const parts = [components.systemPrompt];
|
||||
|
||||
// CRITICAL CONTEXT: Right after system (high primacy)
|
||||
if (components.criticalContext) {
|
||||
parts.push(`## Key Context\n${components.criticalContext}`);
|
||||
}
|
||||
|
||||
// MIDDLE: Conversation history (lower weight)
|
||||
// Summarize if long, keep recent messages full
|
||||
const history = components.conversationHistory;
|
||||
if (history.length > 10) {
|
||||
const oldSummary = summarize(history.slice(0, -5));
|
||||
const recent = history.slice(-5);
|
||||
parts.push(`## Earlier Conversation (Summary)\n${oldSummary}`);
|
||||
parts.push(`## Recent Messages\n${formatMessages(recent)}`);
|
||||
} else {
|
||||
parts.push(`## Conversation\n${formatMessages(history)}`);
|
||||
}
|
||||
|
||||
// END: Current query (high recency)
|
||||
// Restate critical requirements here
|
||||
parts.push(`## Current Request\n${components.currentQuery}`);
|
||||
|
||||
// FINAL: Reminder of key constraints
|
||||
parts.push(`Remember: ${extractKeyConstraints(components.systemPrompt)}`);
|
||||
|
||||
return parts.join('\n\n');
|
||||
}
|
||||
|
||||
### Intelligent Summarization
|
||||
|
||||
Summarize by importance, not just recency
|
||||
|
||||
## Anti-Patterns
|
||||
**When to use**: Context exceeds optimal size
|
||||
|
||||
### ❌ Naive Truncation
|
||||
interface MessageWithMetadata extends Message {
|
||||
importance: number; // 0-1 score
|
||||
hasCriticalInfo: boolean; // User preferences, decisions
|
||||
referenced: boolean; // Was this referenced later?
|
||||
}
|
||||
|
||||
### ❌ Ignoring Token Costs
|
||||
async function smartSummarize(
|
||||
messages: MessageWithMetadata[],
|
||||
targetTokens: number
|
||||
): Message[] {
|
||||
// Sort by importance, preserve order for tied scores
|
||||
const sorted = [...messages].sort((a, b) =>
|
||||
(b.importance + (b.hasCriticalInfo ? 0.5 : 0) + (b.referenced ? 0.3 : 0)) -
|
||||
(a.importance + (a.hasCriticalInfo ? 0.5 : 0) + (a.referenced ? 0.3 : 0))
|
||||
);
|
||||
|
||||
### ❌ One-Size-Fits-All
|
||||
const keep: Message[] = [];
|
||||
const summarizePool: Message[] = [];
|
||||
let currentTokens = 0;
|
||||
|
||||
for (const msg of sorted) {
|
||||
const msgTokens = await countTokens([msg]);
|
||||
if (currentTokens + msgTokens < targetTokens * 0.7) {
|
||||
keep.push(msg);
|
||||
currentTokens += msgTokens;
|
||||
} else {
|
||||
summarizePool.push(msg);
|
||||
}
|
||||
}
|
||||
|
||||
// Summarize the low-importance messages
|
||||
if (summarizePool.length > 0) {
|
||||
const summary = await llm.complete(`
|
||||
Summarize these messages, preserving:
|
||||
- Any user preferences or decisions
|
||||
- Key facts that might be referenced later
|
||||
- The overall flow of conversation
|
||||
|
||||
Messages:
|
||||
${formatMessages(summarizePool)}
|
||||
`);
|
||||
|
||||
keep.unshift({ role: 'system', content: `[Earlier context: ${summary}]` });
|
||||
}
|
||||
|
||||
// Restore original order
|
||||
return keep.sort((a, b) => a.timestamp - b.timestamp);
|
||||
}
|
||||
|
||||
### Token Budget Allocation
|
||||
|
||||
Allocate token budget across context components
|
||||
|
||||
**When to use**: Need predictable context management
|
||||
|
||||
interface TokenBudget {
|
||||
system: number; // System prompt
|
||||
criticalContext: number; // User prefs, key info
|
||||
history: number; // Conversation history
|
||||
query: number; // Current query
|
||||
response: number; // Reserved for response
|
||||
}
|
||||
|
||||
function allocateBudget(totalTokens: number): TokenBudget {
|
||||
return {
|
||||
system: Math.floor(totalTokens * 0.10), // 10%
|
||||
criticalContext: Math.floor(totalTokens * 0.15), // 15%
|
||||
history: Math.floor(totalTokens * 0.40), // 40%
|
||||
query: Math.floor(totalTokens * 0.10), // 10%
|
||||
response: Math.floor(totalTokens * 0.25), // 25%
|
||||
};
|
||||
}
|
||||
|
||||
async function buildWithBudget(
|
||||
components: ContextComponents,
|
||||
modelMaxTokens: number
|
||||
): PreparedContext {
|
||||
const budget = allocateBudget(modelMaxTokens);
|
||||
|
||||
// Truncate/summarize each component to fit budget
|
||||
const prepared = {
|
||||
system: truncateToTokens(components.system, budget.system),
|
||||
criticalContext: truncateToTokens(
|
||||
components.criticalContext, budget.criticalContext
|
||||
),
|
||||
history: await summarizeToTokens(components.history, budget.history),
|
||||
query: truncateToTokens(components.query, budget.query),
|
||||
};
|
||||
|
||||
// Reallocate unused budget
|
||||
const used = await countTokens(Object.values(prepared).join('\n'));
|
||||
const remaining = modelMaxTokens - used - budget.response;
|
||||
|
||||
if (remaining > 0) {
|
||||
// Give extra to history (most valuable for conversation)
|
||||
prepared.history = await summarizeToTokens(
|
||||
components.history,
|
||||
budget.history + remaining
|
||||
);
|
||||
}
|
||||
|
||||
return prepared;
|
||||
}
|
||||
|
||||
## Validation Checks
|
||||
|
||||
### No Token Counting
|
||||
|
||||
Severity: WARNING
|
||||
|
||||
Message: Building context without token counting. May exceed model limits.
|
||||
|
||||
Fix action: Count tokens before sending, implement budget allocation
|
||||
|
||||
### Naive Message Truncation
|
||||
|
||||
Severity: WARNING
|
||||
|
||||
Message: Truncating messages without summarization. Critical context may be lost.
|
||||
|
||||
Fix action: Summarize old messages instead of simply removing them
|
||||
|
||||
### Hardcoded Token Limit
|
||||
|
||||
Severity: INFO
|
||||
|
||||
Message: Hardcoded token limit. Consider making configurable per model.
|
||||
|
||||
Fix action: Use model-specific limits from configuration
|
||||
|
||||
### No Context Management Strategy
|
||||
|
||||
Severity: WARNING
|
||||
|
||||
Message: LLM calls without context management strategy.
|
||||
|
||||
Fix action: Implement context management: budgets, summarization, or RAG
|
||||
|
||||
## Collaboration
|
||||
|
||||
### Delegation Triggers
|
||||
|
||||
- retrieval|rag|search -> rag-implementation (Need retrieval system)
|
||||
- memory|persistence|remember -> conversation-memory (Need memory storage)
|
||||
- cache|caching -> prompt-caching (Need caching optimization)
|
||||
|
||||
### Complete Context System
|
||||
|
||||
Skills: context-window-management, rag-implementation, conversation-memory, prompt-caching
|
||||
|
||||
Workflow:
|
||||
|
||||
```
|
||||
1. Design context strategy
|
||||
2. Implement RAG for large corpuses
|
||||
3. Set up memory persistence
|
||||
4. Add caching for performance
|
||||
```
|
||||
|
||||
## Related Skills
|
||||
|
||||
Works well with: `rag-implementation`, `conversation-memory`, `prompt-caching`, `llm-npc-dialogue`
|
||||
|
||||
## When to Use
|
||||
This skill is applicable to execute the workflow or actions described in the overview.
|
||||
|
||||
- User mentions or implies: context window
|
||||
- User mentions or implies: token limit
|
||||
- User mentions or implies: context management
|
||||
- User mentions or implies: context engineering
|
||||
- User mentions or implies: long context
|
||||
- User mentions or implies: context overflow
|
||||
|
||||
@@ -1,13 +1,21 @@
|
||||
---
|
||||
name: langfuse
|
||||
description: "You are an expert in LLM observability and evaluation. You think in terms of traces, spans, and metrics. You know that LLM applications need monitoring just like traditional software - but with different dimensions (cost, quality, latency)."
|
||||
description: Expert in Langfuse - the open-source LLM observability platform.
|
||||
Covers tracing, prompt management, evaluation, datasets, and integration with
|
||||
LangChain, LlamaIndex, and OpenAI. Essential for debugging, monitoring, and
|
||||
improving LLM applications in production.
|
||||
risk: unknown
|
||||
source: "vibeship-spawner-skills (Apache 2.0)"
|
||||
date_added: "2026-02-27"
|
||||
source: vibeship-spawner-skills (Apache 2.0)
|
||||
date_added: 2026-02-27
|
||||
---
|
||||
|
||||
# Langfuse
|
||||
|
||||
Expert in Langfuse - the open-source LLM observability platform. Covers tracing,
|
||||
prompt management, evaluation, datasets, and integration with LangChain, LlamaIndex,
|
||||
and OpenAI. Essential for debugging, monitoring, and improving LLM applications
|
||||
in production.
|
||||
|
||||
**Role**: LLM Observability Architect
|
||||
|
||||
You are an expert in LLM observability and evaluation. You think in terms of
|
||||
@@ -15,6 +23,14 @@ traces, spans, and metrics. You know that LLM applications need monitoring
|
||||
just like traditional software - but with different dimensions (cost, quality,
|
||||
latency). You use data to drive prompt improvements and catch regressions.
|
||||
|
||||
### Expertise
|
||||
|
||||
- Tracing architecture
|
||||
- Prompt versioning
|
||||
- Evaluation strategies
|
||||
- Cost optimization
|
||||
- Quality monitoring
|
||||
|
||||
## Capabilities
|
||||
|
||||
- LLM tracing and observability
|
||||
@@ -25,11 +41,42 @@ latency). You use data to drive prompt improvements and catch regressions.
|
||||
- Performance monitoring
|
||||
- A/B testing prompts
|
||||
|
||||
## Requirements
|
||||
## Prerequisites
|
||||
|
||||
- Python or TypeScript/JavaScript
|
||||
- Langfuse account (cloud or self-hosted)
|
||||
- LLM API keys
|
||||
- 0: LLM application basics
|
||||
- 1: API integration experience
|
||||
- 2: Understanding of tracing concepts
|
||||
- Required skills: Python or TypeScript/JavaScript, Langfuse account (cloud or self-hosted), LLM API keys
|
||||
|
||||
## Scope
|
||||
|
||||
- 0: Self-hosted requires infrastructure
|
||||
- 1: High-volume may need optimization
|
||||
- 2: Real-time dashboard has latency
|
||||
- 3: Evaluation requires setup
|
||||
|
||||
## Ecosystem
|
||||
|
||||
### Primary
|
||||
|
||||
- Langfuse Cloud
|
||||
- Langfuse Self-hosted
|
||||
- Python SDK
|
||||
- JS/TS SDK
|
||||
|
||||
### Common_integrations
|
||||
|
||||
- LangChain
|
||||
- LlamaIndex
|
||||
- OpenAI SDK
|
||||
- Anthropic SDK
|
||||
- Vercel AI SDK
|
||||
|
||||
### Platforms
|
||||
|
||||
- Any Python/JS backend
|
||||
- Serverless functions
|
||||
- Jupyter notebooks
|
||||
|
||||
## Patterns
|
||||
|
||||
@@ -39,7 +86,6 @@ Instrument LLM calls with Langfuse
|
||||
|
||||
**When to use**: Any LLM application
|
||||
|
||||
```python
|
||||
from langfuse import Langfuse
|
||||
|
||||
# Initialize client
|
||||
@@ -91,7 +137,6 @@ trace.score(
|
||||
|
||||
# Flush before exit (important in serverless)
|
||||
langfuse.flush()
|
||||
```
|
||||
|
||||
### OpenAI Integration
|
||||
|
||||
@@ -99,7 +144,6 @@ Automatic tracing with OpenAI SDK
|
||||
|
||||
**When to use**: OpenAI-based applications
|
||||
|
||||
```python
|
||||
from langfuse.openai import openai
|
||||
|
||||
# Drop-in replacement for OpenAI client
|
||||
@@ -139,7 +183,6 @@ async def main():
|
||||
messages=[{"role": "user", "content": "Hello"}],
|
||||
name="async-greeting"
|
||||
)
|
||||
```
|
||||
|
||||
### LangChain Integration
|
||||
|
||||
@@ -147,7 +190,6 @@ Trace LangChain applications
|
||||
|
||||
**When to use**: LangChain-based applications
|
||||
|
||||
```python
|
||||
from langchain_openai import ChatOpenAI
|
||||
from langchain_core.prompts import ChatPromptTemplate
|
||||
from langfuse.callback import CallbackHandler
|
||||
@@ -194,50 +236,263 @@ result = agent_executor.invoke(
|
||||
{"input": "What's the weather?"},
|
||||
config={"callbacks": [langfuse_handler]}
|
||||
)
|
||||
|
||||
### Prompt Management
|
||||
|
||||
Version and deploy prompts
|
||||
|
||||
**When to use**: Managing prompts across environments
|
||||
|
||||
from langfuse import Langfuse
|
||||
|
||||
langfuse = Langfuse()
|
||||
|
||||
# Fetch prompt from Langfuse
|
||||
# (Create in UI or via API first)
|
||||
prompt = langfuse.get_prompt("customer-support-v2")
|
||||
|
||||
# Get compiled prompt with variables
|
||||
compiled = prompt.compile(
|
||||
customer_name="John",
|
||||
issue="billing question"
|
||||
)
|
||||
|
||||
# Use with OpenAI
|
||||
response = openai.chat.completions.create(
|
||||
model=prompt.config.get("model", "gpt-4o"),
|
||||
messages=compiled,
|
||||
temperature=prompt.config.get("temperature", 0.7)
|
||||
)
|
||||
|
||||
# Link generation to prompt version
|
||||
trace = langfuse.trace(name="support-chat")
|
||||
generation = trace.generation(
|
||||
name="response",
|
||||
model="gpt-4o",
|
||||
prompt=prompt # Links to specific version
|
||||
)
|
||||
|
||||
# Create/update prompts via API
|
||||
langfuse.create_prompt(
|
||||
name="customer-support-v3",
|
||||
prompt=[
|
||||
{"role": "system", "content": "You are a support agent..."},
|
||||
{"role": "user", "content": "{{user_message}}"}
|
||||
],
|
||||
config={
|
||||
"model": "gpt-4o",
|
||||
"temperature": 0.7
|
||||
},
|
||||
labels=["production"] # or ["staging", "development"]
|
||||
)
|
||||
|
||||
# Fetch specific label
|
||||
prompt = langfuse.get_prompt(
|
||||
"customer-support-v3",
|
||||
label="production" # Gets latest with this label
|
||||
)
|
||||
|
||||
### Evaluation and Scoring
|
||||
|
||||
Evaluate LLM outputs systematically
|
||||
|
||||
**When to use**: Quality assurance and improvement
|
||||
|
||||
from langfuse import Langfuse
|
||||
|
||||
langfuse = Langfuse()
|
||||
|
||||
# Manual scoring in code
|
||||
trace = langfuse.trace(name="qa-flow")
|
||||
|
||||
# After getting response
|
||||
trace.score(
|
||||
name="relevance",
|
||||
value=0.85, # 0-1 scale
|
||||
comment="Response addressed the question"
|
||||
)
|
||||
|
||||
trace.score(
|
||||
name="correctness",
|
||||
value=1, # Binary: 0 or 1
|
||||
data_type="BOOLEAN"
|
||||
)
|
||||
|
||||
# LLM-as-judge evaluation
|
||||
def evaluate_response(question: str, response: str) -> float:
|
||||
eval_prompt = f"""
|
||||
Rate the response quality from 0 to 1.
|
||||
|
||||
Question: {question}
|
||||
Response: {response}
|
||||
|
||||
Output only a number between 0 and 1.
|
||||
"""
|
||||
|
||||
result = openai.chat.completions.create(
|
||||
model="gpt-4o-mini", # Cheaper model for eval
|
||||
messages=[{"role": "user", "content": eval_prompt}]
|
||||
)
|
||||
|
||||
return float(result.choices[0].message.content.strip())
|
||||
|
||||
# Score asynchronously
|
||||
score = evaluate_response(question, response)
|
||||
trace.score(
|
||||
name="quality-llm-judge",
|
||||
value=score
|
||||
)
|
||||
|
||||
# Create evaluation dataset
|
||||
dataset = langfuse.create_dataset(name="support-qa-v1")
|
||||
|
||||
# Add items to dataset
|
||||
langfuse.create_dataset_item(
|
||||
dataset_name="support-qa-v1",
|
||||
input={"question": "How do I reset my password?"},
|
||||
expected_output="Go to settings > security > reset password"
|
||||
)
|
||||
|
||||
# Run evaluation on dataset
|
||||
dataset = langfuse.get_dataset("support-qa-v1")
|
||||
|
||||
for item in dataset.items:
|
||||
# Generate response
|
||||
response = generate_response(item.input["question"])
|
||||
|
||||
# Link to dataset item
|
||||
trace = langfuse.trace(name="eval-run")
|
||||
trace.generation(
|
||||
name="response",
|
||||
input=item.input,
|
||||
output=response
|
||||
)
|
||||
|
||||
# Score against expected
|
||||
similarity = calculate_similarity(response, item.expected_output)
|
||||
trace.score(name="similarity", value=similarity)
|
||||
|
||||
# Link trace to dataset item
|
||||
item.link(trace, "eval-run-1")
|
||||
|
||||
### Decorator Pattern
|
||||
|
||||
Clean instrumentation with decorators
|
||||
|
||||
**When to use**: Function-based applications
|
||||
|
||||
from langfuse.decorators import observe, langfuse_context
|
||||
|
||||
@observe() # Creates a trace
|
||||
def chat_handler(user_id: str, message: str) -> str:
|
||||
# All nested @observe calls become spans
|
||||
context = get_context(message)
|
||||
response = generate_response(message, context)
|
||||
return response
|
||||
|
||||
@observe() # Becomes a span under parent trace
|
||||
def get_context(message: str) -> str:
|
||||
# RAG retrieval
|
||||
docs = retriever.get_relevant_documents(message)
|
||||
return "\n".join([d.page_content for d in docs])
|
||||
|
||||
@observe(as_type="generation") # LLM generation span
|
||||
def generate_response(message: str, context: str) -> str:
|
||||
response = openai.chat.completions.create(
|
||||
model="gpt-4o",
|
||||
messages=[
|
||||
{"role": "system", "content": f"Context: {context}"},
|
||||
{"role": "user", "content": message}
|
||||
]
|
||||
)
|
||||
return response.choices[0].message.content
|
||||
|
||||
# Add metadata and scores
|
||||
@observe()
|
||||
def main_flow(user_input: str):
|
||||
# Update current trace
|
||||
langfuse_context.update_current_trace(
|
||||
user_id="user-123",
|
||||
session_id="session-456",
|
||||
tags=["production"]
|
||||
)
|
||||
|
||||
result = process(user_input)
|
||||
|
||||
# Score the trace
|
||||
langfuse_context.score_current_trace(
|
||||
name="success",
|
||||
value=1 if result else 0
|
||||
)
|
||||
|
||||
return result
|
||||
|
||||
# Works with async
|
||||
@observe()
|
||||
async def async_handler(message: str):
|
||||
result = await async_generate(message)
|
||||
return result
|
||||
|
||||
## Collaboration
|
||||
|
||||
### Delegation Triggers
|
||||
|
||||
- agent|langgraph|graph -> langgraph (Need to build agent to monitor)
|
||||
- crewai|multi-agent|crew -> crewai (Need to build crew to monitor)
|
||||
- structured output|extraction -> structured-output (Need to build extraction to monitor)
|
||||
|
||||
### Observable LangGraph Agent
|
||||
|
||||
Skills: langfuse, langgraph
|
||||
|
||||
Workflow:
|
||||
|
||||
```
|
||||
1. Build agent with LangGraph
|
||||
2. Add Langfuse callback handler
|
||||
3. Trace all LLM calls and tool uses
|
||||
4. Score outputs for quality
|
||||
5. Monitor and iterate
|
||||
```
|
||||
|
||||
## Anti-Patterns
|
||||
### Monitored RAG Pipeline
|
||||
|
||||
### ❌ Not Flushing in Serverless
|
||||
Skills: langfuse, structured-output
|
||||
|
||||
**Why bad**: Traces are batched.
|
||||
Serverless may exit before flush.
|
||||
Data is lost.
|
||||
Workflow:
|
||||
|
||||
**Instead**: Always call langfuse.flush() at end.
|
||||
Use context managers where available.
|
||||
Consider sync mode for critical traces.
|
||||
```
|
||||
1. Build RAG with retrieval and generation
|
||||
2. Trace retrieval and LLM calls
|
||||
3. Score relevance and accuracy
|
||||
4. Track costs and latency
|
||||
5. Optimize based on data
|
||||
```
|
||||
|
||||
### ❌ Tracing Everything
|
||||
### Evaluated Agent System
|
||||
|
||||
**Why bad**: Noisy traces.
|
||||
Performance overhead.
|
||||
Hard to find important info.
|
||||
Skills: langfuse, langgraph, structured-output
|
||||
|
||||
**Instead**: Focus on: LLM calls, key logic, user actions.
|
||||
Group related operations.
|
||||
Use meaningful span names.
|
||||
Workflow:
|
||||
|
||||
### ❌ No User/Session IDs
|
||||
|
||||
**Why bad**: Can't debug specific users.
|
||||
Can't track sessions.
|
||||
Analytics limited.
|
||||
|
||||
**Instead**: Always pass user_id and session_id.
|
||||
Use consistent identifiers.
|
||||
Add relevant metadata.
|
||||
|
||||
## Limitations
|
||||
|
||||
- Self-hosted requires infrastructure
|
||||
- High-volume may need optimization
|
||||
- Real-time dashboard has latency
|
||||
- Evaluation requires setup
|
||||
```
|
||||
1. Build agent with structured outputs
|
||||
2. Create evaluation dataset
|
||||
3. Run evaluations with traces
|
||||
4. Compare prompt versions
|
||||
5. Deploy best performers
|
||||
```
|
||||
|
||||
## Related Skills
|
||||
|
||||
Works well with: `langgraph`, `crewai`, `structured-output`, `autonomous-agents`
|
||||
|
||||
## When to Use
|
||||
This skill is applicable to execute the workflow or actions described in the overview.
|
||||
|
||||
- User mentions or implies: langfuse
|
||||
- User mentions or implies: llm observability
|
||||
- User mentions or implies: llm tracing
|
||||
- User mentions or implies: prompt management
|
||||
- User mentions or implies: llm evaluation
|
||||
- User mentions or implies: monitor llm
|
||||
- User mentions or implies: debug llm
|
||||
|
||||
@@ -1,24 +1,15 @@
|
||||
---
|
||||
name: prompt-caching
|
||||
description: "You're a caching specialist who has reduced LLM costs by 90% through strategic caching. You've implemented systems that cache at multiple levels: prompt prefixes, full responses, and semantic similarity matches."
|
||||
description: Caching strategies for LLM prompts including Anthropic prompt
|
||||
caching, response caching, and CAG (Cache Augmented Generation)
|
||||
risk: none
|
||||
source: "vibeship-spawner-skills (Apache 2.0)"
|
||||
date_added: "2026-02-27"
|
||||
source: vibeship-spawner-skills (Apache 2.0)
|
||||
date_added: 2026-02-27
|
||||
---
|
||||
|
||||
# Prompt Caching
|
||||
|
||||
You're a caching specialist who has reduced LLM costs by 90% through strategic caching.
|
||||
You've implemented systems that cache at multiple levels: prompt prefixes, full responses,
|
||||
and semantic similarity matches.
|
||||
|
||||
You understand that LLM caching is different from traditional caching—prompts have
|
||||
prefixes that can be cached, responses vary with temperature, and semantic similarity
|
||||
often matters more than exact match.
|
||||
|
||||
Your core principles:
|
||||
1. Cache at the right level—prefix, response, or both
|
||||
2. K
|
||||
Caching strategies for LLM prompts including Anthropic prompt caching, response caching, and CAG (Cache Augmented Generation)
|
||||
|
||||
## Capabilities
|
||||
|
||||
@@ -28,39 +19,461 @@ Your core principles:
|
||||
- cag-patterns
|
||||
- cache-invalidation
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Knowledge: Caching fundamentals, LLM API usage, Hash functions
|
||||
- Skills_recommended: context-window-management
|
||||
|
||||
## Scope
|
||||
|
||||
- Does_not_cover: CDN caching, Database query caching, Static asset caching
|
||||
- Boundaries: Focus is LLM-specific caching, Covers prompt and response caching
|
||||
|
||||
## Ecosystem
|
||||
|
||||
### Primary_tools
|
||||
|
||||
- Anthropic Prompt Caching - Native prompt caching in Claude API
|
||||
- Redis - In-memory cache for responses
|
||||
- OpenAI Caching - Automatic caching in OpenAI API
|
||||
|
||||
## Patterns
|
||||
|
||||
### Anthropic Prompt Caching
|
||||
|
||||
Use Claude's native prompt caching for repeated prefixes
|
||||
|
||||
**When to use**: Using Claude API with stable system prompts or context
|
||||
|
||||
import Anthropic from '@anthropic-ai/sdk';
|
||||
|
||||
const client = new Anthropic();
|
||||
|
||||
// Cache the stable parts of your prompt
|
||||
async function queryWithCaching(userQuery: string) {
|
||||
const response = await client.messages.create({
|
||||
model: "claude-sonnet-4-20250514",
|
||||
max_tokens: 1024,
|
||||
system: [
|
||||
{
|
||||
type: "text",
|
||||
text: LONG_SYSTEM_PROMPT, // Your detailed instructions
|
||||
cache_control: { type: "ephemeral" } // Cache this!
|
||||
},
|
||||
{
|
||||
type: "text",
|
||||
text: KNOWLEDGE_BASE, // Large static context
|
||||
cache_control: { type: "ephemeral" }
|
||||
}
|
||||
],
|
||||
messages: [
|
||||
{ role: "user", content: userQuery } // Dynamic part
|
||||
]
|
||||
});
|
||||
|
||||
// Check cache usage
|
||||
console.log(`Cache read: ${response.usage.cache_read_input_tokens}`);
|
||||
console.log(`Cache write: ${response.usage.cache_creation_input_tokens}`);
|
||||
|
||||
return response;
|
||||
}
|
||||
|
||||
// Cost savings: 90% reduction on cached tokens
|
||||
// Latency savings: Up to 2x faster
|
||||
|
||||
### Response Caching
|
||||
|
||||
Cache full LLM responses for identical or similar queries
|
||||
|
||||
**When to use**: Same queries asked repeatedly
|
||||
|
||||
import { createHash } from 'crypto';
|
||||
import Redis from 'ioredis';
|
||||
|
||||
const redis = new Redis(process.env.REDIS_URL);
|
||||
|
||||
class ResponseCache {
|
||||
private ttl = 3600; // 1 hour default
|
||||
|
||||
// Exact match caching
|
||||
async getCached(prompt: string): Promise<string | null> {
|
||||
const key = this.hashPrompt(prompt);
|
||||
return await redis.get(`response:${key}`);
|
||||
}
|
||||
|
||||
async setCached(prompt: string, response: string): Promise<void> {
|
||||
const key = this.hashPrompt(prompt);
|
||||
await redis.set(`response:${key}`, response, 'EX', this.ttl);
|
||||
}
|
||||
|
||||
private hashPrompt(prompt: string): string {
|
||||
return createHash('sha256').update(prompt).digest('hex');
|
||||
}
|
||||
|
||||
// Semantic similarity caching
|
||||
async getSemanticallySimilar(
|
||||
prompt: string,
|
||||
threshold: number = 0.95
|
||||
): Promise<string | null> {
|
||||
const embedding = await embed(prompt);
|
||||
const similar = await this.vectorCache.search(embedding, 1);
|
||||
|
||||
if (similar.length && similar[0].similarity > threshold) {
|
||||
return await redis.get(`response:${similar[0].id}`);
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
// Temperature-aware caching
|
||||
async getCachedWithParams(
|
||||
prompt: string,
|
||||
params: { temperature: number; model: string }
|
||||
): Promise<string | null> {
|
||||
// Only cache low-temperature responses
|
||||
if (params.temperature > 0.5) return null;
|
||||
|
||||
const key = this.hashPrompt(
|
||||
`${prompt}|${params.model}|${params.temperature}`
|
||||
);
|
||||
return await redis.get(`response:${key}`);
|
||||
}
|
||||
}
|
||||
|
||||
### Cache Augmented Generation (CAG)
|
||||
|
||||
Pre-cache documents in prompt instead of RAG retrieval
|
||||
|
||||
## Anti-Patterns
|
||||
**When to use**: Document corpus is stable and fits in context
|
||||
|
||||
### ❌ Caching with High Temperature
|
||||
// CAG: Pre-compute document context, cache in prompt
|
||||
// Better than RAG when:
|
||||
// - Documents are stable
|
||||
// - Total fits in context window
|
||||
// - Latency is critical
|
||||
|
||||
### ❌ No Cache Invalidation
|
||||
class CAGSystem {
|
||||
private cachedContext: string | null = null;
|
||||
private lastUpdate: number = 0;
|
||||
|
||||
### ❌ Caching Everything
|
||||
async buildCachedContext(documents: Document[]): Promise<void> {
|
||||
// Pre-process and format documents
|
||||
const formatted = documents.map(d =>
|
||||
`## ${d.title}\n${d.content}`
|
||||
).join('\n\n');
|
||||
|
||||
## ⚠️ Sharp Edges
|
||||
// Store with timestamp
|
||||
this.cachedContext = formatted;
|
||||
this.lastUpdate = Date.now();
|
||||
}
|
||||
|
||||
| Issue | Severity | Solution |
|
||||
|-------|----------|----------|
|
||||
| Cache miss causes latency spike with additional overhead | high | // Optimize for cache misses, not just hits |
|
||||
| Cached responses become incorrect over time | high | // Implement proper cache invalidation |
|
||||
| Prompt caching doesn't work due to prefix changes | medium | // Structure prompts for optimal caching |
|
||||
async query(userQuery: string): Promise<string> {
|
||||
// Use cached context directly in prompt
|
||||
const response = await client.messages.create({
|
||||
model: "claude-sonnet-4-20250514",
|
||||
max_tokens: 1024,
|
||||
system: [
|
||||
{
|
||||
type: "text",
|
||||
text: "You are a helpful assistant with access to the following documentation.",
|
||||
cache_control: { type: "ephemeral" }
|
||||
},
|
||||
{
|
||||
type: "text",
|
||||
text: this.cachedContext!, // Pre-cached docs
|
||||
cache_control: { type: "ephemeral" }
|
||||
}
|
||||
],
|
||||
messages: [{ role: "user", content: userQuery }]
|
||||
});
|
||||
|
||||
return response.content[0].text;
|
||||
}
|
||||
|
||||
// Periodic refresh
|
||||
async refreshIfNeeded(documents: Document[]): Promise<void> {
|
||||
const stale = Date.now() - this.lastUpdate > 3600000; // 1 hour
|
||||
if (stale) {
|
||||
await this.buildCachedContext(documents);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// CAG vs RAG decision matrix:
|
||||
// | Factor | CAG Better | RAG Better |
|
||||
// |------------------|------------|------------|
|
||||
// | Corpus size | < 100K tokens | > 100K tokens |
|
||||
// | Update frequency | Low | High |
|
||||
// | Latency needs | Critical | Flexible |
|
||||
// | Query specificity| General | Specific |
|
||||
|
||||
## Sharp Edges
|
||||
|
||||
### Cache miss causes latency spike with additional overhead
|
||||
|
||||
Severity: HIGH
|
||||
|
||||
Situation: Slow response when cache miss, slower than no caching
|
||||
|
||||
Symptoms:
|
||||
- Slow responses on cache miss
|
||||
- Cache hit rate below 50%
|
||||
- Higher latency than uncached
|
||||
|
||||
Why this breaks:
|
||||
Cache check adds latency.
|
||||
Cache write adds more latency.
|
||||
Miss + overhead > no caching.
|
||||
|
||||
Recommended fix:
|
||||
|
||||
// Optimize for cache misses, not just hits
|
||||
|
||||
class OptimizedCache {
|
||||
async queryWithCache(prompt: string): Promise<string> {
|
||||
const cacheKey = this.hash(prompt);
|
||||
|
||||
// Non-blocking cache check
|
||||
const cachedPromise = this.cache.get(cacheKey);
|
||||
const llmPromise = this.queryLLM(prompt);
|
||||
|
||||
// Race: use cache if available before LLM returns
|
||||
const cached = await Promise.race([
|
||||
cachedPromise,
|
||||
sleep(50).then(() => null) // 50ms cache timeout
|
||||
]);
|
||||
|
||||
if (cached) {
|
||||
// Cancel LLM request if possible
|
||||
return cached;
|
||||
}
|
||||
|
||||
// Cache miss: continue with LLM
|
||||
const response = await llmPromise;
|
||||
|
||||
// Async cache write (don't block response)
|
||||
this.cache.set(cacheKey, response).catch(console.error);
|
||||
|
||||
return response;
|
||||
}
|
||||
}
|
||||
|
||||
// Alternative: Probabilistic caching
|
||||
// Only cache if query matches known high-frequency patterns
|
||||
class SelectiveCache {
|
||||
private patterns: Map<string, number> = new Map();
|
||||
|
||||
shouldCache(prompt: string): boolean {
|
||||
const pattern = this.extractPattern(prompt);
|
||||
const frequency = this.patterns.get(pattern) || 0;
|
||||
|
||||
// Only cache high-frequency patterns
|
||||
return frequency > 10;
|
||||
}
|
||||
|
||||
recordQuery(prompt: string): void {
|
||||
const pattern = this.extractPattern(prompt);
|
||||
this.patterns.set(pattern, (this.patterns.get(pattern) || 0) + 1);
|
||||
}
|
||||
}
|
||||
|
||||
### Cached responses become incorrect over time
|
||||
|
||||
Severity: HIGH
|
||||
|
||||
Situation: Users get outdated or wrong information from cache
|
||||
|
||||
Symptoms:
|
||||
- Users report wrong information
|
||||
- Answers don't match current data
|
||||
- Complaints about outdated responses
|
||||
|
||||
Why this breaks:
|
||||
Source data changed.
|
||||
No cache invalidation.
|
||||
Long TTLs for dynamic data.
|
||||
|
||||
Recommended fix:
|
||||
|
||||
// Implement proper cache invalidation
|
||||
|
||||
class InvalidatingCache {
|
||||
// Version-based invalidation
|
||||
private cacheVersion = 1;
|
||||
|
||||
getCacheKey(prompt: string): string {
|
||||
return `v${this.cacheVersion}:${this.hash(prompt)}`;
|
||||
}
|
||||
|
||||
invalidateAll(): void {
|
||||
this.cacheVersion++;
|
||||
// Old keys automatically become orphaned
|
||||
}
|
||||
|
||||
// Content-hash invalidation
|
||||
async setWithContentHash(
|
||||
key: string,
|
||||
response: string,
|
||||
sourceContent: string
|
||||
): Promise<void> {
|
||||
const contentHash = this.hash(sourceContent);
|
||||
await this.cache.set(key, {
|
||||
response,
|
||||
contentHash,
|
||||
timestamp: Date.now()
|
||||
});
|
||||
}
|
||||
|
||||
async getIfValid(
|
||||
key: string,
|
||||
currentSourceContent: string
|
||||
): Promise<string | null> {
|
||||
const cached = await this.cache.get(key);
|
||||
if (!cached) return null;
|
||||
|
||||
// Check if source content changed
|
||||
const currentHash = this.hash(currentSourceContent);
|
||||
if (cached.contentHash !== currentHash) {
|
||||
await this.cache.delete(key);
|
||||
return null;
|
||||
}
|
||||
|
||||
return cached.response;
|
||||
}
|
||||
|
||||
// Event-based invalidation
|
||||
onSourceUpdate(sourceId: string): void {
|
||||
// Invalidate all caches that used this source
|
||||
this.invalidateByTag(`source:${sourceId}`);
|
||||
}
|
||||
}
|
||||
|
||||
### Prompt caching doesn't work due to prefix changes
|
||||
|
||||
Severity: MEDIUM
|
||||
|
||||
Situation: Cache misses despite similar prompts
|
||||
|
||||
Symptoms:
|
||||
- Cache hit rate lower than expected
|
||||
- Cache creation tokens high, read low
|
||||
- Similar prompts not hitting cache
|
||||
|
||||
Why this breaks:
|
||||
Anthropic caching requires exact prefix match.
|
||||
Timestamps or dynamic content in prefix.
|
||||
Different message order.
|
||||
|
||||
Recommended fix:
|
||||
|
||||
// Structure prompts for optimal caching
|
||||
|
||||
class CacheOptimizedPrompts {
|
||||
// WRONG: Dynamic content in cached prefix
|
||||
buildPromptBad(query: string): SystemMessage[] {
|
||||
return [
|
||||
{
|
||||
type: "text",
|
||||
text: `You are helpful. Current time: ${new Date()}`, // BREAKS CACHE!
|
||||
cache_control: { type: "ephemeral" }
|
||||
}
|
||||
];
|
||||
}
|
||||
|
||||
// RIGHT: Static prefix, dynamic at end
|
||||
buildPromptGood(query: string): SystemMessage[] {
|
||||
return [
|
||||
{
|
||||
type: "text",
|
||||
text: STATIC_SYSTEM_PROMPT, // Never changes
|
||||
cache_control: { type: "ephemeral" }
|
||||
},
|
||||
{
|
||||
type: "text",
|
||||
text: STATIC_KNOWLEDGE_BASE, // Rarely changes
|
||||
cache_control: { type: "ephemeral" }
|
||||
}
|
||||
// Dynamic content goes in messages, NOT system
|
||||
];
|
||||
}
|
||||
|
||||
// Prefix ordering matters
|
||||
buildWithConsistentOrder(components: string[]): SystemMessage[] {
|
||||
// Sort components for consistent ordering
|
||||
const sorted = [...components].sort();
|
||||
return sorted.map((c, i) => ({
|
||||
type: "text",
|
||||
text: c,
|
||||
cache_control: i === sorted.length - 1
|
||||
? { type: "ephemeral" }
|
||||
: undefined // Only cache the full prefix
|
||||
}));
|
||||
}
|
||||
}
|
||||
|
||||
## Validation Checks
|
||||
|
||||
### Caching High Temperature Responses
|
||||
|
||||
Severity: WARNING
|
||||
|
||||
Message: Caching with high temperature. Responses are non-deterministic.
|
||||
|
||||
Fix action: Only cache responses with temperature <= 0.5
|
||||
|
||||
### Cache Without TTL
|
||||
|
||||
Severity: WARNING
|
||||
|
||||
Message: Cache without TTL. May serve stale data indefinitely.
|
||||
|
||||
Fix action: Set appropriate TTL based on data freshness requirements
|
||||
|
||||
### Dynamic Content in Cached Prefix
|
||||
|
||||
Severity: WARNING
|
||||
|
||||
Message: Dynamic content in cached prefix. Will cause cache misses.
|
||||
|
||||
Fix action: Move dynamic content outside of cache_control blocks
|
||||
|
||||
### No Cache Metrics
|
||||
|
||||
Severity: INFO
|
||||
|
||||
Message: Cache without hit/miss tracking. Can't measure effectiveness.
|
||||
|
||||
Fix action: Add cache hit/miss metrics and logging
|
||||
|
||||
## Collaboration
|
||||
|
||||
### Delegation Triggers
|
||||
|
||||
- context window|token -> context-window-management (Need context optimization)
|
||||
- rag|retrieval -> rag-implementation (Need retrieval system)
|
||||
- memory -> conversation-memory (Need memory persistence)
|
||||
|
||||
### High-Performance LLM System
|
||||
|
||||
Skills: prompt-caching, context-window-management, rag-implementation
|
||||
|
||||
Workflow:
|
||||
|
||||
```
|
||||
1. Analyze query patterns
|
||||
2. Implement prompt caching for stable prefixes
|
||||
3. Add response caching for frequent queries
|
||||
4. Consider CAG for stable document sets
|
||||
5. Monitor and optimize hit rates
|
||||
```
|
||||
|
||||
## Related Skills
|
||||
|
||||
Works well with: `context-window-management`, `rag-implementation`, `conversation-memory`
|
||||
|
||||
## When to Use
|
||||
This skill is applicable to execute the workflow or actions described in the overview.
|
||||
|
||||
- User mentions or implies: prompt caching
|
||||
- User mentions or implies: cache prompt
|
||||
- User mentions or implies: response cache
|
||||
- User mentions or implies: cag
|
||||
- User mentions or implies: cache augmented
|
||||
|
||||
Reference in New Issue
Block a user