This patch release fixes the broken Chinese language selector link on PyPI by using absolute GitHub URLs instead of relative paths. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
19 KiB
Skill Seekers Intelligence System - Research Topics
Version: 1.0 Status: 🔬 Research Phase Last Updated: 2026-01-20 Purpose: Areas to research and experiment with before/during implementation
🔬 Research Areas
1. Import Analysis Accuracy
Question: How accurate is AST-based import analysis for finding relevant skills?
Hypothesis: 85-90% accuracy for Python, lower for JavaScript (dynamic imports)
Research Plan:
- Dataset: Analyze 10 real-world Python projects
- Ground Truth: Manually identify relevant modules for 50 test files
- Measure: Precision, recall, F1-score
- Iterate: Improve import parser based on results
Test Cases:
# Case 1: Simple import
from fastapi import FastAPI
# Expected: Load fastapi.skill
# Case 2: Relative import
from .models import User
# Expected: Load models.skill
# Case 3: Dynamic import
importlib.import_module("my_module")
# Expected: ??? (hard to detect)
# Case 4: Nested import
from src.api.v1.routes import router
# Expected: Load api.skill
# Case 5: Import with alias
from very_long_name import X as Y
# Expected: Load very_long_name.skill
Success Criteria:
- >85% precision (no false positives)
- >80% recall (no false negatives)
- <100ms parse time per file
Findings: (To be filled during research)
2. Embedding Model Selection
Question: Which embedding model is best for code similarity?
Candidates:
- sentence-transformers/all-MiniLM-L6-v2 (80MB, general purpose)
- microsoft/codebert-base (500MB, code-specific)
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (420MB, multilingual)
- Custom fine-tuned (train on code + docs)
Evaluation Criteria:
- Speed: Embedding time per file
- Size: Model download size
- Accuracy: Similarity to ground truth
- Resource: RAM/CPU usage
Benchmark Plan:
# Dataset: 100 Python files + 20 skills
# For each file:
# 1. Manual: Which skills are relevant? (ground truth)
# 2. Each model: Rank skills by similarity
# 3. Measure: Precision@5, Recall@5, MRR
models = [
"all-MiniLM-L6-v2",
"codebert-base",
"paraphrase-multilingual",
]
results = {}
for model in models:
results[model] = benchmark(model, dataset)
# Compare
print(results)
Expected Results:
| Model | Speed | Size | Accuracy | RAM | Winner? |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 50ms | 80MB | 75% | 200MB | ✅ Best balance |
| codebert-base | 200ms | 500MB | 85% | 1GB | Too slow/large |
| paraphrase-multi | 100ms | 420MB | 78% | 500MB | Middle ground |
Success Criteria:
- <100ms embedding time
- <200MB model size
- >75% accuracy (better than random)
Findings: (To be filled during research)
3. Skill Granularity
Question: How fine-grained should skills be?
Options:
- Coarse: One skill per 1000+ LOC (e.g., entire backend)
- Medium: One skill per 200-500 LOC (e.g., api, auth, models)
- Fine: One skill per 50-100 LOC (e.g., each endpoint)
Trade-offs:
| Granularity | Skills | Skill Size | Context Usage | Accuracy |
|---|---|---|---|---|
| Coarse | 3-5 | 500 lines | Low | Low (too broad) |
| Medium | 10-15 | 200 lines | Medium | ✅ Good |
| Fine | 50+ | 50 lines | High | Too specific |
Experiment:
- Generate skills at all 3 granularities for skill-seekers
- Use each set for 1 week of development
- Measure: usefulness (subjective), context overflow (objective)
Success Criteria:
- Skills feel "right-sized" (not too broad, not too narrow)
- <5 skills needed for typical task
- Skills don't overflow context (< 10K tokens total)
Findings: (To be filled during research)
4. Clustering Strategy Performance
Question: Which clustering strategy is best?
Strategies:
- Import-only: Fast, deterministic
- Embedding-only: Flexible, catches semantics
- Hybrid (70/30): Best of both
- Hybrid (50/50): Equal weight
- Hybrid with learning: Adjust weights based on feedback
Evaluation:
# Dataset: 50 files with manually labeled relevant skills
strategies = {
"import_only": ImportBasedEngine(),
"embedding_only": EmbeddingBasedEngine(),
"hybrid_70_30": HybridEngine(0.7, 0.3),
"hybrid_50_50": HybridEngine(0.5, 0.5),
}
for name, engine in strategies.items():
scores = evaluate(engine, dataset)
print(f"{name}: Precision={scores.precision}, Recall={scores.recall}")
Expected Results:
| Strategy | Precision | Recall | F1 | Speed | Winner? |
|---|---|---|---|---|---|
| Import-only | 90% | 75% | 82% | 50ms | Fast, precise |
| Embedding-only | 75% | 85% | 80% | 100ms | Flexible |
| Hybrid 70/30 | 88% | 82% | 85% | 80ms | ✅ Best balance |
| Hybrid 50/50 | 85% | 85% | 85% | 80ms | Equal weight |
Success Criteria:
- Hybrid beats both individual strategies
- <100ms clustering time
- >85% F1-score
Findings: (To be filled during research)
5. Git Hook Performance
Question: How long does skill regeneration take?
Variables:
- Codebase size (100, 500, 1000, 5000 files)
- Analysis depth (surface, deep, full)
- Incremental vs full regeneration
Benchmark:
# Test on real projects
projects = [
("skill-seekers", 140, "Python"),
("fastapi", 500, "Python"),
("react", 1000, "JavaScript"),
("vscode", 5000, "TypeScript"),
]
for name, files, lang in projects:
# Full regeneration
time_full = time_regeneration(name, incremental=False)
# Incremental (10% changed)
time_incr = time_regeneration(name, incremental=True, changed_ratio=0.1)
print(f"{name}: Full={time_full}s, Incremental={time_incr}s")
Expected Results:
| Project | Files | Full | Incremental | Acceptable? |
|---|---|---|---|---|
| skill-seekers | 140 | 3 min | 30 sec | ✅ Yes |
| fastapi | 500 | 8 min | 1 min | ✅ Yes |
| react | 1000 | 15 min | 2 min | ⚠️ Borderline |
| vscode | 5000 | 60 min | 10 min | ❌ Too slow |
Optimizations if too slow:
- Parallel analysis (multiprocessing)
- Smarter incremental (only changed modules)
- Background daemon (non-blocking)
Success Criteria:
- <5 min for typical project (500 files)
- <2 min for incremental update
- Can run in background without blocking
Findings: (To be filled during research)
6. Context Window Management
Question: How to handle context overflow with large skills?
Problem: Claude has 200K context, but large projects generate huge skills
Solutions:
- Skill Summarization: Compress skills (API signatures only, no examples)
- Dynamic Loading: Load skill sections on-demand
- Skill Splitting: Further split large skills into sub-skills
- Priority System: Load most important skills first
Experiment:
# Generate skills for large project (5000 files)
# Measure context usage
skills = generate_skills("large-project")
total_tokens = sum(count_tokens(s) for s in skills)
print(f"Total tokens: {total_tokens}")
print(f"Context budget: 200,000")
print(f"Remaining: {200_000 - total_tokens}")
if total_tokens > 150_000: # Leave room for conversation
print("WARNING: Context overflow!")
# Try solutions
compressed = compress_skills(skills)
print(f"After compression: {count_tokens(compressed)}")
Success Criteria:
- Skills fit in context (< 150K tokens)
- Quality doesn't degrade significantly
- User has control (can choose which skills to load)
Findings: (To be filled during research)
7. Multi-Language Support
Question: How well does the system work for non-Python languages?
Languages to Support:
- Python (primary, best support)
- JavaScript/TypeScript (common frontend)
- Go (backend microservices)
- Rust (systems programming)
- Java (enterprise)
Challenges:
- Import syntax varies (import vs require vs use)
- Module systems differ (CommonJS, ESM, Go modules)
- Embedding accuracy may vary
Research Plan:
- Implement import parsers for each language
- Test on real projects
- Measure accuracy vs Python baseline
Expected Results:
| Language | Import Parse | Embedding | Overall | Support? |
|---|---|---|---|---|
| Python | 90% | 85% | 88% | ✅ Excellent |
| JavaScript | 80% | 85% | 83% | ✅ Good |
| TypeScript | 85% | 85% | 85% | ✅ Good |
| Go | 75% | 80% | 78% | ⚠️ Acceptable |
| Rust | 70% | 80% | 75% | ⚠️ Acceptable |
| Java | 65% | 80% | 73% | ⚠️ Basic |
Success Criteria:
- Python: >85% accuracy (primary focus)
- JS/TS: >80% accuracy (important)
- Others: >70% accuracy (nice to have)
Findings: (To be filled during research)
8. Library Skill Quality
Question: How good are auto-generated library skills vs handcrafted?
Experiment:
- Generate library skills for popular frameworks:
- FastAPI (from docs)
- React (from docs)
- PostgreSQL (from docs)
- Compare to handcrafted skills (manually written)
- Measure: completeness, accuracy, usefulness
Evaluation Criteria:
- Completeness: Does it cover all key APIs?
- Accuracy: Is information correct?
- Usefulness: Do developers find it helpful?
- Freshness: Is it up-to-date?
Test Plan:
# For each framework:
# 1. Auto-generate skill
# 2. Handcraft skill (1 hour of work)
# 3. A/B test with 5 developers
# 4. Measure: time to complete task, satisfaction
frameworks = ["FastAPI", "React", "PostgreSQL"]
for framework in frameworks:
auto_skill = generate_skill(framework)
hand_skill = handcraft_skill(framework)
results = ab_test(auto_skill, hand_skill, n_users=5)
print(f"{framework}:")
print(f" Auto: {results.auto_score}/10")
print(f" Hand: {results.hand_score}/10")
Expected Results:
| Framework | Auto | Hand | Difference | Acceptable? |
|---|---|---|---|---|
| FastAPI | 7/10 | 9/10 | -2 | ✅ Close enough |
| React | 6/10 | 9/10 | -3 | ⚠️ Needs work |
| PostgreSQL | 5/10 | 9/10 | -4 | ❌ Too far |
Optimization:
- If auto-generated is <7/10, use handcrafted
- Offer both: curated (handcrafted) + auto-generated
- Community contributions for popular frameworks
Success Criteria:
- Auto-generated is >7/10 quality
- Users find library skills helpful
- Skills stay up-to-date (auto-regenerate)
Findings: (To be filled during research)
9. Skill Update Frequency
Question: How often do skills need updating?
Variables:
- Codebase churn rate (commits/day)
- Trigger: every commit vs every merge vs weekly
- Impact: staleness vs performance
Experiment:
# Track a real project for 1 month
# Measure:
# - How often code changes affect skills
# - How stale skills get if not updated
# - User tolerance for staleness
project = "skill-seekers"
duration = "30 days"
events = track_changes(project, duration)
print(f"Total commits: {events.commits}")
print(f"Skill-affecting changes: {events.skill_changes}")
print(f"Ratio: {events.skill_changes / events.commits}")
# Test different update frequencies
frequencies = ["every-commit", "every-merge", "daily", "weekly"]
for freq in frequencies:
staleness = measure_staleness(freq)
perf_cost = measure_performance_cost(freq)
print(f"{freq}: Staleness={staleness}, Cost={perf_cost}")
Expected Results:
| Frequency | Staleness | Perf Cost | CPU Usage | Acceptable? |
|---|---|---|---|---|
| Every commit | 0% | High | 50%+ | ❌ Too much |
| Every merge | 5% | Medium | 10% | ✅ Good |
| Daily | 15% | Low | 2% | ✅ Good |
| Weekly | 40% | Very low | <1% | ⚠️ Too stale |
Recommendation: Update on merge to watched branches (main, dev)
Success Criteria:
- Skills <10% stale
- Performance overhead <10% CPU
- User doesn't notice staleness
Findings: (To be filled during research)
10. Plugin Integration Patterns
Question: What's the best way to integrate with Claude Code?
Options:
- File Hooks: React to file open/save events
- Command Palette: User manually loads skills
- Automatic: Always load best skills
- Hybrid: Auto-load + manual override
User Experience Testing:
# Test with 5 developers for 1 week each
patterns = [
"file_hooks", # Auto-load on file open
"command_palette", # Manual: Cmd+Shift+P -> "Load Skills"
"automatic", # Always load, no user action
"hybrid", # Auto + manual override
]
for pattern in patterns:
feedback = test_with_users(pattern, n_users=5, days=7)
print(f"{pattern}:")
print(f" Ease of use: {feedback.ease}/10")
print(f" Control: {feedback.control}/10")
print(f" Satisfaction: {feedback.satisfaction}/10")
Expected Results:
| Pattern | Ease | Control | Satisfaction | Winner? |
|---|---|---|---|---|
| File Hooks | 9/10 | 7/10 | 8/10 | ✅ Automatic |
| Command Palette | 6/10 | 10/10 | 7/10 | Power users |
| Automatic | 10/10 | 5/10 | 7/10 | Too magic |
| Hybrid | 9/10 | 9/10 | 9/10 | ✅✅ Best |
Recommendation: Hybrid approach
- Auto-load on file open (convenience)
- Show notification (transparency)
- Allow manual override (control)
Success Criteria:
- Users don't think about it (automatic)
- Users can control it (override)
- Users trust it (transparent)
Findings: (To be filled during research)
🧪 Experimental Ideas
Idea 1: Conversation-Aware Clustering
Concept: Use chat history to improve skill clustering
Algorithm:
def find_relevant_skills_with_context(
current_file: Path,
conversation_history: list[str]
) -> list[Path]:
# Extract topics from recent messages
topics = extract_topics(conversation_history[-10:])
# Examples: "authentication", "database", "API endpoints"
# Find skills matching these topics
topic_skills = find_skills_by_topic(topics)
# Combine with file-based clustering
file_skills = find_relevant_skills(current_file)
# Merge with weighted ranking
return merge(topic_skills, file_skills, weights=[0.3, 0.7])
Example:
User: "How do I add authentication to the API?"
Claude: [loads auth.skill, api.skill]
User: "Now show me the database models"
Claude: [keeps auth.skill (context), adds models.skill]
User: "How do I test this?"
Claude: [adds tests.skill, keeps auth.skill, models.skill]
Potential: High (conversation context is valuable) Complexity: Medium (need to parse conversation) Risk: Low (can fail gracefully)
Idea 2: Feedback Loop Learning
Concept: Learn from user corrections to improve clustering
Algorithm:
class FeedbackLearner:
def __init__(self):
self.history = [] # (file, loaded_skills, user_feedback)
def record_feedback(self, file: Path, loaded: list, feedback: str):
"""
feedback: "skill X was not helpful" or "missing skill Y"
"""
self.history.append({
"file": file,
"loaded": loaded,
"feedback": feedback,
"timestamp": now()
})
def adjust_weights(self):
"""
Learn from feedback to adjust clustering weights
"""
# If skill X frequently marked "not helpful" for files in dir Y:
# → Reduce X's weight for Y
# If skill Y frequently requested for files in dir Z:
# → Increase Y's weight for Z
# Update clustering engine weights
self.clustering_engine.update_weights(learned_weights)
Potential: Very High (personalized to user) Complexity: High (ML/learning system) Risk: Medium (could learn wrong patterns)
Idea 3: Multi-File Context
Concept: Load skills for all open files, not just current
Algorithm:
def find_relevant_skills_multi_file(
open_files: list[Path]
) -> list[Path]:
all_skills = set()
for file in open_files:
skills = find_relevant_skills(file)
all_skills.update(skills)
# Rank by frequency across files
ranked = rank_by_frequency(all_skills)
return ranked[:10] # Top 10 (more files = more skills needed)
Example:
Open tabs:
- src/api/users.py
- src/models/user.py
- src/auth/jwt.py
Loaded skills:
- api.skill (from users.py)
- models.skill (from user.py)
- auth.skill (from jwt.py)
- fastapi.skill (common across all)
Potential: High (developers work on multiple files) Complexity: Low (just aggregate) Risk: Low (might load too many skills)
Idea 4: Skill Versioning
Concept: Track skill changes over time, allow rollback
Implementation:
.skill-seekers/skills/
├── codebase/
│ └── api.skill
│
└── versions/
└── api/
├── api.skill.2026-01-20-v1
├── api.skill.2026-01-19-v1
└── api.skill.2026-01-15-v1
Commands:
# View skill history
skill-seekers skill-history api.skill
# Diff versions
skill-seekers skill-diff api.skill --from 2026-01-15 --to 2026-01-20
# Rollback
skill-seekers skill-rollback api.skill --to 2026-01-19
Potential: Medium (useful for debugging) Complexity: Low (just file copies) Risk: Low (storage cost)
Idea 5: Skill Analytics
Concept: Track which skills are most useful
Metrics:
- Load frequency (how often loaded)
- Dwell time (how long in context)
- User rating (thumbs up/down)
- Task completion (helped solve problem?)
Dashboard:
Skill Analytics
===============
Most Loaded:
1. api.skill (45 times)
2. models.skill (38 times)
3. fastapi.skill (32 times)
Most Helpful (by rating):
1. api.skill (4.8/5.0)
2. auth.skill (4.5/5.0)
3. tests.skill (4.2/5.0)
Least Helpful:
1. deprecated.skill (2.1/5.0) ← Maybe remove?
Potential: Medium (helps improve system) Complexity: Medium (tracking infrastructure) Risk: Low (privacy concerns if shared)
📊 Research Checklist
Phase 0: Before Implementation
- Import analysis accuracy (Research #1)
- Embedding model selection (Research #2)
- Skill granularity (Research #3)
- Git hook performance (Research #5)
Phase 1-3: During Implementation
- Clustering strategy (Research #4)
- Multi-language support (Research #7)
- Skill update frequency (Research #9)
Phase 4-5: Advanced Features
- Context window management (Research #6)
- Library skill quality (Research #8)
- Plugin integration (Research #10)
Experimental (Optional)
- Conversation-aware clustering
- Feedback loop learning
- Multi-file context
- Skill versioning
- Skill analytics
🎯 Success Metrics
Technical Metrics
- Import parse accuracy: >85%
- Embedding similarity: >75%
- Clustering F1-score: >85%
- Regeneration time: <5 min
- Context usage: <150K tokens
User Metrics
- Satisfaction: >8/10
- Ease of use: >8/10
- Trust: >8/10
- Would recommend: >80%
Business Metrics
- GitHub stars: >1000
- Active users: >100
- Community contributions: >10
- Issue response time: <24 hours
Version: 1.0 Status: Research Phase Next: Conduct experiments, fill in findings