firefrost-gaming/skill-seekers-reference

Files

yusyus 2855b59165 chore: Bump version to 2.7.4 for language link fix

This patch release fixes the broken Chinese language selector link
on PyPI by using absolute GitHub URLs instead of relative paths.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-22 00:12:08 +03:00

19 KiB

Raw Permalink Blame History

Skill Seekers Intelligence System - Research Topics

Version: 1.0 Status: 🔬 Research Phase Last Updated: 2026-01-20 Purpose: Areas to research and experiment with before/during implementation

🔬 Research Areas

1. Import Analysis Accuracy

Question: How accurate is AST-based import analysis for finding relevant skills?

Hypothesis: 85-90% accuracy for Python, lower for JavaScript (dynamic imports)

Research Plan:

Dataset: Analyze 10 real-world Python projects
Ground Truth: Manually identify relevant modules for 50 test files
Measure: Precision, recall, F1-score
Iterate: Improve import parser based on results

Test Cases:

# Case 1: Simple import
from fastapi import FastAPI
# Expected: Load fastapi.skill

# Case 2: Relative import
from .models import User
# Expected: Load models.skill

# Case 3: Dynamic import
importlib.import_module("my_module")
# Expected: ??? (hard to detect)

# Case 4: Nested import
from src.api.v1.routes import router
# Expected: Load api.skill

# Case 5: Import with alias
from very_long_name import X as Y
# Expected: Load very_long_name.skill

Success Criteria:

>85% precision (no false positives)
>80% recall (no false negatives)
<100ms parse time per file

Findings: (To be filled during research)

2. Embedding Model Selection

Question: Which embedding model is best for code similarity?

Candidates:

sentence-transformers/all-MiniLM-L6-v2 (80MB, general purpose)
microsoft/codebert-base (500MB, code-specific)
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (420MB, multilingual)
Custom fine-tuned (train on code + docs)

Evaluation Criteria:

Speed: Embedding time per file
Size: Model download size
Accuracy: Similarity to ground truth
Resource: RAM/CPU usage

Benchmark Plan:

# Dataset: 100 Python files + 20 skills
# For each file:
#   1. Manual: Which skills are relevant? (ground truth)
#   2. Each model: Rank skills by similarity
#   3. Measure: Precision@5, Recall@5, MRR

models = [
    "all-MiniLM-L6-v2",
    "codebert-base",
    "paraphrase-multilingual",
]

results = {}

for model in models:
    results[model] = benchmark(model, dataset)

# Compare
print(results)

Expected Results:

Model	Speed	Size	Accuracy	RAM	Winner?
all-MiniLM-L6-v2	50ms	80MB	75%	200MB	✅ Best balance
codebert-base	200ms	500MB	85%	1GB	Too slow/large
paraphrase-multi	100ms	420MB	78%	500MB	Middle ground

Success Criteria:

<100ms embedding time
<200MB model size
>75% accuracy (better than random)

Findings: (To be filled during research)

3. Skill Granularity

Question: How fine-grained should skills be?

Options:

Coarse: One skill per 1000+ LOC (e.g., entire backend)
Medium: One skill per 200-500 LOC (e.g., api, auth, models)
Fine: One skill per 50-100 LOC (e.g., each endpoint)

Trade-offs:

Granularity	Skills	Skill Size	Context Usage	Accuracy
Coarse	3-5	500 lines	Low	Low (too broad)
Medium	10-15	200 lines	Medium	✅ Good
Fine	50+	50 lines	High	Too specific

Experiment:

Generate skills at all 3 granularities for skill-seekers
Use each set for 1 week of development
Measure: usefulness (subjective), context overflow (objective)

Success Criteria:

Skills feel "right-sized" (not too broad, not too narrow)
<5 skills needed for typical task
Skills don't overflow context (< 10K tokens total)

Findings: (To be filled during research)

4. Clustering Strategy Performance

Question: Which clustering strategy is best?

Strategies:

Import-only: Fast, deterministic
Embedding-only: Flexible, catches semantics
Hybrid (70/30): Best of both
Hybrid (50/50): Equal weight
Hybrid with learning: Adjust weights based on feedback

Evaluation:

# Dataset: 50 files with manually labeled relevant skills

strategies = {
    "import_only": ImportBasedEngine(),
    "embedding_only": EmbeddingBasedEngine(),
    "hybrid_70_30": HybridEngine(0.7, 0.3),
    "hybrid_50_50": HybridEngine(0.5, 0.5),
}

for name, engine in strategies.items():
    scores = evaluate(engine, dataset)
    print(f"{name}: Precision={scores.precision}, Recall={scores.recall}")

Expected Results:

Strategy	Precision	Recall	F1	Speed	Winner?
Import-only	90%	75%	82%	50ms	Fast, precise
Embedding-only	75%	85%	80%	100ms	Flexible
Hybrid 70/30	88%	82%	85%	80ms	✅ Best balance
Hybrid 50/50	85%	85%	85%	80ms	Equal weight

Success Criteria:

Hybrid beats both individual strategies
<100ms clustering time
>85% F1-score

Findings: (To be filled during research)

5. Git Hook Performance

Question: How long does skill regeneration take?

Variables:

Codebase size (100, 500, 1000, 5000 files)
Analysis depth (surface, deep, full)
Incremental vs full regeneration

Benchmark:

# Test on real projects
projects = [
    ("skill-seekers", 140, "Python"),
    ("fastapi", 500, "Python"),
    ("react", 1000, "JavaScript"),
    ("vscode", 5000, "TypeScript"),
]

for name, files, lang in projects:
    # Full regeneration
    time_full = time_regeneration(name, incremental=False)

    # Incremental (10% changed)
    time_incr = time_regeneration(name, incremental=True, changed_ratio=0.1)

    print(f"{name}: Full={time_full}s, Incremental={time_incr}s")

Expected Results:

Project	Files	Full	Incremental	Acceptable?
skill-seekers	140	3 min	30 sec	✅ Yes
fastapi	500	8 min	1 min	✅ Yes
react	1000	15 min	2 min	⚠️ Borderline
vscode	5000	60 min	10 min	❌ Too slow

Optimizations if too slow:

Parallel analysis (multiprocessing)
Smarter incremental (only changed modules)
Background daemon (non-blocking)

Success Criteria:

<5 min for typical project (500 files)
<2 min for incremental update
Can run in background without blocking

Findings: (To be filled during research)

6. Context Window Management

Question: How to handle context overflow with large skills?

Problem: Claude has 200K context, but large projects generate huge skills

Solutions:

Skill Summarization: Compress skills (API signatures only, no examples)
Dynamic Loading: Load skill sections on-demand
Skill Splitting: Further split large skills into sub-skills
Priority System: Load most important skills first

Experiment:

# Generate skills for large project (5000 files)
# Measure context usage

skills = generate_skills("large-project")
total_tokens = sum(count_tokens(s) for s in skills)

print(f"Total tokens: {total_tokens}")
print(f"Context budget: 200,000")
print(f"Remaining: {200_000 - total_tokens}")

if total_tokens > 150_000:  # Leave room for conversation
    print("WARNING: Context overflow!")
    # Try solutions
    compressed = compress_skills(skills)
    print(f"After compression: {count_tokens(compressed)}")

Success Criteria:

Skills fit in context (< 150K tokens)
Quality doesn't degrade significantly
User has control (can choose which skills to load)

Findings: (To be filled during research)

7. Multi-Language Support

Question: How well does the system work for non-Python languages?

Languages to Support:

Python (primary, best support)
JavaScript/TypeScript (common frontend)
Go (backend microservices)
Rust (systems programming)
Java (enterprise)

Challenges:

Import syntax varies (import vs require vs use)
Module systems differ (CommonJS, ESM, Go modules)
Embedding accuracy may vary

Research Plan:

Implement import parsers for each language
Test on real projects
Measure accuracy vs Python baseline

Expected Results:

Language	Import Parse	Embedding	Overall	Support?
Python	90%	85%	88%	✅ Excellent
JavaScript	80%	85%	83%	✅ Good
TypeScript	85%	85%	85%	✅ Good
Go	75%	80%	78%	⚠️ Acceptable
Rust	70%	80%	75%	⚠️ Acceptable
Java	65%	80%	73%	⚠️ Basic

Success Criteria:

Python: >85% accuracy (primary focus)
JS/TS: >80% accuracy (important)
Others: >70% accuracy (nice to have)

Findings: (To be filled during research)

8. Library Skill Quality

Question: How good are auto-generated library skills vs handcrafted?

Experiment:

Generate library skills for popular frameworks:
- FastAPI (from docs)
- React (from docs)
- PostgreSQL (from docs)
Compare to handcrafted skills (manually written)
Measure: completeness, accuracy, usefulness

Evaluation Criteria:

Completeness: Does it cover all key APIs?
Accuracy: Is information correct?
Usefulness: Do developers find it helpful?
Freshness: Is it up-to-date?

Test Plan:

# For each framework:
#   1. Auto-generate skill
#   2. Handcraft skill (1 hour of work)
#   3. A/B test with 5 developers
#   4. Measure: time to complete task, satisfaction

frameworks = ["FastAPI", "React", "PostgreSQL"]

for framework in frameworks:
    auto_skill = generate_skill(framework)
    hand_skill = handcraft_skill(framework)

    results = ab_test(auto_skill, hand_skill, n_users=5)

    print(f"{framework}:")
    print(f"  Auto: {results.auto_score}/10")
    print(f"  Hand: {results.hand_score}/10")

Expected Results:

Framework	Auto	Hand	Difference	Acceptable?
FastAPI	7/10	9/10	-2	✅ Close enough
React	6/10	9/10	-3	⚠️ Needs work
PostgreSQL	5/10	9/10	-4	❌ Too far

Optimization:

If auto-generated is <7/10, use handcrafted
Offer both: curated (handcrafted) + auto-generated
Community contributions for popular frameworks

Success Criteria:

Auto-generated is >7/10 quality
Users find library skills helpful
Skills stay up-to-date (auto-regenerate)

Findings: (To be filled during research)

9. Skill Update Frequency

Question: How often do skills need updating?

Variables:

Codebase churn rate (commits/day)
Trigger: every commit vs every merge vs weekly
Impact: staleness vs performance

Experiment:

# Track a real project for 1 month
# Measure:
#   - How often code changes affect skills
#   - How stale skills get if not updated
#   - User tolerance for staleness

project = "skill-seekers"
duration = "30 days"

events = track_changes(project, duration)

print(f"Total commits: {events.commits}")
print(f"Skill-affecting changes: {events.skill_changes}")
print(f"Ratio: {events.skill_changes / events.commits}")

# Test different update frequencies
frequencies = ["every-commit", "every-merge", "daily", "weekly"]

for freq in frequencies:
    staleness = measure_staleness(freq)
    perf_cost = measure_performance_cost(freq)

    print(f"{freq}: Staleness={staleness}, Cost={perf_cost}")

Expected Results:

Frequency	Staleness	Perf Cost	CPU Usage	Acceptable?
Every commit	0%	High	50%+	❌ Too much
Every merge	5%	Medium	10%	✅ Good
Daily	15%	Low	2%	✅ Good
Weekly	40%	Very low	<1%	⚠️ Too stale

Recommendation: Update on merge to watched branches (main, dev)

Success Criteria:

Skills <10% stale
Performance overhead <10% CPU
User doesn't notice staleness

Findings: (To be filled during research)

10. Plugin Integration Patterns

Question: What's the best way to integrate with Claude Code?

Options:

File Hooks: React to file open/save events
Command Palette: User manually loads skills
Automatic: Always load best skills
Hybrid: Auto-load + manual override

User Experience Testing:

# Test with 5 developers for 1 week each

patterns = [
    "file_hooks",      # Auto-load on file open
    "command_palette", # Manual: Cmd+Shift+P -> "Load Skills"
    "automatic",       # Always load, no user action
    "hybrid",          # Auto + manual override
]

for pattern in patterns:
    feedback = test_with_users(pattern, n_users=5, days=7)

    print(f"{pattern}:")
    print(f"  Ease of use: {feedback.ease}/10")
    print(f"  Control: {feedback.control}/10")
    print(f"  Satisfaction: {feedback.satisfaction}/10")

Expected Results:

Pattern	Ease	Control	Satisfaction	Winner?
File Hooks	9/10	7/10	8/10	✅ Automatic
Command Palette	6/10	10/10	7/10	Power users
Automatic	10/10	5/10	7/10	Too magic
Hybrid	9/10	9/10	9/10	✅✅ Best

Recommendation: Hybrid approach

Auto-load on file open (convenience)
Show notification (transparency)
Allow manual override (control)

Success Criteria:

Users don't think about it (automatic)
Users can control it (override)
Users trust it (transparent)

Findings: (To be filled during research)

🧪 Experimental Ideas

Idea 1: Conversation-Aware Clustering

Concept: Use chat history to improve skill clustering

Algorithm:

def find_relevant_skills_with_context(
    current_file: Path,
    conversation_history: list[str]
) -> list[Path]:
    # Extract topics from recent messages
    topics = extract_topics(conversation_history[-10:])
    # Examples: "authentication", "database", "API endpoints"

    # Find skills matching these topics
    topic_skills = find_skills_by_topic(topics)

    # Combine with file-based clustering
    file_skills = find_relevant_skills(current_file)

    # Merge with weighted ranking
    return merge(topic_skills, file_skills, weights=[0.3, 0.7])

Example:

User: "How do I add authentication to the API?"
Claude: [loads auth.skill, api.skill]

User: "Now show me the database models"
Claude: [keeps auth.skill (context), adds models.skill]

User: "How do I test this?"
Claude: [adds tests.skill, keeps auth.skill, models.skill]

Potential: High (conversation context is valuable) Complexity: Medium (need to parse conversation) Risk: Low (can fail gracefully)

Idea 2: Feedback Loop Learning

Concept: Learn from user corrections to improve clustering

Algorithm:

class FeedbackLearner:
    def __init__(self):
        self.history = []  # (file, loaded_skills, user_feedback)

    def record_feedback(self, file: Path, loaded: list, feedback: str):
        """
        feedback: "skill X was not helpful" or "missing skill Y"
        """
        self.history.append({
            "file": file,
            "loaded": loaded,
            "feedback": feedback,
            "timestamp": now()
        })

    def adjust_weights(self):
        """
        Learn from feedback to adjust clustering weights
        """
        # If skill X frequently marked "not helpful" for files in dir Y:
        #   → Reduce X's weight for Y

        # If skill Y frequently requested for files in dir Z:
        #   → Increase Y's weight for Z

        # Update clustering engine weights
        self.clustering_engine.update_weights(learned_weights)

Potential: Very High (personalized to user) Complexity: High (ML/learning system) Risk: Medium (could learn wrong patterns)

Idea 3: Multi-File Context

Concept: Load skills for all open files, not just current

Algorithm:

def find_relevant_skills_multi_file(
    open_files: list[Path]
) -> list[Path]:
    all_skills = set()

    for file in open_files:
        skills = find_relevant_skills(file)
        all_skills.update(skills)

    # Rank by frequency across files
    ranked = rank_by_frequency(all_skills)

    return ranked[:10]  # Top 10 (more files = more skills needed)

Example:

Open tabs:
  - src/api/users.py
  - src/models/user.py
  - src/auth/jwt.py

Loaded skills:
  - api.skill (from users.py)
  - models.skill (from user.py)
  - auth.skill (from jwt.py)
  - fastapi.skill (common across all)

Potential: High (developers work on multiple files) Complexity: Low (just aggregate) Risk: Low (might load too many skills)

Idea 4: Skill Versioning

Concept: Track skill changes over time, allow rollback

Implementation:

.skill-seekers/skills/
├── codebase/
│   └── api.skill
│
└── versions/
    └── api/
        ├── api.skill.2026-01-20-v1
        ├── api.skill.2026-01-19-v1
        └── api.skill.2026-01-15-v1

Commands:

# View skill history
skill-seekers skill-history api.skill

# Diff versions
skill-seekers skill-diff api.skill --from 2026-01-15 --to 2026-01-20

# Rollback
skill-seekers skill-rollback api.skill --to 2026-01-19

Potential: Medium (useful for debugging) Complexity: Low (just file copies) Risk: Low (storage cost)

Idea 5: Skill Analytics

Concept: Track which skills are most useful

Metrics:

Load frequency (how often loaded)
Dwell time (how long in context)
User rating (thumbs up/down)
Task completion (helped solve problem?)

Dashboard:

Skill Analytics
===============

Most Loaded:
  1. api.skill (45 times)
  2. models.skill (38 times)
  3. fastapi.skill (32 times)

Most Helpful (by rating):
  1. api.skill (4.8/5.0)
  2. auth.skill (4.5/5.0)
  3. tests.skill (4.2/5.0)

Least Helpful:
  1. deprecated.skill (2.1/5.0) ← Maybe remove?

Potential: Medium (helps improve system) Complexity: Medium (tracking infrastructure) Risk: Low (privacy concerns if shared)

📊 Research Checklist

Phase 0: Before Implementation

Import analysis accuracy (Research #1)
Embedding model selection (Research #2)
Skill granularity (Research #3)
Git hook performance (Research #5)

Phase 1-3: During Implementation

Clustering strategy (Research #4)
Multi-language support (Research #7)
Skill update frequency (Research #9)

Phase 4-5: Advanced Features

Context window management (Research #6)
Library skill quality (Research #8)
Plugin integration (Research #10)

Experimental (Optional)

Conversation-aware clustering
Feedback loop learning
Multi-file context
Skill versioning
Skill analytics

🎯 Success Metrics

Technical Metrics

Import parse accuracy: >85%
Embedding similarity: >75%
Clustering F1-score: >85%
Regeneration time: <5 min
Context usage: <150K tokens

User Metrics

Satisfaction: >8/10
Ease of use: >8/10
Trust: >8/10
Would recommend: >80%

Business Metrics

GitHub stars: >1000
Active users: >100
Community contributions: >10
Issue response time: <24 hours

Version: 1.0 Status: Research Phase Next: Conduct experiments, fill in findings

19 KiB Raw Permalink Blame History

Skill Seekers Intelligence System - Research Topics

🔬 Research Areas

1. Import Analysis Accuracy

2. Embedding Model Selection

3. Skill Granularity

4. Clustering Strategy Performance

5. Git Hook Performance

6. Context Window Management

7. Multi-Language Support

8. Library Skill Quality

9. Skill Update Frequency

10. Plugin Integration Patterns

🧪 Experimental Ideas

Idea 1: Conversation-Aware Clustering

Idea 2: Feedback Loop Learning

Idea 3: Multi-File Context

Idea 4: Skill Versioning

Idea 5: Skill Analytics

📊 Research Checklist

Phase 0: Before Implementation

Phase 1-3: During Implementation

Phase 4-5: Advanced Features

Experimental (Optional)

🎯 Success Metrics

Technical Metrics

User Metrics

Business Metrics

19 KiB

Raw Permalink Blame History