Files
skill-seekers-reference/docs/roadmap/INTELLIGENCE_SYSTEM_ARCHITECTURE.md
yusyus 2855b59165 chore: Bump version to 2.7.4 for language link fix
This patch release fixes the broken Chinese language selector link
on PyPI by using absolute GitHub URLs instead of relative paths.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-22 00:12:08 +03:00

1170 lines
40 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Skill Seekers Intelligence System - Technical Architecture
**Version:** 1.0 (Draft)
**Status:** 🔬 Research & Design
**Last Updated:** 2026-01-20
**For:** Study and iteration before implementation
---
## 🎯 System Overview
The **Skill Seekers Intelligence System** is a multi-layered architecture that automatically generates, updates, and intelligently loads codebase knowledge into Claude Code's context.
**Core Principles:**
1. **Git-Based Triggers:** Only update on branch merges (not constant watching)
2. **Modular Skills:** Separate libraries from codebase, split codebase into modules
3. **Smart Clustering:** Load only relevant skills based on context
4. **User Control:** Config-driven, user has final say
---
## 🏗️ Architecture Layers
```
┌─────────────────────────────────────────────────────────────┐
│ USER INTERFACE │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ CLI Commands Claude Code Plugin Config Files │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ ORCHESTRATION LAYER │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ • Project Manager │ │
│ │ • Skill Registry │ │
│ │ • Update Scheduler │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ SKILL GENERATION LAYER │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │ Tech Stack │ │ Modular Codebase │ │
│ │ Detector │ │ Analyzer │ │
│ └────────────────────┘ └────────────────────┘ │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │ Library Skill │ │ Git Change │ │
│ │ Downloader │ │ Detector │ │
│ └────────────────────┘ └────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ CLUSTERING LAYER │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │ Import-Based │ │ Embedding-Based │ │
│ │ Clustering │ │ Clustering │ │
│ │ (Phase 1) │ │ (Phase 2) │ │
│ └────────────────────┘ └────────────────────┘ │
│ ┌────────────────────┐ │
│ │ Hybrid Clustering │ │
│ │ (Combines both) │ │
│ └────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ STORAGE LAYER │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ • Skill Files (.skill-seekers/skills/) │ │
│ │ • Embeddings Cache (.skill-seekers/cache/) │ │
│ │ • Metadata (.skill-seekers/registry.json) │ │
│ │ • Git Hooks (.skill-seekers/hooks/) │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
---
## 📂 File System Structure
```
project-root/
├── .skill-seekers/ # Intelligence system directory
│ ├── config.yml # User configuration
│ │
│ ├── skills/ # Generated skills
│ │ ├── libraries/ # External library skills
│ │ │ ├── fastapi.skill
│ │ │ ├── react.skill
│ │ │ └── postgresql.skill
│ │ │
│ │ └── codebase/ # Project-specific skills
│ │ ├── backend/
│ │ │ ├── api.skill
│ │ │ ├── auth.skill
│ │ │ └── models.skill
│ │ │
│ │ └── frontend/
│ │ ├── components.skill
│ │ └── pages.skill
│ │
│ ├── cache/ # Performance caches
│ │ ├── embeddings/ # Skill embeddings
│ │ │ ├── fastapi.npy
│ │ │ ├── api.npy
│ │ │ └── ...
│ │ │
│ │ └── metadata/ # Cached metadata
│ │ └── skill-registry.json
│ │
│ ├── hooks/ # Git hooks
│ │ ├── post-merge # Auto-regenerate on merge
│ │ ├── post-commit # Optional
│ │ └── pre-push # Optional validation
│ │
│ ├── logs/ # System logs
│ │ ├── regeneration.log
│ │ └── clustering.log
│ │
│ └── registry.json # Skill registry metadata
├── .git/ # Git repository
└── ... (project files)
```
---
## ⚙️ Component Details
### 1. Project Manager
**Responsibility:** Initialize and manage project intelligence
```python
# src/skill_seekers/intelligence/project_manager.py
class ProjectManager:
"""Manages project intelligence system lifecycle"""
def __init__(self, project_root: Path):
self.root = project_root
self.config_path = project_root / ".skill-seekers" / "config.yml"
self.skills_dir = project_root / ".skill-seekers" / "skills"
def initialize(self) -> bool:
"""
Initialize project for intelligence system
Creates directory structure, config, git hooks
"""
# 1. Create directory structure
self._create_directories()
# 2. Generate default config
config = self._generate_default_config()
self._save_config(config)
# 3. Install git hooks
self._install_git_hooks()
# 4. Initial skill generation
self._initial_skill_generation()
return True
def _create_directories(self):
"""Create .skill-seekers directory structure"""
dirs = [
".skill-seekers",
".skill-seekers/skills",
".skill-seekers/skills/libraries",
".skill-seekers/skills/codebase",
".skill-seekers/cache",
".skill-seekers/cache/embeddings",
".skill-seekers/cache/metadata",
".skill-seekers/hooks",
".skill-seekers/logs",
]
for d in dirs:
(self.root / d).mkdir(parents=True, exist_ok=True)
def _generate_default_config(self) -> dict:
"""Generate sensible default configuration"""
return {
"version": "1.0",
"project_name": self.root.name,
"watch_branches": ["main", "development"],
"tech_stack": {
"auto_detect": True,
"frameworks": []
},
"skill_generation": {
"enabled": True,
"output_dir": ".skill-seekers/skills/codebase"
},
"git_hooks": {
"enabled": True,
"trigger_on": ["post-merge"]
},
"clustering": {
"enabled": False, # Phase 4+
"strategy": "import", # import, embedding, hybrid
"max_skills_in_context": 5
}
}
def _install_git_hooks(self):
"""Install git hooks for auto-regeneration"""
hook_template = """#!/bin/bash
# Auto-generated by skill-seekers
# DO NOT EDIT - regenerate with: skill-seekers init-project
CURRENT_BRANCH=$(git rev-parse --abbrev-ref HEAD)
CONFIG_FILE=".skill-seekers/config.yml"
if [ ! -f "$CONFIG_FILE" ]; then
exit 0
fi
# Read watched branches from config
WATCH_BRANCHES=$(yq '.watch_branches[]' "$CONFIG_FILE" 2>/dev/null || echo "")
if echo "$WATCH_BRANCHES" | grep -q "^$CURRENT_BRANCH$"; then
echo "🔄 Skill regeneration triggered on branch: $CURRENT_BRANCH"
skill-seekers regenerate-skills --branch "$CURRENT_BRANCH" --silent
echo "✅ Skills updated"
fi
"""
hook_path = self.root / ".git" / "hooks" / "post-merge"
hook_path.write_text(hook_template)
hook_path.chmod(0o755) # Make executable
```
---
### 2. Tech Stack Detector
**Responsibility:** Detect frameworks and libraries from project files
```python
# src/skill_seekers/intelligence/stack_detector.py
from pathlib import Path
from typing import Dict, List
import json
import yaml
import toml
class TechStackDetector:
"""
Detect tech stack from project configuration files
Supports: Python, JavaScript/TypeScript, Go, Rust, Java
"""
def __init__(self, project_root: Path):
self.root = project_root
self.detectors = {
"python": self._detect_python,
"javascript": self._detect_javascript,
"typescript": self._detect_typescript,
"go": self._detect_go,
"rust": self._detect_rust,
"java": self._detect_java,
}
def detect(self) -> Dict[str, List[str]]:
"""
Detect complete tech stack
Returns:
{
"languages": ["Python", "JavaScript"],
"frameworks": ["FastAPI", "React"],
"databases": ["PostgreSQL"],
"tools": ["Docker", "Redis"]
}
"""
stack = {
"languages": [],
"frameworks": [],
"databases": [],
"tools": []
}
# Detect languages
for lang, detector in self.detectors.items():
if detector():
stack["languages"].append(lang.title())
# Detect frameworks (per language)
if "Python" in stack["languages"]:
stack["frameworks"].extend(self._detect_python_frameworks())
if "JavaScript" in stack["languages"] or "TypeScript" in stack["languages"]:
stack["frameworks"].extend(self._detect_js_frameworks())
# Detect databases
stack["databases"].extend(self._detect_databases())
# Detect tools
stack["tools"].extend(self._detect_tools())
return stack
def _detect_python(self) -> bool:
"""Detect Python project"""
markers = [
"requirements.txt",
"setup.py",
"pyproject.toml",
"Pipfile",
"poetry.lock"
]
return any((self.root / marker).exists() for marker in markers)
def _detect_python_frameworks(self) -> List[str]:
"""Detect Python frameworks"""
frameworks = []
# Parse requirements.txt
req_file = self.root / "requirements.txt"
if req_file.exists():
deps = req_file.read_text().lower()
framework_map = {
"fastapi": "FastAPI",
"django": "Django",
"flask": "Flask",
"sqlalchemy": "SQLAlchemy",
"pydantic": "Pydantic",
"anthropic": "Anthropic",
"openai": "OpenAI",
"beautifulsoup4": "BeautifulSoup",
"requests": "Requests",
"httpx": "HTTPX",
"aiohttp": "aiohttp",
}
for key, name in framework_map.items():
if key in deps:
frameworks.append(name)
# Parse pyproject.toml
pyproject = self.root / "pyproject.toml"
if pyproject.exists():
try:
data = toml.loads(pyproject.read_text())
deps = data.get("project", {}).get("dependencies", [])
deps_str = " ".join(deps).lower()
for key, name in framework_map.items():
if key in deps_str and name not in frameworks:
frameworks.append(name)
except:
pass
return frameworks
def _detect_javascript(self) -> bool:
"""Detect JavaScript project"""
return (self.root / "package.json").exists()
def _detect_typescript(self) -> bool:
"""Detect TypeScript project"""
markers = ["tsconfig.json", "package.json"]
if not all((self.root / m).exists() for m in markers):
return False
# Check if typescript is in dependencies
pkg = self.root / "package.json"
try:
data = json.loads(pkg.read_text())
deps = {**data.get("dependencies", {}), **data.get("devDependencies", {})}
return "typescript" in deps
except:
return False
def _detect_js_frameworks(self) -> List[str]:
"""Detect JavaScript/TypeScript frameworks"""
frameworks = []
pkg = self.root / "package.json"
if not pkg.exists():
return frameworks
try:
data = json.loads(pkg.read_text())
deps = {**data.get("dependencies", {}), **data.get("devDependencies", {})}
framework_map = {
"react": "React",
"vue": "Vue",
"next": "Next.js",
"nuxt": "Nuxt.js",
"svelte": "Svelte",
"angular": "Angular",
"express": "Express",
"fastify": "Fastify",
"nestjs": "NestJS",
}
for key, name in framework_map.items():
if key in deps:
frameworks.append(name)
except:
pass
return frameworks
def _detect_databases(self) -> List[str]:
"""Detect databases from environment and configs"""
databases = []
# Check .env file
env_file = self.root / ".env"
if env_file.exists():
env_content = env_file.read_text().lower()
db_markers = {
"postgres": "PostgreSQL",
"mysql": "MySQL",
"mongodb": "MongoDB",
"redis": "Redis",
"sqlite": "SQLite",
}
for marker, name in db_markers.items():
if marker in env_content:
databases.append(name)
# Check docker-compose.yml
compose = self.root / "docker-compose.yml"
if compose.exists():
try:
data = yaml.safe_load(compose.read_text())
services = data.get("services", {})
for service_name, config in services.items():
image = config.get("image", "").lower()
db_images = {
"postgres": "PostgreSQL",
"mysql": "MySQL",
"mongo": "MongoDB",
"redis": "Redis",
}
for marker, name in db_images.items():
if marker in image and name not in databases:
databases.append(name)
except:
pass
return databases
def _detect_tools(self) -> List[str]:
"""Detect development tools"""
tools = []
tool_markers = {
"Dockerfile": "Docker",
"docker-compose.yml": "Docker Compose",
".github/workflows": "GitHub Actions",
"Makefile": "Make",
"nginx.conf": "Nginx",
}
for marker, name in tool_markers.items():
if (self.root / marker).exists():
tools.append(name)
return tools
def _detect_go(self) -> bool:
return (self.root / "go.mod").exists()
def _detect_rust(self) -> bool:
return (self.root / "Cargo.toml").exists()
def _detect_java(self) -> bool:
markers = ["pom.xml", "build.gradle", "build.gradle.kts"]
return any((self.root / m).exists() for m in markers)
```
---
### 3. Modular Skill Generator
**Responsibility:** Split codebase into modular skills based on config
```python
# src/skill_seekers/intelligence/modular_generator.py
from pathlib import Path
from typing import List, Dict
import glob
class ModularSkillGenerator:
"""
Generate modular skills from codebase
Splits based on: namespace, directory, feature, or custom
"""
def __init__(self, project_root: Path, config: dict):
self.root = project_root
self.config = config
self.modules = config.get("modules", {})
def generate_all(self) -> List[Path]:
"""Generate all modular skills"""
generated_skills = []
for module_name, module_config in self.modules.items():
skills = self.generate_module(module_name, module_config)
generated_skills.extend(skills)
return generated_skills
def generate_module(self, module_name: str, module_config: dict) -> List[Path]:
"""
Generate skills for a single module
module_config = {
"path": "src/api/",
"split_by": "namespace", # or directory, feature, custom
"skills": [
{
"name": "api",
"description": "API endpoints",
"include": ["*/routes/*.py"],
"exclude": ["*_test.py"]
}
]
}
"""
skills = []
for skill_config in module_config.get("skills", []):
skill_path = self._generate_skill(module_name, skill_config)
skills.append(skill_path)
return skills
def _generate_skill(self, module_name: str, skill_config: dict) -> Path:
"""Generate a single skill file"""
skill_name = skill_config["name"]
include_patterns = skill_config.get("include", [])
exclude_patterns = skill_config.get("exclude", [])
# 1. Find files matching patterns
files = self._find_files(include_patterns, exclude_patterns)
# 2. Run codebase analysis on these files
# (Reuse existing C3.x codebase_scraper.py)
from skill_seekers.cli.codebase_scraper import analyze_codebase
analysis_result = analyze_codebase(
files=files,
project_root=self.root,
depth="deep",
ai_mode="none"
)
# 3. Generate SKILL.md
skill_content = self._format_skill(
name=skill_name,
description=skill_config.get("description", ""),
analysis=analysis_result
)
# 4. Save skill file
output_dir = self.root / ".skill-seekers" / "skills" / "codebase" / module_name
output_dir.mkdir(parents=True, exist_ok=True)
skill_path = output_dir / f"{skill_name}.skill"
skill_path.write_text(skill_content)
return skill_path
def _find_files(self, include: List[str], exclude: List[str]) -> List[Path]:
"""Find files matching include/exclude patterns"""
files = set()
# Include patterns
for pattern in include:
matched = glob.glob(str(self.root / pattern), recursive=True)
files.update(Path(f) for f in matched)
# Exclude patterns
for pattern in exclude:
matched = glob.glob(str(self.root / pattern), recursive=True)
files.difference_update(Path(f) for f in matched)
return sorted(files)
def _format_skill(self, name: str, description: str, analysis: dict) -> str:
"""Format analysis results into SKILL.md"""
return f"""---
name: {name}
description: {description}
module: codebase
---
# {name.title()}
## Description
{description}
## API Reference
{analysis.get('api_reference', '')}
## Design Patterns
{analysis.get('patterns', '')}
## Examples
{analysis.get('examples', '')}
## Related Skills
{self._generate_cross_references(name)}
"""
def _generate_cross_references(self, skill_name: str) -> str:
"""Generate cross-references to related skills"""
# Analyze imports to find dependencies
# Link to other skills that this skill imports from
return "- Related skill 1\n- Related skill 2"
```
---
### 4. Import-Based Clustering Engine
**Responsibility:** Find relevant skills based on import analysis
```python
# src/skill_seekers/intelligence/import_clustering.py
from pathlib import Path
from typing import List, Set
import ast
class ImportBasedClusteringEngine:
"""
Find relevant skills by analyzing imports in current file
Fast and deterministic - no AI needed
"""
def __init__(self, skills_dir: Path):
self.skills_dir = skills_dir
self.skill_registry = self._build_registry()
def _build_registry(self) -> dict:
"""
Build registry mapping imports to skills
Returns:
{
"fastapi": ["libraries/fastapi.skill"],
"anthropic": ["libraries/anthropic.skill"],
"src.api": ["codebase/backend/api.skill"],
"src.auth": ["codebase/backend/auth.skill"],
}
"""
registry = {}
# Scan all skills and extract what they provide
for skill_path in self.skills_dir.rglob("*.skill"):
# Parse skill metadata (YAML frontmatter)
provides = self._extract_provides(skill_path)
for module in provides:
if module not in registry:
registry[module] = []
registry[module].append(skill_path)
return registry
def find_relevant_skills(
self,
current_file: Path,
max_skills: int = 5
) -> List[Path]:
"""
Find most relevant skills for current file
Algorithm:
1. Parse imports from current file
2. Map imports to skills via registry
3. Add current file's skill (if exists)
4. Rank and return top N
"""
# 1. Parse imports
imports = self._parse_imports(current_file)
# 2. Map to skills
relevant_skills = set()
for imp in imports:
# External library?
if self._is_external(imp):
lib_skill = self._find_library_skill(imp)
if lib_skill:
relevant_skills.add(lib_skill)
# Internal module?
else:
module_skill = self._find_module_skill(imp)
if module_skill:
relevant_skills.add(module_skill)
# 3. Add current file's skill (highest priority)
current_skill = self._find_skill_for_file(current_file)
if current_skill:
# Insert at beginning
relevant_skills = [current_skill] + list(relevant_skills)
# 4. Rank and return
return self._rank_skills(relevant_skills)[:max_skills]
def _parse_imports(self, file_path: Path) -> Set[str]:
"""
Parse imports from Python file using AST
Returns: {"fastapi", "anthropic", "src.api", "src.auth"}
"""
imports = set()
try:
tree = ast.parse(file_path.read_text())
for node in ast.walk(tree):
# import X
if isinstance(node, ast.Import):
for alias in node.names:
imports.add(alias.name)
# from X import Y
elif isinstance(node, ast.ImportFrom):
if node.module:
imports.add(node.module)
except Exception as e:
print(f"Warning: Could not parse {file_path}: {e}")
return imports
def _is_external(self, import_name: str) -> bool:
"""Check if import is external library or internal module"""
# External if:
# - Not starts with project name
# - Not starts with "src"
# - Is known library (fastapi, django, etc.)
internal_prefixes = ["src", "tests", self._get_project_name()]
return not any(import_name.startswith(prefix) for prefix in internal_prefixes)
def _find_library_skill(self, import_name: str) -> Path | None:
"""Find library skill for external import"""
# Try exact match first
skill_path = self.skills_dir / "libraries" / f"{import_name}.skill"
if skill_path.exists():
return skill_path
# Try partial match (e.g., "fastapi.routing" -> "fastapi")
base_module = import_name.split(".")[0]
skill_path = self.skills_dir / "libraries" / f"{base_module}.skill"
if skill_path.exists():
return skill_path
return None
def _find_module_skill(self, import_name: str) -> Path | None:
"""Find codebase skill for internal import"""
# Use registry to map import to skill
return self.skill_registry.get(import_name)
def _find_skill_for_file(self, file_path: Path) -> Path | None:
"""Find which skill contains this file"""
# Match file path against skill file patterns
# This requires reading all skill configs
# For now, simple heuristic: src/api/ -> api.skill
rel_path = file_path.relative_to(self.project_root)
if "api" in str(rel_path):
return self.skills_dir / "codebase" / "backend" / "api.skill"
elif "auth" in str(rel_path):
return self.skills_dir / "codebase" / "backend" / "auth.skill"
# ... etc
return None
def _rank_skills(self, skills: List[Path]) -> List[Path]:
"""Rank skills by relevance (for now, just deduplicate)"""
return list(dict.fromkeys(skills)) # Preserve order, remove dupes
```
---
### 5. Embedding-Based Clustering Engine
**Responsibility:** Find relevant skills using semantic similarity
```python
# src/skill_seekers/intelligence/embedding_clustering.py
from pathlib import Path
from typing import List
import numpy as np
from sentence_transformers import SentenceTransformer
class EmbeddingBasedClusteringEngine:
"""
Find relevant skills using embeddings and cosine similarity
More flexible than import-based, but slower
"""
def __init__(self, skills_dir: Path, cache_dir: Path):
self.skills_dir = skills_dir
self.cache_dir = cache_dir
self.model = SentenceTransformer('all-MiniLM-L6-v2') # 80MB, fast
# Load or generate skill embeddings
self.skill_embeddings = self._load_skill_embeddings()
def _load_skill_embeddings(self) -> dict:
"""Load pre-computed skill embeddings from cache"""
embeddings = {}
for skill_path in self.skills_dir.rglob("*.skill"):
cache_path = self.cache_dir / "embeddings" / f"{skill_path.stem}.npy"
if cache_path.exists():
# Load from cache
embeddings[skill_path] = np.load(cache_path)
else:
# Generate and cache
embedding = self._embed_skill(skill_path)
cache_path.parent.mkdir(parents=True, exist_ok=True)
np.save(cache_path, embedding)
embeddings[skill_path] = embedding
return embeddings
def _embed_skill(self, skill_path: Path) -> np.ndarray:
"""Generate embedding for a skill"""
content = skill_path.read_text()
# Extract key sections (API Reference + Examples)
api_section = self._extract_section(content, "## API Reference")
examples_section = self._extract_section(content, "## Examples")
# Combine and embed
text = f"{api_section}\n{examples_section}"
embedding = self.model.encode(text[:5000]) # Limit to 5K chars
return embedding
def _embed_file(self, file_path: Path) -> np.ndarray:
"""Generate embedding for current file"""
content = file_path.read_text()
# Embed full content (or first N chars for performance)
embedding = self.model.encode(content[:5000])
return embedding
def find_relevant_skills(
self,
current_file: Path,
max_skills: int = 5
) -> List[Path]:
"""
Find most relevant skills using cosine similarity
Algorithm:
1. Embed current file
2. Compute cosine similarity with all skill embeddings
3. Rank by similarity
4. Return top N
"""
# 1. Embed current file
file_embedding = self._embed_file(current_file)
# 2. Compute similarities
similarities = {}
for skill_path, skill_embedding in self.skill_embeddings.items():
similarity = self._cosine_similarity(file_embedding, skill_embedding)
similarities[skill_path] = similarity
# 3. Rank by similarity
ranked = sorted(similarities.items(), key=lambda x: x[1], reverse=True)
# 4. Return top N
return [skill_path for skill_path, _ in ranked[:max_skills]]
def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
"""Compute cosine similarity between two vectors"""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def _extract_section(self, content: str, header: str) -> str:
"""Extract section from markdown content"""
lines = content.split("\n")
section_lines = []
in_section = False
for line in lines:
if line.startswith(header):
in_section = True
continue
if in_section:
if line.startswith("##"): # Next section
break
section_lines.append(line)
return "\n".join(section_lines)
```
---
### 6. Hybrid Clustering Engine
**Responsibility:** Combine import-based and embedding-based clustering
```python
# src/skill_seekers/intelligence/hybrid_clustering.py
class HybridClusteringEngine:
"""
Combine import-based (precise) and embedding-based (flexible)
for best-of-both-worlds clustering
"""
def __init__(
self,
import_engine: ImportBasedClusteringEngine,
embedding_engine: EmbeddingBasedClusteringEngine,
import_weight: float = 0.7,
embedding_weight: float = 0.3
):
self.import_engine = import_engine
self.embedding_engine = embedding_engine
self.import_weight = import_weight
self.embedding_weight = embedding_weight
def find_relevant_skills(
self,
current_file: Path,
max_skills: int = 5
) -> List[Path]:
"""
Find relevant skills using hybrid approach
Algorithm:
1. Get skills from both engines
2. Combine with weighted ranking
3. Return top N
"""
# 1. Get results from both engines
import_skills = self.import_engine.find_relevant_skills(
current_file, max_skills=10
)
embedding_skills = self.embedding_engine.find_relevant_skills(
current_file, max_skills=10
)
# 2. Weighted ranking
skill_scores = {}
# Import-based scores (higher rank = higher score)
for i, skill in enumerate(import_skills):
score = (len(import_skills) - i) * self.import_weight
skill_scores[skill] = skill_scores.get(skill, 0) + score
# Embedding-based scores
for i, skill in enumerate(embedding_skills):
score = (len(embedding_skills) - i) * self.embedding_weight
skill_scores[skill] = skill_scores.get(skill, 0) + score
# 3. Sort by combined score
ranked = sorted(skill_scores.items(), key=lambda x: x[1], reverse=True)
# 4. Return top N
return [skill for skill, _ in ranked[:max_skills]]
```
---
## 🔌 Claude Code Plugin Integration
```python
# claude_plugins/skill-seekers-intelligence/agent.py
class SkillSeekersIntelligenceAgent:
"""
Claude Code plugin for skill intelligence
Handles file open events, loads relevant skills
"""
def __init__(self):
self.project_root = self._detect_project_root()
self.config = self._load_config()
self.clustering_engine = self._init_clustering_engine()
self.loaded_skills = []
def _init_clustering_engine(self):
"""Initialize clustering engine based on config"""
strategy = self.config.get("clustering", {}).get("strategy", "import")
if strategy == "import":
return ImportBasedClusteringEngine(self.skills_dir)
elif strategy == "embedding":
return EmbeddingBasedClusteringEngine(self.skills_dir, self.cache_dir)
elif strategy == "hybrid":
import_engine = ImportBasedClusteringEngine(self.skills_dir)
embedding_engine = EmbeddingBasedClusteringEngine(
self.skills_dir, self.cache_dir
)
return HybridClusteringEngine(import_engine, embedding_engine)
async def on_file_open(self, file_path: str):
"""Hook: User opens a file"""
file_path = Path(file_path)
# Find relevant skills
relevant_skills = self.clustering_engine.find_relevant_skills(
file_path,
max_skills=self.config.get("clustering", {}).get("max_skills_in_context", 5)
)
# Load skills into Claude context
await self.load_skills(relevant_skills)
# Notify user
self.notify_user(f"📚 Loaded {len(relevant_skills)} skills", relevant_skills)
async def on_branch_merge(self, branch: str):
"""Hook: Branch merged"""
if branch in self.config.get("watch_branches", []):
await self.regenerate_skills(branch)
async def load_skills(self, skill_paths: List[Path]):
"""Load skills into Claude's context"""
self.loaded_skills = skill_paths
# Read skill contents
skill_contents = []
for path in skill_paths:
content = path.read_text()
skill_contents.append({
"name": path.stem,
"content": content
})
# Tell Claude which skills are loaded
# (Exact API depends on Claude Code plugin system)
await self.claude_api.load_skills(skill_contents)
async def regenerate_skills(self, branch: str):
"""Regenerate skills after branch merge"""
# Run: skill-seekers regenerate-skills --branch {branch}
import subprocess
result = subprocess.run(
["skill-seekers", "regenerate-skills", "--branch", branch, "--silent"],
capture_output=True,
text=True
)
if result.returncode == 0:
self.notify_user(f"✅ Skills updated for branch: {branch}")
else:
self.notify_user(f"❌ Skill regeneration failed: {result.stderr}")
```
---
## 📊 Performance Considerations
### Import Analysis
- **Speed:** <100ms per file (AST parsing is fast)
- **Accuracy:** 85-90% (misses dynamic imports)
- **Memory:** Negligible (registry is small)
### Embedding Generation
- **Speed:** ~50ms per embedding (with all-MiniLM-L6-v2)
- **Accuracy:** 80-85% (better than imports for semantics)
- **Memory:** ~5KB per embedding
- **Storage:** ~500KB for 100 skills
### Skill Loading
- **Context Size:** 5 skills × 200 lines = 1000 lines (~4K tokens)
- **Loading Time:** <50ms (file I/O)
- **Claude Context:** Leaves plenty of room for code
### Git Hooks
- **Trigger Time:** <1 second (git hook overhead)
- **Regeneration:** 3-5 minutes (depends on codebase size)
- **Background:** Can run in background (async)
---
## 🔒 Security Considerations
1. **Git Hooks:** Installed with user permission, can be disabled
2. **File System:** Limited to project directory
3. **Network:** Library skills downloaded over HTTPS
4. **Embeddings:** Generated locally, no data sent externally
5. **Cache:** Stored locally in `.skill-seekers/cache/`
---
## 🎯 Design Trade-offs
### 1. Git-Based vs Watch Mode
- **Chosen:** Git-based (update on merge)
- **Why:** Better performance, no constant CPU usage
- **Trade-off:** Less real-time, requires commit
### 2. Import vs Embedding
- **Chosen:** Both (hybrid)
- **Why:** Import is fast/precise, embedding is flexible
- **Trade-off:** More complex, harder to debug
### 3. Config-Driven vs Auto
- **Chosen:** Config-driven with auto-detect
- **Why:** User control + convenience
- **Trade-off:** Requires manual config for complex cases
### 4. Local vs Cloud
- **Chosen:** Local (embeddings generated locally)
- **Why:** Privacy, speed, no API costs
- **Trade-off:** Requires model download (80MB)
---
## 🚧 Open Questions
1. **Claude Code Plugin API:** How exactly do we load skills into context?
2. **Context Management:** How to handle context overflow with large skills?
3. **Multi-File Context:** What if user has 3 files open? Load skills for all?
4. **Skill Updates:** How to invalidate cache when code changes?
5. **Cross-Project:** Can skills be shared across projects?
---
## 📚 References
- **Existing Code:** `src/skill_seekers/cli/codebase_scraper.py` (C3.x features)
- **Similar Tools:** GitHub Copilot, Cursor, Tabnine
- **Research:** RAG systems, semantic code search
- **Libraries:** sentence-transformers, numpy, ast
---
**Version:** 1.0 (Draft)
**Status:** For study and iteration
**Next:** Review, iterate, then implement Phase 1