Files
skill-seekers-reference/docs/roadmap/INTELLIGENCE_SYSTEM_ARCHITECTURE.md
yusyus 2855b59165 chore: Bump version to 2.7.4 for language link fix
This patch release fixes the broken Chinese language selector link
on PyPI by using absolute GitHub URLs instead of relative paths.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-22 00:12:08 +03:00

40 KiB
Raw Blame History

Skill Seekers Intelligence System - Technical Architecture

Version: 1.0 (Draft) Status: 🔬 Research & Design Last Updated: 2026-01-20 For: Study and iteration before implementation


🎯 System Overview

The Skill Seekers Intelligence System is a multi-layered architecture that automatically generates, updates, and intelligently loads codebase knowledge into Claude Code's context.

Core Principles:

  1. Git-Based Triggers: Only update on branch merges (not constant watching)
  2. Modular Skills: Separate libraries from codebase, split codebase into modules
  3. Smart Clustering: Load only relevant skills based on context
  4. User Control: Config-driven, user has final say

🏗️ Architecture Layers

┌─────────────────────────────────────────────────────────────┐
│                     USER INTERFACE                          │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ CLI Commands     Claude Code Plugin    Config Files  │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                            ↕
┌─────────────────────────────────────────────────────────────┐
│                   ORCHESTRATION LAYER                       │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ • Project Manager                                    │   │
│  │ • Skill Registry                                     │   │
│  │ • Update Scheduler                                   │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                            ↕
┌─────────────────────────────────────────────────────────────┐
│                  SKILL GENERATION LAYER                     │
│  ┌────────────────────┐  ┌────────────────────┐            │
│  │ Tech Stack         │  │ Modular Codebase   │            │
│  │ Detector           │  │ Analyzer           │            │
│  └────────────────────┘  └────────────────────┘            │
│  ┌────────────────────┐  ┌────────────────────┐            │
│  │ Library Skill      │  │ Git Change         │            │
│  │ Downloader         │  │ Detector           │            │
│  └────────────────────┘  └────────────────────┘            │
└─────────────────────────────────────────────────────────────┘
                            ↕
┌─────────────────────────────────────────────────────────────┐
│                  CLUSTERING LAYER                           │
│  ┌────────────────────┐  ┌────────────────────┐            │
│  │ Import-Based       │  │ Embedding-Based    │            │
│  │ Clustering         │  │ Clustering         │            │
│  │ (Phase 1)          │  │ (Phase 2)          │            │
│  └────────────────────┘  └────────────────────┘            │
│  ┌────────────────────┐                                     │
│  │ Hybrid Clustering  │                                     │
│  │ (Combines both)    │                                     │
│  └────────────────────┘                                     │
└─────────────────────────────────────────────────────────────┘
                            ↕
┌─────────────────────────────────────────────────────────────┐
│                     STORAGE LAYER                           │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ • Skill Files (.skill-seekers/skills/)               │   │
│  │ • Embeddings Cache (.skill-seekers/cache/)           │   │
│  │ • Metadata (.skill-seekers/registry.json)            │   │
│  │ • Git Hooks (.skill-seekers/hooks/)                  │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

📂 File System Structure

project-root/
├── .skill-seekers/                    # Intelligence system directory
│   ├── config.yml                     # User configuration
│   │
│   ├── skills/                        # Generated skills
│   │   ├── libraries/                 # External library skills
│   │   │   ├── fastapi.skill
│   │   │   ├── react.skill
│   │   │   └── postgresql.skill
│   │   │
│   │   └── codebase/                  # Project-specific skills
│   │       ├── backend/
│   │       │   ├── api.skill
│   │       │   ├── auth.skill
│   │       │   └── models.skill
│   │       │
│   │       └── frontend/
│   │           ├── components.skill
│   │           └── pages.skill
│   │
│   ├── cache/                         # Performance caches
│   │   ├── embeddings/                # Skill embeddings
│   │   │   ├── fastapi.npy
│   │   │   ├── api.npy
│   │   │   └── ...
│   │   │
│   │   └── metadata/                  # Cached metadata
│   │       └── skill-registry.json
│   │
│   ├── hooks/                         # Git hooks
│   │   ├── post-merge                 # Auto-regenerate on merge
│   │   ├── post-commit                # Optional
│   │   └── pre-push                   # Optional validation
│   │
│   ├── logs/                          # System logs
│   │   ├── regeneration.log
│   │   └── clustering.log
│   │
│   └── registry.json                  # Skill registry metadata
│
├── .git/                              # Git repository
└── ... (project files)

⚙️ Component Details

1. Project Manager

Responsibility: Initialize and manage project intelligence

# src/skill_seekers/intelligence/project_manager.py

class ProjectManager:
    """Manages project intelligence system lifecycle"""

    def __init__(self, project_root: Path):
        self.root = project_root
        self.config_path = project_root / ".skill-seekers" / "config.yml"
        self.skills_dir = project_root / ".skill-seekers" / "skills"

    def initialize(self) -> bool:
        """
        Initialize project for intelligence system
        Creates directory structure, config, git hooks
        """
        # 1. Create directory structure
        self._create_directories()

        # 2. Generate default config
        config = self._generate_default_config()
        self._save_config(config)

        # 3. Install git hooks
        self._install_git_hooks()

        # 4. Initial skill generation
        self._initial_skill_generation()

        return True

    def _create_directories(self):
        """Create .skill-seekers directory structure"""
        dirs = [
            ".skill-seekers",
            ".skill-seekers/skills",
            ".skill-seekers/skills/libraries",
            ".skill-seekers/skills/codebase",
            ".skill-seekers/cache",
            ".skill-seekers/cache/embeddings",
            ".skill-seekers/cache/metadata",
            ".skill-seekers/hooks",
            ".skill-seekers/logs",
        ]

        for d in dirs:
            (self.root / d).mkdir(parents=True, exist_ok=True)

    def _generate_default_config(self) -> dict:
        """Generate sensible default configuration"""
        return {
            "version": "1.0",
            "project_name": self.root.name,
            "watch_branches": ["main", "development"],
            "tech_stack": {
                "auto_detect": True,
                "frameworks": []
            },
            "skill_generation": {
                "enabled": True,
                "output_dir": ".skill-seekers/skills/codebase"
            },
            "git_hooks": {
                "enabled": True,
                "trigger_on": ["post-merge"]
            },
            "clustering": {
                "enabled": False,  # Phase 4+
                "strategy": "import",  # import, embedding, hybrid
                "max_skills_in_context": 5
            }
        }

    def _install_git_hooks(self):
        """Install git hooks for auto-regeneration"""
        hook_template = """#!/bin/bash
# Auto-generated by skill-seekers
# DO NOT EDIT - regenerate with: skill-seekers init-project

CURRENT_BRANCH=$(git rev-parse --abbrev-ref HEAD)
CONFIG_FILE=".skill-seekers/config.yml"

if [ ! -f "$CONFIG_FILE" ]; then
    exit 0
fi

# Read watched branches from config
WATCH_BRANCHES=$(yq '.watch_branches[]' "$CONFIG_FILE" 2>/dev/null || echo "")

if echo "$WATCH_BRANCHES" | grep -q "^$CURRENT_BRANCH$"; then
    echo "🔄 Skill regeneration triggered on branch: $CURRENT_BRANCH"
    skill-seekers regenerate-skills --branch "$CURRENT_BRANCH" --silent
    echo "✅ Skills updated"
fi
"""

        hook_path = self.root / ".git" / "hooks" / "post-merge"
        hook_path.write_text(hook_template)
        hook_path.chmod(0o755)  # Make executable

2. Tech Stack Detector

Responsibility: Detect frameworks and libraries from project files

# src/skill_seekers/intelligence/stack_detector.py

from pathlib import Path
from typing import Dict, List
import json
import yaml
import toml

class TechStackDetector:
    """
    Detect tech stack from project configuration files
    Supports: Python, JavaScript/TypeScript, Go, Rust, Java
    """

    def __init__(self, project_root: Path):
        self.root = project_root
        self.detectors = {
            "python": self._detect_python,
            "javascript": self._detect_javascript,
            "typescript": self._detect_typescript,
            "go": self._detect_go,
            "rust": self._detect_rust,
            "java": self._detect_java,
        }

    def detect(self) -> Dict[str, List[str]]:
        """
        Detect complete tech stack

        Returns:
            {
                "languages": ["Python", "JavaScript"],
                "frameworks": ["FastAPI", "React"],
                "databases": ["PostgreSQL"],
                "tools": ["Docker", "Redis"]
            }
        """
        stack = {
            "languages": [],
            "frameworks": [],
            "databases": [],
            "tools": []
        }

        # Detect languages
        for lang, detector in self.detectors.items():
            if detector():
                stack["languages"].append(lang.title())

        # Detect frameworks (per language)
        if "Python" in stack["languages"]:
            stack["frameworks"].extend(self._detect_python_frameworks())

        if "JavaScript" in stack["languages"] or "TypeScript" in stack["languages"]:
            stack["frameworks"].extend(self._detect_js_frameworks())

        # Detect databases
        stack["databases"].extend(self._detect_databases())

        # Detect tools
        stack["tools"].extend(self._detect_tools())

        return stack

    def _detect_python(self) -> bool:
        """Detect Python project"""
        markers = [
            "requirements.txt",
            "setup.py",
            "pyproject.toml",
            "Pipfile",
            "poetry.lock"
        ]
        return any((self.root / marker).exists() for marker in markers)

    def _detect_python_frameworks(self) -> List[str]:
        """Detect Python frameworks"""
        frameworks = []

        # Parse requirements.txt
        req_file = self.root / "requirements.txt"
        if req_file.exists():
            deps = req_file.read_text().lower()

            framework_map = {
                "fastapi": "FastAPI",
                "django": "Django",
                "flask": "Flask",
                "sqlalchemy": "SQLAlchemy",
                "pydantic": "Pydantic",
                "anthropic": "Anthropic",
                "openai": "OpenAI",
                "beautifulsoup4": "BeautifulSoup",
                "requests": "Requests",
                "httpx": "HTTPX",
                "aiohttp": "aiohttp",
            }

            for key, name in framework_map.items():
                if key in deps:
                    frameworks.append(name)

        # Parse pyproject.toml
        pyproject = self.root / "pyproject.toml"
        if pyproject.exists():
            try:
                data = toml.loads(pyproject.read_text())
                deps = data.get("project", {}).get("dependencies", [])
                deps_str = " ".join(deps).lower()

                for key, name in framework_map.items():
                    if key in deps_str and name not in frameworks:
                        frameworks.append(name)
            except:
                pass

        return frameworks

    def _detect_javascript(self) -> bool:
        """Detect JavaScript project"""
        return (self.root / "package.json").exists()

    def _detect_typescript(self) -> bool:
        """Detect TypeScript project"""
        markers = ["tsconfig.json", "package.json"]
        if not all((self.root / m).exists() for m in markers):
            return False

        # Check if typescript is in dependencies
        pkg = self.root / "package.json"
        try:
            data = json.loads(pkg.read_text())
            deps = {**data.get("dependencies", {}), **data.get("devDependencies", {})}
            return "typescript" in deps
        except:
            return False

    def _detect_js_frameworks(self) -> List[str]:
        """Detect JavaScript/TypeScript frameworks"""
        frameworks = []

        pkg = self.root / "package.json"
        if not pkg.exists():
            return frameworks

        try:
            data = json.loads(pkg.read_text())
            deps = {**data.get("dependencies", {}), **data.get("devDependencies", {})}

            framework_map = {
                "react": "React",
                "vue": "Vue",
                "next": "Next.js",
                "nuxt": "Nuxt.js",
                "svelte": "Svelte",
                "angular": "Angular",
                "express": "Express",
                "fastify": "Fastify",
                "nestjs": "NestJS",
            }

            for key, name in framework_map.items():
                if key in deps:
                    frameworks.append(name)

        except:
            pass

        return frameworks

    def _detect_databases(self) -> List[str]:
        """Detect databases from environment and configs"""
        databases = []

        # Check .env file
        env_file = self.root / ".env"
        if env_file.exists():
            env_content = env_file.read_text().lower()

            db_markers = {
                "postgres": "PostgreSQL",
                "mysql": "MySQL",
                "mongodb": "MongoDB",
                "redis": "Redis",
                "sqlite": "SQLite",
            }

            for marker, name in db_markers.items():
                if marker in env_content:
                    databases.append(name)

        # Check docker-compose.yml
        compose = self.root / "docker-compose.yml"
        if compose.exists():
            try:
                data = yaml.safe_load(compose.read_text())
                services = data.get("services", {})

                for service_name, config in services.items():
                    image = config.get("image", "").lower()

                    db_images = {
                        "postgres": "PostgreSQL",
                        "mysql": "MySQL",
                        "mongo": "MongoDB",
                        "redis": "Redis",
                    }

                    for marker, name in db_images.items():
                        if marker in image and name not in databases:
                            databases.append(name)
            except:
                pass

        return databases

    def _detect_tools(self) -> List[str]:
        """Detect development tools"""
        tools = []

        tool_markers = {
            "Dockerfile": "Docker",
            "docker-compose.yml": "Docker Compose",
            ".github/workflows": "GitHub Actions",
            "Makefile": "Make",
            "nginx.conf": "Nginx",
        }

        for marker, name in tool_markers.items():
            if (self.root / marker).exists():
                tools.append(name)

        return tools

    def _detect_go(self) -> bool:
        return (self.root / "go.mod").exists()

    def _detect_rust(self) -> bool:
        return (self.root / "Cargo.toml").exists()

    def _detect_java(self) -> bool:
        markers = ["pom.xml", "build.gradle", "build.gradle.kts"]
        return any((self.root / m).exists() for m in markers)

3. Modular Skill Generator

Responsibility: Split codebase into modular skills based on config

# src/skill_seekers/intelligence/modular_generator.py

from pathlib import Path
from typing import List, Dict
import glob

class ModularSkillGenerator:
    """
    Generate modular skills from codebase
    Splits based on: namespace, directory, feature, or custom
    """

    def __init__(self, project_root: Path, config: dict):
        self.root = project_root
        self.config = config
        self.modules = config.get("modules", {})

    def generate_all(self) -> List[Path]:
        """Generate all modular skills"""
        generated_skills = []

        for module_name, module_config in self.modules.items():
            skills = self.generate_module(module_name, module_config)
            generated_skills.extend(skills)

        return generated_skills

    def generate_module(self, module_name: str, module_config: dict) -> List[Path]:
        """
        Generate skills for a single module

        module_config = {
            "path": "src/api/",
            "split_by": "namespace",  # or directory, feature, custom
            "skills": [
                {
                    "name": "api",
                    "description": "API endpoints",
                    "include": ["*/routes/*.py"],
                    "exclude": ["*_test.py"]
                }
            ]
        }
        """
        skills = []

        for skill_config in module_config.get("skills", []):
            skill_path = self._generate_skill(module_name, skill_config)
            skills.append(skill_path)

        return skills

    def _generate_skill(self, module_name: str, skill_config: dict) -> Path:
        """Generate a single skill file"""
        skill_name = skill_config["name"]
        include_patterns = skill_config.get("include", [])
        exclude_patterns = skill_config.get("exclude", [])

        # 1. Find files matching patterns
        files = self._find_files(include_patterns, exclude_patterns)

        # 2. Run codebase analysis on these files
        # (Reuse existing C3.x codebase_scraper.py)
        from skill_seekers.cli.codebase_scraper import analyze_codebase

        analysis_result = analyze_codebase(
            files=files,
            project_root=self.root,
            depth="deep",
            ai_mode="none"
        )

        # 3. Generate SKILL.md
        skill_content = self._format_skill(
            name=skill_name,
            description=skill_config.get("description", ""),
            analysis=analysis_result
        )

        # 4. Save skill file
        output_dir = self.root / ".skill-seekers" / "skills" / "codebase" / module_name
        output_dir.mkdir(parents=True, exist_ok=True)

        skill_path = output_dir / f"{skill_name}.skill"
        skill_path.write_text(skill_content)

        return skill_path

    def _find_files(self, include: List[str], exclude: List[str]) -> List[Path]:
        """Find files matching include/exclude patterns"""
        files = set()

        # Include patterns
        for pattern in include:
            matched = glob.glob(str(self.root / pattern), recursive=True)
            files.update(Path(f) for f in matched)

        # Exclude patterns
        for pattern in exclude:
            matched = glob.glob(str(self.root / pattern), recursive=True)
            files.difference_update(Path(f) for f in matched)

        return sorted(files)

    def _format_skill(self, name: str, description: str, analysis: dict) -> str:
        """Format analysis results into SKILL.md"""
        return f"""---
name: {name}
description: {description}
module: codebase
---

# {name.title()}

## Description

{description}

## API Reference

{analysis.get('api_reference', '')}

## Design Patterns

{analysis.get('patterns', '')}

## Examples

{analysis.get('examples', '')}

## Related Skills

{self._generate_cross_references(name)}
"""

    def _generate_cross_references(self, skill_name: str) -> str:
        """Generate cross-references to related skills"""
        # Analyze imports to find dependencies
        # Link to other skills that this skill imports from
        return "- Related skill 1\n- Related skill 2"

4. Import-Based Clustering Engine

Responsibility: Find relevant skills based on import analysis

# src/skill_seekers/intelligence/import_clustering.py

from pathlib import Path
from typing import List, Set
import ast

class ImportBasedClusteringEngine:
    """
    Find relevant skills by analyzing imports in current file
    Fast and deterministic - no AI needed
    """

    def __init__(self, skills_dir: Path):
        self.skills_dir = skills_dir
        self.skill_registry = self._build_registry()

    def _build_registry(self) -> dict:
        """
        Build registry mapping imports to skills

        Returns:
            {
                "fastapi": ["libraries/fastapi.skill"],
                "anthropic": ["libraries/anthropic.skill"],
                "src.api": ["codebase/backend/api.skill"],
                "src.auth": ["codebase/backend/auth.skill"],
            }
        """
        registry = {}

        # Scan all skills and extract what they provide
        for skill_path in self.skills_dir.rglob("*.skill"):
            # Parse skill metadata (YAML frontmatter)
            provides = self._extract_provides(skill_path)

            for module in provides:
                if module not in registry:
                    registry[module] = []
                registry[module].append(skill_path)

        return registry

    def find_relevant_skills(
        self,
        current_file: Path,
        max_skills: int = 5
    ) -> List[Path]:
        """
        Find most relevant skills for current file

        Algorithm:
        1. Parse imports from current file
        2. Map imports to skills via registry
        3. Add current file's skill (if exists)
        4. Rank and return top N
        """
        # 1. Parse imports
        imports = self._parse_imports(current_file)

        # 2. Map to skills
        relevant_skills = set()

        for imp in imports:
            # External library?
            if self._is_external(imp):
                lib_skill = self._find_library_skill(imp)
                if lib_skill:
                    relevant_skills.add(lib_skill)

            # Internal module?
            else:
                module_skill = self._find_module_skill(imp)
                if module_skill:
                    relevant_skills.add(module_skill)

        # 3. Add current file's skill (highest priority)
        current_skill = self._find_skill_for_file(current_file)
        if current_skill:
            # Insert at beginning
            relevant_skills = [current_skill] + list(relevant_skills)

        # 4. Rank and return
        return self._rank_skills(relevant_skills)[:max_skills]

    def _parse_imports(self, file_path: Path) -> Set[str]:
        """
        Parse imports from Python file using AST

        Returns: {"fastapi", "anthropic", "src.api", "src.auth"}
        """
        imports = set()

        try:
            tree = ast.parse(file_path.read_text())

            for node in ast.walk(tree):
                # import X
                if isinstance(node, ast.Import):
                    for alias in node.names:
                        imports.add(alias.name)

                # from X import Y
                elif isinstance(node, ast.ImportFrom):
                    if node.module:
                        imports.add(node.module)

        except Exception as e:
            print(f"Warning: Could not parse {file_path}: {e}")

        return imports

    def _is_external(self, import_name: str) -> bool:
        """Check if import is external library or internal module"""
        # External if:
        # - Not starts with project name
        # - Not starts with "src"
        # - Is known library (fastapi, django, etc.)

        internal_prefixes = ["src", "tests", self._get_project_name()]

        return not any(import_name.startswith(prefix) for prefix in internal_prefixes)

    def _find_library_skill(self, import_name: str) -> Path | None:
        """Find library skill for external import"""
        # Try exact match first
        skill_path = self.skills_dir / "libraries" / f"{import_name}.skill"
        if skill_path.exists():
            return skill_path

        # Try partial match (e.g., "fastapi.routing" -> "fastapi")
        base_module = import_name.split(".")[0]
        skill_path = self.skills_dir / "libraries" / f"{base_module}.skill"
        if skill_path.exists():
            return skill_path

        return None

    def _find_module_skill(self, import_name: str) -> Path | None:
        """Find codebase skill for internal import"""
        # Use registry to map import to skill
        return self.skill_registry.get(import_name)

    def _find_skill_for_file(self, file_path: Path) -> Path | None:
        """Find which skill contains this file"""
        # Match file path against skill file patterns
        # This requires reading all skill configs
        # For now, simple heuristic: src/api/ -> api.skill

        rel_path = file_path.relative_to(self.project_root)

        if "api" in str(rel_path):
            return self.skills_dir / "codebase" / "backend" / "api.skill"
        elif "auth" in str(rel_path):
            return self.skills_dir / "codebase" / "backend" / "auth.skill"
        # ... etc

        return None

    def _rank_skills(self, skills: List[Path]) -> List[Path]:
        """Rank skills by relevance (for now, just deduplicate)"""
        return list(dict.fromkeys(skills))  # Preserve order, remove dupes

5. Embedding-Based Clustering Engine

Responsibility: Find relevant skills using semantic similarity

# src/skill_seekers/intelligence/embedding_clustering.py

from pathlib import Path
from typing import List
import numpy as np
from sentence_transformers import SentenceTransformer

class EmbeddingBasedClusteringEngine:
    """
    Find relevant skills using embeddings and cosine similarity
    More flexible than import-based, but slower
    """

    def __init__(self, skills_dir: Path, cache_dir: Path):
        self.skills_dir = skills_dir
        self.cache_dir = cache_dir
        self.model = SentenceTransformer('all-MiniLM-L6-v2')  # 80MB, fast

        # Load or generate skill embeddings
        self.skill_embeddings = self._load_skill_embeddings()

    def _load_skill_embeddings(self) -> dict:
        """Load pre-computed skill embeddings from cache"""
        embeddings = {}

        for skill_path in self.skills_dir.rglob("*.skill"):
            cache_path = self.cache_dir / "embeddings" / f"{skill_path.stem}.npy"

            if cache_path.exists():
                # Load from cache
                embeddings[skill_path] = np.load(cache_path)
            else:
                # Generate and cache
                embedding = self._embed_skill(skill_path)
                cache_path.parent.mkdir(parents=True, exist_ok=True)
                np.save(cache_path, embedding)
                embeddings[skill_path] = embedding

        return embeddings

    def _embed_skill(self, skill_path: Path) -> np.ndarray:
        """Generate embedding for a skill"""
        content = skill_path.read_text()

        # Extract key sections (API Reference + Examples)
        api_section = self._extract_section(content, "## API Reference")
        examples_section = self._extract_section(content, "## Examples")

        # Combine and embed
        text = f"{api_section}\n{examples_section}"
        embedding = self.model.encode(text[:5000])  # Limit to 5K chars

        return embedding

    def _embed_file(self, file_path: Path) -> np.ndarray:
        """Generate embedding for current file"""
        content = file_path.read_text()

        # Embed full content (or first N chars for performance)
        embedding = self.model.encode(content[:5000])

        return embedding

    def find_relevant_skills(
        self,
        current_file: Path,
        max_skills: int = 5
    ) -> List[Path]:
        """
        Find most relevant skills using cosine similarity

        Algorithm:
        1. Embed current file
        2. Compute cosine similarity with all skill embeddings
        3. Rank by similarity
        4. Return top N
        """
        # 1. Embed current file
        file_embedding = self._embed_file(current_file)

        # 2. Compute similarities
        similarities = {}

        for skill_path, skill_embedding in self.skill_embeddings.items():
            similarity = self._cosine_similarity(file_embedding, skill_embedding)
            similarities[skill_path] = similarity

        # 3. Rank by similarity
        ranked = sorted(similarities.items(), key=lambda x: x[1], reverse=True)

        # 4. Return top N
        return [skill_path for skill_path, _ in ranked[:max_skills]]

    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        """Compute cosine similarity between two vectors"""
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    def _extract_section(self, content: str, header: str) -> str:
        """Extract section from markdown content"""
        lines = content.split("\n")
        section_lines = []
        in_section = False

        for line in lines:
            if line.startswith(header):
                in_section = True
                continue

            if in_section:
                if line.startswith("##"):  # Next section
                    break
                section_lines.append(line)

        return "\n".join(section_lines)

6. Hybrid Clustering Engine

Responsibility: Combine import-based and embedding-based clustering

# src/skill_seekers/intelligence/hybrid_clustering.py

class HybridClusteringEngine:
    """
    Combine import-based (precise) and embedding-based (flexible)
    for best-of-both-worlds clustering
    """

    def __init__(
        self,
        import_engine: ImportBasedClusteringEngine,
        embedding_engine: EmbeddingBasedClusteringEngine,
        import_weight: float = 0.7,
        embedding_weight: float = 0.3
    ):
        self.import_engine = import_engine
        self.embedding_engine = embedding_engine
        self.import_weight = import_weight
        self.embedding_weight = embedding_weight

    def find_relevant_skills(
        self,
        current_file: Path,
        max_skills: int = 5
    ) -> List[Path]:
        """
        Find relevant skills using hybrid approach

        Algorithm:
        1. Get skills from both engines
        2. Combine with weighted ranking
        3. Return top N
        """
        # 1. Get results from both engines
        import_skills = self.import_engine.find_relevant_skills(
            current_file, max_skills=10
        )

        embedding_skills = self.embedding_engine.find_relevant_skills(
            current_file, max_skills=10
        )

        # 2. Weighted ranking
        skill_scores = {}

        # Import-based scores (higher rank = higher score)
        for i, skill in enumerate(import_skills):
            score = (len(import_skills) - i) * self.import_weight
            skill_scores[skill] = skill_scores.get(skill, 0) + score

        # Embedding-based scores
        for i, skill in enumerate(embedding_skills):
            score = (len(embedding_skills) - i) * self.embedding_weight
            skill_scores[skill] = skill_scores.get(skill, 0) + score

        # 3. Sort by combined score
        ranked = sorted(skill_scores.items(), key=lambda x: x[1], reverse=True)

        # 4. Return top N
        return [skill for skill, _ in ranked[:max_skills]]

🔌 Claude Code Plugin Integration

# claude_plugins/skill-seekers-intelligence/agent.py

class SkillSeekersIntelligenceAgent:
    """
    Claude Code plugin for skill intelligence
    Handles file open events, loads relevant skills
    """

    def __init__(self):
        self.project_root = self._detect_project_root()
        self.config = self._load_config()
        self.clustering_engine = self._init_clustering_engine()
        self.loaded_skills = []

    def _init_clustering_engine(self):
        """Initialize clustering engine based on config"""
        strategy = self.config.get("clustering", {}).get("strategy", "import")

        if strategy == "import":
            return ImportBasedClusteringEngine(self.skills_dir)
        elif strategy == "embedding":
            return EmbeddingBasedClusteringEngine(self.skills_dir, self.cache_dir)
        elif strategy == "hybrid":
            import_engine = ImportBasedClusteringEngine(self.skills_dir)
            embedding_engine = EmbeddingBasedClusteringEngine(
                self.skills_dir, self.cache_dir
            )
            return HybridClusteringEngine(import_engine, embedding_engine)

    async def on_file_open(self, file_path: str):
        """Hook: User opens a file"""
        file_path = Path(file_path)

        # Find relevant skills
        relevant_skills = self.clustering_engine.find_relevant_skills(
            file_path,
            max_skills=self.config.get("clustering", {}).get("max_skills_in_context", 5)
        )

        # Load skills into Claude context
        await self.load_skills(relevant_skills)

        # Notify user
        self.notify_user(f"📚 Loaded {len(relevant_skills)} skills", relevant_skills)

    async def on_branch_merge(self, branch: str):
        """Hook: Branch merged"""
        if branch in self.config.get("watch_branches", []):
            await self.regenerate_skills(branch)

    async def load_skills(self, skill_paths: List[Path]):
        """Load skills into Claude's context"""
        self.loaded_skills = skill_paths

        # Read skill contents
        skill_contents = []
        for path in skill_paths:
            content = path.read_text()
            skill_contents.append({
                "name": path.stem,
                "content": content
            })

        # Tell Claude which skills are loaded
        # (Exact API depends on Claude Code plugin system)
        await self.claude_api.load_skills(skill_contents)

    async def regenerate_skills(self, branch: str):
        """Regenerate skills after branch merge"""
        # Run: skill-seekers regenerate-skills --branch {branch}
        import subprocess

        result = subprocess.run(
            ["skill-seekers", "regenerate-skills", "--branch", branch, "--silent"],
            capture_output=True,
            text=True
        )

        if result.returncode == 0:
            self.notify_user(f"✅ Skills updated for branch: {branch}")
        else:
            self.notify_user(f"❌ Skill regeneration failed: {result.stderr}")

📊 Performance Considerations

Import Analysis

  • Speed: <100ms per file (AST parsing is fast)
  • Accuracy: 85-90% (misses dynamic imports)
  • Memory: Negligible (registry is small)

Embedding Generation

  • Speed: ~50ms per embedding (with all-MiniLM-L6-v2)
  • Accuracy: 80-85% (better than imports for semantics)
  • Memory: ~5KB per embedding
  • Storage: ~500KB for 100 skills

Skill Loading

  • Context Size: 5 skills × 200 lines = 1000 lines (~4K tokens)
  • Loading Time: <50ms (file I/O)
  • Claude Context: Leaves plenty of room for code

Git Hooks

  • Trigger Time: <1 second (git hook overhead)
  • Regeneration: 3-5 minutes (depends on codebase size)
  • Background: Can run in background (async)

🔒 Security Considerations

  1. Git Hooks: Installed with user permission, can be disabled
  2. File System: Limited to project directory
  3. Network: Library skills downloaded over HTTPS
  4. Embeddings: Generated locally, no data sent externally
  5. Cache: Stored locally in .skill-seekers/cache/

🎯 Design Trade-offs

1. Git-Based vs Watch Mode

  • Chosen: Git-based (update on merge)
  • Why: Better performance, no constant CPU usage
  • Trade-off: Less real-time, requires commit

2. Import vs Embedding

  • Chosen: Both (hybrid)
  • Why: Import is fast/precise, embedding is flexible
  • Trade-off: More complex, harder to debug

3. Config-Driven vs Auto

  • Chosen: Config-driven with auto-detect
  • Why: User control + convenience
  • Trade-off: Requires manual config for complex cases

4. Local vs Cloud

  • Chosen: Local (embeddings generated locally)
  • Why: Privacy, speed, no API costs
  • Trade-off: Requires model download (80MB)

🚧 Open Questions

  1. Claude Code Plugin API: How exactly do we load skills into context?
  2. Context Management: How to handle context overflow with large skills?
  3. Multi-File Context: What if user has 3 files open? Load skills for all?
  4. Skill Updates: How to invalidate cache when code changes?
  5. Cross-Project: Can skills be shared across projects?

📚 References

  • Existing Code: src/skill_seekers/cli/codebase_scraper.py (C3.x features)
  • Similar Tools: GitHub Copilot, Cursor, Tabnine
  • Research: RAG systems, semantic code search
  • Libraries: sentence-transformers, numpy, ast

Version: 1.0 (Draft) Status: For study and iteration Next: Review, iterate, then implement Phase 1