feat: Add modern Python packaging - Phase 1 (Foundation)

Implements issue #168 - Modern Python packaging with uv support This is Phase 1 of the modernization effort, establishing the core package structure and build system. ## Major Changes ### 1. Migrated to src/ Layout - Moved cli/ → src/skill_seekers/cli/ - Moved skill_seeker_mcp/ → src/skill_seekers/mcp/ - Created root package: src/skill_seekers/__init__.py - Updated all imports: cli. → skill_seekers.cli. - Updated all imports: skill_seeker_mcp. → skill_seekers.mcp. ### 2. Created pyproject.toml - Modern Python packaging configuration - All dependencies properly declared - 8 CLI entry points configured: * skill-seekers (unified CLI) * skill-seekers-scrape * skill-seekers-github * skill-seekers-pdf * skill-seekers-unified * skill-seekers-enhance * skill-seekers-package * skill-seekers-upload * skill-seekers-estimate - uv tool support enabled - Build system: setuptools with wheel ### 3. Created Unified CLI (main.py) - Git-style subcommands (skill-seekers scrape, etc.) - Delegates to existing tool main() functions - Full help system at top-level and subcommand level - Backwards compatible with individual commands ### 4. Updated Package Versions - cli/__init__.py: 1.3.0 → 2.0.0 - mcp/__init__.py: 1.2.0 → 2.0.0 - Root package: 2.0.0 ### 5. Updated Test Suite - Fixed test_package_structure.py for new layout - All 28 package structure tests passing - Updated all test imports for new structure ## Installation Methods (Working) ```bash # Development install pip install -e . # Run unified CLI skill-seekers --version # → 2.0.0 skill-seekers --help # Run individual tools skill-seekers-scrape --help skill-seekers-github --help ``` ## Test Results - Package structure tests: 28/28 passing ✅ - Package installs successfully ✅ - All entry points working ✅ ## Still TODO (Phase 2) - [ ] Run full test suite (299 tests) - [ ] Update documentation (README, CLAUDE.md, etc.) - [ ] Test with uv tool run/install - [ ] Build and publish to PyPI - [ ] Create PR and merge ## Breaking Changes None - fully backwards compatible. Old import paths still work. ## Migration for Users No action needed. Package works with both pip and uv. Closes #168 (when complete) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-07 01:14:24 +03:00
parent e3b49574d3
commit ce1c07b437
43 changed files with 601 additions and 106 deletions
--- a/src/skill_seekers/init.py
+++ b/src/skill_seekers/init.py
@@ -0,0 +1,22 @@
+"""
+Skill Seekers - Convert documentation, GitHub repos, and PDFs into Claude AI skills.
+
+This package provides tools for automatically scraping, organizing, and packaging
+documentation from various sources into uploadable Claude AI skills.
+"""
+
+__version__ = "2.0.0"
+__author__ = "Yusuf Karaaslan"
+__license__ = "MIT"
+
+# Expose main components for easier imports
+from skill_seekers.cli import __version__ as cli_version
+from skill_seekers.mcp import __version__ as mcp_version
+
+__all__ = [
+    "__version__",
+    "__author__",
+    "__license__",
+    "cli_version",
+    "mcp_version",
+]
--- a/src/skill_seekers/cli/init.py
+++ b/src/skill_seekers/cli/init.py
@@ -0,0 +1,39 @@
+"""Skill Seekers CLI tools package.
+
+This package provides command-line tools for converting documentation
+websites into Claude AI skills.
+
+Main modules:
+    - doc_scraper: Main documentation scraping and skill building tool
+    - llms_txt_detector: Detect llms.txt files at documentation URLs
+    - llms_txt_downloader: Download llms.txt content
+    - llms_txt_parser: Parse llms.txt markdown content
+    - pdf_scraper: Extract documentation from PDF files
+    - enhance_skill: AI-powered skill enhancement (API-based)
+    - enhance_skill_local: AI-powered skill enhancement (local)
+    - estimate_pages: Estimate page count before scraping
+    - package_skill: Package skills into .zip files
+    - upload_skill: Upload skills to Claude
+    - utils: Shared utility functions
+"""
+
+from .llms_txt_detector import LlmsTxtDetector
+from .llms_txt_downloader import LlmsTxtDownloader
+from .llms_txt_parser import LlmsTxtParser
+
+try:
+    from .utils import open_folder, read_reference_files
+except ImportError:
+    # utils.py might not exist in all configurations
+    open_folder = None
+    read_reference_files = None
+
+__version__ = "2.0.0"
+
+__all__ = [
+    "LlmsTxtDetector",
+    "LlmsTxtDownloader",
+    "LlmsTxtParser",
+    "open_folder",
+    "read_reference_files",
+]
--- a/src/skill_seekers/cli/code_analyzer.py
+++ b/src/skill_seekers/cli/code_analyzer.py
@@ -0,0 +1,491 @@
+#!/usr/bin/env python3
+"""
+Code Analyzer for GitHub Repositories
+
+Extracts code signatures at configurable depth levels:
+- surface: File tree only (existing behavior)
+- deep: Parse files for signatures, parameters, types
+- full: Complete AST analysis (future enhancement)
+
+Supports multiple languages with language-specific parsers.
+"""
+
+import ast
+import re
+import logging
+from typing import Dict, List, Any, Optional
+from dataclasses import dataclass, asdict
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class Parameter:
+    """Represents a function parameter."""
+    name: str
+    type_hint: Optional[str] = None
+    default: Optional[str] = None
+
+
+@dataclass
+class FunctionSignature:
+    """Represents a function/method signature."""
+    name: str
+    parameters: List[Parameter]
+    return_type: Optional[str] = None
+    docstring: Optional[str] = None
+    line_number: Optional[int] = None
+    is_async: bool = False
+    is_method: bool = False
+    decorators: List[str] = None
+
+    def __post_init__(self):
+        if self.decorators is None:
+            self.decorators = []
+
+
+@dataclass
+class ClassSignature:
+    """Represents a class signature."""
+    name: str
+    base_classes: List[str]
+    methods: List[FunctionSignature]
+    docstring: Optional[str] = None
+    line_number: Optional[int] = None
+
+
+class CodeAnalyzer:
+    """
+    Analyzes code at different depth levels.
+    """
+
+    def __init__(self, depth: str = 'surface'):
+        """
+        Initialize code analyzer.
+
+        Args:
+            depth: Analysis depth ('surface', 'deep', 'full')
+        """
+        self.depth = depth
+
+    def analyze_file(self, file_path: str, content: str, language: str) -> Dict[str, Any]:
+        """
+        Analyze a single file based on depth level.
+
+        Args:
+            file_path: Path to file in repository
+            content: File content as string
+            language: Programming language (Python, JavaScript, etc.)
+
+        Returns:
+            Dict containing extracted signatures
+        """
+        if self.depth == 'surface':
+            return {}  # Surface level doesn't analyze individual files
+
+        logger.debug(f"Analyzing {file_path} (language: {language}, depth: {self.depth})")
+
+        try:
+            if language == 'Python':
+                return self._analyze_python(content, file_path)
+            elif language in ['JavaScript', 'TypeScript']:
+                return self._analyze_javascript(content, file_path)
+            elif language in ['C', 'C++']:
+                return self._analyze_cpp(content, file_path)
+            else:
+                logger.debug(f"No analyzer for language: {language}")
+                return {}
+        except Exception as e:
+            logger.warning(f"Error analyzing {file_path}: {e}")
+            return {}
+
+    def _analyze_python(self, content: str, file_path: str) -> Dict[str, Any]:
+        """Analyze Python file using AST."""
+        try:
+            tree = ast.parse(content)
+        except SyntaxError as e:
+            logger.debug(f"Syntax error in {file_path}: {e}")
+            return {}
+
+        classes = []
+        functions = []
+
+        for node in ast.walk(tree):
+            if isinstance(node, ast.ClassDef):
+                class_sig = self._extract_python_class(node)
+                classes.append(asdict(class_sig))
+            elif isinstance(node, ast.FunctionDef) or isinstance(node, ast.AsyncFunctionDef):
+                # Only top-level functions (not methods)
+                if not any(isinstance(parent, ast.ClassDef)
+                          for parent in ast.walk(tree) if hasattr(parent, 'body') and node in parent.body):
+                    func_sig = self._extract_python_function(node)
+                    functions.append(asdict(func_sig))
+
+        return {
+            'classes': classes,
+            'functions': functions
+        }
+
+    def _extract_python_class(self, node: ast.ClassDef) -> ClassSignature:
+        """Extract class signature from AST node."""
+        # Extract base classes
+        bases = []
+        for base in node.bases:
+            if isinstance(base, ast.Name):
+                bases.append(base.id)
+            elif isinstance(base, ast.Attribute):
+                bases.append(f"{base.value.id}.{base.attr}" if hasattr(base.value, 'id') else base.attr)
+
+        # Extract methods
+        methods = []
+        for item in node.body:
+            if isinstance(item, (ast.FunctionDef, ast.AsyncFunctionDef)):
+                method_sig = self._extract_python_function(item, is_method=True)
+                methods.append(method_sig)
+
+        # Extract docstring
+        docstring = ast.get_docstring(node)
+
+        return ClassSignature(
+            name=node.name,
+            base_classes=bases,
+            methods=methods,
+            docstring=docstring,
+            line_number=node.lineno
+        )
+
+    def _extract_python_function(self, node, is_method: bool = False) -> FunctionSignature:
+        """Extract function signature from AST node."""
+        # Extract parameters
+        params = []
+        for arg in node.args.args:
+            param_type = None
+            if arg.annotation:
+                param_type = ast.unparse(arg.annotation) if hasattr(ast, 'unparse') else None
+
+            params.append(Parameter(
+                name=arg.arg,
+                type_hint=param_type
+            ))
+
+        # Extract defaults
+        defaults = node.args.defaults
+        if defaults:
+            # Defaults are aligned to the end of params
+            num_no_default = len(params) - len(defaults)
+            for i, default in enumerate(defaults):
+                param_idx = num_no_default + i
+                if param_idx < len(params):
+                    try:
+                        params[param_idx].default = ast.unparse(default) if hasattr(ast, 'unparse') else str(default)
+                    except:
+                        params[param_idx].default = "..."
+
+        # Extract return type
+        return_type = None
+        if node.returns:
+            try:
+                return_type = ast.unparse(node.returns) if hasattr(ast, 'unparse') else None
+            except:
+                pass
+
+        # Extract decorators
+        decorators = []
+        for decorator in node.decorator_list:
+            try:
+                if hasattr(ast, 'unparse'):
+                    decorators.append(ast.unparse(decorator))
+                elif isinstance(decorator, ast.Name):
+                    decorators.append(decorator.id)
+            except:
+                pass
+
+        # Extract docstring
+        docstring = ast.get_docstring(node)
+
+        return FunctionSignature(
+            name=node.name,
+            parameters=params,
+            return_type=return_type,
+            docstring=docstring,
+            line_number=node.lineno,
+            is_async=isinstance(node, ast.AsyncFunctionDef),
+            is_method=is_method,
+            decorators=decorators
+        )
+
+    def _analyze_javascript(self, content: str, file_path: str) -> Dict[str, Any]:
+        """
+        Analyze JavaScript/TypeScript file using regex patterns.
+
+        Note: This is a simplified approach. For production, consider using
+        a proper JS/TS parser like esprima or ts-morph.
+        """
+        classes = []
+        functions = []
+
+        # Extract class definitions
+        class_pattern = r'class\s+(\w+)(?:\s+extends\s+(\w+))?\s*\{'
+        for match in re.finditer(class_pattern, content):
+            class_name = match.group(1)
+            base_class = match.group(2) if match.group(2) else None
+
+            # Try to extract methods (simplified)
+            class_block_start = match.end()
+            # This is a simplification - proper parsing would track braces
+            class_block_end = content.find('}', class_block_start)
+            if class_block_end != -1:
+                class_body = content[class_block_start:class_block_end]
+                methods = self._extract_js_methods(class_body)
+            else:
+                methods = []
+
+            classes.append({
+                'name': class_name,
+                'base_classes': [base_class] if base_class else [],
+                'methods': methods,
+                'docstring': None,
+                'line_number': content[:match.start()].count('\n') + 1
+            })
+
+        # Extract top-level functions
+        func_pattern = r'(?:async\s+)?function\s+(\w+)\s*\(([^)]*)\)'
+        for match in re.finditer(func_pattern, content):
+            func_name = match.group(1)
+            params_str = match.group(2)
+            is_async = 'async' in match.group(0)
+
+            params = self._parse_js_parameters(params_str)
+
+            functions.append({
+                'name': func_name,
+                'parameters': params,
+                'return_type': None,  # JS doesn't have type annotations (unless TS)
+                'docstring': None,
+                'line_number': content[:match.start()].count('\n') + 1,
+                'is_async': is_async,
+                'is_method': False,
+                'decorators': []
+            })
+
+        # Extract arrow functions assigned to const/let
+        arrow_pattern = r'(?:const|let|var)\s+(\w+)\s*=\s*(?:async\s+)?\(([^)]*)\)\s*=>'
+        for match in re.finditer(arrow_pattern, content):
+            func_name = match.group(1)
+            params_str = match.group(2)
+            is_async = 'async' in match.group(0)
+
+            params = self._parse_js_parameters(params_str)
+
+            functions.append({
+                'name': func_name,
+                'parameters': params,
+                'return_type': None,
+                'docstring': None,
+                'line_number': content[:match.start()].count('\n') + 1,
+                'is_async': is_async,
+                'is_method': False,
+                'decorators': []
+            })
+
+        return {
+            'classes': classes,
+            'functions': functions
+        }
+
+    def _extract_js_methods(self, class_body: str) -> List[Dict]:
+        """Extract method signatures from class body."""
+        methods = []
+
+        # Match method definitions
+        method_pattern = r'(?:async\s+)?(\w+)\s*\(([^)]*)\)'
+        for match in re.finditer(method_pattern, class_body):
+            method_name = match.group(1)
+            params_str = match.group(2)
+            is_async = 'async' in match.group(0)
+
+            # Skip constructor keyword detection
+            if method_name in ['if', 'for', 'while', 'switch']:
+                continue
+
+            params = self._parse_js_parameters(params_str)
+
+            methods.append({
+                'name': method_name,
+                'parameters': params,
+                'return_type': None,
+                'docstring': None,
+                'line_number': None,
+                'is_async': is_async,
+                'is_method': True,
+                'decorators': []
+            })
+
+        return methods
+
+    def _parse_js_parameters(self, params_str: str) -> List[Dict]:
+        """Parse JavaScript parameter string."""
+        params = []
+
+        if not params_str.strip():
+            return params
+
+        # Split by comma (simplified - doesn't handle complex default values)
+        param_list = [p.strip() for p in params_str.split(',')]
+
+        for param in param_list:
+            if not param:
+                continue
+
+            # Check for default value
+            if '=' in param:
+                name, default = param.split('=', 1)
+                name = name.strip()
+                default = default.strip()
+            else:
+                name = param
+                default = None
+
+            # Check for type annotation (TypeScript)
+            type_hint = None
+            if ':' in name:
+                name, type_hint = name.split(':', 1)
+                name = name.strip()
+                type_hint = type_hint.strip()
+
+            params.append({
+                'name': name,
+                'type_hint': type_hint,
+                'default': default
+            })
+
+        return params
+
+    def _analyze_cpp(self, content: str, file_path: str) -> Dict[str, Any]:
+        """
+        Analyze C/C++ header file using regex patterns.
+
+        Note: This is a simplified approach focusing on header files.
+        For production, consider using libclang or similar.
+        """
+        classes = []
+        functions = []
+
+        # Extract class definitions (simplified - doesn't handle nested classes)
+        class_pattern = r'class\s+(\w+)(?:\s*:\s*public\s+(\w+))?\s*\{'
+        for match in re.finditer(class_pattern, content):
+            class_name = match.group(1)
+            base_class = match.group(2) if match.group(2) else None
+
+            classes.append({
+                'name': class_name,
+                'base_classes': [base_class] if base_class else [],
+                'methods': [],  # Simplified - would need to parse class body
+                'docstring': None,
+                'line_number': content[:match.start()].count('\n') + 1
+            })
+
+        # Extract function declarations
+        func_pattern = r'(\w+(?:\s*\*|\s*&)?)\s+(\w+)\s*\(([^)]*)\)'
+        for match in re.finditer(func_pattern, content):
+            return_type = match.group(1).strip()
+            func_name = match.group(2)
+            params_str = match.group(3)
+
+            # Skip common keywords
+            if func_name in ['if', 'for', 'while', 'switch', 'return']:
+                continue
+
+            params = self._parse_cpp_parameters(params_str)
+
+            functions.append({
+                'name': func_name,
+                'parameters': params,
+                'return_type': return_type,
+                'docstring': None,
+                'line_number': content[:match.start()].count('\n') + 1,
+                'is_async': False,
+                'is_method': False,
+                'decorators': []
+            })
+
+        return {
+            'classes': classes,
+            'functions': functions
+        }
+
+    def _parse_cpp_parameters(self, params_str: str) -> List[Dict]:
+        """Parse C++ parameter string."""
+        params = []
+
+        if not params_str.strip() or params_str.strip() == 'void':
+            return params
+
+        # Split by comma (simplified)
+        param_list = [p.strip() for p in params_str.split(',')]
+
+        for param in param_list:
+            if not param:
+                continue
+
+            # Check for default value
+            default = None
+            if '=' in param:
+                param, default = param.rsplit('=', 1)
+                param = param.strip()
+                default = default.strip()
+
+            # Extract type and name (simplified)
+            # Format: "type name" or "type* name" or "type& name"
+            parts = param.split()
+            if len(parts) >= 2:
+                param_type = ' '.join(parts[:-1])
+                param_name = parts[-1]
+            else:
+                param_type = param
+                param_name = "unknown"
+
+            params.append({
+                'name': param_name,
+                'type_hint': param_type,
+                'default': default
+            })
+
+        return params
+
+
+if __name__ == '__main__':
+    # Test the analyzer
+    python_code = '''
+class Node2D:
+    """Base class for 2D nodes."""
+
+    def move_local_x(self, delta: float, snap: bool = False) -> None:
+        """Move node along local X axis."""
+        pass
+
+    async def tween_position(self, target: tuple, duration: float = 1.0):
+        """Animate position to target."""
+        pass
+
+def create_sprite(texture: str) -> Node2D:
+    """Create a new sprite node."""
+    return Node2D()
+'''
+
+    analyzer = CodeAnalyzer(depth='deep')
+    result = analyzer.analyze_file('test.py', python_code, 'Python')
+
+    print("Analysis Result:")
+    print(f"Classes: {len(result.get('classes', []))}")
+    print(f"Functions: {len(result.get('functions', []))}")
+
+    if result.get('classes'):
+        cls = result['classes'][0]
+        print(f"\nClass: {cls['name']}")
+        print(f"  Methods: {len(cls['methods'])}")
+        for method in cls['methods']:
+            params = ', '.join([f"{p['name']}: {p['type_hint']}" + (f" = {p['default']}" if p.get('default') else "")
+                               for p in method['parameters']])
+            print(f"    {method['name']}({params}) -> {method['return_type']}")
--- a/src/skill_seekers/cli/config_validator.py
+++ b/src/skill_seekers/cli/config_validator.py
@@ -0,0 +1,376 @@
+#!/usr/bin/env python3
+"""
+Unified Config Validator
+
+Validates unified config format that supports multiple sources:
+- documentation (website scraping)
+- github (repository scraping)
+- pdf (PDF document scraping)
+
+Also provides backward compatibility detection for legacy configs.
+"""
+
+import json
+import logging
+from typing import Dict, Any, List, Optional, Union
+from pathlib import Path
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+class ConfigValidator:
+    """
+    Validates unified config format and provides backward compatibility.
+    """
+
+    # Valid source types
+    VALID_SOURCE_TYPES = {'documentation', 'github', 'pdf'}
+
+    # Valid merge modes
+    VALID_MERGE_MODES = {'rule-based', 'claude-enhanced'}
+
+    # Valid code analysis depth levels
+    VALID_DEPTH_LEVELS = {'surface', 'deep', 'full'}
+
+    def __init__(self, config_or_path: Union[Dict[str, Any], str]):
+        """
+        Initialize validator with config dict or file path.
+
+        Args:
+            config_or_path: Either a config dict or path to config JSON file
+        """
+        if isinstance(config_or_path, dict):
+            self.config_path = None
+            self.config = config_or_path
+        else:
+            self.config_path = config_or_path
+            self.config = self._load_config()
+        self.is_unified = self._detect_format()
+
+    def _load_config(self) -> Dict[str, Any]:
+        """Load JSON config file."""
+        try:
+            with open(self.config_path, 'r', encoding='utf-8') as f:
+                return json.load(f)
+        except FileNotFoundError:
+            raise ValueError(f"Config file not found: {self.config_path}")
+        except json.JSONDecodeError as e:
+            raise ValueError(f"Invalid JSON in config file: {e}")
+
+    def _detect_format(self) -> bool:
+        """
+        Detect if config is unified format or legacy.
+
+        Returns:
+            True if unified format (has 'sources' array)
+            False if legacy format
+        """
+        return 'sources' in self.config and isinstance(self.config['sources'], list)
+
+    def validate(self) -> bool:
+        """
+        Validate config based on detected format.
+
+        Returns:
+            True if valid
+
+        Raises:
+            ValueError if invalid with detailed error message
+        """
+        if self.is_unified:
+            return self._validate_unified()
+        else:
+            return self._validate_legacy()
+
+    def _validate_unified(self) -> bool:
+        """Validate unified config format."""
+        logger.info("Validating unified config format...")
+
+        # Required top-level fields
+        if 'name' not in self.config:
+            raise ValueError("Missing required field: 'name'")
+
+        if 'description' not in self.config:
+            raise ValueError("Missing required field: 'description'")
+
+        if 'sources' not in self.config:
+            raise ValueError("Missing required field: 'sources'")
+
+        # Validate sources array
+        sources = self.config['sources']
+
+        if not isinstance(sources, list):
+            raise ValueError("'sources' must be an array")
+
+        if len(sources) == 0:
+            raise ValueError("'sources' array cannot be empty")
+
+        # Validate merge_mode (optional)
+        merge_mode = self.config.get('merge_mode', 'rule-based')
+        if merge_mode not in self.VALID_MERGE_MODES:
+            raise ValueError(f"Invalid merge_mode: '{merge_mode}'. Must be one of {self.VALID_MERGE_MODES}")
+
+        # Validate each source
+        for i, source in enumerate(sources):
+            self._validate_source(source, i)
+
+        logger.info(f"✅ Unified config valid: {len(sources)} sources")
+        return True
+
+    def _validate_source(self, source: Dict[str, Any], index: int):
+        """Validate individual source configuration."""
+        # Check source has 'type' field
+        if 'type' not in source:
+            raise ValueError(f"Source {index}: Missing required field 'type'")
+
+        source_type = source['type']
+
+        if source_type not in self.VALID_SOURCE_TYPES:
+            raise ValueError(
+                f"Source {index}: Invalid type '{source_type}'. "
+                f"Must be one of {self.VALID_SOURCE_TYPES}"
+            )
+
+        # Type-specific validation
+        if source_type == 'documentation':
+            self._validate_documentation_source(source, index)
+        elif source_type == 'github':
+            self._validate_github_source(source, index)
+        elif source_type == 'pdf':
+            self._validate_pdf_source(source, index)
+
+    def _validate_documentation_source(self, source: Dict[str, Any], index: int):
+        """Validate documentation source configuration."""
+        if 'base_url' not in source:
+            raise ValueError(f"Source {index} (documentation): Missing required field 'base_url'")
+
+        # Optional but recommended fields
+        if 'selectors' not in source:
+            logger.warning(f"Source {index} (documentation): No 'selectors' specified, using defaults")
+
+        if 'max_pages' in source and not isinstance(source['max_pages'], int):
+            raise ValueError(f"Source {index} (documentation): 'max_pages' must be an integer")
+
+    def _validate_github_source(self, source: Dict[str, Any], index: int):
+        """Validate GitHub source configuration."""
+        if 'repo' not in source:
+            raise ValueError(f"Source {index} (github): Missing required field 'repo'")
+
+        # Validate repo format (owner/repo)
+        repo = source['repo']
+        if '/' not in repo:
+            raise ValueError(
+                f"Source {index} (github): Invalid repo format '{repo}'. "
+                f"Must be 'owner/repo' (e.g., 'facebook/react')"
+            )
+
+        # Validate code_analysis_depth if specified
+        if 'code_analysis_depth' in source:
+            depth = source['code_analysis_depth']
+            if depth not in self.VALID_DEPTH_LEVELS:
+                raise ValueError(
+                    f"Source {index} (github): Invalid code_analysis_depth '{depth}'. "
+                    f"Must be one of {self.VALID_DEPTH_LEVELS}"
+                )
+
+        # Validate max_issues if specified
+        if 'max_issues' in source and not isinstance(source['max_issues'], int):
+            raise ValueError(f"Source {index} (github): 'max_issues' must be an integer")
+
+    def _validate_pdf_source(self, source: Dict[str, Any], index: int):
+        """Validate PDF source configuration."""
+        if 'path' not in source:
+            raise ValueError(f"Source {index} (pdf): Missing required field 'path'")
+
+        # Check if file exists
+        pdf_path = source['path']
+        if not Path(pdf_path).exists():
+            logger.warning(f"Source {index} (pdf): File not found: {pdf_path}")
+
+    def _validate_legacy(self) -> bool:
+        """
+        Validate legacy config format (backward compatibility).
+
+        Legacy configs are the old format used by doc_scraper, github_scraper, pdf_scraper.
+        """
+        logger.info("Detected legacy config format (backward compatible)")
+
+        # Detect which legacy type based on fields
+        if 'base_url' in self.config:
+            logger.info("Legacy type: documentation")
+        elif 'repo' in self.config:
+            logger.info("Legacy type: github")
+        elif 'pdf' in self.config or 'path' in self.config:
+            logger.info("Legacy type: pdf")
+        else:
+            raise ValueError("Cannot detect legacy config type (missing base_url, repo, or pdf)")
+
+        return True
+
+    def convert_legacy_to_unified(self) -> Dict[str, Any]:
+        """
+        Convert legacy config to unified format.
+
+        Returns:
+            Unified config dict
+        """
+        if self.is_unified:
+            logger.info("Config already in unified format")
+            return self.config
+
+        logger.info("Converting legacy config to unified format...")
+
+        # Detect legacy type and convert
+        if 'base_url' in self.config:
+            return self._convert_legacy_documentation()
+        elif 'repo' in self.config:
+            return self._convert_legacy_github()
+        elif 'pdf' in self.config or 'path' in self.config:
+            return self._convert_legacy_pdf()
+        else:
+            raise ValueError("Cannot convert: unknown legacy format")
+
+    def _convert_legacy_documentation(self) -> Dict[str, Any]:
+        """Convert legacy documentation config to unified."""
+        unified = {
+            'name': self.config.get('name', 'unnamed'),
+            'description': self.config.get('description', 'Documentation skill'),
+            'merge_mode': 'rule-based',
+            'sources': [
+                {
+                    'type': 'documentation',
+                    **{k: v for k, v in self.config.items()
+                       if k not in ['name', 'description']}
+                }
+            ]
+        }
+        return unified
+
+    def _convert_legacy_github(self) -> Dict[str, Any]:
+        """Convert legacy GitHub config to unified."""
+        unified = {
+            'name': self.config.get('name', 'unnamed'),
+            'description': self.config.get('description', 'GitHub repository skill'),
+            'merge_mode': 'rule-based',
+            'sources': [
+                {
+                    'type': 'github',
+                    **{k: v for k, v in self.config.items()
+                       if k not in ['name', 'description']}
+                }
+            ]
+        }
+        return unified
+
+    def _convert_legacy_pdf(self) -> Dict[str, Any]:
+        """Convert legacy PDF config to unified."""
+        unified = {
+            'name': self.config.get('name', 'unnamed'),
+            'description': self.config.get('description', 'PDF document skill'),
+            'merge_mode': 'rule-based',
+            'sources': [
+                {
+                    'type': 'pdf',
+                    **{k: v for k, v in self.config.items()
+                       if k not in ['name', 'description']}
+                }
+            ]
+        }
+        return unified
+
+    def get_sources_by_type(self, source_type: str) -> List[Dict[str, Any]]:
+        """
+        Get all sources of a specific type.
+
+        Args:
+            source_type: 'documentation', 'github', or 'pdf'
+
+        Returns:
+            List of sources matching the type
+        """
+        if not self.is_unified:
+            # For legacy, convert and get sources
+            unified = self.convert_legacy_to_unified()
+            sources = unified['sources']
+        else:
+            sources = self.config['sources']
+
+        return [s for s in sources if s.get('type') == source_type]
+
+    def has_multiple_sources(self) -> bool:
+        """Check if config has multiple sources (requires merging)."""
+        if not self.is_unified:
+            return False
+        return len(self.config['sources']) > 1
+
+    def needs_api_merge(self) -> bool:
+        """
+        Check if config needs API merging.
+
+        Returns True if both documentation and github sources exist
+        with API extraction enabled.
+        """
+        if not self.has_multiple_sources():
+            return False
+
+        has_docs_api = any(
+            s.get('type') == 'documentation' and s.get('extract_api', True)
+            for s in self.config['sources']
+        )
+
+        has_github_code = any(
+            s.get('type') == 'github' and s.get('include_code', False)
+            for s in self.config['sources']
+        )
+
+        return has_docs_api and has_github_code
+
+
+def validate_config(config_path: str) -> ConfigValidator:
+    """
+    Validate config file and return validator instance.
+
+    Args:
+        config_path: Path to config JSON file
+
+    Returns:
+        ConfigValidator instance
+
+    Raises:
+        ValueError if config is invalid
+    """
+    validator = ConfigValidator(config_path)
+    validator.validate()
+    return validator
+
+
+if __name__ == '__main__':
+    import sys
+
+    if len(sys.argv) < 2:
+        print("Usage: python config_validator.py <config.json>")
+        sys.exit(1)
+
+    config_file = sys.argv[1]
+
+    try:
+        validator = validate_config(config_file)
+
+        print(f"\n✅ Config valid!")
+        print(f"   Format: {'Unified' if validator.is_unified else 'Legacy'}")
+        print(f"   Name: {validator.config.get('name')}")
+
+        if validator.is_unified:
+            sources = validator.config['sources']
+            print(f"   Sources: {len(sources)}")
+            for i, source in enumerate(sources):
+                print(f"     {i+1}. {source['type']}")
+
+            if validator.needs_api_merge():
+                merge_mode = validator.config.get('merge_mode', 'rule-based')
+                print(f"   ⚠️  API merge required (mode: {merge_mode})")
+
+    except ValueError as e:
+        print(f"\n❌ Config invalid: {e}")
+        sys.exit(1)
--- a/src/skill_seekers/cli/conflict_detector.py
+++ b/src/skill_seekers/cli/conflict_detector.py
@@ -0,0 +1,513 @@
+#!/usr/bin/env python3
+"""
+Conflict Detector for Multi-Source Skills
+
+Detects conflicts between documentation and code:
+- missing_in_docs: API exists in code but not documented
+- missing_in_code: API documented but doesn't exist in code
+- signature_mismatch: Different parameters/types between docs and code
+- description_mismatch: Docs say one thing, code comments say another
+
+Used by unified scraper to identify discrepancies before merging.
+"""
+
+import json
+import logging
+from typing import Dict, List, Any, Optional, Tuple
+from dataclasses import dataclass, asdict
+from difflib import SequenceMatcher
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class Conflict:
+    """Represents a conflict between documentation and code."""
+    type: str  # 'missing_in_docs', 'missing_in_code', 'signature_mismatch', 'description_mismatch'
+    severity: str  # 'low', 'medium', 'high'
+    api_name: str
+    docs_info: Optional[Dict[str, Any]] = None
+    code_info: Optional[Dict[str, Any]] = None
+    difference: Optional[str] = None
+    suggestion: Optional[str] = None
+
+
+class ConflictDetector:
+    """
+    Detects conflicts between documentation and code sources.
+    """
+
+    def __init__(self, docs_data: Dict[str, Any], github_data: Dict[str, Any]):
+        """
+        Initialize conflict detector.
+
+        Args:
+            docs_data: Data from documentation scraper
+            github_data: Data from GitHub scraper with code analysis
+        """
+        self.docs_data = docs_data
+        self.github_data = github_data
+
+        # Extract API information from both sources
+        self.docs_apis = self._extract_docs_apis()
+        self.code_apis = self._extract_code_apis()
+
+        logger.info(f"Loaded {len(self.docs_apis)} APIs from documentation")
+        logger.info(f"Loaded {len(self.code_apis)} APIs from code")
+
+    def _extract_docs_apis(self) -> Dict[str, Dict[str, Any]]:
+        """
+        Extract API information from documentation data.
+
+        Returns:
+            Dict mapping API name to API info
+        """
+        apis = {}
+
+        # Documentation structure varies, but typically has 'pages' or 'references'
+        pages = self.docs_data.get('pages', {})
+
+        # Handle both dict and list formats
+        if isinstance(pages, dict):
+            # Format: {url: page_data, ...}
+            for url, page_data in pages.items():
+                content = page_data.get('content', '')
+                title = page_data.get('title', '')
+
+                # Simple heuristic: if title or URL contains "api", "reference", "class", "function"
+                # it might be an API page
+                if any(keyword in title.lower() or keyword in url.lower()
+                       for keyword in ['api', 'reference', 'class', 'function', 'method']):
+
+                    # Extract API signatures from content (simplified)
+                    extracted_apis = self._parse_doc_content_for_apis(content, url)
+                    apis.update(extracted_apis)
+        elif isinstance(pages, list):
+            # Format: [{url: '...', apis: [...]}, ...]
+            for page in pages:
+                url = page.get('url', '')
+                page_apis = page.get('apis', [])
+
+                # If APIs are already extracted in the page data
+                for api in page_apis:
+                    api_name = api.get('name', '')
+                    if api_name:
+                        apis[api_name] = {
+                            'parameters': api.get('parameters', []),
+                            'return_type': api.get('return_type', 'Any'),
+                            'source_url': url
+                        }
+
+        return apis
+
+    def _parse_doc_content_for_apis(self, content: str, source_url: str) -> Dict[str, Dict]:
+        """
+        Parse documentation content to extract API signatures.
+
+        This is a simplified approach - real implementation would need
+        to understand the documentation format (Sphinx, JSDoc, etc.)
+        """
+        apis = {}
+
+        # Look for function/method signatures in code blocks
+        # Common patterns:
+        # - function_name(param1, param2)
+        # - ClassName.method_name(param1, param2)
+        # - def function_name(param1: type, param2: type) -> return_type
+
+        import re
+
+        # Pattern for common API signatures
+        patterns = [
+            # Python style: def name(params) -> return
+            r'def\s+(\w+)\s*\(([^)]*)\)(?:\s*->\s*(\w+))?',
+            # JavaScript style: function name(params)
+            r'function\s+(\w+)\s*\(([^)]*)\)',
+            # C++ style: return_type name(params)
+            r'(\w+)\s+(\w+)\s*\(([^)]*)\)',
+            # Method style: ClassName.method_name(params)
+            r'(\w+)\.(\w+)\s*\(([^)]*)\)'
+        ]
+
+        for pattern in patterns:
+            for match in re.finditer(pattern, content):
+                groups = match.groups()
+
+                # Parse based on pattern matched
+                if 'def' in pattern:
+                    # Python function
+                    name = groups[0]
+                    params_str = groups[1]
+                    return_type = groups[2] if len(groups) > 2 else None
+                elif 'function' in pattern:
+                    # JavaScript function
+                    name = groups[0]
+                    params_str = groups[1]
+                    return_type = None
+                elif '.' in pattern:
+                    # Class method
+                    class_name = groups[0]
+                    method_name = groups[1]
+                    name = f"{class_name}.{method_name}"
+                    params_str = groups[2] if len(groups) > 2 else groups[1]
+                    return_type = None
+                else:
+                    # C++ function
+                    return_type = groups[0]
+                    name = groups[1]
+                    params_str = groups[2]
+
+                # Parse parameters
+                params = self._parse_param_string(params_str)
+
+                apis[name] = {
+                    'name': name,
+                    'parameters': params,
+                    'return_type': return_type,
+                    'source': source_url,
+                    'raw_signature': match.group(0)
+                }
+
+        return apis
+
+    def _parse_param_string(self, params_str: str) -> List[Dict]:
+        """Parse parameter string into list of parameter dicts."""
+        if not params_str.strip():
+            return []
+
+        params = []
+        for param in params_str.split(','):
+            param = param.strip()
+            if not param:
+                continue
+
+            # Try to extract name and type
+            param_info = {'name': param, 'type': None, 'default': None}
+
+            # Check for type annotation (: type)
+            if ':' in param:
+                parts = param.split(':', 1)
+                param_info['name'] = parts[0].strip()
+                type_part = parts[1].strip()
+
+                # Check for default value (= value)
+                if '=' in type_part:
+                    type_str, default_str = type_part.split('=', 1)
+                    param_info['type'] = type_str.strip()
+                    param_info['default'] = default_str.strip()
+                else:
+                    param_info['type'] = type_part
+
+            # Check for default without type (= value)
+            elif '=' in param:
+                parts = param.split('=', 1)
+                param_info['name'] = parts[0].strip()
+                param_info['default'] = parts[1].strip()
+
+            params.append(param_info)
+
+        return params
+
+    def _extract_code_apis(self) -> Dict[str, Dict[str, Any]]:
+        """
+        Extract API information from GitHub code analysis.
+
+        Returns:
+            Dict mapping API name to API info
+        """
+        apis = {}
+
+        code_analysis = self.github_data.get('code_analysis', {})
+        if not code_analysis:
+            return apis
+
+        # Support both 'files' and 'analyzed_files' keys
+        files = code_analysis.get('files', code_analysis.get('analyzed_files', []))
+
+        for file_info in files:
+            file_path = file_info.get('file', 'unknown')
+
+            # Extract classes and their methods
+            for class_info in file_info.get('classes', []):
+                class_name = class_info['name']
+
+                # Add class itself
+                apis[class_name] = {
+                    'name': class_name,
+                    'type': 'class',
+                    'source': file_path,
+                    'line': class_info.get('line_number'),
+                    'base_classes': class_info.get('base_classes', []),
+                    'docstring': class_info.get('docstring')
+                }
+
+                # Add methods
+                for method in class_info.get('methods', []):
+                    method_name = f"{class_name}.{method['name']}"
+                    apis[method_name] = {
+                        'name': method_name,
+                        'type': 'method',
+                        'parameters': method.get('parameters', []),
+                        'return_type': method.get('return_type'),
+                        'source': file_path,
+                        'line': method.get('line_number'),
+                        'docstring': method.get('docstring'),
+                        'is_async': method.get('is_async', False)
+                    }
+
+            # Extract standalone functions
+            for func_info in file_info.get('functions', []):
+                func_name = func_info['name']
+                apis[func_name] = {
+                    'name': func_name,
+                    'type': 'function',
+                    'parameters': func_info.get('parameters', []),
+                    'return_type': func_info.get('return_type'),
+                    'source': file_path,
+                    'line': func_info.get('line_number'),
+                    'docstring': func_info.get('docstring'),
+                    'is_async': func_info.get('is_async', False)
+                }
+
+        return apis
+
+    def detect_all_conflicts(self) -> List[Conflict]:
+        """
+        Detect all types of conflicts.
+
+        Returns:
+            List of Conflict objects
+        """
+        logger.info("Detecting conflicts between documentation and code...")
+
+        conflicts = []
+
+        # 1. Find APIs missing in documentation
+        conflicts.extend(self._find_missing_in_docs())
+
+        # 2. Find APIs missing in code
+        conflicts.extend(self._find_missing_in_code())
+
+        # 3. Find signature mismatches
+        conflicts.extend(self._find_signature_mismatches())
+
+        logger.info(f"Found {len(conflicts)} conflicts total")
+
+        return conflicts
+
+    def _find_missing_in_docs(self) -> List[Conflict]:
+        """Find APIs that exist in code but not in documentation."""
+        conflicts = []
+
+        for api_name, code_info in self.code_apis.items():
+            # Simple name matching (can be enhanced with fuzzy matching)
+            if api_name not in self.docs_apis:
+                # Check if it's a private/internal API (often not documented)
+                is_private = api_name.startswith('_') or '__' in api_name
+                severity = 'low' if is_private else 'medium'
+
+                conflicts.append(Conflict(
+                    type='missing_in_docs',
+                    severity=severity,
+                    api_name=api_name,
+                    code_info=code_info,
+                    difference=f"API exists in code ({code_info['source']}) but not found in documentation",
+                    suggestion="Add documentation for this API" if not is_private else "Consider if this internal API should be documented"
+                ))
+
+        logger.info(f"Found {len(conflicts)} APIs missing in documentation")
+        return conflicts
+
+    def _find_missing_in_code(self) -> List[Conflict]:
+        """Find APIs that are documented but don't exist in code."""
+        conflicts = []
+
+        for api_name, docs_info in self.docs_apis.items():
+            if api_name not in self.code_apis:
+                conflicts.append(Conflict(
+                    type='missing_in_code',
+                    severity='high',  # This is serious - documented but doesn't exist
+                    api_name=api_name,
+                    docs_info=docs_info,
+                    difference=f"API documented ({docs_info.get('source', 'unknown')}) but not found in code",
+                    suggestion="Update documentation to remove this API, or add it to codebase"
+                ))
+
+        logger.info(f"Found {len(conflicts)} APIs missing in code")
+        return conflicts
+
+    def _find_signature_mismatches(self) -> List[Conflict]:
+        """Find APIs where signature differs between docs and code."""
+        conflicts = []
+
+        # Find APIs that exist in both
+        common_apis = set(self.docs_apis.keys()) & set(self.code_apis.keys())
+
+        for api_name in common_apis:
+            docs_info = self.docs_apis[api_name]
+            code_info = self.code_apis[api_name]
+
+            # Compare signatures
+            mismatch = self._compare_signatures(docs_info, code_info)
+
+            if mismatch:
+                conflicts.append(Conflict(
+                    type='signature_mismatch',
+                    severity=mismatch['severity'],
+                    api_name=api_name,
+                    docs_info=docs_info,
+                    code_info=code_info,
+                    difference=mismatch['difference'],
+                    suggestion=mismatch['suggestion']
+                ))
+
+        logger.info(f"Found {len(conflicts)} signature mismatches")
+        return conflicts
+
+    def _compare_signatures(self, docs_info: Dict, code_info: Dict) -> Optional[Dict]:
+        """
+        Compare signatures between docs and code.
+
+        Returns:
+            Dict with mismatch details if conflict found, None otherwise
+        """
+        docs_params = docs_info.get('parameters', [])
+        code_params = code_info.get('parameters', [])
+
+        # Compare parameter counts
+        if len(docs_params) != len(code_params):
+            return {
+                'severity': 'medium',
+                'difference': f"Parameter count mismatch: docs has {len(docs_params)}, code has {len(code_params)}",
+                'suggestion': f"Documentation shows {len(docs_params)} parameters, but code has {len(code_params)}"
+            }
+
+        # Compare parameter names and types
+        for i, (doc_param, code_param) in enumerate(zip(docs_params, code_params)):
+            doc_name = doc_param.get('name', '')
+            code_name = code_param.get('name', '')
+
+            # Parameter name mismatch
+            if doc_name != code_name:
+                # Use fuzzy matching for slight variations
+                similarity = SequenceMatcher(None, doc_name, code_name).ratio()
+                if similarity < 0.8:  # Not similar enough
+                    return {
+                        'severity': 'medium',
+                        'difference': f"Parameter {i+1} name mismatch: '{doc_name}' in docs vs '{code_name}' in code",
+                        'suggestion': f"Update documentation to use parameter name '{code_name}'"
+                    }
+
+            # Type mismatch
+            doc_type = doc_param.get('type')
+            code_type = code_param.get('type_hint')
+
+            if doc_type and code_type and doc_type != code_type:
+                return {
+                    'severity': 'low',
+                    'difference': f"Parameter '{doc_name}' type mismatch: '{doc_type}' in docs vs '{code_type}' in code",
+                    'suggestion': f"Verify correct type for parameter '{doc_name}'"
+                }
+
+        # Compare return types if both have them
+        docs_return = docs_info.get('return_type')
+        code_return = code_info.get('return_type')
+
+        if docs_return and code_return and docs_return != code_return:
+            return {
+                'severity': 'low',
+                'difference': f"Return type mismatch: '{docs_return}' in docs vs '{code_return}' in code",
+                'suggestion': "Verify correct return type"
+            }
+
+        return None
+
+    def generate_summary(self, conflicts: List[Conflict]) -> Dict[str, Any]:
+        """
+        Generate summary statistics for conflicts.
+
+        Args:
+            conflicts: List of Conflict objects
+
+        Returns:
+            Summary dict with statistics
+        """
+        summary = {
+            'total': len(conflicts),
+            'by_type': {},
+            'by_severity': {},
+            'apis_affected': len(set(c.api_name for c in conflicts))
+        }
+
+        # Count by type
+        for conflict_type in ['missing_in_docs', 'missing_in_code', 'signature_mismatch', 'description_mismatch']:
+            count = sum(1 for c in conflicts if c.type == conflict_type)
+            summary['by_type'][conflict_type] = count
+
+        # Count by severity
+        for severity in ['low', 'medium', 'high']:
+            count = sum(1 for c in conflicts if c.severity == severity)
+            summary['by_severity'][severity] = count
+
+        return summary
+
+    def save_conflicts(self, conflicts: List[Conflict], output_path: str):
+        """
+        Save conflicts to JSON file.
+
+        Args:
+            conflicts: List of Conflict objects
+            output_path: Path to output JSON file
+        """
+        data = {
+            'conflicts': [asdict(c) for c in conflicts],
+            'summary': self.generate_summary(conflicts)
+        }
+
+        with open(output_path, 'w', encoding='utf-8') as f:
+            json.dump(data, f, indent=2, ensure_ascii=False)
+
+        logger.info(f"Conflicts saved to: {output_path}")
+
+
+if __name__ == '__main__':
+    import sys
+
+    if len(sys.argv) < 3:
+        print("Usage: python conflict_detector.py <docs_data.json> <github_data.json>")
+        sys.exit(1)
+
+    docs_file = sys.argv[1]
+    github_file = sys.argv[2]
+
+    # Load data
+    with open(docs_file, 'r') as f:
+        docs_data = json.load(f)
+
+    with open(github_file, 'r') as f:
+        github_data = json.load(f)
+
+    # Detect conflicts
+    detector = ConflictDetector(docs_data, github_data)
+    conflicts = detector.detect_all_conflicts()
+
+    # Print summary
+    summary = detector.generate_summary(conflicts)
+    print("\n📊 Conflict Summary:")
+    print(f"   Total conflicts: {summary['total']}")
+    print(f"   APIs affected: {summary['apis_affected']}")
+    print("\n   By Type:")
+    for conflict_type, count in summary['by_type'].items():
+        if count > 0:
+            print(f"     {conflict_type}: {count}")
+    print("\n   By Severity:")
+    for severity, count in summary['by_severity'].items():
+        if count > 0:
+            emoji = '🔴' if severity == 'high' else '🟡' if severity == 'medium' else '🟢'
+            print(f"     {emoji} {severity}: {count}")
+
+    # Save to file
+    output_file = 'conflicts.json'
+    detector.save_conflicts(conflicts, output_file)
+    print(f"\n✅ Full report saved to: {output_file}")
--- a/src/skill_seekers/cli/constants.py
+++ b/src/skill_seekers/cli/constants.py
@@ -0,0 +1,72 @@
+"""Configuration constants for Skill Seekers CLI.
+
+This module centralizes all magic numbers and configuration values used
+across the CLI tools to improve maintainability and clarity.
+"""
+
+# ===== SCRAPING CONFIGURATION =====
+
+# Default scraping limits
+DEFAULT_RATE_LIMIT = 0.5  # seconds between requests
+DEFAULT_MAX_PAGES = 500   # maximum pages to scrape
+DEFAULT_CHECKPOINT_INTERVAL = 1000  # pages between checkpoints
+DEFAULT_ASYNC_MODE = False  # use async mode for parallel scraping (opt-in)
+
+# Content analysis limits
+CONTENT_PREVIEW_LENGTH = 500  # characters to check for categorization
+MAX_PAGES_WARNING_THRESHOLD = 10000  # warn if config exceeds this
+
+# Quality thresholds
+MIN_CATEGORIZATION_SCORE = 2  # minimum score for category assignment
+URL_MATCH_POINTS = 3  # points for URL keyword match
+TITLE_MATCH_POINTS = 2  # points for title keyword match
+CONTENT_MATCH_POINTS = 1  # points for content keyword match
+
+# ===== ENHANCEMENT CONFIGURATION =====
+
+# API-based enhancement limits (uses Anthropic API)
+API_CONTENT_LIMIT = 100000  # max characters for API enhancement
+API_PREVIEW_LIMIT = 40000   # max characters for preview
+
+# Local enhancement limits (uses Claude Code Max)
+LOCAL_CONTENT_LIMIT = 50000  # max characters for local enhancement
+LOCAL_PREVIEW_LIMIT = 20000  # max characters for preview
+
+# ===== PAGE ESTIMATION =====
+
+# Estimation and discovery settings
+DEFAULT_MAX_DISCOVERY = 1000  # default max pages to discover
+DISCOVERY_THRESHOLD = 10000   # threshold for warnings
+
+# ===== FILE LIMITS =====
+
+# Output and processing limits
+MAX_REFERENCE_FILES = 100  # maximum reference files per skill
+MAX_CODE_BLOCKS_PER_PAGE = 5  # maximum code blocks to extract per page
+
+# ===== EXPORT CONSTANTS =====
+
+__all__ = [
+    # Scraping
+    'DEFAULT_RATE_LIMIT',
+    'DEFAULT_MAX_PAGES',
+    'DEFAULT_CHECKPOINT_INTERVAL',
+    'DEFAULT_ASYNC_MODE',
+    'CONTENT_PREVIEW_LENGTH',
+    'MAX_PAGES_WARNING_THRESHOLD',
+    'MIN_CATEGORIZATION_SCORE',
+    'URL_MATCH_POINTS',
+    'TITLE_MATCH_POINTS',
+    'CONTENT_MATCH_POINTS',
+    # Enhancement
+    'API_CONTENT_LIMIT',
+    'API_PREVIEW_LIMIT',
+    'LOCAL_CONTENT_LIMIT',
+    'LOCAL_PREVIEW_LIMIT',
+    # Estimation
+    'DEFAULT_MAX_DISCOVERY',
+    'DISCOVERY_THRESHOLD',
+    # Limits
+    'MAX_REFERENCE_FILES',
+    'MAX_CODE_BLOCKS_PER_PAGE',
+]
--- a/src/skill_seekers/cli/doc_scraper.py
+++ b/src/skill_seekers/cli/doc_scraper.py
--- a/src/skill_seekers/cli/enhance_skill.py
+++ b/src/skill_seekers/cli/enhance_skill.py
@@ -0,0 +1,273 @@
+#!/usr/bin/env python3
+"""
+SKILL.md Enhancement Script
+Uses Claude API to improve SKILL.md by analyzing reference documentation.
+
+Usage:
+    python3 cli/enhance_skill.py output/steam-inventory/
+    python3 cli/enhance_skill.py output/react/
+    python3 cli/enhance_skill.py output/godot/ --api-key YOUR_API_KEY
+"""
+
+import os
+import sys
+import json
+import argparse
+from pathlib import Path
+
+# Add parent directory to path for imports when run as script
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+from skill_seekers.cli.constants import API_CONTENT_LIMIT, API_PREVIEW_LIMIT
+from skill_seekers.cli.utils import read_reference_files
+
+try:
+    import anthropic
+except ImportError:
+    print("❌ Error: anthropic package not installed")
+    print("Install with: pip3 install anthropic")
+    sys.exit(1)
+
+
+class SkillEnhancer:
+    def __init__(self, skill_dir, api_key=None):
+        self.skill_dir = Path(skill_dir)
+        self.references_dir = self.skill_dir / "references"
+        self.skill_md_path = self.skill_dir / "SKILL.md"
+
+        # Get API key
+        self.api_key = api_key or os.environ.get('ANTHROPIC_API_KEY')
+        if not self.api_key:
+            raise ValueError(
+                "No API key provided. Set ANTHROPIC_API_KEY environment variable "
+                "or use --api-key argument"
+            )
+
+        self.client = anthropic.Anthropic(api_key=self.api_key)
+
+    def read_current_skill_md(self):
+        """Read existing SKILL.md"""
+        if not self.skill_md_path.exists():
+            return None
+        return self.skill_md_path.read_text(encoding='utf-8')
+
+    def enhance_skill_md(self, references, current_skill_md):
+        """Use Claude to enhance SKILL.md"""
+
+        # Build prompt
+        prompt = self._build_enhancement_prompt(references, current_skill_md)
+
+        print("\n🤖 Asking Claude to enhance SKILL.md...")
+        print(f"   Input: {len(prompt):,} characters")
+
+        try:
+            message = self.client.messages.create(
+                model="claude-sonnet-4-20250514",
+                max_tokens=4096,
+                temperature=0.3,
+                messages=[{
+                    "role": "user",
+                    "content": prompt
+                }]
+            )
+
+            enhanced_content = message.content[0].text
+            return enhanced_content
+
+        except Exception as e:
+            print(f"❌ Error calling Claude API: {e}")
+            return None
+
+    def _build_enhancement_prompt(self, references, current_skill_md):
+        """Build the prompt for Claude"""
+
+        # Extract skill name and description
+        skill_name = self.skill_dir.name
+
+        prompt = f"""You are enhancing a Claude skill's SKILL.md file. This skill is about: {skill_name}
+
+I've scraped documentation and organized it into reference files. Your job is to create an EXCELLENT SKILL.md that will help Claude use this documentation effectively.
+
+CURRENT SKILL.MD:
+{'```markdown' if current_skill_md else '(none - create from scratch)'}
+{current_skill_md or 'No existing SKILL.md'}
+{'```' if current_skill_md else ''}
+
+REFERENCE DOCUMENTATION:
+"""
+
+        for filename, content in references.items():
+            prompt += f"\n\n## {filename}\n```markdown\n{content[:30000]}\n```\n"
+
+        prompt += """
+
+YOUR TASK:
+Create an enhanced SKILL.md that includes:
+
+1. **Clear "When to Use This Skill" section** - Be specific about trigger conditions
+2. **Excellent Quick Reference section** - Extract 5-10 of the BEST, most practical code examples from the reference docs
+   - Choose SHORT, clear examples that demonstrate common tasks
+   - Include both simple and intermediate examples
+   - Annotate examples with clear descriptions
+   - Use proper language tags (cpp, python, javascript, json, etc.)
+3. **Detailed Reference Files description** - Explain what's in each reference file
+4. **Practical "Working with This Skill" section** - Give users clear guidance on how to navigate the skill
+5. **Key Concepts section** (if applicable) - Explain core concepts
+6. **Keep the frontmatter** (---\nname: ...\n---) intact
+
+IMPORTANT:
+- Extract REAL examples from the reference docs, don't make them up
+- Prioritize SHORT, clear examples (5-20 lines max)
+- Make it actionable and practical
+- Don't be too verbose - be concise but useful
+- Maintain the markdown structure for Claude skills
+- Keep code examples properly formatted with language tags
+
+OUTPUT:
+Return ONLY the complete SKILL.md content, starting with the frontmatter (---).
+"""
+
+        return prompt
+
+    def save_enhanced_skill_md(self, content):
+        """Save the enhanced SKILL.md"""
+        # Backup original
+        if self.skill_md_path.exists():
+            backup_path = self.skill_md_path.with_suffix('.md.backup')
+            self.skill_md_path.rename(backup_path)
+            print(f"  💾 Backed up original to: {backup_path.name}")
+
+        # Save enhanced version
+        self.skill_md_path.write_text(content, encoding='utf-8')
+        print(f"  ✅ Saved enhanced SKILL.md")
+
+    def run(self):
+        """Main enhancement workflow"""
+        print(f"\n{'='*60}")
+        print(f"ENHANCING SKILL: {self.skill_dir.name}")
+        print(f"{'='*60}\n")
+
+        # Read reference files
+        print("📖 Reading reference documentation...")
+        references = read_reference_files(
+            self.skill_dir,
+            max_chars=API_CONTENT_LIMIT,
+            preview_limit=API_PREVIEW_LIMIT
+        )
+
+        if not references:
+            print("❌ No reference files found to analyze")
+            return False
+
+        print(f"  ✓ Read {len(references)} reference files")
+        total_size = sum(len(c) for c in references.values())
+        print(f"  ✓ Total size: {total_size:,} characters\n")
+
+        # Read current SKILL.md
+        current_skill_md = self.read_current_skill_md()
+        if current_skill_md:
+            print(f"  ℹ Found existing SKILL.md ({len(current_skill_md)} chars)")
+        else:
+            print(f"  ℹ No existing SKILL.md, will create new one")
+
+        # Enhance with Claude
+        enhanced = self.enhance_skill_md(references, current_skill_md)
+
+        if not enhanced:
+            print("❌ Enhancement failed")
+            return False
+
+        print(f"  ✓ Generated enhanced SKILL.md ({len(enhanced)} chars)\n")
+
+        # Save
+        print("💾 Saving enhanced SKILL.md...")
+        self.save_enhanced_skill_md(enhanced)
+
+        print(f"\n✅ Enhancement complete!")
+        print(f"\nNext steps:")
+        print(f"  1. Review: {self.skill_md_path}")
+        print(f"  2. If you don't like it, restore backup: {self.skill_md_path.with_suffix('.md.backup')}")
+        print(f"  3. Package your skill:")
+        print(f"     python3 cli/package_skill.py {self.skill_dir}/")
+
+        return True
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Enhance SKILL.md using Claude API',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Using ANTHROPIC_API_KEY environment variable
+  export ANTHROPIC_API_KEY=sk-ant-...
+  python3 cli/enhance_skill.py output/steam-inventory/
+
+  # Providing API key directly
+  python3 cli/enhance_skill.py output/react/ --api-key sk-ant-...
+
+  # Show what would be done (dry run)
+  python3 cli/enhance_skill.py output/godot/ --dry-run
+"""
+    )
+
+    parser.add_argument('skill_dir', type=str,
+                       help='Path to skill directory (e.g., output/steam-inventory/)')
+    parser.add_argument('--api-key', type=str,
+                       help='Anthropic API key (or set ANTHROPIC_API_KEY env var)')
+    parser.add_argument('--dry-run', action='store_true',
+                       help='Show what would be done without calling API')
+
+    args = parser.parse_args()
+
+    # Validate skill directory
+    skill_dir = Path(args.skill_dir)
+    if not skill_dir.exists():
+        print(f"❌ Error: Directory not found: {skill_dir}")
+        sys.exit(1)
+
+    if not skill_dir.is_dir():
+        print(f"❌ Error: Not a directory: {skill_dir}")
+        sys.exit(1)
+
+    # Dry run mode
+    if args.dry_run:
+        print(f"🔍 DRY RUN MODE")
+        print(f"   Would enhance: {skill_dir}")
+        print(f"   References: {skill_dir / 'references'}")
+        print(f"   SKILL.md: {skill_dir / 'SKILL.md'}")
+
+        refs_dir = skill_dir / "references"
+        if refs_dir.exists():
+            ref_files = list(refs_dir.glob("*.md"))
+            print(f"   Found {len(ref_files)} reference files:")
+            for rf in ref_files:
+                size = rf.stat().st_size
+                print(f"     - {rf.name} ({size:,} bytes)")
+
+        print("\nTo actually run enhancement:")
+        print(f"  python3 cli/enhance_skill.py {skill_dir}")
+        return
+
+    # Create enhancer and run
+    try:
+        enhancer = SkillEnhancer(skill_dir, api_key=args.api_key)
+        success = enhancer.run()
+        sys.exit(0 if success else 1)
+
+    except ValueError as e:
+        print(f"❌ Error: {e}")
+        print("\nSet your API key:")
+        print("  export ANTHROPIC_API_KEY=sk-ant-...")
+        print("Or provide it directly:")
+        print(f"  python3 cli/enhance_skill.py {skill_dir} --api-key sk-ant-...")
+        sys.exit(1)
+    except Exception as e:
+        print(f"❌ Unexpected error: {e}")
+        import traceback
+        traceback.print_exc()
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
--- a/src/skill_seekers/cli/enhance_skill_local.py
+++ b/src/skill_seekers/cli/enhance_skill_local.py
@@ -0,0 +1,303 @@
+#!/usr/bin/env python3
+"""
+SKILL.md Enhancement Script (Local - Using Claude Code)
+Opens a new terminal with Claude Code to enhance SKILL.md, then reports back.
+No API key needed - uses your existing Claude Code Max plan!
+
+Usage:
+    python3 cli/enhance_skill_local.py output/steam-inventory/
+    python3 cli/enhance_skill_local.py output/react/
+
+Terminal Selection:
+    The script automatically detects which terminal app to use:
+    1. SKILL_SEEKER_TERMINAL env var (highest priority)
+       Example: export SKILL_SEEKER_TERMINAL="Ghostty"
+    2. TERM_PROGRAM env var (current terminal)
+    3. Terminal.app (fallback)
+
+    Supported terminals: Ghostty, iTerm, Terminal, WezTerm
+"""
+
+import os
+import sys
+import time
+import subprocess
+import tempfile
+from pathlib import Path
+
+# Add parent directory to path for imports when run as script
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+from skill_seekers.cli.constants import LOCAL_CONTENT_LIMIT, LOCAL_PREVIEW_LIMIT
+from skill_seekers.cli.utils import read_reference_files
+
+
+def detect_terminal_app():
+    """Detect which terminal app to use with cascading priority.
+
+    Priority order:
+        1. SKILL_SEEKER_TERMINAL environment variable (explicit user preference)
+        2. TERM_PROGRAM environment variable (inherit current terminal)
+        3. Terminal.app (fallback default)
+
+    Returns:
+        tuple: (terminal_app_name, detection_method)
+            - terminal_app_name (str): Name of terminal app to launch (e.g., "Ghostty", "Terminal")
+            - detection_method (str): How the terminal was detected (for logging)
+
+    Examples:
+        >>> os.environ['SKILL_SEEKER_TERMINAL'] = 'Ghostty'
+        >>> detect_terminal_app()
+        ('Ghostty', 'SKILL_SEEKER_TERMINAL')
+
+        >>> os.environ['TERM_PROGRAM'] = 'iTerm.app'
+        >>> detect_terminal_app()
+        ('iTerm', 'TERM_PROGRAM')
+    """
+    # Map TERM_PROGRAM values to macOS app names
+    TERMINAL_MAP = {
+        'Apple_Terminal': 'Terminal',
+        'iTerm.app': 'iTerm',
+        'ghostty': 'Ghostty',
+        'WezTerm': 'WezTerm',
+    }
+
+    # Priority 1: Check SKILL_SEEKER_TERMINAL env var (explicit preference)
+    preferred_terminal = os.environ.get('SKILL_SEEKER_TERMINAL', '').strip()
+    if preferred_terminal:
+        return preferred_terminal, 'SKILL_SEEKER_TERMINAL'
+
+    # Priority 2: Check TERM_PROGRAM (inherit current terminal)
+    term_program = os.environ.get('TERM_PROGRAM', '').strip()
+    if term_program and term_program in TERMINAL_MAP:
+        return TERMINAL_MAP[term_program], 'TERM_PROGRAM'
+
+    # Priority 3: Fallback to Terminal.app
+    if term_program:
+        # TERM_PROGRAM is set but unknown
+        return 'Terminal', f'unknown TERM_PROGRAM ({term_program})'
+    else:
+        # No TERM_PROGRAM set
+        return 'Terminal', 'default'
+
+
+class LocalSkillEnhancer:
+    def __init__(self, skill_dir):
+        self.skill_dir = Path(skill_dir)
+        self.references_dir = self.skill_dir / "references"
+        self.skill_md_path = self.skill_dir / "SKILL.md"
+
+    def create_enhancement_prompt(self):
+        """Create the prompt file for Claude Code"""
+
+        # Read reference files
+        references = read_reference_files(
+            self.skill_dir,
+            max_chars=LOCAL_CONTENT_LIMIT,
+            preview_limit=LOCAL_PREVIEW_LIMIT
+        )
+
+        if not references:
+            print("❌ No reference files found")
+            return None
+
+        # Read current SKILL.md
+        current_skill_md = ""
+        if self.skill_md_path.exists():
+            current_skill_md = self.skill_md_path.read_text(encoding='utf-8')
+
+        # Build prompt
+        prompt = f"""I need you to enhance the SKILL.md file for the {self.skill_dir.name} skill.
+
+CURRENT SKILL.MD:
+{'-'*60}
+{current_skill_md if current_skill_md else '(No existing SKILL.md - create from scratch)'}
+{'-'*60}
+
+REFERENCE DOCUMENTATION:
+{'-'*60}
+"""
+
+        for filename, content in references.items():
+            prompt += f"\n## {filename}\n{content[:15000]}\n"
+
+        prompt += f"""
+{'-'*60}
+
+YOUR TASK:
+Create an EXCELLENT SKILL.md file that will help Claude use this documentation effectively.
+
+Requirements:
+1. **Clear "When to Use This Skill" section**
+   - Be SPECIFIC about trigger conditions
+   - List concrete use cases
+
+2. **Excellent Quick Reference section**
+   - Extract 5-10 of the BEST, most practical code examples from the reference docs
+   - Choose SHORT, clear examples (5-20 lines max)
+   - Include both simple and intermediate examples
+   - Use proper language tags (cpp, python, javascript, json, etc.)
+   - Add clear descriptions for each example
+
+3. **Detailed Reference Files description**
+   - Explain what's in each reference file
+   - Help users navigate the documentation
+
+4. **Practical "Working with This Skill" section**
+   - Clear guidance for beginners, intermediate, and advanced users
+   - Navigation tips
+
+5. **Key Concepts section** (if applicable)
+   - Explain core concepts
+   - Define important terminology
+
+IMPORTANT:
+- Extract REAL examples from the reference docs above
+- Prioritize SHORT, clear examples
+- Make it actionable and practical
+- Keep the frontmatter (---\\nname: ...\\n---) intact
+- Use proper markdown formatting
+
+SAVE THE RESULT:
+Save the complete enhanced SKILL.md to: {self.skill_md_path.absolute()}
+
+First, backup the original to: {self.skill_md_path.with_suffix('.md.backup').absolute()}
+"""
+
+        return prompt
+
+    def run(self):
+        """Main enhancement workflow"""
+        print(f"\n{'='*60}")
+        print(f"LOCAL ENHANCEMENT: {self.skill_dir.name}")
+        print(f"{'='*60}\n")
+
+        # Validate
+        if not self.skill_dir.exists():
+            print(f"❌ Directory not found: {self.skill_dir}")
+            return False
+
+        # Read reference files
+        print("📖 Reading reference documentation...")
+        references = read_reference_files(
+            self.skill_dir,
+            max_chars=LOCAL_CONTENT_LIMIT,
+            preview_limit=LOCAL_PREVIEW_LIMIT
+        )
+
+        if not references:
+            print("❌ No reference files found to analyze")
+            return False
+
+        print(f"  ✓ Read {len(references)} reference files")
+        total_size = sum(len(c) for c in references.values())
+        print(f"  ✓ Total size: {total_size:,} characters\n")
+
+        # Create prompt
+        print("📝 Creating enhancement prompt...")
+        prompt = self.create_enhancement_prompt()
+
+        if not prompt:
+            return False
+
+        # Save prompt to temp file
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False, encoding='utf-8') as f:
+            prompt_file = f.name
+            f.write(prompt)
+
+        print(f"  ✓ Prompt saved ({len(prompt):,} characters)\n")
+
+        # Launch Claude Code in new terminal
+        print("🚀 Launching Claude Code in new terminal...")
+        print("   This will:")
+        print("   1. Open a new terminal window")
+        print("   2. Run Claude Code with the enhancement task")
+        print("   3. Claude will read the docs and enhance SKILL.md")
+        print("   4. Terminal will auto-close when done")
+        print()
+
+        # Create a shell script to run in the terminal
+        shell_script = f'''#!/bin/bash
+claude {prompt_file}
+echo ""
+echo "✅ Enhancement complete!"
+echo "Press any key to close..."
+read -n 1
+rm {prompt_file}
+'''
+
+        # Save shell script
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.sh', delete=False) as f:
+            script_file = f.name
+            f.write(shell_script)
+
+        os.chmod(script_file, 0o755)
+
+        # Launch in new terminal (macOS specific)
+        if sys.platform == 'darwin':
+            # Detect which terminal app to use
+            terminal_app, detection_method = detect_terminal_app()
+
+            # Show detection info
+            if detection_method == 'SKILL_SEEKER_TERMINAL':
+                print(f"   Using terminal: {terminal_app} (from SKILL_SEEKER_TERMINAL)")
+            elif detection_method == 'TERM_PROGRAM':
+                print(f"   Using terminal: {terminal_app} (inherited from current terminal)")
+            elif detection_method.startswith('unknown TERM_PROGRAM'):
+                print(f"⚠️  {detection_method}")
+                print(f"   → Using Terminal.app as fallback")
+            else:
+                print(f"   Using terminal: {terminal_app} (default)")
+
+            try:
+                subprocess.Popen(['open', '-a', terminal_app, script_file])
+            except Exception as e:
+                print(f"⚠️  Error launching {terminal_app}: {e}")
+                print(f"\nManually run: {script_file}")
+                return False
+        else:
+            print("⚠️  Auto-launch only works on macOS")
+            print(f"\nManually run this command in a new terminal:")
+            print(f"  claude '{prompt_file}'")
+            print(f"\nThen delete the prompt file:")
+            print(f"  rm '{prompt_file}'")
+            return False
+
+        print("✅ New terminal launched with Claude Code!")
+        print()
+        print("📊 Status:")
+        print(f"  - Prompt file: {prompt_file}")
+        print(f"  - Skill directory: {self.skill_dir.absolute()}")
+        print(f"  - SKILL.md will be saved to: {self.skill_md_path.absolute()}")
+        print(f"  - Original backed up to: {self.skill_md_path.with_suffix('.md.backup').absolute()}")
+        print()
+        print("⏳ Wait for Claude Code to finish in the other terminal...")
+        print("   (Usually takes 30-60 seconds)")
+        print()
+        print("💡 When done:")
+        print(f"  1. Check the enhanced SKILL.md: {self.skill_md_path}")
+        print(f"  2. If you don't like it, restore: mv {self.skill_md_path.with_suffix('.md.backup')} {self.skill_md_path}")
+        print(f"  3. Package: python3 cli/package_skill.py {self.skill_dir}/")
+
+        return True
+
+
+def main():
+    if len(sys.argv) < 2:
+        print("Usage: python3 cli/enhance_skill_local.py <skill_directory>")
+        print()
+        print("Examples:")
+        print("  python3 cli/enhance_skill_local.py output/steam-inventory/")
+        print("  python3 cli/enhance_skill_local.py output/react/")
+        sys.exit(1)
+
+    skill_dir = sys.argv[1]
+
+    enhancer = LocalSkillEnhancer(skill_dir)
+    success = enhancer.run()
+
+    sys.exit(0 if success else 1)
+
+
+if __name__ == "__main__":
+    main()
--- a/src/skill_seekers/cli/estimate_pages.py
+++ b/src/skill_seekers/cli/estimate_pages.py
@@ -0,0 +1,288 @@
+#!/usr/bin/env python3
+"""
+Page Count Estimator for Skill Seeker
+Quickly estimates how many pages a config will scrape without downloading content
+"""
+
+import sys
+import os
+import requests
+from bs4 import BeautifulSoup
+from urllib.parse import urljoin, urlparse
+import time
+import json
+
+# Add parent directory to path for imports when run as script
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+from skill_seekers.cli.constants import (
+    DEFAULT_RATE_LIMIT,
+    DEFAULT_MAX_DISCOVERY,
+    DISCOVERY_THRESHOLD
+)
+
+
+def estimate_pages(config, max_discovery=DEFAULT_MAX_DISCOVERY, timeout=30):
+    """
+    Estimate total pages that will be scraped
+
+    Args:
+        config: Configuration dictionary
+        max_discovery: Maximum pages to discover (safety limit, use -1 for unlimited)
+        timeout: Timeout for HTTP requests in seconds
+
+    Returns:
+        dict with estimation results
+    """
+    base_url = config['base_url']
+    start_urls = config.get('start_urls', [base_url])
+    url_patterns = config.get('url_patterns', {'include': [], 'exclude': []})
+    rate_limit = config.get('rate_limit', DEFAULT_RATE_LIMIT)
+
+    visited = set()
+    pending = list(start_urls)
+    discovered = 0
+
+    include_patterns = url_patterns.get('include', [])
+    exclude_patterns = url_patterns.get('exclude', [])
+
+    # Handle unlimited mode
+    unlimited = (max_discovery == -1 or max_discovery is None)
+
+    print(f"🔍 Estimating pages for: {config['name']}")
+    print(f"📍 Base URL: {base_url}")
+    print(f"🎯 Start URLs: {len(start_urls)}")
+    print(f"⏱️  Rate limit: {rate_limit}s")
+
+    if unlimited:
+        print(f"🔢 Max discovery: UNLIMITED (will discover all pages)")
+        print(f"⚠️  WARNING: This may take a long time!")
+    else:
+        print(f"🔢 Max discovery: {max_discovery}")
+
+    print()
+
+    start_time = time.time()
+
+    # Loop condition: stop if no more URLs, or if limit reached (when not unlimited)
+    while pending and (unlimited or discovered < max_discovery):
+        url = pending.pop(0)
+
+        # Skip if already visited
+        if url in visited:
+            continue
+
+        visited.add(url)
+        discovered += 1
+
+        # Progress indicator
+        if discovered % 10 == 0:
+            elapsed = time.time() - start_time
+            rate = discovered / elapsed if elapsed > 0 else 0
+            print(f"⏳ Discovered: {discovered} pages ({rate:.1f} pages/sec)", end='\r')
+
+        try:
+            # HEAD request first to check if page exists (faster)
+            head_response = requests.head(url, timeout=timeout, allow_redirects=True)
+
+            # Skip non-HTML content
+            content_type = head_response.headers.get('Content-Type', '')
+            if 'text/html' not in content_type:
+                continue
+
+            # Now GET the page to find links
+            response = requests.get(url, timeout=timeout)
+            response.raise_for_status()
+
+            soup = BeautifulSoup(response.content, 'html.parser')
+
+            # Find all links
+            for link in soup.find_all('a', href=True):
+                href = link['href']
+                full_url = urljoin(url, href)
+
+                # Normalize URL
+                parsed = urlparse(full_url)
+                full_url = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
+
+                # Check if URL is valid
+                if not is_valid_url(full_url, base_url, include_patterns, exclude_patterns):
+                    continue
+
+                # Add to pending if not visited
+                if full_url not in visited and full_url not in pending:
+                    pending.append(full_url)
+
+            # Rate limiting
+            time.sleep(rate_limit)
+
+        except requests.RequestException as e:
+            # Silently skip errors during estimation
+            pass
+        except Exception as e:
+            # Silently skip other errors
+            pass
+
+    elapsed = time.time() - start_time
+
+    # Results
+    results = {
+        'discovered': discovered,
+        'pending': len(pending),
+        'estimated_total': discovered + len(pending),
+        'elapsed_seconds': round(elapsed, 2),
+        'discovery_rate': round(discovered / elapsed if elapsed > 0 else 0, 2),
+        'hit_limit': (not unlimited) and (discovered >= max_discovery),
+        'unlimited': unlimited
+    }
+
+    return results
+
+
+def is_valid_url(url, base_url, include_patterns, exclude_patterns):
+    """Check if URL should be crawled"""
+    # Must be same domain
+    if not url.startswith(base_url.rstrip('/')):
+        return False
+
+    # Check exclude patterns first
+    if exclude_patterns:
+        for pattern in exclude_patterns:
+            if pattern in url:
+                return False
+
+    # Check include patterns (if specified)
+    if include_patterns:
+        for pattern in include_patterns:
+            if pattern in url:
+                return True
+        return False
+
+    # If no include patterns, accept by default
+    return True
+
+
+def print_results(results, config):
+    """Print estimation results"""
+    print()
+    print("=" * 70)
+    print("📊 ESTIMATION RESULTS")
+    print("=" * 70)
+    print()
+    print(f"Config: {config['name']}")
+    print(f"Base URL: {config['base_url']}")
+    print()
+    print(f"✅ Pages Discovered: {results['discovered']}")
+    print(f"⏳ Pages Pending: {results['pending']}")
+    print(f"📈 Estimated Total: {results['estimated_total']}")
+    print()
+    print(f"⏱️  Time Elapsed: {results['elapsed_seconds']}s")
+    print(f"⚡ Discovery Rate: {results['discovery_rate']} pages/sec")
+
+    if results.get('unlimited', False):
+        print()
+        print("✅ UNLIMITED MODE - Discovered all reachable pages")
+        print(f"   Total pages: {results['estimated_total']}")
+    elif results['hit_limit']:
+        print()
+        print("⚠️  Hit discovery limit - actual total may be higher")
+        print("   Increase max_discovery parameter for more accurate estimate")
+
+    print()
+    print("=" * 70)
+    print("💡 RECOMMENDATIONS")
+    print("=" * 70)
+    print()
+
+    estimated = results['estimated_total']
+    current_max = config.get('max_pages', 100)
+
+    if estimated <= current_max:
+        print(f"✅ Current max_pages ({current_max}) is sufficient")
+    else:
+        recommended = min(estimated + 50, DISCOVERY_THRESHOLD)  # Add 50 buffer, cap at threshold
+        print(f"⚠️  Current max_pages ({current_max}) may be too low")
+        print(f"📝 Recommended max_pages: {recommended}")
+        print(f"   (Estimated {estimated} + 50 buffer)")
+
+    # Estimate time for full scrape
+    rate_limit = config.get('rate_limit', DEFAULT_RATE_LIMIT)
+    estimated_time = (estimated * rate_limit) / 60  # in minutes
+
+    print()
+    print(f"⏱️  Estimated full scrape time: {estimated_time:.1f} minutes")
+    print(f"   (Based on rate_limit: {rate_limit}s)")
+
+    print()
+
+
+def load_config(config_path):
+    """Load configuration from JSON file"""
+    try:
+        with open(config_path, 'r') as f:
+            config = json.load(f)
+        return config
+    except FileNotFoundError:
+        print(f"❌ Error: Config file not found: {config_path}")
+        sys.exit(1)
+    except json.JSONDecodeError as e:
+        print(f"❌ Error: Invalid JSON in config file: {e}")
+        sys.exit(1)
+
+
+def main():
+    """Main entry point"""
+    import argparse
+
+    parser = argparse.ArgumentParser(
+        description='Estimate page count for Skill Seeker configs',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Estimate pages for a config
+  python3 cli/estimate_pages.py configs/react.json
+
+  # Estimate with higher discovery limit
+  python3 cli/estimate_pages.py configs/godot.json --max-discovery 2000
+
+  # Quick estimate (stop at 100 pages)
+  python3 cli/estimate_pages.py configs/vue.json --max-discovery 100
+        """
+    )
+
+    parser.add_argument('config', help='Path to config JSON file')
+    parser.add_argument('--max-discovery', '-m', type=int, default=DEFAULT_MAX_DISCOVERY,
+                       help=f'Maximum pages to discover (default: {DEFAULT_MAX_DISCOVERY}, use -1 for unlimited)')
+    parser.add_argument('--unlimited', '-u', action='store_true',
+                       help='Remove discovery limit - discover all pages (same as --max-discovery -1)')
+    parser.add_argument('--timeout', '-t', type=int, default=30,
+                       help='HTTP request timeout in seconds (default: 30)')
+
+    args = parser.parse_args()
+
+    # Handle unlimited flag
+    max_discovery = -1 if args.unlimited else args.max_discovery
+
+    # Load config
+    config = load_config(args.config)
+
+    # Run estimation
+    try:
+        results = estimate_pages(config, max_discovery, args.timeout)
+        print_results(results, config)
+
+        # Return exit code based on results
+        if results['hit_limit']:
+            return 2  # Warning: hit limit
+        return 0  # Success
+
+    except KeyboardInterrupt:
+        print("\n\n⚠️  Estimation interrupted by user")
+        return 1
+    except Exception as e:
+        print(f"\n\n❌ Error during estimation: {e}")
+        return 1
+
+
+if __name__ == '__main__':
+    sys.exit(main())
--- a/src/skill_seekers/cli/generate_router.py
+++ b/src/skill_seekers/cli/generate_router.py
@@ -0,0 +1,274 @@
+#!/usr/bin/env python3
+"""
+Router Skill Generator
+
+Creates a router/hub skill that intelligently directs queries to specialized sub-skills.
+This is used for large documentation sites split into multiple focused skills.
+"""
+
+import json
+import sys
+import argparse
+from pathlib import Path
+from typing import Dict, List, Any, Tuple
+
+
+class RouterGenerator:
+    """Generates router skills that direct to specialized sub-skills"""
+
+    def __init__(self, config_paths: List[str], router_name: str = None):
+        self.config_paths = [Path(p) for p in config_paths]
+        self.configs = [self.load_config(p) for p in self.config_paths]
+        self.router_name = router_name or self.infer_router_name()
+        self.base_config = self.configs[0]  # Use first as template
+
+    def load_config(self, path: Path) -> Dict[str, Any]:
+        """Load a config file"""
+        try:
+            with open(path, 'r') as f:
+                return json.load(f)
+        except Exception as e:
+            print(f"❌ Error loading {path}: {e}")
+            sys.exit(1)
+
+    def infer_router_name(self) -> str:
+        """Infer router name from sub-skill names"""
+        # Find common prefix
+        names = [cfg['name'] for cfg in self.configs]
+        if not names:
+            return "router"
+
+        # Get common prefix before first dash
+        first_name = names[0]
+        if '-' in first_name:
+            return first_name.split('-')[0]
+        return first_name
+
+    def extract_routing_keywords(self) -> Dict[str, List[str]]:
+        """Extract keywords for routing to each skill"""
+        routing = {}
+
+        for config in self.configs:
+            name = config['name']
+            keywords = []
+
+            # Extract from categories
+            if 'categories' in config:
+                keywords.extend(config['categories'].keys())
+
+            # Extract from name (part after dash)
+            if '-' in name:
+                skill_topic = name.split('-', 1)[1]
+                keywords.append(skill_topic)
+
+            routing[name] = keywords
+
+        return routing
+
+    def generate_skill_md(self) -> str:
+        """Generate router SKILL.md content"""
+        routing_keywords = self.extract_routing_keywords()
+
+        skill_md = f"""# {self.router_name.replace('-', ' ').title()} Documentation (Router)
+
+## When to Use This Skill
+
+{self.base_config.get('description', f'Use for {self.router_name} development and programming.')}
+
+This is a router skill that directs your questions to specialized sub-skills for efficient, focused assistance.
+
+## How It Works
+
+This skill analyzes your question and activates the appropriate specialized skill(s):
+
+"""
+
+        # List sub-skills
+        for config in self.configs:
+            name = config['name']
+            desc = config.get('description', '')
+            # Remove router name prefix from description if present
+            if desc.startswith(f"{self.router_name.title()} -"):
+                desc = desc.split(' - ', 1)[1]
+
+            skill_md += f"### {name}\n{desc}\n\n"
+
+        # Routing logic
+        skill_md += """## Routing Logic
+
+The router analyzes your question for topic keywords and activates relevant skills:
+
+**Keywords → Skills:**
+"""
+
+        for skill_name, keywords in routing_keywords.items():
+            keyword_str = ", ".join(keywords)
+            skill_md += f"- {keyword_str} → **{skill_name}**\n"
+
+        # Quick reference
+        skill_md += f"""
+
+## Quick Reference
+
+For quick answers, this router provides basic overview information. For detailed documentation, the specialized skills contain comprehensive references.
+
+### Getting Started
+
+1. Ask your question naturally - mention the topic area
+2. The router will activate the appropriate skill(s)
+3. You'll receive focused, detailed answers from specialized documentation
+
+### Examples
+
+**Question:** "How do I create a 2D sprite?"
+**Activates:** {self.router_name}-2d skill
+
+**Question:** "GDScript function syntax"
+**Activates:** {self.router_name}-scripting skill
+
+**Question:** "Physics collision handling in 3D"
+**Activates:** {self.router_name}-3d + {self.router_name}-physics skills
+
+### All Available Skills
+
+"""
+
+        # List all skills
+        for config in self.configs:
+            skill_md += f"- **{config['name']}**\n"
+
+        skill_md += f"""
+
+## Need Help?
+
+Simply ask your question and mention the topic. The router will find the right specialized skill for you!
+
+---
+
+*This is a router skill. For complete documentation, see the specialized skills listed above.*
+"""
+
+        return skill_md
+
+    def create_router_config(self) -> Dict[str, Any]:
+        """Create router configuration"""
+        routing_keywords = self.extract_routing_keywords()
+
+        router_config = {
+            "name": self.router_name,
+            "description": self.base_config.get('description', f'{self.router_name.title()} documentation router'),
+            "base_url": self.base_config['base_url'],
+            "selectors": self.base_config.get('selectors', {}),
+            "url_patterns": self.base_config.get('url_patterns', {}),
+            "rate_limit": self.base_config.get('rate_limit', 0.5),
+            "max_pages": 500,  # Router only scrapes overview pages
+            "_router": True,
+            "_sub_skills": [cfg['name'] for cfg in self.configs],
+            "_routing_keywords": routing_keywords
+        }
+
+        return router_config
+
+    def generate(self, output_dir: Path = None) -> Tuple[Path, Path]:
+        """Generate router skill and config"""
+        if output_dir is None:
+            output_dir = self.config_paths[0].parent
+
+        output_dir = Path(output_dir)
+
+        # Generate SKILL.md
+        skill_md = self.generate_skill_md()
+        skill_path = output_dir.parent / f"output/{self.router_name}/SKILL.md"
+        skill_path.parent.mkdir(parents=True, exist_ok=True)
+
+        with open(skill_path, 'w') as f:
+            f.write(skill_md)
+
+        # Generate config
+        router_config = self.create_router_config()
+        config_path = output_dir / f"{self.router_name}.json"
+
+        with open(config_path, 'w') as f:
+            json.dump(router_config, f, indent=2)
+
+        return config_path, skill_path
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Generate router/hub skill for split documentation",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Generate router from multiple configs
+  python3 generate_router.py configs/godot-2d.json configs/godot-3d.json configs/godot-scripting.json
+
+  # Use glob pattern
+  python3 generate_router.py configs/godot-*.json
+
+  # Custom router name
+  python3 generate_router.py configs/godot-*.json --name godot-hub
+
+  # Custom output directory
+  python3 generate_router.py configs/godot-*.json --output-dir configs/routers/
+        """
+    )
+
+    parser.add_argument(
+        'configs',
+        nargs='+',
+        help='Sub-skill config files'
+    )
+
+    parser.add_argument(
+        '--name',
+        help='Router skill name (default: inferred from sub-skills)'
+    )
+
+    parser.add_argument(
+        '--output-dir',
+        help='Output directory (default: same as input configs)'
+    )
+
+    args = parser.parse_args()
+
+    # Filter out router configs (avoid recursion)
+    config_files = []
+    for path_str in args.configs:
+        path = Path(path_str)
+        if path.exists() and not path.stem.endswith('-router'):
+            config_files.append(path_str)
+
+    if not config_files:
+        print("❌ Error: No valid config files provided")
+        sys.exit(1)
+
+    print(f"\n{'='*60}")
+    print("ROUTER SKILL GENERATOR")
+    print(f"{'='*60}")
+    print(f"Sub-skills: {len(config_files)}")
+    for cfg in config_files:
+        print(f"  - {Path(cfg).stem}")
+    print("")
+
+    # Generate router
+    generator = RouterGenerator(config_files, args.name)
+    config_path, skill_path = generator.generate(args.output_dir)
+
+    print(f"✅ Router config created: {config_path}")
+    print(f"✅ Router SKILL.md created: {skill_path}")
+    print("")
+    print(f"{'='*60}")
+    print("NEXT STEPS")
+    print(f"{'='*60}")
+    print(f"1. Review router SKILL.md: {skill_path}")
+    print(f"2. Optionally scrape router (for overview pages):")
+    print(f"     python3 cli/doc_scraper.py --config {config_path}")
+    print("3. Package router skill:")
+    print(f"     python3 cli/package_skill.py output/{generator.router_name}/")
+    print("4. Upload router + all sub-skills to Claude")
+    print("")
+
+
+if __name__ == "__main__":
+    main()
--- a/src/skill_seekers/cli/github_scraper.py
+++ b/src/skill_seekers/cli/github_scraper.py
@@ -0,0 +1,797 @@
+#!/usr/bin/env python3
+"""
+GitHub Repository to Claude Skill Converter (Tasks C1.1-C1.12)
+
+Converts GitHub repositories into Claude AI skills by extracting:
+- README and documentation
+- Code structure and signatures
+- GitHub Issues, Changelog, and Releases
+- Usage examples from tests
+
+Usage:
+    python3 cli/github_scraper.py --repo facebook/react
+    python3 cli/github_scraper.py --config configs/react_github.json
+    python3 cli/github_scraper.py --repo owner/repo --token $GITHUB_TOKEN
+"""
+
+import os
+import sys
+import json
+import re
+import argparse
+import logging
+from pathlib import Path
+from typing import Dict, List, Optional, Any
+from datetime import datetime
+
+try:
+    from github import Github, GithubException, Repository
+    from github.GithubException import RateLimitExceededException
+except ImportError:
+    print("Error: PyGithub not installed. Run: pip install PyGithub")
+    sys.exit(1)
+
+# Import code analyzer for deep code analysis
+try:
+    from code_analyzer import CodeAnalyzer
+    CODE_ANALYZER_AVAILABLE = True
+except ImportError:
+    CODE_ANALYZER_AVAILABLE = False
+    logger.warning("Code analyzer not available - deep analysis disabled")
+
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+
+
+class GitHubScraper:
+    """
+    GitHub Repository Scraper (C1.1-C1.9)
+
+    Extracts repository information for skill generation:
+    - Repository structure
+    - README files
+    - Code comments and docstrings
+    - Programming language detection
+    - Function/class signatures
+    - Test examples
+    - GitHub Issues
+    - CHANGELOG
+    - Releases
+    """
+
+    def __init__(self, config: Dict[str, Any]):
+        """Initialize GitHub scraper with configuration."""
+        self.config = config
+        self.repo_name = config['repo']
+        self.name = config.get('name', self.repo_name.split('/')[-1])
+        self.description = config.get('description', f'Skill for {self.repo_name}')
+
+        # GitHub client setup (C1.1)
+        token = self._get_token()
+        self.github = Github(token) if token else Github()
+        self.repo: Optional[Repository.Repository] = None
+
+        # Options
+        self.include_issues = config.get('include_issues', True)
+        self.max_issues = config.get('max_issues', 100)
+        self.include_changelog = config.get('include_changelog', True)
+        self.include_releases = config.get('include_releases', True)
+        self.include_code = config.get('include_code', False)
+        self.code_analysis_depth = config.get('code_analysis_depth', 'surface')  # 'surface', 'deep', 'full'
+        self.file_patterns = config.get('file_patterns', [])
+
+        # Initialize code analyzer if deep analysis requested
+        self.code_analyzer = None
+        if self.code_analysis_depth != 'surface' and CODE_ANALYZER_AVAILABLE:
+            self.code_analyzer = CodeAnalyzer(depth=self.code_analysis_depth)
+            logger.info(f"Code analysis depth: {self.code_analysis_depth}")
+
+        # Output paths
+        self.skill_dir = f"output/{self.name}"
+        self.data_file = f"output/{self.name}_github_data.json"
+
+        # Extracted data storage
+        self.extracted_data = {
+            'repo_info': {},
+            'readme': '',
+            'file_tree': [],
+            'languages': {},
+            'signatures': [],
+            'test_examples': [],
+            'issues': [],
+            'changelog': '',
+            'releases': []
+        }
+
+    def _get_token(self) -> Optional[str]:
+        """
+        Get GitHub token from env var or config (both options supported).
+        Priority: GITHUB_TOKEN env var > config file > None
+        """
+        # Try environment variable first (recommended)
+        token = os.getenv('GITHUB_TOKEN')
+        if token:
+            logger.info("Using GitHub token from GITHUB_TOKEN environment variable")
+            return token
+
+        # Fall back to config file
+        token = self.config.get('github_token')
+        if token:
+            logger.warning("Using GitHub token from config file (less secure)")
+            return token
+
+        logger.warning("No GitHub token provided - using unauthenticated access (lower rate limits)")
+        return None
+
+    def scrape(self) -> Dict[str, Any]:
+        """
+        Main scraping entry point.
+        Executes all C1 tasks in sequence.
+        """
+        try:
+            logger.info(f"Starting GitHub scrape for: {self.repo_name}")
+
+            # C1.1: Fetch repository
+            self._fetch_repository()
+
+            # C1.2: Extract README
+            self._extract_readme()
+
+            # C1.3-C1.6: Extract code structure
+            self._extract_code_structure()
+
+            # C1.7: Extract Issues
+            if self.include_issues:
+                self._extract_issues()
+
+            # C1.8: Extract CHANGELOG
+            if self.include_changelog:
+                self._extract_changelog()
+
+            # C1.9: Extract Releases
+            if self.include_releases:
+                self._extract_releases()
+
+            # Save extracted data
+            self._save_data()
+
+            logger.info(f"✅ Scraping complete! Data saved to: {self.data_file}")
+            return self.extracted_data
+
+        except RateLimitExceededException:
+            logger.error("GitHub API rate limit exceeded. Please wait or use authentication token.")
+            raise
+        except GithubException as e:
+            logger.error(f"GitHub API error: {e}")
+            raise
+        except Exception as e:
+            logger.error(f"Unexpected error during scraping: {e}")
+            raise
+
+    def _fetch_repository(self):
+        """C1.1: Fetch repository structure using GitHub API."""
+        logger.info(f"Fetching repository: {self.repo_name}")
+
+        try:
+            self.repo = self.github.get_repo(self.repo_name)
+
+            # Extract basic repo info
+            self.extracted_data['repo_info'] = {
+                'name': self.repo.name,
+                'full_name': self.repo.full_name,
+                'description': self.repo.description,
+                'url': self.repo.html_url,
+                'homepage': self.repo.homepage,
+                'stars': self.repo.stargazers_count,
+                'forks': self.repo.forks_count,
+                'open_issues': self.repo.open_issues_count,
+                'default_branch': self.repo.default_branch,
+                'created_at': self.repo.created_at.isoformat() if self.repo.created_at else None,
+                'updated_at': self.repo.updated_at.isoformat() if self.repo.updated_at else None,
+                'language': self.repo.language,
+                'license': self.repo.license.name if self.repo.license else None,
+                'topics': self.repo.get_topics()
+            }
+
+            logger.info(f"Repository fetched: {self.repo.full_name} ({self.repo.stargazers_count} stars)")
+
+        except GithubException as e:
+            if e.status == 404:
+                raise ValueError(f"Repository not found: {self.repo_name}")
+            raise
+
+    def _extract_readme(self):
+        """C1.2: Extract README.md files."""
+        logger.info("Extracting README...")
+
+        # Try common README locations
+        readme_files = ['README.md', 'README.rst', 'README.txt', 'README',
+                       'docs/README.md', '.github/README.md']
+
+        for readme_path in readme_files:
+            try:
+                content = self.repo.get_contents(readme_path)
+                if content:
+                    self.extracted_data['readme'] = content.decoded_content.decode('utf-8')
+                    logger.info(f"README found: {readme_path}")
+                    return
+            except GithubException:
+                continue
+
+        logger.warning("No README found in repository")
+
+    def _extract_code_structure(self):
+        """
+        C1.3-C1.6: Extract code structure, languages, signatures, and test examples.
+        Surface layer only - no full implementation code.
+        """
+        logger.info("Extracting code structure...")
+
+        # C1.4: Get language breakdown
+        self._extract_languages()
+
+        # Get file tree
+        self._extract_file_tree()
+
+        # Extract signatures and test examples
+        if self.include_code:
+            self._extract_signatures_and_tests()
+
+    def _extract_languages(self):
+        """C1.4: Detect programming languages in repository."""
+        logger.info("Detecting programming languages...")
+
+        try:
+            languages = self.repo.get_languages()
+            total_bytes = sum(languages.values())
+
+            self.extracted_data['languages'] = {
+                lang: {
+                    'bytes': bytes_count,
+                    'percentage': round((bytes_count / total_bytes) * 100, 2) if total_bytes > 0 else 0
+                }
+                for lang, bytes_count in languages.items()
+            }
+
+            logger.info(f"Languages detected: {', '.join(languages.keys())}")
+
+        except GithubException as e:
+            logger.warning(f"Could not fetch languages: {e}")
+
+    def _extract_file_tree(self):
+        """Extract repository file tree structure."""
+        logger.info("Building file tree...")
+
+        try:
+            contents = self.repo.get_contents("")
+            file_tree = []
+
+            while contents:
+                file_content = contents.pop(0)
+
+                file_info = {
+                    'path': file_content.path,
+                    'type': file_content.type,
+                    'size': file_content.size if file_content.type == 'file' else None
+                }
+                file_tree.append(file_info)
+
+                if file_content.type == "dir":
+                    contents.extend(self.repo.get_contents(file_content.path))
+
+            self.extracted_data['file_tree'] = file_tree
+            logger.info(f"File tree built: {len(file_tree)} items")
+
+        except GithubException as e:
+            logger.warning(f"Could not build file tree: {e}")
+
+    def _extract_signatures_and_tests(self):
+        """
+        C1.3, C1.5, C1.6: Extract signatures, docstrings, and test examples.
+
+        Extraction depth depends on code_analysis_depth setting:
+        - surface: File tree only (minimal)
+        - deep: Parse files for signatures, parameters, types
+        - full: Complete AST analysis (future enhancement)
+        """
+        if self.code_analysis_depth == 'surface':
+            logger.info("Code extraction: Surface level (file tree only)")
+            return
+
+        if not self.code_analyzer:
+            logger.warning("Code analyzer not available - skipping deep analysis")
+            return
+
+        logger.info(f"Extracting code signatures ({self.code_analysis_depth} analysis)...")
+
+        # Get primary language for the repository
+        languages = self.extracted_data.get('languages', {})
+        if not languages:
+            logger.warning("No languages detected - skipping code analysis")
+            return
+
+        # Determine primary language
+        primary_language = max(languages.items(), key=lambda x: x[1]['bytes'])[0]
+        logger.info(f"Primary language: {primary_language}")
+
+        # Determine file extensions to analyze
+        extension_map = {
+            'Python': ['.py'],
+            'JavaScript': ['.js', '.jsx'],
+            'TypeScript': ['.ts', '.tsx'],
+            'C': ['.c', '.h'],
+            'C++': ['.cpp', '.hpp', '.cc', '.hh', '.cxx']
+        }
+
+        extensions = extension_map.get(primary_language, [])
+        if not extensions:
+            logger.warning(f"No file extensions mapped for {primary_language}")
+            return
+
+        # Analyze files matching patterns and extensions
+        analyzed_files = []
+        file_tree = self.extracted_data.get('file_tree', [])
+
+        for file_info in file_tree:
+            file_path = file_info['path']
+
+            # Check if file matches extension
+            if not any(file_path.endswith(ext) for ext in extensions):
+                continue
+
+            # Check if file matches patterns (if specified)
+            if self.file_patterns:
+                import fnmatch
+                if not any(fnmatch.fnmatch(file_path, pattern) for pattern in self.file_patterns):
+                    continue
+
+            # Analyze this file
+            try:
+                file_content = self.repo.get_contents(file_path)
+                content = file_content.decoded_content.decode('utf-8')
+
+                analysis_result = self.code_analyzer.analyze_file(
+                    file_path,
+                    content,
+                    primary_language
+                )
+
+                if analysis_result and (analysis_result.get('classes') or analysis_result.get('functions')):
+                    analyzed_files.append({
+                        'file': file_path,
+                        'language': primary_language,
+                        **analysis_result
+                    })
+
+                    logger.debug(f"Analyzed {file_path}: "
+                               f"{len(analysis_result.get('classes', []))} classes, "
+                               f"{len(analysis_result.get('functions', []))} functions")
+
+            except Exception as e:
+                logger.debug(f"Could not analyze {file_path}: {e}")
+                continue
+
+            # Limit number of files analyzed to avoid rate limits
+            if len(analyzed_files) >= 50:
+                logger.info(f"Reached analysis limit (50 files)")
+                break
+
+        self.extracted_data['code_analysis'] = {
+            'depth': self.code_analysis_depth,
+            'language': primary_language,
+            'files_analyzed': len(analyzed_files),
+            'files': analyzed_files
+        }
+
+        # Calculate totals
+        total_classes = sum(len(f.get('classes', [])) for f in analyzed_files)
+        total_functions = sum(len(f.get('functions', [])) for f in analyzed_files)
+
+        logger.info(f"Code analysis complete: {len(analyzed_files)} files, "
+                   f"{total_classes} classes, {total_functions} functions")
+
+    def _extract_issues(self):
+        """C1.7: Extract GitHub Issues (open/closed, labels, milestones)."""
+        logger.info(f"Extracting GitHub Issues (max {self.max_issues})...")
+
+        try:
+            # Fetch recent issues (open + closed)
+            issues = self.repo.get_issues(state='all', sort='updated', direction='desc')
+
+            issue_list = []
+            for issue in issues[:self.max_issues]:
+                # Skip pull requests (they appear in issues)
+                if issue.pull_request:
+                    continue
+
+                issue_data = {
+                    'number': issue.number,
+                    'title': issue.title,
+                    'state': issue.state,
+                    'labels': [label.name for label in issue.labels],
+                    'milestone': issue.milestone.title if issue.milestone else None,
+                    'created_at': issue.created_at.isoformat() if issue.created_at else None,
+                    'updated_at': issue.updated_at.isoformat() if issue.updated_at else None,
+                    'closed_at': issue.closed_at.isoformat() if issue.closed_at else None,
+                    'url': issue.html_url,
+                    'body': issue.body[:500] if issue.body else None  # First 500 chars
+                }
+                issue_list.append(issue_data)
+
+            self.extracted_data['issues'] = issue_list
+            logger.info(f"Extracted {len(issue_list)} issues")
+
+        except GithubException as e:
+            logger.warning(f"Could not fetch issues: {e}")
+
+    def _extract_changelog(self):
+        """C1.8: Extract CHANGELOG.md and release notes."""
+        logger.info("Extracting CHANGELOG...")
+
+        # Try common changelog locations
+        changelog_files = ['CHANGELOG.md', 'CHANGES.md', 'HISTORY.md',
+                          'CHANGELOG.rst', 'CHANGELOG.txt', 'CHANGELOG',
+                          'docs/CHANGELOG.md', '.github/CHANGELOG.md']
+
+        for changelog_path in changelog_files:
+            try:
+                content = self.repo.get_contents(changelog_path)
+                if content:
+                    self.extracted_data['changelog'] = content.decoded_content.decode('utf-8')
+                    logger.info(f"CHANGELOG found: {changelog_path}")
+                    return
+            except GithubException:
+                continue
+
+        logger.warning("No CHANGELOG found in repository")
+
+    def _extract_releases(self):
+        """C1.9: Extract GitHub Releases with version history."""
+        logger.info("Extracting GitHub Releases...")
+
+        try:
+            releases = self.repo.get_releases()
+
+            release_list = []
+            for release in releases:
+                release_data = {
+                    'tag_name': release.tag_name,
+                    'name': release.title,
+                    'body': release.body,
+                    'draft': release.draft,
+                    'prerelease': release.prerelease,
+                    'created_at': release.created_at.isoformat() if release.created_at else None,
+                    'published_at': release.published_at.isoformat() if release.published_at else None,
+                    'url': release.html_url,
+                    'tarball_url': release.tarball_url,
+                    'zipball_url': release.zipball_url
+                }
+                release_list.append(release_data)
+
+            self.extracted_data['releases'] = release_list
+            logger.info(f"Extracted {len(release_list)} releases")
+
+        except GithubException as e:
+            logger.warning(f"Could not fetch releases: {e}")
+
+    def _save_data(self):
+        """Save extracted data to JSON file."""
+        os.makedirs('output', exist_ok=True)
+
+        with open(self.data_file, 'w', encoding='utf-8') as f:
+            json.dump(self.extracted_data, f, indent=2, ensure_ascii=False)
+
+        logger.info(f"Data saved to: {self.data_file}")
+
+
+class GitHubToSkillConverter:
+    """
+    Convert extracted GitHub data to Claude skill format (C1.10).
+    """
+
+    def __init__(self, config: Dict[str, Any]):
+        """Initialize converter with configuration."""
+        self.config = config
+        self.name = config.get('name', config['repo'].split('/')[-1])
+        self.description = config.get('description', f'Skill for {config["repo"]}')
+
+        # Paths
+        self.data_file = f"output/{self.name}_github_data.json"
+        self.skill_dir = f"output/{self.name}"
+
+        # Load extracted data
+        self.data = self._load_data()
+
+    def _load_data(self) -> Dict[str, Any]:
+        """Load extracted GitHub data from JSON."""
+        if not os.path.exists(self.data_file):
+            raise FileNotFoundError(f"Data file not found: {self.data_file}")
+
+        with open(self.data_file, 'r', encoding='utf-8') as f:
+            return json.load(f)
+
+    def build_skill(self):
+        """Build complete skill structure."""
+        logger.info(f"Building skill for: {self.name}")
+
+        # Create directories
+        os.makedirs(self.skill_dir, exist_ok=True)
+        os.makedirs(f"{self.skill_dir}/references", exist_ok=True)
+        os.makedirs(f"{self.skill_dir}/scripts", exist_ok=True)
+        os.makedirs(f"{self.skill_dir}/assets", exist_ok=True)
+
+        # Generate SKILL.md
+        self._generate_skill_md()
+
+        # Generate reference files
+        self._generate_references()
+
+        logger.info(f"✅ Skill built successfully: {self.skill_dir}/")
+
+    def _generate_skill_md(self):
+        """Generate main SKILL.md file."""
+        repo_info = self.data.get('repo_info', {})
+
+        # Generate skill name (lowercase, hyphens only, max 64 chars)
+        skill_name = self.name.lower().replace('_', '-').replace(' ', '-')[:64]
+
+        # Truncate description to 1024 chars if needed
+        desc = self.description[:1024] if len(self.description) > 1024 else self.description
+
+        skill_content = f"""---
+name: {skill_name}
+description: {desc}
+---
+
+# {repo_info.get('name', self.name)}
+
+{self.description}
+
+## Description
+
+{repo_info.get('description', 'GitHub repository skill')}
+
+**Repository:** [{repo_info.get('full_name', 'N/A')}]({repo_info.get('url', '#')})
+**Language:** {repo_info.get('language', 'N/A')}
+**Stars:** {repo_info.get('stars', 0):,}
+**License:** {repo_info.get('license', 'N/A')}
+
+## When to Use This Skill
+
+Use this skill when you need to:
+- Understand how to use {self.name}
+- Look up API documentation
+- Find usage examples
+- Check for known issues or recent changes
+- Review release history
+
+## Quick Reference
+
+### Repository Info
+- **Homepage:** {repo_info.get('homepage', 'N/A')}
+- **Topics:** {', '.join(repo_info.get('topics', []))}
+- **Open Issues:** {repo_info.get('open_issues', 0)}
+- **Last Updated:** {repo_info.get('updated_at', 'N/A')[:10]}
+
+### Languages
+{self._format_languages()}
+
+### Recent Releases
+{self._format_recent_releases()}
+
+## Available References
+
+- `references/README.md` - Complete README documentation
+- `references/CHANGELOG.md` - Version history and changes
+- `references/issues.md` - Recent GitHub issues
+- `references/releases.md` - Release notes
+- `references/file_structure.md` - Repository structure
+
+## Usage
+
+See README.md for complete usage instructions and examples.
+
+---
+
+**Generated by Skill Seeker** | GitHub Repository Scraper
+"""
+
+        skill_path = f"{self.skill_dir}/SKILL.md"
+        with open(skill_path, 'w', encoding='utf-8') as f:
+            f.write(skill_content)
+
+        logger.info(f"Generated: {skill_path}")
+
+    def _format_languages(self) -> str:
+        """Format language breakdown."""
+        languages = self.data.get('languages', {})
+        if not languages:
+            return "No language data available"
+
+        lines = []
+        for lang, info in sorted(languages.items(), key=lambda x: x[1]['bytes'], reverse=True):
+            lines.append(f"- **{lang}:** {info['percentage']:.1f}%")
+
+        return '\n'.join(lines)
+
+    def _format_recent_releases(self) -> str:
+        """Format recent releases (top 3)."""
+        releases = self.data.get('releases', [])
+        if not releases:
+            return "No releases available"
+
+        lines = []
+        for release in releases[:3]:
+            lines.append(f"- **{release['tag_name']}** ({release['published_at'][:10]}): {release['name']}")
+
+        return '\n'.join(lines)
+
+    def _generate_references(self):
+        """Generate all reference files."""
+        # README
+        if self.data.get('readme'):
+            readme_path = f"{self.skill_dir}/references/README.md"
+            with open(readme_path, 'w', encoding='utf-8') as f:
+                f.write(self.data['readme'])
+            logger.info(f"Generated: {readme_path}")
+
+        # CHANGELOG
+        if self.data.get('changelog'):
+            changelog_path = f"{self.skill_dir}/references/CHANGELOG.md"
+            with open(changelog_path, 'w', encoding='utf-8') as f:
+                f.write(self.data['changelog'])
+            logger.info(f"Generated: {changelog_path}")
+
+        # Issues
+        if self.data.get('issues'):
+            self._generate_issues_reference()
+
+        # Releases
+        if self.data.get('releases'):
+            self._generate_releases_reference()
+
+        # File structure
+        if self.data.get('file_tree'):
+            self._generate_file_structure_reference()
+
+    def _generate_issues_reference(self):
+        """Generate issues.md reference file."""
+        issues = self.data['issues']
+
+        content = f"# GitHub Issues\n\nRecent issues from the repository ({len(issues)} total).\n\n"
+
+        # Group by state
+        open_issues = [i for i in issues if i['state'] == 'open']
+        closed_issues = [i for i in issues if i['state'] == 'closed']
+
+        content += f"## Open Issues ({len(open_issues)})\n\n"
+        for issue in open_issues[:20]:
+            labels = ', '.join(issue['labels']) if issue['labels'] else 'No labels'
+            content += f"### #{issue['number']}: {issue['title']}\n"
+            content += f"**Labels:** {labels} | **Created:** {issue['created_at'][:10]}\n"
+            content += f"[View on GitHub]({issue['url']})\n\n"
+
+        content += f"\n## Recently Closed Issues ({len(closed_issues)})\n\n"
+        for issue in closed_issues[:10]:
+            labels = ', '.join(issue['labels']) if issue['labels'] else 'No labels'
+            content += f"### #{issue['number']}: {issue['title']}\n"
+            content += f"**Labels:** {labels} | **Closed:** {issue['closed_at'][:10]}\n"
+            content += f"[View on GitHub]({issue['url']})\n\n"
+
+        issues_path = f"{self.skill_dir}/references/issues.md"
+        with open(issues_path, 'w', encoding='utf-8') as f:
+            f.write(content)
+        logger.info(f"Generated: {issues_path}")
+
+    def _generate_releases_reference(self):
+        """Generate releases.md reference file."""
+        releases = self.data['releases']
+
+        content = f"# Releases\n\nVersion history for this repository ({len(releases)} releases).\n\n"
+
+        for release in releases:
+            content += f"## {release['tag_name']}: {release['name']}\n"
+            content += f"**Published:** {release['published_at'][:10]}\n"
+            if release['prerelease']:
+                content += f"**Pre-release**\n"
+            content += f"\n{release['body']}\n\n"
+            content += f"[View on GitHub]({release['url']})\n\n---\n\n"
+
+        releases_path = f"{self.skill_dir}/references/releases.md"
+        with open(releases_path, 'w', encoding='utf-8') as f:
+            f.write(content)
+        logger.info(f"Generated: {releases_path}")
+
+    def _generate_file_structure_reference(self):
+        """Generate file_structure.md reference file."""
+        file_tree = self.data['file_tree']
+
+        content = f"# Repository File Structure\n\n"
+        content += f"Total items: {len(file_tree)}\n\n"
+        content += "```\n"
+
+        # Build tree structure
+        for item in file_tree:
+            indent = "  " * item['path'].count('/')
+            icon = "📁" if item['type'] == 'dir' else "📄"
+            content += f"{indent}{icon} {os.path.basename(item['path'])}\n"
+
+        content += "```\n"
+
+        structure_path = f"{self.skill_dir}/references/file_structure.md"
+        with open(structure_path, 'w', encoding='utf-8') as f:
+            f.write(content)
+        logger.info(f"Generated: {structure_path}")
+
+
+def main():
+    """C1.10: CLI tool entry point."""
+    parser = argparse.ArgumentParser(
+        description='GitHub Repository to Claude Skill Converter',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  python3 cli/github_scraper.py --repo facebook/react
+  python3 cli/github_scraper.py --config configs/react_github.json
+  python3 cli/github_scraper.py --repo owner/repo --token $GITHUB_TOKEN
+        """
+    )
+
+    parser.add_argument('--repo', help='GitHub repository (owner/repo)')
+    parser.add_argument('--config', help='Path to config JSON file')
+    parser.add_argument('--token', help='GitHub personal access token')
+    parser.add_argument('--name', help='Skill name (default: repo name)')
+    parser.add_argument('--description', help='Skill description')
+    parser.add_argument('--no-issues', action='store_true', help='Skip GitHub issues')
+    parser.add_argument('--no-changelog', action='store_true', help='Skip CHANGELOG')
+    parser.add_argument('--no-releases', action='store_true', help='Skip releases')
+    parser.add_argument('--max-issues', type=int, default=100, help='Max issues to fetch')
+    parser.add_argument('--scrape-only', action='store_true', help='Only scrape, don\'t build skill')
+
+    args = parser.parse_args()
+
+    # Build config from args or file
+    if args.config:
+        with open(args.config, 'r') as f:
+            config = json.load(f)
+    elif args.repo:
+        config = {
+            'repo': args.repo,
+            'name': args.name or args.repo.split('/')[-1],
+            'description': args.description or f'GitHub repository skill for {args.repo}',
+            'github_token': args.token,
+            'include_issues': not args.no_issues,
+            'include_changelog': not args.no_changelog,
+            'include_releases': not args.no_releases,
+            'max_issues': args.max_issues
+        }
+    else:
+        parser.error('Either --repo or --config is required')
+
+    try:
+        # Phase 1: Scrape GitHub repository
+        scraper = GitHubScraper(config)
+        scraper.scrape()
+
+        if args.scrape_only:
+            logger.info("Scrape complete (--scrape-only mode)")
+            return
+
+        # Phase 2: Build skill
+        converter = GitHubToSkillConverter(config)
+        converter.build_skill()
+
+        logger.info(f"\n✅ Success! Skill created at: output/{config.get('name', config['repo'].split('/')[-1])}/")
+        logger.info(f"Next step: python3 cli/package_skill.py output/{config.get('name', config['repo'].split('/')[-1])}/")
+
+    except Exception as e:
+        logger.error(f"Error: {e}")
+        sys.exit(1)
+
+
+if __name__ == '__main__':
+    main()
--- a/src/skill_seekers/cli/llms_txt_detector.py
+++ b/src/skill_seekers/cli/llms_txt_detector.py
@@ -0,0 +1,66 @@
+# ABOUTME: Detects and validates llms.txt file availability at documentation URLs
+# ABOUTME: Supports llms-full.txt, llms.txt, and llms-small.txt variants
+
+import requests
+from typing import Optional, Dict, List
+from urllib.parse import urlparse
+
+class LlmsTxtDetector:
+    """Detect llms.txt files at documentation URLs"""
+
+    VARIANTS = [
+        ('llms-full.txt', 'full'),
+        ('llms.txt', 'standard'),
+        ('llms-small.txt', 'small')
+    ]
+
+    def __init__(self, base_url: str):
+        self.base_url = base_url.rstrip('/')
+
+    def detect(self) -> Optional[Dict[str, str]]:
+        """
+        Detect available llms.txt variant.
+
+        Returns:
+            Dict with 'url' and 'variant' keys, or None if not found
+        """
+        parsed = urlparse(self.base_url)
+        root_url = f"{parsed.scheme}://{parsed.netloc}"
+
+        for filename, variant in self.VARIANTS:
+            url = f"{root_url}/{filename}"
+
+            if self._check_url_exists(url):
+                return {'url': url, 'variant': variant}
+
+        return None
+
+    def detect_all(self) -> List[Dict[str, str]]:
+        """
+        Detect all available llms.txt variants.
+
+        Returns:
+            List of dicts with 'url' and 'variant' keys for each found variant
+        """
+        found_variants = []
+
+        for filename, variant in self.VARIANTS:
+            parsed = urlparse(self.base_url)
+            root_url = f"{parsed.scheme}://{parsed.netloc}"
+            url = f"{root_url}/{filename}"
+
+            if self._check_url_exists(url):
+                found_variants.append({
+                    'url': url,
+                    'variant': variant
+                })
+
+        return found_variants
+
+    def _check_url_exists(self, url: str) -> bool:
+        """Check if URL returns 200 status"""
+        try:
+            response = requests.head(url, timeout=5, allow_redirects=True)
+            return response.status_code == 200
+        except requests.RequestException:
+            return False
--- a/src/skill_seekers/cli/llms_txt_downloader.py
+++ b/src/skill_seekers/cli/llms_txt_downloader.py
@@ -0,0 +1,94 @@
+"""ABOUTME: Downloads llms.txt files from documentation URLs with retry logic"""
+"""ABOUTME: Validates markdown content and handles timeouts with exponential backoff"""
+
+import requests
+import time
+from typing import Optional
+
+class LlmsTxtDownloader:
+    """Download llms.txt content from URLs with retry logic"""
+
+    def __init__(self, url: str, timeout: int = 30, max_retries: int = 3):
+        self.url = url
+        self.timeout = timeout
+        self.max_retries = max_retries
+
+    def get_proper_filename(self) -> str:
+        """
+        Extract filename from URL and convert .txt to .md
+
+        Returns:
+            Proper filename with .md extension
+
+        Examples:
+            https://hono.dev/llms-full.txt -> llms-full.md
+            https://hono.dev/llms.txt -> llms.md
+            https://hono.dev/llms-small.txt -> llms-small.md
+        """
+        # Extract filename from URL
+        from urllib.parse import urlparse
+        parsed = urlparse(self.url)
+        filename = parsed.path.split('/')[-1]
+
+        # Replace .txt with .md
+        if filename.endswith('.txt'):
+            filename = filename[:-4] + '.md'
+
+        return filename
+
+    def _is_markdown(self, content: str) -> bool:
+        """
+        Check if content looks like markdown.
+
+        Returns:
+            True if content contains markdown patterns
+        """
+        markdown_patterns = ['# ', '## ', '```', '- ', '* ', '`']
+        return any(pattern in content for pattern in markdown_patterns)
+
+    def download(self) -> Optional[str]:
+        """
+        Download llms.txt content with retry logic.
+
+        Returns:
+            String content or None if download fails
+        """
+        headers = {
+            'User-Agent': 'Skill-Seekers-llms.txt-Reader/1.0'
+        }
+
+        for attempt in range(self.max_retries):
+            try:
+                response = requests.get(
+                    self.url,
+                    headers=headers,
+                    timeout=self.timeout
+                )
+                response.raise_for_status()
+
+                content = response.text
+
+                # Validate content is not empty
+                if len(content) < 100:
+                    print(f"⚠️  Content too short ({len(content)} chars), rejecting")
+                    return None
+
+                # Validate content looks like markdown
+                if not self._is_markdown(content):
+                    print(f"⚠️  Content doesn't look like markdown")
+                    return None
+
+                return content
+
+            except requests.RequestException as e:
+                if attempt < self.max_retries - 1:
+                    # Calculate exponential backoff delay: 1s, 2s, 4s, etc.
+                    delay = 2 ** attempt
+                    print(f"⚠️  Attempt {attempt + 1}/{self.max_retries} failed: {e}")
+                    print(f"   Retrying in {delay}s...")
+                    time.sleep(delay)
+                else:
+                    print(f"❌ Failed to download {self.url} after {self.max_retries} attempts: {e}")
+                    return None
+
+        return None
--- a/src/skill_seekers/cli/llms_txt_parser.py
+++ b/src/skill_seekers/cli/llms_txt_parser.py
@@ -0,0 +1,74 @@
+"""ABOUTME: Parses llms.txt markdown content into structured page data"""
+"""ABOUTME: Extracts titles, content, code samples, and headings from markdown"""
+
+import re
+from typing import List, Dict
+
+class LlmsTxtParser:
+    """Parse llms.txt markdown content into page structures"""
+
+    def __init__(self, content: str):
+        self.content = content
+
+    def parse(self) -> List[Dict]:
+        """
+        Parse markdown content into page structures.
+
+        Returns:
+            List of page dicts with title, content, code_samples, headings
+        """
+        pages = []
+
+        # Split by h1 headers (# Title)
+        sections = re.split(r'\n# ', self.content)
+
+        for section in sections:
+            if not section.strip():
+                continue
+
+            # First line is title
+            lines = section.split('\n')
+            title = lines[0].strip('#').strip()
+
+            # Parse content
+            page = self._parse_section('\n'.join(lines[1:]), title)
+            pages.append(page)
+
+        return pages
+
+    def _parse_section(self, content: str, title: str) -> Dict:
+        """Parse a single section into page structure"""
+        page = {
+            'title': title,
+            'content': '',
+            'code_samples': [],
+            'headings': [],
+            'url': f'llms-txt#{title.lower().replace(" ", "-")}',
+            'links': []
+        }
+
+        # Extract code blocks
+        code_blocks = re.findall(r'```(\w+)?\n(.*?)```', content, re.DOTALL)
+        for lang, code in code_blocks:
+            page['code_samples'].append({
+                'code': code.strip(),
+                'language': lang or 'unknown'
+            })
+
+        # Extract h2/h3 headings
+        headings = re.findall(r'^(#{2,3})\s+(.+)$', content, re.MULTILINE)
+        for level_markers, text in headings:
+            page['headings'].append({
+                'level': f'h{len(level_markers)}',
+                'text': text.strip(),
+                'id': text.lower().replace(' ', '-')
+            })
+
+        # Remove code blocks from content for plain text
+        content_no_code = re.sub(r'```.*?```', '', content, flags=re.DOTALL)
+
+        # Extract paragraphs
+        paragraphs = [p.strip() for p in content_no_code.split('\n\n') if len(p.strip()) > 20]
+        page['content'] = '\n\n'.join(paragraphs)
+
+        return page
--- a/src/skill_seekers/cli/main.py
+++ b/src/skill_seekers/cli/main.py
@@ -0,0 +1,285 @@
+#!/usr/bin/env python3
+"""
+Skill Seekers - Unified CLI Entry Point
+
+Provides a git-style unified command-line interface for all Skill Seekers tools.
+
+Usage:
+    skill-seekers <command> [options]
+
+Commands:
+    scrape      Scrape documentation website
+    github      Scrape GitHub repository
+    pdf         Extract from PDF file
+    unified     Multi-source scraping (docs + GitHub + PDF)
+    enhance     AI-powered enhancement (local, no API key)
+    package     Package skill into .zip file
+    upload      Upload skill to Claude
+    estimate    Estimate page count before scraping
+
+Examples:
+    skill-seekers scrape --config configs/react.json
+    skill-seekers github --repo microsoft/TypeScript
+    skill-seekers unified --config configs/react_unified.json
+    skill-seekers package output/react/
+"""
+
+import sys
+import argparse
+from typing import List, Optional
+
+
+def create_parser() -> argparse.ArgumentParser:
+    """Create the main argument parser with subcommands."""
+    parser = argparse.ArgumentParser(
+        prog="skill-seekers",
+        description="Convert documentation, GitHub repos, and PDFs into Claude AI skills",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Scrape documentation
+  skill-seekers scrape --config configs/react.json
+
+  # Scrape GitHub repository
+  skill-seekers github --repo microsoft/TypeScript --name typescript
+
+  # Multi-source scraping (unified)
+  skill-seekers unified --config configs/react_unified.json
+
+  # AI-powered enhancement
+  skill-seekers enhance output/react/
+
+  # Package and upload
+  skill-seekers package output/react/
+  skill-seekers upload output/react.zip
+
+For more information: https://github.com/yusufkaraaslan/Skill_Seekers
+        """
+    )
+
+    parser.add_argument(
+        "--version",
+        action="version",
+        version="%(prog)s 2.0.0"
+    )
+
+    subparsers = parser.add_subparsers(
+        dest="command",
+        title="commands",
+        description="Available Skill Seekers commands",
+        help="Command to run"
+    )
+
+    # === scrape subcommand ===
+    scrape_parser = subparsers.add_parser(
+        "scrape",
+        help="Scrape documentation website",
+        description="Scrape documentation website and generate skill"
+    )
+    scrape_parser.add_argument("--config", help="Config JSON file")
+    scrape_parser.add_argument("--name", help="Skill name")
+    scrape_parser.add_argument("--url", help="Documentation URL")
+    scrape_parser.add_argument("--description", help="Skill description")
+    scrape_parser.add_argument("--skip-scrape", action="store_true", help="Skip scraping, use cached data")
+    scrape_parser.add_argument("--enhance", action="store_true", help="AI enhancement (API)")
+    scrape_parser.add_argument("--enhance-local", action="store_true", help="AI enhancement (local)")
+    scrape_parser.add_argument("--dry-run", action="store_true", help="Dry run mode")
+    scrape_parser.add_argument("--async", dest="async_mode", action="store_true", help="Use async scraping")
+    scrape_parser.add_argument("--workers", type=int, help="Number of async workers")
+
+    # === github subcommand ===
+    github_parser = subparsers.add_parser(
+        "github",
+        help="Scrape GitHub repository",
+        description="Scrape GitHub repository and generate skill"
+    )
+    github_parser.add_argument("--config", help="Config JSON file")
+    github_parser.add_argument("--repo", help="GitHub repo (owner/repo)")
+    github_parser.add_argument("--name", help="Skill name")
+    github_parser.add_argument("--description", help="Skill description")
+
+    # === pdf subcommand ===
+    pdf_parser = subparsers.add_parser(
+        "pdf",
+        help="Extract from PDF file",
+        description="Extract content from PDF and generate skill"
+    )
+    pdf_parser.add_argument("--config", help="Config JSON file")
+    pdf_parser.add_argument("--pdf", help="PDF file path")
+    pdf_parser.add_argument("--name", help="Skill name")
+    pdf_parser.add_argument("--description", help="Skill description")
+    pdf_parser.add_argument("--from-json", help="Build from extracted JSON")
+
+    # === unified subcommand ===
+    unified_parser = subparsers.add_parser(
+        "unified",
+        help="Multi-source scraping (docs + GitHub + PDF)",
+        description="Combine multiple sources into one skill"
+    )
+    unified_parser.add_argument("--config", required=True, help="Unified config JSON file")
+    unified_parser.add_argument("--merge-mode", help="Merge mode (rule-based, claude-enhanced)")
+    unified_parser.add_argument("--dry-run", action="store_true", help="Dry run mode")
+
+    # === enhance subcommand ===
+    enhance_parser = subparsers.add_parser(
+        "enhance",
+        help="AI-powered enhancement (local, no API key)",
+        description="Enhance SKILL.md using Claude Code (local)"
+    )
+    enhance_parser.add_argument("skill_directory", help="Skill directory path")
+
+    # === package subcommand ===
+    package_parser = subparsers.add_parser(
+        "package",
+        help="Package skill into .zip file",
+        description="Package skill directory into uploadable .zip"
+    )
+    package_parser.add_argument("skill_directory", help="Skill directory path")
+    package_parser.add_argument("--no-open", action="store_true", help="Don't open output folder")
+    package_parser.add_argument("--upload", action="store_true", help="Auto-upload after packaging")
+
+    # === upload subcommand ===
+    upload_parser = subparsers.add_parser(
+        "upload",
+        help="Upload skill to Claude",
+        description="Upload .zip file to Claude via Anthropic API"
+    )
+    upload_parser.add_argument("zip_file", help=".zip file to upload")
+    upload_parser.add_argument("--api-key", help="Anthropic API key")
+
+    # === estimate subcommand ===
+    estimate_parser = subparsers.add_parser(
+        "estimate",
+        help="Estimate page count before scraping",
+        description="Estimate total pages for documentation scraping"
+    )
+    estimate_parser.add_argument("config", help="Config JSON file")
+    estimate_parser.add_argument("--max-discovery", type=int, help="Max pages to discover")
+
+    return parser
+
+
+def main(argv: Optional[List[str]] = None) -> int:
+    """Main entry point for the unified CLI.
+
+    Args:
+        argv: Command-line arguments (defaults to sys.argv)
+
+    Returns:
+        Exit code (0 for success, non-zero for error)
+    """
+    parser = create_parser()
+    args = parser.parse_args(argv)
+
+    if not args.command:
+        parser.print_help()
+        return 1
+
+    # Delegate to the appropriate tool
+    try:
+        if args.command == "scrape":
+            from skill_seekers.cli.doc_scraper import main as scrape_main
+            # Convert args namespace to sys.argv format for doc_scraper
+            sys.argv = ["doc_scraper.py"]
+            if args.config:
+                sys.argv.extend(["--config", args.config])
+            if args.name:
+                sys.argv.extend(["--name", args.name])
+            if args.url:
+                sys.argv.extend(["--url", args.url])
+            if args.description:
+                sys.argv.extend(["--description", args.description])
+            if args.skip_scrape:
+                sys.argv.append("--skip-scrape")
+            if args.enhance:
+                sys.argv.append("--enhance")
+            if args.enhance_local:
+                sys.argv.append("--enhance-local")
+            if args.dry_run:
+                sys.argv.append("--dry-run")
+            if args.async_mode:
+                sys.argv.append("--async")
+            if args.workers:
+                sys.argv.extend(["--workers", str(args.workers)])
+            return scrape_main() or 0
+
+        elif args.command == "github":
+            from skill_seekers.cli.github_scraper import main as github_main
+            sys.argv = ["github_scraper.py"]
+            if args.config:
+                sys.argv.extend(["--config", args.config])
+            if args.repo:
+                sys.argv.extend(["--repo", args.repo])
+            if args.name:
+                sys.argv.extend(["--name", args.name])
+            if args.description:
+                sys.argv.extend(["--description", args.description])
+            return github_main() or 0
+
+        elif args.command == "pdf":
+            from skill_seekers.cli.pdf_scraper import main as pdf_main
+            sys.argv = ["pdf_scraper.py"]
+            if args.config:
+                sys.argv.extend(["--config", args.config])
+            if args.pdf:
+                sys.argv.extend(["--pdf", args.pdf])
+            if args.name:
+                sys.argv.extend(["--name", args.name])
+            if args.description:
+                sys.argv.extend(["--description", args.description])
+            if args.from_json:
+                sys.argv.extend(["--from-json", args.from_json])
+            return pdf_main() or 0
+
+        elif args.command == "unified":
+            from skill_seekers.cli.unified_scraper import main as unified_main
+            sys.argv = ["unified_scraper.py", "--config", args.config]
+            if args.merge_mode:
+                sys.argv.extend(["--merge-mode", args.merge_mode])
+            if args.dry_run:
+                sys.argv.append("--dry-run")
+            return unified_main() or 0
+
+        elif args.command == "enhance":
+            from skill_seekers.cli.enhance_skill_local import main as enhance_main
+            sys.argv = ["enhance_skill_local.py", args.skill_directory]
+            return enhance_main() or 0
+
+        elif args.command == "package":
+            from skill_seekers.cli.package_skill import main as package_main
+            sys.argv = ["package_skill.py", args.skill_directory]
+            if args.no_open:
+                sys.argv.append("--no-open")
+            if args.upload:
+                sys.argv.append("--upload")
+            return package_main() or 0
+
+        elif args.command == "upload":
+            from skill_seekers.cli.upload_skill import main as upload_main
+            sys.argv = ["upload_skill.py", args.zip_file]
+            if args.api_key:
+                sys.argv.extend(["--api-key", args.api_key])
+            return upload_main() or 0
+
+        elif args.command == "estimate":
+            from skill_seekers.cli.estimate_pages import main as estimate_main
+            sys.argv = ["estimate_pages.py", args.config]
+            if args.max_discovery:
+                sys.argv.extend(["--max-discovery", str(args.max_discovery)])
+            return estimate_main() or 0
+
+        else:
+            print(f"Error: Unknown command '{args.command}'", file=sys.stderr)
+            parser.print_help()
+            return 1
+
+    except KeyboardInterrupt:
+        print("\n\nInterrupted by user", file=sys.stderr)
+        return 130
+    except Exception as e:
+        print(f"Error: {e}", file=sys.stderr)
+        return 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/src/skill_seekers/cli/merge_sources.py
+++ b/src/skill_seekers/cli/merge_sources.py
@@ -0,0 +1,513 @@
+#!/usr/bin/env python3
+"""
+Source Merger for Multi-Source Skills
+
+Merges documentation and code data intelligently:
+- Rule-based merge: Fast, deterministic rules
+- Claude-enhanced merge: AI-powered reconciliation
+
+Handles conflicts and creates unified API reference.
+"""
+
+import json
+import logging
+import subprocess
+import tempfile
+import os
+from pathlib import Path
+from typing import Dict, List, Any, Optional
+from conflict_detector import Conflict, ConflictDetector
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+class RuleBasedMerger:
+    """
+    Rule-based API merger using deterministic rules.
+
+    Rules:
+    1. If API only in docs → Include with [DOCS_ONLY] tag
+    2. If API only in code → Include with [UNDOCUMENTED] tag
+    3. If both match perfectly → Include normally
+    4. If conflict → Include both versions with [CONFLICT] tag, prefer code signature
+    """
+
+    def __init__(self, docs_data: Dict, github_data: Dict, conflicts: List[Conflict]):
+        """
+        Initialize rule-based merger.
+
+        Args:
+            docs_data: Documentation scraper data
+            github_data: GitHub scraper data
+            conflicts: List of detected conflicts
+        """
+        self.docs_data = docs_data
+        self.github_data = github_data
+        self.conflicts = conflicts
+
+        # Build conflict index for fast lookup
+        self.conflict_index = {c.api_name: c for c in conflicts}
+
+        # Extract APIs from both sources
+        detector = ConflictDetector(docs_data, github_data)
+        self.docs_apis = detector.docs_apis
+        self.code_apis = detector.code_apis
+
+    def merge_all(self) -> Dict[str, Any]:
+        """
+        Merge all APIs using rule-based logic.
+
+        Returns:
+            Dict containing merged API data
+        """
+        logger.info("Starting rule-based merge...")
+
+        merged_apis = {}
+
+        # Get all unique API names
+        all_api_names = set(self.docs_apis.keys()) | set(self.code_apis.keys())
+
+        for api_name in sorted(all_api_names):
+            merged_api = self._merge_single_api(api_name)
+            merged_apis[api_name] = merged_api
+
+        logger.info(f"Merged {len(merged_apis)} APIs")
+
+        return {
+            'merge_mode': 'rule-based',
+            'apis': merged_apis,
+            'summary': {
+                'total_apis': len(merged_apis),
+                'docs_only': sum(1 for api in merged_apis.values() if api['status'] == 'docs_only'),
+                'code_only': sum(1 for api in merged_apis.values() if api['status'] == 'code_only'),
+                'matched': sum(1 for api in merged_apis.values() if api['status'] == 'matched'),
+                'conflict': sum(1 for api in merged_apis.values() if api['status'] == 'conflict')
+            }
+        }
+
+    def _merge_single_api(self, api_name: str) -> Dict[str, Any]:
+        """
+        Merge a single API using rules.
+
+        Args:
+            api_name: Name of the API to merge
+
+        Returns:
+            Merged API dict
+        """
+        in_docs = api_name in self.docs_apis
+        in_code = api_name in self.code_apis
+        has_conflict = api_name in self.conflict_index
+
+        # Rule 1: Only in docs
+        if in_docs and not in_code:
+            conflict = self.conflict_index.get(api_name)
+            return {
+                'name': api_name,
+                'status': 'docs_only',
+                'source': 'documentation',
+                'data': self.docs_apis[api_name],
+                'warning': 'This API is documented but not found in codebase',
+                'conflict': conflict.__dict__ if conflict else None
+            }
+
+        # Rule 2: Only in code
+        if in_code and not in_docs:
+            is_private = api_name.startswith('_')
+            conflict = self.conflict_index.get(api_name)
+            return {
+                'name': api_name,
+                'status': 'code_only',
+                'source': 'code',
+                'data': self.code_apis[api_name],
+                'warning': 'This API exists in code but is not documented' if not is_private else 'Internal/private API',
+                'conflict': conflict.__dict__ if conflict else None
+            }
+
+        # Both exist - check for conflicts
+        docs_info = self.docs_apis[api_name]
+        code_info = self.code_apis[api_name]
+
+        # Rule 3: Both match perfectly (no conflict)
+        if not has_conflict:
+            return {
+                'name': api_name,
+                'status': 'matched',
+                'source': 'both',
+                'docs_data': docs_info,
+                'code_data': code_info,
+                'merged_signature': self._create_merged_signature(code_info, docs_info),
+                'merged_description': docs_info.get('docstring') or code_info.get('docstring')
+            }
+
+        # Rule 4: Conflict exists - prefer code signature, keep docs description
+        conflict = self.conflict_index[api_name]
+
+        return {
+            'name': api_name,
+            'status': 'conflict',
+            'source': 'both',
+            'docs_data': docs_info,
+            'code_data': code_info,
+            'conflict': conflict.__dict__,
+            'resolution': 'prefer_code_signature',
+            'merged_signature': self._create_merged_signature(code_info, docs_info),
+            'merged_description': docs_info.get('docstring') or code_info.get('docstring'),
+            'warning': conflict.difference
+        }
+
+    def _create_merged_signature(self, code_info: Dict, docs_info: Dict) -> str:
+        """
+        Create merged signature preferring code data.
+
+        Args:
+            code_info: API info from code
+            docs_info: API info from docs
+
+        Returns:
+            Merged signature string
+        """
+        name = code_info.get('name', docs_info.get('name'))
+        params = code_info.get('parameters', docs_info.get('parameters', []))
+        return_type = code_info.get('return_type', docs_info.get('return_type'))
+
+        # Build parameter string
+        param_strs = []
+        for param in params:
+            param_str = param['name']
+            if param.get('type_hint'):
+                param_str += f": {param['type_hint']}"
+            if param.get('default'):
+                param_str += f" = {param['default']}"
+            param_strs.append(param_str)
+
+        signature = f"{name}({', '.join(param_strs)})"
+
+        if return_type:
+            signature += f" -> {return_type}"
+
+        return signature
+
+
+class ClaudeEnhancedMerger:
+    """
+    Claude-enhanced API merger using local Claude Code.
+
+    Opens Claude Code in a new terminal to intelligently reconcile conflicts.
+    Uses the same approach as enhance_skill_local.py.
+    """
+
+    def __init__(self, docs_data: Dict, github_data: Dict, conflicts: List[Conflict]):
+        """
+        Initialize Claude-enhanced merger.
+
+        Args:
+            docs_data: Documentation scraper data
+            github_data: GitHub scraper data
+            conflicts: List of detected conflicts
+        """
+        self.docs_data = docs_data
+        self.github_data = github_data
+        self.conflicts = conflicts
+
+        # First do rule-based merge as baseline
+        self.rule_merger = RuleBasedMerger(docs_data, github_data, conflicts)
+
+    def merge_all(self) -> Dict[str, Any]:
+        """
+        Merge all APIs using Claude enhancement.
+
+        Returns:
+            Dict containing merged API data
+        """
+        logger.info("Starting Claude-enhanced merge...")
+
+        # Create temporary workspace
+        workspace_dir = self._create_workspace()
+
+        # Launch Claude Code for enhancement
+        logger.info("Launching Claude Code for intelligent merging...")
+        logger.info("Claude will analyze conflicts and create reconciled API reference")
+
+        try:
+            self._launch_claude_merge(workspace_dir)
+
+            # Read enhanced results
+            merged_data = self._read_merged_results(workspace_dir)
+
+            logger.info("Claude-enhanced merge complete")
+            return merged_data
+
+        except Exception as e:
+            logger.error(f"Claude enhancement failed: {e}")
+            logger.info("Falling back to rule-based merge")
+            return self.rule_merger.merge_all()
+
+    def _create_workspace(self) -> str:
+        """
+        Create temporary workspace with merge context.
+
+        Returns:
+            Path to workspace directory
+        """
+        workspace = tempfile.mkdtemp(prefix='skill_merge_')
+        logger.info(f"Created merge workspace: {workspace}")
+
+        # Write context files for Claude
+        self._write_context_files(workspace)
+
+        return workspace
+
+    def _write_context_files(self, workspace: str):
+        """Write context files for Claude to analyze."""
+
+        # 1. Write conflicts summary
+        conflicts_file = os.path.join(workspace, 'conflicts.json')
+        with open(conflicts_file, 'w') as f:
+            json.dump({
+                'conflicts': [c.__dict__ for c in self.conflicts],
+                'summary': {
+                    'total': len(self.conflicts),
+                    'by_type': self._count_by_field('type'),
+                    'by_severity': self._count_by_field('severity')
+                }
+            }, f, indent=2)
+
+        # 2. Write documentation APIs
+        docs_apis_file = os.path.join(workspace, 'docs_apis.json')
+        detector = ConflictDetector(self.docs_data, self.github_data)
+        with open(docs_apis_file, 'w') as f:
+            json.dump(detector.docs_apis, f, indent=2)
+
+        # 3. Write code APIs
+        code_apis_file = os.path.join(workspace, 'code_apis.json')
+        with open(code_apis_file, 'w') as f:
+            json.dump(detector.code_apis, f, indent=2)
+
+        # 4. Write merge instructions for Claude
+        instructions = """# API Merge Task
+
+You are merging API documentation from two sources:
+1. Official documentation (user-facing)
+2. Source code analysis (implementation reality)
+
+## Context Files:
+- `conflicts.json` - All detected conflicts between sources
+- `docs_apis.json` - APIs from documentation
+- `code_apis.json` - APIs from source code
+
+## Your Task:
+For each conflict, reconcile the differences intelligently:
+
+1. **Prefer code signatures as source of truth**
+   - Use actual parameter names, types, defaults from code
+   - Code is what actually runs, docs might be outdated
+
+2. **Keep documentation descriptions**
+   - Docs are user-friendly, code comments might be technical
+   - Keep the docs' explanation of what the API does
+
+3. **Add implementation notes for discrepancies**
+   - If docs differ from code, explain the difference
+   - Example: "⚠️ The `snap` parameter exists in code but is not documented"
+
+4. **Flag missing APIs clearly**
+   - Missing in docs → Add [UNDOCUMENTED] tag
+   - Missing in code → Add [REMOVED] or [DOCS_ERROR] tag
+
+5. **Create unified API reference**
+   - One definitive signature per API
+   - Clear warnings about conflicts
+   - Implementation notes where helpful
+
+## Output Format:
+Create `merged_apis.json` with this structure:
+
+```json
+{
+  "apis": {
+    "API.name": {
+      "signature": "final_signature_here",
+      "parameters": [...],
+      "return_type": "type",
+      "description": "user-friendly description",
+      "implementation_notes": "Any discrepancies or warnings",
+      "source": "both|docs_only|code_only",
+      "confidence": "high|medium|low"
+    }
+  }
+}
+```
+
+Take your time to analyze each conflict carefully. The goal is to create the most accurate and helpful API reference possible.
+"""
+
+        instructions_file = os.path.join(workspace, 'MERGE_INSTRUCTIONS.md')
+        with open(instructions_file, 'w') as f:
+            f.write(instructions)
+
+        logger.info(f"Wrote context files to {workspace}")
+
+    def _count_by_field(self, field: str) -> Dict[str, int]:
+        """Count conflicts by a specific field."""
+        counts = {}
+        for conflict in self.conflicts:
+            value = getattr(conflict, field)
+            counts[value] = counts.get(value, 0) + 1
+        return counts
+
+    def _launch_claude_merge(self, workspace: str):
+        """
+        Launch Claude Code to perform merge.
+
+        Similar to enhance_skill_local.py approach.
+        """
+        # Create a script that Claude will execute
+        script_path = os.path.join(workspace, 'merge_script.sh')
+
+        script_content = f"""#!/bin/bash
+# Automatic merge script for Claude Code
+
+cd "{workspace}"
+
+echo "📊 Analyzing conflicts..."
+cat conflicts.json | head -20
+
+echo ""
+echo "📖 Documentation APIs: $(cat docs_apis.json | grep -c '\"name\"')"
+echo "💻 Code APIs: $(cat code_apis.json | grep -c '\"name\"')"
+echo ""
+echo "Please review the conflicts and create merged_apis.json"
+echo "Follow the instructions in MERGE_INSTRUCTIONS.md"
+echo ""
+echo "When done, save merged_apis.json and close this terminal."
+
+# Wait for user to complete merge
+read -p "Press Enter when merge is complete..."
+"""
+
+        with open(script_path, 'w') as f:
+            f.write(script_content)
+
+        os.chmod(script_path, 0o755)
+
+        # Open new terminal with Claude Code
+        # Try different terminal emulators
+        terminals = [
+            ['x-terminal-emulator', '-e'],
+            ['gnome-terminal', '--'],
+            ['xterm', '-e'],
+            ['konsole', '-e']
+        ]
+
+        for terminal_cmd in terminals:
+            try:
+                cmd = terminal_cmd + ['bash', script_path]
+                subprocess.Popen(cmd)
+                logger.info(f"Opened terminal with {terminal_cmd[0]}")
+                break
+            except FileNotFoundError:
+                continue
+
+        # Wait for merge to complete
+        merged_file = os.path.join(workspace, 'merged_apis.json')
+        logger.info(f"Waiting for merged results at: {merged_file}")
+        logger.info("Close the terminal when done to continue...")
+
+        # Poll for file existence
+        import time
+        timeout = 3600  # 1 hour max
+        elapsed = 0
+        while not os.path.exists(merged_file) and elapsed < timeout:
+            time.sleep(5)
+            elapsed += 5
+
+        if not os.path.exists(merged_file):
+            raise TimeoutError("Claude merge timed out after 1 hour")
+
+    def _read_merged_results(self, workspace: str) -> Dict[str, Any]:
+        """Read merged results from workspace."""
+        merged_file = os.path.join(workspace, 'merged_apis.json')
+
+        if not os.path.exists(merged_file):
+            raise FileNotFoundError(f"Merged results not found: {merged_file}")
+
+        with open(merged_file, 'r') as f:
+            merged_data = json.load(f)
+
+        return {
+            'merge_mode': 'claude-enhanced',
+            **merged_data
+        }
+
+
+def merge_sources(docs_data_path: str,
+                  github_data_path: str,
+                  output_path: str,
+                  mode: str = 'rule-based') -> Dict[str, Any]:
+    """
+    Merge documentation and GitHub data.
+
+    Args:
+        docs_data_path: Path to documentation data JSON
+        github_data_path: Path to GitHub data JSON
+        output_path: Path to save merged output
+        mode: 'rule-based' or 'claude-enhanced'
+
+    Returns:
+        Merged data dict
+    """
+    # Load data
+    with open(docs_data_path, 'r') as f:
+        docs_data = json.load(f)
+
+    with open(github_data_path, 'r') as f:
+        github_data = json.load(f)
+
+    # Detect conflicts
+    detector = ConflictDetector(docs_data, github_data)
+    conflicts = detector.detect_all_conflicts()
+
+    logger.info(f"Detected {len(conflicts)} conflicts")
+
+    # Merge based on mode
+    if mode == 'claude-enhanced':
+        merger = ClaudeEnhancedMerger(docs_data, github_data, conflicts)
+    else:
+        merger = RuleBasedMerger(docs_data, github_data, conflicts)
+
+    merged_data = merger.merge_all()
+
+    # Save merged data
+    with open(output_path, 'w') as f:
+        json.dump(merged_data, f, indent=2, ensure_ascii=False)
+
+    logger.info(f"Merged data saved to: {output_path}")
+
+    return merged_data
+
+
+if __name__ == '__main__':
+    import argparse
+
+    parser = argparse.ArgumentParser(description='Merge documentation and code sources')
+    parser.add_argument('docs_data', help='Path to documentation data JSON')
+    parser.add_argument('github_data', help='Path to GitHub data JSON')
+    parser.add_argument('--output', '-o', default='merged_data.json', help='Output file path')
+    parser.add_argument('--mode', '-m', choices=['rule-based', 'claude-enhanced'],
+                       default='rule-based', help='Merge mode')
+
+    args = parser.parse_args()
+
+    merged = merge_sources(args.docs_data, args.github_data, args.output, args.mode)
+
+    # Print summary
+    summary = merged.get('summary', {})
+    print(f"\n✅ Merge complete ({merged.get('merge_mode')})")
+    print(f"   Total APIs: {summary.get('total_apis', 0)}")
+    print(f"   Matched: {summary.get('matched', 0)}")
+    print(f"   Docs only: {summary.get('docs_only', 0)}")
+    print(f"   Code only: {summary.get('code_only', 0)}")
+    print(f"   Conflicts: {summary.get('conflict', 0)}")
+    print(f"\n📄 Saved to: {args.output}")
--- a/src/skill_seekers/cli/package_multi.py
+++ b/src/skill_seekers/cli/package_multi.py
@@ -0,0 +1,81 @@
+#!/usr/bin/env python3
+"""
+Multi-Skill Packager
+
+Package multiple skills at once. Useful for packaging router + sub-skills together.
+"""
+
+import sys
+import argparse
+from pathlib import Path
+import subprocess
+
+
+def package_skill(skill_dir: Path) -> bool:
+    """Package a single skill"""
+    try:
+        result = subprocess.run(
+            [sys.executable, str(Path(__file__).parent / "package_skill.py"), str(skill_dir)],
+            capture_output=True,
+            text=True
+        )
+        return result.returncode == 0
+    except Exception as e:
+        print(f"❌ Error packaging {skill_dir}: {e}")
+        return False
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Package multiple skills at once",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Package all godot skills
+  python3 package_multi.py output/godot*/
+
+  # Package specific skills
+  python3 package_multi.py output/godot-2d/ output/godot-3d/ output/godot-scripting/
+        """
+    )
+
+    parser.add_argument(
+        'skill_dirs',
+        nargs='+',
+        help='Skill directories to package'
+    )
+
+    args = parser.parse_args()
+
+    print(f"\n{'='*60}")
+    print(f"MULTI-SKILL PACKAGER")
+    print(f"{'='*60}\n")
+
+    skill_dirs = [Path(d) for d in args.skill_dirs]
+    success_count = 0
+    total_count = len(skill_dirs)
+
+    for skill_dir in skill_dirs:
+        if not skill_dir.exists():
+            print(f"⚠️  Skipping (not found): {skill_dir}")
+            continue
+
+        if not (skill_dir / "SKILL.md").exists():
+            print(f"⚠️  Skipping (no SKILL.md): {skill_dir}")
+            continue
+
+        print(f"📦 Packaging: {skill_dir.name}")
+        if package_skill(skill_dir):
+            success_count += 1
+            print(f"   ✅ Success")
+        else:
+            print(f"   ❌ Failed")
+        print("")
+
+    print(f"{'='*60}")
+    print(f"SUMMARY: {success_count}/{total_count} skills packaged")
+    print(f"{'='*60}\n")
+
+
+if __name__ == "__main__":
+    main()
--- a/src/skill_seekers/cli/package_skill.py
+++ b/src/skill_seekers/cli/package_skill.py
@@ -0,0 +1,177 @@
+#!/usr/bin/env python3
+"""
+Simple Skill Packager
+Packages a skill directory into a .zip file for Claude.
+
+Usage:
+    python3 cli/package_skill.py output/steam-inventory/
+    python3 cli/package_skill.py output/react/
+    python3 cli/package_skill.py output/react/ --no-open  # Don't open folder
+"""
+
+import os
+import sys
+import zipfile
+import argparse
+from pathlib import Path
+
+# Import utilities
+try:
+    from utils import (
+        open_folder,
+        print_upload_instructions,
+        format_file_size,
+        validate_skill_directory
+    )
+except ImportError:
+    # If running from different directory, add cli to path
+    sys.path.insert(0, str(Path(__file__).parent))
+    from utils import (
+        open_folder,
+        print_upload_instructions,
+        format_file_size,
+        validate_skill_directory
+    )
+
+
+def package_skill(skill_dir, open_folder_after=True):
+    """
+    Package a skill directory into a .zip file
+
+    Args:
+        skill_dir: Path to skill directory
+        open_folder_after: Whether to open the output folder after packaging
+
+    Returns:
+        tuple: (success, zip_path) where success is bool and zip_path is Path or None
+    """
+    skill_path = Path(skill_dir)
+
+    # Validate skill directory
+    is_valid, error_msg = validate_skill_directory(skill_path)
+    if not is_valid:
+        print(f"❌ Error: {error_msg}")
+        return False, None
+
+    # Create zip filename
+    skill_name = skill_path.name
+    zip_path = skill_path.parent / f"{skill_name}.zip"
+
+    print(f"📦 Packaging skill: {skill_name}")
+    print(f"   Source: {skill_path}")
+    print(f"   Output: {zip_path}")
+
+    # Create zip file
+    with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zf:
+        for root, dirs, files in os.walk(skill_path):
+            # Skip backup files
+            files = [f for f in files if not f.endswith('.backup')]
+
+            for file in files:
+                file_path = Path(root) / file
+                arcname = file_path.relative_to(skill_path)
+                zf.write(file_path, arcname)
+                print(f"   + {arcname}")
+
+    # Get zip size
+    zip_size = zip_path.stat().st_size
+    print(f"\n✅ Package created: {zip_path}")
+    print(f"   Size: {zip_size:,} bytes ({format_file_size(zip_size)})")
+
+    # Open folder in file browser
+    if open_folder_after:
+        print(f"\n📂 Opening folder: {zip_path.parent}")
+        open_folder(zip_path.parent)
+
+    # Print upload instructions
+    print_upload_instructions(zip_path)
+
+    return True, zip_path
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Package a skill directory into a .zip file for Claude",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Package skill and open folder
+  python3 cli/package_skill.py output/react/
+
+  # Package skill without opening folder
+  python3 cli/package_skill.py output/react/ --no-open
+
+  # Get help
+  python3 cli/package_skill.py --help
+        """
+    )
+
+    parser.add_argument(
+        'skill_dir',
+        help='Path to skill directory (e.g., output/react/)'
+    )
+
+    parser.add_argument(
+        '--no-open',
+        action='store_true',
+        help='Do not open the output folder after packaging'
+    )
+
+    parser.add_argument(
+        '--upload',
+        action='store_true',
+        help='Automatically upload to Claude after packaging (requires ANTHROPIC_API_KEY)'
+    )
+
+    args = parser.parse_args()
+
+    success, zip_path = package_skill(args.skill_dir, open_folder_after=not args.no_open)
+
+    if not success:
+        sys.exit(1)
+
+    # Auto-upload if requested
+    if args.upload:
+        # Check if API key is set BEFORE attempting upload
+        api_key = os.environ.get('ANTHROPIC_API_KEY', '').strip()
+
+        if not api_key:
+            # No API key - show helpful message but DON'T fail
+            print("\n" + "="*60)
+            print("💡 Automatic Upload")
+            print("="*60)
+            print()
+            print("To enable automatic upload:")
+            print("  1. Get API key from https://console.anthropic.com/")
+            print("  2. Set: export ANTHROPIC_API_KEY=sk-ant-...")
+            print("  3. Run package_skill.py with --upload flag")
+            print()
+            print("For now, use manual upload (instructions above) ☝️")
+            print("="*60)
+            # Exit successfully - packaging worked!
+            sys.exit(0)
+
+        # API key exists - try upload
+        try:
+            from upload_skill import upload_skill_api
+            print("\n" + "="*60)
+            upload_success, message = upload_skill_api(zip_path)
+            if not upload_success:
+                print(f"❌ Upload failed: {message}")
+                print()
+                print("💡 Try manual upload instead (instructions above) ☝️")
+                print("="*60)
+                # Exit successfully - packaging worked even if upload failed
+                sys.exit(0)
+            else:
+                print("="*60)
+                sys.exit(0)
+        except ImportError:
+            print("\n❌ Error: upload_skill.py not found")
+            sys.exit(1)
+
+    sys.exit(0)
+
+
+if __name__ == "__main__":
+    main()
--- a/src/skill_seekers/cli/pdf_extractor_poc.py
+++ b/src/skill_seekers/cli/pdf_extractor_poc.py
--- a/src/skill_seekers/cli/pdf_scraper.py
+++ b/src/skill_seekers/cli/pdf_scraper.py
@@ -0,0 +1,401 @@
+#!/usr/bin/env python3
+"""
+PDF Documentation to Claude Skill Converter (Task B1.6)
+
+Converts PDF documentation into Claude AI skills.
+Uses pdf_extractor_poc.py for extraction, builds skill structure.
+
+Usage:
+    python3 pdf_scraper.py --config configs/manual_pdf.json
+    python3 pdf_scraper.py --pdf manual.pdf --name myskill
+    python3 pdf_scraper.py --from-json manual_extracted.json
+"""
+
+import os
+import sys
+import json
+import re
+import argparse
+from pathlib import Path
+
+# Import the PDF extractor
+from pdf_extractor_poc import PDFExtractor
+
+
+class PDFToSkillConverter:
+    """Convert PDF documentation to Claude skill"""
+
+    def __init__(self, config):
+        self.config = config
+        self.name = config['name']
+        self.pdf_path = config.get('pdf_path', '')
+        self.description = config.get('description', f'Documentation skill for {self.name}')
+
+        # Paths
+        self.skill_dir = f"output/{self.name}"
+        self.data_file = f"output/{self.name}_extracted.json"
+
+        # Extraction options
+        self.extract_options = config.get('extract_options', {})
+
+        # Categories
+        self.categories = config.get('categories', {})
+
+        # Extracted data
+        self.extracted_data = None
+
+    def extract_pdf(self):
+        """Extract content from PDF using pdf_extractor_poc.py"""
+        print(f"\n🔍 Extracting from PDF: {self.pdf_path}")
+
+        # Create extractor with options
+        extractor = PDFExtractor(
+            self.pdf_path,
+            verbose=True,
+            chunk_size=self.extract_options.get('chunk_size', 10),
+            min_quality=self.extract_options.get('min_quality', 5.0),
+            extract_images=self.extract_options.get('extract_images', True),
+            image_dir=f"{self.skill_dir}/assets/images",
+            min_image_size=self.extract_options.get('min_image_size', 100)
+        )
+
+        # Extract
+        result = extractor.extract_all()
+
+        if not result:
+            print("❌ Extraction failed")
+            raise RuntimeError(f"Failed to extract PDF: {self.pdf_path}")
+
+        # Save extracted data
+        with open(self.data_file, 'w', encoding='utf-8') as f:
+            json.dump(result, f, indent=2, ensure_ascii=False)
+
+        print(f"\n💾 Saved extracted data to: {self.data_file}")
+        self.extracted_data = result
+        return True
+
+    def load_extracted_data(self, json_path):
+        """Load previously extracted data from JSON"""
+        print(f"\n📂 Loading extracted data from: {json_path}")
+
+        with open(json_path, 'r', encoding='utf-8') as f:
+            self.extracted_data = json.load(f)
+
+        print(f"✅ Loaded {self.extracted_data['total_pages']} pages")
+        return True
+
+    def categorize_content(self):
+        """Categorize pages based on chapters or keywords"""
+        print(f"\n📋 Categorizing content...")
+
+        categorized = {}
+
+        # Use chapters if available
+        if self.extracted_data.get('chapters'):
+            for chapter in self.extracted_data['chapters']:
+                category_key = self._sanitize_filename(chapter['title'])
+                categorized[category_key] = {
+                    'title': chapter['title'],
+                    'pages': []
+                }
+
+            # Assign pages to chapters
+            for page in self.extracted_data['pages']:
+                page_num = page['page_number']
+
+                # Find which chapter this page belongs to
+                for chapter in self.extracted_data['chapters']:
+                    if chapter['start_page'] <= page_num <= chapter['end_page']:
+                        category_key = self._sanitize_filename(chapter['title'])
+                        categorized[category_key]['pages'].append(page)
+                        break
+
+        # Fall back to keyword-based categorization
+        elif self.categories:
+            # Check if categories is already in the right format (for tests)
+            # If first value is a list of dicts (pages), use as-is
+            first_value = next(iter(self.categories.values()))
+            if isinstance(first_value, list) and first_value and isinstance(first_value[0], dict):
+                # Already categorized - convert to expected format
+                for cat_key, pages in self.categories.items():
+                    categorized[cat_key] = {
+                        'title': cat_key.replace('_', ' ').title(),
+                        'pages': pages
+                    }
+            else:
+                # Keyword-based categorization
+                # Initialize categories
+                for cat_key, keywords in self.categories.items():
+                    categorized[cat_key] = {
+                        'title': cat_key.replace('_', ' ').title(),
+                        'pages': []
+                    }
+
+                # Categorize by keywords
+                for page in self.extracted_data['pages']:
+                    text = page.get('text', '').lower()
+                    headings_text = ' '.join([h['text'] for h in page.get('headings', [])]).lower()
+
+                    # Score against each category
+                    scores = {}
+                    for cat_key, keywords in self.categories.items():
+                        # Handle both string keywords and dict keywords (shouldn't happen, but be safe)
+                        if isinstance(keywords, list):
+                            score = sum(1 for kw in keywords
+                                      if isinstance(kw, str) and (kw.lower() in text or kw.lower() in headings_text))
+                        else:
+                            score = 0
+                        if score > 0:
+                            scores[cat_key] = score
+
+                    # Assign to highest scoring category
+                    if scores:
+                        best_cat = max(scores, key=scores.get)
+                        categorized[best_cat]['pages'].append(page)
+                    else:
+                        # Default category
+                        if 'other' not in categorized:
+                            categorized['other'] = {'title': 'Other', 'pages': []}
+                        categorized['other']['pages'].append(page)
+
+        else:
+            # No categorization - use single category
+            categorized['content'] = {
+                'title': 'Content',
+                'pages': self.extracted_data['pages']
+            }
+
+        print(f"✅ Created {len(categorized)} categories")
+        for cat_key, cat_data in categorized.items():
+            print(f"   - {cat_data['title']}: {len(cat_data['pages'])} pages")
+
+        return categorized
+
+    def build_skill(self):
+        """Build complete skill structure"""
+        print(f"\n🏗️  Building skill: {self.name}")
+
+        # Create directories
+        os.makedirs(f"{self.skill_dir}/references", exist_ok=True)
+        os.makedirs(f"{self.skill_dir}/scripts", exist_ok=True)
+        os.makedirs(f"{self.skill_dir}/assets", exist_ok=True)
+
+        # Categorize content
+        categorized = self.categorize_content()
+
+        # Generate reference files
+        print(f"\n📝 Generating reference files...")
+        for cat_key, cat_data in categorized.items():
+            self._generate_reference_file(cat_key, cat_data)
+
+        # Generate index
+        self._generate_index(categorized)
+
+        # Generate SKILL.md
+        self._generate_skill_md(categorized)
+
+        print(f"\n✅ Skill built successfully: {self.skill_dir}/")
+        print(f"\n📦 Next step: Package with: python3 cli/package_skill.py {self.skill_dir}/")
+
+    def _generate_reference_file(self, cat_key, cat_data):
+        """Generate a reference markdown file for a category"""
+        filename = f"{self.skill_dir}/references/{cat_key}.md"
+
+        with open(filename, 'w', encoding='utf-8') as f:
+            f.write(f"# {cat_data['title']}\n\n")
+
+            for page in cat_data['pages']:
+                # Add headings as section markers
+                if page.get('headings'):
+                    f.write(f"## {page['headings'][0]['text']}\n\n")
+
+                # Add text content
+                if page.get('text'):
+                    # Limit to first 1000 chars per page to avoid huge files
+                    text = page['text'][:1000]
+                    f.write(f"{text}\n\n")
+
+                # Add code samples (check both 'code_samples' and 'code_blocks' for compatibility)
+                code_list = page.get('code_samples') or page.get('code_blocks')
+                if code_list:
+                    f.write("### Code Examples\n\n")
+                    for code in code_list[:3]:  # Limit to top 3
+                        lang = code.get('language', '')
+                        f.write(f"```{lang}\n{code['code']}\n```\n\n")
+
+                # Add images
+                if page.get('images'):
+                    # Create assets directory if needed
+                    assets_dir = os.path.join(self.skill_dir, 'assets')
+                    os.makedirs(assets_dir, exist_ok=True)
+
+                    f.write("### Images\n\n")
+                    for img in page['images']:
+                        # Save image to assets
+                        img_filename = f"page_{page['page_number']}_img_{img['index']}.png"
+                        img_path = os.path.join(assets_dir, img_filename)
+
+                        with open(img_path, 'wb') as img_file:
+                            img_file.write(img['data'])
+
+                        # Add markdown image reference
+                        f.write(f"![Image {img['index']}](../assets/{img_filename})\n\n")
+
+                f.write("---\n\n")
+
+        print(f"   Generated: {filename}")
+
+    def _generate_index(self, categorized):
+        """Generate reference index"""
+        filename = f"{self.skill_dir}/references/index.md"
+
+        with open(filename, 'w', encoding='utf-8') as f:
+            f.write(f"# {self.name.title()} Documentation Reference\n\n")
+            f.write("## Categories\n\n")
+
+            for cat_key, cat_data in categorized.items():
+                page_count = len(cat_data['pages'])
+                f.write(f"- [{cat_data['title']}]({cat_key}.md) ({page_count} pages)\n")
+
+            f.write("\n## Statistics\n\n")
+            stats = self.extracted_data.get('quality_statistics', {})
+            f.write(f"- Total pages: {self.extracted_data.get('total_pages', 0)}\n")
+            f.write(f"- Code blocks: {self.extracted_data.get('total_code_blocks', 0)}\n")
+            f.write(f"- Images: {self.extracted_data.get('total_images', 0)}\n")
+            if stats:
+                f.write(f"- Average code quality: {stats.get('average_quality', 0):.1f}/10\n")
+                f.write(f"- Valid code blocks: {stats.get('valid_code_blocks', 0)}\n")
+
+        print(f"   Generated: {filename}")
+
+    def _generate_skill_md(self, categorized):
+        """Generate main SKILL.md file"""
+        filename = f"{self.skill_dir}/SKILL.md"
+
+        # Generate skill name (lowercase, hyphens only, max 64 chars)
+        skill_name = self.name.lower().replace('_', '-').replace(' ', '-')[:64]
+
+        # Truncate description to 1024 chars if needed
+        desc = self.description[:1024] if len(self.description) > 1024 else self.description
+
+        with open(filename, 'w', encoding='utf-8') as f:
+            # Write YAML frontmatter
+            f.write(f"---\n")
+            f.write(f"name: {skill_name}\n")
+            f.write(f"description: {desc}\n")
+            f.write(f"---\n\n")
+
+            f.write(f"# {self.name.title()} Documentation Skill\n\n")
+            f.write(f"{self.description}\n\n")
+
+            f.write("## When to use this skill\n\n")
+            f.write(f"Use this skill when the user asks about {self.name} documentation, ")
+            f.write("including API references, tutorials, examples, and best practices.\n\n")
+
+            f.write("## What's included\n\n")
+            f.write("This skill contains:\n\n")
+            for cat_key, cat_data in categorized.items():
+                f.write(f"- **{cat_data['title']}**: {len(cat_data['pages'])} pages\n")
+
+            f.write("\n## Quick Reference\n\n")
+
+            # Get high-quality code samples
+            all_code = []
+            for page in self.extracted_data['pages']:
+                all_code.extend(page.get('code_samples', []))
+
+            # Sort by quality and get top 5
+            all_code.sort(key=lambda x: x.get('quality_score', 0), reverse=True)
+            top_code = all_code[:5]
+
+            if top_code:
+                f.write("### Top Code Examples\n\n")
+                for i, code in enumerate(top_code, 1):
+                    lang = code['language']
+                    quality = code.get('quality_score', 0)
+                    f.write(f"**Example {i}** (Quality: {quality:.1f}/10):\n\n")
+                    f.write(f"```{lang}\n{code['code'][:300]}...\n```\n\n")
+
+            f.write("## Navigation\n\n")
+            f.write("See `references/index.md` for complete documentation structure.\n\n")
+
+            # Add language statistics
+            langs = self.extracted_data.get('languages_detected', {})
+            if langs:
+                f.write("## Languages Covered\n\n")
+                for lang, count in sorted(langs.items(), key=lambda x: x[1], reverse=True):
+                    f.write(f"- {lang}: {count} examples\n")
+
+        print(f"   Generated: {filename}")
+
+    def _sanitize_filename(self, name):
+        """Convert string to safe filename"""
+        # Remove special chars, replace spaces with underscores
+        safe = re.sub(r'[^\w\s-]', '', name.lower())
+        safe = re.sub(r'[-\s]+', '_', safe)
+        return safe
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert PDF documentation to Claude skill',
+        formatter_class=argparse.RawDescriptionHelpFormatter
+    )
+
+    parser.add_argument('--config', help='PDF config JSON file')
+    parser.add_argument('--pdf', help='Direct PDF file path')
+    parser.add_argument('--name', help='Skill name (with --pdf)')
+    parser.add_argument('--from-json', help='Build skill from extracted JSON')
+    parser.add_argument('--description', help='Skill description')
+
+    args = parser.parse_args()
+
+    # Validate inputs
+    if not (args.config or args.pdf or args.from_json):
+        parser.error("Must specify --config, --pdf, or --from-json")
+
+    # Load or create config
+    if args.config:
+        with open(args.config, 'r') as f:
+            config = json.load(f)
+    elif args.from_json:
+        # Build from extracted JSON
+        name = Path(args.from_json).stem.replace('_extracted', '')
+        config = {
+            'name': name,
+            'description': args.description or f'Documentation skill for {name}'
+        }
+        converter = PDFToSkillConverter(config)
+        converter.load_extracted_data(args.from_json)
+        converter.build_skill()
+        return
+    else:
+        # Direct PDF mode
+        if not args.name:
+            parser.error("Must specify --name with --pdf")
+        config = {
+            'name': args.name,
+            'pdf_path': args.pdf,
+            'description': args.description or f'Documentation skill for {args.name}',
+            'extract_options': {
+                'chunk_size': 10,
+                'min_quality': 5.0,
+                'extract_images': True,
+                'min_image_size': 100
+            }
+        }
+
+    # Create converter
+    converter = PDFToSkillConverter(config)
+
+    # Extract if needed
+    if config.get('pdf_path'):
+        if not converter.extract_pdf():
+            sys.exit(1)
+
+    # Build skill
+    converter.build_skill()
+
+
+if __name__ == '__main__':
+    main()
--- a/src/skill_seekers/cli/run_tests.py
+++ b/src/skill_seekers/cli/run_tests.py
@@ -0,0 +1,228 @@
+#!/usr/bin/env python3
+"""
+Test Runner for Skill Seeker
+Runs all test suites and generates a comprehensive test report
+"""
+
+import sys
+import unittest
+import os
+from io import StringIO
+from pathlib import Path
+
+
+class ColoredTextTestResult(unittest.TextTestResult):
+    """Custom test result class with colored output"""
+
+    # ANSI color codes
+    GREEN = '\033[92m'
+    RED = '\033[91m'
+    YELLOW = '\033[93m'
+    BLUE = '\033[94m'
+    RESET = '\033[0m'
+    BOLD = '\033[1m'
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.test_results = []
+
+    def addSuccess(self, test):
+        super().addSuccess(test)
+        self.test_results.append(('PASS', test))
+        if self.showAll:
+            self.stream.write(f"{self.GREEN}✓ PASS{self.RESET}\n")
+        elif self.dots:
+            self.stream.write(f"{self.GREEN}.{self.RESET}")
+            self.stream.flush()
+
+    def addError(self, test, err):
+        super().addError(test, err)
+        self.test_results.append(('ERROR', test))
+        if self.showAll:
+            self.stream.write(f"{self.RED}✗ ERROR{self.RESET}\n")
+        elif self.dots:
+            self.stream.write(f"{self.RED}E{self.RESET}")
+            self.stream.flush()
+
+    def addFailure(self, test, err):
+        super().addFailure(test, err)
+        self.test_results.append(('FAIL', test))
+        if self.showAll:
+            self.stream.write(f"{self.RED}✗ FAIL{self.RESET}\n")
+        elif self.dots:
+            self.stream.write(f"{self.RED}F{self.RESET}")
+            self.stream.flush()
+
+    def addSkip(self, test, reason):
+        super().addSkip(test, reason)
+        self.test_results.append(('SKIP', test))
+        if self.showAll:
+            self.stream.write(f"{self.YELLOW}⊘ SKIP{self.RESET}\n")
+        elif self.dots:
+            self.stream.write(f"{self.YELLOW}s{self.RESET}")
+            self.stream.flush()
+
+
+class ColoredTextTestRunner(unittest.TextTestRunner):
+    """Custom test runner with colored output"""
+    resultclass = ColoredTextTestResult
+
+
+def discover_tests(test_dir='tests'):
+    """Discover all test files in the tests directory"""
+    loader = unittest.TestLoader()
+    start_dir = test_dir
+    pattern = 'test_*.py'
+
+    suite = loader.discover(start_dir, pattern=pattern)
+    return suite
+
+
+def run_specific_suite(suite_name):
+    """Run a specific test suite"""
+    loader = unittest.TestLoader()
+
+    suite_map = {
+        'config': 'tests.test_config_validation',
+        'features': 'tests.test_scraper_features',
+        'integration': 'tests.test_integration'
+    }
+
+    if suite_name not in suite_map:
+        print(f"Unknown test suite: {suite_name}")
+        print(f"Available suites: {', '.join(suite_map.keys())}")
+        return None
+
+    module_name = suite_map[suite_name]
+    try:
+        suite = loader.loadTestsFromName(module_name)
+        return suite
+    except Exception as e:
+        print(f"Error loading test suite '{suite_name}': {e}")
+        return None
+
+
+def print_summary(result):
+    """Print a detailed test summary"""
+    total = result.testsRun
+    passed = total - len(result.failures) - len(result.errors) - len(result.skipped)
+    failed = len(result.failures)
+    errors = len(result.errors)
+    skipped = len(result.skipped)
+
+    print("\n" + "="*70)
+    print("TEST SUMMARY")
+    print("="*70)
+
+    # Overall stats
+    print(f"\n{ColoredTextTestResult.BOLD}Total Tests:{ColoredTextTestResult.RESET} {total}")
+    print(f"{ColoredTextTestResult.GREEN}✓ Passed:{ColoredTextTestResult.RESET} {passed}")
+    if failed > 0:
+        print(f"{ColoredTextTestResult.RED}✗ Failed:{ColoredTextTestResult.RESET} {failed}")
+    if errors > 0:
+        print(f"{ColoredTextTestResult.RED}✗ Errors:{ColoredTextTestResult.RESET} {errors}")
+    if skipped > 0:
+        print(f"{ColoredTextTestResult.YELLOW}⊘ Skipped:{ColoredTextTestResult.RESET} {skipped}")
+
+    # Success rate
+    if total > 0:
+        success_rate = (passed / total) * 100
+        color = ColoredTextTestResult.GREEN if success_rate == 100 else \
+                ColoredTextTestResult.YELLOW if success_rate >= 80 else \
+                ColoredTextTestResult.RED
+        print(f"\n{color}Success Rate: {success_rate:.1f}%{ColoredTextTestResult.RESET}")
+
+    # Category breakdown
+    if hasattr(result, 'test_results'):
+        print(f"\n{ColoredTextTestResult.BOLD}Test Breakdown by Category:{ColoredTextTestResult.RESET}")
+
+        categories = {}
+        for status, test in result.test_results:
+            test_name = str(test)
+            # Extract test class name
+            if '.' in test_name:
+                class_name = test_name.split('.')[0].split()[-1]
+                if class_name not in categories:
+                    categories[class_name] = {'PASS': 0, 'FAIL': 0, 'ERROR': 0, 'SKIP': 0}
+                categories[class_name][status] += 1
+
+        for category, stats in sorted(categories.items()):
+            total_cat = sum(stats.values())
+            passed_cat = stats['PASS']
+            print(f"  {category}: {passed_cat}/{total_cat} passed")
+
+    print("\n" + "="*70)
+
+    # Return status
+    return failed == 0 and errors == 0
+
+
+def main():
+    """Main test runner"""
+    import argparse
+
+    parser = argparse.ArgumentParser(
+        description='Run tests for Skill Seeker',
+        formatter_class=argparse.RawDescriptionHelpFormatter
+    )
+
+    parser.add_argument('--suite', '-s', type=str,
+                       help='Run specific test suite (config, features, integration)')
+    parser.add_argument('--verbose', '-v', action='store_true',
+                       help='Verbose output (show each test)')
+    parser.add_argument('--quiet', '-q', action='store_true',
+                       help='Quiet output (minimal output)')
+    parser.add_argument('--failfast', '-f', action='store_true',
+                       help='Stop on first failure')
+    parser.add_argument('--list', '-l', action='store_true',
+                       help='List all available tests')
+
+    args = parser.parse_args()
+
+    # Set verbosity
+    verbosity = 1
+    if args.verbose:
+        verbosity = 2
+    elif args.quiet:
+        verbosity = 0
+
+    print(f"\n{ColoredTextTestResult.BOLD}{'='*70}{ColoredTextTestResult.RESET}")
+    print(f"{ColoredTextTestResult.BOLD}SKILL SEEKER TEST SUITE{ColoredTextTestResult.RESET}")
+    print(f"{ColoredTextTestResult.BOLD}{'='*70}{ColoredTextTestResult.RESET}\n")
+
+    # Discover or load specific suite
+    if args.suite:
+        print(f"Running test suite: {ColoredTextTestResult.BLUE}{args.suite}{ColoredTextTestResult.RESET}\n")
+        suite = run_specific_suite(args.suite)
+        if suite is None:
+            return 1
+    else:
+        print(f"Running {ColoredTextTestResult.BLUE}all tests{ColoredTextTestResult.RESET}\n")
+        suite = discover_tests()
+
+    # List tests
+    if args.list:
+        print("\nAvailable tests:\n")
+        for test_group in suite:
+            for test in test_group:
+                print(f"  - {test}")
+        print()
+        return 0
+
+    # Run tests
+    runner = ColoredTextTestRunner(
+        verbosity=verbosity,
+        failfast=args.failfast
+    )
+
+    result = runner.run(suite)
+
+    # Print summary
+    success = print_summary(result)
+
+    # Return appropriate exit code
+    return 0 if success else 1
+
+
+if __name__ == '__main__':
+    sys.exit(main())
--- a/src/skill_seekers/cli/split_config.py
+++ b/src/skill_seekers/cli/split_config.py
@@ -0,0 +1,320 @@
+#!/usr/bin/env python3
+"""
+Config Splitter for Large Documentation Sites
+
+Splits large documentation configs into multiple smaller, focused skill configs.
+Supports multiple splitting strategies: category-based, size-based, and automatic.
+"""
+
+import json
+import sys
+import argparse
+from pathlib import Path
+from typing import Dict, List, Any, Tuple
+from collections import defaultdict
+
+
+class ConfigSplitter:
+    """Splits large documentation configs into multiple focused configs"""
+
+    def __init__(self, config_path: str, strategy: str = "auto", target_pages: int = 5000):
+        self.config_path = Path(config_path)
+        self.strategy = strategy
+        self.target_pages = target_pages
+        self.config = self.load_config()
+        self.base_name = self.config['name']
+
+    def load_config(self) -> Dict[str, Any]:
+        """Load configuration from file"""
+        try:
+            with open(self.config_path, 'r') as f:
+                return json.load(f)
+        except FileNotFoundError:
+            print(f"❌ Error: Config file not found: {self.config_path}")
+            sys.exit(1)
+        except json.JSONDecodeError as e:
+            print(f"❌ Error: Invalid JSON in config file: {e}")
+            sys.exit(1)
+
+    def get_split_strategy(self) -> str:
+        """Determine split strategy"""
+        # Check if strategy is defined in config
+        if 'split_strategy' in self.config:
+            config_strategy = self.config['split_strategy']
+            if config_strategy != "none":
+                return config_strategy
+
+        # Use provided strategy or auto-detect
+        if self.strategy == "auto":
+            max_pages = self.config.get('max_pages', 500)
+
+            if max_pages < 5000:
+                print(f"ℹ️  Small documentation ({max_pages} pages) - no splitting needed")
+                return "none"
+            elif max_pages < 10000 and 'categories' in self.config:
+                print(f"ℹ️  Medium documentation ({max_pages} pages) - category split recommended")
+                return "category"
+            elif 'categories' in self.config and len(self.config['categories']) >= 3:
+                print(f"ℹ️  Large documentation ({max_pages} pages) - router + categories recommended")
+                return "router"
+            else:
+                print(f"ℹ️  Large documentation ({max_pages} pages) - size-based split")
+                return "size"
+
+        return self.strategy
+
+    def split_by_category(self, create_router: bool = False) -> List[Dict[str, Any]]:
+        """Split config by categories"""
+        if 'categories' not in self.config:
+            print("❌ Error: No categories defined in config")
+            sys.exit(1)
+
+        categories = self.config['categories']
+        split_categories = self.config.get('split_config', {}).get('split_by_categories')
+
+        # If specific categories specified, use only those
+        if split_categories:
+            categories = {k: v for k, v in categories.items() if k in split_categories}
+
+        configs = []
+
+        for category_name, keywords in categories.items():
+            # Create new config for this category
+            new_config = self.config.copy()
+            new_config['name'] = f"{self.base_name}-{category_name}"
+            new_config['description'] = f"{self.base_name.capitalize()} - {category_name.replace('_', ' ').title()}. {self.config.get('description', '')}"
+
+            # Update URL patterns to focus on this category
+            url_patterns = new_config.get('url_patterns', {})
+
+            # Add category keywords to includes
+            includes = url_patterns.get('include', [])
+            for keyword in keywords:
+                if keyword.startswith('/'):
+                    includes.append(keyword)
+
+            if includes:
+                url_patterns['include'] = list(set(includes))
+                new_config['url_patterns'] = url_patterns
+
+            # Keep only this category
+            new_config['categories'] = {category_name: keywords}
+
+            # Remove split config from child
+            if 'split_strategy' in new_config:
+                del new_config['split_strategy']
+            if 'split_config' in new_config:
+                del new_config['split_config']
+
+            # Adjust max_pages estimate
+            if 'max_pages' in new_config:
+                new_config['max_pages'] = self.target_pages
+
+            configs.append(new_config)
+
+        print(f"✅ Created {len(configs)} category-based configs")
+
+        # Optionally create router config
+        if create_router:
+            router_config = self.create_router_config(configs)
+            configs.insert(0, router_config)
+            print(f"✅ Created router config: {router_config['name']}")
+
+        return configs
+
+    def split_by_size(self) -> List[Dict[str, Any]]:
+        """Split config by size (page count)"""
+        max_pages = self.config.get('max_pages', 500)
+        num_splits = (max_pages + self.target_pages - 1) // self.target_pages
+
+        configs = []
+
+        for i in range(num_splits):
+            new_config = self.config.copy()
+            part_num = i + 1
+            new_config['name'] = f"{self.base_name}-part{part_num}"
+            new_config['description'] = f"{self.base_name.capitalize()} - Part {part_num}. {self.config.get('description', '')}"
+            new_config['max_pages'] = self.target_pages
+
+            # Remove split config from child
+            if 'split_strategy' in new_config:
+                del new_config['split_strategy']
+            if 'split_config' in new_config:
+                del new_config['split_config']
+
+            configs.append(new_config)
+
+        print(f"✅ Created {len(configs)} size-based configs ({self.target_pages} pages each)")
+        return configs
+
+    def create_router_config(self, sub_configs: List[Dict[str, Any]]) -> Dict[str, Any]:
+        """Create a router config that references sub-skills"""
+        router_name = self.config.get('split_config', {}).get('router_name', self.base_name)
+
+        router_config = {
+            "name": router_name,
+            "description": self.config.get('description', ''),
+            "base_url": self.config['base_url'],
+            "selectors": self.config['selectors'],
+            "url_patterns": self.config.get('url_patterns', {}),
+            "rate_limit": self.config.get('rate_limit', 0.5),
+            "max_pages": 500,  # Router only needs overview pages
+            "_router": True,
+            "_sub_skills": [cfg['name'] for cfg in sub_configs],
+            "_routing_keywords": {
+                cfg['name']: list(cfg.get('categories', {}).keys())
+                for cfg in sub_configs
+            }
+        }
+
+        return router_config
+
+    def split(self) -> List[Dict[str, Any]]:
+        """Execute split based on strategy"""
+        strategy = self.get_split_strategy()
+
+        print(f"\n{'='*60}")
+        print(f"CONFIG SPLITTER: {self.base_name}")
+        print(f"{'='*60}")
+        print(f"Strategy: {strategy}")
+        print(f"Target pages per skill: {self.target_pages}")
+        print("")
+
+        if strategy == "none":
+            print("ℹ️  No splitting required")
+            return [self.config]
+
+        elif strategy == "category":
+            return self.split_by_category(create_router=False)
+
+        elif strategy == "router":
+            create_router = self.config.get('split_config', {}).get('create_router', True)
+            return self.split_by_category(create_router=create_router)
+
+        elif strategy == "size":
+            return self.split_by_size()
+
+        else:
+            print(f"❌ Error: Unknown strategy: {strategy}")
+            sys.exit(1)
+
+    def save_configs(self, configs: List[Dict[str, Any]], output_dir: Path = None) -> List[Path]:
+        """Save configs to files"""
+        if output_dir is None:
+            output_dir = self.config_path.parent
+
+        output_dir = Path(output_dir)
+        output_dir.mkdir(parents=True, exist_ok=True)
+
+        saved_files = []
+
+        for config in configs:
+            filename = f"{config['name']}.json"
+            filepath = output_dir / filename
+
+            with open(filepath, 'w') as f:
+                json.dump(config, f, indent=2)
+
+            saved_files.append(filepath)
+            print(f"  💾 Saved: {filepath}")
+
+        return saved_files
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Split large documentation configs into multiple focused skills",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Auto-detect strategy
+  python3 split_config.py configs/godot.json
+
+  # Use category-based split
+  python3 split_config.py configs/godot.json --strategy category
+
+  # Use router + categories
+  python3 split_config.py configs/godot.json --strategy router
+
+  # Custom target size
+  python3 split_config.py configs/godot.json --target-pages 3000
+
+  # Dry run (don't save files)
+  python3 split_config.py configs/godot.json --dry-run
+
+Split Strategies:
+  none     - No splitting (single skill)
+  auto     - Automatically choose best strategy
+  category - Split by categories defined in config
+  router   - Create router + category-based sub-skills
+  size     - Split by page count
+        """
+    )
+
+    parser.add_argument(
+        'config',
+        help='Path to config file (e.g., configs/godot.json)'
+    )
+
+    parser.add_argument(
+        '--strategy',
+        choices=['auto', 'none', 'category', 'router', 'size'],
+        default='auto',
+        help='Splitting strategy (default: auto)'
+    )
+
+    parser.add_argument(
+        '--target-pages',
+        type=int,
+        default=5000,
+        help='Target pages per skill (default: 5000)'
+    )
+
+    parser.add_argument(
+        '--output-dir',
+        help='Output directory for configs (default: same as input)'
+    )
+
+    parser.add_argument(
+        '--dry-run',
+        action='store_true',
+        help='Show what would be created without saving files'
+    )
+
+    args = parser.parse_args()
+
+    # Create splitter
+    splitter = ConfigSplitter(args.config, args.strategy, args.target_pages)
+
+    # Split config
+    configs = splitter.split()
+
+    if args.dry_run:
+        print(f"\n{'='*60}")
+        print("DRY RUN - No files saved")
+        print(f"{'='*60}")
+        print(f"Would create {len(configs)} config files:")
+        for cfg in configs:
+            is_router = cfg.get('_router', False)
+            router_marker = " (ROUTER)" if is_router else ""
+            print(f"  📄 {cfg['name']}.json{router_marker}")
+    else:
+        print(f"\n{'='*60}")
+        print("SAVING CONFIGS")
+        print(f"{'='*60}")
+        saved_files = splitter.save_configs(configs, args.output_dir)
+
+        print(f"\n{'='*60}")
+        print("NEXT STEPS")
+        print(f"{'='*60}")
+        print("1. Review generated configs")
+        print("2. Scrape each config:")
+        for filepath in saved_files:
+            print(f"     python3 cli/doc_scraper.py --config {filepath}")
+        print("3. Package skills:")
+        print("     python3 cli/package_multi.py configs/<name>-*.json")
+        print("")
+
+
+if __name__ == "__main__":
+    main()
--- a/src/skill_seekers/cli/test_unified_simple.py
+++ b/src/skill_seekers/cli/test_unified_simple.py
@@ -0,0 +1,192 @@
+#!/usr/bin/env python3
+"""
+Simple Integration Tests for Unified Multi-Source Scraper
+
+Focuses on real-world usage patterns rather than unit tests.
+"""
+
+import os
+import sys
+import json
+import tempfile
+from pathlib import Path
+
+# Add CLI to path
+sys.path.insert(0, str(Path(__file__).parent))
+
+from config_validator import validate_config
+
+def test_validate_existing_unified_configs():
+    """Test that all existing unified configs are valid"""
+    configs_dir = Path(__file__).parent.parent / 'configs'
+
+    unified_configs = [
+        'godot_unified.json',
+        'react_unified.json',
+        'django_unified.json',
+        'fastapi_unified.json'
+    ]
+
+    for config_name in unified_configs:
+        config_path = configs_dir / config_name
+        if config_path.exists():
+            print(f"\n✓ Validating {config_name}...")
+            validator = validate_config(str(config_path))
+            assert validator.is_unified, f"{config_name} should be unified format"
+            assert validator.needs_api_merge(), f"{config_name} should need API merging"
+            print(f"  Sources: {len(validator.config['sources'])}")
+            print(f"  Merge mode: {validator.config.get('merge_mode')}")
+
+
+def test_backward_compatibility():
+    """Test that legacy configs still work"""
+    configs_dir = Path(__file__).parent.parent / 'configs'
+
+    legacy_configs = [
+        'react.json',
+        'godot.json',
+        'django.json'
+    ]
+
+    for config_name in legacy_configs:
+        config_path = configs_dir / config_name
+        if config_path.exists():
+            print(f"\n✓ Validating legacy {config_name}...")
+            validator = validate_config(str(config_path))
+            assert not validator.is_unified, f"{config_name} should be legacy format"
+            print(f"  Format: Legacy")
+
+
+def test_create_temp_unified_config():
+    """Test creating a unified config from scratch"""
+    config = {
+        "name": "test_unified",
+        "description": "Test unified config",
+        "merge_mode": "rule-based",
+        "sources": [
+            {
+                "type": "documentation",
+                "base_url": "https://example.com/docs",
+                "extract_api": True,
+                "max_pages": 50
+            },
+            {
+                "type": "github",
+                "repo": "test/repo",
+                "include_code": True,
+                "code_analysis_depth": "surface"
+            }
+        ]
+    }
+
+    with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
+        json.dump(config, f)
+        config_path = f.name
+
+    try:
+        print("\n✓ Validating temp unified config...")
+        validator = validate_config(config_path)
+        assert validator.is_unified
+        assert validator.needs_api_merge()
+        assert len(validator.config['sources']) == 2
+        print("  ✓ Config is valid unified format")
+        print(f"  Sources: {len(validator.config['sources'])}")
+    finally:
+        os.unlink(config_path)
+
+
+def test_mixed_source_types():
+    """Test config with documentation, GitHub, and PDF sources"""
+    config = {
+        "name": "test_mixed",
+        "description": "Test mixed sources",
+        "merge_mode": "rule-based",
+        "sources": [
+            {
+                "type": "documentation",
+                "base_url": "https://example.com"
+            },
+            {
+                "type": "github",
+                "repo": "test/repo"
+            },
+            {
+                "type": "pdf",
+                "path": "/path/to/manual.pdf"
+            }
+        ]
+    }
+
+    with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
+        json.dump(config, f)
+        config_path = f.name
+
+    try:
+        print("\n✓ Validating mixed source types...")
+        validator = validate_config(config_path)
+        assert validator.is_unified
+        assert len(validator.config['sources']) == 3
+
+        # Check each source type
+        source_types = [s['type'] for s in validator.config['sources']]
+        assert 'documentation' in source_types
+        assert 'github' in source_types
+        assert 'pdf' in source_types
+        print("  ✓ All 3 source types validated")
+    finally:
+        os.unlink(config_path)
+
+
+def test_config_validation_errors():
+    """Test that invalid configs are rejected"""
+    # Invalid source type
+    config = {
+        "name": "test",
+        "description": "Test",
+        "sources": [
+            {"type": "invalid_type", "url": "https://example.com"}
+        ]
+    }
+
+    with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
+        json.dump(config, f)
+        config_path = f.name
+
+    try:
+        print("\n✓ Testing invalid source type...")
+        try:
+            # validate_config() calls .validate() automatically
+            validator = validate_config(config_path)
+            assert False, "Should have raised error for invalid source type"
+        except ValueError as e:
+            assert "Invalid" in str(e) or "invalid" in str(e)
+            print("  ✓ Invalid source type correctly rejected")
+    finally:
+        os.unlink(config_path)
+
+
+# Run tests
+if __name__ == '__main__':
+    print("=" * 60)
+    print("Running Unified Scraper Integration Tests")
+    print("=" * 60)
+
+    try:
+        test_validate_existing_unified_configs()
+        test_backward_compatibility()
+        test_create_temp_unified_config()
+        test_mixed_source_types()
+        test_config_validation_errors()
+
+        print("\n" + "=" * 60)
+        print("✅ All integration tests passed!")
+        print("=" * 60)
+
+    except AssertionError as e:
+        print(f"\n❌ Test failed: {e}")
+        sys.exit(1)
+    except Exception as e:
+        print(f"\n❌ Unexpected error: {e}")
+        import traceback
+        traceback.print_exc()
+        sys.exit(1)
--- a/src/skill_seekers/cli/unified_scraper.py
+++ b/src/skill_seekers/cli/unified_scraper.py
@@ -0,0 +1,449 @@
+#!/usr/bin/env python3
+"""
+Unified Multi-Source Scraper
+
+Orchestrates scraping from multiple sources (documentation, GitHub, PDF),
+detects conflicts, merges intelligently, and builds unified skills.
+
+This is the main entry point for unified config workflow.
+
+Usage:
+    python3 cli/unified_scraper.py --config configs/godot_unified.json
+    python3 cli/unified_scraper.py --config configs/react_unified.json --merge-mode claude-enhanced
+"""
+
+import os
+import sys
+import json
+import logging
+import argparse
+import subprocess
+from pathlib import Path
+from typing import Dict, List, Any, Optional
+
+# Import validators and scrapers
+try:
+    from config_validator import ConfigValidator, validate_config
+    from conflict_detector import ConflictDetector
+    from merge_sources import RuleBasedMerger, ClaudeEnhancedMerger
+    from unified_skill_builder import UnifiedSkillBuilder
+except ImportError as e:
+    print(f"Error importing modules: {e}")
+    print("Make sure you're running from the project root directory")
+    sys.exit(1)
+
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+
+
+class UnifiedScraper:
+    """
+    Orchestrates multi-source scraping and merging.
+
+    Main workflow:
+    1. Load and validate unified config
+    2. Scrape all sources (docs, GitHub, PDF)
+    3. Detect conflicts between sources
+    4. Merge intelligently (rule-based or Claude-enhanced)
+    5. Build unified skill
+    """
+
+    def __init__(self, config_path: str, merge_mode: Optional[str] = None):
+        """
+        Initialize unified scraper.
+
+        Args:
+            config_path: Path to unified config JSON
+            merge_mode: Override config merge_mode ('rule-based' or 'claude-enhanced')
+        """
+        self.config_path = config_path
+
+        # Validate and load config
+        logger.info(f"Loading config: {config_path}")
+        self.validator = validate_config(config_path)
+        self.config = self.validator.config
+
+        # Determine merge mode
+        self.merge_mode = merge_mode or self.config.get('merge_mode', 'rule-based')
+        logger.info(f"Merge mode: {self.merge_mode}")
+
+        # Storage for scraped data
+        self.scraped_data = {}
+
+        # Output paths
+        self.name = self.config['name']
+        self.output_dir = f"output/{self.name}"
+        self.data_dir = f"output/{self.name}_unified_data"
+
+        os.makedirs(self.output_dir, exist_ok=True)
+        os.makedirs(self.data_dir, exist_ok=True)
+
+    def scrape_all_sources(self):
+        """
+        Scrape all configured sources.
+
+        Routes to appropriate scraper based on source type.
+        """
+        logger.info("=" * 60)
+        logger.info("PHASE 1: Scraping all sources")
+        logger.info("=" * 60)
+
+        if not self.validator.is_unified:
+            logger.warning("Config is not unified format, converting...")
+            self.config = self.validator.convert_legacy_to_unified()
+
+        sources = self.config.get('sources', [])
+
+        for i, source in enumerate(sources):
+            source_type = source['type']
+            logger.info(f"\n[{i+1}/{len(sources)}] Scraping {source_type} source...")
+
+            try:
+                if source_type == 'documentation':
+                    self._scrape_documentation(source)
+                elif source_type == 'github':
+                    self._scrape_github(source)
+                elif source_type == 'pdf':
+                    self._scrape_pdf(source)
+                else:
+                    logger.warning(f"Unknown source type: {source_type}")
+            except Exception as e:
+                logger.error(f"Error scraping {source_type}: {e}")
+                logger.info("Continuing with other sources...")
+
+        logger.info(f"\n✅ Scraped {len(self.scraped_data)} sources successfully")
+
+    def _scrape_documentation(self, source: Dict[str, Any]):
+        """Scrape documentation website."""
+        # Create temporary config for doc scraper
+        doc_config = {
+            'name': f"{self.name}_docs",
+            'base_url': source['base_url'],
+            'selectors': source.get('selectors', {}),
+            'url_patterns': source.get('url_patterns', {}),
+            'categories': source.get('categories', {}),
+            'rate_limit': source.get('rate_limit', 0.5),
+            'max_pages': source.get('max_pages', 100)
+        }
+
+        # Write temporary config
+        temp_config_path = os.path.join(self.data_dir, 'temp_docs_config.json')
+        with open(temp_config_path, 'w') as f:
+            json.dump(doc_config, f, indent=2)
+
+        # Run doc_scraper as subprocess
+        logger.info(f"Scraping documentation from {source['base_url']}")
+
+        doc_scraper_path = Path(__file__).parent / "doc_scraper.py"
+        cmd = [sys.executable, str(doc_scraper_path), '--config', temp_config_path]
+
+        result = subprocess.run(cmd, capture_output=True, text=True)
+
+        if result.returncode != 0:
+            logger.error(f"Documentation scraping failed: {result.stderr}")
+            return
+
+        # Load scraped data
+        docs_data_file = f"output/{doc_config['name']}_data/summary.json"
+
+        if os.path.exists(docs_data_file):
+            with open(docs_data_file, 'r') as f:
+                summary = json.load(f)
+
+            self.scraped_data['documentation'] = {
+                'pages': summary.get('pages', []),
+                'data_file': docs_data_file
+            }
+
+            logger.info(f"✅ Documentation: {summary.get('total_pages', 0)} pages scraped")
+        else:
+            logger.warning("Documentation data file not found")
+
+        # Clean up temp config
+        if os.path.exists(temp_config_path):
+            os.remove(temp_config_path)
+
+    def _scrape_github(self, source: Dict[str, Any]):
+        """Scrape GitHub repository."""
+        sys.path.insert(0, str(Path(__file__).parent))
+
+        try:
+            from github_scraper import GitHubScraper
+        except ImportError:
+            logger.error("github_scraper.py not found")
+            return
+
+        # Create config for GitHub scraper
+        github_config = {
+            'repo': source['repo'],
+            'name': f"{self.name}_github",
+            'github_token': source.get('github_token'),
+            'include_issues': source.get('include_issues', True),
+            'max_issues': source.get('max_issues', 100),
+            'include_changelog': source.get('include_changelog', True),
+            'include_releases': source.get('include_releases', True),
+            'include_code': source.get('include_code', True),
+            'code_analysis_depth': source.get('code_analysis_depth', 'surface'),
+            'file_patterns': source.get('file_patterns', [])
+        }
+
+        # Scrape
+        logger.info(f"Scraping GitHub repository: {source['repo']}")
+        scraper = GitHubScraper(github_config)
+        github_data = scraper.scrape()
+
+        # Save data
+        github_data_file = os.path.join(self.data_dir, 'github_data.json')
+        with open(github_data_file, 'w') as f:
+            json.dump(github_data, f, indent=2, ensure_ascii=False)
+
+        self.scraped_data['github'] = {
+            'data': github_data,
+            'data_file': github_data_file
+        }
+
+        logger.info(f"✅ GitHub: Repository scraped successfully")
+
+    def _scrape_pdf(self, source: Dict[str, Any]):
+        """Scrape PDF document."""
+        sys.path.insert(0, str(Path(__file__).parent))
+
+        try:
+            from pdf_scraper import PDFToSkillConverter
+        except ImportError:
+            logger.error("pdf_scraper.py not found")
+            return
+
+        # Create config for PDF scraper
+        pdf_config = {
+            'name': f"{self.name}_pdf",
+            'pdf': source['path'],
+            'extract_tables': source.get('extract_tables', False),
+            'ocr': source.get('ocr', False),
+            'password': source.get('password')
+        }
+
+        # Scrape
+        logger.info(f"Scraping PDF: {source['path']}")
+        converter = PDFToSkillConverter(pdf_config)
+        pdf_data = converter.extract_all()
+
+        # Save data
+        pdf_data_file = os.path.join(self.data_dir, 'pdf_data.json')
+        with open(pdf_data_file, 'w') as f:
+            json.dump(pdf_data, f, indent=2, ensure_ascii=False)
+
+        self.scraped_data['pdf'] = {
+            'data': pdf_data,
+            'data_file': pdf_data_file
+        }
+
+        logger.info(f"✅ PDF: {len(pdf_data.get('pages', []))} pages extracted")
+
+    def detect_conflicts(self) -> List:
+        """
+        Detect conflicts between documentation and code.
+
+        Only applicable if both documentation and GitHub sources exist.
+
+        Returns:
+            List of conflicts
+        """
+        logger.info("\n" + "=" * 60)
+        logger.info("PHASE 2: Detecting conflicts")
+        logger.info("=" * 60)
+
+        if not self.validator.needs_api_merge():
+            logger.info("No API merge needed (only one API source)")
+            return []
+
+        # Get documentation and GitHub data
+        docs_data = self.scraped_data.get('documentation', {})
+        github_data = self.scraped_data.get('github', {})
+
+        if not docs_data or not github_data:
+            logger.warning("Missing documentation or GitHub data for conflict detection")
+            return []
+
+        # Load data files
+        with open(docs_data['data_file'], 'r') as f:
+            docs_json = json.load(f)
+
+        with open(github_data['data_file'], 'r') as f:
+            github_json = json.load(f)
+
+        # Detect conflicts
+        detector = ConflictDetector(docs_json, github_json)
+        conflicts = detector.detect_all_conflicts()
+
+        # Save conflicts
+        conflicts_file = os.path.join(self.data_dir, 'conflicts.json')
+        detector.save_conflicts(conflicts, conflicts_file)
+
+        # Print summary
+        summary = detector.generate_summary(conflicts)
+        logger.info(f"\n📊 Conflict Summary:")
+        logger.info(f"   Total: {summary['total']}")
+        logger.info(f"   By Type:")
+        for ctype, count in summary['by_type'].items():
+            if count > 0:
+                logger.info(f"     - {ctype}: {count}")
+        logger.info(f"   By Severity:")
+        for severity, count in summary['by_severity'].items():
+            if count > 0:
+                emoji = '🔴' if severity == 'high' else '🟡' if severity == 'medium' else '🟢'
+                logger.info(f"     {emoji} {severity}: {count}")
+
+        return conflicts
+
+    def merge_sources(self, conflicts: List):
+        """
+        Merge data from multiple sources.
+
+        Args:
+            conflicts: List of detected conflicts
+        """
+        logger.info("\n" + "=" * 60)
+        logger.info(f"PHASE 3: Merging sources ({self.merge_mode})")
+        logger.info("=" * 60)
+
+        if not conflicts:
+            logger.info("No conflicts to merge")
+            return None
+
+        # Get data files
+        docs_data = self.scraped_data.get('documentation', {})
+        github_data = self.scraped_data.get('github', {})
+
+        # Load data
+        with open(docs_data['data_file'], 'r') as f:
+            docs_json = json.load(f)
+
+        with open(github_data['data_file'], 'r') as f:
+            github_json = json.load(f)
+
+        # Choose merger
+        if self.merge_mode == 'claude-enhanced':
+            merger = ClaudeEnhancedMerger(docs_json, github_json, conflicts)
+        else:
+            merger = RuleBasedMerger(docs_json, github_json, conflicts)
+
+        # Merge
+        merged_data = merger.merge_all()
+
+        # Save merged data
+        merged_file = os.path.join(self.data_dir, 'merged_data.json')
+        with open(merged_file, 'w') as f:
+            json.dump(merged_data, f, indent=2, ensure_ascii=False)
+
+        logger.info(f"✅ Merged data saved: {merged_file}")
+
+        return merged_data
+
+    def build_skill(self, merged_data: Optional[Dict] = None):
+        """
+        Build final unified skill.
+
+        Args:
+            merged_data: Merged API data (if conflicts were resolved)
+        """
+        logger.info("\n" + "=" * 60)
+        logger.info("PHASE 4: Building unified skill")
+        logger.info("=" * 60)
+
+        # Load conflicts if they exist
+        conflicts = []
+        conflicts_file = os.path.join(self.data_dir, 'conflicts.json')
+        if os.path.exists(conflicts_file):
+            with open(conflicts_file, 'r') as f:
+                conflicts_data = json.load(f)
+                conflicts = conflicts_data.get('conflicts', [])
+
+        # Build skill
+        builder = UnifiedSkillBuilder(
+            self.config,
+            self.scraped_data,
+            merged_data,
+            conflicts
+        )
+
+        builder.build()
+
+        logger.info(f"✅ Unified skill built: {self.output_dir}/")
+
+    def run(self):
+        """
+        Execute complete unified scraping workflow.
+        """
+        logger.info("\n" + "🚀 " * 20)
+        logger.info(f"Unified Scraper: {self.config['name']}")
+        logger.info("🚀 " * 20 + "\n")
+
+        try:
+            # Phase 1: Scrape all sources
+            self.scrape_all_sources()
+
+            # Phase 2: Detect conflicts (if applicable)
+            conflicts = self.detect_conflicts()
+
+            # Phase 3: Merge sources (if conflicts exist)
+            merged_data = None
+            if conflicts:
+                merged_data = self.merge_sources(conflicts)
+
+            # Phase 4: Build skill
+            self.build_skill(merged_data)
+
+            logger.info("\n" + "✅ " * 20)
+            logger.info("Unified scraping complete!")
+            logger.info("✅ " * 20 + "\n")
+
+            logger.info(f"📁 Output: {self.output_dir}/")
+            logger.info(f"📁 Data: {self.data_dir}/")
+
+        except KeyboardInterrupt:
+            logger.info("\n\n⚠️  Scraping interrupted by user")
+            sys.exit(1)
+        except Exception as e:
+            logger.error(f"\n\n❌ Error during scraping: {e}")
+            import traceback
+            traceback.print_exc()
+            sys.exit(1)
+
+
+def main():
+    """Main entry point."""
+    parser = argparse.ArgumentParser(
+        description='Unified multi-source scraper',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Basic usage with unified config
+  python3 cli/unified_scraper.py --config configs/godot_unified.json
+
+  # Override merge mode
+  python3 cli/unified_scraper.py --config configs/react_unified.json --merge-mode claude-enhanced
+
+  # Backward compatible with legacy configs
+  python3 cli/unified_scraper.py --config configs/react.json
+        """
+    )
+
+    parser.add_argument('--config', '-c', required=True,
+                       help='Path to unified config JSON file')
+    parser.add_argument('--merge-mode', '-m',
+                       choices=['rule-based', 'claude-enhanced'],
+                       help='Override config merge mode')
+
+    args = parser.parse_args()
+
+    # Create and run scraper
+    scraper = UnifiedScraper(args.config, args.merge_mode)
+    scraper.run()
+
+
+if __name__ == '__main__':
+    main()
--- a/src/skill_seekers/cli/unified_skill_builder.py
+++ b/src/skill_seekers/cli/unified_skill_builder.py
@@ -0,0 +1,444 @@
+#!/usr/bin/env python3
+"""
+Unified Skill Builder
+
+Generates final skill structure from merged multi-source data:
+- SKILL.md with merged APIs and conflict warnings
+- references/ with organized content by source
+- Inline conflict markers (⚠️)
+- Separate conflicts summary section
+
+Supports mixed sources (documentation, GitHub, PDF) and highlights
+discrepancies transparently.
+"""
+
+import os
+import json
+import logging
+from pathlib import Path
+from typing import Dict, List, Any, Optional
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+class UnifiedSkillBuilder:
+    """
+    Builds unified skill from multi-source data.
+    """
+
+    def __init__(self, config: Dict, scraped_data: Dict,
+                 merged_data: Optional[Dict] = None, conflicts: Optional[List] = None):
+        """
+        Initialize skill builder.
+
+        Args:
+            config: Unified config dict
+            scraped_data: Dict of scraped data by source type
+            merged_data: Merged API data (if conflicts were resolved)
+            conflicts: List of detected conflicts
+        """
+        self.config = config
+        self.scraped_data = scraped_data
+        self.merged_data = merged_data
+        self.conflicts = conflicts or []
+
+        self.name = config['name']
+        self.description = config['description']
+        self.skill_dir = f"output/{self.name}"
+
+        # Create directories
+        os.makedirs(self.skill_dir, exist_ok=True)
+        os.makedirs(f"{self.skill_dir}/references", exist_ok=True)
+        os.makedirs(f"{self.skill_dir}/scripts", exist_ok=True)
+        os.makedirs(f"{self.skill_dir}/assets", exist_ok=True)
+
+    def build(self):
+        """Build complete skill structure."""
+        logger.info(f"Building unified skill: {self.name}")
+
+        # Generate main SKILL.md
+        self._generate_skill_md()
+
+        # Generate reference files by source
+        self._generate_references()
+
+        # Generate conflicts report (if any)
+        if self.conflicts:
+            self._generate_conflicts_report()
+
+        logger.info(f"✅ Unified skill built: {self.skill_dir}/")
+
+    def _generate_skill_md(self):
+        """Generate main SKILL.md file."""
+        skill_path = os.path.join(self.skill_dir, 'SKILL.md')
+
+        # Generate skill name (lowercase, hyphens only, max 64 chars)
+        skill_name = self.name.lower().replace('_', '-').replace(' ', '-')[:64]
+
+        # Truncate description to 1024 chars if needed
+        desc = self.description[:1024] if len(self.description) > 1024 else self.description
+
+        content = f"""---
+name: {skill_name}
+description: {desc}
+---
+
+# {self.name.title()}
+
+{self.description}
+
+## 📚 Sources
+
+This skill combines knowledge from multiple sources:
+
+"""
+
+        # List sources
+        for source in self.config.get('sources', []):
+            source_type = source['type']
+            if source_type == 'documentation':
+                content += f"- ✅ **Documentation**: {source.get('base_url', 'N/A')}\n"
+                content += f"  - Pages: {source.get('max_pages', 'unlimited')}\n"
+            elif source_type == 'github':
+                content += f"- ✅ **GitHub Repository**: {source.get('repo', 'N/A')}\n"
+                content += f"  - Code Analysis: {source.get('code_analysis_depth', 'surface')}\n"
+                content += f"  - Issues: {source.get('max_issues', 0)}\n"
+            elif source_type == 'pdf':
+                content += f"- ✅ **PDF Document**: {source.get('path', 'N/A')}\n"
+
+        # Data quality section
+        if self.conflicts:
+            content += f"\n## ⚠️ Data Quality\n\n"
+            content += f"**{len(self.conflicts)} conflicts detected** between sources.\n\n"
+
+            # Count by type
+            by_type = {}
+            for conflict in self.conflicts:
+                ctype = conflict.type if hasattr(conflict, 'type') else conflict.get('type', 'unknown')
+                by_type[ctype] = by_type.get(ctype, 0) + 1
+
+            content += "**Conflict Breakdown:**\n"
+            for ctype, count in by_type.items():
+                content += f"- {ctype}: {count}\n"
+
+            content += f"\nSee `references/conflicts.md` for detailed conflict information.\n"
+
+        # Merged API section (if available)
+        if self.merged_data:
+            content += self._format_merged_apis()
+
+        # Quick reference from each source
+        content += "\n## 📖 Reference Documentation\n\n"
+        content += "Organized by source:\n\n"
+
+        for source in self.config.get('sources', []):
+            source_type = source['type']
+            content += f"- [{source_type.title()}](references/{source_type}/)\n"
+
+        # When to use this skill
+        content += f"\n## 💡 When to Use This Skill\n\n"
+        content += f"Use this skill when you need to:\n"
+        content += f"- Understand how to use {self.name}\n"
+        content += f"- Look up API documentation\n"
+        content += f"- Find usage examples\n"
+
+        if 'github' in self.scraped_data:
+            content += f"- Check for known issues or recent changes\n"
+            content += f"- Review release history\n"
+
+        content += "\n---\n\n"
+        content += "*Generated by Skill Seeker's unified multi-source scraper*\n"
+
+        with open(skill_path, 'w', encoding='utf-8') as f:
+            f.write(content)
+
+        logger.info(f"Created SKILL.md")
+
+    def _format_merged_apis(self) -> str:
+        """Format merged APIs section with inline conflict warnings."""
+        if not self.merged_data:
+            return ""
+
+        content = "\n## 🔧 API Reference\n\n"
+        content += "*Merged from documentation and code analysis*\n\n"
+
+        apis = self.merged_data.get('apis', {})
+
+        if not apis:
+            return content + "*No APIs to display*\n"
+
+        # Group APIs by status
+        matched = {k: v for k, v in apis.items() if v.get('status') == 'matched'}
+        conflicts = {k: v for k, v in apis.items() if v.get('status') == 'conflict'}
+        docs_only = {k: v for k, v in apis.items() if v.get('status') == 'docs_only'}
+        code_only = {k: v for k, v in apis.items() if v.get('status') == 'code_only'}
+
+        # Show matched APIs first
+        if matched:
+            content += "### ✅ Verified APIs\n\n"
+            content += "*Documentation and code agree*\n\n"
+            for api_name, api_data in list(matched.items())[:10]:  # Limit to first 10
+                content += self._format_api_entry(api_data, inline_conflict=False)
+
+        # Show conflicting APIs with warnings
+        if conflicts:
+            content += "\n### ⚠️ APIs with Conflicts\n\n"
+            content += "*Documentation and code differ*\n\n"
+            for api_name, api_data in list(conflicts.items())[:10]:
+                content += self._format_api_entry(api_data, inline_conflict=True)
+
+        # Show undocumented APIs
+        if code_only:
+            content += f"\n### 💻 Undocumented APIs\n\n"
+            content += f"*Found in code but not in documentation ({len(code_only)} total)*\n\n"
+            for api_name, api_data in list(code_only.items())[:5]:
+                content += self._format_api_entry(api_data, inline_conflict=False)
+
+        # Show removed/missing APIs
+        if docs_only:
+            content += f"\n### 📖 Documentation-Only APIs\n\n"
+            content += f"*Documented but not found in code ({len(docs_only)} total)*\n\n"
+            for api_name, api_data in list(docs_only.items())[:5]:
+                content += self._format_api_entry(api_data, inline_conflict=False)
+
+        content += f"\n*See references/api/ for complete API documentation*\n"
+
+        return content
+
+    def _format_api_entry(self, api_data: Dict, inline_conflict: bool = False) -> str:
+        """Format a single API entry."""
+        name = api_data.get('name', 'Unknown')
+        signature = api_data.get('merged_signature', name)
+        description = api_data.get('merged_description', '')
+        warning = api_data.get('warning', '')
+
+        entry = f"#### `{signature}`\n\n"
+
+        if description:
+            entry += f"{description}\n\n"
+
+        # Add inline conflict warning
+        if inline_conflict and warning:
+            entry += f"⚠️ **Conflict**: {warning}\n\n"
+
+            # Show both versions if available
+            conflict = api_data.get('conflict', {})
+            if conflict:
+                docs_info = conflict.get('docs_info')
+                code_info = conflict.get('code_info')
+
+                if docs_info and code_info:
+                    entry += "**Documentation says:**\n"
+                    entry += f"```\n{docs_info.get('raw_signature', 'N/A')}\n```\n\n"
+                    entry += "**Code implementation:**\n"
+                    entry += f"```\n{self._format_code_signature(code_info)}\n```\n\n"
+
+        # Add source info
+        source = api_data.get('source', 'unknown')
+        entry += f"*Source: {source}*\n\n"
+
+        entry += "---\n\n"
+
+        return entry
+
+    def _format_code_signature(self, code_info: Dict) -> str:
+        """Format code signature for display."""
+        name = code_info.get('name', '')
+        params = code_info.get('parameters', [])
+        return_type = code_info.get('return_type')
+
+        param_strs = []
+        for param in params:
+            param_str = param.get('name', '')
+            if param.get('type_hint'):
+                param_str += f": {param['type_hint']}"
+            if param.get('default'):
+                param_str += f" = {param['default']}"
+            param_strs.append(param_str)
+
+        sig = f"{name}({', '.join(param_strs)})"
+        if return_type:
+            sig += f" -> {return_type}"
+
+        return sig
+
+    def _generate_references(self):
+        """Generate reference files organized by source."""
+        logger.info("Generating reference files...")
+
+        # Generate references for each source type
+        if 'documentation' in self.scraped_data:
+            self._generate_docs_references()
+
+        if 'github' in self.scraped_data:
+            self._generate_github_references()
+
+        if 'pdf' in self.scraped_data:
+            self._generate_pdf_references()
+
+        # Generate merged API reference if available
+        if self.merged_data:
+            self._generate_merged_api_reference()
+
+    def _generate_docs_references(self):
+        """Generate references from documentation source."""
+        docs_dir = os.path.join(self.skill_dir, 'references', 'documentation')
+        os.makedirs(docs_dir, exist_ok=True)
+
+        # Create index
+        index_path = os.path.join(docs_dir, 'index.md')
+        with open(index_path, 'w') as f:
+            f.write("# Documentation\n\n")
+            f.write("Reference from official documentation.\n\n")
+
+        logger.info("Created documentation references")
+
+    def _generate_github_references(self):
+        """Generate references from GitHub source."""
+        github_dir = os.path.join(self.skill_dir, 'references', 'github')
+        os.makedirs(github_dir, exist_ok=True)
+
+        github_data = self.scraped_data['github']['data']
+
+        # Create README reference
+        if github_data.get('readme'):
+            readme_path = os.path.join(github_dir, 'README.md')
+            with open(readme_path, 'w') as f:
+                f.write("# Repository README\n\n")
+                f.write(github_data['readme'])
+
+        # Create issues reference
+        if github_data.get('issues'):
+            issues_path = os.path.join(github_dir, 'issues.md')
+            with open(issues_path, 'w') as f:
+                f.write("# GitHub Issues\n\n")
+                f.write(f"{len(github_data['issues'])} recent issues.\n\n")
+
+                for issue in github_data['issues'][:20]:
+                    f.write(f"## #{issue['number']}: {issue['title']}\n\n")
+                    f.write(f"**State**: {issue['state']}\n")
+                    if issue.get('labels'):
+                        f.write(f"**Labels**: {', '.join(issue['labels'])}\n")
+                    f.write(f"**URL**: {issue.get('url', 'N/A')}\n\n")
+
+        # Create releases reference
+        if github_data.get('releases'):
+            releases_path = os.path.join(github_dir, 'releases.md')
+            with open(releases_path, 'w') as f:
+                f.write("# Releases\n\n")
+
+                for release in github_data['releases'][:10]:
+                    f.write(f"## {release['tag_name']}: {release.get('name', 'N/A')}\n\n")
+                    f.write(f"**Published**: {release.get('published_at', 'N/A')[:10]}\n\n")
+                    if release.get('body'):
+                        f.write(release['body'][:500])
+                        f.write("\n\n")
+
+        logger.info("Created GitHub references")
+
+    def _generate_pdf_references(self):
+        """Generate references from PDF source."""
+        pdf_dir = os.path.join(self.skill_dir, 'references', 'pdf')
+        os.makedirs(pdf_dir, exist_ok=True)
+
+        # Create index
+        index_path = os.path.join(pdf_dir, 'index.md')
+        with open(index_path, 'w') as f:
+            f.write("# PDF Documentation\n\n")
+            f.write("Reference from PDF document.\n\n")
+
+        logger.info("Created PDF references")
+
+    def _generate_merged_api_reference(self):
+        """Generate merged API reference file."""
+        api_dir = os.path.join(self.skill_dir, 'references', 'api')
+        os.makedirs(api_dir, exist_ok=True)
+
+        api_path = os.path.join(api_dir, 'merged_api.md')
+
+        with open(api_path, 'w') as f:
+            f.write("# Merged API Reference\n\n")
+            f.write("*Combined from documentation and code analysis*\n\n")
+
+            apis = self.merged_data.get('apis', {})
+
+            for api_name in sorted(apis.keys()):
+                api_data = apis[api_name]
+                entry = self._format_api_entry(api_data, inline_conflict=True)
+                f.write(entry)
+
+        logger.info(f"Created merged API reference ({len(apis)} APIs)")
+
+    def _generate_conflicts_report(self):
+        """Generate detailed conflicts report."""
+        conflicts_path = os.path.join(self.skill_dir, 'references', 'conflicts.md')
+
+        with open(conflicts_path, 'w') as f:
+            f.write("# Conflict Report\n\n")
+            f.write(f"Found **{len(self.conflicts)}** conflicts between sources.\n\n")
+
+            # Group by severity
+            high = [c for c in self.conflicts if (hasattr(c, 'severity') and c.severity == 'high') or c.get('severity') == 'high']
+            medium = [c for c in self.conflicts if (hasattr(c, 'severity') and c.severity == 'medium') or c.get('severity') == 'medium']
+            low = [c for c in self.conflicts if (hasattr(c, 'severity') and c.severity == 'low') or c.get('severity') == 'low']
+
+            f.write("## Severity Breakdown\n\n")
+            f.write(f"- 🔴 **High**: {len(high)} (action required)\n")
+            f.write(f"- 🟡 **Medium**: {len(medium)} (review recommended)\n")
+            f.write(f"- 🟢 **Low**: {len(low)} (informational)\n\n")
+
+            # List high severity conflicts
+            if high:
+                f.write("## 🔴 High Severity\n\n")
+                f.write("*These conflicts require immediate attention*\n\n")
+
+                for conflict in high:
+                    api_name = conflict.api_name if hasattr(conflict, 'api_name') else conflict.get('api_name', 'Unknown')
+                    diff = conflict.difference if hasattr(conflict, 'difference') else conflict.get('difference', 'N/A')
+
+                    f.write(f"### {api_name}\n\n")
+                    f.write(f"**Issue**: {diff}\n\n")
+
+            # List medium severity
+            if medium:
+                f.write("## 🟡 Medium Severity\n\n")
+
+                for conflict in medium[:20]:  # Limit to 20
+                    api_name = conflict.api_name if hasattr(conflict, 'api_name') else conflict.get('api_name', 'Unknown')
+                    diff = conflict.difference if hasattr(conflict, 'difference') else conflict.get('difference', 'N/A')
+
+                    f.write(f"### {api_name}\n\n")
+                    f.write(f"{diff}\n\n")
+
+        logger.info(f"Created conflicts report")
+
+
+if __name__ == '__main__':
+    # Test with mock data
+    import sys
+
+    if len(sys.argv) < 2:
+        print("Usage: python unified_skill_builder.py <config.json>")
+        sys.exit(1)
+
+    config_path = sys.argv[1]
+
+    with open(config_path, 'r') as f:
+        config = json.load(f)
+
+    # Mock scraped data
+    scraped_data = {
+        'github': {
+            'data': {
+                'readme': '# Test Repository',
+                'issues': [],
+                'releases': []
+            }
+        }
+    }
+
+    builder = UnifiedSkillBuilder(config, scraped_data)
+    builder.build()
+
+    print(f"\n✅ Test skill built in: output/{config['name']}/")
--- a/src/skill_seekers/cli/upload_skill.py
+++ b/src/skill_seekers/cli/upload_skill.py
@@ -0,0 +1,174 @@
+#!/usr/bin/env python3
+"""
+Automatic Skill Uploader
+Uploads a skill .zip file to Claude using the Anthropic API
+
+Usage:
+    # Set API key (one-time)
+    export ANTHROPIC_API_KEY=sk-ant-...
+
+    # Upload skill
+    python3 upload_skill.py output/react.zip
+    python3 upload_skill.py output/godot.zip
+"""
+
+import os
+import sys
+import json
+import argparse
+from pathlib import Path
+
+# Import utilities
+try:
+    from utils import (
+        get_api_key,
+        get_upload_url,
+        print_upload_instructions,
+        validate_zip_file
+    )
+except ImportError:
+    sys.path.insert(0, str(Path(__file__).parent))
+    from utils import (
+        get_api_key,
+        get_upload_url,
+        print_upload_instructions,
+        validate_zip_file
+    )
+
+
+def upload_skill_api(zip_path):
+    """
+    Upload skill to Claude via Anthropic API
+
+    Args:
+        zip_path: Path to skill .zip file
+
+    Returns:
+        tuple: (success, message)
+    """
+    # Check for requests library
+    try:
+        import requests
+    except ImportError:
+        return False, "requests library not installed. Run: pip install requests"
+
+    # Validate zip file
+    is_valid, error_msg = validate_zip_file(zip_path)
+    if not is_valid:
+        return False, error_msg
+
+    # Get API key
+    api_key = get_api_key()
+    if not api_key:
+        return False, "ANTHROPIC_API_KEY not set. Run: export ANTHROPIC_API_KEY=sk-ant-..."
+
+    zip_path = Path(zip_path)
+    skill_name = zip_path.stem
+
+    print(f"📤 Uploading skill: {skill_name}")
+    print(f"   Source: {zip_path}")
+    print(f"   Size: {zip_path.stat().st_size:,} bytes")
+    print()
+
+    # Prepare API request
+    api_url = "https://api.anthropic.com/v1/skills"
+    headers = {
+        "x-api-key": api_key,
+        "anthropic-version": "2023-06-01"
+    }
+
+    try:
+        # Read zip file
+        with open(zip_path, 'rb') as f:
+            zip_data = f.read()
+
+        # Upload skill
+        print("⏳ Uploading to Anthropic API...")
+
+        files = {
+            'skill': (zip_path.name, zip_data, 'application/zip')
+        }
+
+        response = requests.post(
+            api_url,
+            headers=headers,
+            files=files,
+            timeout=60
+        )
+
+        # Check response
+        if response.status_code == 200:
+            print()
+            print("✅ Skill uploaded successfully!")
+            print()
+            print("Your skill is now available in Claude at:")
+            print(f"   {get_upload_url()}")
+            print()
+            return True, "Upload successful"
+
+        elif response.status_code == 401:
+            return False, "Authentication failed. Check your ANTHROPIC_API_KEY"
+
+        elif response.status_code == 400:
+            error_msg = response.json().get('error', {}).get('message', 'Unknown error')
+            return False, f"Invalid skill format: {error_msg}"
+
+        else:
+            error_msg = response.json().get('error', {}).get('message', 'Unknown error')
+            return False, f"Upload failed ({response.status_code}): {error_msg}"
+
+    except requests.exceptions.Timeout:
+        return False, "Upload timed out. Try again or use manual upload"
+
+    except requests.exceptions.ConnectionError:
+        return False, "Connection error. Check your internet connection"
+
+    except Exception as e:
+        return False, f"Unexpected error: {str(e)}"
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Upload a skill .zip file to Claude via Anthropic API",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Setup:
+  1. Get your Anthropic API key from https://console.anthropic.com/
+  2. Set the API key:
+     export ANTHROPIC_API_KEY=sk-ant-...
+
+Examples:
+  # Upload skill
+  python3 upload_skill.py output/react.zip
+
+  # Upload with explicit path
+  python3 upload_skill.py /path/to/skill.zip
+
+Requirements:
+  - ANTHROPIC_API_KEY environment variable must be set
+  - requests library (pip install requests)
+        """
+    )
+
+    parser.add_argument(
+        'zip_file',
+        help='Path to skill .zip file (e.g., output/react.zip)'
+    )
+
+    args = parser.parse_args()
+
+    # Upload skill
+    success, message = upload_skill_api(args.zip_file)
+
+    if success:
+        sys.exit(0)
+    else:
+        print(f"\n❌ Upload failed: {message}")
+        print()
+        print("📝 Manual upload instructions:")
+        print_upload_instructions(args.zip_file)
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
--- a/src/skill_seekers/cli/utils.py
+++ b/src/skill_seekers/cli/utils.py
@@ -0,0 +1,224 @@
+#!/usr/bin/env python3
+"""
+Utility functions for Skill Seeker CLI tools
+"""
+
+import os
+import sys
+import subprocess
+import platform
+from pathlib import Path
+from typing import Optional, Tuple, Dict, Union
+
+
+def open_folder(folder_path: Union[str, Path]) -> bool:
+    """
+    Open a folder in the system file browser
+
+    Args:
+        folder_path: Path to folder to open
+
+    Returns:
+        bool: True if successful, False otherwise
+    """
+    folder_path = Path(folder_path).resolve()
+
+    if not folder_path.exists():
+        print(f"⚠️  Folder not found: {folder_path}")
+        return False
+
+    system = platform.system()
+
+    try:
+        if system == "Linux":
+            # Try xdg-open first (standard)
+            subprocess.run(["xdg-open", str(folder_path)], check=True)
+        elif system == "Darwin":  # macOS
+            subprocess.run(["open", str(folder_path)], check=True)
+        elif system == "Windows":
+            subprocess.run(["explorer", str(folder_path)], check=True)
+        else:
+            print(f"⚠️  Unknown operating system: {system}")
+            return False
+
+        return True
+
+    except subprocess.CalledProcessError:
+        print(f"⚠️  Could not open folder automatically")
+        return False
+    except FileNotFoundError:
+        print(f"⚠️  File browser not found on system")
+        return False
+
+
+def has_api_key() -> bool:
+    """
+    Check if ANTHROPIC_API_KEY is set in environment
+
+    Returns:
+        bool: True if API key is set, False otherwise
+    """
+    api_key = os.environ.get('ANTHROPIC_API_KEY', '').strip()
+    return len(api_key) > 0
+
+
+def get_api_key() -> Optional[str]:
+    """
+    Get ANTHROPIC_API_KEY from environment
+
+    Returns:
+        str: API key or None if not set
+    """
+    api_key = os.environ.get('ANTHROPIC_API_KEY', '').strip()
+    return api_key if api_key else None
+
+
+def get_upload_url() -> str:
+    """
+    Get the Claude skills upload URL
+
+    Returns:
+        str: Claude skills upload URL
+    """
+    return "https://claude.ai/skills"
+
+
+def print_upload_instructions(zip_path: Union[str, Path]) -> None:
+    """
+    Print clear upload instructions for manual upload
+
+    Args:
+        zip_path: Path to the .zip file to upload
+    """
+    zip_path = Path(zip_path)
+
+    print()
+    print("╔══════════════════════════════════════════════════════════╗")
+    print("║                     NEXT STEP                            ║")
+    print("╚══════════════════════════════════════════════════════════╝")
+    print()
+    print(f"📤 Upload to Claude: {get_upload_url()}")
+    print()
+    print(f"1. Go to {get_upload_url()}")
+    print("2. Click \"Upload Skill\"")
+    print(f"3. Select: {zip_path}")
+    print("4. Done! ✅")
+    print()
+
+
+def format_file_size(size_bytes: int) -> str:
+    """
+    Format file size in human-readable format
+
+    Args:
+        size_bytes: Size in bytes
+
+    Returns:
+        str: Formatted size (e.g., "45.3 KB")
+    """
+    if size_bytes < 1024:
+        return f"{size_bytes} bytes"
+    elif size_bytes < 1024 * 1024:
+        return f"{size_bytes / 1024:.1f} KB"
+    else:
+        return f"{size_bytes / (1024 * 1024):.1f} MB"
+
+
+def validate_skill_directory(skill_dir: Union[str, Path]) -> Tuple[bool, Optional[str]]:
+    """
+    Validate that a directory is a valid skill directory
+
+    Args:
+        skill_dir: Path to skill directory
+
+    Returns:
+        tuple: (is_valid, error_message)
+    """
+    skill_path = Path(skill_dir)
+
+    if not skill_path.exists():
+        return False, f"Directory not found: {skill_dir}"
+
+    if not skill_path.is_dir():
+        return False, f"Not a directory: {skill_dir}"
+
+    skill_md = skill_path / "SKILL.md"
+    if not skill_md.exists():
+        return False, f"SKILL.md not found in {skill_dir}"
+
+    return True, None
+
+
+def validate_zip_file(zip_path: Union[str, Path]) -> Tuple[bool, Optional[str]]:
+    """
+    Validate that a file is a valid skill .zip file
+
+    Args:
+        zip_path: Path to .zip file
+
+    Returns:
+        tuple: (is_valid, error_message)
+    """
+    zip_path = Path(zip_path)
+
+    if not zip_path.exists():
+        return False, f"File not found: {zip_path}"
+
+    if not zip_path.is_file():
+        return False, f"Not a file: {zip_path}"
+
+    if not zip_path.suffix == '.zip':
+        return False, f"Not a .zip file: {zip_path}"
+
+    return True, None
+
+
+def read_reference_files(skill_dir: Union[str, Path], max_chars: int = 100000, preview_limit: int = 40000) -> Dict[str, str]:
+    """Read reference files from a skill directory with size limits.
+
+    This function reads markdown files from the references/ subdirectory
+    of a skill, applying both per-file and total content limits.
+
+    Args:
+        skill_dir (str or Path): Path to skill directory
+        max_chars (int): Maximum total characters to read (default: 100000)
+        preview_limit (int): Maximum characters per file (default: 40000)
+
+    Returns:
+        dict: Dictionary mapping filename to content
+
+    Example:
+        >>> refs = read_reference_files('output/react/', max_chars=50000)
+        >>> len(refs)
+        5
+    """
+    from pathlib import Path
+
+    skill_path = Path(skill_dir)
+    references_dir = skill_path / "references"
+    references: Dict[str, str] = {}
+
+    if not references_dir.exists():
+        print(f"⚠ No references directory found at {references_dir}")
+        return references
+
+    total_chars = 0
+    for ref_file in sorted(references_dir.glob("*.md")):
+        if ref_file.name == "index.md":
+            continue
+
+        content = ref_file.read_text(encoding='utf-8')
+
+        # Limit size per file
+        if len(content) > preview_limit:
+            content = content[:preview_limit] + "\n\n[Content truncated...]"
+
+        references[ref_file.name] = content
+        total_chars += len(content)
+
+        # Stop if we've read enough
+        if total_chars > max_chars:
+            print(f"  ℹ Limiting input to {max_chars:,} characters")
+            break
+
+    return references
--- a/src/skill_seekers/mcp/README.md
+++ b/src/skill_seekers/mcp/README.md
@@ -0,0 +1,596 @@
+# Skill Seeker MCP Server
+
+Model Context Protocol (MCP) server for Skill Seeker - enables Claude Code to generate documentation skills directly.
+
+## What is This?
+
+This MCP server allows Claude Code to use Skill Seeker's tools directly through natural language commands. Instead of running CLI commands manually, you can ask Claude Code to:
+
+- Generate config files for any documentation site
+- Estimate page counts before scraping
+- Scrape documentation and build skills
+- Package skills into `.zip` files
+- List and validate configurations
+- Split large documentation (10K-40K+ pages) into focused sub-skills
+- Generate intelligent router/hub skills for split documentation
+- **NEW:** Scrape PDF documentation and extract code/images
+
+## Quick Start
+
+### 1. Install Dependencies
+
+```bash
+# From repository root
+pip3 install -r mcp/requirements.txt
+pip3 install requests beautifulsoup4
+```
+
+### 2. Quick Setup (Automated)
+
+```bash
+# Run the setup script
+./setup_mcp.sh
+
+# Follow the prompts - it will:
+# - Install dependencies
+# - Test the server
+# - Generate configuration
+# - Guide you through Claude Code setup
+```
+
+### 3. Manual Setup
+
+Add to `~/.config/claude-code/mcp.json`:
+
+```json
+{
+  "mcpServers": {
+    "skill-seeker": {
+      "command": "python3",
+      "args": [
+        "/path/to/Skill_Seekers/mcp/server.py"
+      ],
+      "cwd": "/path/to/Skill_Seekers"
+    }
+  }
+}
+```
+
+**Replace `/path/to/Skill_Seekers`** with your actual repository path!
+
+### 4. Restart Claude Code
+
+Quit and reopen Claude Code (don't just close the window).
+
+### 5. Test
+
+In Claude Code, type:
+```
+List all available configs
+```
+
+You should see a list of preset configurations (Godot, React, Vue, etc.).
+
+## Available Tools
+
+The MCP server exposes 10 tools:
+
+### 1. `generate_config`
+Create a new configuration file for any documentation website.
+
+**Parameters:**
+- `name` (required): Skill name (e.g., "tailwind")
+- `url` (required): Documentation URL (e.g., "https://tailwindcss.com/docs")
+- `description` (required): When to use this skill
+- `max_pages` (optional): Maximum pages to scrape (default: 100)
+- `rate_limit` (optional): Delay between requests in seconds (default: 0.5)
+
+**Example:**
+```
+Generate config for Tailwind CSS at https://tailwindcss.com/docs
+```
+
+### 2. `estimate_pages`
+Estimate how many pages will be scraped from a config (fast, no data downloaded).
+
+**Parameters:**
+- `config_path` (required): Path to config file (e.g., "configs/react.json")
+- `max_discovery` (optional): Maximum pages to discover (default: 1000)
+
+**Example:**
+```
+Estimate pages for configs/react.json
+```
+
+### 3. `scrape_docs`
+Scrape documentation and build Claude skill.
+
+**Parameters:**
+- `config_path` (required): Path to config file
+- `enhance_local` (optional): Open terminal for local enhancement (default: false)
+- `skip_scrape` (optional): Use cached data (default: false)
+- `dry_run` (optional): Preview without saving (default: false)
+
+**Example:**
+```
+Scrape docs using configs/react.json
+```
+
+### 4. `package_skill`
+Package a skill directory into a `.zip` file ready for Claude upload. Automatically uploads if ANTHROPIC_API_KEY is set.
+
+**Parameters:**
+- `skill_dir` (required): Path to skill directory (e.g., "output/react/")
+- `auto_upload` (optional): Try to upload automatically if API key is available (default: true)
+
+**Example:**
+```
+Package skill at output/react/
+```
+
+### 5. `upload_skill`
+Upload a skill .zip file to Claude automatically (requires ANTHROPIC_API_KEY).
+
+**Parameters:**
+- `skill_zip` (required): Path to skill .zip file (e.g., "output/react.zip")
+
+**Example:**
+```
+Upload output/react.zip using upload_skill
+```
+
+### 6. `list_configs`
+List all available preset configurations.
+
+**Parameters:** None
+
+**Example:**
+```
+List all available configs
+```
+
+### 7. `validate_config`
+Validate a config file for errors.
+
+**Parameters:**
+- `config_path` (required): Path to config file
+
+**Example:**
+```
+Validate configs/godot.json
+```
+
+### 8. `split_config`
+Split large documentation config into multiple focused skills. For 10K+ page documentation.
+
+**Parameters:**
+- `config_path` (required): Path to config JSON file (e.g., "configs/godot.json")
+- `strategy` (optional): Split strategy - "auto", "none", "category", "router", "size" (default: "auto")
+- `target_pages` (optional): Target pages per skill (default: 5000)
+- `dry_run` (optional): Preview without saving files (default: false)
+
+**Example:**
+```
+Split configs/godot.json using router strategy with 5000 pages per skill
+```
+
+**Strategies:**
+- **auto** - Intelligently detects best strategy based on page count and config
+- **category** - Split by documentation categories (creates focused sub-skills)
+- **router** - Create router/hub skill + specialized sub-skills (RECOMMENDED for 10K+ pages)
+- **size** - Split every N pages (for docs without clear categories)
+
+### 9. `generate_router`
+Generate router/hub skill for split documentation. Creates intelligent routing to sub-skills.
+
+**Parameters:**
+- `config_pattern` (required): Config pattern for sub-skills (e.g., "configs/godot-*.json")
+- `router_name` (optional): Router skill name (inferred from configs if not provided)
+
+**Example:**
+```
+Generate router for configs/godot-*.json
+```
+
+**What it does:**
+- Analyzes all sub-skill configs
+- Extracts routing keywords from categories and names
+- Creates router SKILL.md with intelligent routing logic
+- Users can ask questions naturally, router directs to appropriate sub-skill
+
+### 10. `scrape_pdf`
+Scrape PDF documentation and build Claude skill. Extracts text, code blocks, images, and tables from PDF files with advanced features.
+
+**Parameters:**
+- `config_path` (optional): Path to PDF config JSON file (e.g., "configs/manual_pdf.json")
+- `pdf_path` (optional): Direct PDF path (alternative to config_path)
+- `name` (optional): Skill name (required with pdf_path)
+- `description` (optional): Skill description
+- `from_json` (optional): Build from extracted JSON file (e.g., "output/manual_extracted.json")
+- `use_ocr` (optional): Use OCR for scanned PDFs (requires pytesseract)
+- `password` (optional): Password for encrypted PDFs
+- `extract_tables` (optional): Extract tables from PDF
+- `parallel` (optional): Process pages in parallel for faster extraction
+- `max_workers` (optional): Number of parallel workers (default: CPU count)
+
+**Examples:**
+```
+Scrape PDF at docs/manual.pdf and create skill named api-docs
+Create skill from configs/example_pdf.json
+Build skill from output/manual_extracted.json
+Scrape scanned PDF with OCR: --pdf docs/scanned.pdf --ocr
+Scrape encrypted PDF: --pdf docs/manual.pdf --password mypassword
+Extract tables: --pdf docs/data.pdf --extract-tables
+Fast parallel processing: --pdf docs/large.pdf --parallel --workers 8
+```
+
+**What it does:**
+- Extracts text and markdown from PDF pages
+- Detects code blocks using 3 methods (font, indent, pattern)
+- Detects programming language with confidence scoring (19+ languages)
+- Validates syntax and scores code quality (0-10 scale)
+- Extracts images with size filtering
+- **NEW:** Extracts tables from PDFs (Priority 2)
+- **NEW:** OCR support for scanned PDFs (Priority 2, requires pytesseract + Pillow)
+- **NEW:** Password-protected PDF support (Priority 2)
+- **NEW:** Parallel page processing for faster extraction (Priority 3)
+- **NEW:** Intelligent caching of expensive operations (Priority 3)
+- Detects chapters and creates page chunks
+- Categorizes content automatically
+- Generates complete skill structure (SKILL.md + references)
+
+**Performance:**
+- Sequential: ~30-60 seconds per 100 pages
+- Parallel (8 workers): ~10-20 seconds per 100 pages (3x faster)
+
+**See:** `docs/PDF_SCRAPER.md` for complete PDF documentation guide
+
+## Example Workflows
+
+### Generate a New Skill from Scratch
+
+```
+User: Generate config for Svelte at https://svelte.dev/docs
+
+Claude: ✅ Config created: configs/svelte.json
+
+User: Estimate pages for configs/svelte.json
+
+Claude: 📊 Estimated pages: 150
+
+User: Scrape docs using configs/svelte.json
+
+Claude: ✅ Skill created at output/svelte/
+
+User: Package skill at output/svelte/
+
+Claude: ✅ Created: output/svelte.zip
+      Ready to upload to Claude!
+```
+
+### Use Existing Preset
+
+```
+User: List all available configs
+
+Claude: [Shows all configs: godot, react, vue, django, fastapi, etc.]
+
+User: Scrape docs using configs/react.json
+
+Claude: ✅ Skill created at output/react/
+
+User: Package skill at output/react/
+
+Claude: ✅ Created: output/react.zip
+```
+
+### Validate Before Scraping
+
+```
+User: Validate configs/godot.json
+
+Claude: ✅ Config is valid!
+        Name: godot
+        Base URL: https://docs.godotengine.org/en/stable/
+        Max pages: 500
+        Rate limit: 0.5s
+
+User: Scrape docs using configs/godot.json
+
+Claude: [Starts scraping...]
+```
+
+### PDF Documentation - NEW
+
+```
+User: Scrape PDF at docs/api-manual.pdf and create skill named api-docs
+
+Claude: 📄 Scraping PDF documentation...
+        ✅ Extracted 120 pages
+        ✅ Found 45 code blocks (Python, JavaScript, C++)
+        ✅ Extracted 12 images
+        ✅ Created skill at output/api-docs/
+        📦 Package with: python3 cli/package_skill.py output/api-docs/
+
+User: Package skill at output/api-docs/
+
+Claude: ✅ Created: output/api-docs.zip
+        Ready to upload to Claude!
+```
+
+### Large Documentation (40K Pages)
+
+```
+User: Estimate pages for configs/godot.json
+
+Claude: 📊 Estimated pages: 40,000
+        ⚠️  Large documentation detected!
+        💡 Recommend splitting into multiple skills
+
+User: Split configs/godot.json using router strategy
+
+Claude: ✅ Split complete!
+        Created 5 sub-skills:
+        - godot-scripting.json (5,000 pages)
+        - godot-2d.json (8,000 pages)
+        - godot-3d.json (10,000 pages)
+        - godot-physics.json (6,000 pages)
+        - godot-shaders.json (11,000 pages)
+
+User: Scrape all godot sub-skills in parallel
+
+Claude: [Starts scraping all 5 configs in parallel...]
+        ✅ All skills created in 4-8 hours instead of 20-40!
+
+User: Generate router for configs/godot-*.json
+
+Claude: ✅ Router skill created at output/godot/
+        Routing logic:
+        - "scripting", "gdscript" → godot-scripting
+        - "2d", "sprites", "tilemap" → godot-2d
+        - "3d", "meshes", "camera" → godot-3d
+        - "physics", "collision" → godot-physics
+        - "shaders", "visual shader" → godot-shaders
+
+User: Package all godot skills
+
+Claude: ✅ 6 skills packaged:
+        - godot.zip (router)
+        - godot-scripting.zip
+        - godot-2d.zip
+        - godot-3d.zip
+        - godot-physics.zip
+        - godot-shaders.zip
+
+        Upload all to Claude!
+        Users just ask questions naturally - router handles routing!
+```
+
+## Architecture
+
+### Server Structure
+
+```
+mcp/
+├── server.py           # Main MCP server
+├── requirements.txt    # MCP dependencies
+└── README.md          # This file
+```
+
+### How It Works
+
+1. **Claude Code** sends MCP requests to the server
+2. **Server** routes requests to appropriate tool functions
+3. **Tools** call CLI scripts (`doc_scraper.py`, `estimate_pages.py`, etc.)
+4. **CLI scripts** perform actual work (scraping, packaging, etc.)
+5. **Results** returned to Claude Code via MCP protocol
+
+### Tool Implementation
+
+Each tool is implemented as an async function:
+
+```python
+async def generate_config_tool(args: dict) -> list[TextContent]:
+    """Generate a config file"""
+    # Create config JSON
+    # Save to configs/
+    # Return success message
+```
+
+Tools use `subprocess.run()` to call CLI scripts:
+
+```python
+result = subprocess.run([
+    sys.executable,
+    str(CLI_DIR / "doc_scraper.py"),
+    "--config", config_path
+], capture_output=True, text=True)
+```
+
+## Testing
+
+The MCP server has comprehensive test coverage:
+
+```bash
+# Run MCP server tests (25 tests)
+python3 -m pytest tests/test_mcp_server.py -v
+
+# Expected output: 25 passed in ~0.3s
+```
+
+### Test Coverage
+
+- **Server initialization** (2 tests)
+- **Tool listing** (2 tests)
+- **generate_config** (3 tests)
+- **estimate_pages** (3 tests)
+- **scrape_docs** (4 tests)
+- **package_skill** (3 tests)
+- **upload_skill** (2 tests)
+- **list_configs** (3 tests)
+- **validate_config** (3 tests)
+- **split_config** (3 tests)
+- **generate_router** (3 tests)
+- **Tool routing** (2 tests)
+- **Integration** (1 test)
+
+**Total: 34 tests | Pass rate: 100%**
+
+## Troubleshooting
+
+### MCP Server Not Loading
+
+**Symptoms:**
+- Tools don't appear in Claude Code
+- No response to skill-seeker commands
+
+**Solutions:**
+
+1. Check configuration:
+   ```bash
+   cat ~/.config/claude-code/mcp.json
+   ```
+
+2. Verify server can start:
+   ```bash
+   python3 mcp/server.py
+   # Should start without errors (Ctrl+C to exit)
+   ```
+
+3. Check dependencies:
+   ```bash
+   pip3 install -r mcp/requirements.txt
+   ```
+
+4. Completely restart Claude Code (quit and reopen)
+
+5. Check Claude Code logs:
+   - macOS: `~/Library/Logs/Claude Code/`
+   - Linux: `~/.config/claude-code/logs/`
+
+### "ModuleNotFoundError: No module named 'mcp'"
+
+```bash
+pip3 install -r mcp/requirements.txt
+```
+
+### Tools Appear But Don't Work
+
+**Solutions:**
+
+1. Verify `cwd` in config points to repository root
+2. Check CLI tools exist:
+   ```bash
+   ls cli/doc_scraper.py
+   ls cli/estimate_pages.py
+   ls cli/package_skill.py
+   ```
+
+3. Test CLI tools directly:
+   ```bash
+   python3 cli/doc_scraper.py --help
+   ```
+
+### Slow Operations
+
+1. Check rate limit in configs (increase if needed)
+2. Use smaller `max_pages` for testing
+3. Use `skip_scrape` to avoid re-downloading data
+
+## Advanced Configuration
+
+### Using Virtual Environment
+
+```bash
+# Create venv
+python3 -m venv venv
+source venv/bin/activate
+pip install -r mcp/requirements.txt
+pip install requests beautifulsoup4
+which python3  # Copy this path
+```
+
+Configure Claude Code to use venv Python:
+
+```json
+{
+  "mcpServers": {
+    "skill-seeker": {
+      "command": "/path/to/Skill_Seekers/venv/bin/python3",
+      "args": ["/path/to/Skill_Seekers/mcp/server.py"],
+      "cwd": "/path/to/Skill_Seekers"
+    }
+  }
+}
+```
+
+### Debug Mode
+
+Enable verbose logging:
+
+```json
+{
+  "mcpServers": {
+    "skill-seeker": {
+      "command": "python3",
+      "args": ["-u", "/path/to/Skill_Seekers/mcp/server.py"],
+      "cwd": "/path/to/Skill_Seekers",
+      "env": {
+        "DEBUG": "1"
+      }
+    }
+  }
+}
+```
+
+### With API Enhancement
+
+For API-based enhancement (requires Anthropic API key):
+
+```json
+{
+  "mcpServers": {
+    "skill-seeker": {
+      "command": "python3",
+      "args": ["/path/to/Skill_Seekers/mcp/server.py"],
+      "cwd": "/path/to/Skill_Seekers",
+      "env": {
+        "ANTHROPIC_API_KEY": "sk-ant-your-key-here"
+      }
+    }
+  }
+}
+```
+
+## Performance
+
+| Operation | Time | Notes |
+|-----------|------|-------|
+| List configs | <1s | Instant |
+| Generate config | <1s | Creates JSON file |
+| Validate config | <1s | Quick validation |
+| Estimate pages | 1-2min | Fast, no data download |
+| Split config | 1-3min | Analyzes and creates sub-configs |
+| Generate router | 10-30s | Creates router SKILL.md |
+| Scrape docs | 15-45min | First time only |
+| Scrape docs (40K pages) | 20-40hrs | Sequential |
+| Scrape docs (40K pages, parallel) | 4-8hrs | 5 skills in parallel |
+| Scrape (cached) | <1min | With `skip_scrape` |
+| Package skill | 5-10s | Creates .zip |
+| Package multi | 30-60s | Packages 5-10 skills |
+
+## Documentation
+
+- **Full Setup Guide**: [docs/MCP_SETUP.md](../docs/MCP_SETUP.md)
+- **Main README**: [README.md](../README.md)
+- **Usage Guide**: [docs/USAGE.md](../docs/USAGE.md)
+- **Testing Guide**: [docs/TESTING.md](../docs/TESTING.md)
+
+## Support
+
+- **Issues**: [GitHub Issues](https://github.com/yusufkaraaslan/Skill_Seekers/issues)
+- **Discussions**: [GitHub Discussions](https://github.com/yusufkaraaslan/Skill_Seekers/discussions)
+
+## License
+
+MIT License - See [LICENSE](../LICENSE) for details
--- a/src/skill_seekers/mcp/init.py
+++ b/src/skill_seekers/mcp/init.py
@@ -0,0 +1,27 @@
+"""Skill Seekers MCP (Model Context Protocol) server package.
+
+This package provides MCP server integration for Claude Code, allowing
+natural language interaction with Skill Seekers tools.
+
+Main modules:
+    - server: MCP server implementation with 9 tools
+
+Available MCP Tools:
+    - list_configs: List all available preset configurations
+    - generate_config: Generate a new config file for any docs site
+    - validate_config: Validate a config file structure
+    - estimate_pages: Estimate page count before scraping
+    - scrape_docs: Scrape and build a skill
+    - package_skill: Package skill into .zip file (with auto-upload)
+    - upload_skill: Upload .zip to Claude
+    - split_config: Split large documentation configs
+    - generate_router: Generate router/hub skills
+
+Usage:
+    The MCP server is typically run by Claude Code via configuration
+    in ~/.config/claude-code/mcp.json
+"""
+
+__version__ = "2.0.0"
+
+__all__ = []
--- a/src/skill_seekers/mcp/requirements.txt
+++ b/src/skill_seekers/mcp/requirements.txt
@@ -0,0 +1,9 @@
+# MCP Server dependencies
+mcp>=1.0.0
+
+# CLI tool dependencies (shared)
+requests>=2.31.0
+beautifulsoup4>=4.12.0
+
+# Optional: for API-based enhancement
+# anthropic>=0.18.0
--- a/src/skill_seekers/mcp/server.py
+++ b/src/skill_seekers/mcp/server.py
--- a/src/skill_seekers/mcp/tools/init.py
+++ b/src/skill_seekers/mcp/tools/init.py
@@ -0,0 +1,19 @@
+"""MCP tools subpackage.
+
+This package will contain modularized MCP tool implementations.
+
+Planned structure (for future refactoring):
+    - scraping_tools.py: Tools for scraping (estimate_pages, scrape_docs)
+    - building_tools.py: Tools for building (package_skill, validate_config)
+    - deployment_tools.py: Tools for deployment (upload_skill)
+    - config_tools.py: Tools for configs (list_configs, generate_config)
+    - advanced_tools.py: Advanced tools (split_config, generate_router)
+
+Current state:
+    All tools are currently implemented in mcp/server.py
+    This directory is a placeholder for future modularization.
+"""
+
+__version__ = "2.0.0"
+
+__all__ = []