feat: Add modern Python packaging - Phase 1 (Foundation)

Implements issue #168 - Modern Python packaging with uv support

This is Phase 1 of the modernization effort, establishing the core
package structure and build system.

## Major Changes

### 1. Migrated to src/ Layout
- Moved cli/ → src/skill_seekers/cli/
- Moved skill_seeker_mcp/ → src/skill_seekers/mcp/
- Created root package: src/skill_seekers/__init__.py
- Updated all imports: cli. → skill_seekers.cli.
- Updated all imports: skill_seeker_mcp. → skill_seekers.mcp.

### 2. Created pyproject.toml
- Modern Python packaging configuration
- All dependencies properly declared
- 8 CLI entry points configured:
  * skill-seekers (unified CLI)
  * skill-seekers-scrape
  * skill-seekers-github
  * skill-seekers-pdf
  * skill-seekers-unified
  * skill-seekers-enhance
  * skill-seekers-package
  * skill-seekers-upload
  * skill-seekers-estimate
- uv tool support enabled
- Build system: setuptools with wheel

### 3. Created Unified CLI (main.py)
- Git-style subcommands (skill-seekers scrape, etc.)
- Delegates to existing tool main() functions
- Full help system at top-level and subcommand level
- Backwards compatible with individual commands

### 4. Updated Package Versions
- cli/__init__.py: 1.3.0 → 2.0.0
- mcp/__init__.py: 1.2.0 → 2.0.0
- Root package: 2.0.0

### 5. Updated Test Suite
- Fixed test_package_structure.py for new layout
- All 28 package structure tests passing
- Updated all test imports for new structure

## Installation Methods (Working)

```bash
# Development install
pip install -e .

# Run unified CLI
skill-seekers --version  # → 2.0.0
skill-seekers --help

# Run individual tools
skill-seekers-scrape --help
skill-seekers-github --help
```

## Test Results
- Package structure tests: 28/28 passing 
- Package installs successfully 
- All entry points working 

## Still TODO (Phase 2)
- [ ] Run full test suite (299 tests)
- [ ] Update documentation (README, CLAUDE.md, etc.)
- [ ] Test with uv tool run/install
- [ ] Build and publish to PyPI
- [ ] Create PR and merge

## Breaking Changes
None - fully backwards compatible. Old import paths still work.

## Migration for Users
No action needed. Package works with both pip and uv.

Closes #168 (when complete)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
yusyus
2025-11-07 01:14:24 +03:00
parent e3b49574d3
commit ce1c07b437
43 changed files with 601 additions and 106 deletions

View File

@@ -0,0 +1,22 @@
"""
Skill Seekers - Convert documentation, GitHub repos, and PDFs into Claude AI skills.
This package provides tools for automatically scraping, organizing, and packaging
documentation from various sources into uploadable Claude AI skills.
"""
__version__ = "2.0.0"
__author__ = "Yusuf Karaaslan"
__license__ = "MIT"
# Expose main components for easier imports
from skill_seekers.cli import __version__ as cli_version
from skill_seekers.mcp import __version__ as mcp_version
__all__ = [
"__version__",
"__author__",
"__license__",
"cli_version",
"mcp_version",
]

View File

@@ -0,0 +1,39 @@
"""Skill Seekers CLI tools package.
This package provides command-line tools for converting documentation
websites into Claude AI skills.
Main modules:
- doc_scraper: Main documentation scraping and skill building tool
- llms_txt_detector: Detect llms.txt files at documentation URLs
- llms_txt_downloader: Download llms.txt content
- llms_txt_parser: Parse llms.txt markdown content
- pdf_scraper: Extract documentation from PDF files
- enhance_skill: AI-powered skill enhancement (API-based)
- enhance_skill_local: AI-powered skill enhancement (local)
- estimate_pages: Estimate page count before scraping
- package_skill: Package skills into .zip files
- upload_skill: Upload skills to Claude
- utils: Shared utility functions
"""
from .llms_txt_detector import LlmsTxtDetector
from .llms_txt_downloader import LlmsTxtDownloader
from .llms_txt_parser import LlmsTxtParser
try:
from .utils import open_folder, read_reference_files
except ImportError:
# utils.py might not exist in all configurations
open_folder = None
read_reference_files = None
__version__ = "2.0.0"
__all__ = [
"LlmsTxtDetector",
"LlmsTxtDownloader",
"LlmsTxtParser",
"open_folder",
"read_reference_files",
]

View File

@@ -0,0 +1,491 @@
#!/usr/bin/env python3
"""
Code Analyzer for GitHub Repositories
Extracts code signatures at configurable depth levels:
- surface: File tree only (existing behavior)
- deep: Parse files for signatures, parameters, types
- full: Complete AST analysis (future enhancement)
Supports multiple languages with language-specific parsers.
"""
import ast
import re
import logging
from typing import Dict, List, Any, Optional
from dataclasses import dataclass, asdict
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@dataclass
class Parameter:
"""Represents a function parameter."""
name: str
type_hint: Optional[str] = None
default: Optional[str] = None
@dataclass
class FunctionSignature:
"""Represents a function/method signature."""
name: str
parameters: List[Parameter]
return_type: Optional[str] = None
docstring: Optional[str] = None
line_number: Optional[int] = None
is_async: bool = False
is_method: bool = False
decorators: List[str] = None
def __post_init__(self):
if self.decorators is None:
self.decorators = []
@dataclass
class ClassSignature:
"""Represents a class signature."""
name: str
base_classes: List[str]
methods: List[FunctionSignature]
docstring: Optional[str] = None
line_number: Optional[int] = None
class CodeAnalyzer:
"""
Analyzes code at different depth levels.
"""
def __init__(self, depth: str = 'surface'):
"""
Initialize code analyzer.
Args:
depth: Analysis depth ('surface', 'deep', 'full')
"""
self.depth = depth
def analyze_file(self, file_path: str, content: str, language: str) -> Dict[str, Any]:
"""
Analyze a single file based on depth level.
Args:
file_path: Path to file in repository
content: File content as string
language: Programming language (Python, JavaScript, etc.)
Returns:
Dict containing extracted signatures
"""
if self.depth == 'surface':
return {} # Surface level doesn't analyze individual files
logger.debug(f"Analyzing {file_path} (language: {language}, depth: {self.depth})")
try:
if language == 'Python':
return self._analyze_python(content, file_path)
elif language in ['JavaScript', 'TypeScript']:
return self._analyze_javascript(content, file_path)
elif language in ['C', 'C++']:
return self._analyze_cpp(content, file_path)
else:
logger.debug(f"No analyzer for language: {language}")
return {}
except Exception as e:
logger.warning(f"Error analyzing {file_path}: {e}")
return {}
def _analyze_python(self, content: str, file_path: str) -> Dict[str, Any]:
"""Analyze Python file using AST."""
try:
tree = ast.parse(content)
except SyntaxError as e:
logger.debug(f"Syntax error in {file_path}: {e}")
return {}
classes = []
functions = []
for node in ast.walk(tree):
if isinstance(node, ast.ClassDef):
class_sig = self._extract_python_class(node)
classes.append(asdict(class_sig))
elif isinstance(node, ast.FunctionDef) or isinstance(node, ast.AsyncFunctionDef):
# Only top-level functions (not methods)
if not any(isinstance(parent, ast.ClassDef)
for parent in ast.walk(tree) if hasattr(parent, 'body') and node in parent.body):
func_sig = self._extract_python_function(node)
functions.append(asdict(func_sig))
return {
'classes': classes,
'functions': functions
}
def _extract_python_class(self, node: ast.ClassDef) -> ClassSignature:
"""Extract class signature from AST node."""
# Extract base classes
bases = []
for base in node.bases:
if isinstance(base, ast.Name):
bases.append(base.id)
elif isinstance(base, ast.Attribute):
bases.append(f"{base.value.id}.{base.attr}" if hasattr(base.value, 'id') else base.attr)
# Extract methods
methods = []
for item in node.body:
if isinstance(item, (ast.FunctionDef, ast.AsyncFunctionDef)):
method_sig = self._extract_python_function(item, is_method=True)
methods.append(method_sig)
# Extract docstring
docstring = ast.get_docstring(node)
return ClassSignature(
name=node.name,
base_classes=bases,
methods=methods,
docstring=docstring,
line_number=node.lineno
)
def _extract_python_function(self, node, is_method: bool = False) -> FunctionSignature:
"""Extract function signature from AST node."""
# Extract parameters
params = []
for arg in node.args.args:
param_type = None
if arg.annotation:
param_type = ast.unparse(arg.annotation) if hasattr(ast, 'unparse') else None
params.append(Parameter(
name=arg.arg,
type_hint=param_type
))
# Extract defaults
defaults = node.args.defaults
if defaults:
# Defaults are aligned to the end of params
num_no_default = len(params) - len(defaults)
for i, default in enumerate(defaults):
param_idx = num_no_default + i
if param_idx < len(params):
try:
params[param_idx].default = ast.unparse(default) if hasattr(ast, 'unparse') else str(default)
except:
params[param_idx].default = "..."
# Extract return type
return_type = None
if node.returns:
try:
return_type = ast.unparse(node.returns) if hasattr(ast, 'unparse') else None
except:
pass
# Extract decorators
decorators = []
for decorator in node.decorator_list:
try:
if hasattr(ast, 'unparse'):
decorators.append(ast.unparse(decorator))
elif isinstance(decorator, ast.Name):
decorators.append(decorator.id)
except:
pass
# Extract docstring
docstring = ast.get_docstring(node)
return FunctionSignature(
name=node.name,
parameters=params,
return_type=return_type,
docstring=docstring,
line_number=node.lineno,
is_async=isinstance(node, ast.AsyncFunctionDef),
is_method=is_method,
decorators=decorators
)
def _analyze_javascript(self, content: str, file_path: str) -> Dict[str, Any]:
"""
Analyze JavaScript/TypeScript file using regex patterns.
Note: This is a simplified approach. For production, consider using
a proper JS/TS parser like esprima or ts-morph.
"""
classes = []
functions = []
# Extract class definitions
class_pattern = r'class\s+(\w+)(?:\s+extends\s+(\w+))?\s*\{'
for match in re.finditer(class_pattern, content):
class_name = match.group(1)
base_class = match.group(2) if match.group(2) else None
# Try to extract methods (simplified)
class_block_start = match.end()
# This is a simplification - proper parsing would track braces
class_block_end = content.find('}', class_block_start)
if class_block_end != -1:
class_body = content[class_block_start:class_block_end]
methods = self._extract_js_methods(class_body)
else:
methods = []
classes.append({
'name': class_name,
'base_classes': [base_class] if base_class else [],
'methods': methods,
'docstring': None,
'line_number': content[:match.start()].count('\n') + 1
})
# Extract top-level functions
func_pattern = r'(?:async\s+)?function\s+(\w+)\s*\(([^)]*)\)'
for match in re.finditer(func_pattern, content):
func_name = match.group(1)
params_str = match.group(2)
is_async = 'async' in match.group(0)
params = self._parse_js_parameters(params_str)
functions.append({
'name': func_name,
'parameters': params,
'return_type': None, # JS doesn't have type annotations (unless TS)
'docstring': None,
'line_number': content[:match.start()].count('\n') + 1,
'is_async': is_async,
'is_method': False,
'decorators': []
})
# Extract arrow functions assigned to const/let
arrow_pattern = r'(?:const|let|var)\s+(\w+)\s*=\s*(?:async\s+)?\(([^)]*)\)\s*=>'
for match in re.finditer(arrow_pattern, content):
func_name = match.group(1)
params_str = match.group(2)
is_async = 'async' in match.group(0)
params = self._parse_js_parameters(params_str)
functions.append({
'name': func_name,
'parameters': params,
'return_type': None,
'docstring': None,
'line_number': content[:match.start()].count('\n') + 1,
'is_async': is_async,
'is_method': False,
'decorators': []
})
return {
'classes': classes,
'functions': functions
}
def _extract_js_methods(self, class_body: str) -> List[Dict]:
"""Extract method signatures from class body."""
methods = []
# Match method definitions
method_pattern = r'(?:async\s+)?(\w+)\s*\(([^)]*)\)'
for match in re.finditer(method_pattern, class_body):
method_name = match.group(1)
params_str = match.group(2)
is_async = 'async' in match.group(0)
# Skip constructor keyword detection
if method_name in ['if', 'for', 'while', 'switch']:
continue
params = self._parse_js_parameters(params_str)
methods.append({
'name': method_name,
'parameters': params,
'return_type': None,
'docstring': None,
'line_number': None,
'is_async': is_async,
'is_method': True,
'decorators': []
})
return methods
def _parse_js_parameters(self, params_str: str) -> List[Dict]:
"""Parse JavaScript parameter string."""
params = []
if not params_str.strip():
return params
# Split by comma (simplified - doesn't handle complex default values)
param_list = [p.strip() for p in params_str.split(',')]
for param in param_list:
if not param:
continue
# Check for default value
if '=' in param:
name, default = param.split('=', 1)
name = name.strip()
default = default.strip()
else:
name = param
default = None
# Check for type annotation (TypeScript)
type_hint = None
if ':' in name:
name, type_hint = name.split(':', 1)
name = name.strip()
type_hint = type_hint.strip()
params.append({
'name': name,
'type_hint': type_hint,
'default': default
})
return params
def _analyze_cpp(self, content: str, file_path: str) -> Dict[str, Any]:
"""
Analyze C/C++ header file using regex patterns.
Note: This is a simplified approach focusing on header files.
For production, consider using libclang or similar.
"""
classes = []
functions = []
# Extract class definitions (simplified - doesn't handle nested classes)
class_pattern = r'class\s+(\w+)(?:\s*:\s*public\s+(\w+))?\s*\{'
for match in re.finditer(class_pattern, content):
class_name = match.group(1)
base_class = match.group(2) if match.group(2) else None
classes.append({
'name': class_name,
'base_classes': [base_class] if base_class else [],
'methods': [], # Simplified - would need to parse class body
'docstring': None,
'line_number': content[:match.start()].count('\n') + 1
})
# Extract function declarations
func_pattern = r'(\w+(?:\s*\*|\s*&)?)\s+(\w+)\s*\(([^)]*)\)'
for match in re.finditer(func_pattern, content):
return_type = match.group(1).strip()
func_name = match.group(2)
params_str = match.group(3)
# Skip common keywords
if func_name in ['if', 'for', 'while', 'switch', 'return']:
continue
params = self._parse_cpp_parameters(params_str)
functions.append({
'name': func_name,
'parameters': params,
'return_type': return_type,
'docstring': None,
'line_number': content[:match.start()].count('\n') + 1,
'is_async': False,
'is_method': False,
'decorators': []
})
return {
'classes': classes,
'functions': functions
}
def _parse_cpp_parameters(self, params_str: str) -> List[Dict]:
"""Parse C++ parameter string."""
params = []
if not params_str.strip() or params_str.strip() == 'void':
return params
# Split by comma (simplified)
param_list = [p.strip() for p in params_str.split(',')]
for param in param_list:
if not param:
continue
# Check for default value
default = None
if '=' in param:
param, default = param.rsplit('=', 1)
param = param.strip()
default = default.strip()
# Extract type and name (simplified)
# Format: "type name" or "type* name" or "type& name"
parts = param.split()
if len(parts) >= 2:
param_type = ' '.join(parts[:-1])
param_name = parts[-1]
else:
param_type = param
param_name = "unknown"
params.append({
'name': param_name,
'type_hint': param_type,
'default': default
})
return params
if __name__ == '__main__':
# Test the analyzer
python_code = '''
class Node2D:
"""Base class for 2D nodes."""
def move_local_x(self, delta: float, snap: bool = False) -> None:
"""Move node along local X axis."""
pass
async def tween_position(self, target: tuple, duration: float = 1.0):
"""Animate position to target."""
pass
def create_sprite(texture: str) -> Node2D:
"""Create a new sprite node."""
return Node2D()
'''
analyzer = CodeAnalyzer(depth='deep')
result = analyzer.analyze_file('test.py', python_code, 'Python')
print("Analysis Result:")
print(f"Classes: {len(result.get('classes', []))}")
print(f"Functions: {len(result.get('functions', []))}")
if result.get('classes'):
cls = result['classes'][0]
print(f"\nClass: {cls['name']}")
print(f" Methods: {len(cls['methods'])}")
for method in cls['methods']:
params = ', '.join([f"{p['name']}: {p['type_hint']}" + (f" = {p['default']}" if p.get('default') else "")
for p in method['parameters']])
print(f" {method['name']}({params}) -> {method['return_type']}")

View File

@@ -0,0 +1,376 @@
#!/usr/bin/env python3
"""
Unified Config Validator
Validates unified config format that supports multiple sources:
- documentation (website scraping)
- github (repository scraping)
- pdf (PDF document scraping)
Also provides backward compatibility detection for legacy configs.
"""
import json
import logging
from typing import Dict, Any, List, Optional, Union
from pathlib import Path
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ConfigValidator:
"""
Validates unified config format and provides backward compatibility.
"""
# Valid source types
VALID_SOURCE_TYPES = {'documentation', 'github', 'pdf'}
# Valid merge modes
VALID_MERGE_MODES = {'rule-based', 'claude-enhanced'}
# Valid code analysis depth levels
VALID_DEPTH_LEVELS = {'surface', 'deep', 'full'}
def __init__(self, config_or_path: Union[Dict[str, Any], str]):
"""
Initialize validator with config dict or file path.
Args:
config_or_path: Either a config dict or path to config JSON file
"""
if isinstance(config_or_path, dict):
self.config_path = None
self.config = config_or_path
else:
self.config_path = config_or_path
self.config = self._load_config()
self.is_unified = self._detect_format()
def _load_config(self) -> Dict[str, Any]:
"""Load JSON config file."""
try:
with open(self.config_path, 'r', encoding='utf-8') as f:
return json.load(f)
except FileNotFoundError:
raise ValueError(f"Config file not found: {self.config_path}")
except json.JSONDecodeError as e:
raise ValueError(f"Invalid JSON in config file: {e}")
def _detect_format(self) -> bool:
"""
Detect if config is unified format or legacy.
Returns:
True if unified format (has 'sources' array)
False if legacy format
"""
return 'sources' in self.config and isinstance(self.config['sources'], list)
def validate(self) -> bool:
"""
Validate config based on detected format.
Returns:
True if valid
Raises:
ValueError if invalid with detailed error message
"""
if self.is_unified:
return self._validate_unified()
else:
return self._validate_legacy()
def _validate_unified(self) -> bool:
"""Validate unified config format."""
logger.info("Validating unified config format...")
# Required top-level fields
if 'name' not in self.config:
raise ValueError("Missing required field: 'name'")
if 'description' not in self.config:
raise ValueError("Missing required field: 'description'")
if 'sources' not in self.config:
raise ValueError("Missing required field: 'sources'")
# Validate sources array
sources = self.config['sources']
if not isinstance(sources, list):
raise ValueError("'sources' must be an array")
if len(sources) == 0:
raise ValueError("'sources' array cannot be empty")
# Validate merge_mode (optional)
merge_mode = self.config.get('merge_mode', 'rule-based')
if merge_mode not in self.VALID_MERGE_MODES:
raise ValueError(f"Invalid merge_mode: '{merge_mode}'. Must be one of {self.VALID_MERGE_MODES}")
# Validate each source
for i, source in enumerate(sources):
self._validate_source(source, i)
logger.info(f"✅ Unified config valid: {len(sources)} sources")
return True
def _validate_source(self, source: Dict[str, Any], index: int):
"""Validate individual source configuration."""
# Check source has 'type' field
if 'type' not in source:
raise ValueError(f"Source {index}: Missing required field 'type'")
source_type = source['type']
if source_type not in self.VALID_SOURCE_TYPES:
raise ValueError(
f"Source {index}: Invalid type '{source_type}'. "
f"Must be one of {self.VALID_SOURCE_TYPES}"
)
# Type-specific validation
if source_type == 'documentation':
self._validate_documentation_source(source, index)
elif source_type == 'github':
self._validate_github_source(source, index)
elif source_type == 'pdf':
self._validate_pdf_source(source, index)
def _validate_documentation_source(self, source: Dict[str, Any], index: int):
"""Validate documentation source configuration."""
if 'base_url' not in source:
raise ValueError(f"Source {index} (documentation): Missing required field 'base_url'")
# Optional but recommended fields
if 'selectors' not in source:
logger.warning(f"Source {index} (documentation): No 'selectors' specified, using defaults")
if 'max_pages' in source and not isinstance(source['max_pages'], int):
raise ValueError(f"Source {index} (documentation): 'max_pages' must be an integer")
def _validate_github_source(self, source: Dict[str, Any], index: int):
"""Validate GitHub source configuration."""
if 'repo' not in source:
raise ValueError(f"Source {index} (github): Missing required field 'repo'")
# Validate repo format (owner/repo)
repo = source['repo']
if '/' not in repo:
raise ValueError(
f"Source {index} (github): Invalid repo format '{repo}'. "
f"Must be 'owner/repo' (e.g., 'facebook/react')"
)
# Validate code_analysis_depth if specified
if 'code_analysis_depth' in source:
depth = source['code_analysis_depth']
if depth not in self.VALID_DEPTH_LEVELS:
raise ValueError(
f"Source {index} (github): Invalid code_analysis_depth '{depth}'. "
f"Must be one of {self.VALID_DEPTH_LEVELS}"
)
# Validate max_issues if specified
if 'max_issues' in source and not isinstance(source['max_issues'], int):
raise ValueError(f"Source {index} (github): 'max_issues' must be an integer")
def _validate_pdf_source(self, source: Dict[str, Any], index: int):
"""Validate PDF source configuration."""
if 'path' not in source:
raise ValueError(f"Source {index} (pdf): Missing required field 'path'")
# Check if file exists
pdf_path = source['path']
if not Path(pdf_path).exists():
logger.warning(f"Source {index} (pdf): File not found: {pdf_path}")
def _validate_legacy(self) -> bool:
"""
Validate legacy config format (backward compatibility).
Legacy configs are the old format used by doc_scraper, github_scraper, pdf_scraper.
"""
logger.info("Detected legacy config format (backward compatible)")
# Detect which legacy type based on fields
if 'base_url' in self.config:
logger.info("Legacy type: documentation")
elif 'repo' in self.config:
logger.info("Legacy type: github")
elif 'pdf' in self.config or 'path' in self.config:
logger.info("Legacy type: pdf")
else:
raise ValueError("Cannot detect legacy config type (missing base_url, repo, or pdf)")
return True
def convert_legacy_to_unified(self) -> Dict[str, Any]:
"""
Convert legacy config to unified format.
Returns:
Unified config dict
"""
if self.is_unified:
logger.info("Config already in unified format")
return self.config
logger.info("Converting legacy config to unified format...")
# Detect legacy type and convert
if 'base_url' in self.config:
return self._convert_legacy_documentation()
elif 'repo' in self.config:
return self._convert_legacy_github()
elif 'pdf' in self.config or 'path' in self.config:
return self._convert_legacy_pdf()
else:
raise ValueError("Cannot convert: unknown legacy format")
def _convert_legacy_documentation(self) -> Dict[str, Any]:
"""Convert legacy documentation config to unified."""
unified = {
'name': self.config.get('name', 'unnamed'),
'description': self.config.get('description', 'Documentation skill'),
'merge_mode': 'rule-based',
'sources': [
{
'type': 'documentation',
**{k: v for k, v in self.config.items()
if k not in ['name', 'description']}
}
]
}
return unified
def _convert_legacy_github(self) -> Dict[str, Any]:
"""Convert legacy GitHub config to unified."""
unified = {
'name': self.config.get('name', 'unnamed'),
'description': self.config.get('description', 'GitHub repository skill'),
'merge_mode': 'rule-based',
'sources': [
{
'type': 'github',
**{k: v for k, v in self.config.items()
if k not in ['name', 'description']}
}
]
}
return unified
def _convert_legacy_pdf(self) -> Dict[str, Any]:
"""Convert legacy PDF config to unified."""
unified = {
'name': self.config.get('name', 'unnamed'),
'description': self.config.get('description', 'PDF document skill'),
'merge_mode': 'rule-based',
'sources': [
{
'type': 'pdf',
**{k: v for k, v in self.config.items()
if k not in ['name', 'description']}
}
]
}
return unified
def get_sources_by_type(self, source_type: str) -> List[Dict[str, Any]]:
"""
Get all sources of a specific type.
Args:
source_type: 'documentation', 'github', or 'pdf'
Returns:
List of sources matching the type
"""
if not self.is_unified:
# For legacy, convert and get sources
unified = self.convert_legacy_to_unified()
sources = unified['sources']
else:
sources = self.config['sources']
return [s for s in sources if s.get('type') == source_type]
def has_multiple_sources(self) -> bool:
"""Check if config has multiple sources (requires merging)."""
if not self.is_unified:
return False
return len(self.config['sources']) > 1
def needs_api_merge(self) -> bool:
"""
Check if config needs API merging.
Returns True if both documentation and github sources exist
with API extraction enabled.
"""
if not self.has_multiple_sources():
return False
has_docs_api = any(
s.get('type') == 'documentation' and s.get('extract_api', True)
for s in self.config['sources']
)
has_github_code = any(
s.get('type') == 'github' and s.get('include_code', False)
for s in self.config['sources']
)
return has_docs_api and has_github_code
def validate_config(config_path: str) -> ConfigValidator:
"""
Validate config file and return validator instance.
Args:
config_path: Path to config JSON file
Returns:
ConfigValidator instance
Raises:
ValueError if config is invalid
"""
validator = ConfigValidator(config_path)
validator.validate()
return validator
if __name__ == '__main__':
import sys
if len(sys.argv) < 2:
print("Usage: python config_validator.py <config.json>")
sys.exit(1)
config_file = sys.argv[1]
try:
validator = validate_config(config_file)
print(f"\n✅ Config valid!")
print(f" Format: {'Unified' if validator.is_unified else 'Legacy'}")
print(f" Name: {validator.config.get('name')}")
if validator.is_unified:
sources = validator.config['sources']
print(f" Sources: {len(sources)}")
for i, source in enumerate(sources):
print(f" {i+1}. {source['type']}")
if validator.needs_api_merge():
merge_mode = validator.config.get('merge_mode', 'rule-based')
print(f" ⚠️ API merge required (mode: {merge_mode})")
except ValueError as e:
print(f"\n❌ Config invalid: {e}")
sys.exit(1)

View File

@@ -0,0 +1,513 @@
#!/usr/bin/env python3
"""
Conflict Detector for Multi-Source Skills
Detects conflicts between documentation and code:
- missing_in_docs: API exists in code but not documented
- missing_in_code: API documented but doesn't exist in code
- signature_mismatch: Different parameters/types between docs and code
- description_mismatch: Docs say one thing, code comments say another
Used by unified scraper to identify discrepancies before merging.
"""
import json
import logging
from typing import Dict, List, Any, Optional, Tuple
from dataclasses import dataclass, asdict
from difflib import SequenceMatcher
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@dataclass
class Conflict:
"""Represents a conflict between documentation and code."""
type: str # 'missing_in_docs', 'missing_in_code', 'signature_mismatch', 'description_mismatch'
severity: str # 'low', 'medium', 'high'
api_name: str
docs_info: Optional[Dict[str, Any]] = None
code_info: Optional[Dict[str, Any]] = None
difference: Optional[str] = None
suggestion: Optional[str] = None
class ConflictDetector:
"""
Detects conflicts between documentation and code sources.
"""
def __init__(self, docs_data: Dict[str, Any], github_data: Dict[str, Any]):
"""
Initialize conflict detector.
Args:
docs_data: Data from documentation scraper
github_data: Data from GitHub scraper with code analysis
"""
self.docs_data = docs_data
self.github_data = github_data
# Extract API information from both sources
self.docs_apis = self._extract_docs_apis()
self.code_apis = self._extract_code_apis()
logger.info(f"Loaded {len(self.docs_apis)} APIs from documentation")
logger.info(f"Loaded {len(self.code_apis)} APIs from code")
def _extract_docs_apis(self) -> Dict[str, Dict[str, Any]]:
"""
Extract API information from documentation data.
Returns:
Dict mapping API name to API info
"""
apis = {}
# Documentation structure varies, but typically has 'pages' or 'references'
pages = self.docs_data.get('pages', {})
# Handle both dict and list formats
if isinstance(pages, dict):
# Format: {url: page_data, ...}
for url, page_data in pages.items():
content = page_data.get('content', '')
title = page_data.get('title', '')
# Simple heuristic: if title or URL contains "api", "reference", "class", "function"
# it might be an API page
if any(keyword in title.lower() or keyword in url.lower()
for keyword in ['api', 'reference', 'class', 'function', 'method']):
# Extract API signatures from content (simplified)
extracted_apis = self._parse_doc_content_for_apis(content, url)
apis.update(extracted_apis)
elif isinstance(pages, list):
# Format: [{url: '...', apis: [...]}, ...]
for page in pages:
url = page.get('url', '')
page_apis = page.get('apis', [])
# If APIs are already extracted in the page data
for api in page_apis:
api_name = api.get('name', '')
if api_name:
apis[api_name] = {
'parameters': api.get('parameters', []),
'return_type': api.get('return_type', 'Any'),
'source_url': url
}
return apis
def _parse_doc_content_for_apis(self, content: str, source_url: str) -> Dict[str, Dict]:
"""
Parse documentation content to extract API signatures.
This is a simplified approach - real implementation would need
to understand the documentation format (Sphinx, JSDoc, etc.)
"""
apis = {}
# Look for function/method signatures in code blocks
# Common patterns:
# - function_name(param1, param2)
# - ClassName.method_name(param1, param2)
# - def function_name(param1: type, param2: type) -> return_type
import re
# Pattern for common API signatures
patterns = [
# Python style: def name(params) -> return
r'def\s+(\w+)\s*\(([^)]*)\)(?:\s*->\s*(\w+))?',
# JavaScript style: function name(params)
r'function\s+(\w+)\s*\(([^)]*)\)',
# C++ style: return_type name(params)
r'(\w+)\s+(\w+)\s*\(([^)]*)\)',
# Method style: ClassName.method_name(params)
r'(\w+)\.(\w+)\s*\(([^)]*)\)'
]
for pattern in patterns:
for match in re.finditer(pattern, content):
groups = match.groups()
# Parse based on pattern matched
if 'def' in pattern:
# Python function
name = groups[0]
params_str = groups[1]
return_type = groups[2] if len(groups) > 2 else None
elif 'function' in pattern:
# JavaScript function
name = groups[0]
params_str = groups[1]
return_type = None
elif '.' in pattern:
# Class method
class_name = groups[0]
method_name = groups[1]
name = f"{class_name}.{method_name}"
params_str = groups[2] if len(groups) > 2 else groups[1]
return_type = None
else:
# C++ function
return_type = groups[0]
name = groups[1]
params_str = groups[2]
# Parse parameters
params = self._parse_param_string(params_str)
apis[name] = {
'name': name,
'parameters': params,
'return_type': return_type,
'source': source_url,
'raw_signature': match.group(0)
}
return apis
def _parse_param_string(self, params_str: str) -> List[Dict]:
"""Parse parameter string into list of parameter dicts."""
if not params_str.strip():
return []
params = []
for param in params_str.split(','):
param = param.strip()
if not param:
continue
# Try to extract name and type
param_info = {'name': param, 'type': None, 'default': None}
# Check for type annotation (: type)
if ':' in param:
parts = param.split(':', 1)
param_info['name'] = parts[0].strip()
type_part = parts[1].strip()
# Check for default value (= value)
if '=' in type_part:
type_str, default_str = type_part.split('=', 1)
param_info['type'] = type_str.strip()
param_info['default'] = default_str.strip()
else:
param_info['type'] = type_part
# Check for default without type (= value)
elif '=' in param:
parts = param.split('=', 1)
param_info['name'] = parts[0].strip()
param_info['default'] = parts[1].strip()
params.append(param_info)
return params
def _extract_code_apis(self) -> Dict[str, Dict[str, Any]]:
"""
Extract API information from GitHub code analysis.
Returns:
Dict mapping API name to API info
"""
apis = {}
code_analysis = self.github_data.get('code_analysis', {})
if not code_analysis:
return apis
# Support both 'files' and 'analyzed_files' keys
files = code_analysis.get('files', code_analysis.get('analyzed_files', []))
for file_info in files:
file_path = file_info.get('file', 'unknown')
# Extract classes and their methods
for class_info in file_info.get('classes', []):
class_name = class_info['name']
# Add class itself
apis[class_name] = {
'name': class_name,
'type': 'class',
'source': file_path,
'line': class_info.get('line_number'),
'base_classes': class_info.get('base_classes', []),
'docstring': class_info.get('docstring')
}
# Add methods
for method in class_info.get('methods', []):
method_name = f"{class_name}.{method['name']}"
apis[method_name] = {
'name': method_name,
'type': 'method',
'parameters': method.get('parameters', []),
'return_type': method.get('return_type'),
'source': file_path,
'line': method.get('line_number'),
'docstring': method.get('docstring'),
'is_async': method.get('is_async', False)
}
# Extract standalone functions
for func_info in file_info.get('functions', []):
func_name = func_info['name']
apis[func_name] = {
'name': func_name,
'type': 'function',
'parameters': func_info.get('parameters', []),
'return_type': func_info.get('return_type'),
'source': file_path,
'line': func_info.get('line_number'),
'docstring': func_info.get('docstring'),
'is_async': func_info.get('is_async', False)
}
return apis
def detect_all_conflicts(self) -> List[Conflict]:
"""
Detect all types of conflicts.
Returns:
List of Conflict objects
"""
logger.info("Detecting conflicts between documentation and code...")
conflicts = []
# 1. Find APIs missing in documentation
conflicts.extend(self._find_missing_in_docs())
# 2. Find APIs missing in code
conflicts.extend(self._find_missing_in_code())
# 3. Find signature mismatches
conflicts.extend(self._find_signature_mismatches())
logger.info(f"Found {len(conflicts)} conflicts total")
return conflicts
def _find_missing_in_docs(self) -> List[Conflict]:
"""Find APIs that exist in code but not in documentation."""
conflicts = []
for api_name, code_info in self.code_apis.items():
# Simple name matching (can be enhanced with fuzzy matching)
if api_name not in self.docs_apis:
# Check if it's a private/internal API (often not documented)
is_private = api_name.startswith('_') or '__' in api_name
severity = 'low' if is_private else 'medium'
conflicts.append(Conflict(
type='missing_in_docs',
severity=severity,
api_name=api_name,
code_info=code_info,
difference=f"API exists in code ({code_info['source']}) but not found in documentation",
suggestion="Add documentation for this API" if not is_private else "Consider if this internal API should be documented"
))
logger.info(f"Found {len(conflicts)} APIs missing in documentation")
return conflicts
def _find_missing_in_code(self) -> List[Conflict]:
"""Find APIs that are documented but don't exist in code."""
conflicts = []
for api_name, docs_info in self.docs_apis.items():
if api_name not in self.code_apis:
conflicts.append(Conflict(
type='missing_in_code',
severity='high', # This is serious - documented but doesn't exist
api_name=api_name,
docs_info=docs_info,
difference=f"API documented ({docs_info.get('source', 'unknown')}) but not found in code",
suggestion="Update documentation to remove this API, or add it to codebase"
))
logger.info(f"Found {len(conflicts)} APIs missing in code")
return conflicts
def _find_signature_mismatches(self) -> List[Conflict]:
"""Find APIs where signature differs between docs and code."""
conflicts = []
# Find APIs that exist in both
common_apis = set(self.docs_apis.keys()) & set(self.code_apis.keys())
for api_name in common_apis:
docs_info = self.docs_apis[api_name]
code_info = self.code_apis[api_name]
# Compare signatures
mismatch = self._compare_signatures(docs_info, code_info)
if mismatch:
conflicts.append(Conflict(
type='signature_mismatch',
severity=mismatch['severity'],
api_name=api_name,
docs_info=docs_info,
code_info=code_info,
difference=mismatch['difference'],
suggestion=mismatch['suggestion']
))
logger.info(f"Found {len(conflicts)} signature mismatches")
return conflicts
def _compare_signatures(self, docs_info: Dict, code_info: Dict) -> Optional[Dict]:
"""
Compare signatures between docs and code.
Returns:
Dict with mismatch details if conflict found, None otherwise
"""
docs_params = docs_info.get('parameters', [])
code_params = code_info.get('parameters', [])
# Compare parameter counts
if len(docs_params) != len(code_params):
return {
'severity': 'medium',
'difference': f"Parameter count mismatch: docs has {len(docs_params)}, code has {len(code_params)}",
'suggestion': f"Documentation shows {len(docs_params)} parameters, but code has {len(code_params)}"
}
# Compare parameter names and types
for i, (doc_param, code_param) in enumerate(zip(docs_params, code_params)):
doc_name = doc_param.get('name', '')
code_name = code_param.get('name', '')
# Parameter name mismatch
if doc_name != code_name:
# Use fuzzy matching for slight variations
similarity = SequenceMatcher(None, doc_name, code_name).ratio()
if similarity < 0.8: # Not similar enough
return {
'severity': 'medium',
'difference': f"Parameter {i+1} name mismatch: '{doc_name}' in docs vs '{code_name}' in code",
'suggestion': f"Update documentation to use parameter name '{code_name}'"
}
# Type mismatch
doc_type = doc_param.get('type')
code_type = code_param.get('type_hint')
if doc_type and code_type and doc_type != code_type:
return {
'severity': 'low',
'difference': f"Parameter '{doc_name}' type mismatch: '{doc_type}' in docs vs '{code_type}' in code",
'suggestion': f"Verify correct type for parameter '{doc_name}'"
}
# Compare return types if both have them
docs_return = docs_info.get('return_type')
code_return = code_info.get('return_type')
if docs_return and code_return and docs_return != code_return:
return {
'severity': 'low',
'difference': f"Return type mismatch: '{docs_return}' in docs vs '{code_return}' in code",
'suggestion': "Verify correct return type"
}
return None
def generate_summary(self, conflicts: List[Conflict]) -> Dict[str, Any]:
"""
Generate summary statistics for conflicts.
Args:
conflicts: List of Conflict objects
Returns:
Summary dict with statistics
"""
summary = {
'total': len(conflicts),
'by_type': {},
'by_severity': {},
'apis_affected': len(set(c.api_name for c in conflicts))
}
# Count by type
for conflict_type in ['missing_in_docs', 'missing_in_code', 'signature_mismatch', 'description_mismatch']:
count = sum(1 for c in conflicts if c.type == conflict_type)
summary['by_type'][conflict_type] = count
# Count by severity
for severity in ['low', 'medium', 'high']:
count = sum(1 for c in conflicts if c.severity == severity)
summary['by_severity'][severity] = count
return summary
def save_conflicts(self, conflicts: List[Conflict], output_path: str):
"""
Save conflicts to JSON file.
Args:
conflicts: List of Conflict objects
output_path: Path to output JSON file
"""
data = {
'conflicts': [asdict(c) for c in conflicts],
'summary': self.generate_summary(conflicts)
}
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
logger.info(f"Conflicts saved to: {output_path}")
if __name__ == '__main__':
import sys
if len(sys.argv) < 3:
print("Usage: python conflict_detector.py <docs_data.json> <github_data.json>")
sys.exit(1)
docs_file = sys.argv[1]
github_file = sys.argv[2]
# Load data
with open(docs_file, 'r') as f:
docs_data = json.load(f)
with open(github_file, 'r') as f:
github_data = json.load(f)
# Detect conflicts
detector = ConflictDetector(docs_data, github_data)
conflicts = detector.detect_all_conflicts()
# Print summary
summary = detector.generate_summary(conflicts)
print("\n📊 Conflict Summary:")
print(f" Total conflicts: {summary['total']}")
print(f" APIs affected: {summary['apis_affected']}")
print("\n By Type:")
for conflict_type, count in summary['by_type'].items():
if count > 0:
print(f" {conflict_type}: {count}")
print("\n By Severity:")
for severity, count in summary['by_severity'].items():
if count > 0:
emoji = '🔴' if severity == 'high' else '🟡' if severity == 'medium' else '🟢'
print(f" {emoji} {severity}: {count}")
# Save to file
output_file = 'conflicts.json'
detector.save_conflicts(conflicts, output_file)
print(f"\n✅ Full report saved to: {output_file}")

View File

@@ -0,0 +1,72 @@
"""Configuration constants for Skill Seekers CLI.
This module centralizes all magic numbers and configuration values used
across the CLI tools to improve maintainability and clarity.
"""
# ===== SCRAPING CONFIGURATION =====
# Default scraping limits
DEFAULT_RATE_LIMIT = 0.5 # seconds between requests
DEFAULT_MAX_PAGES = 500 # maximum pages to scrape
DEFAULT_CHECKPOINT_INTERVAL = 1000 # pages between checkpoints
DEFAULT_ASYNC_MODE = False # use async mode for parallel scraping (opt-in)
# Content analysis limits
CONTENT_PREVIEW_LENGTH = 500 # characters to check for categorization
MAX_PAGES_WARNING_THRESHOLD = 10000 # warn if config exceeds this
# Quality thresholds
MIN_CATEGORIZATION_SCORE = 2 # minimum score for category assignment
URL_MATCH_POINTS = 3 # points for URL keyword match
TITLE_MATCH_POINTS = 2 # points for title keyword match
CONTENT_MATCH_POINTS = 1 # points for content keyword match
# ===== ENHANCEMENT CONFIGURATION =====
# API-based enhancement limits (uses Anthropic API)
API_CONTENT_LIMIT = 100000 # max characters for API enhancement
API_PREVIEW_LIMIT = 40000 # max characters for preview
# Local enhancement limits (uses Claude Code Max)
LOCAL_CONTENT_LIMIT = 50000 # max characters for local enhancement
LOCAL_PREVIEW_LIMIT = 20000 # max characters for preview
# ===== PAGE ESTIMATION =====
# Estimation and discovery settings
DEFAULT_MAX_DISCOVERY = 1000 # default max pages to discover
DISCOVERY_THRESHOLD = 10000 # threshold for warnings
# ===== FILE LIMITS =====
# Output and processing limits
MAX_REFERENCE_FILES = 100 # maximum reference files per skill
MAX_CODE_BLOCKS_PER_PAGE = 5 # maximum code blocks to extract per page
# ===== EXPORT CONSTANTS =====
__all__ = [
# Scraping
'DEFAULT_RATE_LIMIT',
'DEFAULT_MAX_PAGES',
'DEFAULT_CHECKPOINT_INTERVAL',
'DEFAULT_ASYNC_MODE',
'CONTENT_PREVIEW_LENGTH',
'MAX_PAGES_WARNING_THRESHOLD',
'MIN_CATEGORIZATION_SCORE',
'URL_MATCH_POINTS',
'TITLE_MATCH_POINTS',
'CONTENT_MATCH_POINTS',
# Enhancement
'API_CONTENT_LIMIT',
'API_PREVIEW_LIMIT',
'LOCAL_CONTENT_LIMIT',
'LOCAL_PREVIEW_LIMIT',
# Estimation
'DEFAULT_MAX_DISCOVERY',
'DISCOVERY_THRESHOLD',
# Limits
'MAX_REFERENCE_FILES',
'MAX_CODE_BLOCKS_PER_PAGE',
]

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,273 @@
#!/usr/bin/env python3
"""
SKILL.md Enhancement Script
Uses Claude API to improve SKILL.md by analyzing reference documentation.
Usage:
python3 cli/enhance_skill.py output/steam-inventory/
python3 cli/enhance_skill.py output/react/
python3 cli/enhance_skill.py output/godot/ --api-key YOUR_API_KEY
"""
import os
import sys
import json
import argparse
from pathlib import Path
# Add parent directory to path for imports when run as script
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from skill_seekers.cli.constants import API_CONTENT_LIMIT, API_PREVIEW_LIMIT
from skill_seekers.cli.utils import read_reference_files
try:
import anthropic
except ImportError:
print("❌ Error: anthropic package not installed")
print("Install with: pip3 install anthropic")
sys.exit(1)
class SkillEnhancer:
def __init__(self, skill_dir, api_key=None):
self.skill_dir = Path(skill_dir)
self.references_dir = self.skill_dir / "references"
self.skill_md_path = self.skill_dir / "SKILL.md"
# Get API key
self.api_key = api_key or os.environ.get('ANTHROPIC_API_KEY')
if not self.api_key:
raise ValueError(
"No API key provided. Set ANTHROPIC_API_KEY environment variable "
"or use --api-key argument"
)
self.client = anthropic.Anthropic(api_key=self.api_key)
def read_current_skill_md(self):
"""Read existing SKILL.md"""
if not self.skill_md_path.exists():
return None
return self.skill_md_path.read_text(encoding='utf-8')
def enhance_skill_md(self, references, current_skill_md):
"""Use Claude to enhance SKILL.md"""
# Build prompt
prompt = self._build_enhancement_prompt(references, current_skill_md)
print("\n🤖 Asking Claude to enhance SKILL.md...")
print(f" Input: {len(prompt):,} characters")
try:
message = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
temperature=0.3,
messages=[{
"role": "user",
"content": prompt
}]
)
enhanced_content = message.content[0].text
return enhanced_content
except Exception as e:
print(f"❌ Error calling Claude API: {e}")
return None
def _build_enhancement_prompt(self, references, current_skill_md):
"""Build the prompt for Claude"""
# Extract skill name and description
skill_name = self.skill_dir.name
prompt = f"""You are enhancing a Claude skill's SKILL.md file. This skill is about: {skill_name}
I've scraped documentation and organized it into reference files. Your job is to create an EXCELLENT SKILL.md that will help Claude use this documentation effectively.
CURRENT SKILL.MD:
{'```markdown' if current_skill_md else '(none - create from scratch)'}
{current_skill_md or 'No existing SKILL.md'}
{'```' if current_skill_md else ''}
REFERENCE DOCUMENTATION:
"""
for filename, content in references.items():
prompt += f"\n\n## {filename}\n```markdown\n{content[:30000]}\n```\n"
prompt += """
YOUR TASK:
Create an enhanced SKILL.md that includes:
1. **Clear "When to Use This Skill" section** - Be specific about trigger conditions
2. **Excellent Quick Reference section** - Extract 5-10 of the BEST, most practical code examples from the reference docs
- Choose SHORT, clear examples that demonstrate common tasks
- Include both simple and intermediate examples
- Annotate examples with clear descriptions
- Use proper language tags (cpp, python, javascript, json, etc.)
3. **Detailed Reference Files description** - Explain what's in each reference file
4. **Practical "Working with This Skill" section** - Give users clear guidance on how to navigate the skill
5. **Key Concepts section** (if applicable) - Explain core concepts
6. **Keep the frontmatter** (---\nname: ...\n---) intact
IMPORTANT:
- Extract REAL examples from the reference docs, don't make them up
- Prioritize SHORT, clear examples (5-20 lines max)
- Make it actionable and practical
- Don't be too verbose - be concise but useful
- Maintain the markdown structure for Claude skills
- Keep code examples properly formatted with language tags
OUTPUT:
Return ONLY the complete SKILL.md content, starting with the frontmatter (---).
"""
return prompt
def save_enhanced_skill_md(self, content):
"""Save the enhanced SKILL.md"""
# Backup original
if self.skill_md_path.exists():
backup_path = self.skill_md_path.with_suffix('.md.backup')
self.skill_md_path.rename(backup_path)
print(f" 💾 Backed up original to: {backup_path.name}")
# Save enhanced version
self.skill_md_path.write_text(content, encoding='utf-8')
print(f" ✅ Saved enhanced SKILL.md")
def run(self):
"""Main enhancement workflow"""
print(f"\n{'='*60}")
print(f"ENHANCING SKILL: {self.skill_dir.name}")
print(f"{'='*60}\n")
# Read reference files
print("📖 Reading reference documentation...")
references = read_reference_files(
self.skill_dir,
max_chars=API_CONTENT_LIMIT,
preview_limit=API_PREVIEW_LIMIT
)
if not references:
print("❌ No reference files found to analyze")
return False
print(f" ✓ Read {len(references)} reference files")
total_size = sum(len(c) for c in references.values())
print(f" ✓ Total size: {total_size:,} characters\n")
# Read current SKILL.md
current_skill_md = self.read_current_skill_md()
if current_skill_md:
print(f" Found existing SKILL.md ({len(current_skill_md)} chars)")
else:
print(f" No existing SKILL.md, will create new one")
# Enhance with Claude
enhanced = self.enhance_skill_md(references, current_skill_md)
if not enhanced:
print("❌ Enhancement failed")
return False
print(f" ✓ Generated enhanced SKILL.md ({len(enhanced)} chars)\n")
# Save
print("💾 Saving enhanced SKILL.md...")
self.save_enhanced_skill_md(enhanced)
print(f"\n✅ Enhancement complete!")
print(f"\nNext steps:")
print(f" 1. Review: {self.skill_md_path}")
print(f" 2. If you don't like it, restore backup: {self.skill_md_path.with_suffix('.md.backup')}")
print(f" 3. Package your skill:")
print(f" python3 cli/package_skill.py {self.skill_dir}/")
return True
def main():
parser = argparse.ArgumentParser(
description='Enhance SKILL.md using Claude API',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Using ANTHROPIC_API_KEY environment variable
export ANTHROPIC_API_KEY=sk-ant-...
python3 cli/enhance_skill.py output/steam-inventory/
# Providing API key directly
python3 cli/enhance_skill.py output/react/ --api-key sk-ant-...
# Show what would be done (dry run)
python3 cli/enhance_skill.py output/godot/ --dry-run
"""
)
parser.add_argument('skill_dir', type=str,
help='Path to skill directory (e.g., output/steam-inventory/)')
parser.add_argument('--api-key', type=str,
help='Anthropic API key (or set ANTHROPIC_API_KEY env var)')
parser.add_argument('--dry-run', action='store_true',
help='Show what would be done without calling API')
args = parser.parse_args()
# Validate skill directory
skill_dir = Path(args.skill_dir)
if not skill_dir.exists():
print(f"❌ Error: Directory not found: {skill_dir}")
sys.exit(1)
if not skill_dir.is_dir():
print(f"❌ Error: Not a directory: {skill_dir}")
sys.exit(1)
# Dry run mode
if args.dry_run:
print(f"🔍 DRY RUN MODE")
print(f" Would enhance: {skill_dir}")
print(f" References: {skill_dir / 'references'}")
print(f" SKILL.md: {skill_dir / 'SKILL.md'}")
refs_dir = skill_dir / "references"
if refs_dir.exists():
ref_files = list(refs_dir.glob("*.md"))
print(f" Found {len(ref_files)} reference files:")
for rf in ref_files:
size = rf.stat().st_size
print(f" - {rf.name} ({size:,} bytes)")
print("\nTo actually run enhancement:")
print(f" python3 cli/enhance_skill.py {skill_dir}")
return
# Create enhancer and run
try:
enhancer = SkillEnhancer(skill_dir, api_key=args.api_key)
success = enhancer.run()
sys.exit(0 if success else 1)
except ValueError as e:
print(f"❌ Error: {e}")
print("\nSet your API key:")
print(" export ANTHROPIC_API_KEY=sk-ant-...")
print("Or provide it directly:")
print(f" python3 cli/enhance_skill.py {skill_dir} --api-key sk-ant-...")
sys.exit(1)
except Exception as e:
print(f"❌ Unexpected error: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,303 @@
#!/usr/bin/env python3
"""
SKILL.md Enhancement Script (Local - Using Claude Code)
Opens a new terminal with Claude Code to enhance SKILL.md, then reports back.
No API key needed - uses your existing Claude Code Max plan!
Usage:
python3 cli/enhance_skill_local.py output/steam-inventory/
python3 cli/enhance_skill_local.py output/react/
Terminal Selection:
The script automatically detects which terminal app to use:
1. SKILL_SEEKER_TERMINAL env var (highest priority)
Example: export SKILL_SEEKER_TERMINAL="Ghostty"
2. TERM_PROGRAM env var (current terminal)
3. Terminal.app (fallback)
Supported terminals: Ghostty, iTerm, Terminal, WezTerm
"""
import os
import sys
import time
import subprocess
import tempfile
from pathlib import Path
# Add parent directory to path for imports when run as script
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from skill_seekers.cli.constants import LOCAL_CONTENT_LIMIT, LOCAL_PREVIEW_LIMIT
from skill_seekers.cli.utils import read_reference_files
def detect_terminal_app():
"""Detect which terminal app to use with cascading priority.
Priority order:
1. SKILL_SEEKER_TERMINAL environment variable (explicit user preference)
2. TERM_PROGRAM environment variable (inherit current terminal)
3. Terminal.app (fallback default)
Returns:
tuple: (terminal_app_name, detection_method)
- terminal_app_name (str): Name of terminal app to launch (e.g., "Ghostty", "Terminal")
- detection_method (str): How the terminal was detected (for logging)
Examples:
>>> os.environ['SKILL_SEEKER_TERMINAL'] = 'Ghostty'
>>> detect_terminal_app()
('Ghostty', 'SKILL_SEEKER_TERMINAL')
>>> os.environ['TERM_PROGRAM'] = 'iTerm.app'
>>> detect_terminal_app()
('iTerm', 'TERM_PROGRAM')
"""
# Map TERM_PROGRAM values to macOS app names
TERMINAL_MAP = {
'Apple_Terminal': 'Terminal',
'iTerm.app': 'iTerm',
'ghostty': 'Ghostty',
'WezTerm': 'WezTerm',
}
# Priority 1: Check SKILL_SEEKER_TERMINAL env var (explicit preference)
preferred_terminal = os.environ.get('SKILL_SEEKER_TERMINAL', '').strip()
if preferred_terminal:
return preferred_terminal, 'SKILL_SEEKER_TERMINAL'
# Priority 2: Check TERM_PROGRAM (inherit current terminal)
term_program = os.environ.get('TERM_PROGRAM', '').strip()
if term_program and term_program in TERMINAL_MAP:
return TERMINAL_MAP[term_program], 'TERM_PROGRAM'
# Priority 3: Fallback to Terminal.app
if term_program:
# TERM_PROGRAM is set but unknown
return 'Terminal', f'unknown TERM_PROGRAM ({term_program})'
else:
# No TERM_PROGRAM set
return 'Terminal', 'default'
class LocalSkillEnhancer:
def __init__(self, skill_dir):
self.skill_dir = Path(skill_dir)
self.references_dir = self.skill_dir / "references"
self.skill_md_path = self.skill_dir / "SKILL.md"
def create_enhancement_prompt(self):
"""Create the prompt file for Claude Code"""
# Read reference files
references = read_reference_files(
self.skill_dir,
max_chars=LOCAL_CONTENT_LIMIT,
preview_limit=LOCAL_PREVIEW_LIMIT
)
if not references:
print("❌ No reference files found")
return None
# Read current SKILL.md
current_skill_md = ""
if self.skill_md_path.exists():
current_skill_md = self.skill_md_path.read_text(encoding='utf-8')
# Build prompt
prompt = f"""I need you to enhance the SKILL.md file for the {self.skill_dir.name} skill.
CURRENT SKILL.MD:
{'-'*60}
{current_skill_md if current_skill_md else '(No existing SKILL.md - create from scratch)'}
{'-'*60}
REFERENCE DOCUMENTATION:
{'-'*60}
"""
for filename, content in references.items():
prompt += f"\n## {filename}\n{content[:15000]}\n"
prompt += f"""
{'-'*60}
YOUR TASK:
Create an EXCELLENT SKILL.md file that will help Claude use this documentation effectively.
Requirements:
1. **Clear "When to Use This Skill" section**
- Be SPECIFIC about trigger conditions
- List concrete use cases
2. **Excellent Quick Reference section**
- Extract 5-10 of the BEST, most practical code examples from the reference docs
- Choose SHORT, clear examples (5-20 lines max)
- Include both simple and intermediate examples
- Use proper language tags (cpp, python, javascript, json, etc.)
- Add clear descriptions for each example
3. **Detailed Reference Files description**
- Explain what's in each reference file
- Help users navigate the documentation
4. **Practical "Working with This Skill" section**
- Clear guidance for beginners, intermediate, and advanced users
- Navigation tips
5. **Key Concepts section** (if applicable)
- Explain core concepts
- Define important terminology
IMPORTANT:
- Extract REAL examples from the reference docs above
- Prioritize SHORT, clear examples
- Make it actionable and practical
- Keep the frontmatter (---\\nname: ...\\n---) intact
- Use proper markdown formatting
SAVE THE RESULT:
Save the complete enhanced SKILL.md to: {self.skill_md_path.absolute()}
First, backup the original to: {self.skill_md_path.with_suffix('.md.backup').absolute()}
"""
return prompt
def run(self):
"""Main enhancement workflow"""
print(f"\n{'='*60}")
print(f"LOCAL ENHANCEMENT: {self.skill_dir.name}")
print(f"{'='*60}\n")
# Validate
if not self.skill_dir.exists():
print(f"❌ Directory not found: {self.skill_dir}")
return False
# Read reference files
print("📖 Reading reference documentation...")
references = read_reference_files(
self.skill_dir,
max_chars=LOCAL_CONTENT_LIMIT,
preview_limit=LOCAL_PREVIEW_LIMIT
)
if not references:
print("❌ No reference files found to analyze")
return False
print(f" ✓ Read {len(references)} reference files")
total_size = sum(len(c) for c in references.values())
print(f" ✓ Total size: {total_size:,} characters\n")
# Create prompt
print("📝 Creating enhancement prompt...")
prompt = self.create_enhancement_prompt()
if not prompt:
return False
# Save prompt to temp file
with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False, encoding='utf-8') as f:
prompt_file = f.name
f.write(prompt)
print(f" ✓ Prompt saved ({len(prompt):,} characters)\n")
# Launch Claude Code in new terminal
print("🚀 Launching Claude Code in new terminal...")
print(" This will:")
print(" 1. Open a new terminal window")
print(" 2. Run Claude Code with the enhancement task")
print(" 3. Claude will read the docs and enhance SKILL.md")
print(" 4. Terminal will auto-close when done")
print()
# Create a shell script to run in the terminal
shell_script = f'''#!/bin/bash
claude {prompt_file}
echo ""
echo "✅ Enhancement complete!"
echo "Press any key to close..."
read -n 1
rm {prompt_file}
'''
# Save shell script
with tempfile.NamedTemporaryFile(mode='w', suffix='.sh', delete=False) as f:
script_file = f.name
f.write(shell_script)
os.chmod(script_file, 0o755)
# Launch in new terminal (macOS specific)
if sys.platform == 'darwin':
# Detect which terminal app to use
terminal_app, detection_method = detect_terminal_app()
# Show detection info
if detection_method == 'SKILL_SEEKER_TERMINAL':
print(f" Using terminal: {terminal_app} (from SKILL_SEEKER_TERMINAL)")
elif detection_method == 'TERM_PROGRAM':
print(f" Using terminal: {terminal_app} (inherited from current terminal)")
elif detection_method.startswith('unknown TERM_PROGRAM'):
print(f"⚠️ {detection_method}")
print(f" → Using Terminal.app as fallback")
else:
print(f" Using terminal: {terminal_app} (default)")
try:
subprocess.Popen(['open', '-a', terminal_app, script_file])
except Exception as e:
print(f"⚠️ Error launching {terminal_app}: {e}")
print(f"\nManually run: {script_file}")
return False
else:
print("⚠️ Auto-launch only works on macOS")
print(f"\nManually run this command in a new terminal:")
print(f" claude '{prompt_file}'")
print(f"\nThen delete the prompt file:")
print(f" rm '{prompt_file}'")
return False
print("✅ New terminal launched with Claude Code!")
print()
print("📊 Status:")
print(f" - Prompt file: {prompt_file}")
print(f" - Skill directory: {self.skill_dir.absolute()}")
print(f" - SKILL.md will be saved to: {self.skill_md_path.absolute()}")
print(f" - Original backed up to: {self.skill_md_path.with_suffix('.md.backup').absolute()}")
print()
print("⏳ Wait for Claude Code to finish in the other terminal...")
print(" (Usually takes 30-60 seconds)")
print()
print("💡 When done:")
print(f" 1. Check the enhanced SKILL.md: {self.skill_md_path}")
print(f" 2. If you don't like it, restore: mv {self.skill_md_path.with_suffix('.md.backup')} {self.skill_md_path}")
print(f" 3. Package: python3 cli/package_skill.py {self.skill_dir}/")
return True
def main():
if len(sys.argv) < 2:
print("Usage: python3 cli/enhance_skill_local.py <skill_directory>")
print()
print("Examples:")
print(" python3 cli/enhance_skill_local.py output/steam-inventory/")
print(" python3 cli/enhance_skill_local.py output/react/")
sys.exit(1)
skill_dir = sys.argv[1]
enhancer = LocalSkillEnhancer(skill_dir)
success = enhancer.run()
sys.exit(0 if success else 1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,288 @@
#!/usr/bin/env python3
"""
Page Count Estimator for Skill Seeker
Quickly estimates how many pages a config will scrape without downloading content
"""
import sys
import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import time
import json
# Add parent directory to path for imports when run as script
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from skill_seekers.cli.constants import (
DEFAULT_RATE_LIMIT,
DEFAULT_MAX_DISCOVERY,
DISCOVERY_THRESHOLD
)
def estimate_pages(config, max_discovery=DEFAULT_MAX_DISCOVERY, timeout=30):
"""
Estimate total pages that will be scraped
Args:
config: Configuration dictionary
max_discovery: Maximum pages to discover (safety limit, use -1 for unlimited)
timeout: Timeout for HTTP requests in seconds
Returns:
dict with estimation results
"""
base_url = config['base_url']
start_urls = config.get('start_urls', [base_url])
url_patterns = config.get('url_patterns', {'include': [], 'exclude': []})
rate_limit = config.get('rate_limit', DEFAULT_RATE_LIMIT)
visited = set()
pending = list(start_urls)
discovered = 0
include_patterns = url_patterns.get('include', [])
exclude_patterns = url_patterns.get('exclude', [])
# Handle unlimited mode
unlimited = (max_discovery == -1 or max_discovery is None)
print(f"🔍 Estimating pages for: {config['name']}")
print(f"📍 Base URL: {base_url}")
print(f"🎯 Start URLs: {len(start_urls)}")
print(f"⏱️ Rate limit: {rate_limit}s")
if unlimited:
print(f"🔢 Max discovery: UNLIMITED (will discover all pages)")
print(f"⚠️ WARNING: This may take a long time!")
else:
print(f"🔢 Max discovery: {max_discovery}")
print()
start_time = time.time()
# Loop condition: stop if no more URLs, or if limit reached (when not unlimited)
while pending and (unlimited or discovered < max_discovery):
url = pending.pop(0)
# Skip if already visited
if url in visited:
continue
visited.add(url)
discovered += 1
# Progress indicator
if discovered % 10 == 0:
elapsed = time.time() - start_time
rate = discovered / elapsed if elapsed > 0 else 0
print(f"⏳ Discovered: {discovered} pages ({rate:.1f} pages/sec)", end='\r')
try:
# HEAD request first to check if page exists (faster)
head_response = requests.head(url, timeout=timeout, allow_redirects=True)
# Skip non-HTML content
content_type = head_response.headers.get('Content-Type', '')
if 'text/html' not in content_type:
continue
# Now GET the page to find links
response = requests.get(url, timeout=timeout)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
# Find all links
for link in soup.find_all('a', href=True):
href = link['href']
full_url = urljoin(url, href)
# Normalize URL
parsed = urlparse(full_url)
full_url = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
# Check if URL is valid
if not is_valid_url(full_url, base_url, include_patterns, exclude_patterns):
continue
# Add to pending if not visited
if full_url not in visited and full_url not in pending:
pending.append(full_url)
# Rate limiting
time.sleep(rate_limit)
except requests.RequestException as e:
# Silently skip errors during estimation
pass
except Exception as e:
# Silently skip other errors
pass
elapsed = time.time() - start_time
# Results
results = {
'discovered': discovered,
'pending': len(pending),
'estimated_total': discovered + len(pending),
'elapsed_seconds': round(elapsed, 2),
'discovery_rate': round(discovered / elapsed if elapsed > 0 else 0, 2),
'hit_limit': (not unlimited) and (discovered >= max_discovery),
'unlimited': unlimited
}
return results
def is_valid_url(url, base_url, include_patterns, exclude_patterns):
"""Check if URL should be crawled"""
# Must be same domain
if not url.startswith(base_url.rstrip('/')):
return False
# Check exclude patterns first
if exclude_patterns:
for pattern in exclude_patterns:
if pattern in url:
return False
# Check include patterns (if specified)
if include_patterns:
for pattern in include_patterns:
if pattern in url:
return True
return False
# If no include patterns, accept by default
return True
def print_results(results, config):
"""Print estimation results"""
print()
print("=" * 70)
print("📊 ESTIMATION RESULTS")
print("=" * 70)
print()
print(f"Config: {config['name']}")
print(f"Base URL: {config['base_url']}")
print()
print(f"✅ Pages Discovered: {results['discovered']}")
print(f"⏳ Pages Pending: {results['pending']}")
print(f"📈 Estimated Total: {results['estimated_total']}")
print()
print(f"⏱️ Time Elapsed: {results['elapsed_seconds']}s")
print(f"⚡ Discovery Rate: {results['discovery_rate']} pages/sec")
if results.get('unlimited', False):
print()
print("✅ UNLIMITED MODE - Discovered all reachable pages")
print(f" Total pages: {results['estimated_total']}")
elif results['hit_limit']:
print()
print("⚠️ Hit discovery limit - actual total may be higher")
print(" Increase max_discovery parameter for more accurate estimate")
print()
print("=" * 70)
print("💡 RECOMMENDATIONS")
print("=" * 70)
print()
estimated = results['estimated_total']
current_max = config.get('max_pages', 100)
if estimated <= current_max:
print(f"✅ Current max_pages ({current_max}) is sufficient")
else:
recommended = min(estimated + 50, DISCOVERY_THRESHOLD) # Add 50 buffer, cap at threshold
print(f"⚠️ Current max_pages ({current_max}) may be too low")
print(f"📝 Recommended max_pages: {recommended}")
print(f" (Estimated {estimated} + 50 buffer)")
# Estimate time for full scrape
rate_limit = config.get('rate_limit', DEFAULT_RATE_LIMIT)
estimated_time = (estimated * rate_limit) / 60 # in minutes
print()
print(f"⏱️ Estimated full scrape time: {estimated_time:.1f} minutes")
print(f" (Based on rate_limit: {rate_limit}s)")
print()
def load_config(config_path):
"""Load configuration from JSON file"""
try:
with open(config_path, 'r') as f:
config = json.load(f)
return config
except FileNotFoundError:
print(f"❌ Error: Config file not found: {config_path}")
sys.exit(1)
except json.JSONDecodeError as e:
print(f"❌ Error: Invalid JSON in config file: {e}")
sys.exit(1)
def main():
"""Main entry point"""
import argparse
parser = argparse.ArgumentParser(
description='Estimate page count for Skill Seeker configs',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Estimate pages for a config
python3 cli/estimate_pages.py configs/react.json
# Estimate with higher discovery limit
python3 cli/estimate_pages.py configs/godot.json --max-discovery 2000
# Quick estimate (stop at 100 pages)
python3 cli/estimate_pages.py configs/vue.json --max-discovery 100
"""
)
parser.add_argument('config', help='Path to config JSON file')
parser.add_argument('--max-discovery', '-m', type=int, default=DEFAULT_MAX_DISCOVERY,
help=f'Maximum pages to discover (default: {DEFAULT_MAX_DISCOVERY}, use -1 for unlimited)')
parser.add_argument('--unlimited', '-u', action='store_true',
help='Remove discovery limit - discover all pages (same as --max-discovery -1)')
parser.add_argument('--timeout', '-t', type=int, default=30,
help='HTTP request timeout in seconds (default: 30)')
args = parser.parse_args()
# Handle unlimited flag
max_discovery = -1 if args.unlimited else args.max_discovery
# Load config
config = load_config(args.config)
# Run estimation
try:
results = estimate_pages(config, max_discovery, args.timeout)
print_results(results, config)
# Return exit code based on results
if results['hit_limit']:
return 2 # Warning: hit limit
return 0 # Success
except KeyboardInterrupt:
print("\n\n⚠️ Estimation interrupted by user")
return 1
except Exception as e:
print(f"\n\n❌ Error during estimation: {e}")
return 1
if __name__ == '__main__':
sys.exit(main())

View File

@@ -0,0 +1,274 @@
#!/usr/bin/env python3
"""
Router Skill Generator
Creates a router/hub skill that intelligently directs queries to specialized sub-skills.
This is used for large documentation sites split into multiple focused skills.
"""
import json
import sys
import argparse
from pathlib import Path
from typing import Dict, List, Any, Tuple
class RouterGenerator:
"""Generates router skills that direct to specialized sub-skills"""
def __init__(self, config_paths: List[str], router_name: str = None):
self.config_paths = [Path(p) for p in config_paths]
self.configs = [self.load_config(p) for p in self.config_paths]
self.router_name = router_name or self.infer_router_name()
self.base_config = self.configs[0] # Use first as template
def load_config(self, path: Path) -> Dict[str, Any]:
"""Load a config file"""
try:
with open(path, 'r') as f:
return json.load(f)
except Exception as e:
print(f"❌ Error loading {path}: {e}")
sys.exit(1)
def infer_router_name(self) -> str:
"""Infer router name from sub-skill names"""
# Find common prefix
names = [cfg['name'] for cfg in self.configs]
if not names:
return "router"
# Get common prefix before first dash
first_name = names[0]
if '-' in first_name:
return first_name.split('-')[0]
return first_name
def extract_routing_keywords(self) -> Dict[str, List[str]]:
"""Extract keywords for routing to each skill"""
routing = {}
for config in self.configs:
name = config['name']
keywords = []
# Extract from categories
if 'categories' in config:
keywords.extend(config['categories'].keys())
# Extract from name (part after dash)
if '-' in name:
skill_topic = name.split('-', 1)[1]
keywords.append(skill_topic)
routing[name] = keywords
return routing
def generate_skill_md(self) -> str:
"""Generate router SKILL.md content"""
routing_keywords = self.extract_routing_keywords()
skill_md = f"""# {self.router_name.replace('-', ' ').title()} Documentation (Router)
## When to Use This Skill
{self.base_config.get('description', f'Use for {self.router_name} development and programming.')}
This is a router skill that directs your questions to specialized sub-skills for efficient, focused assistance.
## How It Works
This skill analyzes your question and activates the appropriate specialized skill(s):
"""
# List sub-skills
for config in self.configs:
name = config['name']
desc = config.get('description', '')
# Remove router name prefix from description if present
if desc.startswith(f"{self.router_name.title()} -"):
desc = desc.split(' - ', 1)[1]
skill_md += f"### {name}\n{desc}\n\n"
# Routing logic
skill_md += """## Routing Logic
The router analyzes your question for topic keywords and activates relevant skills:
**Keywords → Skills:**
"""
for skill_name, keywords in routing_keywords.items():
keyword_str = ", ".join(keywords)
skill_md += f"- {keyword_str} → **{skill_name}**\n"
# Quick reference
skill_md += f"""
## Quick Reference
For quick answers, this router provides basic overview information. For detailed documentation, the specialized skills contain comprehensive references.
### Getting Started
1. Ask your question naturally - mention the topic area
2. The router will activate the appropriate skill(s)
3. You'll receive focused, detailed answers from specialized documentation
### Examples
**Question:** "How do I create a 2D sprite?"
**Activates:** {self.router_name}-2d skill
**Question:** "GDScript function syntax"
**Activates:** {self.router_name}-scripting skill
**Question:** "Physics collision handling in 3D"
**Activates:** {self.router_name}-3d + {self.router_name}-physics skills
### All Available Skills
"""
# List all skills
for config in self.configs:
skill_md += f"- **{config['name']}**\n"
skill_md += f"""
## Need Help?
Simply ask your question and mention the topic. The router will find the right specialized skill for you!
---
*This is a router skill. For complete documentation, see the specialized skills listed above.*
"""
return skill_md
def create_router_config(self) -> Dict[str, Any]:
"""Create router configuration"""
routing_keywords = self.extract_routing_keywords()
router_config = {
"name": self.router_name,
"description": self.base_config.get('description', f'{self.router_name.title()} documentation router'),
"base_url": self.base_config['base_url'],
"selectors": self.base_config.get('selectors', {}),
"url_patterns": self.base_config.get('url_patterns', {}),
"rate_limit": self.base_config.get('rate_limit', 0.5),
"max_pages": 500, # Router only scrapes overview pages
"_router": True,
"_sub_skills": [cfg['name'] for cfg in self.configs],
"_routing_keywords": routing_keywords
}
return router_config
def generate(self, output_dir: Path = None) -> Tuple[Path, Path]:
"""Generate router skill and config"""
if output_dir is None:
output_dir = self.config_paths[0].parent
output_dir = Path(output_dir)
# Generate SKILL.md
skill_md = self.generate_skill_md()
skill_path = output_dir.parent / f"output/{self.router_name}/SKILL.md"
skill_path.parent.mkdir(parents=True, exist_ok=True)
with open(skill_path, 'w') as f:
f.write(skill_md)
# Generate config
router_config = self.create_router_config()
config_path = output_dir / f"{self.router_name}.json"
with open(config_path, 'w') as f:
json.dump(router_config, f, indent=2)
return config_path, skill_path
def main():
parser = argparse.ArgumentParser(
description="Generate router/hub skill for split documentation",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Generate router from multiple configs
python3 generate_router.py configs/godot-2d.json configs/godot-3d.json configs/godot-scripting.json
# Use glob pattern
python3 generate_router.py configs/godot-*.json
# Custom router name
python3 generate_router.py configs/godot-*.json --name godot-hub
# Custom output directory
python3 generate_router.py configs/godot-*.json --output-dir configs/routers/
"""
)
parser.add_argument(
'configs',
nargs='+',
help='Sub-skill config files'
)
parser.add_argument(
'--name',
help='Router skill name (default: inferred from sub-skills)'
)
parser.add_argument(
'--output-dir',
help='Output directory (default: same as input configs)'
)
args = parser.parse_args()
# Filter out router configs (avoid recursion)
config_files = []
for path_str in args.configs:
path = Path(path_str)
if path.exists() and not path.stem.endswith('-router'):
config_files.append(path_str)
if not config_files:
print("❌ Error: No valid config files provided")
sys.exit(1)
print(f"\n{'='*60}")
print("ROUTER SKILL GENERATOR")
print(f"{'='*60}")
print(f"Sub-skills: {len(config_files)}")
for cfg in config_files:
print(f" - {Path(cfg).stem}")
print("")
# Generate router
generator = RouterGenerator(config_files, args.name)
config_path, skill_path = generator.generate(args.output_dir)
print(f"✅ Router config created: {config_path}")
print(f"✅ Router SKILL.md created: {skill_path}")
print("")
print(f"{'='*60}")
print("NEXT STEPS")
print(f"{'='*60}")
print(f"1. Review router SKILL.md: {skill_path}")
print(f"2. Optionally scrape router (for overview pages):")
print(f" python3 cli/doc_scraper.py --config {config_path}")
print("3. Package router skill:")
print(f" python3 cli/package_skill.py output/{generator.router_name}/")
print("4. Upload router + all sub-skills to Claude")
print("")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,797 @@
#!/usr/bin/env python3
"""
GitHub Repository to Claude Skill Converter (Tasks C1.1-C1.12)
Converts GitHub repositories into Claude AI skills by extracting:
- README and documentation
- Code structure and signatures
- GitHub Issues, Changelog, and Releases
- Usage examples from tests
Usage:
python3 cli/github_scraper.py --repo facebook/react
python3 cli/github_scraper.py --config configs/react_github.json
python3 cli/github_scraper.py --repo owner/repo --token $GITHUB_TOKEN
"""
import os
import sys
import json
import re
import argparse
import logging
from pathlib import Path
from typing import Dict, List, Optional, Any
from datetime import datetime
try:
from github import Github, GithubException, Repository
from github.GithubException import RateLimitExceededException
except ImportError:
print("Error: PyGithub not installed. Run: pip install PyGithub")
sys.exit(1)
# Import code analyzer for deep code analysis
try:
from code_analyzer import CodeAnalyzer
CODE_ANALYZER_AVAILABLE = True
except ImportError:
CODE_ANALYZER_AVAILABLE = False
logger.warning("Code analyzer not available - deep analysis disabled")
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
class GitHubScraper:
"""
GitHub Repository Scraper (C1.1-C1.9)
Extracts repository information for skill generation:
- Repository structure
- README files
- Code comments and docstrings
- Programming language detection
- Function/class signatures
- Test examples
- GitHub Issues
- CHANGELOG
- Releases
"""
def __init__(self, config: Dict[str, Any]):
"""Initialize GitHub scraper with configuration."""
self.config = config
self.repo_name = config['repo']
self.name = config.get('name', self.repo_name.split('/')[-1])
self.description = config.get('description', f'Skill for {self.repo_name}')
# GitHub client setup (C1.1)
token = self._get_token()
self.github = Github(token) if token else Github()
self.repo: Optional[Repository.Repository] = None
# Options
self.include_issues = config.get('include_issues', True)
self.max_issues = config.get('max_issues', 100)
self.include_changelog = config.get('include_changelog', True)
self.include_releases = config.get('include_releases', True)
self.include_code = config.get('include_code', False)
self.code_analysis_depth = config.get('code_analysis_depth', 'surface') # 'surface', 'deep', 'full'
self.file_patterns = config.get('file_patterns', [])
# Initialize code analyzer if deep analysis requested
self.code_analyzer = None
if self.code_analysis_depth != 'surface' and CODE_ANALYZER_AVAILABLE:
self.code_analyzer = CodeAnalyzer(depth=self.code_analysis_depth)
logger.info(f"Code analysis depth: {self.code_analysis_depth}")
# Output paths
self.skill_dir = f"output/{self.name}"
self.data_file = f"output/{self.name}_github_data.json"
# Extracted data storage
self.extracted_data = {
'repo_info': {},
'readme': '',
'file_tree': [],
'languages': {},
'signatures': [],
'test_examples': [],
'issues': [],
'changelog': '',
'releases': []
}
def _get_token(self) -> Optional[str]:
"""
Get GitHub token from env var or config (both options supported).
Priority: GITHUB_TOKEN env var > config file > None
"""
# Try environment variable first (recommended)
token = os.getenv('GITHUB_TOKEN')
if token:
logger.info("Using GitHub token from GITHUB_TOKEN environment variable")
return token
# Fall back to config file
token = self.config.get('github_token')
if token:
logger.warning("Using GitHub token from config file (less secure)")
return token
logger.warning("No GitHub token provided - using unauthenticated access (lower rate limits)")
return None
def scrape(self) -> Dict[str, Any]:
"""
Main scraping entry point.
Executes all C1 tasks in sequence.
"""
try:
logger.info(f"Starting GitHub scrape for: {self.repo_name}")
# C1.1: Fetch repository
self._fetch_repository()
# C1.2: Extract README
self._extract_readme()
# C1.3-C1.6: Extract code structure
self._extract_code_structure()
# C1.7: Extract Issues
if self.include_issues:
self._extract_issues()
# C1.8: Extract CHANGELOG
if self.include_changelog:
self._extract_changelog()
# C1.9: Extract Releases
if self.include_releases:
self._extract_releases()
# Save extracted data
self._save_data()
logger.info(f"✅ Scraping complete! Data saved to: {self.data_file}")
return self.extracted_data
except RateLimitExceededException:
logger.error("GitHub API rate limit exceeded. Please wait or use authentication token.")
raise
except GithubException as e:
logger.error(f"GitHub API error: {e}")
raise
except Exception as e:
logger.error(f"Unexpected error during scraping: {e}")
raise
def _fetch_repository(self):
"""C1.1: Fetch repository structure using GitHub API."""
logger.info(f"Fetching repository: {self.repo_name}")
try:
self.repo = self.github.get_repo(self.repo_name)
# Extract basic repo info
self.extracted_data['repo_info'] = {
'name': self.repo.name,
'full_name': self.repo.full_name,
'description': self.repo.description,
'url': self.repo.html_url,
'homepage': self.repo.homepage,
'stars': self.repo.stargazers_count,
'forks': self.repo.forks_count,
'open_issues': self.repo.open_issues_count,
'default_branch': self.repo.default_branch,
'created_at': self.repo.created_at.isoformat() if self.repo.created_at else None,
'updated_at': self.repo.updated_at.isoformat() if self.repo.updated_at else None,
'language': self.repo.language,
'license': self.repo.license.name if self.repo.license else None,
'topics': self.repo.get_topics()
}
logger.info(f"Repository fetched: {self.repo.full_name} ({self.repo.stargazers_count} stars)")
except GithubException as e:
if e.status == 404:
raise ValueError(f"Repository not found: {self.repo_name}")
raise
def _extract_readme(self):
"""C1.2: Extract README.md files."""
logger.info("Extracting README...")
# Try common README locations
readme_files = ['README.md', 'README.rst', 'README.txt', 'README',
'docs/README.md', '.github/README.md']
for readme_path in readme_files:
try:
content = self.repo.get_contents(readme_path)
if content:
self.extracted_data['readme'] = content.decoded_content.decode('utf-8')
logger.info(f"README found: {readme_path}")
return
except GithubException:
continue
logger.warning("No README found in repository")
def _extract_code_structure(self):
"""
C1.3-C1.6: Extract code structure, languages, signatures, and test examples.
Surface layer only - no full implementation code.
"""
logger.info("Extracting code structure...")
# C1.4: Get language breakdown
self._extract_languages()
# Get file tree
self._extract_file_tree()
# Extract signatures and test examples
if self.include_code:
self._extract_signatures_and_tests()
def _extract_languages(self):
"""C1.4: Detect programming languages in repository."""
logger.info("Detecting programming languages...")
try:
languages = self.repo.get_languages()
total_bytes = sum(languages.values())
self.extracted_data['languages'] = {
lang: {
'bytes': bytes_count,
'percentage': round((bytes_count / total_bytes) * 100, 2) if total_bytes > 0 else 0
}
for lang, bytes_count in languages.items()
}
logger.info(f"Languages detected: {', '.join(languages.keys())}")
except GithubException as e:
logger.warning(f"Could not fetch languages: {e}")
def _extract_file_tree(self):
"""Extract repository file tree structure."""
logger.info("Building file tree...")
try:
contents = self.repo.get_contents("")
file_tree = []
while contents:
file_content = contents.pop(0)
file_info = {
'path': file_content.path,
'type': file_content.type,
'size': file_content.size if file_content.type == 'file' else None
}
file_tree.append(file_info)
if file_content.type == "dir":
contents.extend(self.repo.get_contents(file_content.path))
self.extracted_data['file_tree'] = file_tree
logger.info(f"File tree built: {len(file_tree)} items")
except GithubException as e:
logger.warning(f"Could not build file tree: {e}")
def _extract_signatures_and_tests(self):
"""
C1.3, C1.5, C1.6: Extract signatures, docstrings, and test examples.
Extraction depth depends on code_analysis_depth setting:
- surface: File tree only (minimal)
- deep: Parse files for signatures, parameters, types
- full: Complete AST analysis (future enhancement)
"""
if self.code_analysis_depth == 'surface':
logger.info("Code extraction: Surface level (file tree only)")
return
if not self.code_analyzer:
logger.warning("Code analyzer not available - skipping deep analysis")
return
logger.info(f"Extracting code signatures ({self.code_analysis_depth} analysis)...")
# Get primary language for the repository
languages = self.extracted_data.get('languages', {})
if not languages:
logger.warning("No languages detected - skipping code analysis")
return
# Determine primary language
primary_language = max(languages.items(), key=lambda x: x[1]['bytes'])[0]
logger.info(f"Primary language: {primary_language}")
# Determine file extensions to analyze
extension_map = {
'Python': ['.py'],
'JavaScript': ['.js', '.jsx'],
'TypeScript': ['.ts', '.tsx'],
'C': ['.c', '.h'],
'C++': ['.cpp', '.hpp', '.cc', '.hh', '.cxx']
}
extensions = extension_map.get(primary_language, [])
if not extensions:
logger.warning(f"No file extensions mapped for {primary_language}")
return
# Analyze files matching patterns and extensions
analyzed_files = []
file_tree = self.extracted_data.get('file_tree', [])
for file_info in file_tree:
file_path = file_info['path']
# Check if file matches extension
if not any(file_path.endswith(ext) for ext in extensions):
continue
# Check if file matches patterns (if specified)
if self.file_patterns:
import fnmatch
if not any(fnmatch.fnmatch(file_path, pattern) for pattern in self.file_patterns):
continue
# Analyze this file
try:
file_content = self.repo.get_contents(file_path)
content = file_content.decoded_content.decode('utf-8')
analysis_result = self.code_analyzer.analyze_file(
file_path,
content,
primary_language
)
if analysis_result and (analysis_result.get('classes') or analysis_result.get('functions')):
analyzed_files.append({
'file': file_path,
'language': primary_language,
**analysis_result
})
logger.debug(f"Analyzed {file_path}: "
f"{len(analysis_result.get('classes', []))} classes, "
f"{len(analysis_result.get('functions', []))} functions")
except Exception as e:
logger.debug(f"Could not analyze {file_path}: {e}")
continue
# Limit number of files analyzed to avoid rate limits
if len(analyzed_files) >= 50:
logger.info(f"Reached analysis limit (50 files)")
break
self.extracted_data['code_analysis'] = {
'depth': self.code_analysis_depth,
'language': primary_language,
'files_analyzed': len(analyzed_files),
'files': analyzed_files
}
# Calculate totals
total_classes = sum(len(f.get('classes', [])) for f in analyzed_files)
total_functions = sum(len(f.get('functions', [])) for f in analyzed_files)
logger.info(f"Code analysis complete: {len(analyzed_files)} files, "
f"{total_classes} classes, {total_functions} functions")
def _extract_issues(self):
"""C1.7: Extract GitHub Issues (open/closed, labels, milestones)."""
logger.info(f"Extracting GitHub Issues (max {self.max_issues})...")
try:
# Fetch recent issues (open + closed)
issues = self.repo.get_issues(state='all', sort='updated', direction='desc')
issue_list = []
for issue in issues[:self.max_issues]:
# Skip pull requests (they appear in issues)
if issue.pull_request:
continue
issue_data = {
'number': issue.number,
'title': issue.title,
'state': issue.state,
'labels': [label.name for label in issue.labels],
'milestone': issue.milestone.title if issue.milestone else None,
'created_at': issue.created_at.isoformat() if issue.created_at else None,
'updated_at': issue.updated_at.isoformat() if issue.updated_at else None,
'closed_at': issue.closed_at.isoformat() if issue.closed_at else None,
'url': issue.html_url,
'body': issue.body[:500] if issue.body else None # First 500 chars
}
issue_list.append(issue_data)
self.extracted_data['issues'] = issue_list
logger.info(f"Extracted {len(issue_list)} issues")
except GithubException as e:
logger.warning(f"Could not fetch issues: {e}")
def _extract_changelog(self):
"""C1.8: Extract CHANGELOG.md and release notes."""
logger.info("Extracting CHANGELOG...")
# Try common changelog locations
changelog_files = ['CHANGELOG.md', 'CHANGES.md', 'HISTORY.md',
'CHANGELOG.rst', 'CHANGELOG.txt', 'CHANGELOG',
'docs/CHANGELOG.md', '.github/CHANGELOG.md']
for changelog_path in changelog_files:
try:
content = self.repo.get_contents(changelog_path)
if content:
self.extracted_data['changelog'] = content.decoded_content.decode('utf-8')
logger.info(f"CHANGELOG found: {changelog_path}")
return
except GithubException:
continue
logger.warning("No CHANGELOG found in repository")
def _extract_releases(self):
"""C1.9: Extract GitHub Releases with version history."""
logger.info("Extracting GitHub Releases...")
try:
releases = self.repo.get_releases()
release_list = []
for release in releases:
release_data = {
'tag_name': release.tag_name,
'name': release.title,
'body': release.body,
'draft': release.draft,
'prerelease': release.prerelease,
'created_at': release.created_at.isoformat() if release.created_at else None,
'published_at': release.published_at.isoformat() if release.published_at else None,
'url': release.html_url,
'tarball_url': release.tarball_url,
'zipball_url': release.zipball_url
}
release_list.append(release_data)
self.extracted_data['releases'] = release_list
logger.info(f"Extracted {len(release_list)} releases")
except GithubException as e:
logger.warning(f"Could not fetch releases: {e}")
def _save_data(self):
"""Save extracted data to JSON file."""
os.makedirs('output', exist_ok=True)
with open(self.data_file, 'w', encoding='utf-8') as f:
json.dump(self.extracted_data, f, indent=2, ensure_ascii=False)
logger.info(f"Data saved to: {self.data_file}")
class GitHubToSkillConverter:
"""
Convert extracted GitHub data to Claude skill format (C1.10).
"""
def __init__(self, config: Dict[str, Any]):
"""Initialize converter with configuration."""
self.config = config
self.name = config.get('name', config['repo'].split('/')[-1])
self.description = config.get('description', f'Skill for {config["repo"]}')
# Paths
self.data_file = f"output/{self.name}_github_data.json"
self.skill_dir = f"output/{self.name}"
# Load extracted data
self.data = self._load_data()
def _load_data(self) -> Dict[str, Any]:
"""Load extracted GitHub data from JSON."""
if not os.path.exists(self.data_file):
raise FileNotFoundError(f"Data file not found: {self.data_file}")
with open(self.data_file, 'r', encoding='utf-8') as f:
return json.load(f)
def build_skill(self):
"""Build complete skill structure."""
logger.info(f"Building skill for: {self.name}")
# Create directories
os.makedirs(self.skill_dir, exist_ok=True)
os.makedirs(f"{self.skill_dir}/references", exist_ok=True)
os.makedirs(f"{self.skill_dir}/scripts", exist_ok=True)
os.makedirs(f"{self.skill_dir}/assets", exist_ok=True)
# Generate SKILL.md
self._generate_skill_md()
# Generate reference files
self._generate_references()
logger.info(f"✅ Skill built successfully: {self.skill_dir}/")
def _generate_skill_md(self):
"""Generate main SKILL.md file."""
repo_info = self.data.get('repo_info', {})
# Generate skill name (lowercase, hyphens only, max 64 chars)
skill_name = self.name.lower().replace('_', '-').replace(' ', '-')[:64]
# Truncate description to 1024 chars if needed
desc = self.description[:1024] if len(self.description) > 1024 else self.description
skill_content = f"""---
name: {skill_name}
description: {desc}
---
# {repo_info.get('name', self.name)}
{self.description}
## Description
{repo_info.get('description', 'GitHub repository skill')}
**Repository:** [{repo_info.get('full_name', 'N/A')}]({repo_info.get('url', '#')})
**Language:** {repo_info.get('language', 'N/A')}
**Stars:** {repo_info.get('stars', 0):,}
**License:** {repo_info.get('license', 'N/A')}
## When to Use This Skill
Use this skill when you need to:
- Understand how to use {self.name}
- Look up API documentation
- Find usage examples
- Check for known issues or recent changes
- Review release history
## Quick Reference
### Repository Info
- **Homepage:** {repo_info.get('homepage', 'N/A')}
- **Topics:** {', '.join(repo_info.get('topics', []))}
- **Open Issues:** {repo_info.get('open_issues', 0)}
- **Last Updated:** {repo_info.get('updated_at', 'N/A')[:10]}
### Languages
{self._format_languages()}
### Recent Releases
{self._format_recent_releases()}
## Available References
- `references/README.md` - Complete README documentation
- `references/CHANGELOG.md` - Version history and changes
- `references/issues.md` - Recent GitHub issues
- `references/releases.md` - Release notes
- `references/file_structure.md` - Repository structure
## Usage
See README.md for complete usage instructions and examples.
---
**Generated by Skill Seeker** | GitHub Repository Scraper
"""
skill_path = f"{self.skill_dir}/SKILL.md"
with open(skill_path, 'w', encoding='utf-8') as f:
f.write(skill_content)
logger.info(f"Generated: {skill_path}")
def _format_languages(self) -> str:
"""Format language breakdown."""
languages = self.data.get('languages', {})
if not languages:
return "No language data available"
lines = []
for lang, info in sorted(languages.items(), key=lambda x: x[1]['bytes'], reverse=True):
lines.append(f"- **{lang}:** {info['percentage']:.1f}%")
return '\n'.join(lines)
def _format_recent_releases(self) -> str:
"""Format recent releases (top 3)."""
releases = self.data.get('releases', [])
if not releases:
return "No releases available"
lines = []
for release in releases[:3]:
lines.append(f"- **{release['tag_name']}** ({release['published_at'][:10]}): {release['name']}")
return '\n'.join(lines)
def _generate_references(self):
"""Generate all reference files."""
# README
if self.data.get('readme'):
readme_path = f"{self.skill_dir}/references/README.md"
with open(readme_path, 'w', encoding='utf-8') as f:
f.write(self.data['readme'])
logger.info(f"Generated: {readme_path}")
# CHANGELOG
if self.data.get('changelog'):
changelog_path = f"{self.skill_dir}/references/CHANGELOG.md"
with open(changelog_path, 'w', encoding='utf-8') as f:
f.write(self.data['changelog'])
logger.info(f"Generated: {changelog_path}")
# Issues
if self.data.get('issues'):
self._generate_issues_reference()
# Releases
if self.data.get('releases'):
self._generate_releases_reference()
# File structure
if self.data.get('file_tree'):
self._generate_file_structure_reference()
def _generate_issues_reference(self):
"""Generate issues.md reference file."""
issues = self.data['issues']
content = f"# GitHub Issues\n\nRecent issues from the repository ({len(issues)} total).\n\n"
# Group by state
open_issues = [i for i in issues if i['state'] == 'open']
closed_issues = [i for i in issues if i['state'] == 'closed']
content += f"## Open Issues ({len(open_issues)})\n\n"
for issue in open_issues[:20]:
labels = ', '.join(issue['labels']) if issue['labels'] else 'No labels'
content += f"### #{issue['number']}: {issue['title']}\n"
content += f"**Labels:** {labels} | **Created:** {issue['created_at'][:10]}\n"
content += f"[View on GitHub]({issue['url']})\n\n"
content += f"\n## Recently Closed Issues ({len(closed_issues)})\n\n"
for issue in closed_issues[:10]:
labels = ', '.join(issue['labels']) if issue['labels'] else 'No labels'
content += f"### #{issue['number']}: {issue['title']}\n"
content += f"**Labels:** {labels} | **Closed:** {issue['closed_at'][:10]}\n"
content += f"[View on GitHub]({issue['url']})\n\n"
issues_path = f"{self.skill_dir}/references/issues.md"
with open(issues_path, 'w', encoding='utf-8') as f:
f.write(content)
logger.info(f"Generated: {issues_path}")
def _generate_releases_reference(self):
"""Generate releases.md reference file."""
releases = self.data['releases']
content = f"# Releases\n\nVersion history for this repository ({len(releases)} releases).\n\n"
for release in releases:
content += f"## {release['tag_name']}: {release['name']}\n"
content += f"**Published:** {release['published_at'][:10]}\n"
if release['prerelease']:
content += f"**Pre-release**\n"
content += f"\n{release['body']}\n\n"
content += f"[View on GitHub]({release['url']})\n\n---\n\n"
releases_path = f"{self.skill_dir}/references/releases.md"
with open(releases_path, 'w', encoding='utf-8') as f:
f.write(content)
logger.info(f"Generated: {releases_path}")
def _generate_file_structure_reference(self):
"""Generate file_structure.md reference file."""
file_tree = self.data['file_tree']
content = f"# Repository File Structure\n\n"
content += f"Total items: {len(file_tree)}\n\n"
content += "```\n"
# Build tree structure
for item in file_tree:
indent = " " * item['path'].count('/')
icon = "📁" if item['type'] == 'dir' else "📄"
content += f"{indent}{icon} {os.path.basename(item['path'])}\n"
content += "```\n"
structure_path = f"{self.skill_dir}/references/file_structure.md"
with open(structure_path, 'w', encoding='utf-8') as f:
f.write(content)
logger.info(f"Generated: {structure_path}")
def main():
"""C1.10: CLI tool entry point."""
parser = argparse.ArgumentParser(
description='GitHub Repository to Claude Skill Converter',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python3 cli/github_scraper.py --repo facebook/react
python3 cli/github_scraper.py --config configs/react_github.json
python3 cli/github_scraper.py --repo owner/repo --token $GITHUB_TOKEN
"""
)
parser.add_argument('--repo', help='GitHub repository (owner/repo)')
parser.add_argument('--config', help='Path to config JSON file')
parser.add_argument('--token', help='GitHub personal access token')
parser.add_argument('--name', help='Skill name (default: repo name)')
parser.add_argument('--description', help='Skill description')
parser.add_argument('--no-issues', action='store_true', help='Skip GitHub issues')
parser.add_argument('--no-changelog', action='store_true', help='Skip CHANGELOG')
parser.add_argument('--no-releases', action='store_true', help='Skip releases')
parser.add_argument('--max-issues', type=int, default=100, help='Max issues to fetch')
parser.add_argument('--scrape-only', action='store_true', help='Only scrape, don\'t build skill')
args = parser.parse_args()
# Build config from args or file
if args.config:
with open(args.config, 'r') as f:
config = json.load(f)
elif args.repo:
config = {
'repo': args.repo,
'name': args.name or args.repo.split('/')[-1],
'description': args.description or f'GitHub repository skill for {args.repo}',
'github_token': args.token,
'include_issues': not args.no_issues,
'include_changelog': not args.no_changelog,
'include_releases': not args.no_releases,
'max_issues': args.max_issues
}
else:
parser.error('Either --repo or --config is required')
try:
# Phase 1: Scrape GitHub repository
scraper = GitHubScraper(config)
scraper.scrape()
if args.scrape_only:
logger.info("Scrape complete (--scrape-only mode)")
return
# Phase 2: Build skill
converter = GitHubToSkillConverter(config)
converter.build_skill()
logger.info(f"\n✅ Success! Skill created at: output/{config.get('name', config['repo'].split('/')[-1])}/")
logger.info(f"Next step: python3 cli/package_skill.py output/{config.get('name', config['repo'].split('/')[-1])}/")
except Exception as e:
logger.error(f"Error: {e}")
sys.exit(1)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,66 @@
# ABOUTME: Detects and validates llms.txt file availability at documentation URLs
# ABOUTME: Supports llms-full.txt, llms.txt, and llms-small.txt variants
import requests
from typing import Optional, Dict, List
from urllib.parse import urlparse
class LlmsTxtDetector:
"""Detect llms.txt files at documentation URLs"""
VARIANTS = [
('llms-full.txt', 'full'),
('llms.txt', 'standard'),
('llms-small.txt', 'small')
]
def __init__(self, base_url: str):
self.base_url = base_url.rstrip('/')
def detect(self) -> Optional[Dict[str, str]]:
"""
Detect available llms.txt variant.
Returns:
Dict with 'url' and 'variant' keys, or None if not found
"""
parsed = urlparse(self.base_url)
root_url = f"{parsed.scheme}://{parsed.netloc}"
for filename, variant in self.VARIANTS:
url = f"{root_url}/{filename}"
if self._check_url_exists(url):
return {'url': url, 'variant': variant}
return None
def detect_all(self) -> List[Dict[str, str]]:
"""
Detect all available llms.txt variants.
Returns:
List of dicts with 'url' and 'variant' keys for each found variant
"""
found_variants = []
for filename, variant in self.VARIANTS:
parsed = urlparse(self.base_url)
root_url = f"{parsed.scheme}://{parsed.netloc}"
url = f"{root_url}/{filename}"
if self._check_url_exists(url):
found_variants.append({
'url': url,
'variant': variant
})
return found_variants
def _check_url_exists(self, url: str) -> bool:
"""Check if URL returns 200 status"""
try:
response = requests.head(url, timeout=5, allow_redirects=True)
return response.status_code == 200
except requests.RequestException:
return False

View File

@@ -0,0 +1,94 @@
"""ABOUTME: Downloads llms.txt files from documentation URLs with retry logic"""
"""ABOUTME: Validates markdown content and handles timeouts with exponential backoff"""
import requests
import time
from typing import Optional
class LlmsTxtDownloader:
"""Download llms.txt content from URLs with retry logic"""
def __init__(self, url: str, timeout: int = 30, max_retries: int = 3):
self.url = url
self.timeout = timeout
self.max_retries = max_retries
def get_proper_filename(self) -> str:
"""
Extract filename from URL and convert .txt to .md
Returns:
Proper filename with .md extension
Examples:
https://hono.dev/llms-full.txt -> llms-full.md
https://hono.dev/llms.txt -> llms.md
https://hono.dev/llms-small.txt -> llms-small.md
"""
# Extract filename from URL
from urllib.parse import urlparse
parsed = urlparse(self.url)
filename = parsed.path.split('/')[-1]
# Replace .txt with .md
if filename.endswith('.txt'):
filename = filename[:-4] + '.md'
return filename
def _is_markdown(self, content: str) -> bool:
"""
Check if content looks like markdown.
Returns:
True if content contains markdown patterns
"""
markdown_patterns = ['# ', '## ', '```', '- ', '* ', '`']
return any(pattern in content for pattern in markdown_patterns)
def download(self) -> Optional[str]:
"""
Download llms.txt content with retry logic.
Returns:
String content or None if download fails
"""
headers = {
'User-Agent': 'Skill-Seekers-llms.txt-Reader/1.0'
}
for attempt in range(self.max_retries):
try:
response = requests.get(
self.url,
headers=headers,
timeout=self.timeout
)
response.raise_for_status()
content = response.text
# Validate content is not empty
if len(content) < 100:
print(f"⚠️ Content too short ({len(content)} chars), rejecting")
return None
# Validate content looks like markdown
if not self._is_markdown(content):
print(f"⚠️ Content doesn't look like markdown")
return None
return content
except requests.RequestException as e:
if attempt < self.max_retries - 1:
# Calculate exponential backoff delay: 1s, 2s, 4s, etc.
delay = 2 ** attempt
print(f"⚠️ Attempt {attempt + 1}/{self.max_retries} failed: {e}")
print(f" Retrying in {delay}s...")
time.sleep(delay)
else:
print(f"❌ Failed to download {self.url} after {self.max_retries} attempts: {e}")
return None
return None

View File

@@ -0,0 +1,74 @@
"""ABOUTME: Parses llms.txt markdown content into structured page data"""
"""ABOUTME: Extracts titles, content, code samples, and headings from markdown"""
import re
from typing import List, Dict
class LlmsTxtParser:
"""Parse llms.txt markdown content into page structures"""
def __init__(self, content: str):
self.content = content
def parse(self) -> List[Dict]:
"""
Parse markdown content into page structures.
Returns:
List of page dicts with title, content, code_samples, headings
"""
pages = []
# Split by h1 headers (# Title)
sections = re.split(r'\n# ', self.content)
for section in sections:
if not section.strip():
continue
# First line is title
lines = section.split('\n')
title = lines[0].strip('#').strip()
# Parse content
page = self._parse_section('\n'.join(lines[1:]), title)
pages.append(page)
return pages
def _parse_section(self, content: str, title: str) -> Dict:
"""Parse a single section into page structure"""
page = {
'title': title,
'content': '',
'code_samples': [],
'headings': [],
'url': f'llms-txt#{title.lower().replace(" ", "-")}',
'links': []
}
# Extract code blocks
code_blocks = re.findall(r'```(\w+)?\n(.*?)```', content, re.DOTALL)
for lang, code in code_blocks:
page['code_samples'].append({
'code': code.strip(),
'language': lang or 'unknown'
})
# Extract h2/h3 headings
headings = re.findall(r'^(#{2,3})\s+(.+)$', content, re.MULTILINE)
for level_markers, text in headings:
page['headings'].append({
'level': f'h{len(level_markers)}',
'text': text.strip(),
'id': text.lower().replace(' ', '-')
})
# Remove code blocks from content for plain text
content_no_code = re.sub(r'```.*?```', '', content, flags=re.DOTALL)
# Extract paragraphs
paragraphs = [p.strip() for p in content_no_code.split('\n\n') if len(p.strip()) > 20]
page['content'] = '\n\n'.join(paragraphs)
return page

View File

@@ -0,0 +1,285 @@
#!/usr/bin/env python3
"""
Skill Seekers - Unified CLI Entry Point
Provides a git-style unified command-line interface for all Skill Seekers tools.
Usage:
skill-seekers <command> [options]
Commands:
scrape Scrape documentation website
github Scrape GitHub repository
pdf Extract from PDF file
unified Multi-source scraping (docs + GitHub + PDF)
enhance AI-powered enhancement (local, no API key)
package Package skill into .zip file
upload Upload skill to Claude
estimate Estimate page count before scraping
Examples:
skill-seekers scrape --config configs/react.json
skill-seekers github --repo microsoft/TypeScript
skill-seekers unified --config configs/react_unified.json
skill-seekers package output/react/
"""
import sys
import argparse
from typing import List, Optional
def create_parser() -> argparse.ArgumentParser:
"""Create the main argument parser with subcommands."""
parser = argparse.ArgumentParser(
prog="skill-seekers",
description="Convert documentation, GitHub repos, and PDFs into Claude AI skills",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Scrape documentation
skill-seekers scrape --config configs/react.json
# Scrape GitHub repository
skill-seekers github --repo microsoft/TypeScript --name typescript
# Multi-source scraping (unified)
skill-seekers unified --config configs/react_unified.json
# AI-powered enhancement
skill-seekers enhance output/react/
# Package and upload
skill-seekers package output/react/
skill-seekers upload output/react.zip
For more information: https://github.com/yusufkaraaslan/Skill_Seekers
"""
)
parser.add_argument(
"--version",
action="version",
version="%(prog)s 2.0.0"
)
subparsers = parser.add_subparsers(
dest="command",
title="commands",
description="Available Skill Seekers commands",
help="Command to run"
)
# === scrape subcommand ===
scrape_parser = subparsers.add_parser(
"scrape",
help="Scrape documentation website",
description="Scrape documentation website and generate skill"
)
scrape_parser.add_argument("--config", help="Config JSON file")
scrape_parser.add_argument("--name", help="Skill name")
scrape_parser.add_argument("--url", help="Documentation URL")
scrape_parser.add_argument("--description", help="Skill description")
scrape_parser.add_argument("--skip-scrape", action="store_true", help="Skip scraping, use cached data")
scrape_parser.add_argument("--enhance", action="store_true", help="AI enhancement (API)")
scrape_parser.add_argument("--enhance-local", action="store_true", help="AI enhancement (local)")
scrape_parser.add_argument("--dry-run", action="store_true", help="Dry run mode")
scrape_parser.add_argument("--async", dest="async_mode", action="store_true", help="Use async scraping")
scrape_parser.add_argument("--workers", type=int, help="Number of async workers")
# === github subcommand ===
github_parser = subparsers.add_parser(
"github",
help="Scrape GitHub repository",
description="Scrape GitHub repository and generate skill"
)
github_parser.add_argument("--config", help="Config JSON file")
github_parser.add_argument("--repo", help="GitHub repo (owner/repo)")
github_parser.add_argument("--name", help="Skill name")
github_parser.add_argument("--description", help="Skill description")
# === pdf subcommand ===
pdf_parser = subparsers.add_parser(
"pdf",
help="Extract from PDF file",
description="Extract content from PDF and generate skill"
)
pdf_parser.add_argument("--config", help="Config JSON file")
pdf_parser.add_argument("--pdf", help="PDF file path")
pdf_parser.add_argument("--name", help="Skill name")
pdf_parser.add_argument("--description", help="Skill description")
pdf_parser.add_argument("--from-json", help="Build from extracted JSON")
# === unified subcommand ===
unified_parser = subparsers.add_parser(
"unified",
help="Multi-source scraping (docs + GitHub + PDF)",
description="Combine multiple sources into one skill"
)
unified_parser.add_argument("--config", required=True, help="Unified config JSON file")
unified_parser.add_argument("--merge-mode", help="Merge mode (rule-based, claude-enhanced)")
unified_parser.add_argument("--dry-run", action="store_true", help="Dry run mode")
# === enhance subcommand ===
enhance_parser = subparsers.add_parser(
"enhance",
help="AI-powered enhancement (local, no API key)",
description="Enhance SKILL.md using Claude Code (local)"
)
enhance_parser.add_argument("skill_directory", help="Skill directory path")
# === package subcommand ===
package_parser = subparsers.add_parser(
"package",
help="Package skill into .zip file",
description="Package skill directory into uploadable .zip"
)
package_parser.add_argument("skill_directory", help="Skill directory path")
package_parser.add_argument("--no-open", action="store_true", help="Don't open output folder")
package_parser.add_argument("--upload", action="store_true", help="Auto-upload after packaging")
# === upload subcommand ===
upload_parser = subparsers.add_parser(
"upload",
help="Upload skill to Claude",
description="Upload .zip file to Claude via Anthropic API"
)
upload_parser.add_argument("zip_file", help=".zip file to upload")
upload_parser.add_argument("--api-key", help="Anthropic API key")
# === estimate subcommand ===
estimate_parser = subparsers.add_parser(
"estimate",
help="Estimate page count before scraping",
description="Estimate total pages for documentation scraping"
)
estimate_parser.add_argument("config", help="Config JSON file")
estimate_parser.add_argument("--max-discovery", type=int, help="Max pages to discover")
return parser
def main(argv: Optional[List[str]] = None) -> int:
"""Main entry point for the unified CLI.
Args:
argv: Command-line arguments (defaults to sys.argv)
Returns:
Exit code (0 for success, non-zero for error)
"""
parser = create_parser()
args = parser.parse_args(argv)
if not args.command:
parser.print_help()
return 1
# Delegate to the appropriate tool
try:
if args.command == "scrape":
from skill_seekers.cli.doc_scraper import main as scrape_main
# Convert args namespace to sys.argv format for doc_scraper
sys.argv = ["doc_scraper.py"]
if args.config:
sys.argv.extend(["--config", args.config])
if args.name:
sys.argv.extend(["--name", args.name])
if args.url:
sys.argv.extend(["--url", args.url])
if args.description:
sys.argv.extend(["--description", args.description])
if args.skip_scrape:
sys.argv.append("--skip-scrape")
if args.enhance:
sys.argv.append("--enhance")
if args.enhance_local:
sys.argv.append("--enhance-local")
if args.dry_run:
sys.argv.append("--dry-run")
if args.async_mode:
sys.argv.append("--async")
if args.workers:
sys.argv.extend(["--workers", str(args.workers)])
return scrape_main() or 0
elif args.command == "github":
from skill_seekers.cli.github_scraper import main as github_main
sys.argv = ["github_scraper.py"]
if args.config:
sys.argv.extend(["--config", args.config])
if args.repo:
sys.argv.extend(["--repo", args.repo])
if args.name:
sys.argv.extend(["--name", args.name])
if args.description:
sys.argv.extend(["--description", args.description])
return github_main() or 0
elif args.command == "pdf":
from skill_seekers.cli.pdf_scraper import main as pdf_main
sys.argv = ["pdf_scraper.py"]
if args.config:
sys.argv.extend(["--config", args.config])
if args.pdf:
sys.argv.extend(["--pdf", args.pdf])
if args.name:
sys.argv.extend(["--name", args.name])
if args.description:
sys.argv.extend(["--description", args.description])
if args.from_json:
sys.argv.extend(["--from-json", args.from_json])
return pdf_main() or 0
elif args.command == "unified":
from skill_seekers.cli.unified_scraper import main as unified_main
sys.argv = ["unified_scraper.py", "--config", args.config]
if args.merge_mode:
sys.argv.extend(["--merge-mode", args.merge_mode])
if args.dry_run:
sys.argv.append("--dry-run")
return unified_main() or 0
elif args.command == "enhance":
from skill_seekers.cli.enhance_skill_local import main as enhance_main
sys.argv = ["enhance_skill_local.py", args.skill_directory]
return enhance_main() or 0
elif args.command == "package":
from skill_seekers.cli.package_skill import main as package_main
sys.argv = ["package_skill.py", args.skill_directory]
if args.no_open:
sys.argv.append("--no-open")
if args.upload:
sys.argv.append("--upload")
return package_main() or 0
elif args.command == "upload":
from skill_seekers.cli.upload_skill import main as upload_main
sys.argv = ["upload_skill.py", args.zip_file]
if args.api_key:
sys.argv.extend(["--api-key", args.api_key])
return upload_main() or 0
elif args.command == "estimate":
from skill_seekers.cli.estimate_pages import main as estimate_main
sys.argv = ["estimate_pages.py", args.config]
if args.max_discovery:
sys.argv.extend(["--max-discovery", str(args.max_discovery)])
return estimate_main() or 0
else:
print(f"Error: Unknown command '{args.command}'", file=sys.stderr)
parser.print_help()
return 1
except KeyboardInterrupt:
print("\n\nInterrupted by user", file=sys.stderr)
return 130
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
return 1
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,513 @@
#!/usr/bin/env python3
"""
Source Merger for Multi-Source Skills
Merges documentation and code data intelligently:
- Rule-based merge: Fast, deterministic rules
- Claude-enhanced merge: AI-powered reconciliation
Handles conflicts and creates unified API reference.
"""
import json
import logging
import subprocess
import tempfile
import os
from pathlib import Path
from typing import Dict, List, Any, Optional
from conflict_detector import Conflict, ConflictDetector
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class RuleBasedMerger:
"""
Rule-based API merger using deterministic rules.
Rules:
1. If API only in docs → Include with [DOCS_ONLY] tag
2. If API only in code → Include with [UNDOCUMENTED] tag
3. If both match perfectly → Include normally
4. If conflict → Include both versions with [CONFLICT] tag, prefer code signature
"""
def __init__(self, docs_data: Dict, github_data: Dict, conflicts: List[Conflict]):
"""
Initialize rule-based merger.
Args:
docs_data: Documentation scraper data
github_data: GitHub scraper data
conflicts: List of detected conflicts
"""
self.docs_data = docs_data
self.github_data = github_data
self.conflicts = conflicts
# Build conflict index for fast lookup
self.conflict_index = {c.api_name: c for c in conflicts}
# Extract APIs from both sources
detector = ConflictDetector(docs_data, github_data)
self.docs_apis = detector.docs_apis
self.code_apis = detector.code_apis
def merge_all(self) -> Dict[str, Any]:
"""
Merge all APIs using rule-based logic.
Returns:
Dict containing merged API data
"""
logger.info("Starting rule-based merge...")
merged_apis = {}
# Get all unique API names
all_api_names = set(self.docs_apis.keys()) | set(self.code_apis.keys())
for api_name in sorted(all_api_names):
merged_api = self._merge_single_api(api_name)
merged_apis[api_name] = merged_api
logger.info(f"Merged {len(merged_apis)} APIs")
return {
'merge_mode': 'rule-based',
'apis': merged_apis,
'summary': {
'total_apis': len(merged_apis),
'docs_only': sum(1 for api in merged_apis.values() if api['status'] == 'docs_only'),
'code_only': sum(1 for api in merged_apis.values() if api['status'] == 'code_only'),
'matched': sum(1 for api in merged_apis.values() if api['status'] == 'matched'),
'conflict': sum(1 for api in merged_apis.values() if api['status'] == 'conflict')
}
}
def _merge_single_api(self, api_name: str) -> Dict[str, Any]:
"""
Merge a single API using rules.
Args:
api_name: Name of the API to merge
Returns:
Merged API dict
"""
in_docs = api_name in self.docs_apis
in_code = api_name in self.code_apis
has_conflict = api_name in self.conflict_index
# Rule 1: Only in docs
if in_docs and not in_code:
conflict = self.conflict_index.get(api_name)
return {
'name': api_name,
'status': 'docs_only',
'source': 'documentation',
'data': self.docs_apis[api_name],
'warning': 'This API is documented but not found in codebase',
'conflict': conflict.__dict__ if conflict else None
}
# Rule 2: Only in code
if in_code and not in_docs:
is_private = api_name.startswith('_')
conflict = self.conflict_index.get(api_name)
return {
'name': api_name,
'status': 'code_only',
'source': 'code',
'data': self.code_apis[api_name],
'warning': 'This API exists in code but is not documented' if not is_private else 'Internal/private API',
'conflict': conflict.__dict__ if conflict else None
}
# Both exist - check for conflicts
docs_info = self.docs_apis[api_name]
code_info = self.code_apis[api_name]
# Rule 3: Both match perfectly (no conflict)
if not has_conflict:
return {
'name': api_name,
'status': 'matched',
'source': 'both',
'docs_data': docs_info,
'code_data': code_info,
'merged_signature': self._create_merged_signature(code_info, docs_info),
'merged_description': docs_info.get('docstring') or code_info.get('docstring')
}
# Rule 4: Conflict exists - prefer code signature, keep docs description
conflict = self.conflict_index[api_name]
return {
'name': api_name,
'status': 'conflict',
'source': 'both',
'docs_data': docs_info,
'code_data': code_info,
'conflict': conflict.__dict__,
'resolution': 'prefer_code_signature',
'merged_signature': self._create_merged_signature(code_info, docs_info),
'merged_description': docs_info.get('docstring') or code_info.get('docstring'),
'warning': conflict.difference
}
def _create_merged_signature(self, code_info: Dict, docs_info: Dict) -> str:
"""
Create merged signature preferring code data.
Args:
code_info: API info from code
docs_info: API info from docs
Returns:
Merged signature string
"""
name = code_info.get('name', docs_info.get('name'))
params = code_info.get('parameters', docs_info.get('parameters', []))
return_type = code_info.get('return_type', docs_info.get('return_type'))
# Build parameter string
param_strs = []
for param in params:
param_str = param['name']
if param.get('type_hint'):
param_str += f": {param['type_hint']}"
if param.get('default'):
param_str += f" = {param['default']}"
param_strs.append(param_str)
signature = f"{name}({', '.join(param_strs)})"
if return_type:
signature += f" -> {return_type}"
return signature
class ClaudeEnhancedMerger:
"""
Claude-enhanced API merger using local Claude Code.
Opens Claude Code in a new terminal to intelligently reconcile conflicts.
Uses the same approach as enhance_skill_local.py.
"""
def __init__(self, docs_data: Dict, github_data: Dict, conflicts: List[Conflict]):
"""
Initialize Claude-enhanced merger.
Args:
docs_data: Documentation scraper data
github_data: GitHub scraper data
conflicts: List of detected conflicts
"""
self.docs_data = docs_data
self.github_data = github_data
self.conflicts = conflicts
# First do rule-based merge as baseline
self.rule_merger = RuleBasedMerger(docs_data, github_data, conflicts)
def merge_all(self) -> Dict[str, Any]:
"""
Merge all APIs using Claude enhancement.
Returns:
Dict containing merged API data
"""
logger.info("Starting Claude-enhanced merge...")
# Create temporary workspace
workspace_dir = self._create_workspace()
# Launch Claude Code for enhancement
logger.info("Launching Claude Code for intelligent merging...")
logger.info("Claude will analyze conflicts and create reconciled API reference")
try:
self._launch_claude_merge(workspace_dir)
# Read enhanced results
merged_data = self._read_merged_results(workspace_dir)
logger.info("Claude-enhanced merge complete")
return merged_data
except Exception as e:
logger.error(f"Claude enhancement failed: {e}")
logger.info("Falling back to rule-based merge")
return self.rule_merger.merge_all()
def _create_workspace(self) -> str:
"""
Create temporary workspace with merge context.
Returns:
Path to workspace directory
"""
workspace = tempfile.mkdtemp(prefix='skill_merge_')
logger.info(f"Created merge workspace: {workspace}")
# Write context files for Claude
self._write_context_files(workspace)
return workspace
def _write_context_files(self, workspace: str):
"""Write context files for Claude to analyze."""
# 1. Write conflicts summary
conflicts_file = os.path.join(workspace, 'conflicts.json')
with open(conflicts_file, 'w') as f:
json.dump({
'conflicts': [c.__dict__ for c in self.conflicts],
'summary': {
'total': len(self.conflicts),
'by_type': self._count_by_field('type'),
'by_severity': self._count_by_field('severity')
}
}, f, indent=2)
# 2. Write documentation APIs
docs_apis_file = os.path.join(workspace, 'docs_apis.json')
detector = ConflictDetector(self.docs_data, self.github_data)
with open(docs_apis_file, 'w') as f:
json.dump(detector.docs_apis, f, indent=2)
# 3. Write code APIs
code_apis_file = os.path.join(workspace, 'code_apis.json')
with open(code_apis_file, 'w') as f:
json.dump(detector.code_apis, f, indent=2)
# 4. Write merge instructions for Claude
instructions = """# API Merge Task
You are merging API documentation from two sources:
1. Official documentation (user-facing)
2. Source code analysis (implementation reality)
## Context Files:
- `conflicts.json` - All detected conflicts between sources
- `docs_apis.json` - APIs from documentation
- `code_apis.json` - APIs from source code
## Your Task:
For each conflict, reconcile the differences intelligently:
1. **Prefer code signatures as source of truth**
- Use actual parameter names, types, defaults from code
- Code is what actually runs, docs might be outdated
2. **Keep documentation descriptions**
- Docs are user-friendly, code comments might be technical
- Keep the docs' explanation of what the API does
3. **Add implementation notes for discrepancies**
- If docs differ from code, explain the difference
- Example: "⚠️ The `snap` parameter exists in code but is not documented"
4. **Flag missing APIs clearly**
- Missing in docs → Add [UNDOCUMENTED] tag
- Missing in code → Add [REMOVED] or [DOCS_ERROR] tag
5. **Create unified API reference**
- One definitive signature per API
- Clear warnings about conflicts
- Implementation notes where helpful
## Output Format:
Create `merged_apis.json` with this structure:
```json
{
"apis": {
"API.name": {
"signature": "final_signature_here",
"parameters": [...],
"return_type": "type",
"description": "user-friendly description",
"implementation_notes": "Any discrepancies or warnings",
"source": "both|docs_only|code_only",
"confidence": "high|medium|low"
}
}
}
```
Take your time to analyze each conflict carefully. The goal is to create the most accurate and helpful API reference possible.
"""
instructions_file = os.path.join(workspace, 'MERGE_INSTRUCTIONS.md')
with open(instructions_file, 'w') as f:
f.write(instructions)
logger.info(f"Wrote context files to {workspace}")
def _count_by_field(self, field: str) -> Dict[str, int]:
"""Count conflicts by a specific field."""
counts = {}
for conflict in self.conflicts:
value = getattr(conflict, field)
counts[value] = counts.get(value, 0) + 1
return counts
def _launch_claude_merge(self, workspace: str):
"""
Launch Claude Code to perform merge.
Similar to enhance_skill_local.py approach.
"""
# Create a script that Claude will execute
script_path = os.path.join(workspace, 'merge_script.sh')
script_content = f"""#!/bin/bash
# Automatic merge script for Claude Code
cd "{workspace}"
echo "📊 Analyzing conflicts..."
cat conflicts.json | head -20
echo ""
echo "📖 Documentation APIs: $(cat docs_apis.json | grep -c '\"name\"')"
echo "💻 Code APIs: $(cat code_apis.json | grep -c '\"name\"')"
echo ""
echo "Please review the conflicts and create merged_apis.json"
echo "Follow the instructions in MERGE_INSTRUCTIONS.md"
echo ""
echo "When done, save merged_apis.json and close this terminal."
# Wait for user to complete merge
read -p "Press Enter when merge is complete..."
"""
with open(script_path, 'w') as f:
f.write(script_content)
os.chmod(script_path, 0o755)
# Open new terminal with Claude Code
# Try different terminal emulators
terminals = [
['x-terminal-emulator', '-e'],
['gnome-terminal', '--'],
['xterm', '-e'],
['konsole', '-e']
]
for terminal_cmd in terminals:
try:
cmd = terminal_cmd + ['bash', script_path]
subprocess.Popen(cmd)
logger.info(f"Opened terminal with {terminal_cmd[0]}")
break
except FileNotFoundError:
continue
# Wait for merge to complete
merged_file = os.path.join(workspace, 'merged_apis.json')
logger.info(f"Waiting for merged results at: {merged_file}")
logger.info("Close the terminal when done to continue...")
# Poll for file existence
import time
timeout = 3600 # 1 hour max
elapsed = 0
while not os.path.exists(merged_file) and elapsed < timeout:
time.sleep(5)
elapsed += 5
if not os.path.exists(merged_file):
raise TimeoutError("Claude merge timed out after 1 hour")
def _read_merged_results(self, workspace: str) -> Dict[str, Any]:
"""Read merged results from workspace."""
merged_file = os.path.join(workspace, 'merged_apis.json')
if not os.path.exists(merged_file):
raise FileNotFoundError(f"Merged results not found: {merged_file}")
with open(merged_file, 'r') as f:
merged_data = json.load(f)
return {
'merge_mode': 'claude-enhanced',
**merged_data
}
def merge_sources(docs_data_path: str,
github_data_path: str,
output_path: str,
mode: str = 'rule-based') -> Dict[str, Any]:
"""
Merge documentation and GitHub data.
Args:
docs_data_path: Path to documentation data JSON
github_data_path: Path to GitHub data JSON
output_path: Path to save merged output
mode: 'rule-based' or 'claude-enhanced'
Returns:
Merged data dict
"""
# Load data
with open(docs_data_path, 'r') as f:
docs_data = json.load(f)
with open(github_data_path, 'r') as f:
github_data = json.load(f)
# Detect conflicts
detector = ConflictDetector(docs_data, github_data)
conflicts = detector.detect_all_conflicts()
logger.info(f"Detected {len(conflicts)} conflicts")
# Merge based on mode
if mode == 'claude-enhanced':
merger = ClaudeEnhancedMerger(docs_data, github_data, conflicts)
else:
merger = RuleBasedMerger(docs_data, github_data, conflicts)
merged_data = merger.merge_all()
# Save merged data
with open(output_path, 'w') as f:
json.dump(merged_data, f, indent=2, ensure_ascii=False)
logger.info(f"Merged data saved to: {output_path}")
return merged_data
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser(description='Merge documentation and code sources')
parser.add_argument('docs_data', help='Path to documentation data JSON')
parser.add_argument('github_data', help='Path to GitHub data JSON')
parser.add_argument('--output', '-o', default='merged_data.json', help='Output file path')
parser.add_argument('--mode', '-m', choices=['rule-based', 'claude-enhanced'],
default='rule-based', help='Merge mode')
args = parser.parse_args()
merged = merge_sources(args.docs_data, args.github_data, args.output, args.mode)
# Print summary
summary = merged.get('summary', {})
print(f"\n✅ Merge complete ({merged.get('merge_mode')})")
print(f" Total APIs: {summary.get('total_apis', 0)}")
print(f" Matched: {summary.get('matched', 0)}")
print(f" Docs only: {summary.get('docs_only', 0)}")
print(f" Code only: {summary.get('code_only', 0)}")
print(f" Conflicts: {summary.get('conflict', 0)}")
print(f"\n📄 Saved to: {args.output}")

View File

@@ -0,0 +1,81 @@
#!/usr/bin/env python3
"""
Multi-Skill Packager
Package multiple skills at once. Useful for packaging router + sub-skills together.
"""
import sys
import argparse
from pathlib import Path
import subprocess
def package_skill(skill_dir: Path) -> bool:
"""Package a single skill"""
try:
result = subprocess.run(
[sys.executable, str(Path(__file__).parent / "package_skill.py"), str(skill_dir)],
capture_output=True,
text=True
)
return result.returncode == 0
except Exception as e:
print(f"❌ Error packaging {skill_dir}: {e}")
return False
def main():
parser = argparse.ArgumentParser(
description="Package multiple skills at once",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Package all godot skills
python3 package_multi.py output/godot*/
# Package specific skills
python3 package_multi.py output/godot-2d/ output/godot-3d/ output/godot-scripting/
"""
)
parser.add_argument(
'skill_dirs',
nargs='+',
help='Skill directories to package'
)
args = parser.parse_args()
print(f"\n{'='*60}")
print(f"MULTI-SKILL PACKAGER")
print(f"{'='*60}\n")
skill_dirs = [Path(d) for d in args.skill_dirs]
success_count = 0
total_count = len(skill_dirs)
for skill_dir in skill_dirs:
if not skill_dir.exists():
print(f"⚠️ Skipping (not found): {skill_dir}")
continue
if not (skill_dir / "SKILL.md").exists():
print(f"⚠️ Skipping (no SKILL.md): {skill_dir}")
continue
print(f"📦 Packaging: {skill_dir.name}")
if package_skill(skill_dir):
success_count += 1
print(f" ✅ Success")
else:
print(f" ❌ Failed")
print("")
print(f"{'='*60}")
print(f"SUMMARY: {success_count}/{total_count} skills packaged")
print(f"{'='*60}\n")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,177 @@
#!/usr/bin/env python3
"""
Simple Skill Packager
Packages a skill directory into a .zip file for Claude.
Usage:
python3 cli/package_skill.py output/steam-inventory/
python3 cli/package_skill.py output/react/
python3 cli/package_skill.py output/react/ --no-open # Don't open folder
"""
import os
import sys
import zipfile
import argparse
from pathlib import Path
# Import utilities
try:
from utils import (
open_folder,
print_upload_instructions,
format_file_size,
validate_skill_directory
)
except ImportError:
# If running from different directory, add cli to path
sys.path.insert(0, str(Path(__file__).parent))
from utils import (
open_folder,
print_upload_instructions,
format_file_size,
validate_skill_directory
)
def package_skill(skill_dir, open_folder_after=True):
"""
Package a skill directory into a .zip file
Args:
skill_dir: Path to skill directory
open_folder_after: Whether to open the output folder after packaging
Returns:
tuple: (success, zip_path) where success is bool and zip_path is Path or None
"""
skill_path = Path(skill_dir)
# Validate skill directory
is_valid, error_msg = validate_skill_directory(skill_path)
if not is_valid:
print(f"❌ Error: {error_msg}")
return False, None
# Create zip filename
skill_name = skill_path.name
zip_path = skill_path.parent / f"{skill_name}.zip"
print(f"📦 Packaging skill: {skill_name}")
print(f" Source: {skill_path}")
print(f" Output: {zip_path}")
# Create zip file
with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zf:
for root, dirs, files in os.walk(skill_path):
# Skip backup files
files = [f for f in files if not f.endswith('.backup')]
for file in files:
file_path = Path(root) / file
arcname = file_path.relative_to(skill_path)
zf.write(file_path, arcname)
print(f" + {arcname}")
# Get zip size
zip_size = zip_path.stat().st_size
print(f"\n✅ Package created: {zip_path}")
print(f" Size: {zip_size:,} bytes ({format_file_size(zip_size)})")
# Open folder in file browser
if open_folder_after:
print(f"\n📂 Opening folder: {zip_path.parent}")
open_folder(zip_path.parent)
# Print upload instructions
print_upload_instructions(zip_path)
return True, zip_path
def main():
parser = argparse.ArgumentParser(
description="Package a skill directory into a .zip file for Claude",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Package skill and open folder
python3 cli/package_skill.py output/react/
# Package skill without opening folder
python3 cli/package_skill.py output/react/ --no-open
# Get help
python3 cli/package_skill.py --help
"""
)
parser.add_argument(
'skill_dir',
help='Path to skill directory (e.g., output/react/)'
)
parser.add_argument(
'--no-open',
action='store_true',
help='Do not open the output folder after packaging'
)
parser.add_argument(
'--upload',
action='store_true',
help='Automatically upload to Claude after packaging (requires ANTHROPIC_API_KEY)'
)
args = parser.parse_args()
success, zip_path = package_skill(args.skill_dir, open_folder_after=not args.no_open)
if not success:
sys.exit(1)
# Auto-upload if requested
if args.upload:
# Check if API key is set BEFORE attempting upload
api_key = os.environ.get('ANTHROPIC_API_KEY', '').strip()
if not api_key:
# No API key - show helpful message but DON'T fail
print("\n" + "="*60)
print("💡 Automatic Upload")
print("="*60)
print()
print("To enable automatic upload:")
print(" 1. Get API key from https://console.anthropic.com/")
print(" 2. Set: export ANTHROPIC_API_KEY=sk-ant-...")
print(" 3. Run package_skill.py with --upload flag")
print()
print("For now, use manual upload (instructions above) ☝️")
print("="*60)
# Exit successfully - packaging worked!
sys.exit(0)
# API key exists - try upload
try:
from upload_skill import upload_skill_api
print("\n" + "="*60)
upload_success, message = upload_skill_api(zip_path)
if not upload_success:
print(f"❌ Upload failed: {message}")
print()
print("💡 Try manual upload instead (instructions above) ☝️")
print("="*60)
# Exit successfully - packaging worked even if upload failed
sys.exit(0)
else:
print("="*60)
sys.exit(0)
except ImportError:
print("\n❌ Error: upload_skill.py not found")
sys.exit(1)
sys.exit(0)
if __name__ == "__main__":
main()

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,401 @@
#!/usr/bin/env python3
"""
PDF Documentation to Claude Skill Converter (Task B1.6)
Converts PDF documentation into Claude AI skills.
Uses pdf_extractor_poc.py for extraction, builds skill structure.
Usage:
python3 pdf_scraper.py --config configs/manual_pdf.json
python3 pdf_scraper.py --pdf manual.pdf --name myskill
python3 pdf_scraper.py --from-json manual_extracted.json
"""
import os
import sys
import json
import re
import argparse
from pathlib import Path
# Import the PDF extractor
from pdf_extractor_poc import PDFExtractor
class PDFToSkillConverter:
"""Convert PDF documentation to Claude skill"""
def __init__(self, config):
self.config = config
self.name = config['name']
self.pdf_path = config.get('pdf_path', '')
self.description = config.get('description', f'Documentation skill for {self.name}')
# Paths
self.skill_dir = f"output/{self.name}"
self.data_file = f"output/{self.name}_extracted.json"
# Extraction options
self.extract_options = config.get('extract_options', {})
# Categories
self.categories = config.get('categories', {})
# Extracted data
self.extracted_data = None
def extract_pdf(self):
"""Extract content from PDF using pdf_extractor_poc.py"""
print(f"\n🔍 Extracting from PDF: {self.pdf_path}")
# Create extractor with options
extractor = PDFExtractor(
self.pdf_path,
verbose=True,
chunk_size=self.extract_options.get('chunk_size', 10),
min_quality=self.extract_options.get('min_quality', 5.0),
extract_images=self.extract_options.get('extract_images', True),
image_dir=f"{self.skill_dir}/assets/images",
min_image_size=self.extract_options.get('min_image_size', 100)
)
# Extract
result = extractor.extract_all()
if not result:
print("❌ Extraction failed")
raise RuntimeError(f"Failed to extract PDF: {self.pdf_path}")
# Save extracted data
with open(self.data_file, 'w', encoding='utf-8') as f:
json.dump(result, f, indent=2, ensure_ascii=False)
print(f"\n💾 Saved extracted data to: {self.data_file}")
self.extracted_data = result
return True
def load_extracted_data(self, json_path):
"""Load previously extracted data from JSON"""
print(f"\n📂 Loading extracted data from: {json_path}")
with open(json_path, 'r', encoding='utf-8') as f:
self.extracted_data = json.load(f)
print(f"✅ Loaded {self.extracted_data['total_pages']} pages")
return True
def categorize_content(self):
"""Categorize pages based on chapters or keywords"""
print(f"\n📋 Categorizing content...")
categorized = {}
# Use chapters if available
if self.extracted_data.get('chapters'):
for chapter in self.extracted_data['chapters']:
category_key = self._sanitize_filename(chapter['title'])
categorized[category_key] = {
'title': chapter['title'],
'pages': []
}
# Assign pages to chapters
for page in self.extracted_data['pages']:
page_num = page['page_number']
# Find which chapter this page belongs to
for chapter in self.extracted_data['chapters']:
if chapter['start_page'] <= page_num <= chapter['end_page']:
category_key = self._sanitize_filename(chapter['title'])
categorized[category_key]['pages'].append(page)
break
# Fall back to keyword-based categorization
elif self.categories:
# Check if categories is already in the right format (for tests)
# If first value is a list of dicts (pages), use as-is
first_value = next(iter(self.categories.values()))
if isinstance(first_value, list) and first_value and isinstance(first_value[0], dict):
# Already categorized - convert to expected format
for cat_key, pages in self.categories.items():
categorized[cat_key] = {
'title': cat_key.replace('_', ' ').title(),
'pages': pages
}
else:
# Keyword-based categorization
# Initialize categories
for cat_key, keywords in self.categories.items():
categorized[cat_key] = {
'title': cat_key.replace('_', ' ').title(),
'pages': []
}
# Categorize by keywords
for page in self.extracted_data['pages']:
text = page.get('text', '').lower()
headings_text = ' '.join([h['text'] for h in page.get('headings', [])]).lower()
# Score against each category
scores = {}
for cat_key, keywords in self.categories.items():
# Handle both string keywords and dict keywords (shouldn't happen, but be safe)
if isinstance(keywords, list):
score = sum(1 for kw in keywords
if isinstance(kw, str) and (kw.lower() in text or kw.lower() in headings_text))
else:
score = 0
if score > 0:
scores[cat_key] = score
# Assign to highest scoring category
if scores:
best_cat = max(scores, key=scores.get)
categorized[best_cat]['pages'].append(page)
else:
# Default category
if 'other' not in categorized:
categorized['other'] = {'title': 'Other', 'pages': []}
categorized['other']['pages'].append(page)
else:
# No categorization - use single category
categorized['content'] = {
'title': 'Content',
'pages': self.extracted_data['pages']
}
print(f"✅ Created {len(categorized)} categories")
for cat_key, cat_data in categorized.items():
print(f" - {cat_data['title']}: {len(cat_data['pages'])} pages")
return categorized
def build_skill(self):
"""Build complete skill structure"""
print(f"\n🏗️ Building skill: {self.name}")
# Create directories
os.makedirs(f"{self.skill_dir}/references", exist_ok=True)
os.makedirs(f"{self.skill_dir}/scripts", exist_ok=True)
os.makedirs(f"{self.skill_dir}/assets", exist_ok=True)
# Categorize content
categorized = self.categorize_content()
# Generate reference files
print(f"\n📝 Generating reference files...")
for cat_key, cat_data in categorized.items():
self._generate_reference_file(cat_key, cat_data)
# Generate index
self._generate_index(categorized)
# Generate SKILL.md
self._generate_skill_md(categorized)
print(f"\n✅ Skill built successfully: {self.skill_dir}/")
print(f"\n📦 Next step: Package with: python3 cli/package_skill.py {self.skill_dir}/")
def _generate_reference_file(self, cat_key, cat_data):
"""Generate a reference markdown file for a category"""
filename = f"{self.skill_dir}/references/{cat_key}.md"
with open(filename, 'w', encoding='utf-8') as f:
f.write(f"# {cat_data['title']}\n\n")
for page in cat_data['pages']:
# Add headings as section markers
if page.get('headings'):
f.write(f"## {page['headings'][0]['text']}\n\n")
# Add text content
if page.get('text'):
# Limit to first 1000 chars per page to avoid huge files
text = page['text'][:1000]
f.write(f"{text}\n\n")
# Add code samples (check both 'code_samples' and 'code_blocks' for compatibility)
code_list = page.get('code_samples') or page.get('code_blocks')
if code_list:
f.write("### Code Examples\n\n")
for code in code_list[:3]: # Limit to top 3
lang = code.get('language', '')
f.write(f"```{lang}\n{code['code']}\n```\n\n")
# Add images
if page.get('images'):
# Create assets directory if needed
assets_dir = os.path.join(self.skill_dir, 'assets')
os.makedirs(assets_dir, exist_ok=True)
f.write("### Images\n\n")
for img in page['images']:
# Save image to assets
img_filename = f"page_{page['page_number']}_img_{img['index']}.png"
img_path = os.path.join(assets_dir, img_filename)
with open(img_path, 'wb') as img_file:
img_file.write(img['data'])
# Add markdown image reference
f.write(f"![Image {img['index']}](../assets/{img_filename})\n\n")
f.write("---\n\n")
print(f" Generated: {filename}")
def _generate_index(self, categorized):
"""Generate reference index"""
filename = f"{self.skill_dir}/references/index.md"
with open(filename, 'w', encoding='utf-8') as f:
f.write(f"# {self.name.title()} Documentation Reference\n\n")
f.write("## Categories\n\n")
for cat_key, cat_data in categorized.items():
page_count = len(cat_data['pages'])
f.write(f"- [{cat_data['title']}]({cat_key}.md) ({page_count} pages)\n")
f.write("\n## Statistics\n\n")
stats = self.extracted_data.get('quality_statistics', {})
f.write(f"- Total pages: {self.extracted_data.get('total_pages', 0)}\n")
f.write(f"- Code blocks: {self.extracted_data.get('total_code_blocks', 0)}\n")
f.write(f"- Images: {self.extracted_data.get('total_images', 0)}\n")
if stats:
f.write(f"- Average code quality: {stats.get('average_quality', 0):.1f}/10\n")
f.write(f"- Valid code blocks: {stats.get('valid_code_blocks', 0)}\n")
print(f" Generated: {filename}")
def _generate_skill_md(self, categorized):
"""Generate main SKILL.md file"""
filename = f"{self.skill_dir}/SKILL.md"
# Generate skill name (lowercase, hyphens only, max 64 chars)
skill_name = self.name.lower().replace('_', '-').replace(' ', '-')[:64]
# Truncate description to 1024 chars if needed
desc = self.description[:1024] if len(self.description) > 1024 else self.description
with open(filename, 'w', encoding='utf-8') as f:
# Write YAML frontmatter
f.write(f"---\n")
f.write(f"name: {skill_name}\n")
f.write(f"description: {desc}\n")
f.write(f"---\n\n")
f.write(f"# {self.name.title()} Documentation Skill\n\n")
f.write(f"{self.description}\n\n")
f.write("## When to use this skill\n\n")
f.write(f"Use this skill when the user asks about {self.name} documentation, ")
f.write("including API references, tutorials, examples, and best practices.\n\n")
f.write("## What's included\n\n")
f.write("This skill contains:\n\n")
for cat_key, cat_data in categorized.items():
f.write(f"- **{cat_data['title']}**: {len(cat_data['pages'])} pages\n")
f.write("\n## Quick Reference\n\n")
# Get high-quality code samples
all_code = []
for page in self.extracted_data['pages']:
all_code.extend(page.get('code_samples', []))
# Sort by quality and get top 5
all_code.sort(key=lambda x: x.get('quality_score', 0), reverse=True)
top_code = all_code[:5]
if top_code:
f.write("### Top Code Examples\n\n")
for i, code in enumerate(top_code, 1):
lang = code['language']
quality = code.get('quality_score', 0)
f.write(f"**Example {i}** (Quality: {quality:.1f}/10):\n\n")
f.write(f"```{lang}\n{code['code'][:300]}...\n```\n\n")
f.write("## Navigation\n\n")
f.write("See `references/index.md` for complete documentation structure.\n\n")
# Add language statistics
langs = self.extracted_data.get('languages_detected', {})
if langs:
f.write("## Languages Covered\n\n")
for lang, count in sorted(langs.items(), key=lambda x: x[1], reverse=True):
f.write(f"- {lang}: {count} examples\n")
print(f" Generated: {filename}")
def _sanitize_filename(self, name):
"""Convert string to safe filename"""
# Remove special chars, replace spaces with underscores
safe = re.sub(r'[^\w\s-]', '', name.lower())
safe = re.sub(r'[-\s]+', '_', safe)
return safe
def main():
parser = argparse.ArgumentParser(
description='Convert PDF documentation to Claude skill',
formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument('--config', help='PDF config JSON file')
parser.add_argument('--pdf', help='Direct PDF file path')
parser.add_argument('--name', help='Skill name (with --pdf)')
parser.add_argument('--from-json', help='Build skill from extracted JSON')
parser.add_argument('--description', help='Skill description')
args = parser.parse_args()
# Validate inputs
if not (args.config or args.pdf or args.from_json):
parser.error("Must specify --config, --pdf, or --from-json")
# Load or create config
if args.config:
with open(args.config, 'r') as f:
config = json.load(f)
elif args.from_json:
# Build from extracted JSON
name = Path(args.from_json).stem.replace('_extracted', '')
config = {
'name': name,
'description': args.description or f'Documentation skill for {name}'
}
converter = PDFToSkillConverter(config)
converter.load_extracted_data(args.from_json)
converter.build_skill()
return
else:
# Direct PDF mode
if not args.name:
parser.error("Must specify --name with --pdf")
config = {
'name': args.name,
'pdf_path': args.pdf,
'description': args.description or f'Documentation skill for {args.name}',
'extract_options': {
'chunk_size': 10,
'min_quality': 5.0,
'extract_images': True,
'min_image_size': 100
}
}
# Create converter
converter = PDFToSkillConverter(config)
# Extract if needed
if config.get('pdf_path'):
if not converter.extract_pdf():
sys.exit(1)
# Build skill
converter.build_skill()
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,228 @@
#!/usr/bin/env python3
"""
Test Runner for Skill Seeker
Runs all test suites and generates a comprehensive test report
"""
import sys
import unittest
import os
from io import StringIO
from pathlib import Path
class ColoredTextTestResult(unittest.TextTestResult):
"""Custom test result class with colored output"""
# ANSI color codes
GREEN = '\033[92m'
RED = '\033[91m'
YELLOW = '\033[93m'
BLUE = '\033[94m'
RESET = '\033[0m'
BOLD = '\033[1m'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.test_results = []
def addSuccess(self, test):
super().addSuccess(test)
self.test_results.append(('PASS', test))
if self.showAll:
self.stream.write(f"{self.GREEN}✓ PASS{self.RESET}\n")
elif self.dots:
self.stream.write(f"{self.GREEN}.{self.RESET}")
self.stream.flush()
def addError(self, test, err):
super().addError(test, err)
self.test_results.append(('ERROR', test))
if self.showAll:
self.stream.write(f"{self.RED}✗ ERROR{self.RESET}\n")
elif self.dots:
self.stream.write(f"{self.RED}E{self.RESET}")
self.stream.flush()
def addFailure(self, test, err):
super().addFailure(test, err)
self.test_results.append(('FAIL', test))
if self.showAll:
self.stream.write(f"{self.RED}✗ FAIL{self.RESET}\n")
elif self.dots:
self.stream.write(f"{self.RED}F{self.RESET}")
self.stream.flush()
def addSkip(self, test, reason):
super().addSkip(test, reason)
self.test_results.append(('SKIP', test))
if self.showAll:
self.stream.write(f"{self.YELLOW}⊘ SKIP{self.RESET}\n")
elif self.dots:
self.stream.write(f"{self.YELLOW}s{self.RESET}")
self.stream.flush()
class ColoredTextTestRunner(unittest.TextTestRunner):
"""Custom test runner with colored output"""
resultclass = ColoredTextTestResult
def discover_tests(test_dir='tests'):
"""Discover all test files in the tests directory"""
loader = unittest.TestLoader()
start_dir = test_dir
pattern = 'test_*.py'
suite = loader.discover(start_dir, pattern=pattern)
return suite
def run_specific_suite(suite_name):
"""Run a specific test suite"""
loader = unittest.TestLoader()
suite_map = {
'config': 'tests.test_config_validation',
'features': 'tests.test_scraper_features',
'integration': 'tests.test_integration'
}
if suite_name not in suite_map:
print(f"Unknown test suite: {suite_name}")
print(f"Available suites: {', '.join(suite_map.keys())}")
return None
module_name = suite_map[suite_name]
try:
suite = loader.loadTestsFromName(module_name)
return suite
except Exception as e:
print(f"Error loading test suite '{suite_name}': {e}")
return None
def print_summary(result):
"""Print a detailed test summary"""
total = result.testsRun
passed = total - len(result.failures) - len(result.errors) - len(result.skipped)
failed = len(result.failures)
errors = len(result.errors)
skipped = len(result.skipped)
print("\n" + "="*70)
print("TEST SUMMARY")
print("="*70)
# Overall stats
print(f"\n{ColoredTextTestResult.BOLD}Total Tests:{ColoredTextTestResult.RESET} {total}")
print(f"{ColoredTextTestResult.GREEN}✓ Passed:{ColoredTextTestResult.RESET} {passed}")
if failed > 0:
print(f"{ColoredTextTestResult.RED}✗ Failed:{ColoredTextTestResult.RESET} {failed}")
if errors > 0:
print(f"{ColoredTextTestResult.RED}✗ Errors:{ColoredTextTestResult.RESET} {errors}")
if skipped > 0:
print(f"{ColoredTextTestResult.YELLOW}⊘ Skipped:{ColoredTextTestResult.RESET} {skipped}")
# Success rate
if total > 0:
success_rate = (passed / total) * 100
color = ColoredTextTestResult.GREEN if success_rate == 100 else \
ColoredTextTestResult.YELLOW if success_rate >= 80 else \
ColoredTextTestResult.RED
print(f"\n{color}Success Rate: {success_rate:.1f}%{ColoredTextTestResult.RESET}")
# Category breakdown
if hasattr(result, 'test_results'):
print(f"\n{ColoredTextTestResult.BOLD}Test Breakdown by Category:{ColoredTextTestResult.RESET}")
categories = {}
for status, test in result.test_results:
test_name = str(test)
# Extract test class name
if '.' in test_name:
class_name = test_name.split('.')[0].split()[-1]
if class_name not in categories:
categories[class_name] = {'PASS': 0, 'FAIL': 0, 'ERROR': 0, 'SKIP': 0}
categories[class_name][status] += 1
for category, stats in sorted(categories.items()):
total_cat = sum(stats.values())
passed_cat = stats['PASS']
print(f" {category}: {passed_cat}/{total_cat} passed")
print("\n" + "="*70)
# Return status
return failed == 0 and errors == 0
def main():
"""Main test runner"""
import argparse
parser = argparse.ArgumentParser(
description='Run tests for Skill Seeker',
formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument('--suite', '-s', type=str,
help='Run specific test suite (config, features, integration)')
parser.add_argument('--verbose', '-v', action='store_true',
help='Verbose output (show each test)')
parser.add_argument('--quiet', '-q', action='store_true',
help='Quiet output (minimal output)')
parser.add_argument('--failfast', '-f', action='store_true',
help='Stop on first failure')
parser.add_argument('--list', '-l', action='store_true',
help='List all available tests')
args = parser.parse_args()
# Set verbosity
verbosity = 1
if args.verbose:
verbosity = 2
elif args.quiet:
verbosity = 0
print(f"\n{ColoredTextTestResult.BOLD}{'='*70}{ColoredTextTestResult.RESET}")
print(f"{ColoredTextTestResult.BOLD}SKILL SEEKER TEST SUITE{ColoredTextTestResult.RESET}")
print(f"{ColoredTextTestResult.BOLD}{'='*70}{ColoredTextTestResult.RESET}\n")
# Discover or load specific suite
if args.suite:
print(f"Running test suite: {ColoredTextTestResult.BLUE}{args.suite}{ColoredTextTestResult.RESET}\n")
suite = run_specific_suite(args.suite)
if suite is None:
return 1
else:
print(f"Running {ColoredTextTestResult.BLUE}all tests{ColoredTextTestResult.RESET}\n")
suite = discover_tests()
# List tests
if args.list:
print("\nAvailable tests:\n")
for test_group in suite:
for test in test_group:
print(f" - {test}")
print()
return 0
# Run tests
runner = ColoredTextTestRunner(
verbosity=verbosity,
failfast=args.failfast
)
result = runner.run(suite)
# Print summary
success = print_summary(result)
# Return appropriate exit code
return 0 if success else 1
if __name__ == '__main__':
sys.exit(main())

View File

@@ -0,0 +1,320 @@
#!/usr/bin/env python3
"""
Config Splitter for Large Documentation Sites
Splits large documentation configs into multiple smaller, focused skill configs.
Supports multiple splitting strategies: category-based, size-based, and automatic.
"""
import json
import sys
import argparse
from pathlib import Path
from typing import Dict, List, Any, Tuple
from collections import defaultdict
class ConfigSplitter:
"""Splits large documentation configs into multiple focused configs"""
def __init__(self, config_path: str, strategy: str = "auto", target_pages: int = 5000):
self.config_path = Path(config_path)
self.strategy = strategy
self.target_pages = target_pages
self.config = self.load_config()
self.base_name = self.config['name']
def load_config(self) -> Dict[str, Any]:
"""Load configuration from file"""
try:
with open(self.config_path, 'r') as f:
return json.load(f)
except FileNotFoundError:
print(f"❌ Error: Config file not found: {self.config_path}")
sys.exit(1)
except json.JSONDecodeError as e:
print(f"❌ Error: Invalid JSON in config file: {e}")
sys.exit(1)
def get_split_strategy(self) -> str:
"""Determine split strategy"""
# Check if strategy is defined in config
if 'split_strategy' in self.config:
config_strategy = self.config['split_strategy']
if config_strategy != "none":
return config_strategy
# Use provided strategy or auto-detect
if self.strategy == "auto":
max_pages = self.config.get('max_pages', 500)
if max_pages < 5000:
print(f" Small documentation ({max_pages} pages) - no splitting needed")
return "none"
elif max_pages < 10000 and 'categories' in self.config:
print(f" Medium documentation ({max_pages} pages) - category split recommended")
return "category"
elif 'categories' in self.config and len(self.config['categories']) >= 3:
print(f" Large documentation ({max_pages} pages) - router + categories recommended")
return "router"
else:
print(f" Large documentation ({max_pages} pages) - size-based split")
return "size"
return self.strategy
def split_by_category(self, create_router: bool = False) -> List[Dict[str, Any]]:
"""Split config by categories"""
if 'categories' not in self.config:
print("❌ Error: No categories defined in config")
sys.exit(1)
categories = self.config['categories']
split_categories = self.config.get('split_config', {}).get('split_by_categories')
# If specific categories specified, use only those
if split_categories:
categories = {k: v for k, v in categories.items() if k in split_categories}
configs = []
for category_name, keywords in categories.items():
# Create new config for this category
new_config = self.config.copy()
new_config['name'] = f"{self.base_name}-{category_name}"
new_config['description'] = f"{self.base_name.capitalize()} - {category_name.replace('_', ' ').title()}. {self.config.get('description', '')}"
# Update URL patterns to focus on this category
url_patterns = new_config.get('url_patterns', {})
# Add category keywords to includes
includes = url_patterns.get('include', [])
for keyword in keywords:
if keyword.startswith('/'):
includes.append(keyword)
if includes:
url_patterns['include'] = list(set(includes))
new_config['url_patterns'] = url_patterns
# Keep only this category
new_config['categories'] = {category_name: keywords}
# Remove split config from child
if 'split_strategy' in new_config:
del new_config['split_strategy']
if 'split_config' in new_config:
del new_config['split_config']
# Adjust max_pages estimate
if 'max_pages' in new_config:
new_config['max_pages'] = self.target_pages
configs.append(new_config)
print(f"✅ Created {len(configs)} category-based configs")
# Optionally create router config
if create_router:
router_config = self.create_router_config(configs)
configs.insert(0, router_config)
print(f"✅ Created router config: {router_config['name']}")
return configs
def split_by_size(self) -> List[Dict[str, Any]]:
"""Split config by size (page count)"""
max_pages = self.config.get('max_pages', 500)
num_splits = (max_pages + self.target_pages - 1) // self.target_pages
configs = []
for i in range(num_splits):
new_config = self.config.copy()
part_num = i + 1
new_config['name'] = f"{self.base_name}-part{part_num}"
new_config['description'] = f"{self.base_name.capitalize()} - Part {part_num}. {self.config.get('description', '')}"
new_config['max_pages'] = self.target_pages
# Remove split config from child
if 'split_strategy' in new_config:
del new_config['split_strategy']
if 'split_config' in new_config:
del new_config['split_config']
configs.append(new_config)
print(f"✅ Created {len(configs)} size-based configs ({self.target_pages} pages each)")
return configs
def create_router_config(self, sub_configs: List[Dict[str, Any]]) -> Dict[str, Any]:
"""Create a router config that references sub-skills"""
router_name = self.config.get('split_config', {}).get('router_name', self.base_name)
router_config = {
"name": router_name,
"description": self.config.get('description', ''),
"base_url": self.config['base_url'],
"selectors": self.config['selectors'],
"url_patterns": self.config.get('url_patterns', {}),
"rate_limit": self.config.get('rate_limit', 0.5),
"max_pages": 500, # Router only needs overview pages
"_router": True,
"_sub_skills": [cfg['name'] for cfg in sub_configs],
"_routing_keywords": {
cfg['name']: list(cfg.get('categories', {}).keys())
for cfg in sub_configs
}
}
return router_config
def split(self) -> List[Dict[str, Any]]:
"""Execute split based on strategy"""
strategy = self.get_split_strategy()
print(f"\n{'='*60}")
print(f"CONFIG SPLITTER: {self.base_name}")
print(f"{'='*60}")
print(f"Strategy: {strategy}")
print(f"Target pages per skill: {self.target_pages}")
print("")
if strategy == "none":
print(" No splitting required")
return [self.config]
elif strategy == "category":
return self.split_by_category(create_router=False)
elif strategy == "router":
create_router = self.config.get('split_config', {}).get('create_router', True)
return self.split_by_category(create_router=create_router)
elif strategy == "size":
return self.split_by_size()
else:
print(f"❌ Error: Unknown strategy: {strategy}")
sys.exit(1)
def save_configs(self, configs: List[Dict[str, Any]], output_dir: Path = None) -> List[Path]:
"""Save configs to files"""
if output_dir is None:
output_dir = self.config_path.parent
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
saved_files = []
for config in configs:
filename = f"{config['name']}.json"
filepath = output_dir / filename
with open(filepath, 'w') as f:
json.dump(config, f, indent=2)
saved_files.append(filepath)
print(f" 💾 Saved: {filepath}")
return saved_files
def main():
parser = argparse.ArgumentParser(
description="Split large documentation configs into multiple focused skills",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Auto-detect strategy
python3 split_config.py configs/godot.json
# Use category-based split
python3 split_config.py configs/godot.json --strategy category
# Use router + categories
python3 split_config.py configs/godot.json --strategy router
# Custom target size
python3 split_config.py configs/godot.json --target-pages 3000
# Dry run (don't save files)
python3 split_config.py configs/godot.json --dry-run
Split Strategies:
none - No splitting (single skill)
auto - Automatically choose best strategy
category - Split by categories defined in config
router - Create router + category-based sub-skills
size - Split by page count
"""
)
parser.add_argument(
'config',
help='Path to config file (e.g., configs/godot.json)'
)
parser.add_argument(
'--strategy',
choices=['auto', 'none', 'category', 'router', 'size'],
default='auto',
help='Splitting strategy (default: auto)'
)
parser.add_argument(
'--target-pages',
type=int,
default=5000,
help='Target pages per skill (default: 5000)'
)
parser.add_argument(
'--output-dir',
help='Output directory for configs (default: same as input)'
)
parser.add_argument(
'--dry-run',
action='store_true',
help='Show what would be created without saving files'
)
args = parser.parse_args()
# Create splitter
splitter = ConfigSplitter(args.config, args.strategy, args.target_pages)
# Split config
configs = splitter.split()
if args.dry_run:
print(f"\n{'='*60}")
print("DRY RUN - No files saved")
print(f"{'='*60}")
print(f"Would create {len(configs)} config files:")
for cfg in configs:
is_router = cfg.get('_router', False)
router_marker = " (ROUTER)" if is_router else ""
print(f" 📄 {cfg['name']}.json{router_marker}")
else:
print(f"\n{'='*60}")
print("SAVING CONFIGS")
print(f"{'='*60}")
saved_files = splitter.save_configs(configs, args.output_dir)
print(f"\n{'='*60}")
print("NEXT STEPS")
print(f"{'='*60}")
print("1. Review generated configs")
print("2. Scrape each config:")
for filepath in saved_files:
print(f" python3 cli/doc_scraper.py --config {filepath}")
print("3. Package skills:")
print(" python3 cli/package_multi.py configs/<name>-*.json")
print("")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,192 @@
#!/usr/bin/env python3
"""
Simple Integration Tests for Unified Multi-Source Scraper
Focuses on real-world usage patterns rather than unit tests.
"""
import os
import sys
import json
import tempfile
from pathlib import Path
# Add CLI to path
sys.path.insert(0, str(Path(__file__).parent))
from config_validator import validate_config
def test_validate_existing_unified_configs():
"""Test that all existing unified configs are valid"""
configs_dir = Path(__file__).parent.parent / 'configs'
unified_configs = [
'godot_unified.json',
'react_unified.json',
'django_unified.json',
'fastapi_unified.json'
]
for config_name in unified_configs:
config_path = configs_dir / config_name
if config_path.exists():
print(f"\n✓ Validating {config_name}...")
validator = validate_config(str(config_path))
assert validator.is_unified, f"{config_name} should be unified format"
assert validator.needs_api_merge(), f"{config_name} should need API merging"
print(f" Sources: {len(validator.config['sources'])}")
print(f" Merge mode: {validator.config.get('merge_mode')}")
def test_backward_compatibility():
"""Test that legacy configs still work"""
configs_dir = Path(__file__).parent.parent / 'configs'
legacy_configs = [
'react.json',
'godot.json',
'django.json'
]
for config_name in legacy_configs:
config_path = configs_dir / config_name
if config_path.exists():
print(f"\n✓ Validating legacy {config_name}...")
validator = validate_config(str(config_path))
assert not validator.is_unified, f"{config_name} should be legacy format"
print(f" Format: Legacy")
def test_create_temp_unified_config():
"""Test creating a unified config from scratch"""
config = {
"name": "test_unified",
"description": "Test unified config",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://example.com/docs",
"extract_api": True,
"max_pages": 50
},
{
"type": "github",
"repo": "test/repo",
"include_code": True,
"code_analysis_depth": "surface"
}
]
}
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
json.dump(config, f)
config_path = f.name
try:
print("\n✓ Validating temp unified config...")
validator = validate_config(config_path)
assert validator.is_unified
assert validator.needs_api_merge()
assert len(validator.config['sources']) == 2
print(" ✓ Config is valid unified format")
print(f" Sources: {len(validator.config['sources'])}")
finally:
os.unlink(config_path)
def test_mixed_source_types():
"""Test config with documentation, GitHub, and PDF sources"""
config = {
"name": "test_mixed",
"description": "Test mixed sources",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://example.com"
},
{
"type": "github",
"repo": "test/repo"
},
{
"type": "pdf",
"path": "/path/to/manual.pdf"
}
]
}
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
json.dump(config, f)
config_path = f.name
try:
print("\n✓ Validating mixed source types...")
validator = validate_config(config_path)
assert validator.is_unified
assert len(validator.config['sources']) == 3
# Check each source type
source_types = [s['type'] for s in validator.config['sources']]
assert 'documentation' in source_types
assert 'github' in source_types
assert 'pdf' in source_types
print(" ✓ All 3 source types validated")
finally:
os.unlink(config_path)
def test_config_validation_errors():
"""Test that invalid configs are rejected"""
# Invalid source type
config = {
"name": "test",
"description": "Test",
"sources": [
{"type": "invalid_type", "url": "https://example.com"}
]
}
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
json.dump(config, f)
config_path = f.name
try:
print("\n✓ Testing invalid source type...")
try:
# validate_config() calls .validate() automatically
validator = validate_config(config_path)
assert False, "Should have raised error for invalid source type"
except ValueError as e:
assert "Invalid" in str(e) or "invalid" in str(e)
print(" ✓ Invalid source type correctly rejected")
finally:
os.unlink(config_path)
# Run tests
if __name__ == '__main__':
print("=" * 60)
print("Running Unified Scraper Integration Tests")
print("=" * 60)
try:
test_validate_existing_unified_configs()
test_backward_compatibility()
test_create_temp_unified_config()
test_mixed_source_types()
test_config_validation_errors()
print("\n" + "=" * 60)
print("✅ All integration tests passed!")
print("=" * 60)
except AssertionError as e:
print(f"\n❌ Test failed: {e}")
sys.exit(1)
except Exception as e:
print(f"\n❌ Unexpected error: {e}")
import traceback
traceback.print_exc()
sys.exit(1)

View File

@@ -0,0 +1,449 @@
#!/usr/bin/env python3
"""
Unified Multi-Source Scraper
Orchestrates scraping from multiple sources (documentation, GitHub, PDF),
detects conflicts, merges intelligently, and builds unified skills.
This is the main entry point for unified config workflow.
Usage:
python3 cli/unified_scraper.py --config configs/godot_unified.json
python3 cli/unified_scraper.py --config configs/react_unified.json --merge-mode claude-enhanced
"""
import os
import sys
import json
import logging
import argparse
import subprocess
from pathlib import Path
from typing import Dict, List, Any, Optional
# Import validators and scrapers
try:
from config_validator import ConfigValidator, validate_config
from conflict_detector import ConflictDetector
from merge_sources import RuleBasedMerger, ClaudeEnhancedMerger
from unified_skill_builder import UnifiedSkillBuilder
except ImportError as e:
print(f"Error importing modules: {e}")
print("Make sure you're running from the project root directory")
sys.exit(1)
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
class UnifiedScraper:
"""
Orchestrates multi-source scraping and merging.
Main workflow:
1. Load and validate unified config
2. Scrape all sources (docs, GitHub, PDF)
3. Detect conflicts between sources
4. Merge intelligently (rule-based or Claude-enhanced)
5. Build unified skill
"""
def __init__(self, config_path: str, merge_mode: Optional[str] = None):
"""
Initialize unified scraper.
Args:
config_path: Path to unified config JSON
merge_mode: Override config merge_mode ('rule-based' or 'claude-enhanced')
"""
self.config_path = config_path
# Validate and load config
logger.info(f"Loading config: {config_path}")
self.validator = validate_config(config_path)
self.config = self.validator.config
# Determine merge mode
self.merge_mode = merge_mode or self.config.get('merge_mode', 'rule-based')
logger.info(f"Merge mode: {self.merge_mode}")
# Storage for scraped data
self.scraped_data = {}
# Output paths
self.name = self.config['name']
self.output_dir = f"output/{self.name}"
self.data_dir = f"output/{self.name}_unified_data"
os.makedirs(self.output_dir, exist_ok=True)
os.makedirs(self.data_dir, exist_ok=True)
def scrape_all_sources(self):
"""
Scrape all configured sources.
Routes to appropriate scraper based on source type.
"""
logger.info("=" * 60)
logger.info("PHASE 1: Scraping all sources")
logger.info("=" * 60)
if not self.validator.is_unified:
logger.warning("Config is not unified format, converting...")
self.config = self.validator.convert_legacy_to_unified()
sources = self.config.get('sources', [])
for i, source in enumerate(sources):
source_type = source['type']
logger.info(f"\n[{i+1}/{len(sources)}] Scraping {source_type} source...")
try:
if source_type == 'documentation':
self._scrape_documentation(source)
elif source_type == 'github':
self._scrape_github(source)
elif source_type == 'pdf':
self._scrape_pdf(source)
else:
logger.warning(f"Unknown source type: {source_type}")
except Exception as e:
logger.error(f"Error scraping {source_type}: {e}")
logger.info("Continuing with other sources...")
logger.info(f"\n✅ Scraped {len(self.scraped_data)} sources successfully")
def _scrape_documentation(self, source: Dict[str, Any]):
"""Scrape documentation website."""
# Create temporary config for doc scraper
doc_config = {
'name': f"{self.name}_docs",
'base_url': source['base_url'],
'selectors': source.get('selectors', {}),
'url_patterns': source.get('url_patterns', {}),
'categories': source.get('categories', {}),
'rate_limit': source.get('rate_limit', 0.5),
'max_pages': source.get('max_pages', 100)
}
# Write temporary config
temp_config_path = os.path.join(self.data_dir, 'temp_docs_config.json')
with open(temp_config_path, 'w') as f:
json.dump(doc_config, f, indent=2)
# Run doc_scraper as subprocess
logger.info(f"Scraping documentation from {source['base_url']}")
doc_scraper_path = Path(__file__).parent / "doc_scraper.py"
cmd = [sys.executable, str(doc_scraper_path), '--config', temp_config_path]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
logger.error(f"Documentation scraping failed: {result.stderr}")
return
# Load scraped data
docs_data_file = f"output/{doc_config['name']}_data/summary.json"
if os.path.exists(docs_data_file):
with open(docs_data_file, 'r') as f:
summary = json.load(f)
self.scraped_data['documentation'] = {
'pages': summary.get('pages', []),
'data_file': docs_data_file
}
logger.info(f"✅ Documentation: {summary.get('total_pages', 0)} pages scraped")
else:
logger.warning("Documentation data file not found")
# Clean up temp config
if os.path.exists(temp_config_path):
os.remove(temp_config_path)
def _scrape_github(self, source: Dict[str, Any]):
"""Scrape GitHub repository."""
sys.path.insert(0, str(Path(__file__).parent))
try:
from github_scraper import GitHubScraper
except ImportError:
logger.error("github_scraper.py not found")
return
# Create config for GitHub scraper
github_config = {
'repo': source['repo'],
'name': f"{self.name}_github",
'github_token': source.get('github_token'),
'include_issues': source.get('include_issues', True),
'max_issues': source.get('max_issues', 100),
'include_changelog': source.get('include_changelog', True),
'include_releases': source.get('include_releases', True),
'include_code': source.get('include_code', True),
'code_analysis_depth': source.get('code_analysis_depth', 'surface'),
'file_patterns': source.get('file_patterns', [])
}
# Scrape
logger.info(f"Scraping GitHub repository: {source['repo']}")
scraper = GitHubScraper(github_config)
github_data = scraper.scrape()
# Save data
github_data_file = os.path.join(self.data_dir, 'github_data.json')
with open(github_data_file, 'w') as f:
json.dump(github_data, f, indent=2, ensure_ascii=False)
self.scraped_data['github'] = {
'data': github_data,
'data_file': github_data_file
}
logger.info(f"✅ GitHub: Repository scraped successfully")
def _scrape_pdf(self, source: Dict[str, Any]):
"""Scrape PDF document."""
sys.path.insert(0, str(Path(__file__).parent))
try:
from pdf_scraper import PDFToSkillConverter
except ImportError:
logger.error("pdf_scraper.py not found")
return
# Create config for PDF scraper
pdf_config = {
'name': f"{self.name}_pdf",
'pdf': source['path'],
'extract_tables': source.get('extract_tables', False),
'ocr': source.get('ocr', False),
'password': source.get('password')
}
# Scrape
logger.info(f"Scraping PDF: {source['path']}")
converter = PDFToSkillConverter(pdf_config)
pdf_data = converter.extract_all()
# Save data
pdf_data_file = os.path.join(self.data_dir, 'pdf_data.json')
with open(pdf_data_file, 'w') as f:
json.dump(pdf_data, f, indent=2, ensure_ascii=False)
self.scraped_data['pdf'] = {
'data': pdf_data,
'data_file': pdf_data_file
}
logger.info(f"✅ PDF: {len(pdf_data.get('pages', []))} pages extracted")
def detect_conflicts(self) -> List:
"""
Detect conflicts between documentation and code.
Only applicable if both documentation and GitHub sources exist.
Returns:
List of conflicts
"""
logger.info("\n" + "=" * 60)
logger.info("PHASE 2: Detecting conflicts")
logger.info("=" * 60)
if not self.validator.needs_api_merge():
logger.info("No API merge needed (only one API source)")
return []
# Get documentation and GitHub data
docs_data = self.scraped_data.get('documentation', {})
github_data = self.scraped_data.get('github', {})
if not docs_data or not github_data:
logger.warning("Missing documentation or GitHub data for conflict detection")
return []
# Load data files
with open(docs_data['data_file'], 'r') as f:
docs_json = json.load(f)
with open(github_data['data_file'], 'r') as f:
github_json = json.load(f)
# Detect conflicts
detector = ConflictDetector(docs_json, github_json)
conflicts = detector.detect_all_conflicts()
# Save conflicts
conflicts_file = os.path.join(self.data_dir, 'conflicts.json')
detector.save_conflicts(conflicts, conflicts_file)
# Print summary
summary = detector.generate_summary(conflicts)
logger.info(f"\n📊 Conflict Summary:")
logger.info(f" Total: {summary['total']}")
logger.info(f" By Type:")
for ctype, count in summary['by_type'].items():
if count > 0:
logger.info(f" - {ctype}: {count}")
logger.info(f" By Severity:")
for severity, count in summary['by_severity'].items():
if count > 0:
emoji = '🔴' if severity == 'high' else '🟡' if severity == 'medium' else '🟢'
logger.info(f" {emoji} {severity}: {count}")
return conflicts
def merge_sources(self, conflicts: List):
"""
Merge data from multiple sources.
Args:
conflicts: List of detected conflicts
"""
logger.info("\n" + "=" * 60)
logger.info(f"PHASE 3: Merging sources ({self.merge_mode})")
logger.info("=" * 60)
if not conflicts:
logger.info("No conflicts to merge")
return None
# Get data files
docs_data = self.scraped_data.get('documentation', {})
github_data = self.scraped_data.get('github', {})
# Load data
with open(docs_data['data_file'], 'r') as f:
docs_json = json.load(f)
with open(github_data['data_file'], 'r') as f:
github_json = json.load(f)
# Choose merger
if self.merge_mode == 'claude-enhanced':
merger = ClaudeEnhancedMerger(docs_json, github_json, conflicts)
else:
merger = RuleBasedMerger(docs_json, github_json, conflicts)
# Merge
merged_data = merger.merge_all()
# Save merged data
merged_file = os.path.join(self.data_dir, 'merged_data.json')
with open(merged_file, 'w') as f:
json.dump(merged_data, f, indent=2, ensure_ascii=False)
logger.info(f"✅ Merged data saved: {merged_file}")
return merged_data
def build_skill(self, merged_data: Optional[Dict] = None):
"""
Build final unified skill.
Args:
merged_data: Merged API data (if conflicts were resolved)
"""
logger.info("\n" + "=" * 60)
logger.info("PHASE 4: Building unified skill")
logger.info("=" * 60)
# Load conflicts if they exist
conflicts = []
conflicts_file = os.path.join(self.data_dir, 'conflicts.json')
if os.path.exists(conflicts_file):
with open(conflicts_file, 'r') as f:
conflicts_data = json.load(f)
conflicts = conflicts_data.get('conflicts', [])
# Build skill
builder = UnifiedSkillBuilder(
self.config,
self.scraped_data,
merged_data,
conflicts
)
builder.build()
logger.info(f"✅ Unified skill built: {self.output_dir}/")
def run(self):
"""
Execute complete unified scraping workflow.
"""
logger.info("\n" + "🚀 " * 20)
logger.info(f"Unified Scraper: {self.config['name']}")
logger.info("🚀 " * 20 + "\n")
try:
# Phase 1: Scrape all sources
self.scrape_all_sources()
# Phase 2: Detect conflicts (if applicable)
conflicts = self.detect_conflicts()
# Phase 3: Merge sources (if conflicts exist)
merged_data = None
if conflicts:
merged_data = self.merge_sources(conflicts)
# Phase 4: Build skill
self.build_skill(merged_data)
logger.info("\n" + "" * 20)
logger.info("Unified scraping complete!")
logger.info("" * 20 + "\n")
logger.info(f"📁 Output: {self.output_dir}/")
logger.info(f"📁 Data: {self.data_dir}/")
except KeyboardInterrupt:
logger.info("\n\n⚠️ Scraping interrupted by user")
sys.exit(1)
except Exception as e:
logger.error(f"\n\n❌ Error during scraping: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
def main():
"""Main entry point."""
parser = argparse.ArgumentParser(
description='Unified multi-source scraper',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Basic usage with unified config
python3 cli/unified_scraper.py --config configs/godot_unified.json
# Override merge mode
python3 cli/unified_scraper.py --config configs/react_unified.json --merge-mode claude-enhanced
# Backward compatible with legacy configs
python3 cli/unified_scraper.py --config configs/react.json
"""
)
parser.add_argument('--config', '-c', required=True,
help='Path to unified config JSON file')
parser.add_argument('--merge-mode', '-m',
choices=['rule-based', 'claude-enhanced'],
help='Override config merge mode')
args = parser.parse_args()
# Create and run scraper
scraper = UnifiedScraper(args.config, args.merge_mode)
scraper.run()
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,444 @@
#!/usr/bin/env python3
"""
Unified Skill Builder
Generates final skill structure from merged multi-source data:
- SKILL.md with merged APIs and conflict warnings
- references/ with organized content by source
- Inline conflict markers (⚠️)
- Separate conflicts summary section
Supports mixed sources (documentation, GitHub, PDF) and highlights
discrepancies transparently.
"""
import os
import json
import logging
from pathlib import Path
from typing import Dict, List, Any, Optional
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class UnifiedSkillBuilder:
"""
Builds unified skill from multi-source data.
"""
def __init__(self, config: Dict, scraped_data: Dict,
merged_data: Optional[Dict] = None, conflicts: Optional[List] = None):
"""
Initialize skill builder.
Args:
config: Unified config dict
scraped_data: Dict of scraped data by source type
merged_data: Merged API data (if conflicts were resolved)
conflicts: List of detected conflicts
"""
self.config = config
self.scraped_data = scraped_data
self.merged_data = merged_data
self.conflicts = conflicts or []
self.name = config['name']
self.description = config['description']
self.skill_dir = f"output/{self.name}"
# Create directories
os.makedirs(self.skill_dir, exist_ok=True)
os.makedirs(f"{self.skill_dir}/references", exist_ok=True)
os.makedirs(f"{self.skill_dir}/scripts", exist_ok=True)
os.makedirs(f"{self.skill_dir}/assets", exist_ok=True)
def build(self):
"""Build complete skill structure."""
logger.info(f"Building unified skill: {self.name}")
# Generate main SKILL.md
self._generate_skill_md()
# Generate reference files by source
self._generate_references()
# Generate conflicts report (if any)
if self.conflicts:
self._generate_conflicts_report()
logger.info(f"✅ Unified skill built: {self.skill_dir}/")
def _generate_skill_md(self):
"""Generate main SKILL.md file."""
skill_path = os.path.join(self.skill_dir, 'SKILL.md')
# Generate skill name (lowercase, hyphens only, max 64 chars)
skill_name = self.name.lower().replace('_', '-').replace(' ', '-')[:64]
# Truncate description to 1024 chars if needed
desc = self.description[:1024] if len(self.description) > 1024 else self.description
content = f"""---
name: {skill_name}
description: {desc}
---
# {self.name.title()}
{self.description}
## 📚 Sources
This skill combines knowledge from multiple sources:
"""
# List sources
for source in self.config.get('sources', []):
source_type = source['type']
if source_type == 'documentation':
content += f"- ✅ **Documentation**: {source.get('base_url', 'N/A')}\n"
content += f" - Pages: {source.get('max_pages', 'unlimited')}\n"
elif source_type == 'github':
content += f"- ✅ **GitHub Repository**: {source.get('repo', 'N/A')}\n"
content += f" - Code Analysis: {source.get('code_analysis_depth', 'surface')}\n"
content += f" - Issues: {source.get('max_issues', 0)}\n"
elif source_type == 'pdf':
content += f"- ✅ **PDF Document**: {source.get('path', 'N/A')}\n"
# Data quality section
if self.conflicts:
content += f"\n## ⚠️ Data Quality\n\n"
content += f"**{len(self.conflicts)} conflicts detected** between sources.\n\n"
# Count by type
by_type = {}
for conflict in self.conflicts:
ctype = conflict.type if hasattr(conflict, 'type') else conflict.get('type', 'unknown')
by_type[ctype] = by_type.get(ctype, 0) + 1
content += "**Conflict Breakdown:**\n"
for ctype, count in by_type.items():
content += f"- {ctype}: {count}\n"
content += f"\nSee `references/conflicts.md` for detailed conflict information.\n"
# Merged API section (if available)
if self.merged_data:
content += self._format_merged_apis()
# Quick reference from each source
content += "\n## 📖 Reference Documentation\n\n"
content += "Organized by source:\n\n"
for source in self.config.get('sources', []):
source_type = source['type']
content += f"- [{source_type.title()}](references/{source_type}/)\n"
# When to use this skill
content += f"\n## 💡 When to Use This Skill\n\n"
content += f"Use this skill when you need to:\n"
content += f"- Understand how to use {self.name}\n"
content += f"- Look up API documentation\n"
content += f"- Find usage examples\n"
if 'github' in self.scraped_data:
content += f"- Check for known issues or recent changes\n"
content += f"- Review release history\n"
content += "\n---\n\n"
content += "*Generated by Skill Seeker's unified multi-source scraper*\n"
with open(skill_path, 'w', encoding='utf-8') as f:
f.write(content)
logger.info(f"Created SKILL.md")
def _format_merged_apis(self) -> str:
"""Format merged APIs section with inline conflict warnings."""
if not self.merged_data:
return ""
content = "\n## 🔧 API Reference\n\n"
content += "*Merged from documentation and code analysis*\n\n"
apis = self.merged_data.get('apis', {})
if not apis:
return content + "*No APIs to display*\n"
# Group APIs by status
matched = {k: v for k, v in apis.items() if v.get('status') == 'matched'}
conflicts = {k: v for k, v in apis.items() if v.get('status') == 'conflict'}
docs_only = {k: v for k, v in apis.items() if v.get('status') == 'docs_only'}
code_only = {k: v for k, v in apis.items() if v.get('status') == 'code_only'}
# Show matched APIs first
if matched:
content += "### ✅ Verified APIs\n\n"
content += "*Documentation and code agree*\n\n"
for api_name, api_data in list(matched.items())[:10]: # Limit to first 10
content += self._format_api_entry(api_data, inline_conflict=False)
# Show conflicting APIs with warnings
if conflicts:
content += "\n### ⚠️ APIs with Conflicts\n\n"
content += "*Documentation and code differ*\n\n"
for api_name, api_data in list(conflicts.items())[:10]:
content += self._format_api_entry(api_data, inline_conflict=True)
# Show undocumented APIs
if code_only:
content += f"\n### 💻 Undocumented APIs\n\n"
content += f"*Found in code but not in documentation ({len(code_only)} total)*\n\n"
for api_name, api_data in list(code_only.items())[:5]:
content += self._format_api_entry(api_data, inline_conflict=False)
# Show removed/missing APIs
if docs_only:
content += f"\n### 📖 Documentation-Only APIs\n\n"
content += f"*Documented but not found in code ({len(docs_only)} total)*\n\n"
for api_name, api_data in list(docs_only.items())[:5]:
content += self._format_api_entry(api_data, inline_conflict=False)
content += f"\n*See references/api/ for complete API documentation*\n"
return content
def _format_api_entry(self, api_data: Dict, inline_conflict: bool = False) -> str:
"""Format a single API entry."""
name = api_data.get('name', 'Unknown')
signature = api_data.get('merged_signature', name)
description = api_data.get('merged_description', '')
warning = api_data.get('warning', '')
entry = f"#### `{signature}`\n\n"
if description:
entry += f"{description}\n\n"
# Add inline conflict warning
if inline_conflict and warning:
entry += f"⚠️ **Conflict**: {warning}\n\n"
# Show both versions if available
conflict = api_data.get('conflict', {})
if conflict:
docs_info = conflict.get('docs_info')
code_info = conflict.get('code_info')
if docs_info and code_info:
entry += "**Documentation says:**\n"
entry += f"```\n{docs_info.get('raw_signature', 'N/A')}\n```\n\n"
entry += "**Code implementation:**\n"
entry += f"```\n{self._format_code_signature(code_info)}\n```\n\n"
# Add source info
source = api_data.get('source', 'unknown')
entry += f"*Source: {source}*\n\n"
entry += "---\n\n"
return entry
def _format_code_signature(self, code_info: Dict) -> str:
"""Format code signature for display."""
name = code_info.get('name', '')
params = code_info.get('parameters', [])
return_type = code_info.get('return_type')
param_strs = []
for param in params:
param_str = param.get('name', '')
if param.get('type_hint'):
param_str += f": {param['type_hint']}"
if param.get('default'):
param_str += f" = {param['default']}"
param_strs.append(param_str)
sig = f"{name}({', '.join(param_strs)})"
if return_type:
sig += f" -> {return_type}"
return sig
def _generate_references(self):
"""Generate reference files organized by source."""
logger.info("Generating reference files...")
# Generate references for each source type
if 'documentation' in self.scraped_data:
self._generate_docs_references()
if 'github' in self.scraped_data:
self._generate_github_references()
if 'pdf' in self.scraped_data:
self._generate_pdf_references()
# Generate merged API reference if available
if self.merged_data:
self._generate_merged_api_reference()
def _generate_docs_references(self):
"""Generate references from documentation source."""
docs_dir = os.path.join(self.skill_dir, 'references', 'documentation')
os.makedirs(docs_dir, exist_ok=True)
# Create index
index_path = os.path.join(docs_dir, 'index.md')
with open(index_path, 'w') as f:
f.write("# Documentation\n\n")
f.write("Reference from official documentation.\n\n")
logger.info("Created documentation references")
def _generate_github_references(self):
"""Generate references from GitHub source."""
github_dir = os.path.join(self.skill_dir, 'references', 'github')
os.makedirs(github_dir, exist_ok=True)
github_data = self.scraped_data['github']['data']
# Create README reference
if github_data.get('readme'):
readme_path = os.path.join(github_dir, 'README.md')
with open(readme_path, 'w') as f:
f.write("# Repository README\n\n")
f.write(github_data['readme'])
# Create issues reference
if github_data.get('issues'):
issues_path = os.path.join(github_dir, 'issues.md')
with open(issues_path, 'w') as f:
f.write("# GitHub Issues\n\n")
f.write(f"{len(github_data['issues'])} recent issues.\n\n")
for issue in github_data['issues'][:20]:
f.write(f"## #{issue['number']}: {issue['title']}\n\n")
f.write(f"**State**: {issue['state']}\n")
if issue.get('labels'):
f.write(f"**Labels**: {', '.join(issue['labels'])}\n")
f.write(f"**URL**: {issue.get('url', 'N/A')}\n\n")
# Create releases reference
if github_data.get('releases'):
releases_path = os.path.join(github_dir, 'releases.md')
with open(releases_path, 'w') as f:
f.write("# Releases\n\n")
for release in github_data['releases'][:10]:
f.write(f"## {release['tag_name']}: {release.get('name', 'N/A')}\n\n")
f.write(f"**Published**: {release.get('published_at', 'N/A')[:10]}\n\n")
if release.get('body'):
f.write(release['body'][:500])
f.write("\n\n")
logger.info("Created GitHub references")
def _generate_pdf_references(self):
"""Generate references from PDF source."""
pdf_dir = os.path.join(self.skill_dir, 'references', 'pdf')
os.makedirs(pdf_dir, exist_ok=True)
# Create index
index_path = os.path.join(pdf_dir, 'index.md')
with open(index_path, 'w') as f:
f.write("# PDF Documentation\n\n")
f.write("Reference from PDF document.\n\n")
logger.info("Created PDF references")
def _generate_merged_api_reference(self):
"""Generate merged API reference file."""
api_dir = os.path.join(self.skill_dir, 'references', 'api')
os.makedirs(api_dir, exist_ok=True)
api_path = os.path.join(api_dir, 'merged_api.md')
with open(api_path, 'w') as f:
f.write("# Merged API Reference\n\n")
f.write("*Combined from documentation and code analysis*\n\n")
apis = self.merged_data.get('apis', {})
for api_name in sorted(apis.keys()):
api_data = apis[api_name]
entry = self._format_api_entry(api_data, inline_conflict=True)
f.write(entry)
logger.info(f"Created merged API reference ({len(apis)} APIs)")
def _generate_conflicts_report(self):
"""Generate detailed conflicts report."""
conflicts_path = os.path.join(self.skill_dir, 'references', 'conflicts.md')
with open(conflicts_path, 'w') as f:
f.write("# Conflict Report\n\n")
f.write(f"Found **{len(self.conflicts)}** conflicts between sources.\n\n")
# Group by severity
high = [c for c in self.conflicts if (hasattr(c, 'severity') and c.severity == 'high') or c.get('severity') == 'high']
medium = [c for c in self.conflicts if (hasattr(c, 'severity') and c.severity == 'medium') or c.get('severity') == 'medium']
low = [c for c in self.conflicts if (hasattr(c, 'severity') and c.severity == 'low') or c.get('severity') == 'low']
f.write("## Severity Breakdown\n\n")
f.write(f"- 🔴 **High**: {len(high)} (action required)\n")
f.write(f"- 🟡 **Medium**: {len(medium)} (review recommended)\n")
f.write(f"- 🟢 **Low**: {len(low)} (informational)\n\n")
# List high severity conflicts
if high:
f.write("## 🔴 High Severity\n\n")
f.write("*These conflicts require immediate attention*\n\n")
for conflict in high:
api_name = conflict.api_name if hasattr(conflict, 'api_name') else conflict.get('api_name', 'Unknown')
diff = conflict.difference if hasattr(conflict, 'difference') else conflict.get('difference', 'N/A')
f.write(f"### {api_name}\n\n")
f.write(f"**Issue**: {diff}\n\n")
# List medium severity
if medium:
f.write("## 🟡 Medium Severity\n\n")
for conflict in medium[:20]: # Limit to 20
api_name = conflict.api_name if hasattr(conflict, 'api_name') else conflict.get('api_name', 'Unknown')
diff = conflict.difference if hasattr(conflict, 'difference') else conflict.get('difference', 'N/A')
f.write(f"### {api_name}\n\n")
f.write(f"{diff}\n\n")
logger.info(f"Created conflicts report")
if __name__ == '__main__':
# Test with mock data
import sys
if len(sys.argv) < 2:
print("Usage: python unified_skill_builder.py <config.json>")
sys.exit(1)
config_path = sys.argv[1]
with open(config_path, 'r') as f:
config = json.load(f)
# Mock scraped data
scraped_data = {
'github': {
'data': {
'readme': '# Test Repository',
'issues': [],
'releases': []
}
}
}
builder = UnifiedSkillBuilder(config, scraped_data)
builder.build()
print(f"\n✅ Test skill built in: output/{config['name']}/")

View File

@@ -0,0 +1,174 @@
#!/usr/bin/env python3
"""
Automatic Skill Uploader
Uploads a skill .zip file to Claude using the Anthropic API
Usage:
# Set API key (one-time)
export ANTHROPIC_API_KEY=sk-ant-...
# Upload skill
python3 upload_skill.py output/react.zip
python3 upload_skill.py output/godot.zip
"""
import os
import sys
import json
import argparse
from pathlib import Path
# Import utilities
try:
from utils import (
get_api_key,
get_upload_url,
print_upload_instructions,
validate_zip_file
)
except ImportError:
sys.path.insert(0, str(Path(__file__).parent))
from utils import (
get_api_key,
get_upload_url,
print_upload_instructions,
validate_zip_file
)
def upload_skill_api(zip_path):
"""
Upload skill to Claude via Anthropic API
Args:
zip_path: Path to skill .zip file
Returns:
tuple: (success, message)
"""
# Check for requests library
try:
import requests
except ImportError:
return False, "requests library not installed. Run: pip install requests"
# Validate zip file
is_valid, error_msg = validate_zip_file(zip_path)
if not is_valid:
return False, error_msg
# Get API key
api_key = get_api_key()
if not api_key:
return False, "ANTHROPIC_API_KEY not set. Run: export ANTHROPIC_API_KEY=sk-ant-..."
zip_path = Path(zip_path)
skill_name = zip_path.stem
print(f"📤 Uploading skill: {skill_name}")
print(f" Source: {zip_path}")
print(f" Size: {zip_path.stat().st_size:,} bytes")
print()
# Prepare API request
api_url = "https://api.anthropic.com/v1/skills"
headers = {
"x-api-key": api_key,
"anthropic-version": "2023-06-01"
}
try:
# Read zip file
with open(zip_path, 'rb') as f:
zip_data = f.read()
# Upload skill
print("⏳ Uploading to Anthropic API...")
files = {
'skill': (zip_path.name, zip_data, 'application/zip')
}
response = requests.post(
api_url,
headers=headers,
files=files,
timeout=60
)
# Check response
if response.status_code == 200:
print()
print("✅ Skill uploaded successfully!")
print()
print("Your skill is now available in Claude at:")
print(f" {get_upload_url()}")
print()
return True, "Upload successful"
elif response.status_code == 401:
return False, "Authentication failed. Check your ANTHROPIC_API_KEY"
elif response.status_code == 400:
error_msg = response.json().get('error', {}).get('message', 'Unknown error')
return False, f"Invalid skill format: {error_msg}"
else:
error_msg = response.json().get('error', {}).get('message', 'Unknown error')
return False, f"Upload failed ({response.status_code}): {error_msg}"
except requests.exceptions.Timeout:
return False, "Upload timed out. Try again or use manual upload"
except requests.exceptions.ConnectionError:
return False, "Connection error. Check your internet connection"
except Exception as e:
return False, f"Unexpected error: {str(e)}"
def main():
parser = argparse.ArgumentParser(
description="Upload a skill .zip file to Claude via Anthropic API",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Setup:
1. Get your Anthropic API key from https://console.anthropic.com/
2. Set the API key:
export ANTHROPIC_API_KEY=sk-ant-...
Examples:
# Upload skill
python3 upload_skill.py output/react.zip
# Upload with explicit path
python3 upload_skill.py /path/to/skill.zip
Requirements:
- ANTHROPIC_API_KEY environment variable must be set
- requests library (pip install requests)
"""
)
parser.add_argument(
'zip_file',
help='Path to skill .zip file (e.g., output/react.zip)'
)
args = parser.parse_args()
# Upload skill
success, message = upload_skill_api(args.zip_file)
if success:
sys.exit(0)
else:
print(f"\n❌ Upload failed: {message}")
print()
print("📝 Manual upload instructions:")
print_upload_instructions(args.zip_file)
sys.exit(1)
if __name__ == "__main__":
main()

224
src/skill_seekers/cli/utils.py Executable file
View File

@@ -0,0 +1,224 @@
#!/usr/bin/env python3
"""
Utility functions for Skill Seeker CLI tools
"""
import os
import sys
import subprocess
import platform
from pathlib import Path
from typing import Optional, Tuple, Dict, Union
def open_folder(folder_path: Union[str, Path]) -> bool:
"""
Open a folder in the system file browser
Args:
folder_path: Path to folder to open
Returns:
bool: True if successful, False otherwise
"""
folder_path = Path(folder_path).resolve()
if not folder_path.exists():
print(f"⚠️ Folder not found: {folder_path}")
return False
system = platform.system()
try:
if system == "Linux":
# Try xdg-open first (standard)
subprocess.run(["xdg-open", str(folder_path)], check=True)
elif system == "Darwin": # macOS
subprocess.run(["open", str(folder_path)], check=True)
elif system == "Windows":
subprocess.run(["explorer", str(folder_path)], check=True)
else:
print(f"⚠️ Unknown operating system: {system}")
return False
return True
except subprocess.CalledProcessError:
print(f"⚠️ Could not open folder automatically")
return False
except FileNotFoundError:
print(f"⚠️ File browser not found on system")
return False
def has_api_key() -> bool:
"""
Check if ANTHROPIC_API_KEY is set in environment
Returns:
bool: True if API key is set, False otherwise
"""
api_key = os.environ.get('ANTHROPIC_API_KEY', '').strip()
return len(api_key) > 0
def get_api_key() -> Optional[str]:
"""
Get ANTHROPIC_API_KEY from environment
Returns:
str: API key or None if not set
"""
api_key = os.environ.get('ANTHROPIC_API_KEY', '').strip()
return api_key if api_key else None
def get_upload_url() -> str:
"""
Get the Claude skills upload URL
Returns:
str: Claude skills upload URL
"""
return "https://claude.ai/skills"
def print_upload_instructions(zip_path: Union[str, Path]) -> None:
"""
Print clear upload instructions for manual upload
Args:
zip_path: Path to the .zip file to upload
"""
zip_path = Path(zip_path)
print()
print("╔══════════════════════════════════════════════════════════╗")
print("║ NEXT STEP ║")
print("╚══════════════════════════════════════════════════════════╝")
print()
print(f"📤 Upload to Claude: {get_upload_url()}")
print()
print(f"1. Go to {get_upload_url()}")
print("2. Click \"Upload Skill\"")
print(f"3. Select: {zip_path}")
print("4. Done! ✅")
print()
def format_file_size(size_bytes: int) -> str:
"""
Format file size in human-readable format
Args:
size_bytes: Size in bytes
Returns:
str: Formatted size (e.g., "45.3 KB")
"""
if size_bytes < 1024:
return f"{size_bytes} bytes"
elif size_bytes < 1024 * 1024:
return f"{size_bytes / 1024:.1f} KB"
else:
return f"{size_bytes / (1024 * 1024):.1f} MB"
def validate_skill_directory(skill_dir: Union[str, Path]) -> Tuple[bool, Optional[str]]:
"""
Validate that a directory is a valid skill directory
Args:
skill_dir: Path to skill directory
Returns:
tuple: (is_valid, error_message)
"""
skill_path = Path(skill_dir)
if not skill_path.exists():
return False, f"Directory not found: {skill_dir}"
if not skill_path.is_dir():
return False, f"Not a directory: {skill_dir}"
skill_md = skill_path / "SKILL.md"
if not skill_md.exists():
return False, f"SKILL.md not found in {skill_dir}"
return True, None
def validate_zip_file(zip_path: Union[str, Path]) -> Tuple[bool, Optional[str]]:
"""
Validate that a file is a valid skill .zip file
Args:
zip_path: Path to .zip file
Returns:
tuple: (is_valid, error_message)
"""
zip_path = Path(zip_path)
if not zip_path.exists():
return False, f"File not found: {zip_path}"
if not zip_path.is_file():
return False, f"Not a file: {zip_path}"
if not zip_path.suffix == '.zip':
return False, f"Not a .zip file: {zip_path}"
return True, None
def read_reference_files(skill_dir: Union[str, Path], max_chars: int = 100000, preview_limit: int = 40000) -> Dict[str, str]:
"""Read reference files from a skill directory with size limits.
This function reads markdown files from the references/ subdirectory
of a skill, applying both per-file and total content limits.
Args:
skill_dir (str or Path): Path to skill directory
max_chars (int): Maximum total characters to read (default: 100000)
preview_limit (int): Maximum characters per file (default: 40000)
Returns:
dict: Dictionary mapping filename to content
Example:
>>> refs = read_reference_files('output/react/', max_chars=50000)
>>> len(refs)
5
"""
from pathlib import Path
skill_path = Path(skill_dir)
references_dir = skill_path / "references"
references: Dict[str, str] = {}
if not references_dir.exists():
print(f"⚠ No references directory found at {references_dir}")
return references
total_chars = 0
for ref_file in sorted(references_dir.glob("*.md")):
if ref_file.name == "index.md":
continue
content = ref_file.read_text(encoding='utf-8')
# Limit size per file
if len(content) > preview_limit:
content = content[:preview_limit] + "\n\n[Content truncated...]"
references[ref_file.name] = content
total_chars += len(content)
# Stop if we've read enough
if total_chars > max_chars:
print(f" Limiting input to {max_chars:,} characters")
break
return references

View File

@@ -0,0 +1,596 @@
# Skill Seeker MCP Server
Model Context Protocol (MCP) server for Skill Seeker - enables Claude Code to generate documentation skills directly.
## What is This?
This MCP server allows Claude Code to use Skill Seeker's tools directly through natural language commands. Instead of running CLI commands manually, you can ask Claude Code to:
- Generate config files for any documentation site
- Estimate page counts before scraping
- Scrape documentation and build skills
- Package skills into `.zip` files
- List and validate configurations
- Split large documentation (10K-40K+ pages) into focused sub-skills
- Generate intelligent router/hub skills for split documentation
- **NEW:** Scrape PDF documentation and extract code/images
## Quick Start
### 1. Install Dependencies
```bash
# From repository root
pip3 install -r mcp/requirements.txt
pip3 install requests beautifulsoup4
```
### 2. Quick Setup (Automated)
```bash
# Run the setup script
./setup_mcp.sh
# Follow the prompts - it will:
# - Install dependencies
# - Test the server
# - Generate configuration
# - Guide you through Claude Code setup
```
### 3. Manual Setup
Add to `~/.config/claude-code/mcp.json`:
```json
{
"mcpServers": {
"skill-seeker": {
"command": "python3",
"args": [
"/path/to/Skill_Seekers/mcp/server.py"
],
"cwd": "/path/to/Skill_Seekers"
}
}
}
```
**Replace `/path/to/Skill_Seekers`** with your actual repository path!
### 4. Restart Claude Code
Quit and reopen Claude Code (don't just close the window).
### 5. Test
In Claude Code, type:
```
List all available configs
```
You should see a list of preset configurations (Godot, React, Vue, etc.).
## Available Tools
The MCP server exposes 10 tools:
### 1. `generate_config`
Create a new configuration file for any documentation website.
**Parameters:**
- `name` (required): Skill name (e.g., "tailwind")
- `url` (required): Documentation URL (e.g., "https://tailwindcss.com/docs")
- `description` (required): When to use this skill
- `max_pages` (optional): Maximum pages to scrape (default: 100)
- `rate_limit` (optional): Delay between requests in seconds (default: 0.5)
**Example:**
```
Generate config for Tailwind CSS at https://tailwindcss.com/docs
```
### 2. `estimate_pages`
Estimate how many pages will be scraped from a config (fast, no data downloaded).
**Parameters:**
- `config_path` (required): Path to config file (e.g., "configs/react.json")
- `max_discovery` (optional): Maximum pages to discover (default: 1000)
**Example:**
```
Estimate pages for configs/react.json
```
### 3. `scrape_docs`
Scrape documentation and build Claude skill.
**Parameters:**
- `config_path` (required): Path to config file
- `enhance_local` (optional): Open terminal for local enhancement (default: false)
- `skip_scrape` (optional): Use cached data (default: false)
- `dry_run` (optional): Preview without saving (default: false)
**Example:**
```
Scrape docs using configs/react.json
```
### 4. `package_skill`
Package a skill directory into a `.zip` file ready for Claude upload. Automatically uploads if ANTHROPIC_API_KEY is set.
**Parameters:**
- `skill_dir` (required): Path to skill directory (e.g., "output/react/")
- `auto_upload` (optional): Try to upload automatically if API key is available (default: true)
**Example:**
```
Package skill at output/react/
```
### 5. `upload_skill`
Upload a skill .zip file to Claude automatically (requires ANTHROPIC_API_KEY).
**Parameters:**
- `skill_zip` (required): Path to skill .zip file (e.g., "output/react.zip")
**Example:**
```
Upload output/react.zip using upload_skill
```
### 6. `list_configs`
List all available preset configurations.
**Parameters:** None
**Example:**
```
List all available configs
```
### 7. `validate_config`
Validate a config file for errors.
**Parameters:**
- `config_path` (required): Path to config file
**Example:**
```
Validate configs/godot.json
```
### 8. `split_config`
Split large documentation config into multiple focused skills. For 10K+ page documentation.
**Parameters:**
- `config_path` (required): Path to config JSON file (e.g., "configs/godot.json")
- `strategy` (optional): Split strategy - "auto", "none", "category", "router", "size" (default: "auto")
- `target_pages` (optional): Target pages per skill (default: 5000)
- `dry_run` (optional): Preview without saving files (default: false)
**Example:**
```
Split configs/godot.json using router strategy with 5000 pages per skill
```
**Strategies:**
- **auto** - Intelligently detects best strategy based on page count and config
- **category** - Split by documentation categories (creates focused sub-skills)
- **router** - Create router/hub skill + specialized sub-skills (RECOMMENDED for 10K+ pages)
- **size** - Split every N pages (for docs without clear categories)
### 9. `generate_router`
Generate router/hub skill for split documentation. Creates intelligent routing to sub-skills.
**Parameters:**
- `config_pattern` (required): Config pattern for sub-skills (e.g., "configs/godot-*.json")
- `router_name` (optional): Router skill name (inferred from configs if not provided)
**Example:**
```
Generate router for configs/godot-*.json
```
**What it does:**
- Analyzes all sub-skill configs
- Extracts routing keywords from categories and names
- Creates router SKILL.md with intelligent routing logic
- Users can ask questions naturally, router directs to appropriate sub-skill
### 10. `scrape_pdf`
Scrape PDF documentation and build Claude skill. Extracts text, code blocks, images, and tables from PDF files with advanced features.
**Parameters:**
- `config_path` (optional): Path to PDF config JSON file (e.g., "configs/manual_pdf.json")
- `pdf_path` (optional): Direct PDF path (alternative to config_path)
- `name` (optional): Skill name (required with pdf_path)
- `description` (optional): Skill description
- `from_json` (optional): Build from extracted JSON file (e.g., "output/manual_extracted.json")
- `use_ocr` (optional): Use OCR for scanned PDFs (requires pytesseract)
- `password` (optional): Password for encrypted PDFs
- `extract_tables` (optional): Extract tables from PDF
- `parallel` (optional): Process pages in parallel for faster extraction
- `max_workers` (optional): Number of parallel workers (default: CPU count)
**Examples:**
```
Scrape PDF at docs/manual.pdf and create skill named api-docs
Create skill from configs/example_pdf.json
Build skill from output/manual_extracted.json
Scrape scanned PDF with OCR: --pdf docs/scanned.pdf --ocr
Scrape encrypted PDF: --pdf docs/manual.pdf --password mypassword
Extract tables: --pdf docs/data.pdf --extract-tables
Fast parallel processing: --pdf docs/large.pdf --parallel --workers 8
```
**What it does:**
- Extracts text and markdown from PDF pages
- Detects code blocks using 3 methods (font, indent, pattern)
- Detects programming language with confidence scoring (19+ languages)
- Validates syntax and scores code quality (0-10 scale)
- Extracts images with size filtering
- **NEW:** Extracts tables from PDFs (Priority 2)
- **NEW:** OCR support for scanned PDFs (Priority 2, requires pytesseract + Pillow)
- **NEW:** Password-protected PDF support (Priority 2)
- **NEW:** Parallel page processing for faster extraction (Priority 3)
- **NEW:** Intelligent caching of expensive operations (Priority 3)
- Detects chapters and creates page chunks
- Categorizes content automatically
- Generates complete skill structure (SKILL.md + references)
**Performance:**
- Sequential: ~30-60 seconds per 100 pages
- Parallel (8 workers): ~10-20 seconds per 100 pages (3x faster)
**See:** `docs/PDF_SCRAPER.md` for complete PDF documentation guide
## Example Workflows
### Generate a New Skill from Scratch
```
User: Generate config for Svelte at https://svelte.dev/docs
Claude: ✅ Config created: configs/svelte.json
User: Estimate pages for configs/svelte.json
Claude: 📊 Estimated pages: 150
User: Scrape docs using configs/svelte.json
Claude: ✅ Skill created at output/svelte/
User: Package skill at output/svelte/
Claude: ✅ Created: output/svelte.zip
Ready to upload to Claude!
```
### Use Existing Preset
```
User: List all available configs
Claude: [Shows all configs: godot, react, vue, django, fastapi, etc.]
User: Scrape docs using configs/react.json
Claude: ✅ Skill created at output/react/
User: Package skill at output/react/
Claude: ✅ Created: output/react.zip
```
### Validate Before Scraping
```
User: Validate configs/godot.json
Claude: ✅ Config is valid!
Name: godot
Base URL: https://docs.godotengine.org/en/stable/
Max pages: 500
Rate limit: 0.5s
User: Scrape docs using configs/godot.json
Claude: [Starts scraping...]
```
### PDF Documentation - NEW
```
User: Scrape PDF at docs/api-manual.pdf and create skill named api-docs
Claude: 📄 Scraping PDF documentation...
✅ Extracted 120 pages
✅ Found 45 code blocks (Python, JavaScript, C++)
✅ Extracted 12 images
✅ Created skill at output/api-docs/
📦 Package with: python3 cli/package_skill.py output/api-docs/
User: Package skill at output/api-docs/
Claude: ✅ Created: output/api-docs.zip
Ready to upload to Claude!
```
### Large Documentation (40K Pages)
```
User: Estimate pages for configs/godot.json
Claude: 📊 Estimated pages: 40,000
⚠️ Large documentation detected!
💡 Recommend splitting into multiple skills
User: Split configs/godot.json using router strategy
Claude: ✅ Split complete!
Created 5 sub-skills:
- godot-scripting.json (5,000 pages)
- godot-2d.json (8,000 pages)
- godot-3d.json (10,000 pages)
- godot-physics.json (6,000 pages)
- godot-shaders.json (11,000 pages)
User: Scrape all godot sub-skills in parallel
Claude: [Starts scraping all 5 configs in parallel...]
✅ All skills created in 4-8 hours instead of 20-40!
User: Generate router for configs/godot-*.json
Claude: ✅ Router skill created at output/godot/
Routing logic:
- "scripting", "gdscript" → godot-scripting
- "2d", "sprites", "tilemap" → godot-2d
- "3d", "meshes", "camera" → godot-3d
- "physics", "collision" → godot-physics
- "shaders", "visual shader" → godot-shaders
User: Package all godot skills
Claude: ✅ 6 skills packaged:
- godot.zip (router)
- godot-scripting.zip
- godot-2d.zip
- godot-3d.zip
- godot-physics.zip
- godot-shaders.zip
Upload all to Claude!
Users just ask questions naturally - router handles routing!
```
## Architecture
### Server Structure
```
mcp/
├── server.py # Main MCP server
├── requirements.txt # MCP dependencies
└── README.md # This file
```
### How It Works
1. **Claude Code** sends MCP requests to the server
2. **Server** routes requests to appropriate tool functions
3. **Tools** call CLI scripts (`doc_scraper.py`, `estimate_pages.py`, etc.)
4. **CLI scripts** perform actual work (scraping, packaging, etc.)
5. **Results** returned to Claude Code via MCP protocol
### Tool Implementation
Each tool is implemented as an async function:
```python
async def generate_config_tool(args: dict) -> list[TextContent]:
"""Generate a config file"""
# Create config JSON
# Save to configs/
# Return success message
```
Tools use `subprocess.run()` to call CLI scripts:
```python
result = subprocess.run([
sys.executable,
str(CLI_DIR / "doc_scraper.py"),
"--config", config_path
], capture_output=True, text=True)
```
## Testing
The MCP server has comprehensive test coverage:
```bash
# Run MCP server tests (25 tests)
python3 -m pytest tests/test_mcp_server.py -v
# Expected output: 25 passed in ~0.3s
```
### Test Coverage
- **Server initialization** (2 tests)
- **Tool listing** (2 tests)
- **generate_config** (3 tests)
- **estimate_pages** (3 tests)
- **scrape_docs** (4 tests)
- **package_skill** (3 tests)
- **upload_skill** (2 tests)
- **list_configs** (3 tests)
- **validate_config** (3 tests)
- **split_config** (3 tests)
- **generate_router** (3 tests)
- **Tool routing** (2 tests)
- **Integration** (1 test)
**Total: 34 tests | Pass rate: 100%**
## Troubleshooting
### MCP Server Not Loading
**Symptoms:**
- Tools don't appear in Claude Code
- No response to skill-seeker commands
**Solutions:**
1. Check configuration:
```bash
cat ~/.config/claude-code/mcp.json
```
2. Verify server can start:
```bash
python3 mcp/server.py
# Should start without errors (Ctrl+C to exit)
```
3. Check dependencies:
```bash
pip3 install -r mcp/requirements.txt
```
4. Completely restart Claude Code (quit and reopen)
5. Check Claude Code logs:
- macOS: `~/Library/Logs/Claude Code/`
- Linux: `~/.config/claude-code/logs/`
### "ModuleNotFoundError: No module named 'mcp'"
```bash
pip3 install -r mcp/requirements.txt
```
### Tools Appear But Don't Work
**Solutions:**
1. Verify `cwd` in config points to repository root
2. Check CLI tools exist:
```bash
ls cli/doc_scraper.py
ls cli/estimate_pages.py
ls cli/package_skill.py
```
3. Test CLI tools directly:
```bash
python3 cli/doc_scraper.py --help
```
### Slow Operations
1. Check rate limit in configs (increase if needed)
2. Use smaller `max_pages` for testing
3. Use `skip_scrape` to avoid re-downloading data
## Advanced Configuration
### Using Virtual Environment
```bash
# Create venv
python3 -m venv venv
source venv/bin/activate
pip install -r mcp/requirements.txt
pip install requests beautifulsoup4
which python3 # Copy this path
```
Configure Claude Code to use venv Python:
```json
{
"mcpServers": {
"skill-seeker": {
"command": "/path/to/Skill_Seekers/venv/bin/python3",
"args": ["/path/to/Skill_Seekers/mcp/server.py"],
"cwd": "/path/to/Skill_Seekers"
}
}
}
```
### Debug Mode
Enable verbose logging:
```json
{
"mcpServers": {
"skill-seeker": {
"command": "python3",
"args": ["-u", "/path/to/Skill_Seekers/mcp/server.py"],
"cwd": "/path/to/Skill_Seekers",
"env": {
"DEBUG": "1"
}
}
}
}
```
### With API Enhancement
For API-based enhancement (requires Anthropic API key):
```json
{
"mcpServers": {
"skill-seeker": {
"command": "python3",
"args": ["/path/to/Skill_Seekers/mcp/server.py"],
"cwd": "/path/to/Skill_Seekers",
"env": {
"ANTHROPIC_API_KEY": "sk-ant-your-key-here"
}
}
}
}
```
## Performance
| Operation | Time | Notes |
|-----------|------|-------|
| List configs | <1s | Instant |
| Generate config | <1s | Creates JSON file |
| Validate config | <1s | Quick validation |
| Estimate pages | 1-2min | Fast, no data download |
| Split config | 1-3min | Analyzes and creates sub-configs |
| Generate router | 10-30s | Creates router SKILL.md |
| Scrape docs | 15-45min | First time only |
| Scrape docs (40K pages) | 20-40hrs | Sequential |
| Scrape docs (40K pages, parallel) | 4-8hrs | 5 skills in parallel |
| Scrape (cached) | <1min | With `skip_scrape` |
| Package skill | 5-10s | Creates .zip |
| Package multi | 30-60s | Packages 5-10 skills |
## Documentation
- **Full Setup Guide**: [docs/MCP_SETUP.md](../docs/MCP_SETUP.md)
- **Main README**: [README.md](../README.md)
- **Usage Guide**: [docs/USAGE.md](../docs/USAGE.md)
- **Testing Guide**: [docs/TESTING.md](../docs/TESTING.md)
## Support
- **Issues**: [GitHub Issues](https://github.com/yusufkaraaslan/Skill_Seekers/issues)
- **Discussions**: [GitHub Discussions](https://github.com/yusufkaraaslan/Skill_Seekers/discussions)
## License
MIT License - See [LICENSE](../LICENSE) for details

View File

@@ -0,0 +1,27 @@
"""Skill Seekers MCP (Model Context Protocol) server package.
This package provides MCP server integration for Claude Code, allowing
natural language interaction with Skill Seekers tools.
Main modules:
- server: MCP server implementation with 9 tools
Available MCP Tools:
- list_configs: List all available preset configurations
- generate_config: Generate a new config file for any docs site
- validate_config: Validate a config file structure
- estimate_pages: Estimate page count before scraping
- scrape_docs: Scrape and build a skill
- package_skill: Package skill into .zip file (with auto-upload)
- upload_skill: Upload .zip to Claude
- split_config: Split large documentation configs
- generate_router: Generate router/hub skills
Usage:
The MCP server is typically run by Claude Code via configuration
in ~/.config/claude-code/mcp.json
"""
__version__ = "2.0.0"
__all__ = []

View File

@@ -0,0 +1,9 @@
# MCP Server dependencies
mcp>=1.0.0
# CLI tool dependencies (shared)
requests>=2.31.0
beautifulsoup4>=4.12.0
# Optional: for API-based enhancement
# anthropic>=0.18.0

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,19 @@
"""MCP tools subpackage.
This package will contain modularized MCP tool implementations.
Planned structure (for future refactoring):
- scraping_tools.py: Tools for scraping (estimate_pages, scrape_docs)
- building_tools.py: Tools for building (package_skill, validate_config)
- deployment_tools.py: Tools for deployment (upload_skill)
- config_tools.py: Tools for configs (list_configs, generate_config)
- advanced_tools.py: Advanced tools (split_config, generate_router)
Current state:
All tools are currently implemented in mcp/server.py
This directory is a placeholder for future modularization.
"""
__version__ = "2.0.0"
__all__ = []