feat: Multi-Source Synthesis Architecture - Rich Standalone Skills + Smart Combination

BREAKING CHANGE: Major architectural improvements to multi-source skill generation

This commit implements the complete "Multi-Source Synthesis Architecture" where
each source (documentation, GitHub, PDF) generates a rich standalone SKILL.md
file before being intelligently synthesized with source-specific formulas.

## 🎯 Core Architecture Changes

### 1. Rich Standalone SKILL.md Generation (Source Parity)

Each source now generates comprehensive, production-quality SKILL.md files that
can stand alone OR be synthesized with other sources.

**GitHub Scraper Enhancements** (+263 lines):
- Now generates 300+ line SKILL.md (was ~50 lines)
- Integrates C3.x codebase analysis data:
  - C2.5: API Reference extraction
  - C3.1: Design pattern detection (27 high-confidence patterns)
  - C3.2: Test example extraction (215 examples)
  - C3.7: Architectural pattern analysis
- Enhanced sections:
  -  Quick Reference with pattern summaries
  - 📝 Code Examples from real repository tests
  - 🔧 API Reference from codebase analysis
  - 🏗️ Architecture Overview with design patterns
  - ⚠️ Known Issues from GitHub issues
- Location: src/skill_seekers/cli/github_scraper.py

**PDF Scraper Enhancements** (+205 lines):
- Now generates 200+ line SKILL.md (was ~50 lines)
- Enhanced content extraction:
  - 📖 Chapter Overview (PDF structure breakdown)
  - 🔑 Key Concepts (extracted from headings)
  -  Quick Reference (pattern extraction)
  - 📝 Code Examples: Top 15 (was top 5), grouped by language
  - Quality scoring and intelligent truncation
- Better formatting and organization
- Location: src/skill_seekers/cli/pdf_scraper.py

**Result**: All 3 sources (docs, GitHub, PDF) now have equal capability to
generate rich, comprehensive standalone skills.

### 2. File Organization & Caching System

**Problem**: output/ directory cluttered with intermediate files, data, and logs.

**Solution**: New `.skillseeker-cache/` hidden directory for all intermediate files.

**New Structure**:
```
.skillseeker-cache/{skill_name}/
├── sources/          # Standalone SKILL.md from each source
│   ├── httpx_docs/
│   ├── httpx_github/
│   └── httpx_pdf/
├── data/             # Raw scraped data (JSON)
├── repos/            # Cloned GitHub repositories (cached for reuse)
└── logs/             # Session logs with timestamps

output/{skill_name}/  # CLEAN: Only final synthesized skill
├── SKILL.md
└── references/
```

**Benefits**:
-  Clean output/ directory (only final product)
-  Intermediate files preserved for debugging
-  Repository clones cached and reused (faster re-runs)
-  Timestamped logs for each scraping session
-  All cache dirs added to .gitignore

**Changes**:
- .gitignore: Added `.skillseeker-cache/` entry
- unified_scraper.py: Complete reorganization (+238 lines)
  - Added cache directory structure
  - File logging with timestamps
  - Repository cloning with caching/reuse
  - Cleaner intermediate file management
  - Better subprocess logging and error handling

### 3. Config Repository Migration

**Moved to separate config repository**: https://github.com/yusufkaraaslan/skill-seekers-configs

**Deleted from this repo** (35 config files):
- ansible-core.json, astro.json, claude-code.json
- django.json, django_unified.json, fastapi.json, fastapi_unified.json
- godot.json, godot_unified.json, godot_github.json, godot-large-example.json
- react.json, react_unified.json, react_github.json, react_github_example.json
- vue.json, kubernetes.json, laravel.json, tailwind.json, hono.json
- svelte_cli_unified.json, steam-economy-complete.json
- deck_deck_go_local.json, python-tutorial-test.json, example_pdf.json
- test-manual.json, fastapi_unified_test.json, fastmcp_github_example.json
- example-team/ directory (4 files)

**Kept as reference example**:
- configs/httpx_comprehensive.json (complete multi-source example)

**Rationale**:
- Cleaner repository (979+ lines added, 1680 deleted)
- Configs managed separately with versioning
- Official presets available via `fetch-config` command
- Users can maintain private config repos

### 4. AI Enhancement Improvements

**enhance_skill.py** (+125 lines):
- Better integration with multi-source synthesis
- Enhanced prompt generation for synthesized skills
- Improved error handling and logging
- Support for source metadata in enhancement

### 5. Documentation Updates

**CLAUDE.md** (+252 lines):
- Comprehensive project documentation
- Architecture explanations
- Development workflow guidelines
- Testing requirements
- Multi-source synthesis patterns

**SKILL_QUALITY_ANALYSIS.md** (new):
- Quality assessment framework
- Before/after analysis of httpx skill
- Grading rubric for skill quality
- Metrics and benchmarks

### 6. Testing & Validation Scripts

**test_httpx_skill.sh** (new):
- Complete httpx skill generation test
- Multi-source synthesis validation
- Quality metrics verification

**test_httpx_quick.sh** (new):
- Quick validation script
- Subset of features for rapid testing

## 📊 Quality Improvements

| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| GitHub SKILL.md lines | ~50 | 300+ | +500% |
| PDF SKILL.md lines | ~50 | 200+ | +300% |
| GitHub C3.x integration |  No |  Yes | New feature |
| PDF pattern extraction |  No |  Yes | New feature |
| File organization | Messy | Clean cache | Major improvement |
| Repository cloning | Always fresh | Cached reuse | Faster re-runs |
| Logging | Console only | Timestamped files | Better debugging |
| Config management | In-repo | Separate repo | Cleaner separation |

## 🧪 Testing

All existing tests pass:
- test_c3_integration.py: Updated for new architecture
- 700+ tests passing
- Multi-source synthesis validated with httpx example

## 🔧 Technical Details

**Modified Core Files**:
1. src/skill_seekers/cli/github_scraper.py (+263 lines)
   - _generate_skill_md(): Rich content with C3.x integration
   - _format_pattern_summary(): Design pattern summaries
   - _format_code_examples(): Test example formatting
   - _format_api_reference(): API reference from codebase
   - _format_architecture(): Architectural pattern analysis

2. src/skill_seekers/cli/pdf_scraper.py (+205 lines)
   - _generate_skill_md(): Enhanced with rich content
   - _format_key_concepts(): Extract concepts from headings
   - _format_patterns_from_content(): Pattern extraction
   - Code examples: Top 15, grouped by language, better quality scoring

3. src/skill_seekers/cli/unified_scraper.py (+238 lines)
   - __init__(): Cache directory structure
   - _setup_logging(): File logging with timestamps
   - _clone_github_repo(): Repository caching system
   - _scrape_documentation(): Move to cache, better logging
   - Better subprocess handling and error reporting

4. src/skill_seekers/cli/enhance_skill.py (+125 lines)
   - Multi-source synthesis awareness
   - Enhanced prompt generation
   - Better error handling

**Minor Updates**:
- src/skill_seekers/cli/codebase_scraper.py (+3 lines): Minor improvements
- src/skill_seekers/cli/test_example_extractor.py: Quality scoring adjustments
- tests/test_c3_integration.py: Test updates for new architecture

## 🚀 Migration Guide

**For users with existing configs**:
No action required - all existing configs continue to work.

**For users wanting official presets**:
```bash
# Fetch from official config repo
skill-seekers fetch-config --name react --target unified

# Or use existing local configs
skill-seekers unified --config configs/httpx_comprehensive.json
```

**Cache directory**:
New `.skillseeker-cache/` directory will be created automatically.
Safe to delete - will be regenerated on next run.

## 📈 Next Steps

This architecture enables:
-  Source parity: All sources generate rich standalone skills
-  Smart synthesis: Each combination has optimal formula
-  Better debugging: Cached files and logs preserved
-  Faster iteration: Repository caching, clean output
- 🔄 Future: Multi-platform enhancement (Gemini, GPT-4) - planned
- 🔄 Future: Conflict detection between sources - planned
- 🔄 Future: Source prioritization rules - planned

## 🎓 Example: httpx Skill Quality

**Before**: 186 lines, basic synthesis, missing data
**After**: 640 lines with AI enhancement, A- (9/10) quality

**What changed**:
- All C3.x analysis data integrated (patterns, tests, API, architecture)
- GitHub metadata included (stars, topics, languages)
- PDF chapter structure visible
- Professional formatting with emojis and clear sections
- Real-world code examples from test suite
- Design patterns explained with confidence scores
- Known issues with impact assessment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
yusyus
2026-01-11 23:01:07 +03:00
parent cf9539878e
commit a99e22c639
46 changed files with 1869 additions and 1678 deletions

View File

@@ -74,13 +74,51 @@ class UnifiedScraper:
# Storage for scraped data
self.scraped_data = {}
# Output paths
# Output paths - cleaner organization
self.name = self.config['name']
self.output_dir = f"output/{self.name}"
self.data_dir = f"output/{self.name}_unified_data"
self.output_dir = f"output/{self.name}" # Final skill only
# Use hidden cache directory for intermediate files
self.cache_dir = f".skillseeker-cache/{self.name}"
self.sources_dir = f"{self.cache_dir}/sources"
self.data_dir = f"{self.cache_dir}/data"
self.repos_dir = f"{self.cache_dir}/repos"
self.logs_dir = f"{self.cache_dir}/logs"
# Create directories
os.makedirs(self.output_dir, exist_ok=True)
os.makedirs(self.sources_dir, exist_ok=True)
os.makedirs(self.data_dir, exist_ok=True)
os.makedirs(self.repos_dir, exist_ok=True)
os.makedirs(self.logs_dir, exist_ok=True)
# Setup file logging
self._setup_logging()
def _setup_logging(self):
"""Setup file logging for this scraping session."""
from datetime import datetime
# Create log filename with timestamp
timestamp = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
log_file = f"{self.logs_dir}/unified_{timestamp}.log"
# Add file handler to root logger
file_handler = logging.FileHandler(log_file, encoding='utf-8')
file_handler.setLevel(logging.DEBUG)
# Create formatter
formatter = logging.Formatter(
'%(asctime)s - %(name)s - %(levelname)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)
file_handler.setFormatter(formatter)
# Add to root logger
logging.getLogger().addHandler(file_handler)
logger.info(f"📝 Logging to: {log_file}")
logger.info(f"🗂️ Cache directory: {self.cache_dir}")
def scrape_all_sources(self):
"""
@@ -150,14 +188,20 @@ class UnifiedScraper:
logger.info(f"Scraping documentation from {source['base_url']}")
doc_scraper_path = Path(__file__).parent / "doc_scraper.py"
cmd = [sys.executable, str(doc_scraper_path), '--config', temp_config_path]
cmd = [sys.executable, str(doc_scraper_path), '--config', temp_config_path, '--fresh']
result = subprocess.run(cmd, capture_output=True, text=True)
result = subprocess.run(cmd, capture_output=True, text=True, stdin=subprocess.DEVNULL)
if result.returncode != 0:
logger.error(f"Documentation scraping failed: {result.stderr}")
logger.error(f"Documentation scraping failed with return code {result.returncode}")
logger.error(f"STDERR: {result.stderr}")
logger.error(f"STDOUT: {result.stdout}")
return
# Log subprocess output for debugging
if result.stdout:
logger.info(f"Doc scraper output: {result.stdout[-500:]}") # Last 500 chars
# Load scraped data
docs_data_file = f"output/{doc_config['name']}_data/summary.json"
@@ -178,6 +222,83 @@ class UnifiedScraper:
if os.path.exists(temp_config_path):
os.remove(temp_config_path)
# Move intermediate files to cache to keep output/ clean
docs_output_dir = f"output/{doc_config['name']}"
docs_data_dir = f"output/{doc_config['name']}_data"
if os.path.exists(docs_output_dir):
cache_docs_dir = os.path.join(self.sources_dir, f"{doc_config['name']}")
if os.path.exists(cache_docs_dir):
shutil.rmtree(cache_docs_dir)
shutil.move(docs_output_dir, cache_docs_dir)
logger.info(f"📦 Moved docs output to cache: {cache_docs_dir}")
if os.path.exists(docs_data_dir):
cache_data_dir = os.path.join(self.data_dir, f"{doc_config['name']}_data")
if os.path.exists(cache_data_dir):
shutil.rmtree(cache_data_dir)
shutil.move(docs_data_dir, cache_data_dir)
logger.info(f"📦 Moved docs data to cache: {cache_data_dir}")
def _clone_github_repo(self, repo_name: str) -> Optional[str]:
"""
Clone GitHub repository to cache directory for C3.x analysis.
Reuses existing clone if already present.
Args:
repo_name: GitHub repo in format "owner/repo"
Returns:
Path to cloned repo, or None if clone failed
"""
# Clone to cache repos folder for future reuse
repo_dir_name = repo_name.replace('/', '_') # e.g., encode_httpx
clone_path = os.path.join(self.repos_dir, repo_dir_name)
# Check if already cloned
if os.path.exists(clone_path) and os.path.isdir(os.path.join(clone_path, '.git')):
logger.info(f"♻️ Found existing repository clone: {clone_path}")
logger.info(f" Reusing for C3.x analysis (skip re-cloning)")
return clone_path
# repos_dir already created in __init__
# Clone repo (full clone, not shallow - for complete analysis)
repo_url = f"https://github.com/{repo_name}.git"
logger.info(f"🔄 Cloning repository for C3.x analysis: {repo_url}")
logger.info(f"{clone_path}")
logger.info(f" 💾 Clone will be saved for future reuse")
try:
result = subprocess.run(
['git', 'clone', repo_url, clone_path],
capture_output=True,
text=True,
timeout=600 # 10 minute timeout for full clone
)
if result.returncode == 0:
logger.info(f"✅ Repository cloned successfully")
logger.info(f" 📁 Saved to: {clone_path}")
return clone_path
else:
logger.error(f"❌ Git clone failed: {result.stderr}")
# Clean up failed clone
if os.path.exists(clone_path):
shutil.rmtree(clone_path)
return None
except subprocess.TimeoutExpired:
logger.error(f"❌ Git clone timed out after 10 minutes")
if os.path.exists(clone_path):
shutil.rmtree(clone_path)
return None
except Exception as e:
logger.error(f"❌ Git clone failed: {e}")
if os.path.exists(clone_path):
shutil.rmtree(clone_path)
return None
def _scrape_github(self, source: Dict[str, Any]):
"""Scrape GitHub repository."""
try:
@@ -186,6 +307,22 @@ class UnifiedScraper:
logger.error("github_scraper.py not found")
return
# Check if we need to clone for C3.x analysis
enable_codebase_analysis = source.get('enable_codebase_analysis', True)
local_repo_path = source.get('local_repo_path')
cloned_repo_path = None
# Auto-clone if C3.x analysis is enabled but no local path provided
if enable_codebase_analysis and not local_repo_path:
logger.info("🔬 C3.x codebase analysis enabled - cloning repository...")
cloned_repo_path = self._clone_github_repo(source['repo'])
if cloned_repo_path:
local_repo_path = cloned_repo_path
logger.info(f"✅ Using cloned repo for C3.x analysis: {local_repo_path}")
else:
logger.warning("⚠️ Failed to clone repo - C3.x analysis will be skipped")
enable_codebase_analysis = False
# Create config for GitHub scraper
github_config = {
'repo': source['repo'],
@@ -198,7 +335,7 @@ class UnifiedScraper:
'include_code': source.get('include_code', True),
'code_analysis_depth': source.get('code_analysis_depth', 'surface'),
'file_patterns': source.get('file_patterns', []),
'local_repo_path': source.get('local_repo_path') # Pass local_repo_path from config
'local_repo_path': local_repo_path # Use cloned path if available
}
# Pass directory exclusions if specified (optional)
@@ -213,9 +350,6 @@ class UnifiedScraper:
github_data = scraper.scrape()
# Run C3.x codebase analysis if enabled and local_repo_path available
enable_codebase_analysis = source.get('enable_codebase_analysis', True)
local_repo_path = source.get('local_repo_path')
if enable_codebase_analysis and local_repo_path:
logger.info("🔬 Running C3.x codebase analysis...")
try:
@@ -227,18 +361,58 @@ class UnifiedScraper:
logger.warning("⚠️ C3.x analysis returned no data")
except Exception as e:
logger.warning(f"⚠️ C3.x analysis failed: {e}")
import traceback
logger.debug(f"Traceback: {traceback.format_exc()}")
# Continue without C3.x data - graceful degradation
# Save data
# Note: We keep the cloned repo in output/ for future reuse
if cloned_repo_path:
logger.info(f"📁 Repository clone saved for future use: {cloned_repo_path}")
# Save data to unified location
github_data_file = os.path.join(self.data_dir, 'github_data.json')
with open(github_data_file, 'w', encoding='utf-8') as f:
json.dump(github_data, f, indent=2, ensure_ascii=False)
# ALSO save to the location GitHubToSkillConverter expects (with C3.x data!)
converter_data_file = f"output/{github_config['name']}_github_data.json"
with open(converter_data_file, 'w', encoding='utf-8') as f:
json.dump(github_data, f, indent=2, ensure_ascii=False)
self.scraped_data['github'] = {
'data': github_data,
'data_file': github_data_file
}
# Build standalone SKILL.md for synthesis using GitHubToSkillConverter
try:
from skill_seekers.cli.github_scraper import GitHubToSkillConverter
# Use github_config which has the correct name field
# Converter will load from output/{name}_github_data.json which now has C3.x data
converter = GitHubToSkillConverter(config=github_config)
converter.build_skill()
logger.info(f"✅ GitHub: Standalone SKILL.md created")
except Exception as e:
logger.warning(f"⚠️ Failed to build standalone GitHub SKILL.md: {e}")
# Move intermediate files to cache to keep output/ clean
github_output_dir = f"output/{github_config['name']}"
github_data_file_path = f"output/{github_config['name']}_github_data.json"
if os.path.exists(github_output_dir):
cache_github_dir = os.path.join(self.sources_dir, github_config['name'])
if os.path.exists(cache_github_dir):
shutil.rmtree(cache_github_dir)
shutil.move(github_output_dir, cache_github_dir)
logger.info(f"📦 Moved GitHub output to cache: {cache_github_dir}")
if os.path.exists(github_data_file_path):
cache_github_data = os.path.join(self.data_dir, f"{github_config['name']}_github_data.json")
if os.path.exists(cache_github_data):
os.remove(cache_github_data)
shutil.move(github_data_file_path, cache_github_data)
logger.info(f"📦 Moved GitHub data to cache: {cache_github_data}")
logger.info(f"✅ GitHub: Repository scraped successfully")
def _scrape_pdf(self, source: Dict[str, Any]):
@@ -273,6 +447,13 @@ class UnifiedScraper:
'data_file': pdf_data_file
}
# Build standalone SKILL.md for synthesis
try:
converter.build_skill()
logger.info(f"✅ PDF: Standalone SKILL.md created")
except Exception as e:
logger.warning(f"⚠️ Failed to build standalone PDF SKILL.md: {e}")
logger.info(f"✅ PDF: {len(pdf_data.get('pages', []))} pages extracted")
def _load_json(self, file_path: Path) -> Dict:
@@ -323,6 +504,30 @@ class UnifiedScraper:
return {'guides': guides, 'total_count': len(guides)}
def _load_api_reference(self, api_dir: Path) -> Dict[str, Any]:
"""
Load API reference markdown files from api_reference directory.
Args:
api_dir: Path to api_reference directory
Returns:
Dict mapping module names to markdown content, or empty dict if not found
"""
if not api_dir.exists():
logger.debug(f"API reference directory not found: {api_dir}")
return {}
api_refs = {}
for md_file in api_dir.glob('*.md'):
try:
module_name = md_file.stem
api_refs[module_name] = md_file.read_text(encoding='utf-8')
except IOError as e:
logger.warning(f"Failed to read API reference {md_file}: {e}")
return api_refs
def _run_c3_analysis(self, local_repo_path: str, source: Dict[str, Any]) -> Dict[str, Any]:
"""
Run comprehensive C3.x codebase analysis.
@@ -358,9 +563,9 @@ class UnifiedScraper:
depth='deep',
languages=None, # Analyze all languages
file_patterns=source.get('file_patterns'),
build_api_reference=False, # Not needed in skill
build_api_reference=True, # C2.5: API Reference
extract_comments=False, # Not needed
build_dependency_graph=False, # Can add later if needed
build_dependency_graph=True, # C2.6: Dependency Graph
detect_patterns=True, # C3.1: Design patterns
extract_test_examples=True, # C3.2: Test examples
build_how_to_guides=True, # C3.3: How-to guides
@@ -375,7 +580,9 @@ class UnifiedScraper:
'test_examples': self._load_json(temp_output / 'test_examples' / 'test_examples.json'),
'how_to_guides': self._load_guide_collection(temp_output / 'tutorials'),
'config_patterns': self._load_json(temp_output / 'config_patterns' / 'config_patterns.json'),
'architecture': self._load_json(temp_output / 'architecture' / 'architectural_patterns.json')
'architecture': self._load_json(temp_output / 'architecture' / 'architectural_patterns.json'),
'api_reference': self._load_api_reference(temp_output / 'api_reference'), # C2.5
'dependency_graph': self._load_json(temp_output / 'dependencies' / 'dependency_graph.json') # C2.6
}
# Log summary
@@ -531,7 +738,8 @@ class UnifiedScraper:
self.config,
self.scraped_data,
merged_data,
conflicts
conflicts,
cache_dir=self.cache_dir
)
builder.build()