feat: Multi-Source Synthesis Architecture - Rich Standalone Skills + Smart Combination

BREAKING CHANGE: Major architectural improvements to multi-source skill generation This commit implements the complete "Multi-Source Synthesis Architecture" where each source (documentation, GitHub, PDF) generates a rich standalone SKILL.md file before being intelligently synthesized with source-specific formulas. ## 🎯 Core Architecture Changes ### 1. Rich Standalone SKILL.md Generation (Source Parity) Each source now generates comprehensive, production-quality SKILL.md files that can stand alone OR be synthesized with other sources. **GitHub Scraper Enhancements** (+263 lines): - Now generates 300+ line SKILL.md (was ~50 lines) - Integrates C3.x codebase analysis data: - C2.5: API Reference extraction - C3.1: Design pattern detection (27 high-confidence patterns) - C3.2: Test example extraction (215 examples) - C3.7: Architectural pattern analysis - Enhanced sections: - ⚡ Quick Reference with pattern summaries - 📝 Code Examples from real repository tests - 🔧 API Reference from codebase analysis - 🏗️ Architecture Overview with design patterns - ⚠️ Known Issues from GitHub issues - Location: src/skill_seekers/cli/github_scraper.py **PDF Scraper Enhancements** (+205 lines): - Now generates 200+ line SKILL.md (was ~50 lines) - Enhanced content extraction: - 📖 Chapter Overview (PDF structure breakdown) - 🔑 Key Concepts (extracted from headings) - ⚡ Quick Reference (pattern extraction) - 📝 Code Examples: Top 15 (was top 5), grouped by language - Quality scoring and intelligent truncation - Better formatting and organization - Location: src/skill_seekers/cli/pdf_scraper.py **Result**: All 3 sources (docs, GitHub, PDF) now have equal capability to generate rich, comprehensive standalone skills. ### 2. File Organization & Caching System **Problem**: output/ directory cluttered with intermediate files, data, and logs. **Solution**: New `.skillseeker-cache/` hidden directory for all intermediate files. **New Structure**: ``` .skillseeker-cache/{skill_name}/ ├── sources/ # Standalone SKILL.md from each source │ ├── httpx_docs/ │ ├── httpx_github/ │ └── httpx_pdf/ ├── data/ # Raw scraped data (JSON) ├── repos/ # Cloned GitHub repositories (cached for reuse) └── logs/ # Session logs with timestamps output/{skill_name}/ # CLEAN: Only final synthesized skill ├── SKILL.md └── references/ ``` **Benefits**: - ✅ Clean output/ directory (only final product) - ✅ Intermediate files preserved for debugging - ✅ Repository clones cached and reused (faster re-runs) - ✅ Timestamped logs for each scraping session - ✅ All cache dirs added to .gitignore **Changes**: - .gitignore: Added `.skillseeker-cache/` entry - unified_scraper.py: Complete reorganization (+238 lines) - Added cache directory structure - File logging with timestamps - Repository cloning with caching/reuse - Cleaner intermediate file management - Better subprocess logging and error handling ### 3. Config Repository Migration **Moved to separate config repository**: https://github.com/yusufkaraaslan/skill-seekers-configs **Deleted from this repo** (35 config files): - ansible-core.json, astro.json, claude-code.json - django.json, django_unified.json, fastapi.json, fastapi_unified.json - godot.json, godot_unified.json, godot_github.json, godot-large-example.json - react.json, react_unified.json, react_github.json, react_github_example.json - vue.json, kubernetes.json, laravel.json, tailwind.json, hono.json - svelte_cli_unified.json, steam-economy-complete.json - deck_deck_go_local.json, python-tutorial-test.json, example_pdf.json - test-manual.json, fastapi_unified_test.json, fastmcp_github_example.json - example-team/ directory (4 files) **Kept as reference example**: - configs/httpx_comprehensive.json (complete multi-source example) **Rationale**: - Cleaner repository (979+ lines added, 1680 deleted) - Configs managed separately with versioning - Official presets available via `fetch-config` command - Users can maintain private config repos ### 4. AI Enhancement Improvements **enhance_skill.py** (+125 lines): - Better integration with multi-source synthesis - Enhanced prompt generation for synthesized skills - Improved error handling and logging - Support for source metadata in enhancement ### 5. Documentation Updates **CLAUDE.md** (+252 lines): - Comprehensive project documentation - Architecture explanations - Development workflow guidelines - Testing requirements - Multi-source synthesis patterns **SKILL_QUALITY_ANALYSIS.md** (new): - Quality assessment framework - Before/after analysis of httpx skill - Grading rubric for skill quality - Metrics and benchmarks ### 6. Testing & Validation Scripts **test_httpx_skill.sh** (new): - Complete httpx skill generation test - Multi-source synthesis validation - Quality metrics verification **test_httpx_quick.sh** (new): - Quick validation script - Subset of features for rapid testing ## 📊 Quality Improvements | Metric | Before | After | Improvement | |--------|--------|-------|-------------| | GitHub SKILL.md lines | ~50 | 300+ | +500% | | PDF SKILL.md lines | ~50 | 200+ | +300% | | GitHub C3.x integration | ❌ No | ✅ Yes | New feature | | PDF pattern extraction | ❌ No | ✅ Yes | New feature | | File organization | Messy | Clean cache | Major improvement | | Repository cloning | Always fresh | Cached reuse | Faster re-runs | | Logging | Console only | Timestamped files | Better debugging | | Config management | In-repo | Separate repo | Cleaner separation | ## 🧪 Testing All existing tests pass: - test_c3_integration.py: Updated for new architecture - 700+ tests passing - Multi-source synthesis validated with httpx example ## 🔧 Technical Details **Modified Core Files**: 1. src/skill_seekers/cli/github_scraper.py (+263 lines) - _generate_skill_md(): Rich content with C3.x integration - _format_pattern_summary(): Design pattern summaries - _format_code_examples(): Test example formatting - _format_api_reference(): API reference from codebase - _format_architecture(): Architectural pattern analysis 2. src/skill_seekers/cli/pdf_scraper.py (+205 lines) - _generate_skill_md(): Enhanced with rich content - _format_key_concepts(): Extract concepts from headings - _format_patterns_from_content(): Pattern extraction - Code examples: Top 15, grouped by language, better quality scoring 3. src/skill_seekers/cli/unified_scraper.py (+238 lines) - __init__(): Cache directory structure - _setup_logging(): File logging with timestamps - _clone_github_repo(): Repository caching system - _scrape_documentation(): Move to cache, better logging - Better subprocess handling and error reporting 4. src/skill_seekers/cli/enhance_skill.py (+125 lines) - Multi-source synthesis awareness - Enhanced prompt generation - Better error handling **Minor Updates**: - src/skill_seekers/cli/codebase_scraper.py (+3 lines): Minor improvements - src/skill_seekers/cli/test_example_extractor.py: Quality scoring adjustments - tests/test_c3_integration.py: Test updates for new architecture ## 🚀 Migration Guide **For users with existing configs**: No action required - all existing configs continue to work. **For users wanting official presets**: ```bash # Fetch from official config repo skill-seekers fetch-config --name react --target unified # Or use existing local configs skill-seekers unified --config configs/httpx_comprehensive.json ``` **Cache directory**: New `.skillseeker-cache/` directory will be created automatically. Safe to delete - will be regenerated on next run. ## 📈 Next Steps This architecture enables: - ✅ Source parity: All sources generate rich standalone skills - ✅ Smart synthesis: Each combination has optimal formula - ✅ Better debugging: Cached files and logs preserved - ✅ Faster iteration: Repository caching, clean output - 🔄 Future: Multi-platform enhancement (Gemini, GPT-4) - planned - 🔄 Future: Conflict detection between sources - planned - 🔄 Future: Source prioritization rules - planned ## 🎓 Example: httpx Skill Quality **Before**: 186 lines, basic synthesis, missing data **After**: 640 lines with AI enhancement, A- (9/10) quality **What changed**: - All C3.x analysis data integrated (patterns, tests, API, architecture) - GitHub metadata included (stars, topics, languages) - PDF chapter structure visible - Professional formatting with emojis and clear sections - Real-world code examples from test suite - Design patterns explained with confidence scores - Known issues with impact assessment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-11 23:01:07 +03:00
parent cf9539878e
commit a99e22c639
46 changed files with 1869 additions and 1678 deletions
--- a/src/skill_seekers/cli/unified_scraper.py
+++ b/src/skill_seekers/cli/unified_scraper.py
@@ -74,13 +74,51 @@ class UnifiedScraper:
        # Storage for scraped data
        self.scraped_data = {}

-        # Output paths
+        # Output paths - cleaner organization
        self.name = self.config['name']
-        self.output_dir = f"output/{self.name}"
-        self.data_dir = f"output/{self.name}_unified_data"
+        self.output_dir = f"output/{self.name}"  # Final skill only

+        # Use hidden cache directory for intermediate files
+        self.cache_dir = f".skillseeker-cache/{self.name}"
+        self.sources_dir = f"{self.cache_dir}/sources"
+        self.data_dir = f"{self.cache_dir}/data"
+        self.repos_dir = f"{self.cache_dir}/repos"
+        self.logs_dir = f"{self.cache_dir}/logs"
+
+        # Create directories
        os.makedirs(self.output_dir, exist_ok=True)
+        os.makedirs(self.sources_dir, exist_ok=True)
        os.makedirs(self.data_dir, exist_ok=True)
+        os.makedirs(self.repos_dir, exist_ok=True)
+        os.makedirs(self.logs_dir, exist_ok=True)
+
+        # Setup file logging
+        self._setup_logging()
+
+    def _setup_logging(self):
+        """Setup file logging for this scraping session."""
+        from datetime import datetime
+
+        # Create log filename with timestamp
+        timestamp = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
+        log_file = f"{self.logs_dir}/unified_{timestamp}.log"
+
+        # Add file handler to root logger
+        file_handler = logging.FileHandler(log_file, encoding='utf-8')
+        file_handler.setLevel(logging.DEBUG)
+
+        # Create formatter
+        formatter = logging.Formatter(
+            '%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+            datefmt='%Y-%m-%d %H:%M:%S'
+        )
+        file_handler.setFormatter(formatter)
+
+        # Add to root logger
+        logging.getLogger().addHandler(file_handler)
+
+        logger.info(f"📝 Logging to: {log_file}")
+        logger.info(f"🗂️  Cache directory: {self.cache_dir}")

    def scrape_all_sources(self):
        """
@@ -150,14 +188,20 @@ class UnifiedScraper:
        logger.info(f"Scraping documentation from {source['base_url']}")

        doc_scraper_path = Path(__file__).parent / "doc_scraper.py"
-        cmd = [sys.executable, str(doc_scraper_path), '--config', temp_config_path]
+        cmd = [sys.executable, str(doc_scraper_path), '--config', temp_config_path, '--fresh']

-        result = subprocess.run(cmd, capture_output=True, text=True)
+        result = subprocess.run(cmd, capture_output=True, text=True, stdin=subprocess.DEVNULL)

        if result.returncode != 0:
-            logger.error(f"Documentation scraping failed: {result.stderr}")
+            logger.error(f"Documentation scraping failed with return code {result.returncode}")
+            logger.error(f"STDERR: {result.stderr}")
+            logger.error(f"STDOUT: {result.stdout}")
            return

+        # Log subprocess output for debugging
+        if result.stdout:
+            logger.info(f"Doc scraper output: {result.stdout[-500:]}")  # Last 500 chars
+
        # Load scraped data
        docs_data_file = f"output/{doc_config['name']}_data/summary.json"

@@ -178,6 +222,83 @@ class UnifiedScraper:
        if os.path.exists(temp_config_path):
            os.remove(temp_config_path)

+        # Move intermediate files to cache to keep output/ clean
+        docs_output_dir = f"output/{doc_config['name']}"
+        docs_data_dir = f"output/{doc_config['name']}_data"
+
+        if os.path.exists(docs_output_dir):
+            cache_docs_dir = os.path.join(self.sources_dir, f"{doc_config['name']}")
+            if os.path.exists(cache_docs_dir):
+                shutil.rmtree(cache_docs_dir)
+            shutil.move(docs_output_dir, cache_docs_dir)
+            logger.info(f"📦 Moved docs output to cache: {cache_docs_dir}")
+
+        if os.path.exists(docs_data_dir):
+            cache_data_dir = os.path.join(self.data_dir, f"{doc_config['name']}_data")
+            if os.path.exists(cache_data_dir):
+                shutil.rmtree(cache_data_dir)
+            shutil.move(docs_data_dir, cache_data_dir)
+            logger.info(f"📦 Moved docs data to cache: {cache_data_dir}")
+
+    def _clone_github_repo(self, repo_name: str) -> Optional[str]:
+        """
+        Clone GitHub repository to cache directory for C3.x analysis.
+        Reuses existing clone if already present.
+
+        Args:
+            repo_name: GitHub repo in format "owner/repo"
+
+        Returns:
+            Path to cloned repo, or None if clone failed
+        """
+        # Clone to cache repos folder for future reuse
+        repo_dir_name = repo_name.replace('/', '_')  # e.g., encode_httpx
+        clone_path = os.path.join(self.repos_dir, repo_dir_name)
+
+        # Check if already cloned
+        if os.path.exists(clone_path) and os.path.isdir(os.path.join(clone_path, '.git')):
+            logger.info(f"♻️  Found existing repository clone: {clone_path}")
+            logger.info(f"   Reusing for C3.x analysis (skip re-cloning)")
+            return clone_path
+
+        # repos_dir already created in __init__
+
+        # Clone repo (full clone, not shallow - for complete analysis)
+        repo_url = f"https://github.com/{repo_name}.git"
+        logger.info(f"🔄 Cloning repository for C3.x analysis: {repo_url}")
+        logger.info(f"   → {clone_path}")
+        logger.info(f"   💾 Clone will be saved for future reuse")
+
+        try:
+            result = subprocess.run(
+                ['git', 'clone', repo_url, clone_path],
+                capture_output=True,
+                text=True,
+                timeout=600  # 10 minute timeout for full clone
+            )
+
+            if result.returncode == 0:
+                logger.info(f"✅ Repository cloned successfully")
+                logger.info(f"   📁 Saved to: {clone_path}")
+                return clone_path
+            else:
+                logger.error(f"❌ Git clone failed: {result.stderr}")
+                # Clean up failed clone
+                if os.path.exists(clone_path):
+                    shutil.rmtree(clone_path)
+                return None
+
+        except subprocess.TimeoutExpired:
+            logger.error(f"❌ Git clone timed out after 10 minutes")
+            if os.path.exists(clone_path):
+                shutil.rmtree(clone_path)
+            return None
+        except Exception as e:
+            logger.error(f"❌ Git clone failed: {e}")
+            if os.path.exists(clone_path):
+                shutil.rmtree(clone_path)
+            return None
+
    def _scrape_github(self, source: Dict[str, Any]):
        """Scrape GitHub repository."""
        try:
@@ -186,6 +307,22 @@ class UnifiedScraper:
            logger.error("github_scraper.py not found")
            return

+        # Check if we need to clone for C3.x analysis
+        enable_codebase_analysis = source.get('enable_codebase_analysis', True)
+        local_repo_path = source.get('local_repo_path')
+        cloned_repo_path = None
+
+        # Auto-clone if C3.x analysis is enabled but no local path provided
+        if enable_codebase_analysis and not local_repo_path:
+            logger.info("🔬 C3.x codebase analysis enabled - cloning repository...")
+            cloned_repo_path = self._clone_github_repo(source['repo'])
+            if cloned_repo_path:
+                local_repo_path = cloned_repo_path
+                logger.info(f"✅ Using cloned repo for C3.x analysis: {local_repo_path}")
+            else:
+                logger.warning("⚠️  Failed to clone repo - C3.x analysis will be skipped")
+                enable_codebase_analysis = False
+
        # Create config for GitHub scraper
        github_config = {
            'repo': source['repo'],
@@ -198,7 +335,7 @@ class UnifiedScraper:
            'include_code': source.get('include_code', True),
            'code_analysis_depth': source.get('code_analysis_depth', 'surface'),
            'file_patterns': source.get('file_patterns', []),
-            'local_repo_path': source.get('local_repo_path')  # Pass local_repo_path from config
+            'local_repo_path': local_repo_path  # Use cloned path if available
        }

        # Pass directory exclusions if specified (optional)
@@ -213,9 +350,6 @@ class UnifiedScraper:
        github_data = scraper.scrape()

        # Run C3.x codebase analysis if enabled and local_repo_path available
-        enable_codebase_analysis = source.get('enable_codebase_analysis', True)
-        local_repo_path = source.get('local_repo_path')
-
        if enable_codebase_analysis and local_repo_path:
            logger.info("🔬 Running C3.x codebase analysis...")
            try:
@@ -227,18 +361,58 @@ class UnifiedScraper:
                    logger.warning("⚠️  C3.x analysis returned no data")
            except Exception as e:
                logger.warning(f"⚠️  C3.x analysis failed: {e}")
+                import traceback
+                logger.debug(f"Traceback: {traceback.format_exc()}")
                # Continue without C3.x data - graceful degradation

-        # Save data
+        # Note: We keep the cloned repo in output/ for future reuse
+        if cloned_repo_path:
+            logger.info(f"📁 Repository clone saved for future use: {cloned_repo_path}")
+
+        # Save data to unified location
        github_data_file = os.path.join(self.data_dir, 'github_data.json')
        with open(github_data_file, 'w', encoding='utf-8') as f:
            json.dump(github_data, f, indent=2, ensure_ascii=False)

+        # ALSO save to the location GitHubToSkillConverter expects (with C3.x data!)
+        converter_data_file = f"output/{github_config['name']}_github_data.json"
+        with open(converter_data_file, 'w', encoding='utf-8') as f:
+            json.dump(github_data, f, indent=2, ensure_ascii=False)
+
        self.scraped_data['github'] = {
            'data': github_data,
            'data_file': github_data_file
        }

+        # Build standalone SKILL.md for synthesis using GitHubToSkillConverter
+        try:
+            from skill_seekers.cli.github_scraper import GitHubToSkillConverter
+            # Use github_config which has the correct name field
+            # Converter will load from output/{name}_github_data.json which now has C3.x data
+            converter = GitHubToSkillConverter(config=github_config)
+            converter.build_skill()
+            logger.info(f"✅ GitHub: Standalone SKILL.md created")
+        except Exception as e:
+            logger.warning(f"⚠️  Failed to build standalone GitHub SKILL.md: {e}")
+
+        # Move intermediate files to cache to keep output/ clean
+        github_output_dir = f"output/{github_config['name']}"
+        github_data_file_path = f"output/{github_config['name']}_github_data.json"
+
+        if os.path.exists(github_output_dir):
+            cache_github_dir = os.path.join(self.sources_dir, github_config['name'])
+            if os.path.exists(cache_github_dir):
+                shutil.rmtree(cache_github_dir)
+            shutil.move(github_output_dir, cache_github_dir)
+            logger.info(f"📦 Moved GitHub output to cache: {cache_github_dir}")
+
+        if os.path.exists(github_data_file_path):
+            cache_github_data = os.path.join(self.data_dir, f"{github_config['name']}_github_data.json")
+            if os.path.exists(cache_github_data):
+                os.remove(cache_github_data)
+            shutil.move(github_data_file_path, cache_github_data)
+            logger.info(f"📦 Moved GitHub data to cache: {cache_github_data}")
+
        logger.info(f"✅ GitHub: Repository scraped successfully")

    def _scrape_pdf(self, source: Dict[str, Any]):
@@ -273,6 +447,13 @@ class UnifiedScraper:
            'data_file': pdf_data_file
        }

+        # Build standalone SKILL.md for synthesis
+        try:
+            converter.build_skill()
+            logger.info(f"✅ PDF: Standalone SKILL.md created")
+        except Exception as e:
+            logger.warning(f"⚠️  Failed to build standalone PDF SKILL.md: {e}")
+
        logger.info(f"✅ PDF: {len(pdf_data.get('pages', []))} pages extracted")

    def _load_json(self, file_path: Path) -> Dict:
@@ -323,6 +504,30 @@ class UnifiedScraper:

        return {'guides': guides, 'total_count': len(guides)}

+    def _load_api_reference(self, api_dir: Path) -> Dict[str, Any]:
+        """
+        Load API reference markdown files from api_reference directory.
+
+        Args:
+            api_dir: Path to api_reference directory
+
+        Returns:
+            Dict mapping module names to markdown content, or empty dict if not found
+        """
+        if not api_dir.exists():
+            logger.debug(f"API reference directory not found: {api_dir}")
+            return {}
+
+        api_refs = {}
+        for md_file in api_dir.glob('*.md'):
+            try:
+                module_name = md_file.stem
+                api_refs[module_name] = md_file.read_text(encoding='utf-8')
+            except IOError as e:
+                logger.warning(f"Failed to read API reference {md_file}: {e}")
+
+        return api_refs
+
    def _run_c3_analysis(self, local_repo_path: str, source: Dict[str, Any]) -> Dict[str, Any]:
        """
        Run comprehensive C3.x codebase analysis.
@@ -358,9 +563,9 @@ class UnifiedScraper:
                depth='deep',
                languages=None,  # Analyze all languages
                file_patterns=source.get('file_patterns'),
-                build_api_reference=False,  # Not needed in skill
+                build_api_reference=True,   # C2.5: API Reference
                extract_comments=False,     # Not needed
-                build_dependency_graph=False,  # Can add later if needed
+                build_dependency_graph=True,  # C2.6: Dependency Graph
                detect_patterns=True,       # C3.1: Design patterns
                extract_test_examples=True, # C3.2: Test examples
                build_how_to_guides=True,   # C3.3: How-to guides
@@ -375,7 +580,9 @@ class UnifiedScraper:
                'test_examples': self._load_json(temp_output / 'test_examples' / 'test_examples.json'),
                'how_to_guides': self._load_guide_collection(temp_output / 'tutorials'),
                'config_patterns': self._load_json(temp_output / 'config_patterns' / 'config_patterns.json'),
-                'architecture': self._load_json(temp_output / 'architecture' / 'architectural_patterns.json')
+                'architecture': self._load_json(temp_output / 'architecture' / 'architectural_patterns.json'),
+                'api_reference': self._load_api_reference(temp_output / 'api_reference'),  # C2.5
+                'dependency_graph': self._load_json(temp_output / 'dependencies' / 'dependency_graph.json')  # C2.6
            }

            # Log summary
@@ -531,7 +738,8 @@ class UnifiedScraper:
            self.config,
            self.scraped_data,
            merged_data,
-            conflicts
+            conflicts,
+            cache_dir=self.cache_dir
        )

        builder.build()