feat: Multi-Source Synthesis Architecture - Rich Standalone Skills + Smart Combination

BREAKING CHANGE: Major architectural improvements to multi-source skill generation This commit implements the complete "Multi-Source Synthesis Architecture" where each source (documentation, GitHub, PDF) generates a rich standalone SKILL.md file before being intelligently synthesized with source-specific formulas. ## 🎯 Core Architecture Changes ### 1. Rich Standalone SKILL.md Generation (Source Parity) Each source now generates comprehensive, production-quality SKILL.md files that can stand alone OR be synthesized with other sources. **GitHub Scraper Enhancements** (+263 lines): - Now generates 300+ line SKILL.md (was ~50 lines) - Integrates C3.x codebase analysis data: - C2.5: API Reference extraction - C3.1: Design pattern detection (27 high-confidence patterns) - C3.2: Test example extraction (215 examples) - C3.7: Architectural pattern analysis - Enhanced sections: - ⚡ Quick Reference with pattern summaries - 📝 Code Examples from real repository tests - 🔧 API Reference from codebase analysis - 🏗️ Architecture Overview with design patterns - ⚠️ Known Issues from GitHub issues - Location: src/skill_seekers/cli/github_scraper.py **PDF Scraper Enhancements** (+205 lines): - Now generates 200+ line SKILL.md (was ~50 lines) - Enhanced content extraction: - 📖 Chapter Overview (PDF structure breakdown) - 🔑 Key Concepts (extracted from headings) - ⚡ Quick Reference (pattern extraction) - 📝 Code Examples: Top 15 (was top 5), grouped by language - Quality scoring and intelligent truncation - Better formatting and organization - Location: src/skill_seekers/cli/pdf_scraper.py **Result**: All 3 sources (docs, GitHub, PDF) now have equal capability to generate rich, comprehensive standalone skills. ### 2. File Organization & Caching System **Problem**: output/ directory cluttered with intermediate files, data, and logs. **Solution**: New `.skillseeker-cache/` hidden directory for all intermediate files. **New Structure**: ``` .skillseeker-cache/{skill_name}/ ├── sources/ # Standalone SKILL.md from each source │ ├── httpx_docs/ │ ├── httpx_github/ │ └── httpx_pdf/ ├── data/ # Raw scraped data (JSON) ├── repos/ # Cloned GitHub repositories (cached for reuse) └── logs/ # Session logs with timestamps output/{skill_name}/ # CLEAN: Only final synthesized skill ├── SKILL.md └── references/ ``` **Benefits**: - ✅ Clean output/ directory (only final product) - ✅ Intermediate files preserved for debugging - ✅ Repository clones cached and reused (faster re-runs) - ✅ Timestamped logs for each scraping session - ✅ All cache dirs added to .gitignore **Changes**: - .gitignore: Added `.skillseeker-cache/` entry - unified_scraper.py: Complete reorganization (+238 lines) - Added cache directory structure - File logging with timestamps - Repository cloning with caching/reuse - Cleaner intermediate file management - Better subprocess logging and error handling ### 3. Config Repository Migration **Moved to separate config repository**: https://github.com/yusufkaraaslan/skill-seekers-configs **Deleted from this repo** (35 config files): - ansible-core.json, astro.json, claude-code.json - django.json, django_unified.json, fastapi.json, fastapi_unified.json - godot.json, godot_unified.json, godot_github.json, godot-large-example.json - react.json, react_unified.json, react_github.json, react_github_example.json - vue.json, kubernetes.json, laravel.json, tailwind.json, hono.json - svelte_cli_unified.json, steam-economy-complete.json - deck_deck_go_local.json, python-tutorial-test.json, example_pdf.json - test-manual.json, fastapi_unified_test.json, fastmcp_github_example.json - example-team/ directory (4 files) **Kept as reference example**: - configs/httpx_comprehensive.json (complete multi-source example) **Rationale**: - Cleaner repository (979+ lines added, 1680 deleted) - Configs managed separately with versioning - Official presets available via `fetch-config` command - Users can maintain private config repos ### 4. AI Enhancement Improvements **enhance_skill.py** (+125 lines): - Better integration with multi-source synthesis - Enhanced prompt generation for synthesized skills - Improved error handling and logging - Support for source metadata in enhancement ### 5. Documentation Updates **CLAUDE.md** (+252 lines): - Comprehensive project documentation - Architecture explanations - Development workflow guidelines - Testing requirements - Multi-source synthesis patterns **SKILL_QUALITY_ANALYSIS.md** (new): - Quality assessment framework - Before/after analysis of httpx skill - Grading rubric for skill quality - Metrics and benchmarks ### 6. Testing & Validation Scripts **test_httpx_skill.sh** (new): - Complete httpx skill generation test - Multi-source synthesis validation - Quality metrics verification **test_httpx_quick.sh** (new): - Quick validation script - Subset of features for rapid testing ## 📊 Quality Improvements | Metric | Before | After | Improvement | |--------|--------|-------|-------------| | GitHub SKILL.md lines | ~50 | 300+ | +500% | | PDF SKILL.md lines | ~50 | 200+ | +300% | | GitHub C3.x integration | ❌ No | ✅ Yes | New feature | | PDF pattern extraction | ❌ No | ✅ Yes | New feature | | File organization | Messy | Clean cache | Major improvement | | Repository cloning | Always fresh | Cached reuse | Faster re-runs | | Logging | Console only | Timestamped files | Better debugging | | Config management | In-repo | Separate repo | Cleaner separation | ## 🧪 Testing All existing tests pass: - test_c3_integration.py: Updated for new architecture - 700+ tests passing - Multi-source synthesis validated with httpx example ## 🔧 Technical Details **Modified Core Files**: 1. src/skill_seekers/cli/github_scraper.py (+263 lines) - _generate_skill_md(): Rich content with C3.x integration - _format_pattern_summary(): Design pattern summaries - _format_code_examples(): Test example formatting - _format_api_reference(): API reference from codebase - _format_architecture(): Architectural pattern analysis 2. src/skill_seekers/cli/pdf_scraper.py (+205 lines) - _generate_skill_md(): Enhanced with rich content - _format_key_concepts(): Extract concepts from headings - _format_patterns_from_content(): Pattern extraction - Code examples: Top 15, grouped by language, better quality scoring 3. src/skill_seekers/cli/unified_scraper.py (+238 lines) - __init__(): Cache directory structure - _setup_logging(): File logging with timestamps - _clone_github_repo(): Repository caching system - _scrape_documentation(): Move to cache, better logging - Better subprocess handling and error reporting 4. src/skill_seekers/cli/enhance_skill.py (+125 lines) - Multi-source synthesis awareness - Enhanced prompt generation - Better error handling **Minor Updates**: - src/skill_seekers/cli/codebase_scraper.py (+3 lines): Minor improvements - src/skill_seekers/cli/test_example_extractor.py: Quality scoring adjustments - tests/test_c3_integration.py: Test updates for new architecture ## 🚀 Migration Guide **For users with existing configs**: No action required - all existing configs continue to work. **For users wanting official presets**: ```bash # Fetch from official config repo skill-seekers fetch-config --name react --target unified # Or use existing local configs skill-seekers unified --config configs/httpx_comprehensive.json ``` **Cache directory**: New `.skillseeker-cache/` directory will be created automatically. Safe to delete - will be regenerated on next run. ## 📈 Next Steps This architecture enables: - ✅ Source parity: All sources generate rich standalone skills - ✅ Smart synthesis: Each combination has optimal formula - ✅ Better debugging: Cached files and logs preserved - ✅ Faster iteration: Repository caching, clean output - 🔄 Future: Multi-platform enhancement (Gemini, GPT-4) - planned - 🔄 Future: Conflict detection between sources - planned - 🔄 Future: Source prioritization rules - planned ## 🎓 Example: httpx Skill Quality **Before**: 186 lines, basic synthesis, missing data **After**: 640 lines with AI enhancement, A- (9/10) quality **What changed**: - All C3.x analysis data integrated (patterns, tests, API, architecture) - GitHub metadata included (stars, topics, languages) - PDF chapter structure visible - Professional formatting with emojis and clear sections - Real-world code examples from test suite - Design patterns explained with confidence scores - Known issues with impact assessment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-11 23:01:07 +03:00
parent cf9539878e
commit a99e22c639
46 changed files with 1869 additions and 1678 deletions
--- a/src/skill_seekers/cli/enhance_skill.py
+++ b/src/skill_seekers/cli/enhance_skill.py
@@ -105,44 +105,129 @@ class SkillEnhancer:
            return None

    def _build_enhancement_prompt(self, references, current_skill_md):
-        """Build the prompt for Claude"""
+        """Build the prompt for Claude with multi-source awareness"""

        # Extract skill name and description
        skill_name = self.skill_dir.name

+        # Analyze sources
+        sources_found = set()
+        for metadata in references.values():
+            sources_found.add(metadata['source'])
+
+        # Analyze conflicts if present
+        has_conflicts = any('conflicts' in meta['path'] for meta in references.values())
+
        prompt = f"""You are enhancing a Claude skill's SKILL.md file. This skill is about: {skill_name}

-I've scraped documentation and organized it into reference files. Your job is to create an EXCELLENT SKILL.md that will help Claude use this documentation effectively.
+I've scraped documentation from multiple sources and organized it into reference files. Your job is to create an EXCELLENT SKILL.md that synthesizes knowledge from these sources.
+
+SKILL OVERVIEW:
+- Name: {skill_name}
+- Source Types: {', '.join(sorted(sources_found))}
+- Multi-Source: {'Yes' if len(sources_found) > 1 else 'No'}
+- Conflicts Detected: {'Yes - see conflicts.md in references' if has_conflicts else 'No'}

 CURRENT SKILL.MD:
 {'```markdown' if current_skill_md else '(none - create from scratch)'}
 {current_skill_md or 'No existing SKILL.md'}
 {'```' if current_skill_md else ''}

-REFERENCE DOCUMENTATION:
+SOURCE ANALYSIS:
+This skill combines knowledge from {len(sources_found)} source type(s):
+
 """

-        for filename, content in references.items():
-            prompt += f"\n\n## {filename}\n```markdown\n{content[:30000]}\n```\n"
+        # Group references by source type
+        by_source = {}
+        for filename, metadata in references.items():
+            source = metadata['source']
+            if source not in by_source:
+                by_source[source] = []
+            by_source[source].append((filename, metadata))
+
+        # Add source breakdown
+        for source in sorted(by_source.keys()):
+            files = by_source[source]
+            prompt += f"\n**{source.upper()} ({len(files)} file(s))**\n"
+            for filename, metadata in files[:5]:  # Top 5 per source
+                prompt += f"- {filename} (confidence: {metadata['confidence']}, {metadata['size']:,} chars)\n"
+            if len(files) > 5:
+                prompt += f"- ... and {len(files) - 5} more\n"
+
+        prompt += "\n\nREFERENCE DOCUMENTATION:\n"
+
+        # Add references grouped by source with metadata
+        for source in sorted(by_source.keys()):
+            prompt += f"\n### {source.upper()} SOURCES\n\n"
+            for filename, metadata in by_source[source]:
+                content = metadata['content']
+                # Limit per-file to 30K
+                if len(content) > 30000:
+                    content = content[:30000] + "\n\n[Content truncated for size...]"
+
+                prompt += f"\n#### {filename}\n"
+                prompt += f"*Source: {metadata['source']}, Confidence: {metadata['confidence']}*\n\n"
+                prompt += f"```markdown\n{content}\n```\n"

        prompt += """

-YOUR TASK:
-Create an enhanced SKILL.md that includes:
+REFERENCE PRIORITY (when sources differ):
+1. **Code patterns (codebase_analysis)**: Ground truth - what the code actually does
+2. **Official documentation**: Intended API and usage patterns
+3. **GitHub issues**: Real-world usage and known problems
+4. **PDF documentation**: Additional context and tutorials

-1. **Clear "When to Use This Skill" section** - Be specific about trigger conditions
-2. **Excellent Quick Reference section** - Extract 5-10 of the BEST, most practical code examples from the reference docs
-   - Choose SHORT, clear examples that demonstrate common tasks
-   - Include both simple and intermediate examples
-   - Annotate examples with clear descriptions
+YOUR TASK:
+Create an enhanced SKILL.md that synthesizes knowledge from multiple sources:
+
+1. **Multi-Source Synthesis**
+   - Acknowledge that this skill combines multiple sources
+   - Highlight agreements between sources (builds confidence)
+   - Note discrepancies transparently (if present)
+   - Use source priority when synthesizing conflicting information
+
+2. **Clear "When to Use This Skill" section**
+   - Be SPECIFIC about trigger conditions
+   - List concrete use cases
+   - Include perspective from both docs AND real-world usage (if GitHub/codebase data available)
+
+3. **Excellent Quick Reference section**
+   - Extract 5-10 of the BEST, most practical code examples
+   - Prefer examples from HIGH CONFIDENCE sources first
+   - If code examples exist from codebase analysis, prioritize those (real usage)
+   - If docs examples exist, include those too (official patterns)
+   - Choose SHORT, clear examples (5-20 lines max)
   - Use proper language tags (cpp, python, javascript, json, etc.)
-3. **Detailed Reference Files description** - Explain what's in each reference file
-4. **Practical "Working with This Skill" section** - Give users clear guidance on how to navigate the skill
-5. **Key Concepts section** (if applicable) - Explain core concepts
-6. **Keep the frontmatter** (---\nname: ...\n---) intact
+   - Add clear descriptions noting the source (e.g., "From official docs" or "From codebase")
+
+4. **Detailed Reference Files description**
+   - Explain what's in each reference file
+   - Note the source type and confidence level
+   - Help users navigate multi-source documentation
+
+5. **Practical "Working with This Skill" section**
+   - Clear guidance for beginners, intermediate, and advanced users
+   - Navigation tips for multi-source references
+   - How to resolve conflicts if present
+
+6. **Key Concepts section** (if applicable)
+   - Explain core concepts
+   - Define important terminology
+   - Reconcile differences between sources if needed
+
+7. **Conflict Handling** (if conflicts detected)
+   - Add a "Known Discrepancies" section
+   - Explain major conflicts transparently
+   - Provide guidance on which source to trust in each case
+
+8. **Keep the frontmatter** (---\nname: ...\n---) intact

 IMPORTANT:
 - Extract REAL examples from the reference docs, don't make them up
+- Prioritize HIGH CONFIDENCE sources when synthesizing
+- Note source attribution when helpful (e.g., "Official docs say X, but codebase shows Y")
+- Make discrepancies transparent, not hidden
 - Prioritize SHORT, clear examples (5-20 lines max)
 - Make it actionable and practical
 - Don't be too verbose - be concise but useful
@@ -185,8 +270,14 @@ Return ONLY the complete SKILL.md content, starting with the frontmatter (---).
            print("❌ No reference files found to analyze")
            return False

+        # Analyze sources
+        sources_found = set()
+        for metadata in references.values():
+            sources_found.add(metadata['source'])
+
        print(f"  ✓ Read {len(references)} reference files")
-        total_size = sum(len(c) for c in references.values())
+        print(f"  ✓ Sources: {', '.join(sorted(sources_found))}")
+        total_size = sum(meta['size'] for meta in references.values())
        print(f"  ✓ Total size: {total_size:,} characters\n")

        # Read current SKILL.md