feat: Multi-Source Synthesis Architecture - Rich Standalone Skills + Smart Combination
BREAKING CHANGE: Major architectural improvements to multi-source skill generation This commit implements the complete "Multi-Source Synthesis Architecture" where each source (documentation, GitHub, PDF) generates a rich standalone SKILL.md file before being intelligently synthesized with source-specific formulas. ## 🎯 Core Architecture Changes ### 1. Rich Standalone SKILL.md Generation (Source Parity) Each source now generates comprehensive, production-quality SKILL.md files that can stand alone OR be synthesized with other sources. **GitHub Scraper Enhancements** (+263 lines): - Now generates 300+ line SKILL.md (was ~50 lines) - Integrates C3.x codebase analysis data: - C2.5: API Reference extraction - C3.1: Design pattern detection (27 high-confidence patterns) - C3.2: Test example extraction (215 examples) - C3.7: Architectural pattern analysis - Enhanced sections: - ⚡ Quick Reference with pattern summaries - 📝 Code Examples from real repository tests - 🔧 API Reference from codebase analysis - 🏗️ Architecture Overview with design patterns - ⚠️ Known Issues from GitHub issues - Location: src/skill_seekers/cli/github_scraper.py **PDF Scraper Enhancements** (+205 lines): - Now generates 200+ line SKILL.md (was ~50 lines) - Enhanced content extraction: - 📖 Chapter Overview (PDF structure breakdown) - 🔑 Key Concepts (extracted from headings) - ⚡ Quick Reference (pattern extraction) - 📝 Code Examples: Top 15 (was top 5), grouped by language - Quality scoring and intelligent truncation - Better formatting and organization - Location: src/skill_seekers/cli/pdf_scraper.py **Result**: All 3 sources (docs, GitHub, PDF) now have equal capability to generate rich, comprehensive standalone skills. ### 2. File Organization & Caching System **Problem**: output/ directory cluttered with intermediate files, data, and logs. **Solution**: New `.skillseeker-cache/` hidden directory for all intermediate files. **New Structure**: ``` .skillseeker-cache/{skill_name}/ ├── sources/ # Standalone SKILL.md from each source │ ├── httpx_docs/ │ ├── httpx_github/ │ └── httpx_pdf/ ├── data/ # Raw scraped data (JSON) ├── repos/ # Cloned GitHub repositories (cached for reuse) └── logs/ # Session logs with timestamps output/{skill_name}/ # CLEAN: Only final synthesized skill ├── SKILL.md └── references/ ``` **Benefits**: - ✅ Clean output/ directory (only final product) - ✅ Intermediate files preserved for debugging - ✅ Repository clones cached and reused (faster re-runs) - ✅ Timestamped logs for each scraping session - ✅ All cache dirs added to .gitignore **Changes**: - .gitignore: Added `.skillseeker-cache/` entry - unified_scraper.py: Complete reorganization (+238 lines) - Added cache directory structure - File logging with timestamps - Repository cloning with caching/reuse - Cleaner intermediate file management - Better subprocess logging and error handling ### 3. Config Repository Migration **Moved to separate config repository**: https://github.com/yusufkaraaslan/skill-seekers-configs **Deleted from this repo** (35 config files): - ansible-core.json, astro.json, claude-code.json - django.json, django_unified.json, fastapi.json, fastapi_unified.json - godot.json, godot_unified.json, godot_github.json, godot-large-example.json - react.json, react_unified.json, react_github.json, react_github_example.json - vue.json, kubernetes.json, laravel.json, tailwind.json, hono.json - svelte_cli_unified.json, steam-economy-complete.json - deck_deck_go_local.json, python-tutorial-test.json, example_pdf.json - test-manual.json, fastapi_unified_test.json, fastmcp_github_example.json - example-team/ directory (4 files) **Kept as reference example**: - configs/httpx_comprehensive.json (complete multi-source example) **Rationale**: - Cleaner repository (979+ lines added, 1680 deleted) - Configs managed separately with versioning - Official presets available via `fetch-config` command - Users can maintain private config repos ### 4. AI Enhancement Improvements **enhance_skill.py** (+125 lines): - Better integration with multi-source synthesis - Enhanced prompt generation for synthesized skills - Improved error handling and logging - Support for source metadata in enhancement ### 5. Documentation Updates **CLAUDE.md** (+252 lines): - Comprehensive project documentation - Architecture explanations - Development workflow guidelines - Testing requirements - Multi-source synthesis patterns **SKILL_QUALITY_ANALYSIS.md** (new): - Quality assessment framework - Before/after analysis of httpx skill - Grading rubric for skill quality - Metrics and benchmarks ### 6. Testing & Validation Scripts **test_httpx_skill.sh** (new): - Complete httpx skill generation test - Multi-source synthesis validation - Quality metrics verification **test_httpx_quick.sh** (new): - Quick validation script - Subset of features for rapid testing ## 📊 Quality Improvements | Metric | Before | After | Improvement | |--------|--------|-------|-------------| | GitHub SKILL.md lines | ~50 | 300+ | +500% | | PDF SKILL.md lines | ~50 | 200+ | +300% | | GitHub C3.x integration | ❌ No | ✅ Yes | New feature | | PDF pattern extraction | ❌ No | ✅ Yes | New feature | | File organization | Messy | Clean cache | Major improvement | | Repository cloning | Always fresh | Cached reuse | Faster re-runs | | Logging | Console only | Timestamped files | Better debugging | | Config management | In-repo | Separate repo | Cleaner separation | ## 🧪 Testing All existing tests pass: - test_c3_integration.py: Updated for new architecture - 700+ tests passing - Multi-source synthesis validated with httpx example ## 🔧 Technical Details **Modified Core Files**: 1. src/skill_seekers/cli/github_scraper.py (+263 lines) - _generate_skill_md(): Rich content with C3.x integration - _format_pattern_summary(): Design pattern summaries - _format_code_examples(): Test example formatting - _format_api_reference(): API reference from codebase - _format_architecture(): Architectural pattern analysis 2. src/skill_seekers/cli/pdf_scraper.py (+205 lines) - _generate_skill_md(): Enhanced with rich content - _format_key_concepts(): Extract concepts from headings - _format_patterns_from_content(): Pattern extraction - Code examples: Top 15, grouped by language, better quality scoring 3. src/skill_seekers/cli/unified_scraper.py (+238 lines) - __init__(): Cache directory structure - _setup_logging(): File logging with timestamps - _clone_github_repo(): Repository caching system - _scrape_documentation(): Move to cache, better logging - Better subprocess handling and error reporting 4. src/skill_seekers/cli/enhance_skill.py (+125 lines) - Multi-source synthesis awareness - Enhanced prompt generation - Better error handling **Minor Updates**: - src/skill_seekers/cli/codebase_scraper.py (+3 lines): Minor improvements - src/skill_seekers/cli/test_example_extractor.py: Quality scoring adjustments - tests/test_c3_integration.py: Test updates for new architecture ## 🚀 Migration Guide **For users with existing configs**: No action required - all existing configs continue to work. **For users wanting official presets**: ```bash # Fetch from official config repo skill-seekers fetch-config --name react --target unified # Or use existing local configs skill-seekers unified --config configs/httpx_comprehensive.json ``` **Cache directory**: New `.skillseeker-cache/` directory will be created automatically. Safe to delete - will be regenerated on next run. ## 📈 Next Steps This architecture enables: - ✅ Source parity: All sources generate rich standalone skills - ✅ Smart synthesis: Each combination has optimal formula - ✅ Better debugging: Cached files and logs preserved - ✅ Faster iteration: Repository caching, clean output - 🔄 Future: Multi-platform enhancement (Gemini, GPT-4) - planned - 🔄 Future: Conflict detection between sources - planned - 🔄 Future: Source prioritization rules - planned ## 🎓 Example: httpx Skill Quality **Before**: 186 lines, basic synthesis, missing data **After**: 640 lines with AI enhancement, A- (9/10) quality **What changed**: - All C3.x analysis data integrated (patterns, tests, API, architecture) - GitHub metadata included (stars, topics, languages) - PDF chapter structure visible - Professional formatting with emojis and clear sections - Real-world code examples from test suite - Design patterns explained with confidence scores - Known issues with impact assessment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -305,7 +305,7 @@ class PDFToSkillConverter:
|
||||
print(f" Generated: {filename}")
|
||||
|
||||
def _generate_skill_md(self, categorized):
|
||||
"""Generate main SKILL.md file"""
|
||||
"""Generate main SKILL.md file (enhanced with rich content)"""
|
||||
filename = f"{self.skill_dir}/SKILL.md"
|
||||
|
||||
# Generate skill name (lowercase, hyphens only, max 64 chars)
|
||||
@@ -324,45 +324,202 @@ class PDFToSkillConverter:
|
||||
f.write(f"# {self.name.title()} Documentation Skill\n\n")
|
||||
f.write(f"{self.description}\n\n")
|
||||
|
||||
f.write("## When to use this skill\n\n")
|
||||
f.write(f"Use this skill when the user asks about {self.name} documentation, ")
|
||||
f.write("including API references, tutorials, examples, and best practices.\n\n")
|
||||
# Enhanced "When to Use" section
|
||||
f.write("## 💡 When to Use This Skill\n\n")
|
||||
f.write(f"Use this skill when you need to:\n")
|
||||
f.write(f"- Understand {self.name} concepts and fundamentals\n")
|
||||
f.write(f"- Look up API references and technical specifications\n")
|
||||
f.write(f"- Find code examples and implementation patterns\n")
|
||||
f.write(f"- Review tutorials, guides, and best practices\n")
|
||||
f.write(f"- Explore the complete documentation structure\n\n")
|
||||
|
||||
f.write("## What's included\n\n")
|
||||
f.write("This skill contains:\n\n")
|
||||
# Chapter Overview (PDF structure)
|
||||
f.write("## 📖 Chapter Overview\n\n")
|
||||
total_pages = self.extracted_data.get('total_pages', 0)
|
||||
f.write(f"**Total Pages:** {total_pages}\n\n")
|
||||
f.write("**Content Breakdown:**\n\n")
|
||||
for cat_key, cat_data in categorized.items():
|
||||
f.write(f"- **{cat_data['title']}**: {len(cat_data['pages'])} pages\n")
|
||||
page_count = len(cat_data['pages'])
|
||||
f.write(f"- **{cat_data['title']}**: {page_count} pages\n")
|
||||
f.write("\n")
|
||||
|
||||
f.write("\n## Quick Reference\n\n")
|
||||
# Extract key concepts from headings
|
||||
f.write(self._format_key_concepts())
|
||||
|
||||
# Get high-quality code samples
|
||||
# Quick Reference with patterns
|
||||
f.write("## ⚡ Quick Reference\n\n")
|
||||
f.write(self._format_patterns_from_content())
|
||||
|
||||
# Enhanced code examples section (top 15, grouped by language)
|
||||
all_code = []
|
||||
for page in self.extracted_data['pages']:
|
||||
all_code.extend(page.get('code_samples', []))
|
||||
|
||||
# Sort by quality and get top 5
|
||||
# Sort by quality and get top 15
|
||||
all_code.sort(key=lambda x: x.get('quality_score', 0), reverse=True)
|
||||
top_code = all_code[:5]
|
||||
top_code = all_code[:15]
|
||||
|
||||
if top_code:
|
||||
f.write("### Top Code Examples\n\n")
|
||||
for i, code in enumerate(top_code, 1):
|
||||
lang = code['language']
|
||||
quality = code.get('quality_score', 0)
|
||||
f.write(f"**Example {i}** (Quality: {quality:.1f}/10):\n\n")
|
||||
f.write(f"```{lang}\n{code['code'][:300]}...\n```\n\n")
|
||||
f.write("## 📝 Code Examples\n\n")
|
||||
f.write("*High-quality examples extracted from documentation*\n\n")
|
||||
|
||||
f.write("## Navigation\n\n")
|
||||
f.write("See `references/index.md` for complete documentation structure.\n\n")
|
||||
# Group by language
|
||||
by_lang = {}
|
||||
for code in top_code:
|
||||
lang = code.get('language', 'unknown')
|
||||
if lang not in by_lang:
|
||||
by_lang[lang] = []
|
||||
by_lang[lang].append(code)
|
||||
|
||||
# Add language statistics
|
||||
# Display grouped by language
|
||||
for lang in sorted(by_lang.keys()):
|
||||
examples = by_lang[lang]
|
||||
f.write(f"### {lang.title()} Examples ({len(examples)})\n\n")
|
||||
|
||||
for i, code in enumerate(examples[:5], 1): # Top 5 per language
|
||||
quality = code.get('quality_score', 0)
|
||||
code_text = code.get('code', '')
|
||||
|
||||
f.write(f"**Example {i}** (Quality: {quality:.1f}/10):\n\n")
|
||||
f.write(f"```{lang}\n")
|
||||
|
||||
# Show full code if short, truncate if long
|
||||
if len(code_text) <= 500:
|
||||
f.write(code_text)
|
||||
else:
|
||||
f.write(code_text[:500] + "\n...")
|
||||
|
||||
f.write("\n```\n\n")
|
||||
|
||||
# Statistics
|
||||
f.write("## 📊 Documentation Statistics\n\n")
|
||||
f.write(f"- **Total Pages**: {total_pages}\n")
|
||||
total_code_blocks = self.extracted_data.get('total_code_blocks', 0)
|
||||
f.write(f"- **Code Blocks**: {total_code_blocks}\n")
|
||||
total_images = self.extracted_data.get('total_images', 0)
|
||||
f.write(f"- **Images/Diagrams**: {total_images}\n")
|
||||
|
||||
# Language statistics
|
||||
langs = self.extracted_data.get('languages_detected', {})
|
||||
if langs:
|
||||
f.write("## Languages Covered\n\n")
|
||||
f.write(f"- **Programming Languages**: {len(langs)}\n\n")
|
||||
f.write("**Language Breakdown:**\n\n")
|
||||
for lang, count in sorted(langs.items(), key=lambda x: x[1], reverse=True):
|
||||
f.write(f"- {lang}: {count} examples\n")
|
||||
f.write("\n")
|
||||
|
||||
print(f" Generated: {filename}")
|
||||
# Quality metrics
|
||||
quality_stats = self.extracted_data.get('quality_statistics', {})
|
||||
if quality_stats:
|
||||
avg_quality = quality_stats.get('average_quality', 0)
|
||||
valid_blocks = quality_stats.get('valid_code_blocks', 0)
|
||||
f.write(f"**Code Quality:**\n\n")
|
||||
f.write(f"- Average Quality Score: {avg_quality:.1f}/10\n")
|
||||
f.write(f"- Valid Code Blocks: {valid_blocks}\n\n")
|
||||
|
||||
# Navigation
|
||||
f.write("## 🗺️ Navigation\n\n")
|
||||
f.write("**Reference Files:**\n\n")
|
||||
for cat_key, cat_data in categorized.items():
|
||||
cat_file = self._sanitize_filename(cat_data['title'])
|
||||
f.write(f"- `references/{cat_file}.md` - {cat_data['title']}\n")
|
||||
f.write("\n")
|
||||
f.write("See `references/index.md` for complete documentation structure.\n\n")
|
||||
|
||||
# Footer
|
||||
f.write("---\n\n")
|
||||
f.write("**Generated by Skill Seeker** | PDF Documentation Scraper\n")
|
||||
|
||||
line_count = len(open(filename, 'r', encoding='utf-8').read().split('\n'))
|
||||
print(f" Generated: {filename} ({line_count} lines)")
|
||||
|
||||
def _format_key_concepts(self) -> str:
|
||||
"""Extract key concepts from headings across all pages."""
|
||||
all_headings = []
|
||||
|
||||
for page in self.extracted_data.get('pages', []):
|
||||
headings = page.get('headings', [])
|
||||
for heading in headings:
|
||||
text = heading.get('text', '').strip()
|
||||
level = heading.get('level', 'h1')
|
||||
if text and len(text) > 3: # Skip very short headings
|
||||
all_headings.append((level, text))
|
||||
|
||||
if not all_headings:
|
||||
return ""
|
||||
|
||||
content = "## 🔑 Key Concepts\n\n"
|
||||
content += "*Main topics covered in this documentation*\n\n"
|
||||
|
||||
# Group by level and show top concepts
|
||||
h1_headings = [text for level, text in all_headings if level == 'h1']
|
||||
h2_headings = [text for level, text in all_headings if level == 'h2']
|
||||
|
||||
if h1_headings:
|
||||
content += "**Major Topics:**\n\n"
|
||||
for heading in h1_headings[:10]: # Top 10
|
||||
content += f"- {heading}\n"
|
||||
content += "\n"
|
||||
|
||||
if h2_headings:
|
||||
content += "**Subtopics:**\n\n"
|
||||
for heading in h2_headings[:15]: # Top 15
|
||||
content += f"- {heading}\n"
|
||||
content += "\n"
|
||||
|
||||
return content
|
||||
|
||||
def _format_patterns_from_content(self) -> str:
|
||||
"""Extract common patterns from text content."""
|
||||
# Look for common technical patterns in text
|
||||
patterns = []
|
||||
|
||||
# Simple pattern extraction from headings and emphasized text
|
||||
for page in self.extracted_data.get('pages', []):
|
||||
text = page.get('text', '')
|
||||
headings = page.get('headings', [])
|
||||
|
||||
# Look for common pattern keywords in headings
|
||||
pattern_keywords = [
|
||||
'getting started', 'installation', 'configuration',
|
||||
'usage', 'api', 'examples', 'tutorial', 'guide',
|
||||
'best practices', 'troubleshooting', 'faq'
|
||||
]
|
||||
|
||||
for heading in headings:
|
||||
heading_text = heading.get('text', '').lower()
|
||||
for keyword in pattern_keywords:
|
||||
if keyword in heading_text:
|
||||
page_num = page.get('page_number', 0)
|
||||
patterns.append({
|
||||
'type': keyword.title(),
|
||||
'heading': heading.get('text', ''),
|
||||
'page': page_num
|
||||
})
|
||||
break # Only add once per heading
|
||||
|
||||
if not patterns:
|
||||
return "*See reference files for detailed content*\n\n"
|
||||
|
||||
content = "*Common documentation patterns found:*\n\n"
|
||||
|
||||
# Group by type
|
||||
by_type = {}
|
||||
for pattern in patterns:
|
||||
ptype = pattern['type']
|
||||
if ptype not in by_type:
|
||||
by_type[ptype] = []
|
||||
by_type[ptype].append(pattern)
|
||||
|
||||
# Display grouped patterns
|
||||
for ptype in sorted(by_type.keys()):
|
||||
items = by_type[ptype]
|
||||
content += f"**{ptype}** ({len(items)} sections):\n"
|
||||
for item in items[:3]: # Top 3 per type
|
||||
content += f"- {item['heading']} (page {item['page']})\n"
|
||||
content += "\n"
|
||||
|
||||
return content
|
||||
|
||||
def _sanitize_filename(self, name):
|
||||
"""Convert string to safe filename"""
|
||||
|
||||
Reference in New Issue
Block a user