diff --git a/docs/SKILL_QUALITY_FIX_PLAN.md b/docs/SKILL_QUALITY_FIX_PLAN.md new file mode 100644 index 0000000..b21bc28 --- /dev/null +++ b/docs/SKILL_QUALITY_FIX_PLAN.md @@ -0,0 +1,404 @@ +# Skill Quality Fix Plan + +**Created:** 2026-01-11 +**Status:** Not Started +**Priority:** P0 - Blocking Production Use + +--- + +## ๐ŸŽฏ Executive Summary + +The multi-source synthesis architecture successfully: +- โœ… Organizes files cleanly (.skillseeker-cache/ + output/) +- โœ… Collects C3.x codebase analysis data +- โœ… Moves files correctly to cache + +But produces poor quality output: +- โŒ Synthesis doesn't truly merge (loses content) +- โŒ Content formatting is broken (walls of text) +- โŒ AI enhancement reads only 13KB out of 30KB references +- โŒ Many accuracy and duplication issues + +**Bottom Line:** The engine works, but the output is unusable. + +--- + +## ๐Ÿ“Š Quality Assessment + +### Current State +| Aspect | Score | Status | +|--------|-------|--------| +| File organization | 10/10 | โœ… Excellent | +| C3.x data collection | 9/10 | โœ… Very Good | +| **Synthesis logic** | **3/10** | โŒ **Failing** | +| **Content formatting** | **2/10** | โŒ **Failing** | +| **AI enhancement** | **2/10** | โŒ **Failing** | +| Overall usability | 4/10 | โŒ Poor | + +--- + +## ๐Ÿ”ด P0: Critical Blocking Issues + +### Issue 1: Synthesis Doesn't Merge Content +**File:** `src/skill_seekers/cli/unified_skill_builder.py` +**Lines:** 73-162 (`_generate_skill_md`) + +**Problem:** +- Docs source: 155 lines +- GitHub source: 255 lines +- **Output: only 186 lines** (should be ~300-400) + +Missing from output: +- GitHub repository metadata (stars, topics, last updated) +- Detailed API reference sections +- Language statistics (says "1 file" instead of "54 files") +- Most C3.x analysis details + +**Root Cause:** Synthesis just concatenates specific sections instead of intelligently merging all content. + +**Fix Required:** +1. Implement proper section-by-section synthesis +2. Merge "When to Use" sections from both sources +3. Combine "Quick Reference" from both +4. Add GitHub metadata to intro +5. Merge code examples (docs + codebase) +6. Include comprehensive API reference links + +**Files to Modify:** +- `unified_skill_builder.py:_generate_skill_md()` +- `unified_skill_builder.py:_synthesize_docs_github()` + +--- + +### Issue 2: Pattern Formatting is Unreadable +**File:** `output/httpx/SKILL.md` +**Lines:** 42-64, 69 + +**Problem:** +```markdown +**Pattern 1:** httpx.request(method, url, *, params=None, content=None, data=None, files=None, json=None, headers=None, cookies=None, auth=None, proxy=None, timeout=Timeout(timeout=5.0), follow_redirects=False, verify=True, trust_env=True) Sends an HTTP request... +``` + +- 600+ character single line +- All parameters run together +- No structure +- Completely unusable by LLM + +**Fix Required:** +1. Format API patterns with proper structure: +```markdown +### `httpx.request()` + +**Signature:** +```python +httpx.request( + method, url, *, + params=None, + content=None, + ... +) +``` + +**Parameters:** +- `method`: HTTP method (GET, POST, PUT, etc.) +- `url`: Target URL +- `params`: (optional) Query parameters +... + +**Returns:** Response object + +**Example:** +```python +>>> import httpx +>>> response = httpx.request('GET', 'https://httpbin.org/get') +``` +``` + +**Files to Modify:** +- `doc_scraper.py:extract_patterns()` - Fix pattern extraction +- `doc_scraper.py:_format_pattern()` - Add proper formatting method + +--- + +### Issue 3: AI Enhancement Missing 57% of References +**File:** `src/skill_seekers/cli/utils.py` +**Lines:** 274-275 + +**Problem:** +```python +if ref_file.name == "index.md": + continue # SKIPS ALL INDEX FILES! +``` + +**Impact:** +- Reads: 13KB (43% of content) + - ARCHITECTURE.md + - issues.md + - README.md + - releases.md +- **Skips: 17KB (57% of content)** + - patterns/index.md (10.5KB) โ† HUGE! + - examples/index.md (5KB) + - configuration/index.md (933B) + - guides/index.md + - documentation/index.md + +**Result:** +``` +โœ“ Read 4 reference files +โœ“ Total size: 24 characters โ† WRONG! Should be ~30KB +``` + +**Fix Required:** +1. Remove the index.md skip logic +2. Or rename files: index.md โ†’ patterns.md, examples.md, etc. +3. Update unified_skill_builder to use non-index names + +**Files to Modify:** +- `utils.py:read_reference_files()` line 274-275 +- `unified_skill_builder.py:_generate_references()` - Fix file naming + +--- + +## ๐ŸŸก P1: Major Quality Issues + +### Issue 4: "httpx_docs" Text Not Replaced +**File:** `output/httpx/SKILL.md` +**Lines:** 20-24 + +**Problem:** +```markdown +- Working with httpx_docs โ† Should be "httpx" +- Asking about httpx_docs features โ† Should be "httpx" +``` + +**Root Cause:** Docs source SKILL.md has placeholder `{name}` that's not replaced during synthesis. + +**Fix Required:** +1. Add text replacement in synthesis: `httpx_docs` โ†’ `httpx` +2. Or fix doc_scraper template to use correct name + +**Files to Modify:** +- `unified_skill_builder.py:_synthesize_docs_github()` - Add replacement +- Or `doc_scraper.py` template + +--- + +### Issue 5: Duplicate Examples +**File:** `output/httpx/SKILL.md` +**Lines:** 133-143 + +**Problem:** +Exact same Cookie example shown twice in a row. + +**Fix Required:** +Deduplicate examples during synthesis. + +**Files to Modify:** +- `unified_skill_builder.py:_synthesize_docs_github()` - Add deduplication + +--- + +### Issue 6: Wrong Language Tags +**File:** `output/httpx/SKILL.md` +**Lines:** 97-125 + +**Problem:** +```markdown +**Example 1** (typescript): โ† WRONG, it's Python! +```typescript +with httpx.Client(proxy="http://localhost:8030"): +``` + +**Example 3** (jsx): โ† WRONG, it's Python! +```jsx +>>> import httpx +``` + +**Root Cause:** Doc scraper's language detection is failing. + +**Fix Required:** +Improve `detect_language()` function in doc_scraper.py. + +**Files to Modify:** +- `doc_scraper.py:detect_language()` - Better heuristics + +--- + +### Issue 7: Language Stats Wrong in Architecture +**File:** `output/httpx/references/codebase_analysis/ARCHITECTURE.md` +**Lines:** 11-13 + +**Problem:** +```markdown +- Python: 1 files โ† Should be "54 files" +- Shell: 1 files โ† Should be "6 files" +``` + +**Root Cause:** Aggregation logic counting file types instead of files. + +**Fix Required:** +Fix language counting in architecture generation. + +**Files to Modify:** +- `unified_skill_builder.py:_generate_codebase_analysis_references()` + +--- + +### Issue 8: API Reference Section Incomplete +**File:** `output/httpx/SKILL.md` +**Lines:** 145-157 + +**Problem:** +Only shows `test_main.py` as example, then cuts off with "---". + +Should link to all 54 API reference modules. + +**Fix Required:** +Generate proper API reference index with links. + +**Files to Modify:** +- `unified_skill_builder.py:_synthesize_docs_github()` - Add API index + +--- + +## ๐Ÿ“ Implementation Phases + +### Phase 1: Fix AI Enhancement (30 min) +**Priority:** P0 - Blocks all AI improvements + +**Tasks:** +1. Fix `utils.py` to not skip index.md files +2. Or rename reference files to avoid "index.md" +3. Verify enhancement reads all 30KB of references +4. Test enhancement actually updates SKILL.md + +**Test:** +```bash +skill-seekers enhance output/httpx/ --mode local +# Should show: "Total size: ~30,000 characters" +# Should update SKILL.md successfully +``` + +--- + +### Phase 2: Fix Content Synthesis (90 min) +**Priority:** P0 - Core functionality + +**Tasks:** +1. Rewrite `_synthesize_docs_github()` to truly merge +2. Add section-by-section merging logic +3. Include GitHub metadata in intro +4. Merge "When to Use" sections +5. Combine quick reference sections +6. Add API reference index with all modules +7. Fix "httpx_docs" โ†’ "httpx" replacement +8. Deduplicate examples + +**Test:** +```bash +skill-seekers unified --config configs/httpx_comprehensive.json +wc -l output/httpx/SKILL.md # Should be 300-400 lines +grep "httpx_docs" output/httpx/SKILL.md # Should return nothing +``` + +--- + +### Phase 3: Fix Content Formatting (60 min) +**Priority:** P0 - Makes output usable + +**Tasks:** +1. Fix pattern extraction to format properly +2. Add `_format_pattern()` method with structure +3. Break long lines into readable format +4. Add proper parameter formatting +5. Fix code block language detection + +**Test:** +```bash +# Check pattern readability +head -100 output/httpx/SKILL.md +# Should see nicely formatted patterns, not walls of text +``` + +--- + +### Phase 4: Fix Data Accuracy (45 min) +**Priority:** P1 - Quality polish + +**Tasks:** +1. Fix language statistics aggregation +2. Complete API reference section +3. Improve language tag detection + +**Test:** +```bash +# Check accuracy +grep "Python: " output/httpx/references/codebase_analysis/ARCHITECTURE.md +# Should say "54 files" not "1 files" +``` + +--- + +## ๐Ÿ“Š Success Metrics + +### Before Fixes +- Synthesis quality: 3/10 +- Content usability: 2/10 +- AI enhancement success: 0% (doesn't update file) +- Reference coverage: 43% (skips 57%) + +### After Fixes (Target) +- Synthesis quality: 8/10 +- Content usability: 9/10 +- AI enhancement success: 90%+ +- Reference coverage: 100% + +### Acceptance Criteria +1. โœ… SKILL.md is 300-400 lines (not 186) +2. โœ… No "httpx_docs" placeholders +3. โœ… Patterns are readable (not walls of text) +4. โœ… AI enhancement reads all 30KB references +5. โœ… AI enhancement successfully updates SKILL.md +6. โœ… No duplicate examples +7. โœ… Correct language tags +8. โœ… Accurate statistics (54 files, not 1) +9. โœ… Complete API reference section +10. โœ… GitHub metadata included (stars, topics) + +--- + +## ๐Ÿš€ Execution Plan + +### Day 1: Fix Blockers +1. Phase 1: Fix AI enhancement (30 min) +2. Phase 2: Fix synthesis (90 min) +3. Test end-to-end (30 min) + +### Day 2: Polish Quality +4. Phase 3: Fix formatting (60 min) +5. Phase 4: Fix accuracy (45 min) +6. Final testing (45 min) + +**Total estimated time:** ~6 hours + +--- + +## ๐Ÿ“Œ Notes + +### Why This Matters +The infrastructure is excellent, but users will judge based on the final SKILL.md quality. Currently, it's not production-ready. + +### Risk Assessment +**Low risk** - All fixes are isolated to specific functions. Won't break existing file organization or C3.x collection. + +### Testing Strategy +Test with httpx (current), then validate with: +- React (docs + GitHub) +- Django (docs + GitHub) +- FastAPI (docs + GitHub) + +--- + +**Plan Status:** Ready for implementation +**Estimated Completion:** 2 days (6 hours total work) diff --git a/src/skill_seekers/cli/doc_scraper.py b/src/skill_seekers/cli/doc_scraper.py index 74b1ee0..338bb41 100755 --- a/src/skill_seekers/cli/doc_scraper.py +++ b/src/skill_seekers/cli/doc_scraper.py @@ -1121,7 +1121,13 @@ This skill should be triggered when: # Add actual quick reference patterns if quick_ref: for i, pattern in enumerate(quick_ref[:8], 1): - content += f"**Pattern {i}:** {pattern.get('description', 'Example pattern')}\n\n" + desc = pattern.get('description', 'Example pattern') + # Format description: extract first sentence, truncate if too long + first_sentence = desc.split('.')[0] if '.' in desc else desc + if len(first_sentence) > 150: + first_sentence = first_sentence[:147] + '...' + + content += f"**Pattern {i}:** {first_sentence}\n\n" content += "```\n" content += pattern.get('code', '')[:300] content += "\n```\n\n" diff --git a/src/skill_seekers/cli/enhance_skill_local.py b/src/skill_seekers/cli/enhance_skill_local.py index e262c76..e1959d7 100644 --- a/src/skill_seekers/cli/enhance_skill_local.py +++ b/src/skill_seekers/cli/enhance_skill_local.py @@ -195,7 +195,7 @@ class LocalSkillEnhancer: summarization_ratio: Target size ratio when summarizing (0.3 = 30%) """ - # Read reference files + # Read reference files (with enriched metadata) references = read_reference_files( self.skill_dir, max_chars=LOCAL_CONTENT_LIMIT, @@ -206,8 +206,13 @@ class LocalSkillEnhancer: print("โŒ No reference files found") return None + # Analyze sources + sources_found = set() + for metadata in references.values(): + sources_found.add(metadata['source']) + # Calculate total size - total_ref_size = sum(len(c) for c in references.values()) + total_ref_size = sum(meta['size'] for meta in references.values()) # Apply summarization if requested or if content is too large if use_summarization or total_ref_size > 30000: @@ -217,13 +222,12 @@ class LocalSkillEnhancer: print() # Summarize each reference - summarized_refs = {} - for filename, content in references.items(): - summarized = self.summarize_reference(content, summarization_ratio) - summarized_refs[filename] = summarized + for filename, metadata in references.items(): + summarized = self.summarize_reference(metadata['content'], summarization_ratio) + metadata['content'] = summarized + metadata['size'] = len(summarized) - references = summarized_refs - new_size = sum(len(c) for c in references.values()) + new_size = sum(meta['size'] for meta in references.values()) print(f" โœ“ Reduced from {total_ref_size:,} to {new_size:,} chars ({int(new_size/total_ref_size*100)}%)") print() @@ -232,67 +236,134 @@ class LocalSkillEnhancer: if self.skill_md_path.exists(): current_skill_md = self.skill_md_path.read_text(encoding='utf-8') - # Build prompt + # Analyze conflicts if present + has_conflicts = any('conflicts' in meta['path'] for meta in references.values()) + + # Build prompt with multi-source awareness prompt = f"""I need you to enhance the SKILL.md file for the {self.skill_dir.name} skill. +SKILL OVERVIEW: +- Name: {self.skill_dir.name} +- Source Types: {', '.join(sorted(sources_found))} +- Multi-Source: {'Yes' if len(sources_found) > 1 else 'No'} +- Conflicts Detected: {'Yes - see conflicts.md in references' if has_conflicts else 'No'} + CURRENT SKILL.MD: {'-'*60} {current_skill_md if current_skill_md else '(No existing SKILL.md - create from scratch)'} {'-'*60} -REFERENCE DOCUMENTATION: +SOURCE ANALYSIS: {'-'*60} +This skill combines knowledge from {len(sources_found)} source type(s): + """ - # Add references (already summarized if needed) - for filename, content in references.items(): - # Further limit per-file to 12K to be safe - max_per_file = 12000 - if len(content) > max_per_file: - content = content[:max_per_file] + "\n\n[Content truncated for size...]" - prompt += f"\n## {filename}\n{content}\n" + # Group references by source type + by_source = {} + for filename, metadata in references.items(): + source = metadata['source'] + if source not in by_source: + by_source[source] = [] + by_source[source].append((filename, metadata)) + + # Add source breakdown + for source in sorted(by_source.keys()): + files = by_source[source] + prompt += f"\n**{source.upper()} ({len(files)} file(s))**\n" + for filename, metadata in files[:5]: # Top 5 per source + prompt += f"- {filename} (confidence: {metadata['confidence']}, {metadata['size']:,} chars)\n" + if len(files) > 5: + prompt += f"- ... and {len(files) - 5} more\n" prompt += f""" {'-'*60} +REFERENCE DOCUMENTATION: +{'-'*60} +""" + + # Add references grouped by source with metadata + for source in sorted(by_source.keys()): + prompt += f"\n### {source.upper()} SOURCES\n\n" + for filename, metadata in by_source[source]: + # Further limit per-file to 12K to be safe + content = metadata['content'] + max_per_file = 12000 + if len(content) > max_per_file: + content = content[:max_per_file] + "\n\n[Content truncated for size...]" + + prompt += f"\n#### {filename}\n" + prompt += f"*Source: {metadata['source']}, Confidence: {metadata['confidence']}*\n\n" + prompt += f"{content}\n" + + prompt += f""" +{'-'*60} + +REFERENCE PRIORITY (when sources differ): +1. **Code patterns (codebase_analysis)**: Ground truth - what the code actually does +2. **Official documentation**: Intended API and usage patterns +3. **GitHub issues**: Real-world usage and known problems +4. **PDF documentation**: Additional context and tutorials + YOUR TASK: -Create an EXCELLENT SKILL.md file that will help Claude use this documentation effectively. +Create an EXCELLENT SKILL.md file that synthesizes knowledge from multiple sources. Requirements: -1. **Clear "When to Use This Skill" section** +1. **Multi-Source Synthesis** + - Acknowledge that this skill combines multiple sources + - Highlight agreements between sources (builds confidence) + - Note discrepancies transparently (if present) + - Use source priority when synthesizing conflicting information + +2. **Clear "When to Use This Skill" section** - Be SPECIFIC about trigger conditions - List concrete use cases + - Include perspective from both docs AND real-world usage (if GitHub/codebase data available) -2. **Excellent Quick Reference section** - - Extract 5-10 of the BEST, most practical code examples from the reference docs +3. **Excellent Quick Reference section** + - Extract 5-10 of the BEST, most practical code examples + - Prefer examples from HIGH CONFIDENCE sources first + - If code examples exist from codebase analysis, prioritize those (real usage) + - If docs examples exist, include those too (official patterns) - Choose SHORT, clear examples (5-20 lines max) - - Include both simple and intermediate examples - Use proper language tags (cpp, python, javascript, json, etc.) - - Add clear descriptions for each example + - Add clear descriptions noting the source (e.g., "From official docs" or "From codebase") -3. **Detailed Reference Files description** +4. **Detailed Reference Files description** - Explain what's in each reference file - - Help users navigate the documentation + - Note the source type and confidence level + - Help users navigate multi-source documentation -4. **Practical "Working with This Skill" section** +5. **Practical "Working with This Skill" section** - Clear guidance for beginners, intermediate, and advanced users - - Navigation tips + - Navigation tips for multi-source references + - How to resolve conflicts if present -5. **Key Concepts section** (if applicable) +6. **Key Concepts section** (if applicable) - Explain core concepts - Define important terminology + - Reconcile differences between sources if needed + +7. **Conflict Handling** (if conflicts detected) + - Add a "Known Discrepancies" section + - Explain major conflicts transparently + - Provide guidance on which source to trust in each case IMPORTANT: - Extract REAL examples from the reference docs above +- Prioritize HIGH CONFIDENCE sources when synthesizing +- Note source attribution when helpful (e.g., "Official docs say X, but codebase shows Y") +- Make discrepancies transparent, not hidden - Prioritize SHORT, clear examples - Make it actionable and practical - Keep the frontmatter (---\\nname: ...\\n---) intact - Use proper markdown formatting SAVE THE RESULT: -Save the complete enhanced SKILL.md to: {self.skill_md_path.absolute()} +Save the complete enhanced SKILL.md to: SKILL.md -First, backup the original to: {self.skill_md_path.with_suffix('.md.backup').absolute()} +First, backup the original to: SKILL.md.backup """ return prompt @@ -381,7 +452,7 @@ First, backup the original to: {self.skill_md_path.with_suffix('.md.backup').abs return False print(f" โœ“ Read {len(references)} reference files") - total_size = sum(len(c) for c in references.values()) + total_size = sum(ref['size'] for ref in references.values()) print(f" โœ“ Total size: {total_size:,} characters\n") # Check if we need smart summarization @@ -530,7 +601,8 @@ rm {prompt_file} ['claude', prompt_file], capture_output=True, text=True, - timeout=timeout + timeout=timeout, + cwd=str(self.skill_dir) # Run from skill directory ) elapsed = time.time() - start_time diff --git a/src/skill_seekers/cli/unified_skill_builder.py b/src/skill_seekers/cli/unified_skill_builder.py index 0dec8cd..70dd6fa 100644 --- a/src/skill_seekers/cli/unified_skill_builder.py +++ b/src/skill_seekers/cli/unified_skill_builder.py @@ -29,7 +29,8 @@ class UnifiedSkillBuilder: """ def __init__(self, config: Dict, scraped_data: Dict, - merged_data: Optional[Dict] = None, conflicts: Optional[List] = None): + merged_data: Optional[Dict] = None, conflicts: Optional[List] = None, + cache_dir: Optional[str] = None): """ Initialize skill builder. @@ -38,11 +39,13 @@ class UnifiedSkillBuilder: scraped_data: Dict of scraped data by source type merged_data: Merged API data (if conflicts were resolved) conflicts: List of detected conflicts + cache_dir: Optional cache directory for intermediate files """ self.config = config self.scraped_data = scraped_data self.merged_data = merged_data self.conflicts = conflicts or [] + self.cache_dir = cache_dir self.name = config['name'] self.description = config['description'] @@ -70,14 +73,472 @@ class UnifiedSkillBuilder: logger.info(f"โœ… Unified skill built: {self.skill_dir}/") + def _load_source_skill_mds(self) -> Dict[str, str]: + """Load standalone SKILL.md files from each source. + + Returns: + Dict mapping source type to SKILL.md content + e.g., {'documentation': '...', 'github': '...', 'pdf': '...'} + """ + skill_mds = {} + + # Determine base directory for source SKILL.md files + if self.cache_dir: + sources_dir = Path(self.cache_dir) / "sources" + else: + sources_dir = Path("output") + + # Load documentation SKILL.md + docs_skill_path = sources_dir / f"{self.name}_docs" / "SKILL.md" + if docs_skill_path.exists(): + try: + skill_mds['documentation'] = docs_skill_path.read_text(encoding='utf-8') + logger.debug(f"Loaded documentation SKILL.md ({len(skill_mds['documentation'])} chars)") + except IOError as e: + logger.warning(f"Failed to read documentation SKILL.md: {e}") + + # Load GitHub SKILL.md + github_skill_path = sources_dir / f"{self.name}_github" / "SKILL.md" + if github_skill_path.exists(): + try: + skill_mds['github'] = github_skill_path.read_text(encoding='utf-8') + logger.debug(f"Loaded GitHub SKILL.md ({len(skill_mds['github'])} chars)") + except IOError as e: + logger.warning(f"Failed to read GitHub SKILL.md: {e}") + + # Load PDF SKILL.md + pdf_skill_path = sources_dir / f"{self.name}_pdf" / "SKILL.md" + if pdf_skill_path.exists(): + try: + skill_mds['pdf'] = pdf_skill_path.read_text(encoding='utf-8') + logger.debug(f"Loaded PDF SKILL.md ({len(skill_mds['pdf'])} chars)") + except IOError as e: + logger.warning(f"Failed to read PDF SKILL.md: {e}") + + logger.info(f"Loaded {len(skill_mds)} source SKILL.md files") + return skill_mds + + def _parse_skill_md_sections(self, skill_md: str) -> Dict[str, str]: + """Parse SKILL.md into sections by ## headers. + + Args: + skill_md: Full SKILL.md content + + Returns: + Dict mapping section name to content + e.g., {'When to Use': '...', 'Quick Reference': '...'} + """ + sections = {} + current_section = None + current_content = [] + + lines = skill_md.split('\n') + + for line in lines: + # Detect section header (## Header) + if line.startswith('## '): + # Save previous section + if current_section: + sections[current_section] = '\n'.join(current_content).strip() + + # Start new section + current_section = line[3:].strip() + # Remove emoji and markdown formatting + current_section = current_section.split('](')[0] # Remove links + for emoji in ['๐Ÿ“š', '๐Ÿ—๏ธ', 'โš ๏ธ', '๐Ÿ”ง', '๐Ÿ“–', '๐Ÿ’ก', '๐ŸŽฏ', '๐Ÿ“Š', '๐Ÿ”', 'โš™๏ธ', '๐Ÿงช', '๐Ÿ“', '๐Ÿ—‚๏ธ', '๐Ÿ“', 'โšก']: + current_section = current_section.replace(emoji, '').strip() + current_content = [] + elif current_section: + # Accumulate content for current section + current_content.append(line) + + # Save last section + if current_section and current_content: + sections[current_section] = '\n'.join(current_content).strip() + + logger.debug(f"Parsed {len(sections)} sections from SKILL.md") + return sections + + def _synthesize_docs_github(self, skill_mds: Dict[str, str]) -> str: + """Synthesize documentation + GitHub sources with weighted merge. + + Strategy: + - Start with docs frontmatter and intro + - Add GitHub metadata (stars, topics, language stats) + - Merge "When to Use" from both sources + - Merge "Quick Reference" from both sources + - Include GitHub-specific sections (patterns, architecture) + - Merge code examples (prioritize GitHub real usage) + - Include Known Issues from GitHub + - Fix placeholder text (httpx_docs โ†’ httpx) + + Args: + skill_mds: Dict with 'documentation' and 'github' keys + + Returns: + Synthesized SKILL.md content + """ + docs_sections = self._parse_skill_md_sections(skill_mds.get('documentation', '')) + github_sections = self._parse_skill_md_sections(skill_mds.get('github', '')) + + # Extract GitHub metadata from full content + github_full = skill_mds.get('github', '') + + # Start with YAML frontmatter + skill_name = self.name.lower().replace('_', '-').replace(' ', '-')[:64] + desc = self.description[:1024] if len(self.description) > 1024 else self.description + + content = f"""--- +name: {skill_name} +description: {desc} +--- + +# {self.name.title()} + +{self.description} + +## ๐Ÿ“š Sources + +This skill synthesizes knowledge from multiple sources: + +- โœ… **Official Documentation**: {self.config.get('sources', [{}])[0].get('base_url', 'N/A')} +- โœ… **GitHub Repository**: {[s for s in self.config.get('sources', []) if s.get('type') == 'github'][0].get('repo', 'N/A') if [s for s in self.config.get('sources', []) if s.get('type') == 'github'] else 'N/A'} + +""" + + # Add GitHub Description and Metadata if present + if 'Description' in github_sections: + content += "## ๐Ÿ“ฆ About\n\n" + content += github_sections['Description'] + "\n\n" + + # Add Repository Info from GitHub + if 'Repository Info' in github_sections: + content += "### Repository Info\n\n" + content += github_sections['Repository Info'] + "\n\n" + + # Add Language stats from GitHub + if 'Languages' in github_sections: + content += "### Languages\n\n" + content += github_sections['Languages'] + "\n\n" + + content += "## ๐Ÿ’ก When to Use This Skill\n\n" + + # Merge "When to Use" sections - Fix placeholder text + when_to_use_added = False + for key in ['When to Use This Skill', 'When to Use']: + if key in docs_sections: + # Fix placeholder text: httpx_docs โ†’ httpx + when_content = docs_sections[key].replace('httpx_docs', self.name) + when_content = when_content.replace('httpx_github', self.name) + content += when_content + "\n\n" + when_to_use_added = True + break + + if 'When to Use This Skill' in github_sections: + if when_to_use_added: + content += "**From repository analysis:**\n\n" + content += github_sections['When to Use This Skill'] + "\n\n" + + # Quick Reference: Merge from both sources + content += "## ๐ŸŽฏ Quick Reference\n\n" + + if 'Quick Reference' in docs_sections: + content += "**From Documentation:**\n\n" + content += docs_sections['Quick Reference'] + "\n\n" + + if 'Quick Reference' in github_sections: + # Include GitHub's Quick Reference (contains design patterns summary) + logger.info(f"DEBUG: Including GitHub Quick Reference ({len(github_sections['Quick Reference'])} chars)") + content += github_sections['Quick Reference'] + "\n\n" + else: + logger.warning("DEBUG: GitHub Quick Reference section NOT FOUND!") + + # Design Patterns (GitHub only - C3.1 analysis) + if 'Design Patterns Detected' in github_sections: + content += "### Design Patterns Detected\n\n" + content += "*From C3.1 codebase analysis (confidence > 0.7)*\n\n" + content += github_sections['Design Patterns Detected'] + "\n\n" + + # Code Examples: Prefer GitHub (real usage) + content += "## ๐Ÿงช Code Examples\n\n" + + if 'Code Examples' in github_sections: + content += "**From Repository Tests:**\n\n" + # Note: GitHub section already includes "*High-quality examples from codebase (C3.2)*" label + content += github_sections['Code Examples'] + "\n\n" + elif 'Usage Examples' in github_sections: + content += "**From Repository:**\n\n" + content += github_sections['Usage Examples'] + "\n\n" + + if 'Example Code Patterns' in docs_sections: + content += "**From Documentation:**\n\n" + content += docs_sections['Example Code Patterns'] + "\n\n" + + # API Reference: Include from both sources + if 'API Reference' in docs_sections or 'API Reference' in github_sections: + content += "## ๐Ÿ”ง API Reference\n\n" + + if 'API Reference' in github_sections: + # Note: GitHub section already includes "*Extracted from codebase analysis (C2.5)*" label + content += github_sections['API Reference'] + "\n\n" + + if 'API Reference' in docs_sections: + content += "**Official API Documentation:**\n\n" + content += docs_sections['API Reference'] + "\n\n" + + # Known Issues: GitHub only + if 'Known Issues' in github_sections: + content += "## โš ๏ธ Known Issues\n\n" + content += "*Recent issues from GitHub*\n\n" + content += github_sections['Known Issues'] + "\n\n" + + # Recent Releases: GitHub only (include subsection if present) + if 'Recent Releases' in github_sections: + # Recent Releases might be a subsection within Known Issues + # Check if it's standalone + releases_content = github_sections['Recent Releases'] + if releases_content.strip() and not releases_content.startswith('###'): + content += "### Recent Releases\n" + content += releases_content + "\n\n" + + # Reference documentation + content += "## ๐Ÿ“– Reference Documentation\n\n" + content += "Organized by source:\n\n" + content += "- [Documentation](references/documentation/)\n" + content += "- [GitHub](references/github/)\n" + content += "- [Codebase Analysis](references/codebase_analysis/ARCHITECTURE.md)\n\n" + + # Footer + content += "---\n\n" + content += "*Synthesized from official documentation and codebase analysis by Skill Seekers*\n" + + return content + + def _synthesize_docs_github_pdf(self, skill_mds: Dict[str, str]) -> str: + """Synthesize all three sources: documentation + GitHub + PDF. + + Strategy: + - Start with docs+github synthesis + - Insert PDF chapters after Quick Reference + - Add PDF key concepts as supplementary section + + Args: + skill_mds: Dict with 'documentation', 'github', and 'pdf' keys + + Returns: + Synthesized SKILL.md content + """ + # Start with docs+github synthesis + base_content = self._synthesize_docs_github(skill_mds) + pdf_sections = self._parse_skill_md_sections(skill_mds.get('pdf', '')) + + # Find insertion point after Quick Reference + lines = base_content.split('\n') + insertion_index = -1 + + for i, line in enumerate(lines): + if line.startswith('## ๐Ÿงช Code Examples') or line.startswith('## ๐Ÿ”ง API Reference'): + insertion_index = i + break + + if insertion_index == -1: + # Fallback: insert before Reference Documentation + for i, line in enumerate(lines): + if line.startswith('## ๐Ÿ“– Reference Documentation'): + insertion_index = i + break + + # Build PDF section + pdf_content_lines = [] + + # Add Chapter Overview + if 'Chapter Overview' in pdf_sections: + pdf_content_lines.append("## ๐Ÿ“š PDF Documentation Structure\n") + pdf_content_lines.append("*From PDF analysis*\n") + pdf_content_lines.append(pdf_sections['Chapter Overview']) + pdf_content_lines.append("\n") + + # Add Key Concepts + if 'Key Concepts' in pdf_sections: + pdf_content_lines.append("## ๐Ÿ” Key Concepts\n") + pdf_content_lines.append("*Extracted from PDF headings*\n") + pdf_content_lines.append(pdf_sections['Key Concepts']) + pdf_content_lines.append("\n") + + # Insert PDF content + if pdf_content_lines and insertion_index != -1: + lines[insertion_index:insertion_index] = pdf_content_lines + elif pdf_content_lines: + # Append at end before footer + footer_index = -1 + for i, line in enumerate(lines): + if line.startswith('---') and i > len(lines) - 5: + footer_index = i + break + if footer_index != -1: + lines[footer_index:footer_index] = pdf_content_lines + + # Update reference documentation to include PDF + final_content = '\n'.join(lines) + final_content = final_content.replace( + '- [Codebase Analysis](references/codebase_analysis/ARCHITECTURE.md)\n', + '- [Codebase Analysis](references/codebase_analysis/ARCHITECTURE.md)\n- [PDF Documentation](references/pdf/)\n' + ) + + return final_content + def _generate_skill_md(self): - """Generate main SKILL.md file.""" + """Generate main SKILL.md file using synthesis formulas. + + Strategy: + 1. Try to load standalone SKILL.md from each source + 2. If found, use synthesis formulas for rich content + 3. If not found, fall back to legacy minimal generation + """ skill_path = os.path.join(self.skill_dir, 'SKILL.md') - # Generate skill name (lowercase, hyphens only, max 64 chars) - skill_name = self.name.lower().replace('_', '-').replace(' ', '-')[:64] + # Try to load source SKILL.md files + skill_mds = self._load_source_skill_mds() - # Truncate description to 1024 chars if needed + # Determine synthesis strategy based on available sources + has_docs = 'documentation' in skill_mds + has_github = 'github' in skill_mds + has_pdf = 'pdf' in skill_mds + + content = None + + # Apply appropriate synthesis formula + if has_docs and has_github and has_pdf: + logger.info("Synthesizing: documentation + GitHub + PDF") + content = self._synthesize_docs_github_pdf(skill_mds) + + elif has_docs and has_github: + logger.info("Synthesizing: documentation + GitHub") + content = self._synthesize_docs_github(skill_mds) + + elif has_docs and has_pdf: + logger.info("Synthesizing: documentation + PDF") + content = self._synthesize_docs_pdf(skill_mds) + + elif has_github and has_pdf: + logger.info("Synthesizing: GitHub + PDF") + content = self._synthesize_github_pdf(skill_mds) + + elif has_docs: + logger.info("Using documentation SKILL.md as-is") + content = skill_mds['documentation'] + + elif has_github: + logger.info("Using GitHub SKILL.md as-is") + content = skill_mds['github'] + + elif has_pdf: + logger.info("Using PDF SKILL.md as-is") + content = skill_mds['pdf'] + + # Fallback: generate minimal SKILL.md (legacy behavior) + if not content: + logger.warning("No source SKILL.md files found, generating minimal SKILL.md (legacy)") + content = self._generate_minimal_skill_md() + + # Write final content + with open(skill_path, 'w', encoding='utf-8') as f: + f.write(content) + + logger.info(f"Created SKILL.md ({len(content)} chars, ~{len(content.split())} words)") + + def _synthesize_docs_pdf(self, skill_mds: Dict[str, str]) -> str: + """Synthesize documentation + PDF sources. + + Strategy: + - Start with docs SKILL.md + - Insert PDF chapters and key concepts as supplementary sections + + Args: + skill_mds: Dict with 'documentation' and 'pdf' keys + + Returns: + Synthesized SKILL.md content + """ + docs_content = skill_mds['documentation'] + pdf_sections = self._parse_skill_md_sections(skill_mds['pdf']) + + lines = docs_content.split('\n') + insertion_index = -1 + + # Find insertion point before Reference Documentation + for i, line in enumerate(lines): + if line.startswith('## ๐Ÿ“– Reference') or line.startswith('## Reference'): + insertion_index = i + break + + # Build PDF sections + pdf_content_lines = [] + + if 'Chapter Overview' in pdf_sections: + pdf_content_lines.append("## ๐Ÿ“š PDF Documentation Structure\n") + pdf_content_lines.append("*From PDF analysis*\n") + pdf_content_lines.append(pdf_sections['Chapter Overview']) + pdf_content_lines.append("\n") + + if 'Key Concepts' in pdf_sections: + pdf_content_lines.append("## ๐Ÿ” Key Concepts\n") + pdf_content_lines.append("*Extracted from PDF headings*\n") + pdf_content_lines.append(pdf_sections['Key Concepts']) + pdf_content_lines.append("\n") + + # Insert PDF content + if pdf_content_lines and insertion_index != -1: + lines[insertion_index:insertion_index] = pdf_content_lines + + return '\n'.join(lines) + + def _synthesize_github_pdf(self, skill_mds: Dict[str, str]) -> str: + """Synthesize GitHub + PDF sources. + + Strategy: + - Start with GitHub SKILL.md (has C3.x analysis) + - Add PDF documentation structure as supplementary section + + Args: + skill_mds: Dict with 'github' and 'pdf' keys + + Returns: + Synthesized SKILL.md content + """ + github_content = skill_mds['github'] + pdf_sections = self._parse_skill_md_sections(skill_mds['pdf']) + + lines = github_content.split('\n') + insertion_index = -1 + + # Find insertion point before Reference Documentation + for i, line in enumerate(lines): + if line.startswith('## ๐Ÿ“– Reference') or line.startswith('## Reference'): + insertion_index = i + break + + # Build PDF sections + pdf_content_lines = [] + + if 'Chapter Overview' in pdf_sections: + pdf_content_lines.append("## ๐Ÿ“š PDF Documentation Structure\n") + pdf_content_lines.append("*From PDF analysis*\n") + pdf_content_lines.append(pdf_sections['Chapter Overview']) + pdf_content_lines.append("\n") + + # Insert PDF content + if pdf_content_lines and insertion_index != -1: + lines[insertion_index:insertion_index] = pdf_content_lines + + return '\n'.join(lines) + + def _generate_minimal_skill_md(self) -> str: + """Generate minimal SKILL.md (legacy fallback behavior). + + Used when no source SKILL.md files are available. + """ + skill_name = self.name.lower().replace('_', '-').replace(' ', '-')[:64] desc = self.description[:1024] if len(self.description) > 1024 else self.description content = f"""--- @@ -156,10 +617,7 @@ This skill combines knowledge from multiple sources: content += "\n---\n\n" content += "*Generated by Skill Seeker's unified multi-source scraper*\n" - with open(skill_path, 'w', encoding='utf-8') as f: - f.write(content) - - logger.info(f"Created SKILL.md") + return content def _format_merged_apis(self) -> str: """Format merged APIs section with inline conflict warnings.""" diff --git a/src/skill_seekers/cli/utils.py b/src/skill_seekers/cli/utils.py index dd870e5..04c688f 100755 --- a/src/skill_seekers/cli/utils.py +++ b/src/skill_seekers/cli/utils.py @@ -179,11 +179,12 @@ def validate_zip_file(zip_path: Union[str, Path]) -> Tuple[bool, Optional[str]]: return True, None -def read_reference_files(skill_dir: Union[str, Path], max_chars: int = 100000, preview_limit: int = 40000) -> Dict[str, str]: - """Read reference files from a skill directory with size limits. +def read_reference_files(skill_dir: Union[str, Path], max_chars: int = 100000, preview_limit: int = 40000) -> Dict[str, Dict]: + """Read reference files from a skill directory with enriched metadata. This function reads markdown files from the references/ subdirectory of a skill, applying both per-file and total content limits. + Returns enriched metadata including source type, confidence, and path. Args: skill_dir (str or Path): Path to skill directory @@ -191,38 +192,110 @@ def read_reference_files(skill_dir: Union[str, Path], max_chars: int = 100000, p preview_limit (int): Maximum characters per file (default: 40000) Returns: - dict: Dictionary mapping filename to content + dict: Dictionary mapping filename to metadata dict with keys: + - 'content': File content + - 'source': Source type (documentation/github/pdf/api/codebase_analysis) + - 'confidence': Confidence level (high/medium/low) + - 'path': Relative path from references directory Example: >>> refs = read_reference_files('output/react/', max_chars=50000) - >>> len(refs) - 5 + >>> refs['documentation/api.md']['source'] + 'documentation' + >>> refs['documentation/api.md']['confidence'] + 'high' """ from pathlib import Path skill_path = Path(skill_dir) references_dir = skill_path / "references" - references: Dict[str, str] = {} + references: Dict[str, Dict] = {} if not references_dir.exists(): print(f"โš  No references directory found at {references_dir}") return references + def _determine_source_metadata(relative_path: Path) -> Tuple[str, str]: + """Determine source type and confidence level from path. + + Returns: + tuple: (source_type, confidence_level) + """ + path_str = str(relative_path) + + # Documentation sources (official docs) + if path_str.startswith('documentation/'): + return 'documentation', 'high' + + # GitHub sources + elif path_str.startswith('github/'): + # README and releases are medium confidence + if 'README' in path_str or 'releases' in path_str: + return 'github', 'medium' + # Issues are low confidence (user reports) + elif 'issues' in path_str: + return 'github', 'low' + else: + return 'github', 'medium' + + # PDF sources (books, manuals) + elif path_str.startswith('pdf/'): + return 'pdf', 'high' + + # Merged API (synthesized from multiple sources) + elif path_str.startswith('api/'): + return 'api', 'high' + + # Codebase analysis (C3.x automated analysis) + elif path_str.startswith('codebase_analysis/'): + # ARCHITECTURE.md is high confidence (comprehensive) + if 'ARCHITECTURE' in path_str: + return 'codebase_analysis', 'high' + # Patterns and examples are medium (heuristic-based) + elif 'patterns' in path_str or 'examples' in path_str: + return 'codebase_analysis', 'medium' + # Configuration is high (direct extraction) + elif 'configuration' in path_str: + return 'codebase_analysis', 'high' + else: + return 'codebase_analysis', 'medium' + + # Conflicts report (discrepancy detection) + elif 'conflicts' in path_str: + return 'conflicts', 'medium' + + # Fallback + else: + return 'unknown', 'medium' + total_chars = 0 # Search recursively for all .md files (including subdirectories like github/README.md) for ref_file in sorted(references_dir.rglob("*.md")): - if ref_file.name == "index.md": - continue + # Note: We now include index.md files as they contain important content + # (patterns, examples, configuration analysis) content = ref_file.read_text(encoding='utf-8') # Limit size per file + truncated = False if len(content) > preview_limit: content = content[:preview_limit] + "\n\n[Content truncated...]" + truncated = True # Use relative path from references_dir as key for nested files relative_path = ref_file.relative_to(references_dir) - references[str(relative_path)] = content + source_type, confidence = _determine_source_metadata(relative_path) + + # Build enriched metadata + references[str(relative_path)] = { + 'content': content, + 'source': source_type, + 'confidence': confidence, + 'path': str(relative_path), + 'truncated': truncated, + 'size': len(content) + } + total_chars += len(content) # Stop if we've read enough