skill-seekers-reference

firefrost-gaming/skill-seekers-reference

Author	SHA1	Message	Date
yusyus	a99e22c639	feat: Multi-Source Synthesis Architecture - Rich Standalone Skills + Smart Combination BREAKING CHANGE: Major architectural improvements to multi-source skill generation This commit implements the complete "Multi-Source Synthesis Architecture" where each source (documentation, GitHub, PDF) generates a rich standalone SKILL.md file before being intelligently synthesized with source-specific formulas. ## 🎯 Core Architecture Changes ### 1. Rich Standalone SKILL.md Generation (Source Parity) Each source now generates comprehensive, production-quality SKILL.md files that can stand alone OR be synthesized with other sources. GitHub Scraper Enhancements (+263 lines): - Now generates 300+ line SKILL.md (was ~50 lines) - Integrates C3.x codebase analysis data: - C2.5: API Reference extraction - C3.1: Design pattern detection (27 high-confidence patterns) - C3.2: Test example extraction (215 examples) - C3.7: Architectural pattern analysis - Enhanced sections: - ⚡ Quick Reference with pattern summaries - 📝 Code Examples from real repository tests - 🔧 API Reference from codebase analysis - 🏗️ Architecture Overview with design patterns - ⚠️ Known Issues from GitHub issues - Location: src/skill_seekers/cli/github_scraper.py PDF Scraper Enhancements (+205 lines): - Now generates 200+ line SKILL.md (was ~50 lines) - Enhanced content extraction: - 📖 Chapter Overview (PDF structure breakdown) - 🔑 Key Concepts (extracted from headings) - ⚡ Quick Reference (pattern extraction) - 📝 Code Examples: Top 15 (was top 5), grouped by language - Quality scoring and intelligent truncation - Better formatting and organization - Location: src/skill_seekers/cli/pdf_scraper.py Result: All 3 sources (docs, GitHub, PDF) now have equal capability to generate rich, comprehensive standalone skills. ### 2. File Organization & Caching System Problem: output/ directory cluttered with intermediate files, data, and logs. Solution: New `.skillseeker-cache/` hidden directory for all intermediate files. New Structure: ``` .skillseeker-cache/{skill_name}/ ├── sources/ # Standalone SKILL.md from each source │ ├── httpx_docs/ │ ├── httpx_github/ │ └── httpx_pdf/ ├── data/ # Raw scraped data (JSON) ├── repos/ # Cloned GitHub repositories (cached for reuse) └── logs/ # Session logs with timestamps output/{skill_name}/ # CLEAN: Only final synthesized skill ├── SKILL.md └── references/ ``` Benefits: - ✅ Clean output/ directory (only final product) - ✅ Intermediate files preserved for debugging - ✅ Repository clones cached and reused (faster re-runs) - ✅ Timestamped logs for each scraping session - ✅ All cache dirs added to .gitignore Changes: - .gitignore: Added `.skillseeker-cache/` entry - unified_scraper.py: Complete reorganization (+238 lines) - Added cache directory structure - File logging with timestamps - Repository cloning with caching/reuse - Cleaner intermediate file management - Better subprocess logging and error handling ### 3. Config Repository Migration Moved to separate config repository: https://github.com/yusufkaraaslan/skill-seekers-configs Deleted from this repo (35 config files): - ansible-core.json, astro.json, claude-code.json - django.json, django_unified.json, fastapi.json, fastapi_unified.json - godot.json, godot_unified.json, godot_github.json, godot-large-example.json - react.json, react_unified.json, react_github.json, react_github_example.json - vue.json, kubernetes.json, laravel.json, tailwind.json, hono.json - svelte_cli_unified.json, steam-economy-complete.json - deck_deck_go_local.json, python-tutorial-test.json, example_pdf.json - test-manual.json, fastapi_unified_test.json, fastmcp_github_example.json - example-team/ directory (4 files) Kept as reference example: - configs/httpx_comprehensive.json (complete multi-source example) Rationale: - Cleaner repository (979+ lines added, 1680 deleted) - Configs managed separately with versioning - Official presets available via `fetch-config` command - Users can maintain private config repos ### 4. AI Enhancement Improvements enhance_skill.py (+125 lines): - Better integration with multi-source synthesis - Enhanced prompt generation for synthesized skills - Improved error handling and logging - Support for source metadata in enhancement ### 5. Documentation Updates CLAUDE.md (+252 lines): - Comprehensive project documentation - Architecture explanations - Development workflow guidelines - Testing requirements - Multi-source synthesis patterns SKILL_QUALITY_ANALYSIS.md (new): - Quality assessment framework - Before/after analysis of httpx skill - Grading rubric for skill quality - Metrics and benchmarks ### 6. Testing & Validation Scripts test_httpx_skill.sh (new): - Complete httpx skill generation test - Multi-source synthesis validation - Quality metrics verification test_httpx_quick.sh (new): - Quick validation script - Subset of features for rapid testing ## 📊 Quality Improvements \| Metric \| Before \| After \| Improvement \| \|--------\|--------\|-------\|-------------\| \| GitHub SKILL.md lines \| ~50 \| 300+ \| +500% \| \| PDF SKILL.md lines \| ~50 \| 200+ \| +300% \| \| GitHub C3.x integration \| ❌ No \| ✅ Yes \| New feature \| \| PDF pattern extraction \| ❌ No \| ✅ Yes \| New feature \| \| File organization \| Messy \| Clean cache \| Major improvement \| \| Repository cloning \| Always fresh \| Cached reuse \| Faster re-runs \| \| Logging \| Console only \| Timestamped files \| Better debugging \| \| Config management \| In-repo \| Separate repo \| Cleaner separation \| ## 🧪 Testing All existing tests pass: - test_c3_integration.py: Updated for new architecture - 700+ tests passing - Multi-source synthesis validated with httpx example ## 🔧 Technical Details Modified Core Files: 1. src/skill_seekers/cli/github_scraper.py (+263 lines) - _generate_skill_md(): Rich content with C3.x integration - _format_pattern_summary(): Design pattern summaries - _format_code_examples(): Test example formatting - _format_api_reference(): API reference from codebase - _format_architecture(): Architectural pattern analysis 2. src/skill_seekers/cli/pdf_scraper.py (+205 lines) - _generate_skill_md(): Enhanced with rich content - _format_key_concepts(): Extract concepts from headings - _format_patterns_from_content(): Pattern extraction - Code examples: Top 15, grouped by language, better quality scoring 3. src/skill_seekers/cli/unified_scraper.py (+238 lines) - __init__(): Cache directory structure - _setup_logging(): File logging with timestamps - _clone_github_repo(): Repository caching system - _scrape_documentation(): Move to cache, better logging - Better subprocess handling and error reporting 4. src/skill_seekers/cli/enhance_skill.py (+125 lines) - Multi-source synthesis awareness - Enhanced prompt generation - Better error handling Minor Updates: - src/skill_seekers/cli/codebase_scraper.py (+3 lines): Minor improvements - src/skill_seekers/cli/test_example_extractor.py: Quality scoring adjustments - tests/test_c3_integration.py: Test updates for new architecture ## 🚀 Migration Guide For users with existing configs: No action required - all existing configs continue to work. For users wanting official presets: ```bash # Fetch from official config repo skill-seekers fetch-config --name react --target unified # Or use existing local configs skill-seekers unified --config configs/httpx_comprehensive.json ``` Cache directory: New `.skillseeker-cache/` directory will be created automatically. Safe to delete - will be regenerated on next run. ## 📈 Next Steps This architecture enables: - ✅ Source parity: All sources generate rich standalone skills - ✅ Smart synthesis: Each combination has optimal formula - ✅ Better debugging: Cached files and logs preserved - ✅ Faster iteration: Repository caching, clean output - 🔄 Future: Multi-platform enhancement (Gemini, GPT-4) - planned - 🔄 Future: Conflict detection between sources - planned - 🔄 Future: Source prioritization rules - planned ## 🎓 Example: httpx Skill Quality Before: 186 lines, basic synthesis, missing data After: 640 lines with AI enhancement, A- (9/10) quality What changed: - All C3.x analysis data integrated (patterns, tests, API, architecture) - GitHub metadata included (stars, topics, languages) - PDF chapter structure visible - Professional formatting with emojis and clear sections - Real-world code examples from test suite - Design patterns explained with confidence scores - Known issues with impact assessment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-11 23:01:07 +03:00
yusyus	709fe229af	feat: Router Quality Improvements - 6.5/10 → 8.5/10 (+31%) Implemented all Phase 1 & 2 router quality improvements to transform generic template routers into practical, useful guides with real examples. ## 🎯 Five Major Improvements ### Fix 1: GitHub Issue-Based Examples - Added _generate_examples_from_github() method - Added _convert_issue_to_question() method - Real user questions instead of generic keywords - Example: "How do I fix oauth setup?" vs "Working with getting_started" ### Fix 2: Complete Code Block Extraction - Added code fence tracking to markdown_cleaner.py - Increased char limit from 500 → 1500 - Never truncates mid-code block - Complete feature lists (8 items vs 1 truncated item) ### Fix 3: Enhanced Keywords from Issue Labels - Added _extract_skill_specific_labels() method - Extracts labels from ALL matching GitHub issues - 2x weight for skill-specific labels - Result: 10-15 keywords per skill (was 5-7) ### Fix 4: Common Patterns Section - Added _extract_common_patterns() method - Added _parse_issue_pattern() method - Extracts problem-solution patterns from closed issues - Shows 5 actionable patterns with issue links ### Fix 5: Framework Detection Templates - Added _detect_framework() method - Added _get_framework_hello_world() method - Fallback templates for FastAPI, FastMCP, Django, React - Ensures 95% of routers have working code examples ## 📊 Quality Metrics \| Metric \| Before \| After \| Improvement \| \|--------\|--------\|-------\|-------------\| \| Examples Quality \| 100% generic \| 80% real issues \| +80% \| \| Code Completeness \| 40% truncated \| 95% complete \| +55% \| \| Keywords/Skill \| 5-7 \| 10-15 \| +2x \| \| Common Patterns \| 0 \| 3-5 \| NEW \| \| Overall Quality \| 6.5/10 \| 8.5/10 \| +31% \| ## 🧪 Test Updates Updated 4 test assertions across 3 test files to expect new question format: - tests/test_generate_router_github.py (2 assertions) - tests/test_e2e_three_stream_pipeline.py (1 assertion) - tests/test_architecture_scenarios.py (1 assertion) All 32 router-related tests now passing (100%) ## 📝 Files Modified ### Core Implementation: - src/skill_seekers/cli/generate_router.py (+350 lines, 7 new methods) - src/skill_seekers/cli/markdown_cleaner.py (+3 lines modified) ### Configuration: - configs/fastapi_unified.json (set code_analysis_depth: full) ### Test Files: - tests/test_generate_router_github.py - tests/test_e2e_three_stream_pipeline.py - tests/test_architecture_scenarios.py ## 🎉 Real-World Impact Generated FastAPI router demonstrates all improvements: - Real GitHub questions in Examples section - Complete 8-item feature list + installation code - 12 specific keywords (oauth2, jwt, pydantic, etc.) - 5 problem-solution patterns from resolved issues - Complete README extraction with hello world ## 📖 Documentation Analysis reports created: - Router improvements summary - Before/after comparison - Comprehensive quality analysis against Claude guidelines BREAKING CHANGE: None - All changes backward compatible Tests: All 32 router tests passing (was 15/18, now 32/32) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-11 13:44:45 +03:00
yusyus	9e772351fe	feat: C3.5 - Architectural Overview & Skill Integrator Implements comprehensive integration of ALL C3.x codebase analysis features into unified skills, transforming basic GitHub scraping into comprehensive codebase intelligence with architectural insights. What C3.5 Does: - Generates comprehensive ARCHITECTURE.md with 8 sections - Integrates ALL C3.x outputs (patterns, examples, guides, configs, architecture) - Defaults to ON for GitHub sources with local_repo_path - Adds --skip-codebase-analysis CLI flag ARCHITECTURE.md Sections: 1. Overview - Project description 2. Architectural Patterns (C3.7) - MVC, MVVM, Clean Architecture, etc. 3. Technology Stack - Frameworks, libraries, languages 4. Design Patterns (C3.1) - Factory, Singleton, Observer, etc. 5. Configuration Overview (C3.4) - Config files with security warnings 6. Common Workflows (C3.3) - How-to guides summary 7. Usage Examples (C3.2) - Test examples statistics 8. Entry Points & Directory Structure - File organization Directory Structure: output/{name}/references/codebase_analysis/ ├── ARCHITECTURE.md (main deliverable) ├── patterns/ (C3.1 design patterns) ├── examples/ (C3.2 test examples) ├── guides/ (C3.3 how-to tutorials) ├── configuration/ (C3.4 config patterns) └── architecture_details/ (C3.7 architectural patterns) Key Features: - Default ON: enable_codebase_analysis=true when local_repo_path exists - CLI flag: --skip-codebase-analysis to disable - Enhanced SKILL.md with Architecture & Code Analysis summary - Graceful degradation on C3.x failures - New config properties: enable_codebase_analysis, ai_mode Changes: - unified_scraper.py: Added _run_c3_analysis(), modified _scrape_github(), CLI flag - unified_skill_builder.py: Added 7 methods for C3.x generation + SKILL.md enhancement - config_validator.py: Added validation for C3.x properties - Updated 5 configs: react, django, fastapi, godot, svelte-cli - Added 9 integration tests in test_c3_integration.py - Updated CHANGELOG.md with complete C3.5 documentation Related: - Closes #75 - Creates #238 (type: "local" support - separate task) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-04 22:03:46 +03:00
Chris Engelhard	9949cdcdca	Fix: include docs references in unified skill output (#213 ) * Fix: include docs references in unified skill output * Fix: quality checker counts nested reference files * fix(unified): pass through llms_txt_url and skip_llms_txt to doc scraper * configs: add svelte CLI unified preset (llms.txt + categories) --------- Co-authored-by: Chris Engelhard <chris@chrisengelhard.nl>	2026-01-01 19:40:51 +03:00
yusyus	65ded6c07c	fix: Fix local repo extraction limitations (code analyzer, exclusions, enhancement) This commit fixes three critical limitations discovered during local repository skill extraction testing: Fix 1: Code Analyzer Import Issue - Changed unified_scraper.py to use absolute imports instead of relative imports - Fixed: `from github_scraper import` → `from skill_seekers.cli.github_scraper import` - Fixed: `from pdf_scraper import` → `from skill_seekers.cli.pdf_scraper import` - Result: CodeAnalyzer now available during extraction, deep analysis works Fix 2: Unity Library Exclusions - Updated should_exclude_dir() to accept and check full directory paths - Updated _extract_file_tree_local() to pass both dir name and full path - Added exclusion config passing from unified_scraper to github_scraper - Result: exclude_dirs_additional now works (297 files excluded in test) Fix 3: AI Enhancement for Single Sources - Changed read_reference_files() to use rglob() for recursive search - Now finds reference files in subdirectories (e.g., references/github/README.md) - Result: AI enhancement works with unified skills that have nested references Test Results: - Code Analyzer: ✅ Working (deep analysis running) - Unity Exclusions: ✅ Working (297 files excluded from 679) - AI Enhancement: ✅ Working (finds and reads nested references) Files Changed: - src/skill_seekers/cli/unified_scraper.py (Fix 1 & 2) - src/skill_seekers/cli/github_scraper.py (Fix 2) - src/skill_seekers/cli/utils.py (Fix 3) Test Artifacts: - configs/deck_deck_go_local.json (test configuration) - docs/LOCAL_REPO_TEST_RESULTS.md (comprehensive test report) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-21 22:24:38 +03:00
yusyus	70ca1d9ba6	docs(A1.9): Add comprehensive git source documentation and example repository Phase 4 Complete: - Updated README.md with git source usage examples and use cases - Created docs/GIT_CONFIG_SOURCES.md (800+ lines comprehensive guide) - Updated CHANGELOG.md with v2.2.0 release notes - Added configs/example-team/ example repository with E2E test Documentation covers: - Quick start and architecture - MCP tools reference (4 tools with examples) - Authentication for GitHub, GitLab, Bitbucket - Use cases (small teams, enterprise, open source) - Best practices, troubleshooting, advanced topics - Complete API reference Example repository includes: - 3 example configs (react-custom, vue-internal, company-api) - README with usage guide - E2E test script (7 steps, 100% passing) 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-21 19:38:26 +03:00
yusyus	5d8c7e39f6	Add unified multi-source scraping feature (Phases 7-11) Completes the unified scraping system implementation: Phase 7: Unified Skill Builder - cli/unified_skill_builder.py: Generates final skill structure - Inline conflict warnings (⚠️) in API reference - Side-by-side docs vs code comparison - Severity-based conflict grouping - Separate conflicts.md report Phase 8: MCP Integration - skill_seeker_mcp/server.py: Auto-detects unified vs legacy configs - Routes to unified_scraper.py or doc_scraper.py automatically - Supports merge_mode parameter override - Maintains full backward compatibility Phase 9: Example Unified Configs - configs/react_unified.json: React docs + GitHub - configs/django_unified.json: Django docs + GitHub - configs/fastapi_unified.json: FastAPI docs + GitHub - configs/fastapi_unified_test.json: Test config with limited pages Phase 10: Comprehensive Tests - cli/test_unified_simple.py: Integration tests (all passing) - Tests unified config validation - Tests backward compatibility - Tests mixed source types - Tests error handling Phase 11: Documentation - docs/UNIFIED_SCRAPING.md: Complete guide (1000+ lines) - Examples, best practices, troubleshooting - Architecture diagrams and data flow - Command reference Additional: - demo_conflicts.py: Interactive conflict detection demo - TEST_RESULTS.md: Complete test results and findings - cli/unified_scraper.py: Fixed doc_scraper integration (subprocess) Features: ✅ Multi-source scraping (docs + GitHub + PDF) ✅ Conflict detection (4 types, 3 severity levels) ✅ Rule-based merging (fast, deterministic) ✅ Claude-enhanced merging (AI-powered) ✅ Transparent conflict reporting ✅ MCP auto-detection ✅ Backward compatibility Test Results: - 6/6 integration tests passed - 4 unified configs validated - 3 legacy configs backward compatible - 5 conflicts detected in test data - All documentation complete 🤖 Generated with Claude Code	2025-10-26 16:33:41 +03:00
yusyus	f2b26ff5fe	feat: Phase 1-2 - Unified config format + deep code analysis Phase 1: Unified Config Format - Created config_validator.py with full validation - Supports multiple sources (documentation, github, pdf) - Backward compatible with legacy configs - Auto-converts legacy → unified format - Validates merge_mode and code_analysis_depth Phase 2: Deep Code Analysis - Created code_analyzer.py with language-specific parsers - Supports Python (AST), JavaScript/TypeScript (regex), C/C++ (regex) - Configurable depth: surface, deep, full - Extracts classes, functions, parameters, types, docstrings - Integrated into github_scraper.py Features: ✅ Unified config with sources array ✅ Code analysis depth: surface/deep/full ✅ Language detection and parser selection ✅ Signature extraction with full parameter info ✅ Type hints and default values captured ✅ Docstring extraction ✅ Example config: godot_unified.json Next: Conflict detection and merging 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-26 15:09:38 +03:00
yusyus	a0017d3459	feat: Add Godot GitHub repository config Config for godotengine/godot repository: - Extracts README, issues, changelog, releases - Targets core C++ files (core, scene, servers) - Max 100 issues - Surface layer only (no full code implementation) Usage: python3 cli/github_scraper.py --config configs/godot_github.json 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-26 14:32:38 +03:00
yusyus	01c14d0e9c	feat: Implement C1 GitHub Repository Scraping (Tasks C1.1-C1.12) Complete implementation of GitHub repository scraping feature with all 12 tasks: ## Core Features Implemented C1.1: GitHub API Client - PyGithub integration with authentication support - Support for GITHUB_TOKEN env var + config file token - Rate limit handling and error management C1.2: README Extraction - Fetch README.md, README.rst, README.txt - Support multiple locations (root, docs/, .github/) C1.3: Code Comments & Docstrings - Framework for extracting docstrings (surface layer) - Placeholder for Python/JS comment extraction C1.4: Language Detection - Use GitHub's language detection API - Percentage breakdown by bytes C1.5: Function/Class Signatures - Framework for signature extraction (surface layer only) C1.6: Usage Examples from Tests - Placeholder for test file analysis C1.7: GitHub Issues Extraction - Fetch open/closed issues via API - Extract title, labels, milestone, state, timestamps - Configurable max issues (default: 100) C1.8: CHANGELOG Extraction - Fetch CHANGELOG.md, CHANGES.md, HISTORY.md - Try multiple common locations C1.9: GitHub Releases - Fetch releases via API - Extract version tags, release notes, publish dates - Full release history C1.10: CLI Tool - Complete `cli/github_scraper.py` (~700 lines) - Argparse interface with config + direct modes - GitHubScraper class for data extraction - GitHubToSkillConverter class for skill building C1.11: MCP Integration - Added `scrape_github` tool to MCP server - Natural language interface: "Scrape GitHub repo facebook/react" - 10 minute timeout for scraping - Full parameter support C1.12: Config Format - JSON config schema with example - `configs/react_github.json` template - Support for repo, name, description, token, flags ## Files Changed - `cli/github_scraper.py` (NEW, ~700 lines) - `configs/react_github.json` (NEW) - `requirements.txt` (+PyGithub==2.5.0) - `skill_seeker_mcp/server.py` (+scrape_github tool) ## Usage ```bash # CLI usage python3 cli/github_scraper.py --repo facebook/react python3 cli/github_scraper.py --config configs/react_github.json # MCP usage (via Claude Code) "Scrape GitHub repository facebook/react" "Extract issues and changelog from owner/repo" ``` ## Implementation Notes - Surface layer only (no full code implementation) - Focus on documentation, issues, changelog, releases - Skill size: 2-5 MB (manageable, focused) - Covers 90%+ of real use cases 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-26 14:19:27 +03:00
Edgar I.	104818f983	feat: enable llms.txt for hono config	2025-10-24 18:27:17 +04:00
yusyus	6936057820	Add PDF documentation support (Tasks B1.1-B1.8) Complete PDF extraction and skill conversion functionality: - pdf_extractor_poc.py (1,004 lines): Extract text, code, images from PDFs - pdf_scraper.py (353 lines): Convert PDFs to Claude skills - MCP tool scrape_pdf: PDF scraping via Claude Code - 7 comprehensive documentation guides (4,705 lines) - Example PDF config format (configs/example_pdf.json) Features: - 3 code detection methods (font, indent, pattern) - 19+ programming languages detected with confidence scoring - Syntax validation and quality scoring (0-10 scale) - Image extraction with size filtering (--extract-images) - Chapter/section detection and page chunking - Quality-filtered code examples (--min-quality) - Three usage modes: config file, direct PDF, from extracted JSON Technical: - PyMuPDF (fitz) as primary library (60x faster than alternatives) - Language detection with confidence scoring - Code block merging across pages - Comprehensive metadata and statistics - Compatible with existing Skill Seeker workflow MCP Integration: - New scrape_pdf tool (10th MCP tool total) - Supports all three usage modes - 10-minute timeout for large PDFs - Real-time streaming output Documentation (4,705 lines): - B1_COMPLETE_SUMMARY.md: Overview of all 8 tasks - PDF_PARSING_RESEARCH.md: Library comparison and benchmarks - PDF_EXTRACTOR_POC.md: POC documentation - PDF_CHUNKING.md: Page chunking guide - PDF_SYNTAX_DETECTION.md: Syntax detection guide - PDF_IMAGE_EXTRACTION.md: Image extraction guide - PDF_SCRAPER.md: PDF scraper usage guide - PDF_MCP_TOOL.md: MCP integration guide Tasks completed: B1.1-B1.8 Addresses Issue #27 See docs/B1_COMPLETE_SUMMARY.md for complete details 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-23 00:23:16 +03:00
Schuyler Erle	183c7596a5	Add config for Ansible core documentation (#147 ) Co-authored-by: Schuyler Erle <schuyler@ardc.net>	2025-10-22 21:50:59 +03:00
Schuyler Erle	ab585584d0	Add config for Claude Code documentation	2025-10-20 21:27:19 -07:00
yusyus	80382551b1	Fix Issue #7 : Fix all broken configs and add Laravel support Tested and fixed all 11 production configs - now 100% working! Fixed Configs: 1. Django (configs/django.json) - ❌ Was using: div.document (selector doesn't exist) - ✅ Now using: article (1,688 chars of content) - Verified on: https://docs.djangoproject.com/en/stable/ 2. Astro (configs/astro.json) - ❌ Was using: homepage URL (no article element) - ✅ Now using: /en/getting-started/ with article selector - Added: start_urls, categories, improved URL patterns - Increased max_pages from 15 to 100 3. Tailwind (configs/tailwind.json) - ❌ Was using: article (selector doesn't exist) - ✅ Now using: div.prose (195 chars of content) - Verified on: https://tailwindcss.com/docs New Config: 4. Laravel (configs/laravel.json) - NEW! - Created complete Laravel 9.x config - Selector: #main-content (16,131 chars of content) - Base URL: https://laravel.com/docs/9.x/ - Includes: 8 start_urls covering installation, routing, controllers, views, Blade, Eloquent, migrations, auth - Categories: getting_started, routing, views, models, authentication, api - max_pages: 500 Test Results: ✅ 11/11 configs tested and verified (100%) ✅ All selectors extract content properly ✅ All base URLs accessible Working Configs: - ✅ astro.json - ✅ django.json - ✅ fastapi.json - ✅ godot.json - ✅ godot-large-example.json - ✅ kubernetes.json - ✅ laravel.json (NEW) - ✅ react.json - ✅ steam-economy-complete.json - ✅ tailwind.json - ✅ vue.json How I Tested: 1. Created test_selectors.py to find correct CSS selectors 2. Tested each config's base_url + selector combination 3. Verified content extraction (not just "found" but actual text) 4. Ensured meaningful content length (50+ chars minimum) Fixes Issue #7 - Laravel scraping not working Fixes #7	2025-10-21 00:16:39 +03:00
yusyus	bddb57f5ef	Add large documentation handling (40K+ pages support) Implement comprehensive system for handling very large documentation sites with intelligent splitting strategies and router/hub architecture. New CLI Tools: - cli/split_config.py: Split large configs into focused sub-skills * Strategies: auto, category, router, size * Configurable target pages per skill (default: 5000) * Dry-run mode for preview - cli/generate_router.py: Create intelligent router/hub skills * Auto-generates routing logic based on keywords * Creates SKILL.md with topic-to-skill mapping * Infers router name from sub-skills - cli/package_multi.py: Batch package multiple skills * Package router + all sub-skills in one command * Progress tracking for each skill MCP Integration: - Added split_config tool (8 total MCP tools now) - Added generate_router tool - Supports 40K+ page documentation via MCP Configuration: - New split_strategy parameter in configs - split_config section for fine-tuned control - checkpoint section for resume capability (ready for Phase 4) - Example: configs/godot-large-example.json Documentation: - docs/LARGE_DOCUMENTATION.md (500+ lines) * Complete guide for 10K+ page documentation * All splitting strategies explained * Detailed workflows with examples * Best practices and troubleshooting * Real-world examples (AWS, Microsoft, Godot) Features: ✅ Handle 40K+ page documentation efficiently ✅ Parallel scraping support (5x-10x faster) ✅ Router + sub-skills architecture ✅ Intelligent keyword-based routing ✅ Multiple splitting strategies ✅ Full MCP integration 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-19 20:48:03 +03:00
yusyus	f103aa62cb	Clean up tracked files and repository structure Remove unnecessary files: - configs/.DS_Store (macOS system file, should not be tracked) This ensures only relevant project files are version controlled and improves repository hygiene. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-19 19:45:13 +03:00
yusyus	d7e6142ab0	Add test configurations for MCP validation Add 4 test configuration files used for validating MCP functionality: - astro.json: Astro framework documentation (15 pages, production test) - python-tutorial-test.json: Python tutorial (minimal test case) - tailwind.json: Tailwind CSS documentation (test case) - test-manual.json: Manual testing configuration These configs were used to verify: - Config generation via generate_config tool - Config validation via validate_config tool - Page estimation via estimate_pages tool - Full scraping workflow via scrape_docs tool - Skill packaging via package_skill tool All tests passed successfully in production Claude Code environment. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-19 19:44:27 +03:00
jarek	7a4c1d7083	kubernetes config for official docs	2025-10-19 09:28:44 +02:00
yusyus	59c2f9126d	Optimize all framework configs with start_urls for better coverage All configs now follow the steam-economy-complete.json pattern with: - Multiple start_urls for comprehensive entry points - Improved include patterns for better targeting - Enhanced exclude patterns to skip irrelevant pages Godot Config: - Added 7 start_urls covering getting started, scripting, 2D, 3D, physics, animation, and classes - Added include patterns: /getting_started/, /tutorials/, /classes/ - More focused scraping of core documentation React Config: - Added 6 start_urls covering learn, quick-start, reference, and hooks - Existing patterns maintained (already well-optimized) Vue Config: - Added 6 start_urls covering introduction, essentials, components, composables, and API - Fixed base_url from https://vuejs.org/guide/ to https://vuejs.org/ - Added /partners/ to exclude list Django Config: - Added 7 start_urls covering intro, models, views, templates, forms, auth, and reference - Added /intro/ to include patterns - Added /releases/ to exclude list (changelog not needed) FastAPI Config: - Added 7 start_urls covering tutorial, first-steps, path-params, body, dependencies, advanced, and reference - Added /deployment/ to exclude list Benefits: - Better initial page discovery - More comprehensive documentation coverage - Faster scraping (direct entry to important sections) - Reduced unnecessary page crawling - Consistent pattern across all configs All configs tested and validated: ✅ 71/71 tests passing ✅ All 6 configs validated successfully 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-19 02:24:56 +03:00
yusyus	78b9cae398	Init	2025-10-17 15:14:44 +00:00

21 Commits