skill-seekers-reference

firefrost-gaming/skill-seekers-reference

Author	SHA1	Message	Date
Chris Engelhard	9949cdcdca	Fix: include docs references in unified skill output (#213 ) * Fix: include docs references in unified skill output * Fix: quality checker counts nested reference files * fix(unified): pass through llms_txt_url and skip_llms_txt to doc scraper * configs: add svelte CLI unified preset (llms.txt + categories) --------- Co-authored-by: Chris Engelhard <chris@chrisengelhard.nl>	2026-01-01 19:40:51 +03:00
yusyus	65ded6c07c	fix: Fix local repo extraction limitations (code analyzer, exclusions, enhancement) This commit fixes three critical limitations discovered during local repository skill extraction testing: Fix 1: Code Analyzer Import Issue - Changed unified_scraper.py to use absolute imports instead of relative imports - Fixed: `from github_scraper import` → `from skill_seekers.cli.github_scraper import` - Fixed: `from pdf_scraper import` → `from skill_seekers.cli.pdf_scraper import` - Result: CodeAnalyzer now available during extraction, deep analysis works Fix 2: Unity Library Exclusions - Updated should_exclude_dir() to accept and check full directory paths - Updated _extract_file_tree_local() to pass both dir name and full path - Added exclusion config passing from unified_scraper to github_scraper - Result: exclude_dirs_additional now works (297 files excluded in test) Fix 3: AI Enhancement for Single Sources - Changed read_reference_files() to use rglob() for recursive search - Now finds reference files in subdirectories (e.g., references/github/README.md) - Result: AI enhancement works with unified skills that have nested references Test Results: - Code Analyzer: ✅ Working (deep analysis running) - Unity Exclusions: ✅ Working (297 files excluded from 679) - AI Enhancement: ✅ Working (finds and reads nested references) Files Changed: - src/skill_seekers/cli/unified_scraper.py (Fix 1 & 2) - src/skill_seekers/cli/github_scraper.py (Fix 2) - src/skill_seekers/cli/utils.py (Fix 3) Test Artifacts: - configs/deck_deck_go_local.json (test configuration) - docs/LOCAL_REPO_TEST_RESULTS.md (comprehensive test report) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-21 22:24:38 +03:00
yusyus	70ca1d9ba6	docs(A1.9): Add comprehensive git source documentation and example repository Phase 4 Complete: - Updated README.md with git source usage examples and use cases - Created docs/GIT_CONFIG_SOURCES.md (800+ lines comprehensive guide) - Updated CHANGELOG.md with v2.2.0 release notes - Added configs/example-team/ example repository with E2E test Documentation covers: - Quick start and architecture - MCP tools reference (4 tools with examples) - Authentication for GitHub, GitLab, Bitbucket - Use cases (small teams, enterprise, open source) - Best practices, troubleshooting, advanced topics - Complete API reference Example repository includes: - 3 example configs (react-custom, vue-internal, company-api) - README with usage guide - E2E test script (7 steps, 100% passing) 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-21 19:38:26 +03:00
yusyus	5d8c7e39f6	Add unified multi-source scraping feature (Phases 7-11) Completes the unified scraping system implementation: Phase 7: Unified Skill Builder - cli/unified_skill_builder.py: Generates final skill structure - Inline conflict warnings (⚠️) in API reference - Side-by-side docs vs code comparison - Severity-based conflict grouping - Separate conflicts.md report Phase 8: MCP Integration - skill_seeker_mcp/server.py: Auto-detects unified vs legacy configs - Routes to unified_scraper.py or doc_scraper.py automatically - Supports merge_mode parameter override - Maintains full backward compatibility Phase 9: Example Unified Configs - configs/react_unified.json: React docs + GitHub - configs/django_unified.json: Django docs + GitHub - configs/fastapi_unified.json: FastAPI docs + GitHub - configs/fastapi_unified_test.json: Test config with limited pages Phase 10: Comprehensive Tests - cli/test_unified_simple.py: Integration tests (all passing) - Tests unified config validation - Tests backward compatibility - Tests mixed source types - Tests error handling Phase 11: Documentation - docs/UNIFIED_SCRAPING.md: Complete guide (1000+ lines) - Examples, best practices, troubleshooting - Architecture diagrams and data flow - Command reference Additional: - demo_conflicts.py: Interactive conflict detection demo - TEST_RESULTS.md: Complete test results and findings - cli/unified_scraper.py: Fixed doc_scraper integration (subprocess) Features: ✅ Multi-source scraping (docs + GitHub + PDF) ✅ Conflict detection (4 types, 3 severity levels) ✅ Rule-based merging (fast, deterministic) ✅ Claude-enhanced merging (AI-powered) ✅ Transparent conflict reporting ✅ MCP auto-detection ✅ Backward compatibility Test Results: - 6/6 integration tests passed - 4 unified configs validated - 3 legacy configs backward compatible - 5 conflicts detected in test data - All documentation complete 🤖 Generated with Claude Code	2025-10-26 16:33:41 +03:00
yusyus	f2b26ff5fe	feat: Phase 1-2 - Unified config format + deep code analysis Phase 1: Unified Config Format - Created config_validator.py with full validation - Supports multiple sources (documentation, github, pdf) - Backward compatible with legacy configs - Auto-converts legacy → unified format - Validates merge_mode and code_analysis_depth Phase 2: Deep Code Analysis - Created code_analyzer.py with language-specific parsers - Supports Python (AST), JavaScript/TypeScript (regex), C/C++ (regex) - Configurable depth: surface, deep, full - Extracts classes, functions, parameters, types, docstrings - Integrated into github_scraper.py Features: ✅ Unified config with sources array ✅ Code analysis depth: surface/deep/full ✅ Language detection and parser selection ✅ Signature extraction with full parameter info ✅ Type hints and default values captured ✅ Docstring extraction ✅ Example config: godot_unified.json Next: Conflict detection and merging 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-26 15:09:38 +03:00
yusyus	a0017d3459	feat: Add Godot GitHub repository config Config for godotengine/godot repository: - Extracts README, issues, changelog, releases - Targets core C++ files (core, scene, servers) - Max 100 issues - Surface layer only (no full code implementation) Usage: python3 cli/github_scraper.py --config configs/godot_github.json 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-26 14:32:38 +03:00
yusyus	01c14d0e9c	feat: Implement C1 GitHub Repository Scraping (Tasks C1.1-C1.12) Complete implementation of GitHub repository scraping feature with all 12 tasks: ## Core Features Implemented C1.1: GitHub API Client - PyGithub integration with authentication support - Support for GITHUB_TOKEN env var + config file token - Rate limit handling and error management C1.2: README Extraction - Fetch README.md, README.rst, README.txt - Support multiple locations (root, docs/, .github/) C1.3: Code Comments & Docstrings - Framework for extracting docstrings (surface layer) - Placeholder for Python/JS comment extraction C1.4: Language Detection - Use GitHub's language detection API - Percentage breakdown by bytes C1.5: Function/Class Signatures - Framework for signature extraction (surface layer only) C1.6: Usage Examples from Tests - Placeholder for test file analysis C1.7: GitHub Issues Extraction - Fetch open/closed issues via API - Extract title, labels, milestone, state, timestamps - Configurable max issues (default: 100) C1.8: CHANGELOG Extraction - Fetch CHANGELOG.md, CHANGES.md, HISTORY.md - Try multiple common locations C1.9: GitHub Releases - Fetch releases via API - Extract version tags, release notes, publish dates - Full release history C1.10: CLI Tool - Complete `cli/github_scraper.py` (~700 lines) - Argparse interface with config + direct modes - GitHubScraper class for data extraction - GitHubToSkillConverter class for skill building C1.11: MCP Integration - Added `scrape_github` tool to MCP server - Natural language interface: "Scrape GitHub repo facebook/react" - 10 minute timeout for scraping - Full parameter support C1.12: Config Format - JSON config schema with example - `configs/react_github.json` template - Support for repo, name, description, token, flags ## Files Changed - `cli/github_scraper.py` (NEW, ~700 lines) - `configs/react_github.json` (NEW) - `requirements.txt` (+PyGithub==2.5.0) - `skill_seeker_mcp/server.py` (+scrape_github tool) ## Usage ```bash # CLI usage python3 cli/github_scraper.py --repo facebook/react python3 cli/github_scraper.py --config configs/react_github.json # MCP usage (via Claude Code) "Scrape GitHub repository facebook/react" "Extract issues and changelog from owner/repo" ``` ## Implementation Notes - Surface layer only (no full code implementation) - Focus on documentation, issues, changelog, releases - Skill size: 2-5 MB (manageable, focused) - Covers 90%+ of real use cases 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-26 14:19:27 +03:00
Edgar I.	104818f983	feat: enable llms.txt for hono config	2025-10-24 18:27:17 +04:00
yusyus	6936057820	Add PDF documentation support (Tasks B1.1-B1.8) Complete PDF extraction and skill conversion functionality: - pdf_extractor_poc.py (1,004 lines): Extract text, code, images from PDFs - pdf_scraper.py (353 lines): Convert PDFs to Claude skills - MCP tool scrape_pdf: PDF scraping via Claude Code - 7 comprehensive documentation guides (4,705 lines) - Example PDF config format (configs/example_pdf.json) Features: - 3 code detection methods (font, indent, pattern) - 19+ programming languages detected with confidence scoring - Syntax validation and quality scoring (0-10 scale) - Image extraction with size filtering (--extract-images) - Chapter/section detection and page chunking - Quality-filtered code examples (--min-quality) - Three usage modes: config file, direct PDF, from extracted JSON Technical: - PyMuPDF (fitz) as primary library (60x faster than alternatives) - Language detection with confidence scoring - Code block merging across pages - Comprehensive metadata and statistics - Compatible with existing Skill Seeker workflow MCP Integration: - New scrape_pdf tool (10th MCP tool total) - Supports all three usage modes - 10-minute timeout for large PDFs - Real-time streaming output Documentation (4,705 lines): - B1_COMPLETE_SUMMARY.md: Overview of all 8 tasks - PDF_PARSING_RESEARCH.md: Library comparison and benchmarks - PDF_EXTRACTOR_POC.md: POC documentation - PDF_CHUNKING.md: Page chunking guide - PDF_SYNTAX_DETECTION.md: Syntax detection guide - PDF_IMAGE_EXTRACTION.md: Image extraction guide - PDF_SCRAPER.md: PDF scraper usage guide - PDF_MCP_TOOL.md: MCP integration guide Tasks completed: B1.1-B1.8 Addresses Issue #27 See docs/B1_COMPLETE_SUMMARY.md for complete details 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-23 00:23:16 +03:00
Schuyler Erle	183c7596a5	Add config for Ansible core documentation (#147 ) Co-authored-by: Schuyler Erle <schuyler@ardc.net>	2025-10-22 21:50:59 +03:00
Schuyler Erle	ab585584d0	Add config for Claude Code documentation	2025-10-20 21:27:19 -07:00
yusyus	80382551b1	Fix Issue #7 : Fix all broken configs and add Laravel support Tested and fixed all 11 production configs - now 100% working! Fixed Configs: 1. Django (configs/django.json) - ❌ Was using: div.document (selector doesn't exist) - ✅ Now using: article (1,688 chars of content) - Verified on: https://docs.djangoproject.com/en/stable/ 2. Astro (configs/astro.json) - ❌ Was using: homepage URL (no article element) - ✅ Now using: /en/getting-started/ with article selector - Added: start_urls, categories, improved URL patterns - Increased max_pages from 15 to 100 3. Tailwind (configs/tailwind.json) - ❌ Was using: article (selector doesn't exist) - ✅ Now using: div.prose (195 chars of content) - Verified on: https://tailwindcss.com/docs New Config: 4. Laravel (configs/laravel.json) - NEW! - Created complete Laravel 9.x config - Selector: #main-content (16,131 chars of content) - Base URL: https://laravel.com/docs/9.x/ - Includes: 8 start_urls covering installation, routing, controllers, views, Blade, Eloquent, migrations, auth - Categories: getting_started, routing, views, models, authentication, api - max_pages: 500 Test Results: ✅ 11/11 configs tested and verified (100%) ✅ All selectors extract content properly ✅ All base URLs accessible Working Configs: - ✅ astro.json - ✅ django.json - ✅ fastapi.json - ✅ godot.json - ✅ godot-large-example.json - ✅ kubernetes.json - ✅ laravel.json (NEW) - ✅ react.json - ✅ steam-economy-complete.json - ✅ tailwind.json - ✅ vue.json How I Tested: 1. Created test_selectors.py to find correct CSS selectors 2. Tested each config's base_url + selector combination 3. Verified content extraction (not just "found" but actual text) 4. Ensured meaningful content length (50+ chars minimum) Fixes Issue #7 - Laravel scraping not working Fixes #7	2025-10-21 00:16:39 +03:00
yusyus	bddb57f5ef	Add large documentation handling (40K+ pages support) Implement comprehensive system for handling very large documentation sites with intelligent splitting strategies and router/hub architecture. New CLI Tools: - cli/split_config.py: Split large configs into focused sub-skills * Strategies: auto, category, router, size * Configurable target pages per skill (default: 5000) * Dry-run mode for preview - cli/generate_router.py: Create intelligent router/hub skills * Auto-generates routing logic based on keywords * Creates SKILL.md with topic-to-skill mapping * Infers router name from sub-skills - cli/package_multi.py: Batch package multiple skills * Package router + all sub-skills in one command * Progress tracking for each skill MCP Integration: - Added split_config tool (8 total MCP tools now) - Added generate_router tool - Supports 40K+ page documentation via MCP Configuration: - New split_strategy parameter in configs - split_config section for fine-tuned control - checkpoint section for resume capability (ready for Phase 4) - Example: configs/godot-large-example.json Documentation: - docs/LARGE_DOCUMENTATION.md (500+ lines) * Complete guide for 10K+ page documentation * All splitting strategies explained * Detailed workflows with examples * Best practices and troubleshooting * Real-world examples (AWS, Microsoft, Godot) Features: ✅ Handle 40K+ page documentation efficiently ✅ Parallel scraping support (5x-10x faster) ✅ Router + sub-skills architecture ✅ Intelligent keyword-based routing ✅ Multiple splitting strategies ✅ Full MCP integration 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-19 20:48:03 +03:00
yusyus	f103aa62cb	Clean up tracked files and repository structure Remove unnecessary files: - configs/.DS_Store (macOS system file, should not be tracked) This ensures only relevant project files are version controlled and improves repository hygiene. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-19 19:45:13 +03:00
yusyus	d7e6142ab0	Add test configurations for MCP validation Add 4 test configuration files used for validating MCP functionality: - astro.json: Astro framework documentation (15 pages, production test) - python-tutorial-test.json: Python tutorial (minimal test case) - tailwind.json: Tailwind CSS documentation (test case) - test-manual.json: Manual testing configuration These configs were used to verify: - Config generation via generate_config tool - Config validation via validate_config tool - Page estimation via estimate_pages tool - Full scraping workflow via scrape_docs tool - Skill packaging via package_skill tool All tests passed successfully in production Claude Code environment. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-19 19:44:27 +03:00
jarek	7a4c1d7083	kubernetes config for official docs	2025-10-19 09:28:44 +02:00
yusyus	59c2f9126d	Optimize all framework configs with start_urls for better coverage All configs now follow the steam-economy-complete.json pattern with: - Multiple start_urls for comprehensive entry points - Improved include patterns for better targeting - Enhanced exclude patterns to skip irrelevant pages Godot Config: - Added 7 start_urls covering getting started, scripting, 2D, 3D, physics, animation, and classes - Added include patterns: /getting_started/, /tutorials/, /classes/ - More focused scraping of core documentation React Config: - Added 6 start_urls covering learn, quick-start, reference, and hooks - Existing patterns maintained (already well-optimized) Vue Config: - Added 6 start_urls covering introduction, essentials, components, composables, and API - Fixed base_url from https://vuejs.org/guide/ to https://vuejs.org/ - Added /partners/ to exclude list Django Config: - Added 7 start_urls covering intro, models, views, templates, forms, auth, and reference - Added /intro/ to include patterns - Added /releases/ to exclude list (changelog not needed) FastAPI Config: - Added 7 start_urls covering tutorial, first-steps, path-params, body, dependencies, advanced, and reference - Added /deployment/ to exclude list Benefits: - Better initial page discovery - More comprehensive documentation coverage - Faster scraping (direct entry to important sections) - Reduced unnecessary page crawling - Consistent pattern across all configs All configs tested and validated: ✅ 71/71 tests passing ✅ All 6 configs validated successfully 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-19 02:24:56 +03:00
yusyus	78b9cae398	Init	2025-10-17 15:14:44 +00:00

18 Commits