Commit Graph

20 Commits

Author SHA1 Message Date
Nick Miethe
9042e1680c Enabling full support of the Claude Code documentation site, with support for all relevant pages and Anthropic's unconventional llms.txt 2026-01-11 14:15:32 +03:00
tsyhahaha
a7f13ec75f chore: add medusa-mercurjs unified config
Multi-source config combining Medusa docs and Mercur.js marketplace

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-05 22:32:31 +08:00
Chris Engelhard
9949cdcdca Fix: include docs references in unified skill output (#213)
* Fix: include docs references in unified skill output

* Fix: quality checker counts nested reference files

* fix(unified): pass through llms_txt_url and skip_llms_txt to doc scraper

* configs: add svelte CLI unified preset (llms.txt + categories)

---------

Co-authored-by: Chris Engelhard <chris@chrisengelhard.nl>
2026-01-01 19:40:51 +03:00
yusyus
65ded6c07c fix: Fix local repo extraction limitations (code analyzer, exclusions, enhancement)
This commit fixes three critical limitations discovered during local repository skill extraction testing:

**Fix 1: Code Analyzer Import Issue**
- Changed unified_scraper.py to use absolute imports instead of relative imports
- Fixed: `from github_scraper import` → `from skill_seekers.cli.github_scraper import`
- Fixed: `from pdf_scraper import` → `from skill_seekers.cli.pdf_scraper import`
- Result: CodeAnalyzer now available during extraction, deep analysis works

**Fix 2: Unity Library Exclusions**
- Updated should_exclude_dir() to accept and check full directory paths
- Updated _extract_file_tree_local() to pass both dir name and full path
- Added exclusion config passing from unified_scraper to github_scraper
- Result: exclude_dirs_additional now works (297 files excluded in test)

**Fix 3: AI Enhancement for Single Sources**
- Changed read_reference_files() to use rglob() for recursive search
- Now finds reference files in subdirectories (e.g., references/github/README.md)
- Result: AI enhancement works with unified skills that have nested references

**Test Results:**
- Code Analyzer:  Working (deep analysis running)
- Unity Exclusions:  Working (297 files excluded from 679)
- AI Enhancement:  Working (finds and reads nested references)

**Files Changed:**
- src/skill_seekers/cli/unified_scraper.py (Fix 1 & 2)
- src/skill_seekers/cli/github_scraper.py (Fix 2)
- src/skill_seekers/cli/utils.py (Fix 3)

**Test Artifacts:**
- configs/deck_deck_go_local.json (test configuration)
- docs/LOCAL_REPO_TEST_RESULTS.md (comprehensive test report)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-21 22:24:38 +03:00
yusyus
70ca1d9ba6 docs(A1.9): Add comprehensive git source documentation and example repository
Phase 4 Complete:
- Updated README.md with git source usage examples and use cases
- Created docs/GIT_CONFIG_SOURCES.md (800+ lines comprehensive guide)
- Updated CHANGELOG.md with v2.2.0 release notes
- Added configs/example-team/ example repository with E2E test

Documentation covers:
- Quick start and architecture
- MCP tools reference (4 tools with examples)
- Authentication for GitHub, GitLab, Bitbucket
- Use cases (small teams, enterprise, open source)
- Best practices, troubleshooting, advanced topics
- Complete API reference

Example repository includes:
- 3 example configs (react-custom, vue-internal, company-api)
- README with usage guide
- E2E test script (7 steps, 100% passing)

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-21 19:38:26 +03:00
yusyus
5d8c7e39f6 Add unified multi-source scraping feature (Phases 7-11)
Completes the unified scraping system implementation:

**Phase 7: Unified Skill Builder**
- cli/unified_skill_builder.py: Generates final skill structure
- Inline conflict warnings (⚠️) in API reference
- Side-by-side docs vs code comparison
- Severity-based conflict grouping
- Separate conflicts.md report

**Phase 8: MCP Integration**
- skill_seeker_mcp/server.py: Auto-detects unified vs legacy configs
- Routes to unified_scraper.py or doc_scraper.py automatically
- Supports merge_mode parameter override
- Maintains full backward compatibility

**Phase 9: Example Unified Configs**
- configs/react_unified.json: React docs + GitHub
- configs/django_unified.json: Django docs + GitHub
- configs/fastapi_unified.json: FastAPI docs + GitHub
- configs/fastapi_unified_test.json: Test config with limited pages

**Phase 10: Comprehensive Tests**
- cli/test_unified_simple.py: Integration tests (all passing)
- Tests unified config validation
- Tests backward compatibility
- Tests mixed source types
- Tests error handling

**Phase 11: Documentation**
- docs/UNIFIED_SCRAPING.md: Complete guide (1000+ lines)
- Examples, best practices, troubleshooting
- Architecture diagrams and data flow
- Command reference

**Additional:**
- demo_conflicts.py: Interactive conflict detection demo
- TEST_RESULTS.md: Complete test results and findings
- cli/unified_scraper.py: Fixed doc_scraper integration (subprocess)

**Features:**
 Multi-source scraping (docs + GitHub + PDF)
 Conflict detection (4 types, 3 severity levels)
 Rule-based merging (fast, deterministic)
 Claude-enhanced merging (AI-powered)
 Transparent conflict reporting
 MCP auto-detection
 Backward compatibility

**Test Results:**
- 6/6 integration tests passed
- 4 unified configs validated
- 3 legacy configs backward compatible
- 5 conflicts detected in test data
- All documentation complete

🤖 Generated with Claude Code
2025-10-26 16:33:41 +03:00
yusyus
f2b26ff5fe feat: Phase 1-2 - Unified config format + deep code analysis
Phase 1: Unified Config Format
- Created config_validator.py with full validation
- Supports multiple sources (documentation, github, pdf)
- Backward compatible with legacy configs
- Auto-converts legacy → unified format
- Validates merge_mode and code_analysis_depth

Phase 2: Deep Code Analysis
- Created code_analyzer.py with language-specific parsers
- Supports Python (AST), JavaScript/TypeScript (regex), C/C++ (regex)
- Configurable depth: surface, deep, full
- Extracts classes, functions, parameters, types, docstrings
- Integrated into github_scraper.py

Features:
 Unified config with sources array
 Code analysis depth: surface/deep/full
 Language detection and parser selection
 Signature extraction with full parameter info
 Type hints and default values captured
 Docstring extraction
 Example config: godot_unified.json

Next: Conflict detection and merging

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 15:09:38 +03:00
yusyus
a0017d3459 feat: Add Godot GitHub repository config
Config for godotengine/godot repository:
- Extracts README, issues, changelog, releases
- Targets core C++ files (core, scene, servers)
- Max 100 issues
- Surface layer only (no full code implementation)

Usage: python3 cli/github_scraper.py --config configs/godot_github.json

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 14:32:38 +03:00
yusyus
01c14d0e9c feat: Implement C1 GitHub Repository Scraping (Tasks C1.1-C1.12)
Complete implementation of GitHub repository scraping feature with all 12 tasks:

## Core Features Implemented

**C1.1: GitHub API Client**
- PyGithub integration with authentication support
- Support for GITHUB_TOKEN env var + config file token
- Rate limit handling and error management

**C1.2: README Extraction**
- Fetch README.md, README.rst, README.txt
- Support multiple locations (root, docs/, .github/)

**C1.3: Code Comments & Docstrings**
- Framework for extracting docstrings (surface layer)
- Placeholder for Python/JS comment extraction

**C1.4: Language Detection**
- Use GitHub's language detection API
- Percentage breakdown by bytes

**C1.5: Function/Class Signatures**
- Framework for signature extraction (surface layer only)

**C1.6: Usage Examples from Tests**
- Placeholder for test file analysis

**C1.7: GitHub Issues Extraction**
- Fetch open/closed issues via API
- Extract title, labels, milestone, state, timestamps
- Configurable max issues (default: 100)

**C1.8: CHANGELOG Extraction**
- Fetch CHANGELOG.md, CHANGES.md, HISTORY.md
- Try multiple common locations

**C1.9: GitHub Releases**
- Fetch releases via API
- Extract version tags, release notes, publish dates
- Full release history

**C1.10: CLI Tool**
- Complete `cli/github_scraper.py` (~700 lines)
- Argparse interface with config + direct modes
- GitHubScraper class for data extraction
- GitHubToSkillConverter class for skill building

**C1.11: MCP Integration**
- Added `scrape_github` tool to MCP server
- Natural language interface: "Scrape GitHub repo facebook/react"
- 10 minute timeout for scraping
- Full parameter support

**C1.12: Config Format**
- JSON config schema with example
- `configs/react_github.json` template
- Support for repo, name, description, token, flags

## Files Changed

- `cli/github_scraper.py` (NEW, ~700 lines)
- `configs/react_github.json` (NEW)
- `requirements.txt` (+PyGithub==2.5.0)
- `skill_seeker_mcp/server.py` (+scrape_github tool)

## Usage

```bash
# CLI usage
python3 cli/github_scraper.py --repo facebook/react
python3 cli/github_scraper.py --config configs/react_github.json

# MCP usage (via Claude Code)
"Scrape GitHub repository facebook/react"
"Extract issues and changelog from owner/repo"
```

## Implementation Notes

- Surface layer only (no full code implementation)
- Focus on documentation, issues, changelog, releases
- Skill size: 2-5 MB (manageable, focused)
- Covers 90%+ of real use cases

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 14:19:27 +03:00
Edgar I.
104818f983 feat: enable llms.txt for hono config 2025-10-24 18:27:17 +04:00
yusyus
6936057820 Add PDF documentation support (Tasks B1.1-B1.8)
Complete PDF extraction and skill conversion functionality:
- pdf_extractor_poc.py (1,004 lines): Extract text, code, images from PDFs
- pdf_scraper.py (353 lines): Convert PDFs to Claude skills
- MCP tool scrape_pdf: PDF scraping via Claude Code
- 7 comprehensive documentation guides (4,705 lines)
- Example PDF config format (configs/example_pdf.json)

Features:
- 3 code detection methods (font, indent, pattern)
- 19+ programming languages detected with confidence scoring
- Syntax validation and quality scoring (0-10 scale)
- Image extraction with size filtering (--extract-images)
- Chapter/section detection and page chunking
- Quality-filtered code examples (--min-quality)
- Three usage modes: config file, direct PDF, from extracted JSON

Technical:
- PyMuPDF (fitz) as primary library (60x faster than alternatives)
- Language detection with confidence scoring
- Code block merging across pages
- Comprehensive metadata and statistics
- Compatible with existing Skill Seeker workflow

MCP Integration:
- New scrape_pdf tool (10th MCP tool total)
- Supports all three usage modes
- 10-minute timeout for large PDFs
- Real-time streaming output

Documentation (4,705 lines):
- B1_COMPLETE_SUMMARY.md: Overview of all 8 tasks
- PDF_PARSING_RESEARCH.md: Library comparison and benchmarks
- PDF_EXTRACTOR_POC.md: POC documentation
- PDF_CHUNKING.md: Page chunking guide
- PDF_SYNTAX_DETECTION.md: Syntax detection guide
- PDF_IMAGE_EXTRACTION.md: Image extraction guide
- PDF_SCRAPER.md: PDF scraper usage guide
- PDF_MCP_TOOL.md: MCP integration guide

Tasks completed: B1.1-B1.8
Addresses Issue #27
See docs/B1_COMPLETE_SUMMARY.md for complete details

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-23 00:23:16 +03:00
Schuyler Erle
183c7596a5 Add config for Ansible core documentation (#147)
Co-authored-by: Schuyler Erle <schuyler@ardc.net>
2025-10-22 21:50:59 +03:00
Schuyler Erle
ab585584d0 Add config for Claude Code documentation 2025-10-20 21:27:19 -07:00
yusyus
80382551b1 Fix Issue #7: Fix all broken configs and add Laravel support
Tested and fixed all 11 production configs - now 100% working!

Fixed Configs:
1. Django (configs/django.json)
   -  Was using: div.document (selector doesn't exist)
   -  Now using: article (1,688 chars of content)
   - Verified on: https://docs.djangoproject.com/en/stable/

2. Astro (configs/astro.json)
   -  Was using: homepage URL (no article element)
   -  Now using: /en/getting-started/ with article selector
   - Added: start_urls, categories, improved URL patterns
   - Increased max_pages from 15 to 100

3. Tailwind (configs/tailwind.json)
   -  Was using: article (selector doesn't exist)
   -  Now using: div.prose (195 chars of content)
   - Verified on: https://tailwindcss.com/docs

New Config:
4. Laravel (configs/laravel.json) - NEW!
   - Created complete Laravel 9.x config
   - Selector: #main-content (16,131 chars of content)
   - Base URL: https://laravel.com/docs/9.x/
   - Includes: 8 start_urls covering installation, routing,
     controllers, views, Blade, Eloquent, migrations, auth
   - Categories: getting_started, routing, views, models,
     authentication, api
   - max_pages: 500

Test Results:
 11/11 configs tested and verified (100%)
 All selectors extract content properly
 All base URLs accessible

Working Configs:
-  astro.json
-  django.json
-  fastapi.json
-  godot.json
-  godot-large-example.json
-  kubernetes.json
-  laravel.json (NEW)
-  react.json
-  steam-economy-complete.json
-  tailwind.json
-  vue.json

How I Tested:
1. Created test_selectors.py to find correct CSS selectors
2. Tested each config's base_url + selector combination
3. Verified content extraction (not just "found" but actual text)
4. Ensured meaningful content length (50+ chars minimum)

Fixes Issue #7 - Laravel scraping not working
Fixes #7
2025-10-21 00:16:39 +03:00
yusyus
bddb57f5ef Add large documentation handling (40K+ pages support)
Implement comprehensive system for handling very large documentation sites
with intelligent splitting strategies and router/hub architecture.

**New CLI Tools:**
- cli/split_config.py: Split large configs into focused sub-skills
  * Strategies: auto, category, router, size
  * Configurable target pages per skill (default: 5000)
  * Dry-run mode for preview

- cli/generate_router.py: Create intelligent router/hub skills
  * Auto-generates routing logic based on keywords
  * Creates SKILL.md with topic-to-skill mapping
  * Infers router name from sub-skills

- cli/package_multi.py: Batch package multiple skills
  * Package router + all sub-skills in one command
  * Progress tracking for each skill

**MCP Integration:**
- Added split_config tool (8 total MCP tools now)
- Added generate_router tool
- Supports 40K+ page documentation via MCP

**Configuration:**
- New split_strategy parameter in configs
- split_config section for fine-tuned control
- checkpoint section for resume capability (ready for Phase 4)
- Example: configs/godot-large-example.json

**Documentation:**
- docs/LARGE_DOCUMENTATION.md (500+ lines)
  * Complete guide for 10K+ page documentation
  * All splitting strategies explained
  * Detailed workflows with examples
  * Best practices and troubleshooting
  * Real-world examples (AWS, Microsoft, Godot)

**Features:**
 Handle 40K+ page documentation efficiently
 Parallel scraping support (5x-10x faster)
 Router + sub-skills architecture
 Intelligent keyword-based routing
 Multiple splitting strategies
 Full MCP integration

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 20:48:03 +03:00
yusyus
f103aa62cb Clean up tracked files and repository structure
Remove unnecessary files:
- configs/.DS_Store (macOS system file, should not be tracked)

This ensures only relevant project files are version controlled
and improves repository hygiene.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 19:45:13 +03:00
yusyus
d7e6142ab0 Add test configurations for MCP validation
Add 4 test configuration files used for validating MCP functionality:
- astro.json: Astro framework documentation (15 pages, production test)
- python-tutorial-test.json: Python tutorial (minimal test case)
- tailwind.json: Tailwind CSS documentation (test case)
- test-manual.json: Manual testing configuration

These configs were used to verify:
- Config generation via generate_config tool
- Config validation via validate_config tool
- Page estimation via estimate_pages tool
- Full scraping workflow via scrape_docs tool
- Skill packaging via package_skill tool

All tests passed successfully in production Claude Code environment.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 19:44:27 +03:00
jarek
7a4c1d7083 kubernetes config for official docs 2025-10-19 09:28:44 +02:00
yusyus
59c2f9126d Optimize all framework configs with start_urls for better coverage
All configs now follow the steam-economy-complete.json pattern with:
- Multiple start_urls for comprehensive entry points
- Improved include patterns for better targeting
- Enhanced exclude patterns to skip irrelevant pages

Godot Config:
- Added 7 start_urls covering getting started, scripting, 2D, 3D, physics, animation, and classes
- Added include patterns: /getting_started/, /tutorials/, /classes/
- More focused scraping of core documentation

React Config:
- Added 6 start_urls covering learn, quick-start, reference, and hooks
- Existing patterns maintained (already well-optimized)

Vue Config:
- Added 6 start_urls covering introduction, essentials, components, composables, and API
- Fixed base_url from https://vuejs.org/guide/ to https://vuejs.org/
- Added /partners/ to exclude list

Django Config:
- Added 7 start_urls covering intro, models, views, templates, forms, auth, and reference
- Added /intro/ to include patterns
- Added /releases/ to exclude list (changelog not needed)

FastAPI Config:
- Added 7 start_urls covering tutorial, first-steps, path-params, body, dependencies, advanced, and reference
- Added /deployment/ to exclude list

Benefits:
- Better initial page discovery
- More comprehensive documentation coverage
- Faster scraping (direct entry to important sections)
- Reduced unnecessary page crawling
- Consistent pattern across all configs

All configs tested and validated:
 71/71 tests passing
 All 6 configs validated successfully

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 02:24:56 +03:00
yusyus
78b9cae398 Init 2025-10-17 15:14:44 +00:00