Prepare v2.5.0 release with multi-LLM platform support. Major changes: - Add support for 4 platforms (Claude, Gemini, OpenAI, Markdown) - Complete feature parity across all platforms - 18 MCP tools with multi-platform support - Comprehensive platform documentation Updated files: - pyproject.toml: version 2.4.0 → 2.5.0 - README.md: version badge updated, tests 427 → 700 - CHANGELOG.md: Added v2.5.0 release notes - docs/CLAUDE.md: Updated version and features Release date: 2025-12-28
17 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
🎯 Current Status (December 28, 2025)
Version: v2.5.0 (Production Ready - Multi-Platform Feature Parity!) Active Development: Multi-platform support complete
Recent Updates (December 2025):
🎉 MAJOR RELEASE: Multi-Platform Feature Parity! (v2.5.0)
- 🌐 Multi-LLM Support: Full support for 4 platforms - Claude AI, Google Gemini, OpenAI ChatGPT, Generic Markdown
- 🔄 Complete Feature Parity: All skill modes work with all platforms
- 🏗️ Platform Adaptors: Clean architecture with platform-specific implementations
- ✨ 18 MCP Tools: Enhanced with multi-platform support (package, upload, enhance)
- 📚 Comprehensive Documentation: Complete guides for all platforms
- 🧪 Test Coverage: 700 tests passing, extensive platform compatibility testing
Overview
This is a Python-based documentation scraper that converts ANY documentation website into a Claude skill. It's a single-file tool (doc_scraper.py) that scrapes documentation, extracts code patterns, detects programming languages, and generates structured skill files ready for use with Claude.
Dependencies
pip3 install requests beautifulsoup4
Core Commands
Run with a preset configuration
python3 cli/doc_scraper.py --config configs/godot.json
python3 cli/doc_scraper.py --config configs/react.json
python3 cli/doc_scraper.py --config configs/vue.json
python3 cli/doc_scraper.py --config configs/django.json
python3 cli/doc_scraper.py --config configs/fastapi.json
Interactive mode (for new frameworks)
python3 cli/doc_scraper.py --interactive
Quick mode (minimal config)
python3 cli/doc_scraper.py --name react --url https://react.dev/ --description "React framework"
Skip scraping (use cached data)
python3 cli/doc_scraper.py --config configs/godot.json --skip-scrape
Resume interrupted scrapes
# If scrape was interrupted
python3 cli/doc_scraper.py --config configs/godot.json --resume
# Start fresh (clear checkpoint)
python3 cli/doc_scraper.py --config configs/godot.json --fresh
Large documentation (10K-40K+ pages)
# 1. Estimate page count
python3 cli/estimate_pages.py configs/godot.json
# 2. Split into focused sub-skills
python3 cli/split_config.py configs/godot.json --strategy router
# 3. Generate router skill
python3 cli/generate_router.py configs/godot-*.json
# 4. Package multiple skills
python3 cli/package_multi.py output/godot*/
AI-powered SKILL.md enhancement
# Option 1: During scraping (API-based, requires ANTHROPIC_API_KEY)
pip3 install anthropic
export ANTHROPIC_API_KEY=sk-ant-...
python3 cli/doc_scraper.py --config configs/react.json --enhance
# Option 2: During scraping (LOCAL, no API key - uses Claude Code Max)
python3 cli/doc_scraper.py --config configs/react.json --enhance-local
# Option 3: Standalone after scraping (API-based)
python3 cli/enhance_skill.py output/react/
# Option 4: Standalone after scraping (LOCAL, no API key)
python3 cli/enhance_skill_local.py output/react/
The LOCAL enhancement option (--enhance-local or enhance_skill_local.py) opens a new terminal with Claude Code, which analyzes reference files and enhances SKILL.md automatically. This requires Claude Code Max plan but no API key.
MCP Integration (Claude Code)
# One-time setup
./setup_mcp.sh
# Then in Claude Code, use natural language:
"List all available configs"
"Generate config for Tailwind at https://tailwindcss.com/docs"
"Split configs/godot.json using router strategy"
"Generate router for configs/godot-*.json"
"Package skill at output/react/"
18 MCP tools available with multi-platform support: list_configs, generate_config, validate_config, fetch_config, estimate_pages, scrape_docs, scrape_github, scrape_pdf, package_skill, upload_skill, enhance_skill (NEW), install_skill, split_config, generate_router, add_config_source, list_config_sources, remove_config_source, submit_config
Test with limited pages (edit config first)
Set "max_pages": 20 in the config file to test with fewer pages.
Multi-Platform Support (v2.5.0+)
4 Platforms Fully Supported:
- Claude AI (default) - ZIP format, Skills API, MCP integration
- Google Gemini - tar.gz format, Files API, 1M token context
- OpenAI ChatGPT - ZIP format, Assistants API, Vector Store
- Generic Markdown - ZIP format, universal compatibility
All skill modes work with all platforms:
- Documentation scraping
- GitHub repository analysis
- PDF extraction
- Unified multi-source
- Local repository analysis
Use the --target parameter for packaging, upload, and enhancement:
# Package for different platforms
skill-seekers package output/react/ --target claude # Default
skill-seekers package output/react/ --target gemini
skill-seekers package output/react/ --target openai
skill-seekers package output/react/ --target markdown
# Upload to platforms (requires API keys)
skill-seekers upload output/react.zip --target claude
skill-seekers upload output/react-gemini.tar.gz --target gemini
skill-seekers upload output/react-openai.zip --target openai
# Enhance with platform-specific AI
skill-seekers enhance output/react/ --target claude # Sonnet 4
skill-seekers enhance output/react/ --target gemini --mode api # Gemini 2.0
skill-seekers enhance output/react/ --target openai --mode api # GPT-4o
See Multi-Platform Guide and Feature Matrix for complete details.
Architecture
Single-File Design
The entire tool is contained in doc_scraper.py (~737 lines). It follows a class-based architecture with a single DocToSkillConverter class that handles:
- Web scraping: BFS traversal with URL validation
- Content extraction: CSS selectors for title, content, code blocks
- Language detection: Heuristic-based detection from code samples (Python, JavaScript, GDScript, C++, etc.)
- Pattern extraction: Identifies common coding patterns from documentation
- Categorization: Smart categorization using URL structure, page titles, and content keywords with scoring
- Skill generation: Creates SKILL.md with real code examples and categorized reference files
Data Flow
-
Scrape Phase:
- Input: Config JSON (name, base_url, selectors, url_patterns, categories, rate_limit, max_pages)
- Process: BFS traversal starting from base_url, respecting include/exclude patterns
- Output:
output/{name}_data/pages/*.json+summary.json
-
Build Phase:
- Input: Scraped JSON data from
output/{name}_data/ - Process: Load pages → Smart categorize → Extract patterns → Generate references
- Output:
output/{name}/SKILL.md+output/{name}/references/*.md
- Input: Scraped JSON data from
Directory Structure
Skill_Seekers/
├── cli/ # CLI tools
│ ├── doc_scraper.py # Main scraping & building tool
│ ├── enhance_skill.py # AI enhancement (API-based)
│ ├── enhance_skill_local.py # AI enhancement (LOCAL, no API)
│ ├── estimate_pages.py # Page count estimator
│ ├── split_config.py # Large docs splitter (NEW)
│ ├── generate_router.py # Router skill generator (NEW)
│ ├── package_skill.py # Single skill packager
│ └── package_multi.py # Multi-skill packager (NEW)
├── mcp/ # MCP server
│ ├── server.py # 9 MCP tools (includes upload)
│ └── README.md
├── configs/ # Preset configurations
│ ├── godot.json
│ ├── godot-large-example.json # Large docs example (NEW)
│ ├── react.json
│ └── ...
├── docs/ # Documentation
│ ├── CLAUDE.md # Technical architecture (this file)
│ ├── LARGE_DOCUMENTATION.md # Large docs guide (NEW)
│ ├── ENHANCEMENT.md
│ ├── MCP_SETUP.md
│ └── ...
└── output/ # Generated output (git-ignored)
├── {name}_data/ # Raw scraped data (cached)
│ ├── pages/ # Individual page JSONs
│ ├── summary.json # Scraping summary
│ └── checkpoint.json # Resume checkpoint (NEW)
└── {name}/ # Generated skill
├── SKILL.md # Main skill file with examples
├── SKILL.md.backup # Backup (if enhanced)
├── references/ # Categorized documentation
│ ├── index.md
│ ├── getting_started.md
│ ├── api.md
│ └── ...
├── scripts/ # Empty (for user scripts)
└── assets/ # Empty (for user assets)
Configuration Format
Config files in configs/*.json contain:
name: Skill identifier (e.g., "godot", "react")description: When to use this skillbase_url: Starting URL for scrapingselectors: CSS selectors for content extractionmain_content: Main documentation content (e.g., "article", "div[role='main']")title: Page title selectorcode_blocks: Code sample selector (e.g., "pre code", "pre")
url_patterns: URL filteringinclude: Only scrape URLs containing these patternsexclude: Skip URLs containing these patterns
categories: Keyword-based categorization mappingrate_limit: Delay between requests (seconds)max_pages: Maximum pages to scrapesplit_strategy: (Optional) How to split large docs: "auto", "category", "router", "size"split_config: (Optional) Split configurationtarget_pages_per_skill: Pages per sub-skill (default: 5000)create_router: Create router/hub skill (default: true)split_by_categories: Category names to split by
checkpoint: (Optional) Checkpoint/resume configurationenabled: Enable checkpointing (default: false)interval: Save every N pages (default: 1000)
Key Features
Auto-detect existing data: Tool checks for output/{name}_data/ and prompts to reuse, avoiding re-scraping.
Language detection: Detects code languages from:
- CSS class attributes (
language-*,lang-*) - Heuristics (keywords like
def,const,func, etc.)
Pattern extraction: Looks for "Example:", "Pattern:", "Usage:" markers in content and extracts following code blocks (up to 5 per page).
Smart categorization:
- Scores pages against category keywords (3 points for URL match, 2 for title, 1 for content)
- Threshold of 2+ for categorization
- Auto-infers categories from URL segments if none provided
- Falls back to "other" category
Enhanced SKILL.md: Generated with:
- Real code examples from documentation (language-annotated)
- Quick reference patterns extracted from docs
- Common pattern section
- Category file listings
AI-Powered Enhancement: Two scripts to dramatically improve SKILL.md quality:
enhance_skill.py: Uses Anthropic API (~$0.15-$0.30 per skill, requires API key)enhance_skill_local.py: Uses Claude Code Max (free, no API key needed)- Transforms generic 75-line templates into comprehensive 500+ line guides
- Extracts best examples, explains key concepts, adds navigation guidance
- Success rate: 9/10 quality (based on steam-economy test)
Large Documentation Support (NEW): Handle 10K-40K+ page documentation:
split_config.py: Split large configs into multiple focused sub-skillsgenerate_router.py: Create intelligent router/hub skills that direct queriespackage_multi.py: Package multiple skills at once- 4 split strategies: auto, category, router, size
- Parallel scraping support for faster processing
- MCP integration for natural language usage
Checkpoint/Resume (NEW): Never lose progress on long scrapes:
- Auto-saves every N pages (configurable, default: 1000)
- Resume with
--resumeflag - Clear checkpoint with
--freshflag - Saves on interruption (Ctrl+C)
Key Code Locations
- URL validation:
is_valid_url()doc_scraper.py:47-62 - Content extraction:
extract_content()doc_scraper.py:64-131 - Language detection:
detect_language()doc_scraper.py:133-163 - Pattern extraction:
extract_patterns()doc_scraper.py:165-181 - Smart categorization:
smart_categorize()doc_scraper.py:280-321 - Category inference:
infer_categories()doc_scraper.py:323-349 - Quick reference generation:
generate_quick_reference()doc_scraper.py:351-370 - SKILL.md generation:
create_enhanced_skill_md()doc_scraper.py:424-540 - Scraping loop:
scrape_all()doc_scraper.py:226-249 - Main workflow:
main()doc_scraper.py:661-733
Workflow Examples
First time scraping (with scraping)
# 1. Scrape + Build
python3 cli/doc_scraper.py --config configs/godot.json
# Time: 20-40 minutes
# 2. Package
python3 cli/package_skill.py output/godot/
# Result: godot.zip
Using cached data (fast iteration)
# 1. Use existing data
python3 cli/doc_scraper.py --config configs/godot.json --skip-scrape
# Time: 1-3 minutes
# 2. Package
python3 cli/package_skill.py output/godot/
Creating a new framework config
# Option 1: Interactive
python3 cli/doc_scraper.py --interactive
# Option 2: Copy and modify
cp configs/react.json configs/myframework.json
# Edit configs/myframework.json
python3 cli/doc_scraper.py --config configs/myframework.json
Large documentation workflow (40K pages)
# 1. Estimate page count (fast, 1-2 minutes)
python3 cli/estimate_pages.py configs/godot.json
# 2. Split into focused sub-skills
python3 cli/split_config.py configs/godot.json --strategy router --target-pages 5000
# Creates: godot-scripting.json, godot-2d.json, godot-3d.json, etc.
# 3. Scrape all in parallel (4-8 hours instead of 20-40!)
for config in configs/godot-*.json; do
python3 cli/doc_scraper.py --config $config &
done
wait
# 4. Generate intelligent router skill
python3 cli/generate_router.py configs/godot-*.json
# 5. Package all skills
python3 cli/package_multi.py output/godot*/
# 6. Upload all .zip files to Claude
# Result: Router automatically directs queries to the right sub-skill!
Time savings: Parallel scraping reduces 20-40 hours to 4-8 hours
See full guide: Large Documentation Guide
Testing Selectors
To find the right CSS selectors for a documentation site:
from bs4 import BeautifulSoup
import requests
url = "https://docs.example.com/page"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Try different selectors
print(soup.select_one('article'))
print(soup.select_one('main'))
print(soup.select_one('div[role="main"]'))
Running Tests
IMPORTANT: You must install the package before running tests
# 1. Install package in editable mode (one-time setup)
pip install -e .
# 2. Run all tests
pytest
# 3. Run specific test files
pytest tests/test_config_validation.py
pytest tests/test_github_scraper.py
# 4. Run with verbose output
pytest -v
# 5. Run with coverage report
pytest --cov=src/skill_seekers --cov-report=html
Why install first?
- Tests import from
skill_seekers.cliwhich requires the package to be installed - Modern Python packaging best practice (PEP 517/518)
- CI/CD automatically installs with
pip install -e . - conftest.py will show helpful error if package not installed
Test Coverage:
- 391+ tests passing
- 39% code coverage
- All core features tested
- CI/CD tests on Ubuntu + macOS with Python 3.10-3.12
Troubleshooting
No content extracted: Check main_content selector. Common values: article, main, div[role="main"], div.content
Poor categorization: Edit categories section in config with better keywords specific to the documentation structure
Force re-scrape: Delete cached data with rm -rf output/{name}_data/
Rate limiting issues: Increase rate_limit value in config (e.g., from 0.5 to 1.0 seconds)
Output Quality Checks
After building, verify quality:
cat output/godot/SKILL.md # Should have real code examples
cat output/godot/references/index.md # Should show categories
ls output/godot/references/ # Should have category .md files
llms.txt Support
Skill_Seekers automatically detects llms.txt files before HTML scraping:
Detection Order
{base_url}/llms-full.txt(complete documentation){base_url}/llms.txt(standard version){base_url}/llms-small.txt(quick reference)
Benefits
- ⚡ 10x faster (< 5 seconds vs 20-60 seconds)
- ✅ More reliable (maintained by docs authors)
- 🎯 Better quality (pre-formatted for LLMs)
- 🚫 No rate limiting needed
Example Sites
If no llms.txt is found, automatically falls back to HTML scraping.