8.9 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Overview
This is a Python-based documentation scraper that converts ANY documentation website into a Claude skill. It's a single-file tool (doc_scraper.py) that scrapes documentation, extracts code patterns, detects programming languages, and generates structured skill files ready for use with Claude.
Dependencies
pip3 install requests beautifulsoup4
Core Commands
Run with a preset configuration
python3 doc_scraper.py --config configs/godot.json
python3 doc_scraper.py --config configs/react.json
python3 doc_scraper.py --config configs/vue.json
python3 doc_scraper.py --config configs/django.json
python3 doc_scraper.py --config configs/fastapi.json
Interactive mode (for new frameworks)
python3 doc_scraper.py --interactive
Quick mode (minimal config)
python3 doc_scraper.py --name react --url https://react.dev/ --description "React framework"
Skip scraping (use cached data)
python3 doc_scraper.py --config configs/godot.json --skip-scrape
AI-powered SKILL.md enhancement
# Option 1: During scraping (API-based, requires ANTHROPIC_API_KEY)
pip3 install anthropic
export ANTHROPIC_API_KEY=sk-ant-...
python3 doc_scraper.py --config configs/react.json --enhance
# Option 2: During scraping (LOCAL, no API key - uses Claude Code Max)
python3 doc_scraper.py --config configs/react.json --enhance-local
# Option 3: Standalone after scraping (API-based)
python3 enhance_skill.py output/react/
# Option 4: Standalone after scraping (LOCAL, no API key)
python3 enhance_skill_local.py output/react/
The LOCAL enhancement option (--enhance-local or enhance_skill_local.py) opens a new terminal with Claude Code, which analyzes reference files and enhances SKILL.md automatically. This requires Claude Code Max plan but no API key.
Test with limited pages (edit config first)
Set "max_pages": 20 in the config file to test with fewer pages.
Architecture
Single-File Design
The entire tool is contained in doc_scraper.py (~737 lines). It follows a class-based architecture with a single DocToSkillConverter class that handles:
- Web scraping: BFS traversal with URL validation
- Content extraction: CSS selectors for title, content, code blocks
- Language detection: Heuristic-based detection from code samples (Python, JavaScript, GDScript, C++, etc.)
- Pattern extraction: Identifies common coding patterns from documentation
- Categorization: Smart categorization using URL structure, page titles, and content keywords with scoring
- Skill generation: Creates SKILL.md with real code examples and categorized reference files
Data Flow
-
Scrape Phase:
- Input: Config JSON (name, base_url, selectors, url_patterns, categories, rate_limit, max_pages)
- Process: BFS traversal starting from base_url, respecting include/exclude patterns
- Output:
output/{name}_data/pages/*.json+summary.json
-
Build Phase:
- Input: Scraped JSON data from
output/{name}_data/ - Process: Load pages → Smart categorize → Extract patterns → Generate references
- Output:
output/{name}/SKILL.md+output/{name}/references/*.md
- Input: Scraped JSON data from
Directory Structure
doc-to-skill/
├── doc_scraper.py # Main scraping & building tool
├── enhance_skill.py # AI enhancement (API-based)
├── enhance_skill_local.py # AI enhancement (LOCAL, no API)
├── configs/ # Preset configurations
│ ├── godot.json
│ ├── react.json
│ ├── steam-inventory.json
│ └── ...
└── output/
├── {name}_data/ # Raw scraped data (cached)
│ ├── pages/ # Individual page JSONs
│ └── summary.json # Scraping summary
└── {name}/ # Generated skill
├── SKILL.md # Main skill file with examples
├── SKILL.md.backup # Backup (if enhanced)
├── references/ # Categorized documentation
│ ├── index.md
│ ├── getting_started.md
│ ├── api.md
│ └── ...
├── scripts/ # Empty (for user scripts)
└── assets/ # Empty (for user assets)
Configuration Format
Config files in configs/*.json contain:
name: Skill identifier (e.g., "godot", "react")description: When to use this skillbase_url: Starting URL for scrapingselectors: CSS selectors for content extractionmain_content: Main documentation content (e.g., "article", "div[role='main']")title: Page title selectorcode_blocks: Code sample selector (e.g., "pre code", "pre")
url_patterns: URL filteringinclude: Only scrape URLs containing these patternsexclude: Skip URLs containing these patterns
categories: Keyword-based categorization mappingrate_limit: Delay between requests (seconds)max_pages: Maximum pages to scrape
Key Features
Auto-detect existing data: Tool checks for output/{name}_data/ and prompts to reuse, avoiding re-scraping.
Language detection: Detects code languages from:
- CSS class attributes (
language-*,lang-*) - Heuristics (keywords like
def,const,func, etc.)
Pattern extraction: Looks for "Example:", "Pattern:", "Usage:" markers in content and extracts following code blocks (up to 5 per page).
Smart categorization:
- Scores pages against category keywords (3 points for URL match, 2 for title, 1 for content)
- Threshold of 2+ for categorization
- Auto-infers categories from URL segments if none provided
- Falls back to "other" category
Enhanced SKILL.md: Generated with:
- Real code examples from documentation (language-annotated)
- Quick reference patterns extracted from docs
- Common pattern section
- Category file listings
AI-Powered Enhancement: Two scripts to dramatically improve SKILL.md quality:
enhance_skill.py: Uses Anthropic API (~$0.15-$0.30 per skill, requires API key)enhance_skill_local.py: Uses Claude Code Max (free, no API key needed)- Transforms generic 75-line templates into comprehensive 500+ line guides
- Extracts best examples, explains key concepts, adds navigation guidance
- Success rate: 9/10 quality (based on steam-economy test)
Key Code Locations
- URL validation:
is_valid_url()doc_scraper.py:47-62 - Content extraction:
extract_content()doc_scraper.py:64-131 - Language detection:
detect_language()doc_scraper.py:133-163 - Pattern extraction:
extract_patterns()doc_scraper.py:165-181 - Smart categorization:
smart_categorize()doc_scraper.py:280-321 - Category inference:
infer_categories()doc_scraper.py:323-349 - Quick reference generation:
generate_quick_reference()doc_scraper.py:351-370 - SKILL.md generation:
create_enhanced_skill_md()doc_scraper.py:424-540 - Scraping loop:
scrape_all()doc_scraper.py:226-249 - Main workflow:
main()doc_scraper.py:661-733
Workflow Examples
First time scraping (with scraping)
# 1. Scrape + Build
python3 doc_scraper.py --config configs/godot.json
# Time: 20-40 minutes
# 2. Package (assuming skill-creator is available)
python3 package_skill.py output/godot/
# Result: godot.zip
Using cached data (fast iteration)
# 1. Use existing data
python3 doc_scraper.py --config configs/godot.json --skip-scrape
# Time: 1-3 minutes
# 2. Package
python3 package_skill.py output/godot/
Creating a new framework config
# Option 1: Interactive
python3 doc_scraper.py --interactive
# Option 2: Copy and modify
cp configs/react.json configs/myframework.json
# Edit configs/myframework.json
python3 doc_scraper.py --config configs/myframework.json
Testing Selectors
To find the right CSS selectors for a documentation site:
from bs4 import BeautifulSoup
import requests
url = "https://docs.example.com/page"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Try different selectors
print(soup.select_one('article'))
print(soup.select_one('main'))
print(soup.select_one('div[role="main"]'))
Troubleshooting
No content extracted: Check main_content selector. Common values: article, main, div[role="main"], div.content
Poor categorization: Edit categories section in config with better keywords specific to the documentation structure
Force re-scrape: Delete cached data with rm -rf output/{name}_data/
Rate limiting issues: Increase rate_limit value in config (e.g., from 0.5 to 1.0 seconds)
Output Quality Checks
After building, verify quality:
cat output/godot/SKILL.md # Should have real code examples
cat output/godot/references/index.md # Should show categories
ls output/godot/references/ # Should have category .md files