Init
This commit is contained in:
239
docs/CLAUDE.md
Normal file
239
docs/CLAUDE.md
Normal file
@@ -0,0 +1,239 @@
|
||||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## Overview
|
||||
|
||||
This is a Python-based documentation scraper that converts ANY documentation website into a Claude skill. It's a single-file tool (`doc_scraper.py`) that scrapes documentation, extracts code patterns, detects programming languages, and generates structured skill files ready for use with Claude.
|
||||
|
||||
## Dependencies
|
||||
|
||||
```bash
|
||||
pip3 install requests beautifulsoup4
|
||||
```
|
||||
|
||||
## Core Commands
|
||||
|
||||
### Run with a preset configuration
|
||||
```bash
|
||||
python3 doc_scraper.py --config configs/godot.json
|
||||
python3 doc_scraper.py --config configs/react.json
|
||||
python3 doc_scraper.py --config configs/vue.json
|
||||
python3 doc_scraper.py --config configs/django.json
|
||||
python3 doc_scraper.py --config configs/fastapi.json
|
||||
```
|
||||
|
||||
### Interactive mode (for new frameworks)
|
||||
```bash
|
||||
python3 doc_scraper.py --interactive
|
||||
```
|
||||
|
||||
### Quick mode (minimal config)
|
||||
```bash
|
||||
python3 doc_scraper.py --name react --url https://react.dev/ --description "React framework"
|
||||
```
|
||||
|
||||
### Skip scraping (use cached data)
|
||||
```bash
|
||||
python3 doc_scraper.py --config configs/godot.json --skip-scrape
|
||||
```
|
||||
|
||||
### AI-powered SKILL.md enhancement
|
||||
```bash
|
||||
# Option 1: During scraping (API-based, requires ANTHROPIC_API_KEY)
|
||||
pip3 install anthropic
|
||||
export ANTHROPIC_API_KEY=sk-ant-...
|
||||
python3 doc_scraper.py --config configs/react.json --enhance
|
||||
|
||||
# Option 2: During scraping (LOCAL, no API key - uses Claude Code Max)
|
||||
python3 doc_scraper.py --config configs/react.json --enhance-local
|
||||
|
||||
# Option 3: Standalone after scraping (API-based)
|
||||
python3 enhance_skill.py output/react/
|
||||
|
||||
# Option 4: Standalone after scraping (LOCAL, no API key)
|
||||
python3 enhance_skill_local.py output/react/
|
||||
```
|
||||
|
||||
The LOCAL enhancement option (`--enhance-local` or `enhance_skill_local.py`) opens a new terminal with Claude Code, which analyzes reference files and enhances SKILL.md automatically. This requires Claude Code Max plan but no API key.
|
||||
|
||||
### Test with limited pages (edit config first)
|
||||
Set `"max_pages": 20` in the config file to test with fewer pages.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Single-File Design
|
||||
The entire tool is contained in `doc_scraper.py` (~737 lines). It follows a class-based architecture with a single `DocToSkillConverter` class that handles:
|
||||
- **Web scraping**: BFS traversal with URL validation
|
||||
- **Content extraction**: CSS selectors for title, content, code blocks
|
||||
- **Language detection**: Heuristic-based detection from code samples (Python, JavaScript, GDScript, C++, etc.)
|
||||
- **Pattern extraction**: Identifies common coding patterns from documentation
|
||||
- **Categorization**: Smart categorization using URL structure, page titles, and content keywords with scoring
|
||||
- **Skill generation**: Creates SKILL.md with real code examples and categorized reference files
|
||||
|
||||
### Data Flow
|
||||
1. **Scrape Phase**:
|
||||
- Input: Config JSON (name, base_url, selectors, url_patterns, categories, rate_limit, max_pages)
|
||||
- Process: BFS traversal starting from base_url, respecting include/exclude patterns
|
||||
- Output: `output/{name}_data/pages/*.json` + `summary.json`
|
||||
|
||||
2. **Build Phase**:
|
||||
- Input: Scraped JSON data from `output/{name}_data/`
|
||||
- Process: Load pages → Smart categorize → Extract patterns → Generate references
|
||||
- Output: `output/{name}/SKILL.md` + `output/{name}/references/*.md`
|
||||
|
||||
### Directory Structure
|
||||
```
|
||||
doc-to-skill/
|
||||
├── doc_scraper.py # Main scraping & building tool
|
||||
├── enhance_skill.py # AI enhancement (API-based)
|
||||
├── enhance_skill_local.py # AI enhancement (LOCAL, no API)
|
||||
├── configs/ # Preset configurations
|
||||
│ ├── godot.json
|
||||
│ ├── react.json
|
||||
│ ├── steam-inventory.json
|
||||
│ └── ...
|
||||
└── output/
|
||||
├── {name}_data/ # Raw scraped data (cached)
|
||||
│ ├── pages/ # Individual page JSONs
|
||||
│ └── summary.json # Scraping summary
|
||||
└── {name}/ # Generated skill
|
||||
├── SKILL.md # Main skill file with examples
|
||||
├── SKILL.md.backup # Backup (if enhanced)
|
||||
├── references/ # Categorized documentation
|
||||
│ ├── index.md
|
||||
│ ├── getting_started.md
|
||||
│ ├── api.md
|
||||
│ └── ...
|
||||
├── scripts/ # Empty (for user scripts)
|
||||
└── assets/ # Empty (for user assets)
|
||||
```
|
||||
|
||||
### Configuration Format
|
||||
Config files in `configs/*.json` contain:
|
||||
- `name`: Skill identifier (e.g., "godot", "react")
|
||||
- `description`: When to use this skill
|
||||
- `base_url`: Starting URL for scraping
|
||||
- `selectors`: CSS selectors for content extraction
|
||||
- `main_content`: Main documentation content (e.g., "article", "div[role='main']")
|
||||
- `title`: Page title selector
|
||||
- `code_blocks`: Code sample selector (e.g., "pre code", "pre")
|
||||
- `url_patterns`: URL filtering
|
||||
- `include`: Only scrape URLs containing these patterns
|
||||
- `exclude`: Skip URLs containing these patterns
|
||||
- `categories`: Keyword-based categorization mapping
|
||||
- `rate_limit`: Delay between requests (seconds)
|
||||
- `max_pages`: Maximum pages to scrape
|
||||
|
||||
### Key Features
|
||||
|
||||
**Auto-detect existing data**: Tool checks for `output/{name}_data/` and prompts to reuse, avoiding re-scraping.
|
||||
|
||||
**Language detection**: Detects code languages from:
|
||||
1. CSS class attributes (`language-*`, `lang-*`)
|
||||
2. Heuristics (keywords like `def`, `const`, `func`, etc.)
|
||||
|
||||
**Pattern extraction**: Looks for "Example:", "Pattern:", "Usage:" markers in content and extracts following code blocks (up to 5 per page).
|
||||
|
||||
**Smart categorization**:
|
||||
- Scores pages against category keywords (3 points for URL match, 2 for title, 1 for content)
|
||||
- Threshold of 2+ for categorization
|
||||
- Auto-infers categories from URL segments if none provided
|
||||
- Falls back to "other" category
|
||||
|
||||
**Enhanced SKILL.md**: Generated with:
|
||||
- Real code examples from documentation (language-annotated)
|
||||
- Quick reference patterns extracted from docs
|
||||
- Common pattern section
|
||||
- Category file listings
|
||||
|
||||
**AI-Powered Enhancement**: Two scripts to dramatically improve SKILL.md quality:
|
||||
- `enhance_skill.py`: Uses Anthropic API (~$0.15-$0.30 per skill, requires API key)
|
||||
- `enhance_skill_local.py`: Uses Claude Code Max (free, no API key needed)
|
||||
- Transforms generic 75-line templates into comprehensive 500+ line guides
|
||||
- Extracts best examples, explains key concepts, adds navigation guidance
|
||||
- Success rate: 9/10 quality (based on steam-economy test)
|
||||
|
||||
## Key Code Locations
|
||||
|
||||
- **URL validation**: `is_valid_url()` doc_scraper.py:47-62
|
||||
- **Content extraction**: `extract_content()` doc_scraper.py:64-131
|
||||
- **Language detection**: `detect_language()` doc_scraper.py:133-163
|
||||
- **Pattern extraction**: `extract_patterns()` doc_scraper.py:165-181
|
||||
- **Smart categorization**: `smart_categorize()` doc_scraper.py:280-321
|
||||
- **Category inference**: `infer_categories()` doc_scraper.py:323-349
|
||||
- **Quick reference generation**: `generate_quick_reference()` doc_scraper.py:351-370
|
||||
- **SKILL.md generation**: `create_enhanced_skill_md()` doc_scraper.py:424-540
|
||||
- **Scraping loop**: `scrape_all()` doc_scraper.py:226-249
|
||||
- **Main workflow**: `main()` doc_scraper.py:661-733
|
||||
|
||||
## Workflow Examples
|
||||
|
||||
### First time scraping (with scraping)
|
||||
```bash
|
||||
# 1. Scrape + Build
|
||||
python3 doc_scraper.py --config configs/godot.json
|
||||
# Time: 20-40 minutes
|
||||
|
||||
# 2. Package (assuming skill-creator is available)
|
||||
python3 package_skill.py output/godot/
|
||||
|
||||
# Result: godot.zip
|
||||
```
|
||||
|
||||
### Using cached data (fast iteration)
|
||||
```bash
|
||||
# 1. Use existing data
|
||||
python3 doc_scraper.py --config configs/godot.json --skip-scrape
|
||||
# Time: 1-3 minutes
|
||||
|
||||
# 2. Package
|
||||
python3 package_skill.py output/godot/
|
||||
```
|
||||
|
||||
### Creating a new framework config
|
||||
```bash
|
||||
# Option 1: Interactive
|
||||
python3 doc_scraper.py --interactive
|
||||
|
||||
# Option 2: Copy and modify
|
||||
cp configs/react.json configs/myframework.json
|
||||
# Edit configs/myframework.json
|
||||
python3 doc_scraper.py --config configs/myframework.json
|
||||
```
|
||||
|
||||
## Testing Selectors
|
||||
|
||||
To find the right CSS selectors for a documentation site:
|
||||
|
||||
```python
|
||||
from bs4 import BeautifulSoup
|
||||
import requests
|
||||
|
||||
url = "https://docs.example.com/page"
|
||||
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
|
||||
|
||||
# Try different selectors
|
||||
print(soup.select_one('article'))
|
||||
print(soup.select_one('main'))
|
||||
print(soup.select_one('div[role="main"]'))
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**No content extracted**: Check `main_content` selector. Common values: `article`, `main`, `div[role="main"]`, `div.content`
|
||||
|
||||
**Poor categorization**: Edit `categories` section in config with better keywords specific to the documentation structure
|
||||
|
||||
**Force re-scrape**: Delete cached data with `rm -rf output/{name}_data/`
|
||||
|
||||
**Rate limiting issues**: Increase `rate_limit` value in config (e.g., from 0.5 to 1.0 seconds)
|
||||
|
||||
## Output Quality Checks
|
||||
|
||||
After building, verify quality:
|
||||
```bash
|
||||
cat output/godot/SKILL.md # Should have real code examples
|
||||
cat output/godot/references/index.md # Should show categories
|
||||
ls output/godot/references/ # Should have category .md files
|
||||
```
|
||||
Reference in New Issue
Block a user