firefrost-gaming/skill-seekers-reference

Fork 0

Files

yusyus 78b9cae398 Init

2025-10-17 15:14:44 +00:00

8.9 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

This is a Python-based documentation scraper that converts ANY documentation website into a Claude skill. It's a single-file tool (doc_scraper.py) that scrapes documentation, extracts code patterns, detects programming languages, and generates structured skill files ready for use with Claude.

Dependencies

pip3 install requests beautifulsoup4

Core Commands

Run with a preset configuration

python3 doc_scraper.py --config configs/godot.json
python3 doc_scraper.py --config configs/react.json
python3 doc_scraper.py --config configs/vue.json
python3 doc_scraper.py --config configs/django.json
python3 doc_scraper.py --config configs/fastapi.json

Interactive mode (for new frameworks)

python3 doc_scraper.py --interactive

Quick mode (minimal config)

python3 doc_scraper.py --name react --url https://react.dev/ --description "React framework"

Skip scraping (use cached data)

python3 doc_scraper.py --config configs/godot.json --skip-scrape

AI-powered SKILL.md enhancement

# Option 1: During scraping (API-based, requires ANTHROPIC_API_KEY)
pip3 install anthropic
export ANTHROPIC_API_KEY=sk-ant-...
python3 doc_scraper.py --config configs/react.json --enhance

# Option 2: During scraping (LOCAL, no API key - uses Claude Code Max)
python3 doc_scraper.py --config configs/react.json --enhance-local

# Option 3: Standalone after scraping (API-based)
python3 enhance_skill.py output/react/

# Option 4: Standalone after scraping (LOCAL, no API key)
python3 enhance_skill_local.py output/react/

The LOCAL enhancement option (--enhance-local or enhance_skill_local.py) opens a new terminal with Claude Code, which analyzes reference files and enhances SKILL.md automatically. This requires Claude Code Max plan but no API key.

Test with limited pages (edit config first)

Set "max_pages": 20 in the config file to test with fewer pages.

Architecture

Single-File Design

The entire tool is contained in doc_scraper.py (~737 lines). It follows a class-based architecture with a single DocToSkillConverter class that handles:

Web scraping: BFS traversal with URL validation
Content extraction: CSS selectors for title, content, code blocks
Language detection: Heuristic-based detection from code samples (Python, JavaScript, GDScript, C++, etc.)
Pattern extraction: Identifies common coding patterns from documentation
Categorization: Smart categorization using URL structure, page titles, and content keywords with scoring
Skill generation: Creates SKILL.md with real code examples and categorized reference files

Data Flow

Scrape Phase:
- Input: Config JSON (name, base_url, selectors, url_patterns, categories, rate_limit, max_pages)
- Process: BFS traversal starting from base_url, respecting include/exclude patterns
- Output: output/{name}_data/pages/*.json + summary.json
Build Phase:
- Input: Scraped JSON data from output/{name}_data/
- Process: Load pages → Smart categorize → Extract patterns → Generate references
- Output: output/{name}/SKILL.md + output/{name}/references/*.md

Directory Structure

doc-to-skill/
├── doc_scraper.py             # Main scraping & building tool
├── enhance_skill.py           # AI enhancement (API-based)
├── enhance_skill_local.py     # AI enhancement (LOCAL, no API)
├── configs/                   # Preset configurations
│   ├── godot.json
│   ├── react.json
│   ├── steam-inventory.json
│   └── ...
└── output/
    ├── {name}_data/           # Raw scraped data (cached)
    │   ├── pages/             # Individual page JSONs
    │   └── summary.json       # Scraping summary
    └── {name}/                # Generated skill
        ├── SKILL.md           # Main skill file with examples
        ├── SKILL.md.backup    # Backup (if enhanced)
        ├── references/        # Categorized documentation
        │   ├── index.md
        │   ├── getting_started.md
        │   ├── api.md
        │   └── ...
        ├── scripts/           # Empty (for user scripts)
        └── assets/            # Empty (for user assets)

Configuration Format

Config files in configs/*.json contain:

name: Skill identifier (e.g., "godot", "react")
description: When to use this skill
base_url: Starting URL for scraping
selectors: CSS selectors for content extraction
- main_content: Main documentation content (e.g., "article", "div[role='main']")
- title: Page title selector
- code_blocks: Code sample selector (e.g., "pre code", "pre")
url_patterns: URL filtering
- include: Only scrape URLs containing these patterns
- exclude: Skip URLs containing these patterns
categories: Keyword-based categorization mapping
rate_limit: Delay between requests (seconds)
max_pages: Maximum pages to scrape

Key Features

Auto-detect existing data: Tool checks for output/{name}_data/ and prompts to reuse, avoiding re-scraping.

Language detection: Detects code languages from:

CSS class attributes (language-*, lang-*)
Heuristics (keywords like def, const, func, etc.)

Pattern extraction: Looks for "Example:", "Pattern:", "Usage:" markers in content and extracts following code blocks (up to 5 per page).

Smart categorization:

Scores pages against category keywords (3 points for URL match, 2 for title, 1 for content)
Threshold of 2+ for categorization
Auto-infers categories from URL segments if none provided
Falls back to "other" category

Enhanced SKILL.md: Generated with:

Real code examples from documentation (language-annotated)
Quick reference patterns extracted from docs
Common pattern section
Category file listings

AI-Powered Enhancement: Two scripts to dramatically improve SKILL.md quality:

enhance_skill.py: Uses Anthropic API (~$0.15-$0.30 per skill, requires API key)
enhance_skill_local.py: Uses Claude Code Max (free, no API key needed)
Transforms generic 75-line templates into comprehensive 500+ line guides
Extracts best examples, explains key concepts, adds navigation guidance
Success rate: 9/10 quality (based on steam-economy test)

Key Code Locations

URL validation: is_valid_url() doc_scraper.py:47-62
Content extraction: extract_content() doc_scraper.py:64-131
Language detection: detect_language() doc_scraper.py:133-163
Pattern extraction: extract_patterns() doc_scraper.py:165-181
Smart categorization: smart_categorize() doc_scraper.py:280-321
Category inference: infer_categories() doc_scraper.py:323-349
Quick reference generation: generate_quick_reference() doc_scraper.py:351-370
SKILL.md generation: create_enhanced_skill_md() doc_scraper.py:424-540
Scraping loop: scrape_all() doc_scraper.py:226-249
Main workflow: main() doc_scraper.py:661-733

Workflow Examples

First time scraping (with scraping)

# 1. Scrape + Build
python3 doc_scraper.py --config configs/godot.json
# Time: 20-40 minutes

# 2. Package (assuming skill-creator is available)
python3 package_skill.py output/godot/

# Result: godot.zip

Using cached data (fast iteration)

# 1. Use existing data
python3 doc_scraper.py --config configs/godot.json --skip-scrape
# Time: 1-3 minutes

# 2. Package
python3 package_skill.py output/godot/

Creating a new framework config

# Option 1: Interactive
python3 doc_scraper.py --interactive

# Option 2: Copy and modify
cp configs/react.json configs/myframework.json
# Edit configs/myframework.json
python3 doc_scraper.py --config configs/myframework.json

Testing Selectors

To find the right CSS selectors for a documentation site:

from bs4 import BeautifulSoup
import requests

url = "https://docs.example.com/page"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

# Try different selectors
print(soup.select_one('article'))
print(soup.select_one('main'))
print(soup.select_one('div[role="main"]'))

Troubleshooting

No content extracted: Check main_content selector. Common values: article, main, div[role="main"], div.content

Poor categorization: Edit categories section in config with better keywords specific to the documentation structure

Force re-scrape: Delete cached data with rm -rf output/{name}_data/

Rate limiting issues: Increase rate_limit value in config (e.g., from 0.5 to 1.0 seconds)

Output Quality Checks

After building, verify quality:

cat output/godot/SKILL.md              # Should have real code examples
cat output/godot/references/index.md   # Should show categories
ls output/godot/references/            # Should have category .md files

8.9 KiB Raw Blame History