**Documentation Added:** - docs/MCP_SETUP.md: Complete 400+ line setup guide - Prerequisites and installation steps - Configuration examples for Claude Code - Verification and troubleshooting - 3 usage examples and advanced configuration - End-to-end workflow and quick reference - tests/mcp_integration_test.md: Comprehensive test template - 10 test cases covering all MCP tools - Performance metrics table - Issue tracking and environment setup - Setup and cleanup scripts - .claude/mcp_config.example.json: Example MCP configuration **Documentation Updated:** - STRUCTURE.md: Complete monorepo structure documentation - CLAUDE.md: All Python script paths updated to cli/ prefix - docs/USAGE.md: All command examples updated for monorepo - TODO.md: Current sprint status and completed tasks **Summary:** - Issues #2 and #3 handled (MCP setup guide + integration tests) - All documentation now reflects monorepo structure (cli/ + mcp/) - Tests: 71/71 passing (100%) - Ready for MCP server testing with Claude Code 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
15 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Overview
Skill Seeker automatically converts any documentation website into a Claude AI skill. It scrapes documentation, organizes content, extracts code patterns, and packages everything into an uploadable .zip file for Claude.
Prerequisites
Python Version: Python 3.7 or higher
Required Dependencies:
pip3 install requests beautifulsoup4
Optional (for API-based enhancement):
pip3 install anthropic
export ANTHROPIC_API_KEY=sk-ant-...
Core Commands
Quick Start - Use a Preset
# Scrape and build with a preset configuration
python3 cli/doc_scraper.py --config configs/godot.json
python3 cli/doc_scraper.py --config configs/react.json
python3 cli/doc_scraper.py --config configs/vue.json
python3 cli/doc_scraper.py --config configs/django.json
python3 cli/doc_scraper.py --config configs/fastapi.json
First-Time User Workflow (Recommended)
# 1. Install dependencies (one-time)
pip3 install requests beautifulsoup4
# 2. Estimate page count BEFORE scraping (fast, no data download)
python3 cli/estimate_pages.py configs/godot.json
# Time: ~1-2 minutes, shows estimated total pages and recommended max_pages
# 3. Scrape with local enhancement (uses Claude Code Max, no API key)
python3 cli/doc_scraper.py --config configs/godot.json --enhance-local
# Time: 20-40 minutes scraping + 60 seconds enhancement
# 4. Package the skill
python3 cli/package_skill.py output/godot/
# Result: godot.zip ready to upload to Claude
Interactive Mode
# Step-by-step configuration wizard
python3 cli/doc_scraper.py --interactive
Quick Mode (Minimal Config)
# Create skill from any documentation URL
python3 cli/doc_scraper.py --name react --url https://react.dev/ --description "React framework for UIs"
Skip Scraping (Use Cached Data)
# Fast rebuild using previously scraped data
python3 cli/doc_scraper.py --config configs/godot.json --skip-scrape
# Time: 1-3 minutes (instant rebuild)
Enhancement Options
LOCAL Enhancement (Recommended - No API Key Required):
# During scraping
python3 cli/doc_scraper.py --config configs/react.json --enhance-local
# Standalone after scraping
python3 cli/enhance_skill_local.py output/react/
API Enhancement (Alternative - Requires API Key):
# During scraping
python3 cli/doc_scraper.py --config configs/react.json --enhance
# Standalone after scraping
python3 cli/enhance_skill.py output/react/
python3 cli/enhance_skill.py output/react/ --api-key sk-ant-...
Package the Skill
# Package skill directory into .zip file
python3 cli/package_skill.py output/godot/
# Result: output/godot.zip
Force Re-scrape
# Delete cached data and re-scrape from scratch
rm -rf output/godot_data/
python3 cli/doc_scraper.py --config configs/godot.json
Estimate Page Count (Before Scraping)
# Quick estimation - discover up to 100 pages
python3 cli/estimate_pages.py configs/react.json --max-discovery 100
# Time: ~30-60 seconds
# Full estimation - discover up to 1000 pages (default)
python3 cli/estimate_pages.py configs/godot.json
# Time: ~1-2 minutes
# Deep estimation - discover up to 2000 pages
python3 cli/estimate_pages.py configs/vue.json --max-discovery 2000
# Time: ~3-5 minutes
# What it shows:
# - Estimated total pages
# - Recommended max_pages value
# - Estimated scraping time
# - Discovery rate (pages/sec)
Why use estimation:
- Validates config URL patterns before full scrape
- Helps set optimal
max_pagesvalue - Estimates total scraping time
- Fast (only HEAD requests + minimal parsing)
- No data downloaded or stored
Repository Architecture
File Structure
Skill_Seekers/
├── cli/doc_scraper.py # Main tool (single-file, ~790 lines)
├── cli/estimate_pages.py # Page count estimator (fast, no data)
├── cli/enhance_skill.py # AI enhancement (API-based)
├── cli/enhance_skill_local.py # AI enhancement (LOCAL, no API)
├── cli/package_skill.py # Skill packager
├── cli/run_tests.py # Test runner (71 tests)
├── configs/ # Preset configurations
│ ├── godot.json
│ ├── react.json
│ ├── vue.json
│ ├── django.json
│ ├── fastapi.json
│ └── steam-economy-complete.json
├── docs/ # Documentation
│ ├── CLAUDE.md # Detailed technical architecture
│ ├── ENHANCEMENT.md # Enhancement guide
│ └── UPLOAD_GUIDE.md # How to upload skills
└── output/ # Generated output (git-ignored)
├── {name}_data/ # Scraped raw data (cached)
│ ├── pages/*.json # Individual page data
│ └── summary.json # Scraping summary
└── {name}/ # Built skill directory
├── SKILL.md # Main skill file
├── SKILL.md.backup # Backup (if enhanced)
├── references/ # Categorized documentation
│ ├── index.md
│ ├── getting_started.md
│ ├── api.md
│ └── ...
├── scripts/ # Empty (user scripts)
└── assets/ # Empty (user assets)
Data Flow
-
Scrape Phase (
scrape_all()in doc_scraper.py:228-251):- Input: Config JSON (name, base_url, selectors, url_patterns, categories)
- Process: BFS traversal from base_url, respecting include/exclude patterns
- Output:
output/{name}_data/pages/*.json+summary.json
-
Build Phase (
build_skill()in doc_scraper.py:561-601):- Input: Scraped JSON data from
output/{name}_data/ - Process: Load pages → Smart categorize → Extract patterns → Generate references
- Output:
output/{name}/SKILL.md+output/{name}/references/*.md
- Input: Scraped JSON data from
-
Enhancement Phase (optional):
- Input: Built skill directory with references
- Process: Claude analyzes references and rewrites SKILL.md
- Output: Enhanced SKILL.md with real examples and guidance
-
Package Phase:
- Input: Skill directory
- Process: Zip all files (excluding .backup)
- Output:
{name}.zip
Configuration File Structure
Config files (configs/*.json) define scraping behavior:
{
"name": "godot",
"description": "When to use this skill",
"base_url": "https://docs.godotengine.org/en/stable/",
"selectors": {
"main_content": "div[role='main']",
"title": "title",
"code_blocks": "pre"
},
"url_patterns": {
"include": [],
"exclude": ["/search.html", "/_static/"]
},
"categories": {
"getting_started": ["introduction", "getting_started"],
"scripting": ["scripting", "gdscript"],
"api": ["api", "reference", "class"]
},
"rate_limit": 0.5,
"max_pages": 500
}
Config Parameters:
name: Skill identifier (output directory name)description: When Claude should use this skillbase_url: Starting URL for scrapingselectors.main_content: CSS selector for main content (common:article,main,div[role="main"])selectors.title: CSS selector for page titleselectors.code_blocks: CSS selector for code samplesurl_patterns.include: Only scrape URLs containing these patternsurl_patterns.exclude: Skip URLs containing these patternscategories: Keyword mapping for categorizationrate_limit: Delay between requests (seconds)max_pages: Maximum pages to scrape
Key Features & Implementation
Auto-Detect Existing Data
Tool checks for output/{name}_data/ and prompts to reuse, avoiding re-scraping (check_existing_data() in doc_scraper.py:653-660).
Language Detection
Detects code languages from:
- CSS class attributes (
language-*,lang-*) - Heuristics (keywords like
def,const,func, etc.)
See: detect_language() in doc_scraper.py:135-165
Pattern Extraction
Looks for "Example:", "Pattern:", "Usage:" markers in content and extracts following code blocks (up to 5 per page).
See: extract_patterns() in doc_scraper.py:167-183
Smart Categorization
- Scores pages against category keywords (3 points for URL match, 2 for title, 1 for content)
- Threshold of 2+ for categorization
- Auto-infers categories from URL segments if none provided
- Falls back to "other" category
See: smart_categorize() and infer_categories() in doc_scraper.py:282-351
Enhanced SKILL.md Generation
Generated with:
- Real code examples from documentation (language-annotated)
- Quick reference patterns extracted from docs
- Common pattern section
- Category file listings
See: create_enhanced_skill_md() in doc_scraper.py:426-542
Common Workflows
First Time (With Scraping + Enhancement)
# 1. Scrape + Build + AI Enhancement (LOCAL, no API key)
python3 cli/doc_scraper.py --config configs/godot.json --enhance-local
# 2. Wait for enhancement terminal to close (~60 seconds)
# 3. Verify quality
cat output/godot/SKILL.md
# 4. Package
python3 cli/package_skill.py output/godot/
# Result: godot.zip ready for Claude
# Time: 20-40 minutes (scraping) + 60 seconds (enhancement)
Using Cached Data (Fast Iteration)
# 1. Use existing data + Local Enhancement
python3 cli/doc_scraper.py --config configs/godot.json --skip-scrape
python3 cli/enhance_skill_local.py output/godot/
# 2. Package
python3 cli/package_skill.py output/godot/
# Time: 1-3 minutes (build) + 60 seconds (enhancement)
Without Enhancement (Basic)
# 1. Scrape + Build (no enhancement)
python3 cli/doc_scraper.py --config configs/godot.json
# 2. Package
python3 cli/package_skill.py output/godot/
# Note: SKILL.md will be basic template - enhancement recommended
# Time: 20-40 minutes
Creating a New Framework Config
Option 1: Interactive
python3 cli/doc_scraper.py --interactive
# Follow prompts, it creates the config for you
Option 2: Copy and Modify
# Copy a preset
cp configs/react.json configs/myframework.json
# Edit it
nano configs/myframework.json
# Test with limited pages first
# Set "max_pages": 20 in config
# Use it
python3 cli/doc_scraper.py --config configs/myframework.json
Testing & Verification
Finding the Right CSS Selectors
Before creating a config, test selectors with BeautifulSoup:
from bs4 import BeautifulSoup
import requests
url = "https://docs.example.com/page"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Try different selectors
print(soup.select_one('article'))
print(soup.select_one('main'))
print(soup.select_one('div[role="main"]'))
print(soup.select_one('div.content'))
# Test code block selector
print(soup.select('pre code'))
print(soup.select('pre'))
Verify Output Quality
After building, verify the skill quality:
# Check SKILL.md has real examples
cat output/godot/SKILL.md
# Check category structure
cat output/godot/references/index.md
# List all reference files
ls output/godot/references/
# Check specific category content
cat output/godot/references/getting_started.md
# Verify code samples have language detection
grep -A 3 "```" output/godot/references/*.md | head -20
Test with Limited Pages
For faster testing, edit config to limit pages:
{
"max_pages": 20 // Test with just 20 pages
}
Troubleshooting
No Content Extracted
Problem: Pages scraped but content is empty
Solution: Check main_content selector in config. Try:
articlemaindiv[role="main"]div.content
Use the BeautifulSoup testing approach above to find the right selector.
Poor Categorization
Problem: Pages not categorized well
Solution: Edit categories section in config with better keywords specific to the documentation structure. Check URL patterns in scraped data:
# See what URLs were scraped
cat output/godot_data/summary.json | grep url | head -20
Data Exists But Won't Use It
Problem: Tool won't reuse existing data
Solution: Force re-scrape:
rm -rf output/myframework_data/
python3 cli/doc_scraper.py --config configs/myframework.json
Rate Limiting Issues
Problem: Getting rate limited or blocked by documentation server
Solution: Increase rate_limit value in config:
{
"rate_limit": 1.0 // Change from 0.5 to 1.0 seconds
}
Package Path Error
Problem: doc_scraper.py shows wrong cli/package_skill.py path
Expected output:
python3 cli/package_skill.py output/godot/
Not:
python3 /mnt/skills/examples/skill-creator/scripts/cli/package_skill.py output/godot/
The correct command uses the local cli/package_skill.py in the repository root.
Key Code Locations
- URL validation:
is_valid_url()doc_scraper.py:49-64 - Content extraction:
extract_content()doc_scraper.py:66-133 - Language detection:
detect_language()doc_scraper.py:135-165 - Pattern extraction:
extract_patterns()doc_scraper.py:167-183 - Smart categorization:
smart_categorize()doc_scraper.py:282-323 - Category inference:
infer_categories()doc_scraper.py:325-351 - Quick reference generation:
generate_quick_reference()doc_scraper.py:353-372 - SKILL.md generation:
create_enhanced_skill_md()doc_scraper.py:426-542 - Scraping loop:
scrape_all()doc_scraper.py:228-251 - Main workflow:
main()doc_scraper.py:663-789
Enhancement Details
LOCAL Enhancement (Recommended)
- Uses your Claude Code Max plan (no API costs)
- Opens new terminal with Claude Code
- Analyzes reference files automatically
- Takes 30-60 seconds
- Quality: 9/10 (comparable to API version)
- Backs up original SKILL.md to SKILL.md.backup
API Enhancement (Alternative)
- Uses Anthropic API (~$0.15-$0.30 per skill)
- Requires ANTHROPIC_API_KEY
- Same quality as LOCAL
- Faster (no terminal launch)
- Better for automation/CI
What Enhancement Does:
- Reads reference documentation files
- Analyzes content with Claude
- Extracts 5-10 best code examples
- Creates comprehensive quick reference
- Adds domain-specific key concepts
- Provides navigation guidance for different skill levels
- Transforms 75-line templates into 500+ line comprehensive guides
Performance
| Task | Time | Notes |
|---|---|---|
| Scraping | 15-45 min | First time only |
| Building | 1-3 min | Fast! |
| Re-building | <1 min | With --skip-scrape |
| Enhancement (LOCAL) | 30-60 sec | Uses Claude Code Max |
| Enhancement (API) | 20-40 sec | Requires API key |
| Packaging | 5-10 sec | Final zip |
Additional Documentation
- README.md - Complete user documentation
- QUICKSTART.md - Get started in 3 steps
- docs/CLAUDE.md - Detailed technical architecture
- docs/ENHANCEMENT.md - AI enhancement guide
- docs/UPLOAD_GUIDE.md - How to upload skills to Claude
- STRUCTURE.md - Repository structure
Notes for Claude Code
- This is a Python-based documentation scraper
- Single-file design (
doc_scraper.py~790 lines) - No build system, no tests, minimal dependencies
- Output is cached and reusable
- Enhancement is optional but highly recommended
- All scraped data stored in
output/(git-ignored)