firefrost-gaming/skill-seekers-reference

Files

yusyus 1e277f80d2 Update documentation for unified multi-source scraping (v2.0.0)

Major documentation update explaining the new unified scraping system that combines documentation + GitHub + PDF sources in a single skill with automatic conflict detection.

## Changes:

**README.md:**
- Update version badge to v2.0.0
- Add "Unified Multi-Source Scraping" to Key Features section
- Add comprehensive Option 5 section showing:
  - Problem statement (documentation drift)
  - Solution with code example
  - Conflict detection types and severity levels
  - Transparent reporting with side-by-side comparison
  - List of advantages (identifies gaps, catches changes, single source of truth)
  - Available unified configs
  - Link to full guide (docs/UNIFIED_SCRAPING.md)

**CLAUDE.md:**
- Update Current Status to v2.0.0
- Add "Major Release: Unified Multi-Source Scraping" in Recent Updates
- Update configs count from 11/11 to 15/15 (added 4 unified configs)
- Add new "Unified Multi-Source Scraping" section under Core Commands
- Include command examples and feature highlights
- Explain what makes unified scraping special

**QUICKSTART.md:**
- Add Option D: Unified Multi-Source to Step 2
- Add unified configs to Available Presets section
- Show react_unified, django_unified, fastapi_unified, godot_unified examples

## Value:
This documentation update explains how unified scraping helps developers:
- Mix documentation + code in one skill
- Automatically detect conflicts (missing_in_docs, missing_in_code, signature_mismatch)
- Get transparent side-by-side comparisons with ⚠️ warnings
- Identify documentation gaps and outdated docs
- Create a single source of truth combining both sources

Related to: Phase 7-11 unified scraper implementation (commit 5d8c7e3)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-26 16:41:58 +03:00

22 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

🎯 Current Status (October 26, 2025)

Version: v2.0.0 (Production Ready - Major Feature Release) Active Development: Flexible, incremental task-based approach

Recent Updates (This Week):

🚀 Major Release: Unified Multi-Source Scraping (v2.0.0)

NEW: Combine documentation + GitHub + PDF in one skill
NEW: Automatic conflict detection between docs and code
NEW: Rule-based and AI-powered merging
NEW: Transparent conflict reporting with side-by-side comparison
NEW: 4 example unified configs (React, Django, FastAPI, Godot)
NEW: Complete documentation in docs/UNIFIED_SCRAPING.md
NEW: Integration tests (6/6 passing)
Status: ✅ Production ready and fully tested

✅ Community Response (H1 Group):

Issue #8 Fixed - Added BULLETPROOF_QUICKSTART.md and TROUBLESHOOTING.md for beginners
Issue #7 Fixed - Fixed all 11 configs (Django, Laravel, Astro, Tailwind) - 100% working
Issue #4 Linked - Connected to roadmap Tasks A2/A3 (knowledge sharing + website)
PR #5 Reviewed - Approved anchor stripping feature (security verified, 32/32 tests pass)
MCP Setup Fixed - Path expansion bug resolved in setup_mcp.sh

📦 Configs Status:

✅ 15/15 production configs verified working (100% success rate)
✅ 4 new unified configs added (React, Django, FastAPI, Godot)
✅ All selectors tested and validated

📋 Next Up:

Task H1.3 - Create example project folder
Task A3.1 - GitHub Pages site (skillseekersweb.com)
Task J1.1 - Install MCP package for testing

📊 Roadmap Progress:

134 tasks organized into 22 feature groups
Project board: https://github.com/users/yusufkaraaslan/projects/2
See FLEXIBLE_ROADMAP.md for complete task list

🔌 MCP Integration Available

This repository includes a fully tested MCP server with 9 tools:

mcp__skill-seeker__list_configs - List all available preset configurations
mcp__skill-seeker__generate_config - Generate a new config file for any docs site
mcp__skill-seeker__validate_config - Validate a config file structure
mcp__skill-seeker__estimate_pages - Estimate page count before scraping
mcp__skill-seeker__scrape_docs - Scrape and build a skill
mcp__skill-seeker__package_skill - Package skill into .zip file (with auto-upload)
mcp__skill-seeker__upload_skill - Upload .zip to Claude (NEW)
mcp__skill-seeker__split_config - Split large documentation configs
mcp__skill-seeker__generate_router - Generate router/hub skills

Setup: See docs/MCP_SETUP.md or run ./setup_mcp.sh

Status: ✅ Tested and working in production with Claude Code

Overview

Skill Seeker automatically converts any documentation website into a Claude AI skill. It scrapes documentation, organizes content, extracts code patterns, and packages everything into an uploadable .zip file for Claude.

Prerequisites

Python Version: Python 3.10 or higher (required for MCP integration)

Setup with Virtual Environment (Recommended):

# One-time setup
python3 -m venv venv
source venv/bin/activate  # macOS/Linux (Windows: venv\Scripts\activate)
pip install requests beautifulsoup4 pytest
pip freeze > requirements.txt

# Every time you use Skill Seeker in a new terminal session
source venv/bin/activate  # Activate before using any commands

Why use a virtual environment?

Keeps dependencies isolated from system Python
Prevents package version conflicts
Standard Python development practice
Required for running tests with pytest

If someone else clones this repo:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Optional (for API-based enhancement):

source venv/bin/activate
pip install anthropic
export ANTHROPIC_API_KEY=sk-ant-...

Core Commands

Quick Start - Use a Preset

# Single-source scraping (documentation only)
python3 cli/doc_scraper.py --config configs/godot.json
python3 cli/doc_scraper.py --config configs/react.json
python3 cli/doc_scraper.py --config configs/vue.json
python3 cli/doc_scraper.py --config configs/django.json
python3 cli/doc_scraper.py --config configs/laravel.json
python3 cli/doc_scraper.py --config configs/fastapi.json

Unified Multi-Source Scraping (NEW - v2.0.0)

# Combine documentation + GitHub + PDF in one skill
python3 cli/unified_scraper.py --config configs/react_unified.json
python3 cli/unified_scraper.py --config configs/django_unified.json
python3 cli/unified_scraper.py --config configs/fastapi_unified.json
python3 cli/unified_scraper.py --config configs/godot_unified.json

# Override merge mode
python3 cli/unified_scraper.py --config configs/react_unified.json --merge-mode claude-enhanced

# Result: One comprehensive skill with conflict detection

What makes it special:

✅ Detects discrepancies between documentation and code
✅ Shows both versions side-by-side with ⚠️ warnings
✅ Identifies outdated docs and undocumented features
✅ Single source of truth showing intent (docs) AND reality (code)

See full guide: docs/UNIFIED_SCRAPING.md

First-Time User Workflow (Recommended)

# 1. Install dependencies (one-time)
pip3 install requests beautifulsoup4

# 2. Estimate page count BEFORE scraping (fast, no data download)
python3 cli/estimate_pages.py configs/godot.json
# Time: ~1-2 minutes, shows estimated total pages and recommended max_pages

# 3. Scrape with local enhancement (uses Claude Code Max, no API key)
python3 cli/doc_scraper.py --config configs/godot.json --enhance-local
# Time: 20-40 minutes scraping + 60 seconds enhancement

# 4. Package the skill
python3 cli/package_skill.py output/godot/

# Result: godot.zip ready to upload to Claude

Interactive Mode

# Step-by-step configuration wizard
python3 cli/doc_scraper.py --interactive

Quick Mode (Minimal Config)

# Create skill from any documentation URL
python3 cli/doc_scraper.py --name react --url https://react.dev/ --description "React framework for UIs"

Skip Scraping (Use Cached Data)

# Fast rebuild using previously scraped data
python3 cli/doc_scraper.py --config configs/godot.json --skip-scrape
# Time: 1-3 minutes (instant rebuild)

Async Mode (2-3x Faster Scraping)

# Enable async mode with 8 workers for best performance
python3 cli/doc_scraper.py --config configs/react.json --async --workers 8

# Quick mode with async
python3 cli/doc_scraper.py --name react --url https://react.dev/ --async --workers 8

# Dry run with async to test
python3 cli/doc_scraper.py --config configs/godot.json --async --workers 4 --dry-run

Recommended Settings:

Small docs (~100-500 pages): --async --workers 4
Medium docs (~500-2000 pages): --async --workers 8
Large docs (2000+ pages): --async --workers 8 --no-rate-limit

Performance:

Sync: ~18 pages/sec, 120 MB memory
Async: ~55 pages/sec, 40 MB memory (3x faster!)

See full guide: ASYNC_SUPPORT.md

Enhancement Options

LOCAL Enhancement (Recommended - No API Key Required):

# During scraping
python3 cli/doc_scraper.py --config configs/react.json --enhance-local

# Standalone after scraping
python3 cli/enhance_skill_local.py output/react/

API Enhancement (Alternative - Requires API Key):

# During scraping
python3 cli/doc_scraper.py --config configs/react.json --enhance

# Standalone after scraping
python3 cli/enhance_skill.py output/react/
python3 cli/enhance_skill.py output/react/ --api-key sk-ant-...

Package and Upload the Skill

# Package skill (opens folder, shows upload instructions)
python3 cli/package_skill.py output/godot/
# Result: output/godot.zip

# Package and auto-upload (requires ANTHROPIC_API_KEY)
export ANTHROPIC_API_KEY=sk-ant-...
python3 cli/package_skill.py output/godot/ --upload

# Upload existing .zip
python3 cli/upload_skill.py output/godot.zip

# Package without opening folder
python3 cli/package_skill.py output/godot/ --no-open

Force Re-scrape

# Delete cached data and re-scrape from scratch
rm -rf output/godot_data/
python3 cli/doc_scraper.py --config configs/godot.json

Estimate Page Count (Before Scraping)

# Quick estimation - discover up to 100 pages
python3 cli/estimate_pages.py configs/react.json --max-discovery 100
# Time: ~30-60 seconds

# Full estimation - discover up to 1000 pages (default)
python3 cli/estimate_pages.py configs/godot.json
# Time: ~1-2 minutes

# Deep estimation - discover up to 2000 pages
python3 cli/estimate_pages.py configs/vue.json --max-discovery 2000
# Time: ~3-5 minutes

# What it shows:
# - Estimated total pages
# - Recommended max_pages value
# - Estimated scraping time
# - Discovery rate (pages/sec)

Why use estimation:

Validates config URL patterns before full scrape
Helps set optimal max_pages value
Estimates total scraping time
Fast (only HEAD requests + minimal parsing)
No data downloaded or stored

Repository Architecture

File Structure

Skill_Seekers/
├── cli/doc_scraper.py              # Main tool (single-file, ~790 lines)
├── cli/estimate_pages.py           # Page count estimator (fast, no data)
├── cli/enhance_skill.py            # AI enhancement (API-based)
├── cli/enhance_skill_local.py      # AI enhancement (LOCAL, no API)
├── cli/package_skill.py            # Skill packager
├── cli/run_tests.py                # Test runner (71 tests)
├── configs/                    # Preset configurations
│   ├── godot.json
│   ├── react.json
│   ├── vue.json
│   ├── django.json
│   ├── fastapi.json
│   └── steam-economy-complete.json
├── docs/                       # Documentation
│   ├── CLAUDE.md               # Detailed technical architecture
│   ├── ENHANCEMENT.md          # Enhancement guide
│   └── UPLOAD_GUIDE.md         # How to upload skills
└── output/                     # Generated output (git-ignored)
    ├── {name}_data/            # Scraped raw data (cached)
    │   ├── pages/*.json        # Individual page data
    │   └── summary.json        # Scraping summary
    └── {name}/                 # Built skill directory
        ├── SKILL.md            # Main skill file
        ├── SKILL.md.backup     # Backup (if enhanced)
        ├── references/         # Categorized documentation
        │   ├── index.md
        │   ├── getting_started.md
        │   ├── api.md
        │   └── ...
        ├── scripts/            # Empty (user scripts)
        └── assets/             # Empty (user assets)

Data Flow

Scrape Phase (scrape_all() in doc_scraper.py:228-251):
- Input: Config JSON (name, base_url, selectors, url_patterns, categories)
- Process: BFS traversal from base_url, respecting include/exclude patterns
- Output: output/{name}_data/pages/*.json + summary.json
Build Phase (build_skill() in doc_scraper.py:561-601):
- Input: Scraped JSON data from output/{name}_data/
- Process: Load pages → Smart categorize → Extract patterns → Generate references
- Output: output/{name}/SKILL.md + output/{name}/references/*.md
Enhancement Phase (optional):
- Input: Built skill directory with references
- Process: Claude analyzes references and rewrites SKILL.md
- Output: Enhanced SKILL.md with real examples and guidance
Package Phase:
- Input: Skill directory
- Process: Zip all files (excluding .backup)
- Output: {name}.zip

Configuration File Structure

Config files (configs/*.json) define scraping behavior:

{
  "name": "godot",
  "description": "When to use this skill",
  "base_url": "https://docs.godotengine.org/en/stable/",
  "selectors": {
    "main_content": "div[role='main']",
    "title": "title",
    "code_blocks": "pre"
  },
  "url_patterns": {
    "include": [],
    "exclude": ["/search.html", "/_static/"]
  },
  "categories": {
    "getting_started": ["introduction", "getting_started"],
    "scripting": ["scripting", "gdscript"],
    "api": ["api", "reference", "class"]
  },
  "rate_limit": 0.5,
  "max_pages": 500
}

Config Parameters:

name: Skill identifier (output directory name)
description: When Claude should use this skill
base_url: Starting URL for scraping
selectors.main_content: CSS selector for main content (common: article, main, div[role="main"])
selectors.title: CSS selector for page title
selectors.code_blocks: CSS selector for code samples
url_patterns.include: Only scrape URLs containing these patterns
url_patterns.exclude: Skip URLs containing these patterns
categories: Keyword mapping for categorization
rate_limit: Delay between requests (seconds)
max_pages: Maximum pages to scrape

Key Features & Implementation

Auto-Detect Existing Data

Tool checks for output/{name}_data/ and prompts to reuse, avoiding re-scraping (check_existing_data() in doc_scraper.py:653-660).

Language Detection

Detects code languages from:

CSS class attributes (language-*, lang-*)
Heuristics (keywords like def, const, func, etc.)

See: detect_language() in doc_scraper.py:135-165

Pattern Extraction

Looks for "Example:", "Pattern:", "Usage:" markers in content and extracts following code blocks (up to 5 per page).

See: extract_patterns() in doc_scraper.py:167-183

Smart Categorization

Scores pages against category keywords (3 points for URL match, 2 for title, 1 for content)
Threshold of 2+ for categorization
Auto-infers categories from URL segments if none provided
Falls back to "other" category

See: smart_categorize() and infer_categories() in doc_scraper.py:282-351

Enhanced SKILL.md Generation

Generated with:

Real code examples from documentation (language-annotated)
Quick reference patterns extracted from docs
Common pattern section
Category file listings

See: create_enhanced_skill_md() in doc_scraper.py:426-542

Common Workflows

First Time (With Scraping + Enhancement)

# 1. Scrape + Build + AI Enhancement (LOCAL, no API key)
python3 cli/doc_scraper.py --config configs/godot.json --enhance-local

# 2. Wait for enhancement terminal to close (~60 seconds)

# 3. Verify quality
cat output/godot/SKILL.md

# 4. Package
python3 cli/package_skill.py output/godot/

# Result: godot.zip ready for Claude
# Time: 20-40 minutes (scraping) + 60 seconds (enhancement)

Using Cached Data (Fast Iteration)

# 1. Use existing data + Local Enhancement
python3 cli/doc_scraper.py --config configs/godot.json --skip-scrape
python3 cli/enhance_skill_local.py output/godot/

# 2. Package
python3 cli/package_skill.py output/godot/

# Time: 1-3 minutes (build) + 60 seconds (enhancement)

Without Enhancement (Basic)

# 1. Scrape + Build (no enhancement)
python3 cli/doc_scraper.py --config configs/godot.json

# 2. Package
python3 cli/package_skill.py output/godot/

# Note: SKILL.md will be basic template - enhancement recommended
# Time: 20-40 minutes

Creating a New Framework Config

Option 1: Interactive

python3 cli/doc_scraper.py --interactive
# Follow prompts, it creates the config for you

Option 2: Copy and Modify

# Copy a preset
cp configs/react.json configs/myframework.json

# Edit it
nano configs/myframework.json

# Test with limited pages first
# Set "max_pages": 20 in config

# Use it
python3 cli/doc_scraper.py --config configs/myframework.json

Testing & Verification

Finding the Right CSS Selectors

Before creating a config, test selectors with BeautifulSoup:

from bs4 import BeautifulSoup
import requests

url = "https://docs.example.com/page"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

# Try different selectors
print(soup.select_one('article'))
print(soup.select_one('main'))
print(soup.select_one('div[role="main"]'))
print(soup.select_one('div.content'))

# Test code block selector
print(soup.select('pre code'))
print(soup.select('pre'))

Verify Output Quality

After building, verify the skill quality:

# Check SKILL.md has real examples
cat output/godot/SKILL.md

# Check category structure
cat output/godot/references/index.md

# List all reference files
ls output/godot/references/

# Check specific category content
cat output/godot/references/getting_started.md

# Verify code samples have language detection
grep -A 3 "```" output/godot/references/*.md | head -20

Test with Limited Pages

For faster testing, edit config to limit pages:

{
  "max_pages": 20  // Test with just 20 pages
}

Troubleshooting

No Content Extracted

Problem: Pages scraped but content is empty

Solution: Check main_content selector in config. Try:

article
main
div[role="main"]
div.content

Use the BeautifulSoup testing approach above to find the right selector.

Poor Categorization

Problem: Pages not categorized well

Solution: Edit categories section in config with better keywords specific to the documentation structure. Check URL patterns in scraped data:

# See what URLs were scraped
cat output/godot_data/summary.json | grep url | head -20

Data Exists But Won't Use It

Problem: Tool won't reuse existing data

Solution: Force re-scrape:

rm -rf output/myframework_data/
python3 cli/doc_scraper.py --config configs/myframework.json

Rate Limiting Issues

Problem: Getting rate limited or blocked by documentation server

Solution: Increase rate_limit value in config:

{
  "rate_limit": 1.0  // Change from 0.5 to 1.0 seconds
}

Package Path Error

Problem: doc_scraper.py shows wrong cli/package_skill.py path

Expected output:

python3 cli/package_skill.py output/godot/

Not:

python3 /mnt/skills/examples/skill-creator/scripts/cli/package_skill.py output/godot/

The correct command uses the local cli/package_skill.py in the repository root.

Key Code Locations

URL validation: is_valid_url() doc_scraper.py:49-64
Content extraction: extract_content() doc_scraper.py:66-133
Language detection: detect_language() doc_scraper.py:135-165
Pattern extraction: extract_patterns() doc_scraper.py:167-183
Smart categorization: smart_categorize() doc_scraper.py:282-323
Category inference: infer_categories() doc_scraper.py:325-351
Quick reference generation: generate_quick_reference() doc_scraper.py:353-372
SKILL.md generation: create_enhanced_skill_md() doc_scraper.py:426-542
Scraping loop: scrape_all() doc_scraper.py:228-251
Main workflow: main() doc_scraper.py:663-789

Enhancement Details

LOCAL Enhancement (Recommended)

Uses your Claude Code Max plan (no API costs)
Opens new terminal with Claude Code
Analyzes reference files automatically
Takes 30-60 seconds
Quality: 9/10 (comparable to API version)
Backs up original SKILL.md to SKILL.md.backup

API Enhancement (Alternative)

Uses Anthropic API (~$0.15-$0.30 per skill)
Requires ANTHROPIC_API_KEY
Same quality as LOCAL
Faster (no terminal launch)
Better for automation/CI

What Enhancement Does:

Reads reference documentation files
Analyzes content with Claude
Extracts 5-10 best code examples
Creates comprehensive quick reference
Adds domain-specific key concepts
Provides navigation guidance for different skill levels
Transforms 75-line templates into 500+ line comprehensive guides

Performance

Task	Time	Notes
Scraping	15-45 min	First time only
Building	1-3 min	Fast!
Re-building	<1 min	With --skip-scrape
Enhancement (LOCAL)	30-60 sec	Uses Claude Code Max
Enhancement (API)	20-40 sec	Requires API key
Packaging	5-10 sec	Final zip

Available Production Configs (12 Total - All Verified Working)

Web Frameworks:

✅ react.json - React (article selector, 7,102 chars)
✅ vue.json - Vue.js (main selector, 1,029 chars)
✅ astro.json - Astro (article selector, 145 chars)
✅ django.json - Django (article selector, 6,468 chars)
✅ laravel.json - Laravel 9.x (#main-content selector, 16,131 chars)
✅ fastapi.json - FastAPI (article selector, 11,906 chars)

DevOps & Automation:

✅ ansible-core.json - Ansible Core 2.19 (div[role='main'] selector, ~32K chars) NEW!
✅ kubernetes.json - Kubernetes (main selector, 2,100 chars)

Game Engines:

✅ godot.json - Godot (div[role='main'] selector, 1,688 chars)
✅ godot-large-example.json - Godot large docs example

CSS & Utilities:

✅ tailwind.json - Tailwind CSS (div.prose selector, 195 chars)

Gaming:

✅ steam-economy-complete.json - Steam Economy (div.documentation_bbcode, 588 chars)

All configs tested and verified as of October 22, 2025

Additional Documentation

README.md - Complete user documentation
BULLETPROOF_QUICKSTART.md - Complete beginner guide NEW!
TROUBLESHOOTING.md - Comprehensive troubleshooting NEW!
QUICKSTART.md - Get started in 3 steps
docs/CLAUDE.md - Detailed technical architecture
docs/ENHANCEMENT.md - AI enhancement guide
docs/UPLOAD_GUIDE.md - How to upload skills to Claude
FLEXIBLE_ROADMAP.md - Complete task catalog (134 tasks)
NEXT_TASKS.md - What to work on next
TODO.md - Current focus
STRUCTURE.md - Repository structure

Notes for Claude Code

This is a Python-based documentation scraper
Single-file design (doc_scraper.py ~790 lines)
No build system, no tests, minimal dependencies
Output is cached and reusable
Enhancement is optional but highly recommended
All scraped data stored in output/ (git-ignored)

22 KiB Raw Blame History