firefrost-gaming/skill-seekers-reference

Go to file

yusyus 1bf53423dc Fix Release workflow - use requirements.txt and correct MCP path

- Changed from manual pip install to using requirements.txt
- Fixed mcp/requirements.txt -> skill_seeker_mcp/requirements.txt
- This ensures all dependencies (including httpx) are installed

Fixes the v2.0.0 tag Release workflow failure

2025-10-26 17:48:23 +03:00

.claude

Add comprehensive MCP setup guide and integration test template

2025-10-19 17:01:37 +03:00

.github

Fix Release workflow - use requirements.txt and correct MCP path

2025-10-26 17:48:23 +03:00

.vscode

Update README.md with detailed project description and features; add initial VSCode settings.

2025-10-17 15:21:39 +00:00

cli

Add unified multi-source scraping feature (Phases 7-11)

2025-10-26 16:33:41 +03:00

configs

Add unified multi-source scraping feature (Phases 7-11)

2025-10-26 16:33:41 +03:00

docs

Clean up unnecessary tracking and snapshot files

2025-10-26 17:40:50 +03:00

skill_seeker_mcp

Add unified multi-source scraping feature (Phases 7-11)

2025-10-26 16:33:41 +03:00

tests

Add comprehensive bash script tests and fix old mcp/ path references

2025-10-26 17:33:39 +03:00

.gitignore

feat(refactor): Phase 0 - Add Python package structure

2025-10-26 00:17:21 +03:00

ASYNC_SUPPORT.md

feat: Complete refactoring with async support, type safety, and package structure

2025-10-26 13:05:39 +03:00

BULLETPROOF_QUICKSTART.md

Add virtual environment setup and minimal dependencies (#149 )

2025-10-22 21:54:05 +03:00

CHANGELOG.md

chore: Bump version to v1.3.0

2025-10-26 13:16:54 +03:00

CLAUDE.md

Update documentation for unified multi-source scraping (v2.0.0)

2025-10-26 16:41:58 +03:00

CONTRIBUTING.md

Establish two-branch workflow: main + development

2025-10-22 23:30:45 +03:00

demo_conflicts.py

Add unified multi-source scraping feature (Phases 7-11)

2025-10-26 16:33:41 +03:00

example-mcp-config.json

Add MCP configuration and setup scripts

2025-10-19 19:43:56 +03:00

FLEXIBLE_ROADMAP.md

feat(roadmap): Add GitHub Issues and Changelog scraping to C1 tasks

2025-10-26 13:47:40 +03:00

LICENSE

Init

2025-10-17 15:14:44 +00:00

mypy.ini

feat: Complete refactoring with async support, type safety, and package structure

2025-10-26 13:05:39 +03:00

QUICKSTART.md

Update documentation for unified multi-source scraping (v2.0.0)

2025-10-26 16:41:58 +03:00

README.md

Update README to reflect GitHub repository scraping capability

2025-10-26 17:10:04 +03:00

requirements.txt

feat: Implement C1 GitHub Repository Scraping (Tasks C1.1-C1.12)

2025-10-26 14:19:27 +03:00

ROADMAP.md

Complete comprehensive planning verification and fix gaps

2025-10-20 23:51:47 +03:00

setup_mcp.sh

Fix setup_mcp.sh path issues (Issue #157 )

2025-10-26 17:23:40 +03:00

STRUCTURE.md

Add comprehensive MCP setup guide and integration test template

2025-10-19 17:01:37 +03:00

TROUBLESHOOTING.md

Fix Issue #8 : Add bulletproof setup and prerequisites

2025-10-21 00:04:26 +03:00

README.md

Skill Seeker

Automatically convert documentation websites, GitHub repositories, and PDFs into Claude AI skills in minutes.

📋 View Development Roadmap & Tasks - 134 tasks across 10 categories, pick any to contribute!

What is Skill Seeker?

Skill Seeker is an automated tool that transforms documentation websites, GitHub repositories, and PDF files into production-ready Claude AI skills. Instead of manually reading and summarizing documentation, Skill Seeker:

Scrapes multiple sources (docs, GitHub repos, PDFs) automatically
Analyzes code repositories with deep AST parsing
Detects conflicts between documentation and code implementation
Organizes content into categorized reference files
Enhances with AI to extract best examples and key concepts
Packages everything into an uploadable .zip file for Claude

Result: Get comprehensive Claude skills for any framework, API, or tool in 20-40 minutes instead of hours of manual work.

Why Use This?

🎯 For Developers: Create skills from documentation + GitHub repos with conflict detection
🎮 For Game Devs: Generate skills for game engines (Godot docs + GitHub, Unity, etc.)
🔧 For Teams: Combine internal docs + code repositories into single source of truth
📚 For Learners: Build comprehensive skills from docs, code examples, and PDFs
🔍 For Open Source: Analyze repos to find documentation gaps and outdated examples

Key Features

🌐 Documentation Scraping

✅ llms.txt Support - Automatically detects and uses LLM-ready documentation files (10x faster)
✅ Universal Scraper - Works with ANY documentation website
✅ Smart Categorization - Automatically organizes content by topic
✅ Code Language Detection - Recognizes Python, JavaScript, C++, GDScript, etc.
✅ 8 Ready-to-Use Presets - Godot, React, Vue, Django, FastAPI, and more

📄 PDF Support (v1.2.0)

✅ Basic PDF Extraction - Extract text, code, and images from PDF files
✅ OCR for Scanned PDFs - Extract text from scanned documents
✅ Password-Protected PDFs - Handle encrypted PDFs
✅ Table Extraction - Extract complex tables from PDFs
✅ Parallel Processing - 3x faster for large PDFs
✅ Intelligent Caching - 50% faster on re-runs

🐙 GitHub Repository Scraping (v2.0.0)

✅ Deep Code Analysis - AST parsing for Python, JavaScript, TypeScript, Java, C++, Go
✅ API Extraction - Functions, classes, methods with parameters and types
✅ Repository Metadata - README, file tree, language breakdown, stars/forks
✅ GitHub Issues & PRs - Fetch open/closed issues with labels and milestones
✅ CHANGELOG & Releases - Automatically extract version history
✅ Conflict Detection - Compare documented APIs vs actual code implementation
✅ MCP Integration - Natural language: "Scrape GitHub repo facebook/react"

🔄 Unified Multi-Source Scraping (NEW - v2.0.0)

✅ Combine Multiple Sources - Mix documentation + GitHub + PDF in one skill
✅ Conflict Detection - Automatically finds discrepancies between docs and code
✅ Intelligent Merging - Rule-based or AI-powered conflict resolution
✅ Transparent Reporting - Side-by-side comparison with ⚠️ warnings
✅ Documentation Gap Analysis - Identifies outdated docs and undocumented features
✅ Single Source of Truth - One skill showing both intent (docs) and reality (code)
✅ Backward Compatible - Legacy single-source configs still work

🤖 AI & Enhancement

✅ AI-Powered Enhancement - Transforms basic templates into comprehensive guides
✅ No API Costs - FREE local enhancement using Claude Code Max
✅ MCP Server for Claude Code - Use directly from Claude Code with natural language

⚡ Performance & Scale

✅ Async Mode - 2-3x faster scraping with async/await (use --async flag)
✅ Large Documentation Support - Handle 10K-40K+ page docs with intelligent splitting
✅ Router/Hub Skills - Intelligent routing to specialized sub-skills
✅ Parallel Scraping - Process multiple skills simultaneously
✅ Checkpoint/Resume - Never lose progress on long scrapes
✅ Caching System - Scrape once, rebuild instantly

✅ Quality Assurance

✅ Fully Tested - 299 tests with 100% pass rate

Quick Example

Option 1: Use from Claude Code (Recommended)

# One-time setup (5 minutes)
./setup_mcp.sh

# Then in Claude Code, just ask:
"Generate a React skill from https://react.dev/"
"Scrape PDF at docs/manual.pdf and create skill"

Time: Automated | Quality: Production-ready | Cost: Free

Option 2: Use CLI Directly (HTML Docs)

# Install dependencies (2 pip packages)
pip3 install requests beautifulsoup4

# Generate a React skill in one command
python3 cli/doc_scraper.py --config configs/react.json --enhance-local

# Upload output/react.zip to Claude - Done!

Time: ~25 minutes | Quality: Production-ready | Cost: Free

Option 3: Use CLI for PDF Documentation

# Install PDF support
pip3 install PyMuPDF

# Basic PDF extraction
python3 cli/pdf_scraper.py --pdf docs/manual.pdf --name myskill

# Advanced features
python3 cli/pdf_scraper.py --pdf docs/manual.pdf --name myskill \
    --extract-tables \        # Extract tables
    --parallel \              # Fast parallel processing
    --workers 8               # Use 8 CPU cores

# Scanned PDFs (requires: pip install pytesseract Pillow)
python3 cli/pdf_scraper.py --pdf docs/scanned.pdf --name myskill --ocr

# Password-protected PDFs
python3 cli/pdf_scraper.py --pdf docs/encrypted.pdf --name myskill --password mypassword

# Upload output/myskill.zip to Claude - Done!

Time: ~5-15 minutes (or 2-5 minutes with parallel) | Quality: Production-ready | Cost: Free

Advanced Features:

✅ OCR for scanned PDFs (requires pytesseract)
✅ Password-protected PDF support
✅ Table extraction
✅ Parallel processing (3x faster)
✅ Intelligent caching

Option 4: Use CLI for GitHub Repository

# Install GitHub support
pip3 install PyGithub

# Basic repository scraping
python3 cli/github_scraper.py --repo facebook/react

# Using a config file
python3 cli/github_scraper.py --config configs/react_github.json

# With authentication (higher rate limits)
export GITHUB_TOKEN=ghp_your_token_here
python3 cli/github_scraper.py --repo facebook/react

# Customize what to include
python3 cli/github_scraper.py --repo django/django \
    --include-issues \        # Extract GitHub Issues
    --max-issues 100 \        # Limit issue count
    --include-changelog \     # Extract CHANGELOG.md
    --include-releases        # Extract GitHub Releases

# MCP usage in Claude Code
"Scrape GitHub repository facebook/react"

# Upload output/react.zip to Claude - Done!

Time: ~5-10 minutes | Quality: Production-ready | Cost: Free

What Gets Extracted:

✅ README.md and documentation files
✅ GitHub Issues (open/closed, labels, milestones)
✅ CHANGELOG.md and version history
✅ GitHub Releases with release notes
✅ Repository metadata (stars, language, topics)
✅ File structure and language breakdown

Option 5: Unified Multi-Source Scraping (NEW - v2.0.0)

The Problem: Documentation and code often drift apart. Docs might be outdated, missing features that exist in code, or documenting features that were removed.

The Solution: Combine documentation + GitHub + PDF into one unified skill that shows BOTH what's documented AND what actually exists, with clear warnings about discrepancies.

# Create unified config (mix documentation + GitHub)
cat > configs/myframework_unified.json << 'EOF'
{
  "name": "myframework",
  "description": "Complete framework knowledge from docs + code",
  "merge_mode": "rule-based",
  "sources": [
    {
      "type": "documentation",
      "base_url": "https://docs.myframework.com/",
      "extract_api": true,
      "max_pages": 200
    },
    {
      "type": "github",
      "repo": "owner/myframework",
      "include_code": true,
      "code_analysis_depth": "surface"
    }
  ]
}
EOF

# Run unified scraper
python3 cli/unified_scraper.py --config configs/myframework_unified.json

# Upload output/myframework.zip to Claude - Done!

Time: ~30-45 minutes | Quality: Production-ready with conflict detection | Cost: Free

What Makes It Special:

✅ Conflict Detection - Automatically finds 4 types of discrepancies:

🔴 Missing in code (high): Documented but not implemented
🟡 Missing in docs (medium): Implemented but not documented
⚠️ Signature mismatch: Different parameters/types
ℹ️ Description mismatch: Different explanations

✅ Transparent Reporting - Shows both versions side-by-side:

#### `move_local_x(delta: float)`

⚠️ **Conflict**: Documentation signature differs from implementation

**Documentation says:**

def move_local_x(delta: float)


**Code implementation:**
```python
def move_local_x(delta: float, snap: bool = False) -> None


✅ **Advantages:**
- **Identifies documentation gaps** - Find outdated or missing docs automatically
- **Catches code changes** - Know when APIs change without docs being updated
- **Single source of truth** - One skill showing intent (docs) AND reality (code)
- **Actionable insights** - Get suggestions for fixing each conflict
- **Development aid** - See what's actually in the codebase vs what's documented

**Example Unified Configs:**
- `configs/react_unified.json` - React docs + GitHub repo
- `configs/django_unified.json` - Django docs + GitHub repo
- `configs/fastapi_unified.json` - FastAPI docs + GitHub repo

**Full Guide:** See [docs/UNIFIED_SCRAPING.md](docs/UNIFIED_SCRAPING.md) for complete documentation.

## How It Works

```mermaid
graph LR
    A[Documentation Website] --> B[Skill Seeker]
    B --> C[Scraper]
    B --> D[AI Enhancement]
    B --> E[Packager]
    C --> F[Organized References]
    D --> F
    F --> E
    E --> G[Claude Skill .zip]
    G --> H[Upload to Claude AI]

Detect llms.txt - Checks for llms-full.txt, llms.txt, llms-small.txt first
Scrape: Extracts all pages from documentation
Categorize: Organizes content into topics (API, guides, tutorials, etc.)
Enhance: AI analyzes docs and creates comprehensive SKILL.md with examples
Package: Bundles everything into a Claude-ready .zip file

📋 Prerequisites

Before you start, make sure you have:

Python 3.10 or higher - Download | Check: python3 --version
Git - Download | Check: git --version
15-30 minutes for first-time setup

First time user? → Start Here: Bulletproof Quick Start Guide 🎯

This guide walks you through EVERYTHING step-by-step (Python install, git clone, first skill creation).

🚀 Quick Start

Method 1: MCP Server for Claude Code (Easiest)

Use Skill Seeker directly from Claude Code with natural language!

# Clone repository
git clone https://github.com/yusufkaraaslan/Skill_Seekers.git
cd Skill_Seekers

# One-time setup (5 minutes)
./setup_mcp.sh

# Restart Claude Code, then just ask:

In Claude Code:

List all available configs
Generate config for Tailwind at https://tailwindcss.com/docs
Scrape docs using configs/react.json
Package skill at output/react/

Benefits:

✅ No manual CLI commands
✅ Natural language interface
✅ Integrated with your workflow
✅ 9 tools available instantly (includes automatic upload!)
✅ Tested and working in production

Full guides:

📘 MCP Setup Guide - Complete installation instructions
🧪 MCP Testing Guide - Test all 9 tools
📦 Large Documentation Guide - Handle 10K-40K+ pages
📤 Upload Guide - How to upload skills to Claude

Method 2: CLI (Traditional)

One-Time Setup: Create Virtual Environment

# Clone repository
git clone https://github.com/yusufkaraaslan/Skill_Seekers.git
cd Skill_Seekers

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
source venv/bin/activate  # macOS/Linux
# OR on Windows: venv\Scripts\activate

# Install dependencies
pip install requests beautifulsoup4 pytest

# Save dependencies
pip freeze > requirements.txt

# Optional: Install anthropic for API-based enhancement (not needed for LOCAL enhancement)
# pip install anthropic

Always activate the virtual environment before using Skill Seeker:

source venv/bin/activate  # Run this each time you start a new terminal session

Easiest: Use a Preset

# Make sure venv is activated (you should see (venv) in your prompt)
source venv/bin/activate

# Optional: Estimate pages first (fast, 1-2 minutes)
python3 cli/estimate_pages.py configs/godot.json

# Use Godot preset
python3 cli/doc_scraper.py --config configs/godot.json

# Use React preset
python3 cli/doc_scraper.py --config configs/react.json

# See all presets
ls configs/

Interactive Mode

python3 cli/doc_scraper.py --interactive

Quick Mode

python3 cli/doc_scraper.py \
  --name react \
  --url https://react.dev/ \
  --description "React framework for UIs"

📤 Uploading Skills to Claude

Once your skill is packaged, you need to upload it to Claude:

Option 1: Automatic Upload (API-based)

# Set your API key (one-time)
export ANTHROPIC_API_KEY=sk-ant-...

# Package and upload automatically
python3 cli/package_skill.py output/react/ --upload

# OR upload existing .zip
python3 cli/upload_skill.py output/react.zip

Benefits:

✅ Fully automatic
✅ No manual steps
✅ Works from command line

Requirements:

Anthropic API key (get from https://console.anthropic.com/)

Option 2: Manual Upload (No API Key)

# Package skill
python3 cli/package_skill.py output/react/

# This will:
# 1. Create output/react.zip
# 2. Open the output/ folder automatically
# 3. Show upload instructions

# Then manually upload:
# - Go to https://claude.ai/skills
# - Click "Upload Skill"
# - Select output/react.zip
# - Done!

Benefits:

✅ No API key needed
✅ Works for everyone
✅ Folder opens automatically

Option 3: Claude Code (MCP) - Smart & Automatic

In Claude Code, just ask:
"Package and upload the React skill"

# With API key set:
# - Packages the skill
# - Uploads to Claude automatically
# - Done! ✅

# Without API key:
# - Packages the skill
# - Shows where to find the .zip
# - Provides manual upload instructions

Benefits:

✅ Natural language
✅ Smart auto-detection (uploads if API key available)
✅ Works with or without API key
✅ No errors or failures

📁 Simple Structure

doc-to-skill/
├── cli/
│   ├── doc_scraper.py      # Main scraping tool
│   ├── package_skill.py    # Package to .zip
│   ├── upload_skill.py     # Auto-upload (API)
│   └── enhance_skill.py    # AI enhancement
├── mcp/                    # MCP server for Claude Code
│   └── server.py           # 9 MCP tools
├── configs/                # Preset configurations
│   ├── godot.json         # Godot Engine
│   ├── react.json         # React
│   ├── vue.json           # Vue.js
│   ├── django.json        # Django
│   └── fastapi.json       # FastAPI
└── output/                 # All output (auto-created)
    ├── godot_data/        # Scraped data
    ├── godot/             # Built skill
    └── godot.zip          # Packaged skill

✨ Features

1. Fast Page Estimation (NEW!)

python3 cli/estimate_pages.py configs/react.json

# Output:
📊 ESTIMATION RESULTS
✅ Pages Discovered: 180
📈 Estimated Total: 230
⏱️  Time Elapsed: 1.2 minutes
💡 Recommended max_pages: 280

Benefits:

Know page count BEFORE scraping (saves time)
Validates URL patterns work correctly
Estimates total scraping time
Recommends optimal max_pages setting
Fast (1-2 minutes vs 20-40 minutes full scrape)

2. Auto-Detect Existing Data

python3 cli/doc_scraper.py --config configs/godot.json

# If data exists:
✓ Found existing data: 245 pages
Use existing data? (y/n): y
⏭️  Skipping scrape, using existing data

3. Knowledge Generation

Automatic pattern extraction:

Extracts common code patterns from docs
Detects programming language
Creates quick reference with real examples
Smarter categorization with scoring

Enhanced SKILL.md:

Real code examples from documentation
Language-annotated code blocks
Common patterns section
Quick reference from actual usage examples

4. Smart Categorization

Automatically infers categories from:

URL structure
Page titles
Content keywords
With scoring for better accuracy

5. Code Language Detection

# Automatically detects:
- Python (def, import, from)
- JavaScript (const, let, =>)
- GDScript (func, var, extends)
- C++ (#include, int main)
- And more...

5. Skip Scraping

# Scrape once
python3 cli/doc_scraper.py --config configs/react.json

# Later, just rebuild (instant)
python3 cli/doc_scraper.py --config configs/react.json --skip-scrape

6. Async Mode for Faster Scraping (2-3x Speed!)

# Enable async mode with 8 workers (recommended for large docs)
python3 cli/doc_scraper.py --config configs/react.json --async --workers 8

# Small docs (~100-500 pages)
python3 cli/doc_scraper.py --config configs/mydocs.json --async --workers 4

# Large docs (2000+ pages) with no rate limiting
python3 cli/doc_scraper.py --config configs/largedocs.json --async --workers 8 --no-rate-limit

Performance Comparison:

Sync mode (threads): ~18 pages/sec, 120 MB memory
Async mode: ~55 pages/sec, 40 MB memory
Result: 3x faster, 66% less memory!

When to use:

✅ Large documentation (500+ pages)
✅ Network latency is high
✅ Memory is constrained
❌ Small docs (< 100 pages) - overhead not worth it

See full guide: ASYNC_SUPPORT.md

7. AI-Powered SKILL.md Enhancement

# Option 1: During scraping (API-based, requires API key)
pip3 install anthropic
export ANTHROPIC_API_KEY=sk-ant-...
python3 cli/doc_scraper.py --config configs/react.json --enhance

# Option 2: During scraping (LOCAL, no API key - uses Claude Code Max)
python3 cli/doc_scraper.py --config configs/react.json --enhance-local

# Option 3: After scraping (API-based, standalone)
python3 cli/enhance_skill.py output/react/

# Option 4: After scraping (LOCAL, no API key, standalone)
python3 cli/enhance_skill_local.py output/react/

What it does:

Reads your reference documentation
Uses Claude to generate an excellent SKILL.md
Extracts best code examples (5-10 practical examples)
Creates comprehensive quick reference
Adds domain-specific key concepts
Provides navigation guidance for different skill levels
Automatically backs up original
Quality: Transforms 75-line templates into 500+ line comprehensive guides

LOCAL Enhancement (Recommended):

Uses your Claude Code Max plan (no API costs)
Opens new terminal with Claude Code
Analyzes reference files automatically
Takes 30-60 seconds
Quality: 9/10 (comparable to API version)

7. Large Documentation Support (10K-40K+ Pages)

For massive documentation sites like Godot (40K pages), AWS, or Microsoft Docs:

# 1. Estimate first (discover page count)
python3 cli/estimate_pages.py configs/godot.json

# 2. Auto-split into focused sub-skills
python3 cli/split_config.py configs/godot.json --strategy router

# Creates:
# - godot-scripting.json (5K pages)
# - godot-2d.json (8K pages)
# - godot-3d.json (10K pages)
# - godot-physics.json (6K pages)
# - godot-shaders.json (11K pages)

# 3. Scrape all in parallel (4-8 hours instead of 20-40!)
for config in configs/godot-*.json; do
  python3 cli/doc_scraper.py --config $config &
done
wait

# 4. Generate intelligent router/hub skill
python3 cli/generate_router.py configs/godot-*.json

# 5. Package all skills
python3 cli/package_multi.py output/godot*/

# 6. Upload all .zip files to Claude
# Users just ask questions naturally!
# Router automatically directs to the right sub-skill!

Split Strategies:

auto - Intelligently detects best strategy based on page count
category - Split by documentation categories (scripting, 2d, 3d, etc.)
router - Create hub skill + specialized sub-skills (RECOMMENDED)
size - Split every N pages (for docs without clear categories)

Benefits:

✅ Faster scraping (parallel execution)
✅ More focused skills (better Claude performance)
✅ Easier maintenance (update one topic at a time)
✅ Natural user experience (router handles routing)
✅ Avoids context window limits

Configuration:

{
  "name": "godot",
  "max_pages": 40000,
  "split_strategy": "router",
  "split_config": {
    "target_pages_per_skill": 5000,
    "create_router": true,
    "split_by_categories": ["scripting", "2d", "3d", "physics"]
  }
}

Full Guide: Large Documentation Guide

8. Checkpoint/Resume for Long Scrapes

Never lose progress on long-running scrapes:

# Enable in config
{
  "checkpoint": {
    "enabled": true,
    "interval": 1000  // Save every 1000 pages
  }
}

# If scrape is interrupted (Ctrl+C or crash)
python3 cli/doc_scraper.py --config configs/godot.json --resume

# Resume from last checkpoint
✅ Resuming from checkpoint (12,450 pages scraped)
⏭️  Skipping 12,450 already-scraped pages
🔄 Continuing from where we left off...

# Start fresh (clear checkpoint)
python3 cli/doc_scraper.py --config configs/godot.json --fresh

Benefits:

✅ Auto-saves every 1000 pages (configurable)
✅ Saves on interruption (Ctrl+C)
✅ Resume with --resume flag
✅ Never lose hours of scraping progress

🎯 Complete Workflows

First Time (With Scraping + Enhancement)

# 1. Scrape + Build + AI Enhancement (LOCAL, no API key)
python3 cli/doc_scraper.py --config configs/godot.json --enhance-local

# 2. Wait for new terminal to close (enhancement completes)
# Check the enhanced SKILL.md:
cat output/godot/SKILL.md

# 3. Package
python3 cli/package_skill.py output/godot/

# 4. Done! You have godot.zip with excellent SKILL.md

Time: 20-40 minutes (scraping) + 60 seconds (enhancement) = ~21-41 minutes

Using Existing Data (Fast!)

# 1. Use cached data + Local Enhancement
python3 cli/doc_scraper.py --config configs/godot.json --skip-scrape
python3 cli/enhance_skill_local.py output/godot/

# 2. Package
python3 cli/package_skill.py output/godot/

# 3. Done!

Time: 1-3 minutes (build) + 60 seconds (enhancement) = ~2-4 minutes total

Without Enhancement (Basic)

# 1. Scrape + Build (no enhancement)
python3 cli/doc_scraper.py --config configs/godot.json

# 2. Package
python3 cli/package_skill.py output/godot/

# 3. Done! (SKILL.md will be basic template)

Time: 20-40 minutes Note: SKILL.md will be generic - enhancement strongly recommended!

📋 Available Presets

Config	Framework	Description
`godot.json`	Godot Engine	Game development
`react.json`	React	UI framework
`vue.json`	Vue.js	Progressive framework
`django.json`	Django	Python web framework
`fastapi.json`	FastAPI	Modern Python API
`ansible-core.json`	Ansible Core 2.19	Automation & configuration

Using Presets

# Godot
python3 cli/doc_scraper.py --config configs/godot.json

# React
python3 cli/doc_scraper.py --config configs/react.json

# Vue
python3 cli/doc_scraper.py --config configs/vue.json

# Django
python3 cli/doc_scraper.py --config configs/django.json

# FastAPI
python3 cli/doc_scraper.py --config configs/fastapi.json

# Ansible
python3 cli/doc_scraper.py --config configs/ansible-core.json

🎨 Creating Your Own Config

Option 1: Interactive

python3 cli/doc_scraper.py --interactive
# Follow prompts, it will create the config for you

Option 2: Copy and Edit

# Copy a preset
cp configs/react.json configs/myframework.json

# Edit it
nano configs/myframework.json

# Use it
python3 cli/doc_scraper.py --config configs/myframework.json

Config Structure

{
  "name": "myframework",
  "description": "When to use this skill",
  "base_url": "https://docs.myframework.com/",
  "selectors": {
    "main_content": "article",
    "title": "h1",
    "code_blocks": "pre code"
  },
  "url_patterns": {
    "include": ["/docs", "/guide"],
    "exclude": ["/blog", "/about"]
  },
  "categories": {
    "getting_started": ["intro", "quickstart"],
    "api": ["api", "reference"]
  },
  "rate_limit": 0.5,
  "max_pages": 500
}

📊 What Gets Created

output/
├── godot_data/              # Scraped raw data
│   ├── pages/              # JSON files (one per page)
│   └── summary.json        # Overview
│
└── godot/                   # The skill
    ├── SKILL.md            # Enhanced with real examples
    ├── references/         # Categorized docs
    │   ├── index.md
    │   ├── getting_started.md
    │   ├── scripting.md
    │   └── ...
    ├── scripts/            # Empty (add your own)
    └── assets/             # Empty (add your own)

🎯 Command Line Options

# Interactive mode
python3 cli/doc_scraper.py --interactive

# Use config file
python3 cli/doc_scraper.py --config configs/godot.json

# Quick mode
python3 cli/doc_scraper.py --name react --url https://react.dev/

# Skip scraping (use existing data)
python3 cli/doc_scraper.py --config configs/godot.json --skip-scrape

# With description
python3 cli/doc_scraper.py \
  --name react \
  --url https://react.dev/ \
  --description "React framework for building UIs"

💡 Tips

1. Test Small First

Edit max_pages in config to test:

{
  "max_pages": 20  // Test with just 20 pages
}

2. Reuse Scraped Data

# Scrape once
python3 cli/doc_scraper.py --config configs/react.json

# Rebuild multiple times (instant)
python3 cli/doc_scraper.py --config configs/react.json --skip-scrape
python3 cli/doc_scraper.py --config configs/react.json --skip-scrape

3. Finding Selectors

# Test in Python
from bs4 import BeautifulSoup
import requests

url = "https://docs.example.com/page"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

# Try different selectors
print(soup.select_one('article'))
print(soup.select_one('main'))
print(soup.select_one('div[role="main"]'))

4. Check Output Quality

# After building, check:
cat output/godot/SKILL.md  # Should have real examples
cat output/godot/references/index.md  # Categories

🐛 Troubleshooting

No Content Extracted?

Check your main_content selector
Try: article, main, div[role="main"]

Data Exists But Won't Use It?

# Force re-scrape
rm -rf output/myframework_data/
python3 cli/doc_scraper.py --config configs/myframework.json

Categories Not Good?

Edit the config categories section with better keywords.

Want to Update Docs?

# Delete old data
rm -rf output/godot_data/

# Re-scrape
python3 cli/doc_scraper.py --config configs/godot.json

📈 Performance

Task	Time	Notes
Scraping (sync)	15-45 min	First time only, thread-based
Scraping (async)	5-15 min	2-3x faster with --async flag
Building	1-3 min	Fast!
Re-building	<1 min	With --skip-scrape
Packaging	5-10 sec	Final zip

✅ Summary

One tool does everything:

✅ Scrapes documentation
✅ Auto-detects existing data
✅ Generates better knowledge
✅ Creates enhanced skills
✅ Works with presets or custom configs
✅ Supports skip-scraping for fast iteration

Simple structure:

doc_scraper.py - The tool
configs/ - Presets
output/ - Everything else

Better output:

Real code examples with language detection
Common patterns extracted from docs
Smart categorization
Enhanced SKILL.md with actual examples

📚 Documentation

Getting Started

BULLETPROOF_QUICKSTART.md - 🎯 START HERE if you're new!
QUICKSTART.md - Quick start for experienced users
TROUBLESHOOTING.md - Common issues and solutions

Guides

docs/LARGE_DOCUMENTATION.md - Handle 10K-40K+ page docs
ASYNC_SUPPORT.md - Async mode guide (2-3x faster scraping)
docs/ENHANCEMENT.md - AI enhancement guide
docs/UPLOAD_GUIDE.md - How to upload skills to Claude
docs/MCP_SETUP.md - MCP integration setup

Technical

docs/CLAUDE.md - Technical architecture
STRUCTURE.md - Repository structure

🎮 Ready?

# Try Godot
python3 cli/doc_scraper.py --config configs/godot.json

# Try React
python3 cli/doc_scraper.py --config configs/react.json

# Or go interactive
python3 cli/doc_scraper.py --interactive

📝 License

MIT License - see LICENSE file for details

Happy skill building! 🚀

README.md Unescape Escape

Skill Seeker

What is Skill Seeker?

Why Use This?

Key Features

🌐 Documentation Scraping

📄 PDF Support (v1.2.0)

🐙 GitHub Repository Scraping (v2.0.0)

🔄 Unified Multi-Source Scraping (NEW - v2.0.0)

🤖 AI & Enhancement

⚡ Performance & Scale

✅ Quality Assurance

Quick Example

Option 1: Use from Claude Code (Recommended)

Option 2: Use CLI Directly (HTML Docs)

Option 3: Use CLI for PDF Documentation

Option 4: Use CLI for GitHub Repository

Option 5: Unified Multi-Source Scraping (NEW - v2.0.0)

📋 Prerequisites

🚀 Quick Start

Method 1: MCP Server for Claude Code (Easiest)

Method 2: CLI (Traditional)

One-Time Setup: Create Virtual Environment

Easiest: Use a Preset

Interactive Mode

Quick Mode

📤 Uploading Skills to Claude

Option 1: Automatic Upload (API-based)

Option 2: Manual Upload (No API Key)

Option 3: Claude Code (MCP) - Smart & Automatic

📁 Simple Structure

✨ Features

1. Fast Page Estimation (NEW!)

2. Auto-Detect Existing Data

3. Knowledge Generation

4. Smart Categorization

5. Code Language Detection

5. Skip Scraping

6. Async Mode for Faster Scraping (2-3x Speed!)

7. AI-Powered SKILL.md Enhancement

7. Large Documentation Support (10K-40K+ Pages)

8. Checkpoint/Resume for Long Scrapes

🎯 Complete Workflows

First Time (With Scraping + Enhancement)

Using Existing Data (Fast!)

Without Enhancement (Basic)

📋 Available Presets

Using Presets

🎨 Creating Your Own Config

Option 1: Interactive

Option 2: Copy and Edit

Config Structure

📊 What Gets Created

🎯 Command Line Options

💡 Tips

1. Test Small First

2. Reuse Scraped Data

3. Finding Selectors

4. Check Output Quality

🐛 Troubleshooting

No Content Extracted?

Data Exists But Won't Use It?

Categories Not Good?

Want to Update Docs?

📈 Performance

✅ Summary

📚 Documentation

Getting Started

Guides

Technical

🎮 Ready?

📝 License

README.md