Files
skill-seekers-reference/README.md
yusyus 319331f5a6 feat: Complete refactoring with async support, type safety, and package structure
This comprehensive refactoring improves code quality, performance, and maintainability
while maintaining 100% backwards compatibility.

## Major Features Added

### 🚀 Async/Await Support (2-3x Performance Boost)
- Added `--async` flag for parallel scraping using asyncio
- Implemented `scrape_page_async()` with httpx.AsyncClient
- Implemented `scrape_all_async()` with asyncio.gather()
- Connection pooling for better resource management
- Performance: 18 pg/s → 55 pg/s (3x faster)
- Memory: 120 MB → 40 MB (66% reduction)
- Full documentation in ASYNC_SUPPORT.md

### 📦 Python Package Structure (Phase 0 Complete)
- Created cli/__init__.py for clean imports
- Created skill_seeker_mcp/__init__.py (renamed from mcp/)
- Created skill_seeker_mcp/tools/__init__.py
- Proper package imports: `from cli import constants`
- Better IDE support and autocomplete

### ⚙️ Centralized Configuration
- Created cli/constants.py with 18 configuration constants
- DEFAULT_ASYNC_MODE, DEFAULT_RATE_LIMIT, DEFAULT_MAX_PAGES
- Enhancement limits, categorization scores, file limits
- All magic numbers now centralized and configurable

### 🔧 Code Quality Improvements
- Converted 71 print() statements to proper logging
- Added type hints to all DocToSkillConverter methods
- Fixed all mypy type checking issues
- Installed types-requests for better type safety
- Code quality: 5.5/10 → 6.5/10

## Testing
- Test count: 207 → 299 tests (92 new tests)
- 11 comprehensive async tests (all passing)
- 16 constants tests (all passing)
- Fixed test isolation issues
- 100% pass rate maintained (299/299 passing)

## Documentation
- Updated README.md with async examples and test count
- Updated CLAUDE.md with async usage guide
- Created ASYNC_SUPPORT.md (292 lines)
- Updated CHANGELOG.md with all changes
- Cleaned up temporary refactoring documents

## Cleanup
- Removed temporary planning/status documents
- Moved test_pr144_concerns.py to tests/ folder
- Updated .gitignore for test artifacts
- Better repository organization

## Breaking Changes
None - all changes are backwards compatible.
Async mode is opt-in via --async flag.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 13:05:39 +03:00

906 lines
25 KiB
Markdown

[![MseeP.ai Security Assessment Badge](https://mseep.net/pr/yusufkaraaslan-skill-seekers-badge.png)](https://mseep.ai/app/yusufkaraaslan-skill-seekers)
# Skill Seeker
[![Version](https://img.shields.io/badge/version-1.2.0-blue.svg)](https://github.com/yusufkaraaslan/Skill_Seekers/releases/tag/v1.2.0)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![MCP Integration](https://img.shields.io/badge/MCP-Integrated-blue.svg)](https://modelcontextprotocol.io)
[![Tested](https://img.shields.io/badge/Tests-299%20Passing-brightgreen.svg)](tests/)
[![Project Board](https://img.shields.io/badge/Project-Board-purple.svg)](https://github.com/users/yusufkaraaslan/projects/2)
**Automatically convert any documentation website into a Claude AI skill in minutes.**
> 📋 **[View Development Roadmap & Tasks](https://github.com/users/yusufkaraaslan/projects/2)** - 134 tasks across 10 categories, pick any to contribute!
## What is Skill Seeker?
Skill Seeker is an automated tool that transforms any documentation website into a production-ready [Claude AI skill](https://claude.ai). Instead of manually reading and summarizing documentation, Skill Seeker:
1. **Scrapes** documentation websites automatically
2. **Organizes** content into categorized reference files
3. **Enhances** with AI to extract best examples and key concepts
4. **Packages** everything into an uploadable `.zip` file for Claude
**Result:** Get comprehensive Claude skills for any framework, API, or tool in 20-40 minutes instead of hours of manual work.
## Why Use This?
- 🎯 **For Developers**: Quickly create Claude skills for your favorite frameworks (React, Vue, Django, etc.)
- 🎮 **For Game Devs**: Generate skills for game engines (Godot, Unity documentation, etc.)
- 🔧 **For Teams**: Create internal documentation skills for your company's APIs
- 📚 **For Learners**: Build comprehensive reference skills for technologies you're learning
## Key Features
### 🌐 Documentation Scraping
-**llms.txt Support** - Automatically detects and uses LLM-ready documentation files (10x faster)
-**Universal Scraper** - Works with ANY documentation website
-**Smart Categorization** - Automatically organizes content by topic
-**Code Language Detection** - Recognizes Python, JavaScript, C++, GDScript, etc.
-**8 Ready-to-Use Presets** - Godot, React, Vue, Django, FastAPI, and more
### 📄 PDF Support (**v1.2.0**)
-**Basic PDF Extraction** - Extract text, code, and images from PDF files
-**OCR for Scanned PDFs** - Extract text from scanned documents
-**Password-Protected PDFs** - Handle encrypted PDFs
-**Table Extraction** - Extract complex tables from PDFs
-**Parallel Processing** - 3x faster for large PDFs
-**Intelligent Caching** - 50% faster on re-runs
### 🤖 AI & Enhancement
-**AI-Powered Enhancement** - Transforms basic templates into comprehensive guides
-**No API Costs** - FREE local enhancement using Claude Code Max
-**MCP Server for Claude Code** - Use directly from Claude Code with natural language
### ⚡ Performance & Scale
-**Async Mode** - 2-3x faster scraping with async/await (use `--async` flag)
-**Large Documentation Support** - Handle 10K-40K+ page docs with intelligent splitting
-**Router/Hub Skills** - Intelligent routing to specialized sub-skills
-**Parallel Scraping** - Process multiple skills simultaneously
-**Checkpoint/Resume** - Never lose progress on long scrapes
-**Caching System** - Scrape once, rebuild instantly
### ✅ Quality Assurance
-**Fully Tested** - 299 tests with 100% pass rate
## Quick Example
### Option 1: Use from Claude Code (Recommended)
```bash
# One-time setup (5 minutes)
./setup_mcp.sh
# Then in Claude Code, just ask:
"Generate a React skill from https://react.dev/"
"Scrape PDF at docs/manual.pdf and create skill"
```
**Time:** Automated | **Quality:** Production-ready | **Cost:** Free
### Option 2: Use CLI Directly (HTML Docs)
```bash
# Install dependencies (2 pip packages)
pip3 install requests beautifulsoup4
# Generate a React skill in one command
python3 cli/doc_scraper.py --config configs/react.json --enhance-local
# Upload output/react.zip to Claude - Done!
```
**Time:** ~25 minutes | **Quality:** Production-ready | **Cost:** Free
### Option 3: Use CLI for PDF Documentation
```bash
# Install PDF support
pip3 install PyMuPDF
# Basic PDF extraction
python3 cli/pdf_scraper.py --pdf docs/manual.pdf --name myskill
# Advanced features
python3 cli/pdf_scraper.py --pdf docs/manual.pdf --name myskill \
--extract-tables \ # Extract tables
--parallel \ # Fast parallel processing
--workers 8 # Use 8 CPU cores
# Scanned PDFs (requires: pip install pytesseract Pillow)
python3 cli/pdf_scraper.py --pdf docs/scanned.pdf --name myskill --ocr
# Password-protected PDFs
python3 cli/pdf_scraper.py --pdf docs/encrypted.pdf --name myskill --password mypassword
# Upload output/myskill.zip to Claude - Done!
```
**Time:** ~5-15 minutes (or 2-5 minutes with parallel) | **Quality:** Production-ready | **Cost:** Free
**Advanced Features:**
- ✅ OCR for scanned PDFs (requires pytesseract)
- ✅ Password-protected PDF support
- ✅ Table extraction
- ✅ Parallel processing (3x faster)
- ✅ Intelligent caching
## How It Works
```mermaid
graph LR
A[Documentation Website] --> B[Skill Seeker]
B --> C[Scraper]
B --> D[AI Enhancement]
B --> E[Packager]
C --> F[Organized References]
D --> F
F --> E
E --> G[Claude Skill .zip]
G --> H[Upload to Claude AI]
```
0. **Detect llms.txt** - Checks for llms-full.txt, llms.txt, llms-small.txt first
1. **Scrape**: Extracts all pages from documentation
2. **Categorize**: Organizes content into topics (API, guides, tutorials, etc.)
3. **Enhance**: AI analyzes docs and creates comprehensive SKILL.md with examples
4. **Package**: Bundles everything into a Claude-ready `.zip` file
## 📋 Prerequisites
**Before you start, make sure you have:**
1. **Python 3.10 or higher** - [Download](https://www.python.org/downloads/) | Check: `python3 --version`
2. **Git** - [Download](https://git-scm.com/) | Check: `git --version`
3. **15-30 minutes** for first-time setup
**First time user?****[Start Here: Bulletproof Quick Start Guide](BULLETPROOF_QUICKSTART.md)** 🎯
This guide walks you through EVERYTHING step-by-step (Python install, git clone, first skill creation).
---
## 🚀 Quick Start
### Method 1: MCP Server for Claude Code (Easiest)
Use Skill Seeker directly from Claude Code with natural language!
```bash
# Clone repository
git clone https://github.com/yusufkaraaslan/Skill_Seekers.git
cd Skill_Seekers
# One-time setup (5 minutes)
./setup_mcp.sh
# Restart Claude Code, then just ask:
```
**In Claude Code:**
```
List all available configs
Generate config for Tailwind at https://tailwindcss.com/docs
Scrape docs using configs/react.json
Package skill at output/react/
```
**Benefits:**
- ✅ No manual CLI commands
- ✅ Natural language interface
- ✅ Integrated with your workflow
- ✅ 9 tools available instantly (includes automatic upload!)
-**Tested and working** in production
**Full guides:**
- 📘 [MCP Setup Guide](docs/MCP_SETUP.md) - Complete installation instructions
- 🧪 [MCP Testing Guide](docs/TEST_MCP_IN_CLAUDE_CODE.md) - Test all 9 tools
- 📦 [Large Documentation Guide](docs/LARGE_DOCUMENTATION.md) - Handle 10K-40K+ pages
- 📤 [Upload Guide](docs/UPLOAD_GUIDE.md) - How to upload skills to Claude
### Method 2: CLI (Traditional)
#### One-Time Setup: Create Virtual Environment
```bash
# Clone repository
git clone https://github.com/yusufkaraaslan/Skill_Seekers.git
cd Skill_Seekers
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
source venv/bin/activate # macOS/Linux
# OR on Windows: venv\Scripts\activate
# Install dependencies
pip install requests beautifulsoup4 pytest
# Save dependencies
pip freeze > requirements.txt
# Optional: Install anthropic for API-based enhancement (not needed for LOCAL enhancement)
# pip install anthropic
```
**Always activate the virtual environment before using Skill Seeker:**
```bash
source venv/bin/activate # Run this each time you start a new terminal session
```
#### Easiest: Use a Preset
```bash
# Make sure venv is activated (you should see (venv) in your prompt)
source venv/bin/activate
# Optional: Estimate pages first (fast, 1-2 minutes)
python3 cli/estimate_pages.py configs/godot.json
# Use Godot preset
python3 cli/doc_scraper.py --config configs/godot.json
# Use React preset
python3 cli/doc_scraper.py --config configs/react.json
# See all presets
ls configs/
```
### Interactive Mode
```bash
python3 cli/doc_scraper.py --interactive
```
### Quick Mode
```bash
python3 cli/doc_scraper.py \
--name react \
--url https://react.dev/ \
--description "React framework for UIs"
```
## 📤 Uploading Skills to Claude
Once your skill is packaged, you need to upload it to Claude:
### Option 1: Automatic Upload (API-based)
```bash
# Set your API key (one-time)
export ANTHROPIC_API_KEY=sk-ant-...
# Package and upload automatically
python3 cli/package_skill.py output/react/ --upload
# OR upload existing .zip
python3 cli/upload_skill.py output/react.zip
```
**Benefits:**
- ✅ Fully automatic
- ✅ No manual steps
- ✅ Works from command line
**Requirements:**
- Anthropic API key (get from https://console.anthropic.com/)
### Option 2: Manual Upload (No API Key)
```bash
# Package skill
python3 cli/package_skill.py output/react/
# This will:
# 1. Create output/react.zip
# 2. Open the output/ folder automatically
# 3. Show upload instructions
# Then manually upload:
# - Go to https://claude.ai/skills
# - Click "Upload Skill"
# - Select output/react.zip
# - Done!
```
**Benefits:**
- ✅ No API key needed
- ✅ Works for everyone
- ✅ Folder opens automatically
### Option 3: Claude Code (MCP) - Smart & Automatic
```
In Claude Code, just ask:
"Package and upload the React skill"
# With API key set:
# - Packages the skill
# - Uploads to Claude automatically
# - Done! ✅
# Without API key:
# - Packages the skill
# - Shows where to find the .zip
# - Provides manual upload instructions
```
**Benefits:**
- ✅ Natural language
- ✅ Smart auto-detection (uploads if API key available)
- ✅ Works with or without API key
- ✅ No errors or failures
---
## 📁 Simple Structure
```
doc-to-skill/
├── cli/
│ ├── doc_scraper.py # Main scraping tool
│ ├── package_skill.py # Package to .zip
│ ├── upload_skill.py # Auto-upload (API)
│ └── enhance_skill.py # AI enhancement
├── mcp/ # MCP server for Claude Code
│ └── server.py # 9 MCP tools
├── configs/ # Preset configurations
│ ├── godot.json # Godot Engine
│ ├── react.json # React
│ ├── vue.json # Vue.js
│ ├── django.json # Django
│ └── fastapi.json # FastAPI
└── output/ # All output (auto-created)
├── godot_data/ # Scraped data
├── godot/ # Built skill
└── godot.zip # Packaged skill
```
## ✨ Features
### 1. Fast Page Estimation (NEW!)
```bash
python3 cli/estimate_pages.py configs/react.json
# Output:
📊 ESTIMATION RESULTS
✅ Pages Discovered: 180
📈 Estimated Total: 230
⏱️ Time Elapsed: 1.2 minutes
💡 Recommended max_pages: 280
```
**Benefits:**
- Know page count BEFORE scraping (saves time)
- Validates URL patterns work correctly
- Estimates total scraping time
- Recommends optimal `max_pages` setting
- Fast (1-2 minutes vs 20-40 minutes full scrape)
### 2. Auto-Detect Existing Data
```bash
python3 cli/doc_scraper.py --config configs/godot.json
# If data exists:
✓ Found existing data: 245 pages
Use existing data? (y/n): y
⏭️ Skipping scrape, using existing data
```
### 3. Knowledge Generation
**Automatic pattern extraction:**
- Extracts common code patterns from docs
- Detects programming language
- Creates quick reference with real examples
- Smarter categorization with scoring
**Enhanced SKILL.md:**
- Real code examples from documentation
- Language-annotated code blocks
- Common patterns section
- Quick reference from actual usage examples
### 4. Smart Categorization
Automatically infers categories from:
- URL structure
- Page titles
- Content keywords
- With scoring for better accuracy
### 5. Code Language Detection
```python
# Automatically detects:
- Python (def, import, from)
- JavaScript (const, let, =>)
- GDScript (func, var, extends)
- C++ (#include, int main)
- And more...
```
### 5. Skip Scraping
```bash
# Scrape once
python3 cli/doc_scraper.py --config configs/react.json
# Later, just rebuild (instant)
python3 cli/doc_scraper.py --config configs/react.json --skip-scrape
```
### 6. Async Mode for Faster Scraping (2-3x Speed!)
```bash
# Enable async mode with 8 workers (recommended for large docs)
python3 cli/doc_scraper.py --config configs/react.json --async --workers 8
# Small docs (~100-500 pages)
python3 cli/doc_scraper.py --config configs/mydocs.json --async --workers 4
# Large docs (2000+ pages) with no rate limiting
python3 cli/doc_scraper.py --config configs/largedocs.json --async --workers 8 --no-rate-limit
```
**Performance Comparison:**
- **Sync mode (threads):** ~18 pages/sec, 120 MB memory
- **Async mode:** ~55 pages/sec, 40 MB memory
- **Result:** 3x faster, 66% less memory!
**When to use:**
- ✅ Large documentation (500+ pages)
- ✅ Network latency is high
- ✅ Memory is constrained
- ❌ Small docs (< 100 pages) - overhead not worth it
**See full guide:** [ASYNC_SUPPORT.md](ASYNC_SUPPORT.md)
### 7. AI-Powered SKILL.md Enhancement
```bash
# Option 1: During scraping (API-based, requires API key)
pip3 install anthropic
export ANTHROPIC_API_KEY=sk-ant-...
python3 cli/doc_scraper.py --config configs/react.json --enhance
# Option 2: During scraping (LOCAL, no API key - uses Claude Code Max)
python3 cli/doc_scraper.py --config configs/react.json --enhance-local
# Option 3: After scraping (API-based, standalone)
python3 cli/enhance_skill.py output/react/
# Option 4: After scraping (LOCAL, no API key, standalone)
python3 cli/enhance_skill_local.py output/react/
```
**What it does:**
- Reads your reference documentation
- Uses Claude to generate an excellent SKILL.md
- Extracts best code examples (5-10 practical examples)
- Creates comprehensive quick reference
- Adds domain-specific key concepts
- Provides navigation guidance for different skill levels
- Automatically backs up original
- **Quality:** Transforms 75-line templates into 500+ line comprehensive guides
**LOCAL Enhancement (Recommended):**
- Uses your Claude Code Max plan (no API costs)
- Opens new terminal with Claude Code
- Analyzes reference files automatically
- Takes 30-60 seconds
- Quality: 9/10 (comparable to API version)
### 7. Large Documentation Support (10K-40K+ Pages)
**For massive documentation sites like Godot (40K pages), AWS, or Microsoft Docs:**
```bash
# 1. Estimate first (discover page count)
python3 cli/estimate_pages.py configs/godot.json
# 2. Auto-split into focused sub-skills
python3 cli/split_config.py configs/godot.json --strategy router
# Creates:
# - godot-scripting.json (5K pages)
# - godot-2d.json (8K pages)
# - godot-3d.json (10K pages)
# - godot-physics.json (6K pages)
# - godot-shaders.json (11K pages)
# 3. Scrape all in parallel (4-8 hours instead of 20-40!)
for config in configs/godot-*.json; do
python3 cli/doc_scraper.py --config $config &
done
wait
# 4. Generate intelligent router/hub skill
python3 cli/generate_router.py configs/godot-*.json
# 5. Package all skills
python3 cli/package_multi.py output/godot*/
# 6. Upload all .zip files to Claude
# Users just ask questions naturally!
# Router automatically directs to the right sub-skill!
```
**Split Strategies:**
- **auto** - Intelligently detects best strategy based on page count
- **category** - Split by documentation categories (scripting, 2d, 3d, etc.)
- **router** - Create hub skill + specialized sub-skills (RECOMMENDED)
- **size** - Split every N pages (for docs without clear categories)
**Benefits:**
- ✅ Faster scraping (parallel execution)
- ✅ More focused skills (better Claude performance)
- ✅ Easier maintenance (update one topic at a time)
- ✅ Natural user experience (router handles routing)
- ✅ Avoids context window limits
**Configuration:**
```json
{
"name": "godot",
"max_pages": 40000,
"split_strategy": "router",
"split_config": {
"target_pages_per_skill": 5000,
"create_router": true,
"split_by_categories": ["scripting", "2d", "3d", "physics"]
}
}
```
**Full Guide:** [Large Documentation Guide](docs/LARGE_DOCUMENTATION.md)
### 8. Checkpoint/Resume for Long Scrapes
**Never lose progress on long-running scrapes:**
```bash
# Enable in config
{
"checkpoint": {
"enabled": true,
"interval": 1000 // Save every 1000 pages
}
}
# If scrape is interrupted (Ctrl+C or crash)
python3 cli/doc_scraper.py --config configs/godot.json --resume
# Resume from last checkpoint
✅ Resuming from checkpoint (12,450 pages scraped)
⏭️ Skipping 12,450 already-scraped pages
🔄 Continuing from where we left off...
# Start fresh (clear checkpoint)
python3 cli/doc_scraper.py --config configs/godot.json --fresh
```
**Benefits:**
- ✅ Auto-saves every 1000 pages (configurable)
- ✅ Saves on interruption (Ctrl+C)
- ✅ Resume with `--resume` flag
- ✅ Never lose hours of scraping progress
## 🎯 Complete Workflows
### First Time (With Scraping + Enhancement)
```bash
# 1. Scrape + Build + AI Enhancement (LOCAL, no API key)
python3 cli/doc_scraper.py --config configs/godot.json --enhance-local
# 2. Wait for new terminal to close (enhancement completes)
# Check the enhanced SKILL.md:
cat output/godot/SKILL.md
# 3. Package
python3 cli/package_skill.py output/godot/
# 4. Done! You have godot.zip with excellent SKILL.md
```
**Time:** 20-40 minutes (scraping) + 60 seconds (enhancement) = ~21-41 minutes
### Using Existing Data (Fast!)
```bash
# 1. Use cached data + Local Enhancement
python3 cli/doc_scraper.py --config configs/godot.json --skip-scrape
python3 cli/enhance_skill_local.py output/godot/
# 2. Package
python3 cli/package_skill.py output/godot/
# 3. Done!
```
**Time:** 1-3 minutes (build) + 60 seconds (enhancement) = ~2-4 minutes total
### Without Enhancement (Basic)
```bash
# 1. Scrape + Build (no enhancement)
python3 cli/doc_scraper.py --config configs/godot.json
# 2. Package
python3 cli/package_skill.py output/godot/
# 3. Done! (SKILL.md will be basic template)
```
**Time:** 20-40 minutes
**Note:** SKILL.md will be generic - enhancement strongly recommended!
## 📋 Available Presets
| Config | Framework | Description |
|--------|-----------|-------------|
| `godot.json` | Godot Engine | Game development |
| `react.json` | React | UI framework |
| `vue.json` | Vue.js | Progressive framework |
| `django.json` | Django | Python web framework |
| `fastapi.json` | FastAPI | Modern Python API |
| `ansible-core.json` | Ansible Core 2.19 | Automation & configuration |
### Using Presets
```bash
# Godot
python3 cli/doc_scraper.py --config configs/godot.json
# React
python3 cli/doc_scraper.py --config configs/react.json
# Vue
python3 cli/doc_scraper.py --config configs/vue.json
# Django
python3 cli/doc_scraper.py --config configs/django.json
# FastAPI
python3 cli/doc_scraper.py --config configs/fastapi.json
# Ansible
python3 cli/doc_scraper.py --config configs/ansible-core.json
```
## 🎨 Creating Your Own Config
### Option 1: Interactive
```bash
python3 cli/doc_scraper.py --interactive
# Follow prompts, it will create the config for you
```
### Option 2: Copy and Edit
```bash
# Copy a preset
cp configs/react.json configs/myframework.json
# Edit it
nano configs/myframework.json
# Use it
python3 cli/doc_scraper.py --config configs/myframework.json
```
### Config Structure
```json
{
"name": "myframework",
"description": "When to use this skill",
"base_url": "https://docs.myframework.com/",
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": ["/docs", "/guide"],
"exclude": ["/blog", "/about"]
},
"categories": {
"getting_started": ["intro", "quickstart"],
"api": ["api", "reference"]
},
"rate_limit": 0.5,
"max_pages": 500
}
```
## 📊 What Gets Created
```
output/
├── godot_data/ # Scraped raw data
│ ├── pages/ # JSON files (one per page)
│ └── summary.json # Overview
└── godot/ # The skill
├── SKILL.md # Enhanced with real examples
├── references/ # Categorized docs
│ ├── index.md
│ ├── getting_started.md
│ ├── scripting.md
│ └── ...
├── scripts/ # Empty (add your own)
└── assets/ # Empty (add your own)
```
## 🎯 Command Line Options
```bash
# Interactive mode
python3 cli/doc_scraper.py --interactive
# Use config file
python3 cli/doc_scraper.py --config configs/godot.json
# Quick mode
python3 cli/doc_scraper.py --name react --url https://react.dev/
# Skip scraping (use existing data)
python3 cli/doc_scraper.py --config configs/godot.json --skip-scrape
# With description
python3 cli/doc_scraper.py \
--name react \
--url https://react.dev/ \
--description "React framework for building UIs"
```
## 💡 Tips
### 1. Test Small First
Edit `max_pages` in config to test:
```json
{
"max_pages": 20 // Test with just 20 pages
}
```
### 2. Reuse Scraped Data
```bash
# Scrape once
python3 cli/doc_scraper.py --config configs/react.json
# Rebuild multiple times (instant)
python3 cli/doc_scraper.py --config configs/react.json --skip-scrape
python3 cli/doc_scraper.py --config configs/react.json --skip-scrape
```
### 3. Finding Selectors
```python
# Test in Python
from bs4 import BeautifulSoup
import requests
url = "https://docs.example.com/page"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Try different selectors
print(soup.select_one('article'))
print(soup.select_one('main'))
print(soup.select_one('div[role="main"]'))
```
### 4. Check Output Quality
```bash
# After building, check:
cat output/godot/SKILL.md # Should have real examples
cat output/godot/references/index.md # Categories
```
## 🐛 Troubleshooting
### No Content Extracted?
- Check your `main_content` selector
- Try: `article`, `main`, `div[role="main"]`
### Data Exists But Won't Use It?
```bash
# Force re-scrape
rm -rf output/myframework_data/
python3 cli/doc_scraper.py --config configs/myframework.json
```
### Categories Not Good?
Edit the config `categories` section with better keywords.
### Want to Update Docs?
```bash
# Delete old data
rm -rf output/godot_data/
# Re-scrape
python3 cli/doc_scraper.py --config configs/godot.json
```
## 📈 Performance
| Task | Time | Notes |
|------|------|-------|
| Scraping (sync) | 15-45 min | First time only, thread-based |
| Scraping (async) | 5-15 min | 2-3x faster with --async flag |
| Building | 1-3 min | Fast! |
| Re-building | <1 min | With --skip-scrape |
| Packaging | 5-10 sec | Final zip |
## ✅ Summary
**One tool does everything:**
1. ✅ Scrapes documentation
2. ✅ Auto-detects existing data
3. ✅ Generates better knowledge
4. ✅ Creates enhanced skills
5. ✅ Works with presets or custom configs
6. ✅ Supports skip-scraping for fast iteration
**Simple structure:**
- `doc_scraper.py` - The tool
- `configs/` - Presets
- `output/` - Everything else
**Better output:**
- Real code examples with language detection
- Common patterns extracted from docs
- Smart categorization
- Enhanced SKILL.md with actual examples
## 📚 Documentation
### Getting Started
- **[BULLETPROOF_QUICKSTART.md](BULLETPROOF_QUICKSTART.md)** - 🎯 **START HERE** if you're new!
- **[QUICKSTART.md](QUICKSTART.md)** - Quick start for experienced users
- **[TROUBLESHOOTING.md](TROUBLESHOOTING.md)** - Common issues and solutions
### Guides
- **[docs/LARGE_DOCUMENTATION.md](docs/LARGE_DOCUMENTATION.md)** - Handle 10K-40K+ page docs
- **[ASYNC_SUPPORT.md](ASYNC_SUPPORT.md)** - Async mode guide (2-3x faster scraping)
- **[docs/ENHANCEMENT.md](docs/ENHANCEMENT.md)** - AI enhancement guide
- **[docs/UPLOAD_GUIDE.md](docs/UPLOAD_GUIDE.md)** - How to upload skills to Claude
- **[docs/MCP_SETUP.md](docs/MCP_SETUP.md)** - MCP integration setup
### Technical
- **[docs/CLAUDE.md](docs/CLAUDE.md)** - Technical architecture
- **[STRUCTURE.md](STRUCTURE.md)** - Repository structure
## 🎮 Ready?
```bash
# Try Godot
python3 cli/doc_scraper.py --config configs/godot.json
# Try React
python3 cli/doc_scraper.py --config configs/react.json
# Or go interactive
python3 cli/doc_scraper.py --interactive
```
## 📝 License
MIT License - see [LICENSE](LICENSE) file for details
---
Happy skill building! 🚀