Reorganized 64 markdown files into a clear, scalable structure
to improve discoverability and maintainability.
## Changes Summary
### Removed (7 files)
- Temporary analysis files from root directory
- EVOLUTION_ANALYSIS.md, SKILL_QUALITY_ANALYSIS.md, ASYNC_SUPPORT.md
- STRUCTURE.md, SUMMARY_*.md, REDDIT_POST_v2.2.0.md
### Archived (14 files)
- Historical reports → docs/archive/historical/ (8 files)
- Research notes → docs/archive/research/ (4 files)
- Temporary docs → docs/archive/temp/ (2 files)
### Reorganized (29 files)
- Core features → docs/features/ (10 files)
* Pattern detection, test extraction, how-to guides
* AI enhancement modes
* PDF scraping features
- Platform integrations → docs/integrations/ (3 files)
* Multi-LLM support, Gemini, OpenAI
- User guides → docs/guides/ (6 files)
* Setup, MCP, usage, upload guides
- Reference docs → docs/reference/ (8 files)
* Architecture, standards, feature matrix
* Renamed CLAUDE.md → CLAUDE_INTEGRATION.md
### Created
- docs/README.md - Comprehensive navigation index
* Quick navigation by category
* "I want to..." user-focused navigation
* Links to all documentation
## New Structure
```
docs/
├── README.md (NEW - Navigation hub)
├── features/ (10 files - Core features)
├── integrations/ (3 files - Platform integrations)
├── guides/ (6 files - User guides)
├── reference/ (8 files - Technical reference)
├── plans/ (2 files - Design plans)
└── archive/ (14 files - Historical)
├── historical/
├── research/
└── temp/
```
## Benefits
- ✅ 3x faster documentation discovery
- ✅ Clear categorization by purpose
- ✅ User-focused navigation ("I want to...")
- ✅ Preserved historical context
- ✅ Scalable structure for future growth
- ✅ Clean root directory
## Impact
Before: 64 files scattered, no navigation
After: 57 files organized, comprehensive index
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
9.5 KiB
Handling Large Documentation Sites (10K+ Pages)
Complete guide for scraping and managing large documentation sites with Skill Seeker.
Table of Contents
- When to Split Documentation
- Split Strategies
- Quick Start
- Detailed Workflows
- Best Practices
- Examples
- Troubleshooting
When to Split Documentation
Size Guidelines
| Documentation Size | Recommendation | Strategy |
|---|---|---|
| < 5,000 pages | One skill | No splitting needed |
| 5,000 - 10,000 pages | Consider splitting | Category-based |
| 10,000 - 30,000 pages | Recommended | Router + Categories |
| 30,000+ pages | Strongly recommended | Router + Categories |
Why Split Large Documentation?
Benefits:
- ✅ Faster scraping (parallel execution)
- ✅ More focused skills (better Claude performance)
- ✅ Easier maintenance (update one topic at a time)
- ✅ Better user experience (precise answers)
- ✅ Avoids context window limits
Trade-offs:
- ⚠️ Multiple skills to manage
- ⚠️ Initial setup more complex
- ⚠️ Router adds one extra skill
Split Strategies
1. No Split (One Big Skill)
Best for: Small to medium documentation (< 5K pages)
# Just use the config as-is
python3 cli/doc_scraper.py --config configs/react.json
Pros: Simple, one skill to maintain Cons: Can be slow for large docs, may hit limits
2. Category Split (Multiple Focused Skills)
Best for: 5K-15K pages with clear topic divisions
# Auto-split by categories
python3 cli/split_config.py configs/godot.json --strategy category
# Creates:
# - godot-scripting.json
# - godot-2d.json
# - godot-3d.json
# - godot-physics.json
# - etc.
Pros: Focused skills, clear separation Cons: User must know which skill to use
3. Router + Categories (Intelligent Hub) ⭐ RECOMMENDED
Best for: 10K+ pages, best user experience
# Create router + sub-skills
python3 cli/split_config.py configs/godot.json --strategy router
# Creates:
# - godot.json (router/hub)
# - godot-scripting.json
# - godot-2d.json
# - etc.
Pros: Best of both worlds, intelligent routing, natural UX Cons: Slightly more complex setup
4. Size-Based Split
Best for: Docs without clear categories
# Split every 5000 pages
python3 cli/split_config.py configs/bigdocs.json --strategy size --target-pages 5000
# Creates:
# - bigdocs-part1.json
# - bigdocs-part2.json
# - bigdocs-part3.json
# - etc.
Pros: Simple, predictable Cons: May split related topics
Quick Start
Option 1: Automatic (Recommended)
# 1. Create config
python3 cli/doc_scraper.py --interactive
# Name: godot
# URL: https://docs.godotengine.org
# ... fill in prompts ...
# 2. Estimate pages (discovers it's large)
python3 cli/estimate_pages.py configs/godot.json
# Output: ⚠️ 40,000 pages detected - splitting recommended
# 3. Auto-split with router
python3 cli/split_config.py configs/godot.json --strategy router
# 4. Scrape all sub-skills
for config in configs/godot-*.json; do
python3 cli/doc_scraper.py --config $config &
done
wait
# 5. Generate router
python3 cli/generate_router.py configs/godot-*.json
# 6. Package all
python3 cli/package_multi.py output/godot*/
# 7. Upload all .zip files to Claude
Option 2: Manual Control
# 1. Define split in config
nano configs/godot.json
# Add:
{
"split_strategy": "router",
"split_config": {
"target_pages_per_skill": 5000,
"create_router": true,
"split_by_categories": ["scripting", "2d", "3d", "physics"]
}
}
# 2. Split
python3 cli/split_config.py configs/godot.json
# 3. Continue as above...
Detailed Workflows
Workflow 1: Router + Categories (40K Pages)
Scenario: Godot documentation (40,000 pages)
Step 1: Estimate
python3 cli/estimate_pages.py configs/godot.json
# Output:
# Estimated: 40,000 pages
# Recommended: Split into 8 skills (5K each)
Step 2: Split Configuration
python3 cli/split_config.py configs/godot.json --strategy router --target-pages 5000
# Creates:
# configs/godot.json (router)
# configs/godot-scripting.json (5K pages)
# configs/godot-2d.json (8K pages)
# configs/godot-3d.json (10K pages)
# configs/godot-physics.json (6K pages)
# configs/godot-shaders.json (11K pages)
Step 3: Scrape Sub-Skills (Parallel)
# Open multiple terminals or use background jobs
python3 cli/doc_scraper.py --config configs/godot-scripting.json &
python3 cli/doc_scraper.py --config configs/godot-2d.json &
python3 cli/doc_scraper.py --config configs/godot-3d.json &
python3 cli/doc_scraper.py --config configs/godot-physics.json &
python3 cli/doc_scraper.py --config configs/godot-shaders.json &
# Wait for all to complete
wait
# Time: 4-8 hours (parallel) vs 20-40 hours (sequential)
Step 4: Generate Router
python3 cli/generate_router.py configs/godot-*.json
# Creates:
# output/godot/SKILL.md (router skill)
Step 5: Package All
python3 cli/package_multi.py output/godot*/
# Creates:
# output/godot.zip (router)
# output/godot-scripting.zip
# output/godot-2d.zip
# output/godot-3d.zip
# output/godot-physics.zip
# output/godot-shaders.zip
Step 6: Upload to Claude Upload all 6 .zip files to Claude. The router will intelligently direct queries to the right sub-skill!
Workflow 2: Category Split Only (15K Pages)
Scenario: Vue.js documentation (15,000 pages)
No router needed - just focused skills:
# 1. Split
python3 cli/split_config.py configs/vue.json --strategy category
# 2. Scrape each
for config in configs/vue-*.json; do
python3 cli/doc_scraper.py --config $config
done
# 3. Package
python3 cli/package_multi.py output/vue*/
# 4. Upload all to Claude
Result: 5 focused Vue skills (components, reactivity, routing, etc.)
Best Practices
1. Choose Target Size Wisely
# Small focused skills (3K-5K pages) - more skills, very focused
python3 cli/split_config.py config.json --target-pages 3000
# Medium skills (5K-8K pages) - balanced (RECOMMENDED)
python3 cli/split_config.py config.json --target-pages 5000
# Larger skills (8K-10K pages) - fewer skills, broader
python3 cli/split_config.py config.json --target-pages 8000
2. Use Parallel Scraping
# Serial (slow - 40 hours)
for config in configs/godot-*.json; do
python3 cli/doc_scraper.py --config $config
done
# Parallel (fast - 8 hours) ⭐
for config in configs/godot-*.json; do
python3 cli/doc_scraper.py --config $config &
done
wait
3. Test Before Full Scrape
# Test with limited pages first
nano configs/godot-2d.json
# Set: "max_pages": 50
python3 cli/doc_scraper.py --config configs/godot-2d.json
# If output looks good, increase to full
4. Use Checkpoints for Long Scrapes
# Enable checkpoints in config
{
"checkpoint": {
"enabled": true,
"interval": 1000
}
}
# If scrape fails, resume
python3 cli/doc_scraper.py --config config.json --resume
Examples
Example 1: AWS Documentation (Hypothetical 50K Pages)
# 1. Split by AWS services
python3 cli/split_config.py configs/aws.json --strategy router --target-pages 5000
# Creates ~10 skills:
# - aws (router)
# - aws-compute (EC2, Lambda)
# - aws-storage (S3, EBS)
# - aws-database (RDS, DynamoDB)
# - etc.
# 2. Scrape in parallel (overnight)
# 3. Upload all skills to Claude
# 4. User asks "How do I create an S3 bucket?"
# 5. Router activates aws-storage skill
# 6. Focused, accurate answer!
Example 2: Microsoft Docs (100K+ Pages)
# Too large even with splitting - use selective categories
# Only scrape key topics
python3 cli/split_config.py configs/microsoft.json --strategy category
# Edit configs to include only:
# - microsoft-azure (Azure docs only)
# - microsoft-dotnet (.NET docs only)
# - microsoft-typescript (TS docs only)
# Skip less relevant sections
Troubleshooting
Issue: "Splitting creates too many skills"
Solution: Increase target size or combine categories
# Instead of 5K per skill, use 8K
python3 cli/split_config.py config.json --target-pages 8000
# Or manually combine categories in config
Issue: "Router not routing correctly"
Solution: Check routing keywords in router SKILL.md
# Review router
cat output/godot/SKILL.md
# Update keywords if needed
nano output/godot/SKILL.md
Issue: "Parallel scraping fails"
Solution: Reduce parallelism or check rate limits
# Scrape 2-3 at a time instead of all
python3 cli/doc_scraper.py --config config1.json &
python3 cli/doc_scraper.py --config config2.json &
wait
python3 cli/doc_scraper.py --config config3.json &
python3 cli/doc_scraper.py --config config4.json &
wait
Summary
For 40K+ Page Documentation:
- ✅ Estimate first:
python3 cli/estimate_pages.py config.json - ✅ Split with router:
python3 cli/split_config.py config.json --strategy router - ✅ Scrape in parallel: Multiple terminals or background jobs
- ✅ Generate router:
python3 cli/generate_router.py configs/*-*.json - ✅ Package all:
python3 cli/package_multi.py output/*/ - ✅ Upload to Claude: All .zip files
Result: Intelligent, fast, focused skills that work seamlessly together!
Questions? See: