# Scraping Guide > **Skill Seekers v3.1.4** > **Complete guide to all scraping options** --- ## Overview Skill Seekers can extract knowledge from four types of sources: | Source | Command | Best For | |--------|---------|----------| | **Documentation** | `create ` | Web docs, tutorials, API refs | | **GitHub** | `create ` | Source code, issues, releases | | **PDF** | `create ` | Manuals, papers, reports | | **Local** | `create <./path>` | Your projects, internal code | --- ## Documentation Scraping ### Basic Usage ```bash # Auto-detect and scrape skill-seekers create https://docs.react.dev/ # With custom name skill-seekers create https://docs.react.dev/ --name react-docs # With description skill-seekers create https://docs.react.dev/ \ --description "React JavaScript library documentation" ``` ### Using Preset Configs ```bash # List available presets skill-seekers estimate --all # Use preset skill-seekers create --config react skill-seekers create --config django skill-seekers create --config fastapi ``` **Available presets:** See `configs/` directory in repository. ### Custom Configuration All configs must use the unified format with a `sources` array (since v2.11.0): ```bash # Create config file cat > configs/my-docs.json << 'EOF' { "name": "my-framework", "description": "My framework documentation", "sources": [ { "type": "documentation", "base_url": "https://docs.example.com/", "max_pages": 200, "rate_limit": 0.5, "selectors": { "main_content": "article", "title": "h1" }, "url_patterns": { "include": ["/docs/", "/api/"], "exclude": ["/blog/", "/search"] } } ] } EOF # Use config skill-seekers create --config configs/my-docs.json ``` > **Note:** Omit `main_content` from `selectors` to let Skill Seekers auto-detect > the best content element (`main`, `article`, `div[role="main"]`, etc.). See [Config Format](../reference/CONFIG_FORMAT.md) for all options. ### Advanced Options ```bash # Limit pages (for testing) skill-seekers create --max-pages 50 # Adjust rate limit skill-seekers create --rate-limit 1.0 # Parallel workers (faster) skill-seekers create --workers 5 --async # Dry run (preview) skill-seekers create --dry-run # Resume interrupted skill-seekers create --resume # Fresh start (ignore cache) skill-seekers create --fresh ``` --- ## GitHub Repository Scraping ### Basic Usage ```bash # By repo name skill-seekers create facebook/react # With explicit flag skill-seekers github --repo facebook/react # With custom name skill-seekers github --repo facebook/react --name react-source ``` ### With GitHub Token ```bash # Set token for higher rate limits export GITHUB_TOKEN=ghp_... # Use token skill-seekers github --repo facebook/react ``` **Benefits of token:** - 5000 requests/hour vs 60 - Access to private repos - Higher GraphQL limits ### What Gets Extracted | Data | Default | Flag to Disable | |------|---------|-----------------| | Source code | ✅ | `--scrape-only` | | README | ✅ | - | | Issues | ✅ | `--no-issues` | | Releases | ✅ | `--no-releases` | | Changelog | ✅ | `--no-changelog` | ### Control What to Fetch ```bash # Skip issues (faster) skill-seekers github --repo facebook/react --no-issues # Limit issues skill-seekers github --repo facebook/react --max-issues 50 # Scrape only (no build) skill-seekers github --repo facebook/react --scrape-only # Non-interactive (CI/CD) skill-seekers github --repo facebook/react --non-interactive ``` --- ## PDF Extraction ### Basic Usage ```bash # Direct file skill-seekers create manual.pdf --name product-manual # With explicit command skill-seekers pdf --pdf manual.pdf --name docs ``` ### OCR for Scanned PDFs ```bash # Enable OCR skill-seekers pdf --pdf scanned.pdf --enable-ocr ``` **Requirements:** ```bash pip install skill-seekers[pdf-ocr] # Also requires: tesseract-ocr (system package) ``` ### Password-Protected PDFs ```bash # In config file { "name": "secure-docs", "pdf_path": "protected.pdf", "password": "secret123" } ``` ### Page Range ```bash # Extract specific pages (via config) { "pdf_path": "manual.pdf", "page_range": [1, 100] } ``` --- ## Local Codebase Analysis ### Basic Usage ```bash # Current directory skill-seekers create . # Specific directory skill-seekers create ./my-project # With explicit command skill-seekers analyze --directory ./my-project ``` ### Analysis Presets ```bash # Quick analysis (1-2 min) skill-seekers analyze --directory ./my-project --preset quick # Standard analysis (5-10 min) - default skill-seekers analyze --directory ./my-project --preset standard # Comprehensive (20-60 min) skill-seekers analyze --directory ./my-project --preset comprehensive ``` ### What Gets Analyzed | Feature | Quick | Standard | Comprehensive | |---------|-------|----------|---------------| | Code structure | ✅ | ✅ | ✅ | | API extraction | ✅ | ✅ | ✅ | | Comments | - | ✅ | ✅ | | Patterns | - | ✅ | ✅ | | Test examples | - | - | ✅ | | How-to guides | - | - | ✅ | | Config patterns | - | - | ✅ | ### Language Filtering ```bash # Specific languages skill-seekers analyze --directory ./my-project \ --languages Python,JavaScript # File patterns skill-seekers analyze --directory ./my-project \ --file-patterns "*.py,*.js" ``` ### Skip Features ```bash # Skip heavy features skill-seekers analyze --directory ./my-project \ --skip-dependency-graph \ --skip-patterns \ --skip-test-examples ``` --- ## Common Scraping Patterns ### Pattern 1: Test First ```bash # Dry run to preview skill-seekers create --dry-run # Small test scrape skill-seekers create --max-pages 10 # Full scrape skill-seekers create ``` ### Pattern 2: Iterative Development ```bash # Scrape without enhancement (fast) skill-seekers create --enhance-level 0 # Review output ls output/my-skill/ cat output/my-skill/SKILL.md # Enhance later skill-seekers enhance output/my-skill/ ``` ### Pattern 3: Parallel Processing ```bash # Fast async scraping skill-seekers create --async --workers 5 # Even faster (be careful with rate limits) skill-seekers create --async --workers 10 --rate-limit 0.2 ``` ### Pattern 4: Resume Capability ```bash # Start scraping skill-seekers create # ...interrupted... # Resume later skill-seekers resume --list skill-seekers resume ``` --- ## Troubleshooting Scraping ### "No content extracted" **Problem:** Wrong CSS selectors **Solution:** ```bash # First, try without a main_content selector (auto-detection) # The scraper tries: main, div[role="main"], article, .content, etc. skill-seekers create --dry-run # If auto-detection fails, find the correct selector: curl -s | grep -i 'article\|main\|content' # Then specify it in your config's source: { "sources": [{ "type": "documentation", "base_url": "https://...", "selectors": { "main_content": "div.content" } }] } ``` ### "Rate limit exceeded" **Problem:** Too many requests **Solution:** ```bash # Slow down skill-seekers create --rate-limit 2.0 # Or use GitHub token for GitHub repos export GITHUB_TOKEN=ghp_... ``` ### "Too many pages" **Problem:** Site is larger than expected **Solution:** ```bash # Estimate first skill-seekers estimate configs/my-config.json # Limit pages skill-seekers create --max-pages 100 # Adjust URL patterns { "url_patterns": { "exclude": ["/blog/", "/archive/", "/search"] } } ``` ### "Memory error" **Problem:** Site too large for memory **Solution:** ```bash # Use streaming mode skill-seekers create --streaming # Or smaller chunks skill-seekers create --chunk-tokens 500 ``` --- ## Performance Tips | Tip | Command | Impact | |-----|---------|--------| | Use presets | `--config react` | Faster setup | | Async mode | `--async --workers 5` | 3-5x faster | | Skip enhancement | `--enhance-level 0` | Skip 60 sec | | Use cache | `--skip-scrape` | Instant rebuild | | Resume | `--resume` | Continue interrupted | --- ## Next Steps - [Enhancement Guide](03-enhancement.md) - Improve skill quality - [Packaging Guide](04-packaging.md) - Export to platforms - [Config Format](../reference/CONFIG_FORMAT.md) - Advanced configuration