Update 32 documentation files across English and Chinese (zh-CN) docs to reflect the 10 new source types added in the previous commit. Updated files: - README.md, README.zh-CN.md — taglines, feature lists, examples, install extras - docs/reference/ — CLI_REFERENCE, FEATURE_MATRIX, MCP_REFERENCE, CONFIG_FORMAT, API_REFERENCE - docs/features/ — UNIFIED_SCRAPING with generic merge docs - docs/advanced/ — multi-source guide, MCP server guide - docs/getting-started/ — installation extras, quick-start examples - docs/user-guide/ — core-concepts, scraping, packaging, workflows (complex-merge) - docs/ — FAQ, TROUBLESHOOTING, BEST_PRACTICES, ARCHITECTURE, UNIFIED_PARSERS, README - Root — BULLETPROOF_QUICKSTART, CONTRIBUTING, ROADMAP - docs/zh-CN/ — Chinese translations for all of the above 32 files changed, +3,016 lines, -245 lines
14 KiB
Scraping Guide
Skill Seekers v3.2.0 Complete guide to all scraping options
Overview
Skill Seekers can extract knowledge from 17 types of sources:
| Source | Command | Best For |
|---|---|---|
| Documentation | create <url> |
Web docs, tutorials, API refs |
| GitHub | create <repo> |
Source code, issues, releases |
create <file.pdf> |
Manuals, papers, reports | |
| Local | create <./path> |
Your projects, internal code |
| Word | create <file.docx> |
Reports, specifications |
| EPUB | create <file.epub> |
E-books, long-form docs |
| Video | create <url/file> |
Tutorials, presentations |
| Jupyter | create <file.ipynb> |
Data science, experiments |
| Local HTML | create <file.html> |
Offline docs, saved pages |
| OpenAPI | create <spec.yaml> |
API specs, Swagger docs |
| AsciiDoc | create <file.adoc> |
Technical documentation |
| PowerPoint | create <file.pptx> |
Slide decks, presentations |
| RSS/Atom | create <feed.rss> |
Blog feeds, news sources |
| Man Pages | create <cmd.1> |
Unix command documentation |
| Confluence | confluence |
Team wikis, knowledge bases |
| Notion | notion |
Workspace docs, databases |
| Slack/Discord | chat |
Chat history, discussions |
Documentation Scraping
Basic Usage
# Auto-detect and scrape
skill-seekers create https://docs.react.dev/
# With custom name
skill-seekers create https://docs.react.dev/ --name react-docs
# With description
skill-seekers create https://docs.react.dev/ \
--description "React JavaScript library documentation"
Using Preset Configs
# List available presets
skill-seekers estimate --all
# Use preset
skill-seekers create --config react
skill-seekers create --config django
skill-seekers create --config fastapi
Available presets: See configs/ directory in repository.
Custom Configuration
All configs must use the unified format with a sources array (since v2.11.0):
# Create config file
cat > configs/my-docs.json << 'EOF'
{
"name": "my-framework",
"description": "My framework documentation",
"sources": [
{
"type": "documentation",
"base_url": "https://docs.example.com/",
"max_pages": 200,
"rate_limit": 0.5,
"selectors": {
"main_content": "article",
"title": "h1"
},
"url_patterns": {
"include": ["/docs/", "/api/"],
"exclude": ["/blog/", "/search"]
}
}
]
}
EOF
# Use config
skill-seekers create --config configs/my-docs.json
Note: Omit
main_contentfromselectorsto let Skill Seekers auto-detect the best content element (main,article,div[role="main"], etc.).
See Config Format for all options.
Advanced Options
# Limit pages (for testing)
skill-seekers create <url> --max-pages 50
# Adjust rate limit
skill-seekers create <url> --rate-limit 1.0
# Parallel workers (faster)
skill-seekers create <url> --workers 5 --async
# Dry run (preview)
skill-seekers create <url> --dry-run
# Resume interrupted
skill-seekers create <url> --resume
# Fresh start (ignore cache)
skill-seekers create <url> --fresh
GitHub Repository Scraping
Basic Usage
# By repo name
skill-seekers create facebook/react
# With explicit flag
skill-seekers github --repo facebook/react
# With custom name
skill-seekers github --repo facebook/react --name react-source
With GitHub Token
# Set token for higher rate limits
export GITHUB_TOKEN=ghp_...
# Use token
skill-seekers github --repo facebook/react
Benefits of token:
- 5000 requests/hour vs 60
- Access to private repos
- Higher GraphQL limits
What Gets Extracted
| Data | Default | Flag to Disable |
|---|---|---|
| Source code | ✅ | --scrape-only |
| README | ✅ | - |
| Issues | ✅ | --no-issues |
| Releases | ✅ | --no-releases |
| Changelog | ✅ | --no-changelog |
Control What to Fetch
# Skip issues (faster)
skill-seekers github --repo facebook/react --no-issues
# Limit issues
skill-seekers github --repo facebook/react --max-issues 50
# Scrape only (no build)
skill-seekers github --repo facebook/react --scrape-only
# Non-interactive (CI/CD)
skill-seekers github --repo facebook/react --non-interactive
PDF Extraction
Basic Usage
# Direct file
skill-seekers create manual.pdf --name product-manual
# With explicit command
skill-seekers pdf --pdf manual.pdf --name docs
OCR for Scanned PDFs
# Enable OCR
skill-seekers pdf --pdf scanned.pdf --enable-ocr
Requirements:
pip install skill-seekers[pdf-ocr]
# Also requires: tesseract-ocr (system package)
Password-Protected PDFs
# In config file
{
"name": "secure-docs",
"pdf_path": "protected.pdf",
"password": "secret123"
}
Page Range
# Extract specific pages (via config)
{
"pdf_path": "manual.pdf",
"page_range": [1, 100]
}
Local Codebase Analysis
Basic Usage
# Current directory
skill-seekers create .
# Specific directory
skill-seekers create ./my-project
# With explicit command
skill-seekers analyze --directory ./my-project
Analysis Presets
# Quick analysis (1-2 min)
skill-seekers analyze --directory ./my-project --preset quick
# Standard analysis (5-10 min) - default
skill-seekers analyze --directory ./my-project --preset standard
# Comprehensive (20-60 min)
skill-seekers analyze --directory ./my-project --preset comprehensive
What Gets Analyzed
| Feature | Quick | Standard | Comprehensive |
|---|---|---|---|
| Code structure | ✅ | ✅ | ✅ |
| API extraction | ✅ | ✅ | ✅ |
| Comments | - | ✅ | ✅ |
| Patterns | - | ✅ | ✅ |
| Test examples | - | - | ✅ |
| How-to guides | - | - | ✅ |
| Config patterns | - | - | ✅ |
Language Filtering
# Specific languages
skill-seekers analyze --directory ./my-project \
--languages Python,JavaScript
# File patterns
skill-seekers analyze --directory ./my-project \
--file-patterns "*.py,*.js"
Skip Features
# Skip heavy features
skill-seekers analyze --directory ./my-project \
--skip-dependency-graph \
--skip-patterns \
--skip-test-examples
Video Extraction
Basic Usage
# YouTube video
skill-seekers create https://www.youtube.com/watch?v=dQw4w9WgXcQ
# Local video file
skill-seekers create presentation.mp4
# With explicit command
skill-seekers video --url https://www.youtube.com/watch?v=...
Visual Analysis
# Install full video support (includes Whisper + scene detection)
pip install skill-seekers[video-full]
skill-seekers video --setup # auto-detect GPU and install PyTorch
# Extract with visual analysis
skill-seekers video --url <url> --visual-analysis
Requirements:
pip install skill-seekers[video] # Transcript only
pip install skill-seekers[video-full] # + Whisper, scene detection
Word Document Extraction
Basic Usage
# Extract from .docx
skill-seekers create report.docx --name project-report
# With explicit command
skill-seekers word --file report.docx
Handles: Text, tables, headings, images, embedded metadata.
EPUB Extraction
Basic Usage
# Extract from .epub
skill-seekers create programming-guide.epub --name guide
# With explicit command
skill-seekers epub --file programming-guide.epub
Handles: Chapters, metadata, table of contents, embedded images.
Jupyter Notebook Extraction
Basic Usage
# Extract from .ipynb
skill-seekers create analysis.ipynb --name data-analysis
# With explicit command
skill-seekers jupyter --notebook analysis.ipynb
Requirements:
pip install skill-seekers[jupyter]
Extracts: Markdown cells, code cells, cell outputs, execution order.
Local HTML Extraction
Basic Usage
# Extract from .html
skill-seekers create docs.html --name offline-docs
# With explicit command
skill-seekers html --file docs.html
Handles: Full HTML parsing, text extraction, link resolution.
OpenAPI/Swagger Extraction
Basic Usage
# Extract from OpenAPI spec
skill-seekers create api-spec.yaml --name my-api
# With explicit command
skill-seekers openapi --spec api-spec.yaml
Extracts: Endpoints, request/response schemas, authentication info, examples.
AsciiDoc Extraction
Basic Usage
# Extract from .adoc
skill-seekers create guide.adoc --name dev-guide
# With explicit command
skill-seekers asciidoc --file guide.adoc
Requirements:
pip install skill-seekers[asciidoc]
Handles: Sections, code blocks, tables, cross-references, includes.
PowerPoint Extraction
Basic Usage
# Extract from .pptx
skill-seekers create slides.pptx --name presentation
# With explicit command
skill-seekers pptx --file slides.pptx
Requirements:
pip install skill-seekers[pptx]
Extracts: Slide text, speaker notes, images, tables, slide order.
RSS/Atom Feed Extraction
Basic Usage
# Extract from RSS feed
skill-seekers create blog.rss --name blog-archive
# Atom feed
skill-seekers create updates.atom --name updates
# With explicit command
skill-seekers rss --feed blog.rss
Requirements:
pip install skill-seekers[rss]
Extracts: Articles, titles, dates, authors, categories.
Man Page Extraction
Basic Usage
# Extract from man page
skill-seekers create curl.1 --name curl-manual
# With explicit command
skill-seekers manpage --file curl.1
Handles: Sections (NAME, SYNOPSIS, DESCRIPTION, OPTIONS, etc.), formatting.
Confluence Wiki Extraction
Basic Usage
# From Confluence API
skill-seekers confluence \
--base-url https://wiki.example.com \
--space DEV \
--name team-docs
# From Confluence export directory
skill-seekers confluence --export-dir ./confluence-export/
Requirements:
pip install skill-seekers[confluence]
Extracts: Pages, page trees, attachments, labels, spaces.
Notion Extraction
Basic Usage
# From Notion API
export NOTION_API_KEY=secret_...
skill-seekers notion --database abc123 --name product-wiki
# From Notion export directory
skill-seekers notion --export-dir ./notion-export/
Requirements:
pip install skill-seekers[notion]
Extracts: Pages, databases, blocks, properties, relations.
Slack/Discord Chat Extraction
Basic Usage
# From Slack export
skill-seekers chat --export slack-export/ --name team-discussions
# From Discord export
skill-seekers chat --export discord-export/ --name server-archive
Requirements:
pip install skill-seekers[chat]
Extracts: Messages, threads, channels, reactions, attachments.
Common Scraping Patterns
Pattern 1: Test First
# Dry run to preview
skill-seekers create <source> --dry-run
# Small test scrape
skill-seekers create <source> --max-pages 10
# Full scrape
skill-seekers create <source>
Pattern 2: Iterative Development
# Scrape without enhancement (fast)
skill-seekers create <source> --enhance-level 0
# Review output
ls output/my-skill/
cat output/my-skill/SKILL.md
# Enhance later
skill-seekers enhance output/my-skill/
Pattern 3: Parallel Processing
# Fast async scraping
skill-seekers create <url> --async --workers 5
# Even faster (be careful with rate limits)
skill-seekers create <url> --async --workers 10 --rate-limit 0.2
Pattern 4: Resume Capability
# Start scraping
skill-seekers create <source>
# ...interrupted...
# Resume later
skill-seekers resume --list
skill-seekers resume <job-id>
Troubleshooting Scraping
"No content extracted"
Problem: Wrong CSS selectors
Solution:
# First, try without a main_content selector (auto-detection)
# The scraper tries: main, div[role="main"], article, .content, etc.
skill-seekers create <url> --dry-run
# If auto-detection fails, find the correct selector:
curl -s <url> | grep -i 'article\|main\|content'
# Then specify it in your config's source:
{
"sources": [{
"type": "documentation",
"base_url": "https://...",
"selectors": {
"main_content": "div.content"
}
}]
}
"Rate limit exceeded"
Problem: Too many requests
Solution:
# Slow down
skill-seekers create <url> --rate-limit 2.0
# Or use GitHub token for GitHub repos
export GITHUB_TOKEN=ghp_...
"Too many pages"
Problem: Site is larger than expected
Solution:
# Estimate first
skill-seekers estimate configs/my-config.json
# Limit pages
skill-seekers create <url> --max-pages 100
# Adjust URL patterns
{
"url_patterns": {
"exclude": ["/blog/", "/archive/", "/search"]
}
}
"Memory error"
Problem: Site too large for memory
Solution:
# Use streaming mode
skill-seekers create <url> --streaming
# Or smaller chunks
skill-seekers create <url> --chunk-tokens 500
Performance Tips
| Tip | Command | Impact |
|---|---|---|
| Use presets | --config react |
Faster setup |
| Async mode | --async --workers 5 |
3-5x faster |
| Skip enhancement | --enhance-level 0 |
Skip 60 sec |
| Use cache | --skip-scrape |
Instant rebuild |
| Resume | --resume |
Continue interrupted |
Next Steps
- Enhancement Guide - Improve skill quality
- Packaging Guide - Export to platforms
- Config Format - Advanced configuration