docs: complete documentation overhaul with v3.1.0 release notes and zh-CN translations

Documentation restructure: - New docs/getting-started/ guide (4 files: install, quick-start, first-skill, next-steps) - New docs/user-guide/ section (6 files: core concepts through troubleshooting) - New docs/reference/ section (CLI_REFERENCE, CONFIG_FORMAT, ENVIRONMENT_VARIABLES, MCP_REFERENCE) - New docs/advanced/ section (custom-workflows, mcp-server, multi-source) - New docs/ARCHITECTURE.md - system architecture overview - Archived legacy files (QUICKSTART.md, QUICK_REFERENCE.md, docs/guides/USAGE.md) to docs/archive/legacy/ Chinese (zh-CN) translations: - Full zh-CN mirror of all user-facing docs (getting-started, user-guide, reference, advanced) - GitHub Actions workflow for translation sync (.github/workflows/translate-docs.yml) - Translation sync checker script (scripts/check_translation_sync.sh) - Translation helper script (scripts/translate_doc.py) Content updates: - CHANGELOG.md: [Unreleased] → [3.1.0] - 2026-02-22 - README.md: updated with new doc structure links - AGENTS.md: updated agent documentation - docs/features/UNIFIED_SCRAPING.md: updated for unified scraper workflow JSON config Analysis/planning artifacts (kept for reference): - DOCUMENTATION_OVERHAUL_PLAN.md, DOCUMENTATION_OVERHAUL_SUMMARY.md - FEATURE_GAP_ANALYSIS.md, IMPLEMENTATION_GAPS_ANALYSIS.md, CREATE_COMMAND_COVERAGE_ANALYSIS.md - CHINESE_TRANSLATION_IMPLEMENTATION_SUMMARY.md, ISSUE_260_UPDATE.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 01:01:51 +03:00
parent 22bdd4f5f6
commit ba9a8ff8b5
69 changed files with 31304 additions and 246 deletions
--- a/docs/user-guide/02-scraping.md
+++ b/docs/user-guide/02-scraping.md
@@ -0,0 +1,409 @@
+# Scraping Guide
+
+> **Skill Seekers v3.1.0**  
+> **Complete guide to all scraping options**
+
+---
+
+## Overview
+
+Skill Seekers can extract knowledge from four types of sources:
+
+| Source | Command | Best For |
+|--------|---------|----------|
+| **Documentation** | `create <url>` | Web docs, tutorials, API refs |
+| **GitHub** | `create <repo>` | Source code, issues, releases |
+| **PDF** | `create <file.pdf>` | Manuals, papers, reports |
+| **Local** | `create <./path>` | Your projects, internal code |
+
+---
+
+## Documentation Scraping
+
+### Basic Usage
+
+```bash
+# Auto-detect and scrape
+skill-seekers create https://docs.react.dev/
+
+# With custom name
+skill-seekers create https://docs.react.dev/ --name react-docs
+
+# With description
+skill-seekers create https://docs.react.dev/ \
+  --description "React JavaScript library documentation"
+```
+
+### Using Preset Configs
+
+```bash
+# List available presets
+skill-seekers estimate --all
+
+# Use preset
+skill-seekers create --config react
+skill-seekers create --config django
+skill-seekers create --config fastapi
+```
+
+**Available presets:** See `configs/` directory in repository.
+
+### Custom Configuration
+
+```bash
+# Create config file
+cat > configs/my-docs.json << 'EOF'
+{
+  "name": "my-framework",
+  "base_url": "https://docs.example.com/",
+  "description": "My framework documentation",
+  "max_pages": 200,
+  "rate_limit": 0.5,
+  "selectors": {
+    "main_content": "article",
+    "title": "h1"
+  },
+  "url_patterns": {
+    "include": ["/docs/", "/api/"],
+    "exclude": ["/blog/", "/search"]
+  }
+}
+EOF
+
+# Use config
+skill-seekers create --config configs/my-docs.json
+```
+
+See [Config Format](../reference/CONFIG_FORMAT.md) for all options.
+
+### Advanced Options
+
+```bash
+# Limit pages (for testing)
+skill-seekers create <url> --max-pages 50
+
+# Adjust rate limit
+skill-seekers create <url> --rate-limit 1.0
+
+# Parallel workers (faster)
+skill-seekers create <url> --workers 5 --async
+
+# Dry run (preview)
+skill-seekers create <url> --dry-run
+
+# Resume interrupted
+skill-seekers create <url> --resume
+
+# Fresh start (ignore cache)
+skill-seekers create <url> --fresh
+```
+
+---
+
+## GitHub Repository Scraping
+
+### Basic Usage
+
+```bash
+# By repo name
+skill-seekers create facebook/react
+
+# With explicit flag
+skill-seekers github --repo facebook/react
+
+# With custom name
+skill-seekers github --repo facebook/react --name react-source
+```
+
+### With GitHub Token
+
+```bash
+# Set token for higher rate limits
+export GITHUB_TOKEN=ghp_...
+
+# Use token
+skill-seekers github --repo facebook/react
+```
+
+**Benefits of token:**
+- 5000 requests/hour vs 60
+- Access to private repos
+- Higher GraphQL limits
+
+### What Gets Extracted
+
+| Data | Default | Flag to Disable |
+|------|---------|-----------------|
+| Source code | ✅ | `--scrape-only` |
+| README | ✅ | - |
+| Issues | ✅ | `--no-issues` |
+| Releases | ✅ | `--no-releases` |
+| Changelog | ✅ | `--no-changelog` |
+
+### Control What to Fetch
+
+```bash
+# Skip issues (faster)
+skill-seekers github --repo facebook/react --no-issues
+
+# Limit issues
+skill-seekers github --repo facebook/react --max-issues 50
+
+# Scrape only (no build)
+skill-seekers github --repo facebook/react --scrape-only
+
+# Non-interactive (CI/CD)
+skill-seekers github --repo facebook/react --non-interactive
+```
+
+---
+
+## PDF Extraction
+
+### Basic Usage
+
+```bash
+# Direct file
+skill-seekers create manual.pdf --name product-manual
+
+# With explicit command
+skill-seekers pdf --pdf manual.pdf --name docs
+```
+
+### OCR for Scanned PDFs
+
+```bash
+# Enable OCR
+skill-seekers pdf --pdf scanned.pdf --enable-ocr
+```
+
+**Requirements:**
+```bash
+pip install skill-seekers[pdf-ocr]
+# Also requires: tesseract-ocr (system package)
+```
+
+### Password-Protected PDFs
+
+```bash
+# In config file
+{
+  "name": "secure-docs",
+  "pdf_path": "protected.pdf",
+  "password": "secret123"
+}
+```
+
+### Page Range
+
+```bash
+# Extract specific pages (via config)
+{
+  "pdf_path": "manual.pdf",
+  "page_range": [1, 100]
+}
+```
+
+---
+
+## Local Codebase Analysis
+
+### Basic Usage
+
+```bash
+# Current directory
+skill-seekers create .
+
+# Specific directory
+skill-seekers create ./my-project
+
+# With explicit command
+skill-seekers analyze --directory ./my-project
+```
+
+### Analysis Presets
+
+```bash
+# Quick analysis (1-2 min)
+skill-seekers analyze --directory ./my-project --preset quick
+
+# Standard analysis (5-10 min) - default
+skill-seekers analyze --directory ./my-project --preset standard
+
+# Comprehensive (20-60 min)
+skill-seekers analyze --directory ./my-project --preset comprehensive
+```
+
+### What Gets Analyzed
+
+| Feature | Quick | Standard | Comprehensive |
+|---------|-------|----------|---------------|
+| Code structure | ✅ | ✅ | ✅ |
+| API extraction | ✅ | ✅ | ✅ |
+| Comments | - | ✅ | ✅ |
+| Patterns | - | ✅ | ✅ |
+| Test examples | - | - | ✅ |
+| How-to guides | - | - | ✅ |
+| Config patterns | - | - | ✅ |
+
+### Language Filtering
+
+```bash
+# Specific languages
+skill-seekers analyze --directory ./my-project \
+  --languages Python,JavaScript
+
+# File patterns
+skill-seekers analyze --directory ./my-project \
+  --file-patterns "*.py,*.js"
+```
+
+### Skip Features
+
+```bash
+# Skip heavy features
+skill-seekers analyze --directory ./my-project \
+  --skip-dependency-graph \
+  --skip-patterns \
+  --skip-test-examples
+```
+
+---
+
+## Common Scraping Patterns
+
+### Pattern 1: Test First
+
+```bash
+# Dry run to preview
+skill-seekers create <source> --dry-run
+
+# Small test scrape
+skill-seekers create <source> --max-pages 10
+
+# Full scrape
+skill-seekers create <source>
+```
+
+### Pattern 2: Iterative Development
+
+```bash
+# Scrape without enhancement (fast)
+skill-seekers create <source> --enhance-level 0
+
+# Review output
+ls output/my-skill/
+cat output/my-skill/SKILL.md
+
+# Enhance later
+skill-seekers enhance output/my-skill/
+```
+
+### Pattern 3: Parallel Processing
+
+```bash
+# Fast async scraping
+skill-seekers create <url> --async --workers 5
+
+# Even faster (be careful with rate limits)
+skill-seekers create <url> --async --workers 10 --rate-limit 0.2
+```
+
+### Pattern 4: Resume Capability
+
+```bash
+# Start scraping
+skill-seekers create <source>
+# ...interrupted...
+
+# Resume later
+skill-seekers resume --list
+skill-seekers resume <job-id>
+```
+
+---
+
+## Troubleshooting Scraping
+
+### "No content extracted"
+
+**Problem:** Wrong CSS selectors
+
+**Solution:**
+```bash
+# Find correct selectors
+curl -s <url> | grep -i 'article\|main\|content'
+
+# Update config
+{
+  "selectors": {
+    "main_content": "div.content"  // or "article", "main", etc.
+  }
+}
+```
+
+### "Rate limit exceeded"
+
+**Problem:** Too many requests
+
+**Solution:**
+```bash
+# Slow down
+skill-seekers create <url> --rate-limit 2.0
+
+# Or use GitHub token for GitHub repos
+export GITHUB_TOKEN=ghp_...
+```
+
+### "Too many pages"
+
+**Problem:** Site is larger than expected
+
+**Solution:**
+```bash
+# Estimate first
+skill-seekers estimate configs/my-config.json
+
+# Limit pages
+skill-seekers create <url> --max-pages 100
+
+# Adjust URL patterns
+{
+  "url_patterns": {
+    "exclude": ["/blog/", "/archive/", "/search"]
+  }
+}
+```
+
+### "Memory error"
+
+**Problem:** Site too large for memory
+
+**Solution:**
+```bash
+# Use streaming mode
+skill-seekers create <url> --streaming
+
+# Or smaller chunks
+skill-seekers create <url> --chunk-size 500
+```
+
+---
+
+## Performance Tips
+
+| Tip | Command | Impact |
+|-----|---------|--------|
+| Use presets | `--config react` | Faster setup |
+| Async mode | `--async --workers 5` | 3-5x faster |
+| Skip enhancement | `--enhance-level 0` | Skip 60 sec |
+| Use cache | `--skip-scrape` | Instant rebuild |
+| Resume | `--resume` | Continue interrupted |
+
+---
+
+## Next Steps
+
+- [Enhancement Guide](03-enhancement.md) - Improve skill quality
+- [Packaging Guide](04-packaging.md) - Export to platforms
+- [Config Format](../reference/CONFIG_FORMAT.md) - Advanced configuration