fix(#300): centralize selector fallback, fix dry-run link discovery, and smart --config routing
- Add FALLBACK_MAIN_SELECTORS constant and _find_main_content() helper to eliminate 3 duplicated fallback loops in doc_scraper.py - Move link extraction before early return in extract_content() so links are always discovered from the full page, not just main content - Fix single-threaded dry-run to extract links from soup (full page) instead of main element only — fixes reactflow.dev finding only 1 page - Add link extraction to async dry-run path (was completely missing) - Remove main_content from get_configuration() defaults so fallback logic kicks in instead of a broad CSS comma selector matching body - Smart create --config routing: peek at JSON to determine unified (sources array → unified_scraper) vs simple (base_url → doc_scraper) - Update docs/user-guide/02-scraping.md and docs/reference/CONFIG_FORMAT.md to use unified config format (legacy format rejected since v2.11.0) - Fix test_auto_fetch_enabled and test_mcp_validate_legacy_config Closes #300 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# Scraping Guide
|
||||
|
||||
> **Skill Seekers v3.1.0**
|
||||
> **Skill Seekers v3.1.4**
|
||||
> **Complete guide to all scraping options**
|
||||
|
||||
---
|
||||
@@ -50,23 +50,30 @@ skill-seekers create --config fastapi
|
||||
|
||||
### Custom Configuration
|
||||
|
||||
All configs must use the unified format with a `sources` array (since v2.11.0):
|
||||
|
||||
```bash
|
||||
# Create config file
|
||||
cat > configs/my-docs.json << 'EOF'
|
||||
{
|
||||
"name": "my-framework",
|
||||
"base_url": "https://docs.example.com/",
|
||||
"description": "My framework documentation",
|
||||
"max_pages": 200,
|
||||
"rate_limit": 0.5,
|
||||
"selectors": {
|
||||
"main_content": "article",
|
||||
"title": "h1"
|
||||
},
|
||||
"url_patterns": {
|
||||
"include": ["/docs/", "/api/"],
|
||||
"exclude": ["/blog/", "/search"]
|
||||
}
|
||||
"sources": [
|
||||
{
|
||||
"type": "documentation",
|
||||
"base_url": "https://docs.example.com/",
|
||||
"max_pages": 200,
|
||||
"rate_limit": 0.5,
|
||||
"selectors": {
|
||||
"main_content": "article",
|
||||
"title": "h1"
|
||||
},
|
||||
"url_patterns": {
|
||||
"include": ["/docs/", "/api/"],
|
||||
"exclude": ["/blog/", "/search"]
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
EOF
|
||||
|
||||
@@ -74,6 +81,9 @@ EOF
|
||||
skill-seekers create --config configs/my-docs.json
|
||||
```
|
||||
|
||||
> **Note:** Omit `main_content` from `selectors` to let Skill Seekers auto-detect
|
||||
> the best content element (`main`, `article`, `div[role="main"]`, etc.).
|
||||
|
||||
See [Config Format](../reference/CONFIG_FORMAT.md) for all options.
|
||||
|
||||
### Advanced Options
|
||||
@@ -331,14 +341,22 @@ skill-seekers resume <job-id>
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Find correct selectors
|
||||
# First, try without a main_content selector (auto-detection)
|
||||
# The scraper tries: main, div[role="main"], article, .content, etc.
|
||||
skill-seekers create <url> --dry-run
|
||||
|
||||
# If auto-detection fails, find the correct selector:
|
||||
curl -s <url> | grep -i 'article\|main\|content'
|
||||
|
||||
# Update config
|
||||
# Then specify it in your config's source:
|
||||
{
|
||||
"selectors": {
|
||||
"main_content": "div.content" // or "article", "main", etc.
|
||||
}
|
||||
"sources": [{
|
||||
"type": "documentation",
|
||||
"base_url": "https://...",
|
||||
"selectors": {
|
||||
"main_content": "div.content"
|
||||
}
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user