fix(#300): centralize selector fallback, fix dry-run link discovery, and smart --config routing

- Add FALLBACK_MAIN_SELECTORS constant and _find_main_content() helper to
  eliminate 3 duplicated fallback loops in doc_scraper.py
- Move link extraction before early return in extract_content() so links
  are always discovered from the full page, not just main content
- Fix single-threaded dry-run to extract links from soup (full page)
  instead of main element only — fixes reactflow.dev finding only 1 page
- Add link extraction to async dry-run path (was completely missing)
- Remove main_content from get_configuration() defaults so fallback logic
  kicks in instead of a broad CSS comma selector matching body
- Smart create --config routing: peek at JSON to determine unified
  (sources array → unified_scraper) vs simple (base_url → doc_scraper)
- Update docs/user-guide/02-scraping.md and docs/reference/CONFIG_FORMAT.md
  to use unified config format (legacy format rejected since v2.11.0)
- Fix test_auto_fetch_enabled and test_mcp_validate_legacy_config

Closes #300

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
yusyus
2026-02-26 22:25:59 +03:00
parent b6d4dd8423
commit 4c8e16c8b1
9 changed files with 426 additions and 194 deletions

View File

@@ -1,6 +1,6 @@
# Scraping Guide
> **Skill Seekers v3.1.0**
> **Skill Seekers v3.1.4**
> **Complete guide to all scraping options**
---
@@ -50,23 +50,30 @@ skill-seekers create --config fastapi
### Custom Configuration
All configs must use the unified format with a `sources` array (since v2.11.0):
```bash
# Create config file
cat > configs/my-docs.json << 'EOF'
{
"name": "my-framework",
"base_url": "https://docs.example.com/",
"description": "My framework documentation",
"max_pages": 200,
"rate_limit": 0.5,
"selectors": {
"main_content": "article",
"title": "h1"
},
"url_patterns": {
"include": ["/docs/", "/api/"],
"exclude": ["/blog/", "/search"]
}
"sources": [
{
"type": "documentation",
"base_url": "https://docs.example.com/",
"max_pages": 200,
"rate_limit": 0.5,
"selectors": {
"main_content": "article",
"title": "h1"
},
"url_patterns": {
"include": ["/docs/", "/api/"],
"exclude": ["/blog/", "/search"]
}
}
]
}
EOF
@@ -74,6 +81,9 @@ EOF
skill-seekers create --config configs/my-docs.json
```
> **Note:** Omit `main_content` from `selectors` to let Skill Seekers auto-detect
> the best content element (`main`, `article`, `div[role="main"]`, etc.).
See [Config Format](../reference/CONFIG_FORMAT.md) for all options.
### Advanced Options
@@ -331,14 +341,22 @@ skill-seekers resume <job-id>
**Solution:**
```bash
# Find correct selectors
# First, try without a main_content selector (auto-detection)
# The scraper tries: main, div[role="main"], article, .content, etc.
skill-seekers create <url> --dry-run
# If auto-detection fails, find the correct selector:
curl -s <url> | grep -i 'article\|main\|content'
# Update config
# Then specify it in your config's source:
{
"selectors": {
"main_content": "div.content" // or "article", "main", etc.
}
"sources": [{
"type": "documentation",
"base_url": "https://...",
"selectors": {
"main_content": "div.content"
}
}]
}
```