fix(#300): centralize selector fallback, fix dry-run link discovery, and smart --config routing

- Add FALLBACK_MAIN_SELECTORS constant and _find_main_content() helper to eliminate 3 duplicated fallback loops in doc_scraper.py - Move link extraction before early return in extract_content() so links are always discovered from the full page, not just main content - Fix single-threaded dry-run to extract links from soup (full page) instead of main element only — fixes reactflow.dev finding only 1 page - Add link extraction to async dry-run path (was completely missing) - Remove main_content from get_configuration() defaults so fallback logic kicks in instead of a broad CSS comma selector matching body - Smart create --config routing: peek at JSON to determine unified (sources array → unified_scraper) vs simple (base_url → doc_scraper) - Update docs/user-guide/02-scraping.md and docs/reference/CONFIG_FORMAT.md to use unified config format (legacy format rejected since v2.11.0) - Fix test_auto_fetch_enabled and test_mcp_validate_legacy_config Closes #300 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 22:25:59 +03:00
parent b6d4dd8423
commit 4c8e16c8b1
9 changed files with 426 additions and 194 deletions
--- a/docs/user-guide/02-scraping.md
+++ b/docs/user-guide/02-scraping.md
@@ -1,6 +1,6 @@
 # Scraping Guide

-> **Skill Seekers v3.1.0**  
+> **Skill Seekers v3.1.4**
 > **Complete guide to all scraping options**

 ---
@@ -50,23 +50,30 @@ skill-seekers create --config fastapi

 ### Custom Configuration

+All configs must use the unified format with a `sources` array (since v2.11.0):
+
 ```bash
 # Create config file
 cat > configs/my-docs.json << 'EOF'
 {
  "name": "my-framework",
-  "base_url": "https://docs.example.com/",
  "description": "My framework documentation",
-  "max_pages": 200,
-  "rate_limit": 0.5,
-  "selectors": {
-    "main_content": "article",
-    "title": "h1"
-  },
-  "url_patterns": {
-    "include": ["/docs/", "/api/"],
-    "exclude": ["/blog/", "/search"]
-  }
+  "sources": [
+    {
+      "type": "documentation",
+      "base_url": "https://docs.example.com/",
+      "max_pages": 200,
+      "rate_limit": 0.5,
+      "selectors": {
+        "main_content": "article",
+        "title": "h1"
+      },
+      "url_patterns": {
+        "include": ["/docs/", "/api/"],
+        "exclude": ["/blog/", "/search"]
+      }
+    }
+  ]
 }
 EOF

@@ -74,6 +81,9 @@ EOF
 skill-seekers create --config configs/my-docs.json
 ```

+> **Note:** Omit `main_content` from `selectors` to let Skill Seekers auto-detect
+> the best content element (`main`, `article`, `div[role="main"]`, etc.).
+
 See [Config Format](../reference/CONFIG_FORMAT.md) for all options.

 ### Advanced Options
@@ -331,14 +341,22 @@ skill-seekers resume <job-id>

 **Solution:**
 ```bash
-# Find correct selectors
+# First, try without a main_content selector (auto-detection)
+# The scraper tries: main, div[role="main"], article, .content, etc.
+skill-seekers create <url> --dry-run
+
+# If auto-detection fails, find the correct selector:
 curl -s <url> | grep -i 'article\|main\|content'

-# Update config
+# Then specify it in your config's source:
 {
-  "selectors": {
-    "main_content": "div.content"  // or "article", "main", etc.
-  }
+  "sources": [{
+    "type": "documentation",
+    "base_url": "https://...",
+    "selectors": {
+      "main_content": "div.content"
+    }
+  }]
 }
 ```