fix(#300): centralize selector fallback, fix dry-run link discovery, and smart --config routing

- Add FALLBACK_MAIN_SELECTORS constant and _find_main_content() helper to
  eliminate 3 duplicated fallback loops in doc_scraper.py
- Move link extraction before early return in extract_content() so links
  are always discovered from the full page, not just main content
- Fix single-threaded dry-run to extract links from soup (full page)
  instead of main element only — fixes reactflow.dev finding only 1 page
- Add link extraction to async dry-run path (was completely missing)
- Remove main_content from get_configuration() defaults so fallback logic
  kicks in instead of a broad CSS comma selector matching body
- Smart create --config routing: peek at JSON to determine unified
  (sources array → unified_scraper) vs simple (base_url → doc_scraper)
- Update docs/user-guide/02-scraping.md and docs/reference/CONFIG_FORMAT.md
  to use unified config format (legacy format rejected since v2.11.0)
- Fix test_auto_fetch_enabled and test_mcp_validate_legacy_config

Closes #300

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
yusyus
2026-02-26 22:25:59 +03:00
parent b6d4dd8423
commit 4c8e16c8b1
9 changed files with 426 additions and 194 deletions

View File

@@ -22,6 +22,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **`docx` optional dependency group** — `pip install skill-seekers[docx]` (mammoth + python-docx) - **`docx` optional dependency group** — `pip install skill-seekers[docx]` (mammoth + python-docx)
### Fixed ### Fixed
- **Issue #300: Selector fallback & dry-run link discovery** — `create https://reactflow.dev/` now finds 20+ pages (was 1). Root causes:
- `extract_content()` extracted links after the early-return when no content selector matched, so they were never discovered. Moved link extraction before the early return.
- Dry-run extracted links from `main.find_all("a")` (main content only) instead of `soup.find_all("a")` (full page), missing navigation links. Fixed both sync and async dry-run paths.
- Async dry-run had no link extraction at all — only logged URLs.
- `get_configuration()` default used a CSS comma selector string that conflicted with the fallback loop. Removed `main_content` from defaults so `_find_main_content()` fallback kicks in.
- `create --config` with a simple web config (has `base_url`, no `sources`) incorrectly routed to `unified_scraper` which rejected it. Now peeks at JSON: routes `"sources"` configs to unified_scraper, `"base_url"` configs to doc_scraper.
- Selector fallback logic was duplicated in 3 places with `body` as ultimate fallback (masks failures). Extracted `FALLBACK_MAIN_SELECTORS` constant and `_find_main_content()` helper (no `body`).
- **Reference file code truncation removed** — `codebase_scraper.py` no longer truncates code blocks to 500 chars in reference files (5 locations fixed) - **Reference file code truncation removed** — `codebase_scraper.py` no longer truncates code blocks to 500 chars in reference files (5 locations fixed)
- **Enhancement code block limit replaced with token budget** — `enhance_skill_local.py` `summarize_reference()` now uses character-budget approach instead of arbitrary `[:5]` code block cap - **Enhancement code block limit replaced with token budget** — `enhance_skill_local.py` `summarize_reference()` now uses character-budget approach instead of arbitrary `[:5]` code block cap
- **Dead variable removed** — `_target_lines` in `enhance_skill_local.py:309` was assigned but never used - **Dead variable removed** — `_target_lines` in `enhance_skill_local.py:309` was assigned but never used

View File

@@ -1835,15 +1835,17 @@ UNIVERSAL_ARGUMENTS = {
## 📚 Key Code Locations ## 📚 Key Code Locations
**Documentation Scraper** (`src/skill_seekers/cli/doc_scraper.py`): **Documentation Scraper** (`src/skill_seekers/cli/doc_scraper.py`):
- `FALLBACK_MAIN_SELECTORS` - Shared fallback CSS selectors for finding main content (no `body`)
- `_find_main_content()` - Centralized selector fallback: config selector → fallback list
- `is_valid_url()` - URL validation - `is_valid_url()` - URL validation
- `extract_content()` - Content extraction - `extract_content()` - Content extraction (links extracted from full page before early return)
- `detect_language()` - Code language detection - `detect_language()` - Code language detection
- `extract_patterns()` - Pattern extraction - `extract_patterns()` - Pattern extraction
- `smart_categorize()` - Smart categorization - `smart_categorize()` - Smart categorization
- `infer_categories()` - Category inference - `infer_categories()` - Category inference
- `generate_quick_reference()` - Quick reference generation - `generate_quick_reference()` - Quick reference generation
- `create_enhanced_skill_md()` - SKILL.md generation - `create_enhanced_skill_md()` - SKILL.md generation
- `scrape_all()` - Main scraping loop - `scrape_all()` - Main scraping loop (dry-run extracts links from full page)
- `main()` - Entry point - `main()` - Entry point
**Codebase Analysis** (`src/skill_seekers/cli/`): **Codebase Analysis** (`src/skill_seekers/cli/`):
@@ -2256,6 +2258,15 @@ The `scripts/` directory contains utility scripts:
## 🎉 Recent Achievements ## 🎉 Recent Achievements
**v3.1.4 (Unreleased) - "Selector Fallback & Dry-Run Fix":**
- 🐛 **Issue #300: `create https://reactflow.dev/` only found 1 page** — Now finds 20+ pages
- 🔧 **Centralized selector fallback** — `FALLBACK_MAIN_SELECTORS` constant + `_find_main_content()` helper replace 3 duplicated fallback loops
- 🔗 **Link extraction before early return** — `extract_content()` now discovers links even when no content selector matches
- 🔍 **Dry-run full-page link discovery** — Both sync and async dry-run paths extract links from the full page (was main-content-only or missing entirely)
- 🛣️ **Smart `create --config` routing** — Peeks at JSON to route `base_url` configs to doc_scraper and `sources` configs to unified_scraper
- 🧹 **Removed `body` fallback** — `body` matched everything, hiding real selector failures
- ✅ **Pre-existing test fixes** — `test_auto_fetch_enabled` (react.json exists locally) and `test_mcp_validate_legacy_config` (react.json is now unified format)
**v3.1.3 (Released) - "Unified Argument Interface":** **v3.1.3 (Released) - "Unified Argument Interface":**
- 🔧 **Unified Scraper Arguments** - All scrapers (scrape, github, analyze, pdf) now share a common argument contract via `add_all_standard_arguments(parser)` in `arguments/common.py` - 🔧 **Unified Scraper Arguments** - All scrapers (scrape, github, analyze, pdf) now share a common argument contract via `add_all_standard_arguments(parser)` in `arguments/common.py`
- 🐛 **Fix `create` Argument Forwarding** - `create <url> --dry-run`, `create owner/repo --dry-run`, `create ./path --dry-run` all work now (previously crashed) - 🐛 **Fix `create` Argument Forwarding** - `create <url> --dry-run`, `create owner/repo --dry-run`, `create ./path --dry-run` all work now (previously crashed)

69
configs/react.json Normal file
View File

@@ -0,0 +1,69 @@
{
"name": "react",
"description": "Complete React knowledge base combining official documentation and React codebase insights. Use when working with React, understanding API changes, or debugging React internals.",
"version": "1.1.0",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://react.dev/",
"extract_api": true,
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": [],
"exclude": [
"/blog/",
"/community/"
]
},
"categories": {
"getting_started": [
"learn",
"installation",
"quick-start"
],
"components": [
"components",
"props",
"state"
],
"hooks": [
"hooks",
"usestate",
"useeffect",
"usecontext"
],
"api": [
"api",
"reference"
],
"advanced": [
"context",
"refs",
"portals",
"suspense"
]
},
"rate_limit": 0.5
},
{
"type": "github",
"repo": "facebook/react",
"enable_codebase_analysis": true,
"code_analysis_depth": "deep",
"fetch_issues": true,
"max_issues": 100,
"fetch_changelog": true,
"fetch_releases": true,
"file_patterns": [
"packages/react/src/**/*.js",
"packages/react-dom/src/**/*.js"
]
}
],
"base_url": "https://react.dev/"
}

View File

@@ -1,7 +1,7 @@
# Config Format Reference - Skill Seekers # Config Format Reference - Skill Seekers
> **Version:** 3.1.0 > **Version:** 3.1.4
> **Last Updated:** 2026-02-16 > **Last Updated:** 2026-02-26
> **Complete JSON configuration specification** > **Complete JSON configuration specification**
--- ---
@@ -25,17 +25,21 @@
## Overview ## Overview
Skill Seekers uses JSON configuration files to define scraping targets. There are two types: Skill Seekers uses JSON configuration files with a unified format. All configs use a `sources` array, even for single-source scraping.
| Type | Use Case | File | > **Important:** Legacy configs without `sources` were removed in v2.11.0. All configs must use the unified format shown below.
|------|----------|------|
| **Single-Source** | One source (docs, GitHub, PDF, or local) | `*.json` | | Use Case | Example |
| **Unified** | Multiple sources combined | `*-unified.json` | |----------|---------|
| **Single source** | `"sources": [{ "type": "documentation", ... }]` |
| **Multiple sources** | `"sources": [{ "type": "documentation", ... }, { "type": "github", ... }]` |
--- ---
## Single-Source Config ## Single-Source Config
Even for a single source, wrap it in a `sources` array.
### Documentation Source ### Documentation Source
For scraping documentation websites. For scraping documentation websites.
@@ -43,33 +47,37 @@ For scraping documentation websites.
```json ```json
{ {
"name": "react", "name": "react",
"base_url": "https://react.dev/",
"description": "React - JavaScript library for building UIs", "description": "React - JavaScript library for building UIs",
"sources": [
"start_urls": [ {
"https://react.dev/learn", "type": "documentation",
"https://react.dev/reference/react" "base_url": "https://react.dev/",
],
"start_urls": [
"selectors": { "https://react.dev/learn",
"main_content": "article", "https://react.dev/reference/react"
"title": "h1", ],
"code_blocks": "pre code"
}, "selectors": {
"main_content": "article",
"url_patterns": { "title": "h1",
"include": ["/learn/", "/reference/"], "code_blocks": "pre code"
"exclude": ["/blog/", "/community/"] },
},
"url_patterns": {
"categories": { "include": ["/learn/", "/reference/"],
"getting_started": ["learn", "tutorial", "intro"], "exclude": ["/blog/", "/community/"]
"api": ["reference", "api", "hooks"] },
},
"categories": {
"rate_limit": 0.5, "getting_started": ["learn", "tutorial", "intro"],
"max_pages": 300, "api": ["reference", "api", "hooks"]
"merge_mode": "claude-enhanced" },
"rate_limit": 0.5,
"max_pages": 300
}
]
} }
``` ```
@@ -99,27 +107,31 @@ For analyzing GitHub repositories.
```json ```json
{ {
"name": "react-github", "name": "react-github",
"type": "github",
"repo": "facebook/react",
"description": "React GitHub repository analysis", "description": "React GitHub repository analysis",
"sources": [
"enable_codebase_analysis": true, {
"code_analysis_depth": "deep", "type": "github",
"repo": "facebook/react",
"fetch_issues": true,
"max_issues": 100, "enable_codebase_analysis": true,
"issue_labels": ["bug", "enhancement"], "code_analysis_depth": "deep",
"fetch_releases": true, "fetch_issues": true,
"max_releases": 20, "max_issues": 100,
"issue_labels": ["bug", "enhancement"],
"fetch_changelog": true,
"analyze_commit_history": true, "fetch_releases": true,
"max_releases": 20,
"file_patterns": ["*.js", "*.ts", "*.tsx"],
"exclude_patterns": ["*.test.js", "node_modules/**"], "fetch_changelog": true,
"analyze_commit_history": true,
"rate_limit": 1.0
"file_patterns": ["*.js", "*.ts", "*.tsx"],
"exclude_patterns": ["*.test.js", "node_modules/**"],
"rate_limit": 1.0
}
]
} }
``` ```
@@ -152,24 +164,28 @@ For extracting content from PDF files.
```json ```json
{ {
"name": "product-manual", "name": "product-manual",
"type": "pdf",
"pdf_path": "docs/manual.pdf",
"description": "Product documentation manual", "description": "Product documentation manual",
"sources": [
"enable_ocr": false, {
"password": "", "type": "pdf",
"pdf_path": "docs/manual.pdf",
"extract_images": true,
"image_output_dir": "output/images/", "enable_ocr": false,
"password": "",
"extract_tables": true,
"table_format": "markdown", "extract_images": true,
"image_output_dir": "output/images/",
"page_range": [1, 100],
"split_by_chapters": true, "extract_tables": true,
"table_format": "markdown",
"chunk_size": 1000,
"chunk_overlap": 100 "page_range": [1, 100],
"split_by_chapters": true,
"chunk_size": 1000,
"chunk_overlap": 100
}
]
} }
``` ```
@@ -201,25 +217,29 @@ For analyzing local codebases.
```json ```json
{ {
"name": "my-project", "name": "my-project",
"type": "local",
"directory": "./my-project",
"description": "Local project analysis", "description": "Local project analysis",
"sources": [
"languages": ["Python", "JavaScript"], {
"file_patterns": ["*.py", "*.js"], "type": "local",
"exclude_patterns": ["*.pyc", "node_modules/**", ".git/**"], "directory": "./my-project",
"analysis_depth": "comprehensive", "languages": ["Python", "JavaScript"],
"file_patterns": ["*.py", "*.js"],
"extract_api": true, "exclude_patterns": ["*.pyc", "node_modules/**", ".git/**"],
"extract_patterns": true,
"extract_test_examples": true, "analysis_depth": "comprehensive",
"extract_how_to_guides": true,
"extract_config_patterns": true, "extract_api": true,
"extract_patterns": true,
"include_comments": true, "extract_test_examples": true,
"include_docstrings": true, "extract_how_to_guides": true,
"include_readme": true "extract_config_patterns": true,
"include_comments": true,
"include_docstrings": true,
"include_readme": true
}
]
} }
``` ```
@@ -406,14 +426,25 @@ CSS selectors for content extraction from HTML:
### Default Selectors ### Default Selectors
If not specified, these defaults are used: If `main_content` is not specified, the scraper tries these selectors in order until one matches:
1. `main`
2. `div[role="main"]`
3. `article`
4. `[role="main"]`
5. `.content`
6. `.doc-content`
7. `#main-content`
> **Tip:** Omit `main_content` from your config to let auto-detection work.
> Only specify it when auto-detection picks the wrong element.
Other defaults:
| Element | Default Selector | | Element | Default Selector |
|---------|-----------------| |---------|-----------------|
| `main_content` | `article, main, .content, #content, [role='main']` | | `title` | `title` |
| `title` | `h1, .page-title, title` | | `code_blocks` | `pre code` |
| `code_blocks` | `pre code, code[class*="language-"]` |
| `navigation` | `nav, .sidebar, .toc` |
--- ---
@@ -494,29 +525,33 @@ Control which URLs are included or excluded:
```json ```json
{ {
"name": "react", "name": "react",
"base_url": "https://react.dev/",
"description": "React - JavaScript library for building UIs", "description": "React - JavaScript library for building UIs",
"start_urls": [ "sources": [
"https://react.dev/learn", {
"https://react.dev/reference/react", "type": "documentation",
"https://react.dev/reference/react-dom" "base_url": "https://react.dev/",
], "start_urls": [
"selectors": { "https://react.dev/learn",
"main_content": "article", "https://react.dev/reference/react",
"title": "h1", "https://react.dev/reference/react-dom"
"code_blocks": "pre code" ],
}, "selectors": {
"url_patterns": { "main_content": "article",
"include": ["/learn/", "/reference/", "/blog/"], "title": "h1",
"exclude": ["/community/", "/search"] "code_blocks": "pre code"
}, },
"categories": { "url_patterns": {
"getting_started": ["learn", "tutorial"], "include": ["/learn/", "/reference/"],
"api": ["reference", "api"], "exclude": ["/community/", "/search"]
"blog": ["blog"] },
}, "categories": {
"rate_limit": 0.5, "getting_started": ["learn", "tutorial"],
"max_pages": 300 "api": ["reference", "api"]
},
"rate_limit": 0.5,
"max_pages": 300
}
]
} }
``` ```
@@ -525,16 +560,20 @@ Control which URLs are included or excluded:
```json ```json
{ {
"name": "django-github", "name": "django-github",
"type": "github",
"repo": "django/django",
"description": "Django web framework source code", "description": "Django web framework source code",
"enable_codebase_analysis": true, "sources": [
"code_analysis_depth": "deep", {
"fetch_issues": true, "type": "github",
"max_issues": 100, "repo": "django/django",
"fetch_releases": true, "enable_codebase_analysis": true,
"file_patterns": ["*.py"], "code_analysis_depth": "deep",
"exclude_patterns": ["tests/**", "docs/**"] "fetch_issues": true,
"max_issues": 100,
"fetch_releases": true,
"file_patterns": ["*.py"],
"exclude_patterns": ["tests/**", "docs/**"]
}
]
} }
``` ```
@@ -572,15 +611,19 @@ Control which URLs are included or excluded:
```json ```json
{ {
"name": "my-api", "name": "my-api",
"type": "local",
"directory": "./my-api-project",
"description": "My REST API implementation", "description": "My REST API implementation",
"languages": ["Python"], "sources": [
"file_patterns": ["*.py"], {
"exclude_patterns": ["tests/**", "migrations/**"], "type": "local",
"analysis_depth": "comprehensive", "directory": "./my-api-project",
"extract_api": true, "languages": ["Python"],
"extract_test_examples": true "file_patterns": ["*.py"],
"exclude_patterns": ["tests/**", "migrations/**"],
"analysis_depth": "comprehensive",
"extract_api": true,
"extract_test_examples": true
}
]
} }
``` ```

View File

@@ -1,6 +1,6 @@
# Scraping Guide # Scraping Guide
> **Skill Seekers v3.1.0** > **Skill Seekers v3.1.4**
> **Complete guide to all scraping options** > **Complete guide to all scraping options**
--- ---
@@ -50,23 +50,30 @@ skill-seekers create --config fastapi
### Custom Configuration ### Custom Configuration
All configs must use the unified format with a `sources` array (since v2.11.0):
```bash ```bash
# Create config file # Create config file
cat > configs/my-docs.json << 'EOF' cat > configs/my-docs.json << 'EOF'
{ {
"name": "my-framework", "name": "my-framework",
"base_url": "https://docs.example.com/",
"description": "My framework documentation", "description": "My framework documentation",
"max_pages": 200, "sources": [
"rate_limit": 0.5, {
"selectors": { "type": "documentation",
"main_content": "article", "base_url": "https://docs.example.com/",
"title": "h1" "max_pages": 200,
}, "rate_limit": 0.5,
"url_patterns": { "selectors": {
"include": ["/docs/", "/api/"], "main_content": "article",
"exclude": ["/blog/", "/search"] "title": "h1"
} },
"url_patterns": {
"include": ["/docs/", "/api/"],
"exclude": ["/blog/", "/search"]
}
}
]
} }
EOF EOF
@@ -74,6 +81,9 @@ EOF
skill-seekers create --config configs/my-docs.json skill-seekers create --config configs/my-docs.json
``` ```
> **Note:** Omit `main_content` from `selectors` to let Skill Seekers auto-detect
> the best content element (`main`, `article`, `div[role="main"]`, etc.).
See [Config Format](../reference/CONFIG_FORMAT.md) for all options. See [Config Format](../reference/CONFIG_FORMAT.md) for all options.
### Advanced Options ### Advanced Options
@@ -331,14 +341,22 @@ skill-seekers resume <job-id>
**Solution:** **Solution:**
```bash ```bash
# Find correct selectors # First, try without a main_content selector (auto-detection)
# The scraper tries: main, div[role="main"], article, .content, etc.
skill-seekers create <url> --dry-run
# If auto-detection fails, find the correct selector:
curl -s <url> | grep -i 'article\|main\|content' curl -s <url> | grep -i 'article\|main\|content'
# Update config # Then specify it in your config's source:
{ {
"selectors": { "sources": [{
"main_content": "div.content" // or "article", "main", etc. "type": "documentation",
} "base_url": "https://...",
"selectors": {
"main_content": "div.content"
}
}]
} }
``` ```

View File

@@ -603,9 +603,30 @@ Common Workflows:
log_level = logging.DEBUG if args.verbose else (logging.WARNING if args.quiet else logging.INFO) log_level = logging.DEBUG if args.verbose else (logging.WARNING if args.quiet else logging.INFO)
logging.basicConfig(level=log_level, format="%(levelname)s: %(message)s") logging.basicConfig(level=log_level, format="%(levelname)s: %(message)s")
# Validate source provided # Validate source provided (config file can serve as source)
if not args.source: if not args.source and not args.config:
parser.error("source is required") parser.error("source is required (or use --config to specify a config file)")
# If config is provided but no source, peek at the JSON to route correctly
if not args.source and args.config:
import json
try:
with open(args.config) as f:
config_peek = json.load(f)
if "sources" in config_peek:
# Unified format → route to unified_scraper via config type detection
args.source = args.config
elif "base_url" in config_peek:
# Simple web config → route to doc_scraper by using the base_url
args.source = config_peek["base_url"]
# source will be detected as web URL; --config is already set
else:
parser.error("Config file must contain 'sources' (unified) or 'base_url' (web)")
except json.JSONDecodeError as e:
parser.error(f"Cannot parse config file as JSON: {e}")
except FileNotFoundError:
parser.error(f"Config file not found: {args.config}")
# Execute create command # Execute create command
command = CreateCommand(args) command = CreateCommand(args)

View File

@@ -52,6 +52,18 @@ from skill_seekers.cli.utils import setup_logging
# Configure logging # Configure logging
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
# Shared fallback selectors for finding main content across all code paths.
# No 'body' — it matches everything and hides real selector failures.
FALLBACK_MAIN_SELECTORS = [
"main",
'div[role="main"]',
"article",
'[role="main"]',
".content",
".doc-content",
"#main-content",
]
def infer_description_from_docs( def infer_description_from_docs(
base_url: str, first_page_content: str | None = None, name: str = "" base_url: str, first_page_content: str | None = None, name: str = ""
@@ -275,6 +287,35 @@ class DocToSkillConverter:
except Exception as e: except Exception as e:
logger.warning("⚠️ Failed to clear checkpoint: %s", e) logger.warning("⚠️ Failed to clear checkpoint: %s", e)
def _find_main_content(self, soup: Any) -> tuple[Any, str | None]:
"""Find the main content element using config selector with fallbacks.
Tries the config-specified selector first, then falls back through
FALLBACK_MAIN_SELECTORS. Does NOT fall back to <body> since that
matches everything and hides real selector failures.
Args:
soup: BeautifulSoup parsed page
Returns:
Tuple of (element, selector_used) or (None, None) if nothing matched
"""
selectors = self.config.get("selectors", {})
main_selector = selectors.get("main_content")
if main_selector:
main = soup.select_one(main_selector)
if main:
return main, main_selector
# Config selector didn't match — fall through to fallbacks
for selector in FALLBACK_MAIN_SELECTORS:
main = soup.select_one(selector)
if main:
return main, selector
return None, None
def extract_content(self, soup: Any, url: str) -> dict[str, Any]: def extract_content(self, soup: Any, url: str) -> dict[str, Any]:
"""Extract content with improved code and pattern detection""" """Extract content with improved code and pattern detection"""
page = { page = {
@@ -294,9 +335,17 @@ class DocToSkillConverter:
if title_elem: if title_elem:
page["title"] = self.clean_text(title_elem.get_text()) page["title"] = self.clean_text(title_elem.get_text())
# Find main content # Extract links from entire page (always, even if main content not found).
main_selector = selectors.get("main_content", 'div[role="main"]') # This allows discovery of navigation links outside the main content area.
main = soup.select_one(main_selector) for link in soup.find_all("a", href=True):
href = urljoin(url, link["href"])
# Strip anchor fragments to avoid treating #anchors as separate pages
href = href.split("#")[0]
if self.is_valid_url(href) and href not in page["links"]:
page["links"].append(href)
# Find main content using shared fallback logic
main, _selector_used = self._find_main_content(soup)
if not main: if not main:
logger.warning("⚠ No content: %s", url) logger.warning("⚠ No content: %s", url)
@@ -329,15 +378,6 @@ class DocToSkillConverter:
page["content"] = "\n\n".join(paragraphs) page["content"] = "\n\n".join(paragraphs)
# Extract links from entire page (not just main content)
# This allows discovery of navigation links outside the main content area
for link in soup.find_all("a", href=True):
href = urljoin(url, link["href"])
# Strip anchor fragments to avoid treating #anchors as separate pages
href = href.split("#")[0]
if self.is_valid_url(href) and href not in page["links"]:
page["links"].append(href)
return page return page
def _extract_markdown_content(self, content: str, url: str) -> dict[str, Any]: def _extract_markdown_content(self, content: str, url: str) -> dict[str, Any]:
@@ -1070,16 +1110,13 @@ class DocToSkillConverter:
response = requests.get(url, headers=headers, timeout=10) response = requests.get(url, headers=headers, timeout=10)
soup = BeautifulSoup(response.content, "html.parser") soup = BeautifulSoup(response.content, "html.parser")
main_selector = self.config.get("selectors", {}).get( # Discover links from full page (not just main content)
"main_content", 'div[role="main"]' # to match real scrape path behaviour in extract_content()
) for link in soup.find_all("a", href=True):
main = soup.select_one(main_selector) href = urljoin(url, link["href"])
href = href.split("#")[0]
if main: if self.is_valid_url(href) and href not in self.visited_urls:
for link in main.find_all("a", href=True): self.pending_urls.append(href)
href = urljoin(url, link["href"])
if self.is_valid_url(href) and href not in self.visited_urls:
self.pending_urls.append(href)
except Exception as e: except Exception as e:
# Failed to extract links in fast mode, continue anyway # Failed to extract links in fast mode, continue anyway
logger.warning("⚠️ Warning: Could not extract links from %s: %s", url, e) logger.warning("⚠️ Warning: Could not extract links from %s: %s", url, e)
@@ -1249,6 +1286,25 @@ class DocToSkillConverter:
if unlimited or len(self.visited_urls) <= preview_limit: if unlimited or len(self.visited_urls) <= preview_limit:
if self.dry_run: if self.dry_run:
logger.info(" [Preview] %s", url) logger.info(" [Preview] %s", url)
# Discover links from full page (async dry-run)
try:
response = await client.get(
url,
headers={
"User-Agent": "Mozilla/5.0 (Documentation Scraper - Dry Run)"
},
timeout=10,
)
soup = BeautifulSoup(response.content, "html.parser")
for link in soup.find_all("a", href=True):
href = urljoin(url, link["href"])
href = href.split("#")[0]
if self.is_valid_url(href) and href not in self.visited_urls:
self.pending_urls.append(href)
except Exception as e:
logger.warning(
"⚠️ Warning: Could not extract links from %s: %s", url, e
)
else: else:
task = asyncio.create_task( task = asyncio.create_task(
self.scrape_page_async(url, semaphore, client) self.scrape_page_async(url, semaphore, client)
@@ -2039,7 +2095,6 @@ def get_configuration(args: argparse.Namespace) -> dict[str, Any]:
"description": args.description or f"Use when working with {args.name}", "description": args.description or f"Use when working with {args.name}",
"base_url": effective_url, "base_url": effective_url,
"selectors": { "selectors": {
"main_content": "div[role='main']",
"title": "title", "title": "title",
"code_blocks": "pre code", "code_blocks": "pre code",
}, },

View File

@@ -265,16 +265,16 @@ class TestResolveConfigPath:
@patch("skill_seekers.cli.config_fetcher.fetch_config_from_api") @patch("skill_seekers.cli.config_fetcher.fetch_config_from_api")
def test_auto_fetch_enabled(self, mock_fetch, tmp_path): def test_auto_fetch_enabled(self, mock_fetch, tmp_path):
"""Test that auto-fetch runs when enabled.""" """Test that auto-fetch runs when enabled."""
# Mock fetch to return a path # Use a name that does NOT exist locally (react.json exists in configs/)
mock_config = tmp_path / "configs" / "react.json" mock_config = tmp_path / "configs" / "obscure_framework.json"
mock_config.parent.mkdir(exist_ok=True) mock_config.parent.mkdir(exist_ok=True)
mock_config.write_text('{"name": "react"}') mock_config.write_text('{"name": "obscure_framework"}')
mock_fetch.return_value = mock_config mock_fetch.return_value = mock_config
result = resolve_config_path("react.json", auto_fetch=True) result = resolve_config_path("obscure_framework.json", auto_fetch=True)
# Verify fetch was called # Verify fetch was called
mock_fetch.assert_called_once_with("react", destination="configs") mock_fetch.assert_called_once_with("obscure_framework", destination="configs")
assert result is not None assert result is not None
assert result.exists() assert result.exists()

View File

@@ -67,22 +67,30 @@ async def test_mcp_validate_legacy_config():
"""Test that MCP can validate legacy configs""" """Test that MCP can validate legacy configs"""
print("\n✓ Testing MCP validate_config_tool with legacy config...") print("\n✓ Testing MCP validate_config_tool with legacy config...")
# Use existing legacy config # Create a truly legacy config (no "sources" key — just base_url + selectors)
config_path = "configs/react.json" legacy_config = {
"name": "test-legacy",
"base_url": "https://example.com/",
"selectors": {"main_content": "main", "title": "h1", "code_blocks": "pre code"},
"url_patterns": {"include": [], "exclude": []},
"rate_limit": 0.5,
}
if not Path(config_path).exists(): with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f:
print(f" ⚠️ Skipping: {config_path} not found") json.dump(legacy_config, f)
return config_path = f.name
args = {"config_path": config_path} try:
result = await validate_config_tool(args) args = {"config_path": config_path}
result = await validate_config_tool(args)
# Check result # Legacy configs are rejected since v2.11.0 — validator should detect the format
text = result[0].text text = result[0].text
assert "" in text, f"Expected success, got: {text}" assert "LEGACY" in text.upper(), f"Expected legacy format detected, got: {text}"
assert "Legacy" in text, f"Expected legacy format detected, got: {text}"
print(" ✅ MCP correctly validates legacy config") print(" ✅ MCP correctly detects legacy config format")
finally:
os.unlink(config_path)
@pytest.mark.skipif(not MCP_AVAILABLE, reason="MCP package not installed") @pytest.mark.skipif(not MCP_AVAILABLE, reason="MCP package not installed")