feat: add headless browser rendering for JavaScript SPA sites (#321)
New BrowserRenderer class uses Playwright to render JavaScript-heavy documentation sites (React, Vue SPAs) that return empty HTML shells with requests.get(). Activated via --browser flag on web scraping. - browser_renderer.py: Playwright wrapper with lazy browser launch, auto-install Chromium on first use, context manager support - doc_scraper.py: browser_mode config, _render_with_browser() helper, integrated into scrape_page() and scrape_page_async() - SPA detection warnings now suggest --browser flag - Optional dep: pip install "skill-seekers[browser]" - 14 real e2e tests (actual Chromium, no mocks) - UML updated: Scrapers class diagram (BrowserRenderer + dependency), Parsers (DoctorParser), Utilities (Doctor), Components, and new Browser Rendering sequence diagram (#20) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
|
||||||
### Added
|
### Added
|
||||||
|
- **Headless browser rendering** (`--browser` flag) — uses Playwright to render JavaScript SPA sites (React, Vue, etc.) that return empty HTML shells. Auto-installs Chromium on first use. Optional dep: `pip install "skill-seekers[browser]"` (#321)
|
||||||
- **`skill-seekers doctor` command** — 8 diagnostic checks (Python version, package install, git, core/optional deps, API keys, MCP server, output dir) with pass/warn/fail status and `--verbose` flag (#316)
|
- **`skill-seekers doctor` command** — 8 diagnostic checks (Python version, package install, git, core/optional deps, API keys, MCP server, output dir) with pass/warn/fail status and `--verbose` flag (#316)
|
||||||
- **Prompt injection check workflow** — bundled `prompt-injection-check` workflow scans scraped content for injection patterns (role assumption, instruction overrides, delimiter injection, hidden instructions). Added as first stage in `default` and `security-focus` workflows. Flags suspicious content without removing it (#324)
|
- **Prompt injection check workflow** — bundled `prompt-injection-check` workflow scans scraped content for injection patterns (role assumption, instruction overrides, delimiter injection, hidden instructions). Added as first stage in `default` and `security-focus` workflows. Flags suspicious content without removing it (#324)
|
||||||
- **6 behavioral UML diagrams** — 3 sequence (create pipeline, GitHub+C3.x flow, MCP invocation), 2 activity (source detection, enhancement pipeline), 1 component (runtime dependencies with interface contracts)
|
- **6 behavioral UML diagrams** — 3 sequence (create pipeline, GitHub+C3.x flow, MCP invocation), 2 activity (source detection, enhancement pipeline), 1 component (runtime dependencies with interface contracts)
|
||||||
|
|||||||
|
Before Width: | Height: | Size: 486 KiB After Width: | Height: | Size: 539 KiB |
|
Before Width: | Height: | Size: 219 KiB After Width: | Height: | Size: 286 KiB |
|
Before Width: | Height: | Size: 223 KiB After Width: | Height: | Size: 268 KiB |
|
Before Width: | Height: | Size: 82 KiB After Width: | Height: | Size: 89 KiB |
BIN
docs/UML/exports/20_browser_rendering_sequence.png
Normal file
|
After Width: | Height: | Size: 100 KiB |
@@ -137,7 +137,12 @@ MCP Client (Claude Code/Cursor) → FastMCPServer (stdio/HTTP) with two invocati
|
|||||||
### Runtime Components
|
### Runtime Components
|
||||||

|

|
||||||
|
|
||||||
Component diagram with corrected runtime dependencies. Key flows: `CLI Core` dispatches to `Scrapers` (via `scraper.main(argv)`) and to `Adaptors` (via package/upload commands). `Scrapers` call `Codebase Analysis` via `analyze_codebase(enhance_level)`. `Codebase Analysis` uses `C3.x Classes` internally and `Enhancement` when level ≥ 2. `MCP Server` reaches `Scrapers` via subprocess and `Adaptors` via direct import.
|
Component diagram with corrected runtime dependencies. Key flows: `CLI Core` dispatches to `Scrapers` (via `scraper.main(argv)`) and to `Adaptors` (via package/upload commands). `Scrapers` call `Codebase Analysis` via `analyze_codebase(enhance_level)`. `Codebase Analysis` uses `C3.x Classes` internally and `Enhancement` when level ≥ 2. `MCP Server` reaches `Scrapers` via subprocess and `Adaptors` via direct import. `Scrapers` optionally use `Browser Renderer (Playwright)` via `render_page()` when `--browser` flag is set for JavaScript SPA sites.
|
||||||
|
|
||||||
|
### Browser Rendering Flow
|
||||||
|

|
||||||
|
|
||||||
|
When `--browser` flag is set, `DocScraper.scrape_page()` delegates to `BrowserRenderer.render_page(url)` instead of `requests.get()`. The renderer auto-installs Chromium on first use, navigates with `wait_until='networkidle'` to let JavaScript execute, then returns the fully-rendered HTML. The rest of the pipeline (BeautifulSoup → `extract_content()` → `save_page()`) remains unchanged. Optional dependency: `pip install "skill-seekers[browser]"`.
|
||||||
|
|
||||||
## File Locations
|
## File Locations
|
||||||
|
|
||||||
|
|||||||
@@ -232,6 +232,11 @@ chat = [
|
|||||||
"slack-sdk>=3.27.0",
|
"slack-sdk>=3.27.0",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
# Headless browser for JavaScript SPA sites
|
||||||
|
browser = [
|
||||||
|
"playwright>=1.40.0",
|
||||||
|
]
|
||||||
|
|
||||||
# Embedding server support
|
# Embedding server support
|
||||||
embedding = [
|
embedding = [
|
||||||
"fastapi>=0.109.0",
|
"fastapi>=0.109.0",
|
||||||
|
|||||||
@@ -173,6 +173,13 @@ UNIVERSAL_ARGUMENTS.update(RAG_ARGUMENTS)
|
|||||||
|
|
||||||
# Web scraping specific (from scrape.py)
|
# Web scraping specific (from scrape.py)
|
||||||
WEB_ARGUMENTS: dict[str, dict[str, Any]] = {
|
WEB_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||||
|
"browser": {
|
||||||
|
"flags": ("--browser",),
|
||||||
|
"kwargs": {
|
||||||
|
"action": "store_true",
|
||||||
|
"help": "Use headless browser (Playwright) to render JavaScript SPA sites",
|
||||||
|
},
|
||||||
|
},
|
||||||
"url": {
|
"url": {
|
||||||
"flags": ("--url",),
|
"flags": ("--url",),
|
||||||
"kwargs": {
|
"kwargs": {
|
||||||
|
|||||||
@@ -115,6 +115,13 @@ SCRAPE_ARGUMENTS: dict[str, dict[str, Any]] = {
|
|||||||
"help": "Disable rate limiting completely (same as --rate-limit 0)",
|
"help": "Disable rate limiting completely (same as --rate-limit 0)",
|
||||||
},
|
},
|
||||||
},
|
},
|
||||||
|
"browser": {
|
||||||
|
"flags": ("--browser",),
|
||||||
|
"kwargs": {
|
||||||
|
"action": "store_true",
|
||||||
|
"help": "Use headless browser (Playwright) to render JavaScript SPA sites. Install: pip install 'skill-seekers[browser]'",
|
||||||
|
},
|
||||||
|
},
|
||||||
"interactive_enhancement": {
|
"interactive_enhancement": {
|
||||||
"flags": ("--interactive-enhancement",),
|
"flags": ("--interactive-enhancement",),
|
||||||
"kwargs": {
|
"kwargs": {
|
||||||
|
|||||||
151
src/skill_seekers/cli/browser_renderer.py
Normal file
@@ -0,0 +1,151 @@
|
|||||||
|
"""
|
||||||
|
Browser Renderer — Playwright-based headless browser for JavaScript SPA sites.
|
||||||
|
|
||||||
|
When documentation sites use client-side rendering (React, Vue, etc.),
|
||||||
|
requests.get() returns empty HTML shells. This module uses Playwright
|
||||||
|
to render JavaScript before extracting content.
|
||||||
|
|
||||||
|
Optional dependency: pip install "skill-seekers[browser]"
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
def _check_playwright_available() -> bool:
|
||||||
|
"""Check if playwright package is installed."""
|
||||||
|
try:
|
||||||
|
import playwright # noqa: F401
|
||||||
|
|
||||||
|
return True
|
||||||
|
except ImportError:
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def _auto_install_chromium() -> bool:
|
||||||
|
"""Auto-install Chromium browser on first use.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if install succeeded or already installed, False on failure.
|
||||||
|
"""
|
||||||
|
logger.info("Installing Chromium browser for headless rendering...")
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
[sys.executable, "-m", "playwright", "install", "chromium"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=300,
|
||||||
|
)
|
||||||
|
if result.returncode == 0:
|
||||||
|
logger.info("Chromium installed successfully.")
|
||||||
|
return True
|
||||||
|
logger.error("Chromium install failed: %s", result.stderr)
|
||||||
|
return False
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("Failed to install Chromium: %s", e)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
class BrowserRenderer:
|
||||||
|
"""Render JavaScript pages using Playwright headless Chromium.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
renderer = BrowserRenderer()
|
||||||
|
html = renderer.render_page("https://docs.discord.com")
|
||||||
|
renderer.close()
|
||||||
|
|
||||||
|
Or as context manager:
|
||||||
|
with BrowserRenderer() as renderer:
|
||||||
|
html = renderer.render_page(url)
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, timeout: int = 30000, wait_until: str = "networkidle"):
|
||||||
|
"""Initialize renderer.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
timeout: Page load timeout in milliseconds (default: 30s)
|
||||||
|
wait_until: Playwright wait condition — "networkidle", "load", "domcontentloaded"
|
||||||
|
"""
|
||||||
|
if not _check_playwright_available():
|
||||||
|
raise ImportError(
|
||||||
|
"Playwright is required for --browser mode.\n"
|
||||||
|
"Install it with: pip install 'skill-seekers[browser]'\n"
|
||||||
|
"Then run: playwright install chromium"
|
||||||
|
)
|
||||||
|
|
||||||
|
self._timeout = timeout
|
||||||
|
self._wait_until = wait_until
|
||||||
|
self._playwright = None
|
||||||
|
self._browser = None
|
||||||
|
self._context = None
|
||||||
|
|
||||||
|
def _ensure_browser(self) -> None:
|
||||||
|
"""Launch browser if not already running. Auto-installs chromium if needed."""
|
||||||
|
if self._browser is not None:
|
||||||
|
return
|
||||||
|
|
||||||
|
from playwright.sync_api import sync_playwright
|
||||||
|
|
||||||
|
self._playwright = sync_playwright().start()
|
||||||
|
|
||||||
|
try:
|
||||||
|
self._browser = self._playwright.chromium.launch(headless=True)
|
||||||
|
except Exception:
|
||||||
|
# Browser not installed — try auto-install
|
||||||
|
logger.warning("Chromium not found. Attempting auto-install...")
|
||||||
|
if _auto_install_chromium():
|
||||||
|
self._browser = self._playwright.chromium.launch(headless=True)
|
||||||
|
else:
|
||||||
|
self._playwright.stop()
|
||||||
|
self._playwright = None
|
||||||
|
raise RuntimeError(
|
||||||
|
"Could not launch Chromium. Run: playwright install chromium"
|
||||||
|
) from None
|
||||||
|
|
||||||
|
self._context = self._browser.new_context(user_agent="Mozilla/5.0 (Documentation Scraper)")
|
||||||
|
|
||||||
|
def render_page(self, url: str) -> str:
|
||||||
|
"""Render a page with JavaScript execution and return the HTML.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url: URL to render
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Fully-rendered HTML string after JavaScript execution
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
RuntimeError: If browser cannot be launched
|
||||||
|
TimeoutError: If page load times out
|
||||||
|
"""
|
||||||
|
self._ensure_browser()
|
||||||
|
|
||||||
|
page = self._context.new_page()
|
||||||
|
try:
|
||||||
|
page.goto(url, wait_until=self._wait_until, timeout=self._timeout)
|
||||||
|
html = page.content()
|
||||||
|
return html
|
||||||
|
finally:
|
||||||
|
page.close()
|
||||||
|
|
||||||
|
def close(self) -> None:
|
||||||
|
"""Shut down browser and Playwright."""
|
||||||
|
if self._context:
|
||||||
|
self._context.close()
|
||||||
|
self._context = None
|
||||||
|
if self._browser:
|
||||||
|
self._browser.close()
|
||||||
|
self._browser = None
|
||||||
|
if self._playwright:
|
||||||
|
self._playwright.stop()
|
||||||
|
self._playwright = None
|
||||||
|
|
||||||
|
def __enter__(self):
|
||||||
|
return self
|
||||||
|
|
||||||
|
def __exit__(self, *args):
|
||||||
|
self.close()
|
||||||
@@ -221,6 +221,8 @@ class CreateCommand:
|
|||||||
argv.append("--async")
|
argv.append("--async")
|
||||||
if getattr(self.args, "no_rate_limit", False):
|
if getattr(self.args, "no_rate_limit", False):
|
||||||
argv.append("--no-rate-limit")
|
argv.append("--no-rate-limit")
|
||||||
|
if getattr(self.args, "browser", False):
|
||||||
|
argv.append("--browser")
|
||||||
|
|
||||||
# Call doc_scraper with modified argv
|
# Call doc_scraper with modified argv
|
||||||
logger.debug(f"Calling doc_scraper with argv: {argv}")
|
logger.debug(f"Calling doc_scraper with argv: {argv}")
|
||||||
|
|||||||
@@ -184,6 +184,10 @@ class DocToSkillConverter:
|
|||||||
self.llms_txt_variant = None
|
self.llms_txt_variant = None
|
||||||
self.llms_txt_variants: list[str] = [] # Track all downloaded variants
|
self.llms_txt_variants: list[str] = [] # Track all downloaded variants
|
||||||
|
|
||||||
|
# Browser rendering mode (for JavaScript SPA sites)
|
||||||
|
self.browser_mode = config.get("browser", False)
|
||||||
|
self._browser_renderer = None
|
||||||
|
|
||||||
# Parallel scraping config
|
# Parallel scraping config
|
||||||
self.workers = config.get("workers", 1)
|
self.workers = config.get("workers", 1)
|
||||||
self.async_mode = config.get("async_mode", DEFAULT_ASYNC_MODE)
|
self.async_mode = config.get("async_mode", DEFAULT_ASYNC_MODE)
|
||||||
@@ -712,6 +716,24 @@ class DocToSkillConverter:
|
|||||||
with open(filepath, "w", encoding="utf-8") as f:
|
with open(filepath, "w", encoding="utf-8") as f:
|
||||||
json.dump(page, f, indent=2, ensure_ascii=False)
|
json.dump(page, f, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
def _render_with_browser(self, url: str) -> str:
|
||||||
|
"""Render a page using headless browser (Playwright).
|
||||||
|
|
||||||
|
Lazily initializes the BrowserRenderer on first call.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url: URL to render
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Fully-rendered HTML string
|
||||||
|
"""
|
||||||
|
if self._browser_renderer is None:
|
||||||
|
from skill_seekers.cli.browser_renderer import BrowserRenderer
|
||||||
|
|
||||||
|
self._browser_renderer = BrowserRenderer()
|
||||||
|
logger.info("Launched headless browser for JavaScript rendering")
|
||||||
|
return self._browser_renderer.render_page(url)
|
||||||
|
|
||||||
def scrape_page(self, url: str) -> None:
|
def scrape_page(self, url: str) -> None:
|
||||||
"""Scrape a single page with thread-safe operations.
|
"""Scrape a single page with thread-safe operations.
|
||||||
|
|
||||||
@@ -730,16 +752,22 @@ class DocToSkillConverter:
|
|||||||
url = sanitize_url(url)
|
url = sanitize_url(url)
|
||||||
|
|
||||||
# Scraping part (no lock needed - independent)
|
# Scraping part (no lock needed - independent)
|
||||||
headers = {"User-Agent": "Mozilla/5.0 (Documentation Scraper)"}
|
if self.browser_mode and not self._has_md_extension(url):
|
||||||
response = requests.get(url, headers=headers, timeout=30)
|
# Use Playwright headless browser for JavaScript rendering
|
||||||
response.raise_for_status()
|
html = self._render_with_browser(url)
|
||||||
|
soup = BeautifulSoup(html, "html.parser")
|
||||||
# Check if this is a Markdown file
|
|
||||||
if self._has_md_extension(url):
|
|
||||||
page = self._extract_markdown_content(response.text, url)
|
|
||||||
else:
|
|
||||||
soup = BeautifulSoup(response.content, "html.parser")
|
|
||||||
page = self.extract_content(soup, url)
|
page = self.extract_content(soup, url)
|
||||||
|
else:
|
||||||
|
headers = {"User-Agent": "Mozilla/5.0 (Documentation Scraper)"}
|
||||||
|
response = requests.get(url, headers=headers, timeout=30)
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
# Check if this is a Markdown file
|
||||||
|
if self._has_md_extension(url):
|
||||||
|
page = self._extract_markdown_content(response.text, url)
|
||||||
|
else:
|
||||||
|
soup = BeautifulSoup(response.content, "html.parser")
|
||||||
|
page = self.extract_content(soup, url)
|
||||||
|
|
||||||
# Thread-safe operations (lock required for workers > 1)
|
# Thread-safe operations (lock required for workers > 1)
|
||||||
if self.workers > 1:
|
if self.workers > 1:
|
||||||
@@ -788,18 +816,27 @@ class DocToSkillConverter:
|
|||||||
# Sanitise brackets before fetching (safety net; see #284)
|
# Sanitise brackets before fetching (safety net; see #284)
|
||||||
url = sanitize_url(url)
|
url = sanitize_url(url)
|
||||||
|
|
||||||
# Async HTTP request
|
if self.browser_mode and not self._has_md_extension(url):
|
||||||
headers = {"User-Agent": "Mozilla/5.0 (Documentation Scraper)"}
|
# Use Playwright in executor (sync API in async context)
|
||||||
response = await client.get(url, headers=headers, timeout=30.0)
|
loop = asyncio.get_event_loop()
|
||||||
response.raise_for_status()
|
html = await loop.run_in_executor(
|
||||||
|
None, self._render_with_browser, url
|
||||||
# Check if this is a Markdown file
|
)
|
||||||
if self._has_md_extension(url):
|
soup = BeautifulSoup(html, "html.parser")
|
||||||
page = self._extract_markdown_content(response.text, url)
|
|
||||||
else:
|
|
||||||
# BeautifulSoup parsing (still synchronous, but fast)
|
|
||||||
soup = BeautifulSoup(response.content, "html.parser")
|
|
||||||
page = self.extract_content(soup, url)
|
page = self.extract_content(soup, url)
|
||||||
|
else:
|
||||||
|
# Async HTTP request
|
||||||
|
headers = {"User-Agent": "Mozilla/5.0 (Documentation Scraper)"}
|
||||||
|
response = await client.get(url, headers=headers, timeout=30.0)
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
# Check if this is a Markdown file
|
||||||
|
if self._has_md_extension(url):
|
||||||
|
page = self._extract_markdown_content(response.text, url)
|
||||||
|
else:
|
||||||
|
# BeautifulSoup parsing (still synchronous, but fast)
|
||||||
|
soup = BeautifulSoup(response.content, "html.parser")
|
||||||
|
page = self.extract_content(soup, url)
|
||||||
|
|
||||||
# Async-safe operations (no lock needed - single event loop)
|
# Async-safe operations (no lock needed - single event loop)
|
||||||
logger.info(" %s", url)
|
logger.info(" %s", url)
|
||||||
@@ -1370,6 +1407,11 @@ class DocToSkillConverter:
|
|||||||
self._log_scrape_completion()
|
self._log_scrape_completion()
|
||||||
self.save_summary()
|
self.save_summary()
|
||||||
|
|
||||||
|
# Clean up browser renderer if used
|
||||||
|
if self._browser_renderer is not None:
|
||||||
|
self._browser_renderer.close()
|
||||||
|
self._browser_renderer = None
|
||||||
|
|
||||||
def _log_scrape_completion(self) -> None:
|
def _log_scrape_completion(self) -> None:
|
||||||
"""Log scrape completion with accurate saved/skipped counts."""
|
"""Log scrape completion with accurate saved/skipped counts."""
|
||||||
visited = len(self.visited_urls)
|
visited = len(self.visited_urls)
|
||||||
@@ -1391,8 +1433,9 @@ class DocToSkillConverter:
|
|||||||
if visited >= 5 and self.pages_saved == 0:
|
if visited >= 5 and self.pages_saved == 0:
|
||||||
logger.warning(
|
logger.warning(
|
||||||
"⚠️ All %d pages had empty content. This site likely requires "
|
"⚠️ All %d pages had empty content. This site likely requires "
|
||||||
"JavaScript rendering (SPA/React/Vue). Scraping cannot extract "
|
"JavaScript rendering (SPA/React/Vue).\n"
|
||||||
"content from JavaScript-rendered pages.",
|
" Try: skill-seekers create <url> --browser\n"
|
||||||
|
" Install: pip install 'skill-seekers[browser]'",
|
||||||
visited,
|
visited,
|
||||||
)
|
)
|
||||||
elif visited >= 10 and self.pages_skipped > 0:
|
elif visited >= 10 and self.pages_skipped > 0:
|
||||||
@@ -1400,7 +1443,8 @@ class DocToSkillConverter:
|
|||||||
if skip_ratio > 0.8:
|
if skip_ratio > 0.8:
|
||||||
logger.warning(
|
logger.warning(
|
||||||
"⚠️ %d%% of pages had empty content. This site may use "
|
"⚠️ %d%% of pages had empty content. This site may use "
|
||||||
"JavaScript rendering for some pages.",
|
"JavaScript rendering for some pages.\n"
|
||||||
|
" Try: skill-seekers create <url> --browser",
|
||||||
int(skip_ratio * 100),
|
int(skip_ratio * 100),
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -2212,6 +2256,11 @@ def get_configuration(args: argparse.Namespace) -> dict[str, Any]:
|
|||||||
"⚠️ Async mode enabled but workers=1. Consider using --workers 4 for better performance"
|
"⚠️ Async mode enabled but workers=1. Consider using --workers 4 for better performance"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# Apply CLI override for browser mode
|
||||||
|
if getattr(args, "browser", False):
|
||||||
|
config["browser"] = True
|
||||||
|
logger.info("🌐 Browser mode enabled (Playwright headless Chromium)")
|
||||||
|
|
||||||
# Apply CLI override for max_pages
|
# Apply CLI override for max_pages
|
||||||
if args.max_pages is not None:
|
if args.max_pages is not None:
|
||||||
old_max = config.get("max_pages", DEFAULT_MAX_PAGES)
|
old_max = config.get("max_pages", DEFAULT_MAX_PAGES)
|
||||||
|
|||||||
152
tests/test_browser_renderer.py
Normal file
@@ -0,0 +1,152 @@
|
|||||||
|
"""Tests for browser_renderer.py (#321).
|
||||||
|
|
||||||
|
Real end-to-end tests using actual Playwright + Chromium.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from skill_seekers.cli.browser_renderer import (
|
||||||
|
BrowserRenderer,
|
||||||
|
_auto_install_chromium,
|
||||||
|
_check_playwright_available,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class TestPlaywrightAvailability:
|
||||||
|
"""Test that playwright is properly detected."""
|
||||||
|
|
||||||
|
def test_playwright_is_available(self):
|
||||||
|
assert _check_playwright_available() is True
|
||||||
|
|
||||||
|
def test_auto_install_succeeds(self):
|
||||||
|
# Chromium is already installed, so this should be a no-op success
|
||||||
|
assert _auto_install_chromium() is True
|
||||||
|
|
||||||
|
|
||||||
|
class TestBrowserRendererReal:
|
||||||
|
"""Real end-to-end tests with actual Chromium."""
|
||||||
|
|
||||||
|
def test_render_simple_page(self):
|
||||||
|
"""Render a real page and get HTML back."""
|
||||||
|
with BrowserRenderer() as renderer:
|
||||||
|
html = renderer.render_page("https://example.com")
|
||||||
|
|
||||||
|
assert "<html" in html.lower()
|
||||||
|
assert "Example Domain" in html
|
||||||
|
|
||||||
|
def test_render_returns_js_content(self):
|
||||||
|
"""Verify that JS-generated content is captured (not just the shell)."""
|
||||||
|
with BrowserRenderer() as renderer:
|
||||||
|
html = renderer.render_page("https://example.com")
|
||||||
|
|
||||||
|
# example.com has static content, but the point is we get real HTML
|
||||||
|
assert len(html) > 500
|
||||||
|
assert "<body" in html.lower()
|
||||||
|
|
||||||
|
def test_multiple_pages_reuse_browser(self):
|
||||||
|
"""Rendering multiple pages should reuse the same browser instance."""
|
||||||
|
with BrowserRenderer() as renderer:
|
||||||
|
html1 = renderer.render_page("https://example.com")
|
||||||
|
html2 = renderer.render_page("https://example.com")
|
||||||
|
|
||||||
|
assert "Example Domain" in html1
|
||||||
|
assert "Example Domain" in html2
|
||||||
|
|
||||||
|
def test_close_cleans_up(self):
|
||||||
|
"""After close(), internal state is None."""
|
||||||
|
renderer = BrowserRenderer()
|
||||||
|
renderer.render_page("https://example.com")
|
||||||
|
assert renderer._browser is not None
|
||||||
|
|
||||||
|
renderer.close()
|
||||||
|
assert renderer._browser is None
|
||||||
|
assert renderer._context is None
|
||||||
|
assert renderer._playwright is None
|
||||||
|
|
||||||
|
def test_context_manager_cleans_up(self):
|
||||||
|
"""Context manager calls close on exit."""
|
||||||
|
with BrowserRenderer() as renderer:
|
||||||
|
renderer.render_page("https://example.com")
|
||||||
|
assert renderer._browser is not None
|
||||||
|
|
||||||
|
assert renderer._browser is None
|
||||||
|
|
||||||
|
def test_timeout_parameter(self):
|
||||||
|
"""Custom timeout is respected."""
|
||||||
|
renderer = BrowserRenderer(timeout=5000)
|
||||||
|
assert renderer._timeout == 5000
|
||||||
|
renderer.close()
|
||||||
|
|
||||||
|
def test_wait_until_parameter(self):
|
||||||
|
"""Custom wait_until is respected."""
|
||||||
|
renderer = BrowserRenderer(wait_until="domcontentloaded")
|
||||||
|
assert renderer._wait_until == "domcontentloaded"
|
||||||
|
renderer.close()
|
||||||
|
|
||||||
|
|
||||||
|
class TestDocScraperBrowserIntegration:
|
||||||
|
"""Test that doc_scraper correctly accepts browser config."""
|
||||||
|
|
||||||
|
def test_browser_mode_config_sets_attribute(self):
|
||||||
|
from skill_seekers.cli.doc_scraper import DocToSkillConverter
|
||||||
|
|
||||||
|
config = {
|
||||||
|
"name": "test",
|
||||||
|
"base_url": "https://example.com",
|
||||||
|
"browser": True,
|
||||||
|
"selectors": {},
|
||||||
|
"url_patterns": {"include": [], "exclude": []},
|
||||||
|
}
|
||||||
|
scraper = DocToSkillConverter(config)
|
||||||
|
assert scraper.browser_mode is True
|
||||||
|
assert scraper._browser_renderer is None
|
||||||
|
|
||||||
|
def test_browser_mode_default_false(self):
|
||||||
|
from skill_seekers.cli.doc_scraper import DocToSkillConverter
|
||||||
|
|
||||||
|
config = {
|
||||||
|
"name": "test",
|
||||||
|
"base_url": "https://example.com",
|
||||||
|
"selectors": {},
|
||||||
|
"url_patterns": {"include": [], "exclude": []},
|
||||||
|
}
|
||||||
|
scraper = DocToSkillConverter(config)
|
||||||
|
assert scraper.browser_mode is False
|
||||||
|
|
||||||
|
def test_render_with_browser_returns_html(self):
|
||||||
|
"""Test the _render_with_browser helper directly."""
|
||||||
|
from skill_seekers.cli.doc_scraper import DocToSkillConverter
|
||||||
|
|
||||||
|
config = {
|
||||||
|
"name": "test",
|
||||||
|
"base_url": "https://example.com",
|
||||||
|
"browser": True,
|
||||||
|
"selectors": {},
|
||||||
|
"url_patterns": {"include": [], "exclude": []},
|
||||||
|
}
|
||||||
|
scraper = DocToSkillConverter(config)
|
||||||
|
|
||||||
|
html = scraper._render_with_browser("https://example.com")
|
||||||
|
assert "Example Domain" in html
|
||||||
|
assert scraper._browser_renderer is not None
|
||||||
|
|
||||||
|
# Clean up
|
||||||
|
scraper._browser_renderer.close()
|
||||||
|
|
||||||
|
|
||||||
|
class TestBrowserArgument:
|
||||||
|
"""Test --browser argument is registered in CLI."""
|
||||||
|
|
||||||
|
def test_scrape_parser_accepts_browser_flag(self):
|
||||||
|
from skill_seekers.cli.doc_scraper import setup_argument_parser
|
||||||
|
|
||||||
|
parser = setup_argument_parser()
|
||||||
|
args = parser.parse_args(["--name", "test", "--url", "https://example.com", "--browser"])
|
||||||
|
assert args.browser is True
|
||||||
|
|
||||||
|
def test_scrape_parser_browser_default_false(self):
|
||||||
|
from skill_seekers.cli.doc_scraper import setup_argument_parser
|
||||||
|
|
||||||
|
parser = setup_argument_parser()
|
||||||
|
args = parser.parse_args(["--name", "test", "--url", "https://example.com"])
|
||||||
|
assert args.browser is False
|
||||||