feat: add headless browser rendering for JavaScript SPA sites (#321)

New BrowserRenderer class uses Playwright to render JavaScript-heavy
documentation sites (React, Vue SPAs) that return empty HTML shells
with requests.get(). Activated via --browser flag on web scraping.

- browser_renderer.py: Playwright wrapper with lazy browser launch,
  auto-install Chromium on first use, context manager support
- doc_scraper.py: browser_mode config, _render_with_browser() helper,
  integrated into scrape_page() and scrape_page_async()
- SPA detection warnings now suggest --browser flag
- Optional dep: pip install "skill-seekers[browser]"
- 14 real e2e tests (actual Chromium, no mocks)
- UML updated: Scrapers class diagram (BrowserRenderer + dependency),
  Parsers (DoctorParser), Utilities (Doctor), Components, and new
  Browser Rendering sequence diagram (#20)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
yusyus
2026-03-28 22:06:14 +03:00
parent 006cccabae
commit ea4fed0be4
15 changed files with 17989 additions and 17824 deletions

View File

@@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased]
### Added
- **Headless browser rendering** (`--browser` flag) — uses Playwright to render JavaScript SPA sites (React, Vue, etc.) that return empty HTML shells. Auto-installs Chromium on first use. Optional dep: `pip install "skill-seekers[browser]"` (#321)
- **`skill-seekers doctor` command** — 8 diagnostic checks (Python version, package install, git, core/optional deps, API keys, MCP server, output dir) with pass/warn/fail status and `--verbose` flag (#316)
- **Prompt injection check workflow** — bundled `prompt-injection-check` workflow scans scraped content for injection patterns (role assumption, instruction overrides, delimiter injection, hidden instructions). Added as first stage in `default` and `security-focus` workflows. Flags suspicious content without removing it (#324)
- **6 behavioral UML diagrams** — 3 sequence (create pipeline, GitHub+C3.x flow, MCP invocation), 2 activity (source detection, enhancement pipeline), 1 component (runtime dependencies with interface contracts)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 486 KiB

After

Width:  |  Height:  |  Size: 539 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 219 KiB

After

Width:  |  Height:  |  Size: 286 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 223 KiB

After

Width:  |  Height:  |  Size: 268 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 82 KiB

After

Width:  |  Height:  |  Size: 89 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 100 KiB

File diff suppressed because it is too large Load Diff

View File

@@ -137,7 +137,12 @@ MCP Client (Claude Code/Cursor) → FastMCPServer (stdio/HTTP) with two invocati
### Runtime Components
![Runtime Components](UML/exports/19_runtime_components.png)
Component diagram with corrected runtime dependencies. Key flows: `CLI Core` dispatches to `Scrapers` (via `scraper.main(argv)`) and to `Adaptors` (via package/upload commands). `Scrapers` call `Codebase Analysis` via `analyze_codebase(enhance_level)`. `Codebase Analysis` uses `C3.x Classes` internally and `Enhancement` when level ≥ 2. `MCP Server` reaches `Scrapers` via subprocess and `Adaptors` via direct import.
Component diagram with corrected runtime dependencies. Key flows: `CLI Core` dispatches to `Scrapers` (via `scraper.main(argv)`) and to `Adaptors` (via package/upload commands). `Scrapers` call `Codebase Analysis` via `analyze_codebase(enhance_level)`. `Codebase Analysis` uses `C3.x Classes` internally and `Enhancement` when level ≥ 2. `MCP Server` reaches `Scrapers` via subprocess and `Adaptors` via direct import. `Scrapers` optionally use `Browser Renderer (Playwright)` via `render_page()` when `--browser` flag is set for JavaScript SPA sites.
### Browser Rendering Flow
![Browser Rendering](UML/exports/20_browser_rendering_sequence.png)
When `--browser` flag is set, `DocScraper.scrape_page()` delegates to `BrowserRenderer.render_page(url)` instead of `requests.get()`. The renderer auto-installs Chromium on first use, navigates with `wait_until='networkidle'` to let JavaScript execute, then returns the fully-rendered HTML. The rest of the pipeline (BeautifulSoup → `extract_content()``save_page()`) remains unchanged. Optional dependency: `pip install "skill-seekers[browser]"`.
## File Locations

View File

@@ -232,6 +232,11 @@ chat = [
"slack-sdk>=3.27.0",
]
# Headless browser for JavaScript SPA sites
browser = [
"playwright>=1.40.0",
]
# Embedding server support
embedding = [
"fastapi>=0.109.0",

View File

@@ -173,6 +173,13 @@ UNIVERSAL_ARGUMENTS.update(RAG_ARGUMENTS)
# Web scraping specific (from scrape.py)
WEB_ARGUMENTS: dict[str, dict[str, Any]] = {
"browser": {
"flags": ("--browser",),
"kwargs": {
"action": "store_true",
"help": "Use headless browser (Playwright) to render JavaScript SPA sites",
},
},
"url": {
"flags": ("--url",),
"kwargs": {

View File

@@ -115,6 +115,13 @@ SCRAPE_ARGUMENTS: dict[str, dict[str, Any]] = {
"help": "Disable rate limiting completely (same as --rate-limit 0)",
},
},
"browser": {
"flags": ("--browser",),
"kwargs": {
"action": "store_true",
"help": "Use headless browser (Playwright) to render JavaScript SPA sites. Install: pip install 'skill-seekers[browser]'",
},
},
"interactive_enhancement": {
"flags": ("--interactive-enhancement",),
"kwargs": {

View File

@@ -0,0 +1,151 @@
"""
Browser Renderer — Playwright-based headless browser for JavaScript SPA sites.
When documentation sites use client-side rendering (React, Vue, etc.),
requests.get() returns empty HTML shells. This module uses Playwright
to render JavaScript before extracting content.
Optional dependency: pip install "skill-seekers[browser]"
"""
from __future__ import annotations
import logging
import subprocess
import sys
logger = logging.getLogger(__name__)
def _check_playwright_available() -> bool:
"""Check if playwright package is installed."""
try:
import playwright # noqa: F401
return True
except ImportError:
return False
def _auto_install_chromium() -> bool:
"""Auto-install Chromium browser on first use.
Returns:
True if install succeeded or already installed, False on failure.
"""
logger.info("Installing Chromium browser for headless rendering...")
try:
result = subprocess.run(
[sys.executable, "-m", "playwright", "install", "chromium"],
capture_output=True,
text=True,
timeout=300,
)
if result.returncode == 0:
logger.info("Chromium installed successfully.")
return True
logger.error("Chromium install failed: %s", result.stderr)
return False
except Exception as e:
logger.error("Failed to install Chromium: %s", e)
return False
class BrowserRenderer:
"""Render JavaScript pages using Playwright headless Chromium.
Usage:
renderer = BrowserRenderer()
html = renderer.render_page("https://docs.discord.com")
renderer.close()
Or as context manager:
with BrowserRenderer() as renderer:
html = renderer.render_page(url)
"""
def __init__(self, timeout: int = 30000, wait_until: str = "networkidle"):
"""Initialize renderer.
Args:
timeout: Page load timeout in milliseconds (default: 30s)
wait_until: Playwright wait condition — "networkidle", "load", "domcontentloaded"
"""
if not _check_playwright_available():
raise ImportError(
"Playwright is required for --browser mode.\n"
"Install it with: pip install 'skill-seekers[browser]'\n"
"Then run: playwright install chromium"
)
self._timeout = timeout
self._wait_until = wait_until
self._playwright = None
self._browser = None
self._context = None
def _ensure_browser(self) -> None:
"""Launch browser if not already running. Auto-installs chromium if needed."""
if self._browser is not None:
return
from playwright.sync_api import sync_playwright
self._playwright = sync_playwright().start()
try:
self._browser = self._playwright.chromium.launch(headless=True)
except Exception:
# Browser not installed — try auto-install
logger.warning("Chromium not found. Attempting auto-install...")
if _auto_install_chromium():
self._browser = self._playwright.chromium.launch(headless=True)
else:
self._playwright.stop()
self._playwright = None
raise RuntimeError(
"Could not launch Chromium. Run: playwright install chromium"
) from None
self._context = self._browser.new_context(user_agent="Mozilla/5.0 (Documentation Scraper)")
def render_page(self, url: str) -> str:
"""Render a page with JavaScript execution and return the HTML.
Args:
url: URL to render
Returns:
Fully-rendered HTML string after JavaScript execution
Raises:
RuntimeError: If browser cannot be launched
TimeoutError: If page load times out
"""
self._ensure_browser()
page = self._context.new_page()
try:
page.goto(url, wait_until=self._wait_until, timeout=self._timeout)
html = page.content()
return html
finally:
page.close()
def close(self) -> None:
"""Shut down browser and Playwright."""
if self._context:
self._context.close()
self._context = None
if self._browser:
self._browser.close()
self._browser = None
if self._playwright:
self._playwright.stop()
self._playwright = None
def __enter__(self):
return self
def __exit__(self, *args):
self.close()

View File

@@ -221,6 +221,8 @@ class CreateCommand:
argv.append("--async")
if getattr(self.args, "no_rate_limit", False):
argv.append("--no-rate-limit")
if getattr(self.args, "browser", False):
argv.append("--browser")
# Call doc_scraper with modified argv
logger.debug(f"Calling doc_scraper with argv: {argv}")

View File

@@ -184,6 +184,10 @@ class DocToSkillConverter:
self.llms_txt_variant = None
self.llms_txt_variants: list[str] = [] # Track all downloaded variants
# Browser rendering mode (for JavaScript SPA sites)
self.browser_mode = config.get("browser", False)
self._browser_renderer = None
# Parallel scraping config
self.workers = config.get("workers", 1)
self.async_mode = config.get("async_mode", DEFAULT_ASYNC_MODE)
@@ -712,6 +716,24 @@ class DocToSkillConverter:
with open(filepath, "w", encoding="utf-8") as f:
json.dump(page, f, indent=2, ensure_ascii=False)
def _render_with_browser(self, url: str) -> str:
"""Render a page using headless browser (Playwright).
Lazily initializes the BrowserRenderer on first call.
Args:
url: URL to render
Returns:
Fully-rendered HTML string
"""
if self._browser_renderer is None:
from skill_seekers.cli.browser_renderer import BrowserRenderer
self._browser_renderer = BrowserRenderer()
logger.info("Launched headless browser for JavaScript rendering")
return self._browser_renderer.render_page(url)
def scrape_page(self, url: str) -> None:
"""Scrape a single page with thread-safe operations.
@@ -730,6 +752,12 @@ class DocToSkillConverter:
url = sanitize_url(url)
# Scraping part (no lock needed - independent)
if self.browser_mode and not self._has_md_extension(url):
# Use Playwright headless browser for JavaScript rendering
html = self._render_with_browser(url)
soup = BeautifulSoup(html, "html.parser")
page = self.extract_content(soup, url)
else:
headers = {"User-Agent": "Mozilla/5.0 (Documentation Scraper)"}
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
@@ -788,6 +816,15 @@ class DocToSkillConverter:
# Sanitise brackets before fetching (safety net; see #284)
url = sanitize_url(url)
if self.browser_mode and not self._has_md_extension(url):
# Use Playwright in executor (sync API in async context)
loop = asyncio.get_event_loop()
html = await loop.run_in_executor(
None, self._render_with_browser, url
)
soup = BeautifulSoup(html, "html.parser")
page = self.extract_content(soup, url)
else:
# Async HTTP request
headers = {"User-Agent": "Mozilla/5.0 (Documentation Scraper)"}
response = await client.get(url, headers=headers, timeout=30.0)
@@ -1370,6 +1407,11 @@ class DocToSkillConverter:
self._log_scrape_completion()
self.save_summary()
# Clean up browser renderer if used
if self._browser_renderer is not None:
self._browser_renderer.close()
self._browser_renderer = None
def _log_scrape_completion(self) -> None:
"""Log scrape completion with accurate saved/skipped counts."""
visited = len(self.visited_urls)
@@ -1391,8 +1433,9 @@ class DocToSkillConverter:
if visited >= 5 and self.pages_saved == 0:
logger.warning(
"⚠️ All %d pages had empty content. This site likely requires "
"JavaScript rendering (SPA/React/Vue). Scraping cannot extract "
"content from JavaScript-rendered pages.",
"JavaScript rendering (SPA/React/Vue).\n"
" Try: skill-seekers create <url> --browser\n"
" Install: pip install 'skill-seekers[browser]'",
visited,
)
elif visited >= 10 and self.pages_skipped > 0:
@@ -1400,7 +1443,8 @@ class DocToSkillConverter:
if skip_ratio > 0.8:
logger.warning(
"⚠️ %d%% of pages had empty content. This site may use "
"JavaScript rendering for some pages.",
"JavaScript rendering for some pages.\n"
" Try: skill-seekers create <url> --browser",
int(skip_ratio * 100),
)
@@ -2212,6 +2256,11 @@ def get_configuration(args: argparse.Namespace) -> dict[str, Any]:
"⚠️ Async mode enabled but workers=1. Consider using --workers 4 for better performance"
)
# Apply CLI override for browser mode
if getattr(args, "browser", False):
config["browser"] = True
logger.info("🌐 Browser mode enabled (Playwright headless Chromium)")
# Apply CLI override for max_pages
if args.max_pages is not None:
old_max = config.get("max_pages", DEFAULT_MAX_PAGES)

View File

@@ -0,0 +1,152 @@
"""Tests for browser_renderer.py (#321).
Real end-to-end tests using actual Playwright + Chromium.
"""
from __future__ import annotations
from skill_seekers.cli.browser_renderer import (
BrowserRenderer,
_auto_install_chromium,
_check_playwright_available,
)
class TestPlaywrightAvailability:
"""Test that playwright is properly detected."""
def test_playwright_is_available(self):
assert _check_playwright_available() is True
def test_auto_install_succeeds(self):
# Chromium is already installed, so this should be a no-op success
assert _auto_install_chromium() is True
class TestBrowserRendererReal:
"""Real end-to-end tests with actual Chromium."""
def test_render_simple_page(self):
"""Render a real page and get HTML back."""
with BrowserRenderer() as renderer:
html = renderer.render_page("https://example.com")
assert "<html" in html.lower()
assert "Example Domain" in html
def test_render_returns_js_content(self):
"""Verify that JS-generated content is captured (not just the shell)."""
with BrowserRenderer() as renderer:
html = renderer.render_page("https://example.com")
# example.com has static content, but the point is we get real HTML
assert len(html) > 500
assert "<body" in html.lower()
def test_multiple_pages_reuse_browser(self):
"""Rendering multiple pages should reuse the same browser instance."""
with BrowserRenderer() as renderer:
html1 = renderer.render_page("https://example.com")
html2 = renderer.render_page("https://example.com")
assert "Example Domain" in html1
assert "Example Domain" in html2
def test_close_cleans_up(self):
"""After close(), internal state is None."""
renderer = BrowserRenderer()
renderer.render_page("https://example.com")
assert renderer._browser is not None
renderer.close()
assert renderer._browser is None
assert renderer._context is None
assert renderer._playwright is None
def test_context_manager_cleans_up(self):
"""Context manager calls close on exit."""
with BrowserRenderer() as renderer:
renderer.render_page("https://example.com")
assert renderer._browser is not None
assert renderer._browser is None
def test_timeout_parameter(self):
"""Custom timeout is respected."""
renderer = BrowserRenderer(timeout=5000)
assert renderer._timeout == 5000
renderer.close()
def test_wait_until_parameter(self):
"""Custom wait_until is respected."""
renderer = BrowserRenderer(wait_until="domcontentloaded")
assert renderer._wait_until == "domcontentloaded"
renderer.close()
class TestDocScraperBrowserIntegration:
"""Test that doc_scraper correctly accepts browser config."""
def test_browser_mode_config_sets_attribute(self):
from skill_seekers.cli.doc_scraper import DocToSkillConverter
config = {
"name": "test",
"base_url": "https://example.com",
"browser": True,
"selectors": {},
"url_patterns": {"include": [], "exclude": []},
}
scraper = DocToSkillConverter(config)
assert scraper.browser_mode is True
assert scraper._browser_renderer is None
def test_browser_mode_default_false(self):
from skill_seekers.cli.doc_scraper import DocToSkillConverter
config = {
"name": "test",
"base_url": "https://example.com",
"selectors": {},
"url_patterns": {"include": [], "exclude": []},
}
scraper = DocToSkillConverter(config)
assert scraper.browser_mode is False
def test_render_with_browser_returns_html(self):
"""Test the _render_with_browser helper directly."""
from skill_seekers.cli.doc_scraper import DocToSkillConverter
config = {
"name": "test",
"base_url": "https://example.com",
"browser": True,
"selectors": {},
"url_patterns": {"include": [], "exclude": []},
}
scraper = DocToSkillConverter(config)
html = scraper._render_with_browser("https://example.com")
assert "Example Domain" in html
assert scraper._browser_renderer is not None
# Clean up
scraper._browser_renderer.close()
class TestBrowserArgument:
"""Test --browser argument is registered in CLI."""
def test_scrape_parser_accepts_browser_flag(self):
from skill_seekers.cli.doc_scraper import setup_argument_parser
parser = setup_argument_parser()
args = parser.parse_args(["--name", "test", "--url", "https://example.com", "--browser"])
assert args.browser is True
def test_scrape_parser_browser_default_false(self):
from skill_seekers.cli.doc_scraper import setup_argument_parser
parser = setup_argument_parser()
args = parser.parse_args(["--name", "test", "--url", "https://example.com"])
assert args.browser is False