release: add scrapling-skill and fix script compatibility

- add scrapling-skill with validated CLI workflow, diagnostics, packaging, and docs integration - fix skill-creator package_skill.py so direct script invocation works from repo root - fix continue-claude-work extract_resume_context.py typing compatibility for local python3 - bump marketplace to 1.39.0 and updated skill versions
2026-03-18 23:08:55 +08:00
parent d8a7d45e53
commit 2192458ef7
13 changed files with 722 additions and 36 deletions
--- a/scrapling-skill/.security-scan-passed
+++ b/scrapling-skill/.security-scan-passed
@@ -0,0 +1,4 @@
+Security scan passed
+Scanned at: 2026-03-18T22:52:43.734452
+Tool: gitleaks + pattern-based validation
+Content hash: 06351e5794510c584fdf29351eb5161f4b12e213f512c3148212c82c357d124a
--- a/scrapling-skill/SKILL.md
+++ b/scrapling-skill/SKILL.md
@@ -0,0 +1,183 @@
+---
+name: scrapling-skill
+description: Install, troubleshoot, and use Scrapling CLI to extract HTML, Markdown, or text from webpages. Use this skill whenever the user mentions Scrapling, `uv tool install scrapling`, `scrapling extract`, WeChat/mp.weixin articles, browser-backed page fetching, or needs help deciding between static and dynamic extraction.
+---
+
+# Scrapling Skill
+
+## Overview
+
+Use Scrapling through its CLI as the default path. Start with the smallest working command, validate the saved output, and only escalate to browser-backed fetching when the static fetch does not contain the real page content.
+
+Do not assume the user's Scrapling install is healthy. Verify it first.
+
+## Default Workflow
+
+Copy this checklist and keep it updated while working:
+
+```text
+Scrapling Progress:
+- [ ] Step 1: Diagnose the local Scrapling install
+- [ ] Step 2: Fix CLI extras or browser runtime if needed
+- [ ] Step 3: Choose static or dynamic fetch
+- [ ] Step 4: Save output to a file
+- [ ] Step 5: Validate file size and extracted content
+- [ ] Step 6: Escalate only if the previous path failed
+```
+
+## Step 1: Diagnose the Install
+
+Run the bundled diagnostic script first:
+
+```bash
+python3 scripts/diagnose_scrapling.py
+```
+
+Use the result as the source of truth for the next step.
+
+## Step 2: Fix the Install
+
+### If the CLI was installed without extras
+
+If `scrapling --help` fails with missing `click` or a message about installing Scrapling with extras, reinstall it with the CLI extra:
+
+```bash
+uv tool uninstall scrapling
+uv tool install 'scrapling[shell]'
+```
+
+Do not default to `scrapling[all]` unless the user explicitly needs the broader feature set.
+
+### If browser-backed fetchers are needed
+
+Install the Playwright runtime:
+
+```bash
+scrapling install
+```
+
+If the install looks slow or opaque, read `references/troubleshooting.md` before guessing. Do not claim success until either:
+- `scrapling install` reports that dependencies are already installed, or
+- the diagnostic script confirms both Chromium and Chrome Headless Shell are present.
+
+## Step 3: Choose the Fetcher
+
+Use this decision rule:
+
+- Start with `extract get` for normal pages, article pages, and most WeChat public articles.
+- Use `extract fetch` when the static HTML does not contain the real content or the page depends on JavaScript rendering.
+- Use `extract stealthy-fetch` only after `fetch` still fails because of anti-bot or challenge behavior. Do not make it the default.
+
+## Step 4: Run the Smallest Useful Command
+
+Always quote URLs in shell commands. This is mandatory in `zsh` when the URL contains `?`, `&`, or other special characters.
+
+### Full page to HTML
+
+```bash
+scrapling extract get 'https://example.com' page.html
+```
+
+### Main content to Markdown
+
+```bash
+scrapling extract get 'https://example.com' article.md -s 'main'
+```
+
+### JS-rendered page with browser automation
+
+```bash
+scrapling extract fetch 'https://example.com' page.html --timeout 20000
+```
+
+### WeChat public article body
+
+Use `#js_content` first. This is the default selector for article body extraction on `mp.weixin.qq.com` pages.
+
+```bash
+scrapling extract get 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' article.md -s '#js_content'
+```
+
+## Step 5: Validate the Output
+
+After every extraction, verify the file instead of assuming success:
+
+```bash
+wc -c article.md
+sed -n '1,40p' article.md
+```
+
+For HTML output, check that the expected title, container, or selector target is actually present:
+
+```bash
+rg -n '<title>|js_content|rich_media_title|main' page.html
+```
+
+If the file is tiny, empty, or missing the expected container, the extraction did not succeed. Go back to Step 3 and switch fetchers or selectors.
+
+## Step 6: Handle Known Failure Modes
+
+### Local TLS trust store problem
+
+If `extract get` fails with `curl: (60) SSL certificate problem`, treat it as a local trust-store problem first, not a Scrapling content failure.
+
+Retry the same command with:
+
+```bash
+--no-verify
+```
+
+Only do this after confirming the failure matches the local certificate verification error pattern. Do not silently disable verification by default.
+
+### WeChat article pages
+
+For `mp.weixin.qq.com`:
+- Try `extract get` before `extract fetch`
+- Use `-s '#js_content'` for the article body
+- Validate the saved Markdown or HTML immediately
+
+### Browser-backed fetch failures
+
+If `extract fetch` fails:
+1. Re-check the install with `python3 scripts/diagnose_scrapling.py`
+2. Confirm Chromium and Chrome Headless Shell are present
+3. Retry with a slightly longer timeout
+4. Escalate to `stealthy-fetch` only if the site behavior justifies it
+
+## Command Patterns
+
+### Diagnose and smoke test a URL
+
+```bash
+python3 scripts/diagnose_scrapling.py --url 'https://example.com'
+```
+
+### Diagnose and smoke test a WeChat article body
+
+```bash
+python3 scripts/diagnose_scrapling.py \
+  --url 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' \
+  --selector '#js_content' \
+  --no-verify
+```
+
+### Diagnose and smoke test a browser-backed fetch
+
+```bash
+python3 scripts/diagnose_scrapling.py \
+  --url 'https://example.com' \
+  --dynamic
+```
+
+## Guardrails
+
+- Do not tell the user to reinstall blindly. Verify first.
+- Do not default to the Python library API when the user is clearly asking about the CLI.
+- Do not jump to browser-backed fetching unless the static result is missing the real content.
+- Do not claim success from exit code alone. Inspect the saved file.
+- Do not hardcode user-specific absolute paths into outputs or docs.
+
+## Resources
+
+- Installation and smoke test helper: `scripts/diagnose_scrapling.py`
+- Verified failure modes and recovery paths: `references/troubleshooting.md`
--- a/scrapling-skill/references/troubleshooting.md
+++ b/scrapling-skill/references/troubleshooting.md
@@ -0,0 +1,164 @@
+# Scrapling Troubleshooting
+
+## Contents
+
+- Installation modes
+- Verified failure modes
+- Static vs dynamic fetch choice
+- WeChat extraction pattern
+- Smoke test commands
+
+## Installation Modes
+
+Use the CLI path as the default:
+
+```bash
+uv tool install 'scrapling[shell]'
+```
+
+Do not assume `uv tool install scrapling` is enough for CLI usage. The base package may install the executable wrapper without the optional CLI dependencies.
+
+## Verified Failure Modes
+
+### 1. CLI installed without extras
+
+Symptom:
+
+- `scrapling --help` fails
+- Output mentions missing `click`
+- Output says Scrapling must be installed with extras
+
+Recovery:
+
+```bash
+uv tool uninstall scrapling
+uv tool install 'scrapling[shell]'
+```
+
+### 2. Browser-backed fetchers not ready
+
+Symptom:
+
+- `extract fetch` or `extract stealthy-fetch` fails because the Playwright runtime is not installed
+- Scrapling has not downloaded Chromium or Chrome Headless Shell
+
+Recovery:
+
+```bash
+scrapling install
+```
+
+Success signals:
+
+- `scrapling install` later reports `The dependencies are already installed`
+- Browser caches contain both:
+  - `chromium-*`
+  - `chromium_headless_shell-*`
+
+Typical cache roots:
+
+- `~/Library/Caches/ms-playwright/`
+- `~/.cache/ms-playwright/`
+
+### 3. Static fetch TLS trust-store failure
+
+Symptom:
+
+- `extract get` fails with `curl: (60) SSL certificate problem`
+
+Interpretation:
+
+- Treat this as a local certificate verification problem first
+- Do not assume the target URL or Scrapling itself is broken
+
+Recovery:
+
+Retry the same static command with:
+
+```bash
+--no-verify
+```
+
+Do not make `--no-verify` the default. Use it only after the failure matches this certificate-verification pattern.
+
+## Static vs Dynamic Fetch Choice
+
+Use this order:
+
+1. `extract get`
+2. `extract fetch`
+3. `extract stealthy-fetch`
+
+Use `extract get` when:
+
+- The page is mostly server-rendered
+- The content is likely already present in raw HTML
+- The target is an article page with a stable content container
+
+Use `extract fetch` when:
+
+- Static HTML does not contain the real content
+- The site depends on JavaScript rendering
+- The page content appears only after runtime hydration
+
+Use `extract stealthy-fetch` when:
+
+- `fetch` still fails
+- The target site shows challenge or anti-bot behavior
+
+## WeChat Extraction Pattern
+
+For `mp.weixin.qq.com` public article pages:
+
+- Start with `extract get`
+- Use the selector `#js_content`
+- Validate the saved file immediately
+
+Example:
+
+```bash
+scrapling extract get 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' article.md -s '#js_content'
+```
+
+Observed behavior:
+
+- The static fetch can already contain the real article body
+- Browser-backed fetch is often unnecessary for article extraction
+
+## Smoke Test Commands
+
+### Basic diagnosis
+
+```bash
+python3 scripts/diagnose_scrapling.py
+```
+
+### Static extraction smoke test
+
+```bash
+python3 scripts/diagnose_scrapling.py --url 'https://example.com'
+```
+
+### WeChat article smoke test
+
+```bash
+python3 scripts/diagnose_scrapling.py \
+  --url 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' \
+  --selector '#js_content'
+```
+
+### Dynamic extraction smoke test
+
+```bash
+python3 scripts/diagnose_scrapling.py \
+  --url 'https://example.com' \
+  --dynamic
+```
+
+### Validate saved output
+
+```bash
+wc -c article.md
+sed -n '1,40p' article.md
+rg -n '<title>|js_content|main|rich_media_title' page.html
+```
--- a/scrapling-skill/scripts/diagnose_scrapling.py
+++ b/scrapling-skill/scripts/diagnose_scrapling.py
@@ -0,0 +1,191 @@
+#!/usr/bin/env python3
+"""
+Diagnose a local Scrapling CLI installation and optionally run a smoke test.
+"""
+
+import argparse
+import shutil
+import subprocess
+import sys
+import tempfile
+from pathlib import Path
+from typing import Iterable, List, Tuple
+
+
+def run_command(cmd: List[str]) -> Tuple[int, str, str]:
+    result = subprocess.run(
+        cmd,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.PIPE,
+        universal_newlines=True,
+        check=False,
+    )
+    return result.returncode, result.stdout, result.stderr
+
+
+def print_section(title: str) -> None:
+    print("")
+    print(title)
+    print("-" * len(title))
+
+
+def existing_dirs(paths: Iterable[Path]) -> List[str]:
+    return [str(path) for path in paths if path.exists()]
+
+
+def detect_browser_cache() -> Tuple[List[str], List[str]]:
+    roots = [
+        Path.home() / "Library" / "Caches" / "ms-playwright",
+        Path.home() / ".cache" / "ms-playwright",
+    ]
+    chromium = []
+    headless_shell = []
+    for root in roots:
+        if not root.exists():
+            continue
+        chromium.extend(existing_dirs(sorted(root.glob("chromium-*"))))
+        headless_shell.extend(existing_dirs(sorted(root.glob("chromium_headless_shell-*"))))
+    return chromium, headless_shell
+
+
+def diagnose_cli() -> bool:
+    print_section("CLI")
+    scrapling_path = shutil.which("scrapling")
+    if not scrapling_path:
+        print("status: missing")
+        print("fix: install with `uv tool install 'scrapling[shell]'`")
+        return False
+
+    print("path: {0}".format(scrapling_path))
+    code, stdout, stderr = run_command(["scrapling", "--help"])
+    output = (stdout + "\n" + stderr).strip()
+
+    if code == 0:
+        print("status: working")
+        return True
+
+    print("status: broken")
+    if "install scrapling with any of the extras" in output.lower() or "no module named 'click'" in output.lower():
+        print("cause: installed without CLI extras")
+        print("fix: `uv tool uninstall scrapling` then `uv tool install 'scrapling[shell]'`")
+    else:
+        print("cause: unknown")
+
+    if output:
+        print("details:")
+        print(output[:1200])
+    return False
+
+
+def diagnose_browsers() -> None:
+    print_section("Browser Runtime")
+    chromium, headless_shell = detect_browser_cache()
+    print("chromium: {0}".format("present" if chromium else "missing"))
+    for path in chromium:
+        print("  - {0}".format(path))
+    print("chrome-headless-shell: {0}".format("present" if headless_shell else "missing"))
+    for path in headless_shell:
+        print("  - {0}".format(path))
+    if not chromium or not headless_shell:
+        print("hint: run `scrapling install` before browser-backed fetches")
+
+
+def preview_file(path: Path, preview_lines: int) -> None:
+    print_section("Smoke Test Output")
+    if not path.exists():
+        print("status: missing output file")
+        return
+
+    size = path.stat().st_size
+    print("path: {0}".format(path))
+    print("bytes: {0}".format(size))
+    if size == 0:
+        print("status: empty")
+        return
+
+    if path.suffix in (".md", ".txt"):
+        print("preview:")
+        with path.open("r", encoding="utf-8", errors="replace") as handle:
+            for index, line in enumerate(handle):
+                if index >= preview_lines:
+                    break
+                print(line.rstrip())
+
+
+def run_smoke_test(args: argparse.Namespace) -> int:
+    print_section("Smoke Test")
+
+    suffix = ".html"
+    if args.selector:
+        suffix = ".md"
+
+    output_path = Path(tempfile.gettempdir()) / ("scrapling-smoke" + suffix)
+    if output_path.exists():
+        output_path.unlink()
+
+    cmd = ["scrapling", "extract", "fetch" if args.dynamic else "get", args.url, str(output_path)]
+    if args.selector:
+        cmd.extend(["-s", args.selector])
+    if args.dynamic:
+        cmd.extend(["--timeout", str(args.timeout)])
+    elif args.no_verify:
+        cmd.append("--no-verify")
+
+    print("command: {0}".format(" ".join(cmd)))
+    code, stdout, stderr = run_command(cmd)
+    if stdout.strip():
+        print(stdout.strip())
+    if stderr.strip():
+        print(stderr.strip())
+
+    preview_file(output_path, args.preview_lines)
+    return code
+
+
+def build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(description="Diagnose Scrapling and run an optional smoke test.")
+    parser.add_argument("--url", help="Optional URL for a smoke test")
+    parser.add_argument("--selector", help="Optional CSS selector for the smoke test")
+    parser.add_argument(
+        "--dynamic",
+        action="store_true",
+        help="Use `scrapling extract fetch` instead of `scrapling extract get`",
+    )
+    parser.add_argument(
+        "--no-verify",
+        action="store_true",
+        help="Pass `--no-verify` to static smoke tests",
+    )
+    parser.add_argument(
+        "--timeout",
+        type=int,
+        default=20000,
+        help="Timeout in milliseconds for dynamic smoke tests",
+    )
+    parser.add_argument(
+        "--preview-lines",
+        type=int,
+        default=20,
+        help="Number of preview lines for markdown/text output",
+    )
+    return parser
+
+
+def main() -> int:
+    parser = build_parser()
+    args = parser.parse_args()
+
+    cli_ok = diagnose_cli()
+    diagnose_browsers()
+
+    if not cli_ok:
+        return 1
+
+    if not args.url:
+        return 0
+
+    return run_smoke_test(args)
+
+
+if __name__ == "__main__":
+    sys.exit(main())