release: add scrapling-skill and fix script compatibility

- add scrapling-skill with validated CLI workflow, diagnostics, packaging, and docs integration
- fix skill-creator package_skill.py so direct script invocation works from repo root
- fix continue-claude-work extract_resume_context.py typing compatibility for local python3
- bump marketplace to 1.39.0 and updated skill versions
This commit is contained in:
daymade
2026-03-18 23:08:55 +08:00
parent d8a7d45e53
commit 2192458ef7
13 changed files with 722 additions and 36 deletions

View File

@@ -0,0 +1,4 @@
Security scan passed
Scanned at: 2026-03-18T22:52:43.734452
Tool: gitleaks + pattern-based validation
Content hash: 06351e5794510c584fdf29351eb5161f4b12e213f512c3148212c82c357d124a

183
scrapling-skill/SKILL.md Normal file
View File

@@ -0,0 +1,183 @@
---
name: scrapling-skill
description: Install, troubleshoot, and use Scrapling CLI to extract HTML, Markdown, or text from webpages. Use this skill whenever the user mentions Scrapling, `uv tool install scrapling`, `scrapling extract`, WeChat/mp.weixin articles, browser-backed page fetching, or needs help deciding between static and dynamic extraction.
---
# Scrapling Skill
## Overview
Use Scrapling through its CLI as the default path. Start with the smallest working command, validate the saved output, and only escalate to browser-backed fetching when the static fetch does not contain the real page content.
Do not assume the user's Scrapling install is healthy. Verify it first.
## Default Workflow
Copy this checklist and keep it updated while working:
```text
Scrapling Progress:
- [ ] Step 1: Diagnose the local Scrapling install
- [ ] Step 2: Fix CLI extras or browser runtime if needed
- [ ] Step 3: Choose static or dynamic fetch
- [ ] Step 4: Save output to a file
- [ ] Step 5: Validate file size and extracted content
- [ ] Step 6: Escalate only if the previous path failed
```
## Step 1: Diagnose the Install
Run the bundled diagnostic script first:
```bash
python3 scripts/diagnose_scrapling.py
```
Use the result as the source of truth for the next step.
## Step 2: Fix the Install
### If the CLI was installed without extras
If `scrapling --help` fails with missing `click` or a message about installing Scrapling with extras, reinstall it with the CLI extra:
```bash
uv tool uninstall scrapling
uv tool install 'scrapling[shell]'
```
Do not default to `scrapling[all]` unless the user explicitly needs the broader feature set.
### If browser-backed fetchers are needed
Install the Playwright runtime:
```bash
scrapling install
```
If the install looks slow or opaque, read `references/troubleshooting.md` before guessing. Do not claim success until either:
- `scrapling install` reports that dependencies are already installed, or
- the diagnostic script confirms both Chromium and Chrome Headless Shell are present.
## Step 3: Choose the Fetcher
Use this decision rule:
- Start with `extract get` for normal pages, article pages, and most WeChat public articles.
- Use `extract fetch` when the static HTML does not contain the real content or the page depends on JavaScript rendering.
- Use `extract stealthy-fetch` only after `fetch` still fails because of anti-bot or challenge behavior. Do not make it the default.
## Step 4: Run the Smallest Useful Command
Always quote URLs in shell commands. This is mandatory in `zsh` when the URL contains `?`, `&`, or other special characters.
### Full page to HTML
```bash
scrapling extract get 'https://example.com' page.html
```
### Main content to Markdown
```bash
scrapling extract get 'https://example.com' article.md -s 'main'
```
### JS-rendered page with browser automation
```bash
scrapling extract fetch 'https://example.com' page.html --timeout 20000
```
### WeChat public article body
Use `#js_content` first. This is the default selector for article body extraction on `mp.weixin.qq.com` pages.
```bash
scrapling extract get 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' article.md -s '#js_content'
```
## Step 5: Validate the Output
After every extraction, verify the file instead of assuming success:
```bash
wc -c article.md
sed -n '1,40p' article.md
```
For HTML output, check that the expected title, container, or selector target is actually present:
```bash
rg -n '<title>|js_content|rich_media_title|main' page.html
```
If the file is tiny, empty, or missing the expected container, the extraction did not succeed. Go back to Step 3 and switch fetchers or selectors.
## Step 6: Handle Known Failure Modes
### Local TLS trust store problem
If `extract get` fails with `curl: (60) SSL certificate problem`, treat it as a local trust-store problem first, not a Scrapling content failure.
Retry the same command with:
```bash
--no-verify
```
Only do this after confirming the failure matches the local certificate verification error pattern. Do not silently disable verification by default.
### WeChat article pages
For `mp.weixin.qq.com`:
- Try `extract get` before `extract fetch`
- Use `-s '#js_content'` for the article body
- Validate the saved Markdown or HTML immediately
### Browser-backed fetch failures
If `extract fetch` fails:
1. Re-check the install with `python3 scripts/diagnose_scrapling.py`
2. Confirm Chromium and Chrome Headless Shell are present
3. Retry with a slightly longer timeout
4. Escalate to `stealthy-fetch` only if the site behavior justifies it
## Command Patterns
### Diagnose and smoke test a URL
```bash
python3 scripts/diagnose_scrapling.py --url 'https://example.com'
```
### Diagnose and smoke test a WeChat article body
```bash
python3 scripts/diagnose_scrapling.py \
--url 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' \
--selector '#js_content' \
--no-verify
```
### Diagnose and smoke test a browser-backed fetch
```bash
python3 scripts/diagnose_scrapling.py \
--url 'https://example.com' \
--dynamic
```
## Guardrails
- Do not tell the user to reinstall blindly. Verify first.
- Do not default to the Python library API when the user is clearly asking about the CLI.
- Do not jump to browser-backed fetching unless the static result is missing the real content.
- Do not claim success from exit code alone. Inspect the saved file.
- Do not hardcode user-specific absolute paths into outputs or docs.
## Resources
- Installation and smoke test helper: `scripts/diagnose_scrapling.py`
- Verified failure modes and recovery paths: `references/troubleshooting.md`

View File

@@ -0,0 +1,164 @@
# Scrapling Troubleshooting
## Contents
- Installation modes
- Verified failure modes
- Static vs dynamic fetch choice
- WeChat extraction pattern
- Smoke test commands
## Installation Modes
Use the CLI path as the default:
```bash
uv tool install 'scrapling[shell]'
```
Do not assume `uv tool install scrapling` is enough for CLI usage. The base package may install the executable wrapper without the optional CLI dependencies.
## Verified Failure Modes
### 1. CLI installed without extras
Symptom:
- `scrapling --help` fails
- Output mentions missing `click`
- Output says Scrapling must be installed with extras
Recovery:
```bash
uv tool uninstall scrapling
uv tool install 'scrapling[shell]'
```
### 2. Browser-backed fetchers not ready
Symptom:
- `extract fetch` or `extract stealthy-fetch` fails because the Playwright runtime is not installed
- Scrapling has not downloaded Chromium or Chrome Headless Shell
Recovery:
```bash
scrapling install
```
Success signals:
- `scrapling install` later reports `The dependencies are already installed`
- Browser caches contain both:
- `chromium-*`
- `chromium_headless_shell-*`
Typical cache roots:
- `~/Library/Caches/ms-playwright/`
- `~/.cache/ms-playwright/`
### 3. Static fetch TLS trust-store failure
Symptom:
- `extract get` fails with `curl: (60) SSL certificate problem`
Interpretation:
- Treat this as a local certificate verification problem first
- Do not assume the target URL or Scrapling itself is broken
Recovery:
Retry the same static command with:
```bash
--no-verify
```
Do not make `--no-verify` the default. Use it only after the failure matches this certificate-verification pattern.
## Static vs Dynamic Fetch Choice
Use this order:
1. `extract get`
2. `extract fetch`
3. `extract stealthy-fetch`
Use `extract get` when:
- The page is mostly server-rendered
- The content is likely already present in raw HTML
- The target is an article page with a stable content container
Use `extract fetch` when:
- Static HTML does not contain the real content
- The site depends on JavaScript rendering
- The page content appears only after runtime hydration
Use `extract stealthy-fetch` when:
- `fetch` still fails
- The target site shows challenge or anti-bot behavior
## WeChat Extraction Pattern
For `mp.weixin.qq.com` public article pages:
- Start with `extract get`
- Use the selector `#js_content`
- Validate the saved file immediately
Example:
```bash
scrapling extract get 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' article.md -s '#js_content'
```
Observed behavior:
- The static fetch can already contain the real article body
- Browser-backed fetch is often unnecessary for article extraction
## Smoke Test Commands
### Basic diagnosis
```bash
python3 scripts/diagnose_scrapling.py
```
### Static extraction smoke test
```bash
python3 scripts/diagnose_scrapling.py --url 'https://example.com'
```
### WeChat article smoke test
```bash
python3 scripts/diagnose_scrapling.py \
--url 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' \
--selector '#js_content'
```
### Dynamic extraction smoke test
```bash
python3 scripts/diagnose_scrapling.py \
--url 'https://example.com' \
--dynamic
```
### Validate saved output
```bash
wc -c article.md
sed -n '1,40p' article.md
rg -n '<title>|js_content|main|rich_media_title' page.html
```

View File

@@ -0,0 +1,191 @@
#!/usr/bin/env python3
"""
Diagnose a local Scrapling CLI installation and optionally run a smoke test.
"""
import argparse
import shutil
import subprocess
import sys
import tempfile
from pathlib import Path
from typing import Iterable, List, Tuple
def run_command(cmd: List[str]) -> Tuple[int, str, str]:
result = subprocess.run(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
universal_newlines=True,
check=False,
)
return result.returncode, result.stdout, result.stderr
def print_section(title: str) -> None:
print("")
print(title)
print("-" * len(title))
def existing_dirs(paths: Iterable[Path]) -> List[str]:
return [str(path) for path in paths if path.exists()]
def detect_browser_cache() -> Tuple[List[str], List[str]]:
roots = [
Path.home() / "Library" / "Caches" / "ms-playwright",
Path.home() / ".cache" / "ms-playwright",
]
chromium = []
headless_shell = []
for root in roots:
if not root.exists():
continue
chromium.extend(existing_dirs(sorted(root.glob("chromium-*"))))
headless_shell.extend(existing_dirs(sorted(root.glob("chromium_headless_shell-*"))))
return chromium, headless_shell
def diagnose_cli() -> bool:
print_section("CLI")
scrapling_path = shutil.which("scrapling")
if not scrapling_path:
print("status: missing")
print("fix: install with `uv tool install 'scrapling[shell]'`")
return False
print("path: {0}".format(scrapling_path))
code, stdout, stderr = run_command(["scrapling", "--help"])
output = (stdout + "\n" + stderr).strip()
if code == 0:
print("status: working")
return True
print("status: broken")
if "install scrapling with any of the extras" in output.lower() or "no module named 'click'" in output.lower():
print("cause: installed without CLI extras")
print("fix: `uv tool uninstall scrapling` then `uv tool install 'scrapling[shell]'`")
else:
print("cause: unknown")
if output:
print("details:")
print(output[:1200])
return False
def diagnose_browsers() -> None:
print_section("Browser Runtime")
chromium, headless_shell = detect_browser_cache()
print("chromium: {0}".format("present" if chromium else "missing"))
for path in chromium:
print(" - {0}".format(path))
print("chrome-headless-shell: {0}".format("present" if headless_shell else "missing"))
for path in headless_shell:
print(" - {0}".format(path))
if not chromium or not headless_shell:
print("hint: run `scrapling install` before browser-backed fetches")
def preview_file(path: Path, preview_lines: int) -> None:
print_section("Smoke Test Output")
if not path.exists():
print("status: missing output file")
return
size = path.stat().st_size
print("path: {0}".format(path))
print("bytes: {0}".format(size))
if size == 0:
print("status: empty")
return
if path.suffix in (".md", ".txt"):
print("preview:")
with path.open("r", encoding="utf-8", errors="replace") as handle:
for index, line in enumerate(handle):
if index >= preview_lines:
break
print(line.rstrip())
def run_smoke_test(args: argparse.Namespace) -> int:
print_section("Smoke Test")
suffix = ".html"
if args.selector:
suffix = ".md"
output_path = Path(tempfile.gettempdir()) / ("scrapling-smoke" + suffix)
if output_path.exists():
output_path.unlink()
cmd = ["scrapling", "extract", "fetch" if args.dynamic else "get", args.url, str(output_path)]
if args.selector:
cmd.extend(["-s", args.selector])
if args.dynamic:
cmd.extend(["--timeout", str(args.timeout)])
elif args.no_verify:
cmd.append("--no-verify")
print("command: {0}".format(" ".join(cmd)))
code, stdout, stderr = run_command(cmd)
if stdout.strip():
print(stdout.strip())
if stderr.strip():
print(stderr.strip())
preview_file(output_path, args.preview_lines)
return code
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="Diagnose Scrapling and run an optional smoke test.")
parser.add_argument("--url", help="Optional URL for a smoke test")
parser.add_argument("--selector", help="Optional CSS selector for the smoke test")
parser.add_argument(
"--dynamic",
action="store_true",
help="Use `scrapling extract fetch` instead of `scrapling extract get`",
)
parser.add_argument(
"--no-verify",
action="store_true",
help="Pass `--no-verify` to static smoke tests",
)
parser.add_argument(
"--timeout",
type=int,
default=20000,
help="Timeout in milliseconds for dynamic smoke tests",
)
parser.add_argument(
"--preview-lines",
type=int,
default=20,
help="Number of preview lines for markdown/text output",
)
return parser
def main() -> int:
parser = build_parser()
args = parser.parse_args()
cli_ok = diagnose_cli()
diagnose_browsers()
if not cli_ok:
return 1
if not args.url:
return 0
return run_smoke_test(args)
if __name__ == "__main__":
sys.exit(main())