release: add scrapling-skill and fix script compatibility
- add scrapling-skill with validated CLI workflow, diagnostics, packaging, and docs integration - fix skill-creator package_skill.py so direct script invocation works from repo root - fix continue-claude-work extract_resume_context.py typing compatibility for local python3 - bump marketplace to 1.39.0 and updated skill versions
This commit is contained in:
4
scrapling-skill/.security-scan-passed
Normal file
4
scrapling-skill/.security-scan-passed
Normal file
@@ -0,0 +1,4 @@
|
||||
Security scan passed
|
||||
Scanned at: 2026-03-18T22:52:43.734452
|
||||
Tool: gitleaks + pattern-based validation
|
||||
Content hash: 06351e5794510c584fdf29351eb5161f4b12e213f512c3148212c82c357d124a
|
||||
183
scrapling-skill/SKILL.md
Normal file
183
scrapling-skill/SKILL.md
Normal file
@@ -0,0 +1,183 @@
|
||||
---
|
||||
name: scrapling-skill
|
||||
description: Install, troubleshoot, and use Scrapling CLI to extract HTML, Markdown, or text from webpages. Use this skill whenever the user mentions Scrapling, `uv tool install scrapling`, `scrapling extract`, WeChat/mp.weixin articles, browser-backed page fetching, or needs help deciding between static and dynamic extraction.
|
||||
---
|
||||
|
||||
# Scrapling Skill
|
||||
|
||||
## Overview
|
||||
|
||||
Use Scrapling through its CLI as the default path. Start with the smallest working command, validate the saved output, and only escalate to browser-backed fetching when the static fetch does not contain the real page content.
|
||||
|
||||
Do not assume the user's Scrapling install is healthy. Verify it first.
|
||||
|
||||
## Default Workflow
|
||||
|
||||
Copy this checklist and keep it updated while working:
|
||||
|
||||
```text
|
||||
Scrapling Progress:
|
||||
- [ ] Step 1: Diagnose the local Scrapling install
|
||||
- [ ] Step 2: Fix CLI extras or browser runtime if needed
|
||||
- [ ] Step 3: Choose static or dynamic fetch
|
||||
- [ ] Step 4: Save output to a file
|
||||
- [ ] Step 5: Validate file size and extracted content
|
||||
- [ ] Step 6: Escalate only if the previous path failed
|
||||
```
|
||||
|
||||
## Step 1: Diagnose the Install
|
||||
|
||||
Run the bundled diagnostic script first:
|
||||
|
||||
```bash
|
||||
python3 scripts/diagnose_scrapling.py
|
||||
```
|
||||
|
||||
Use the result as the source of truth for the next step.
|
||||
|
||||
## Step 2: Fix the Install
|
||||
|
||||
### If the CLI was installed without extras
|
||||
|
||||
If `scrapling --help` fails with missing `click` or a message about installing Scrapling with extras, reinstall it with the CLI extra:
|
||||
|
||||
```bash
|
||||
uv tool uninstall scrapling
|
||||
uv tool install 'scrapling[shell]'
|
||||
```
|
||||
|
||||
Do not default to `scrapling[all]` unless the user explicitly needs the broader feature set.
|
||||
|
||||
### If browser-backed fetchers are needed
|
||||
|
||||
Install the Playwright runtime:
|
||||
|
||||
```bash
|
||||
scrapling install
|
||||
```
|
||||
|
||||
If the install looks slow or opaque, read `references/troubleshooting.md` before guessing. Do not claim success until either:
|
||||
- `scrapling install` reports that dependencies are already installed, or
|
||||
- the diagnostic script confirms both Chromium and Chrome Headless Shell are present.
|
||||
|
||||
## Step 3: Choose the Fetcher
|
||||
|
||||
Use this decision rule:
|
||||
|
||||
- Start with `extract get` for normal pages, article pages, and most WeChat public articles.
|
||||
- Use `extract fetch` when the static HTML does not contain the real content or the page depends on JavaScript rendering.
|
||||
- Use `extract stealthy-fetch` only after `fetch` still fails because of anti-bot or challenge behavior. Do not make it the default.
|
||||
|
||||
## Step 4: Run the Smallest Useful Command
|
||||
|
||||
Always quote URLs in shell commands. This is mandatory in `zsh` when the URL contains `?`, `&`, or other special characters.
|
||||
|
||||
### Full page to HTML
|
||||
|
||||
```bash
|
||||
scrapling extract get 'https://example.com' page.html
|
||||
```
|
||||
|
||||
### Main content to Markdown
|
||||
|
||||
```bash
|
||||
scrapling extract get 'https://example.com' article.md -s 'main'
|
||||
```
|
||||
|
||||
### JS-rendered page with browser automation
|
||||
|
||||
```bash
|
||||
scrapling extract fetch 'https://example.com' page.html --timeout 20000
|
||||
```
|
||||
|
||||
### WeChat public article body
|
||||
|
||||
Use `#js_content` first. This is the default selector for article body extraction on `mp.weixin.qq.com` pages.
|
||||
|
||||
```bash
|
||||
scrapling extract get 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' article.md -s '#js_content'
|
||||
```
|
||||
|
||||
## Step 5: Validate the Output
|
||||
|
||||
After every extraction, verify the file instead of assuming success:
|
||||
|
||||
```bash
|
||||
wc -c article.md
|
||||
sed -n '1,40p' article.md
|
||||
```
|
||||
|
||||
For HTML output, check that the expected title, container, or selector target is actually present:
|
||||
|
||||
```bash
|
||||
rg -n '<title>|js_content|rich_media_title|main' page.html
|
||||
```
|
||||
|
||||
If the file is tiny, empty, or missing the expected container, the extraction did not succeed. Go back to Step 3 and switch fetchers or selectors.
|
||||
|
||||
## Step 6: Handle Known Failure Modes
|
||||
|
||||
### Local TLS trust store problem
|
||||
|
||||
If `extract get` fails with `curl: (60) SSL certificate problem`, treat it as a local trust-store problem first, not a Scrapling content failure.
|
||||
|
||||
Retry the same command with:
|
||||
|
||||
```bash
|
||||
--no-verify
|
||||
```
|
||||
|
||||
Only do this after confirming the failure matches the local certificate verification error pattern. Do not silently disable verification by default.
|
||||
|
||||
### WeChat article pages
|
||||
|
||||
For `mp.weixin.qq.com`:
|
||||
- Try `extract get` before `extract fetch`
|
||||
- Use `-s '#js_content'` for the article body
|
||||
- Validate the saved Markdown or HTML immediately
|
||||
|
||||
### Browser-backed fetch failures
|
||||
|
||||
If `extract fetch` fails:
|
||||
1. Re-check the install with `python3 scripts/diagnose_scrapling.py`
|
||||
2. Confirm Chromium and Chrome Headless Shell are present
|
||||
3. Retry with a slightly longer timeout
|
||||
4. Escalate to `stealthy-fetch` only if the site behavior justifies it
|
||||
|
||||
## Command Patterns
|
||||
|
||||
### Diagnose and smoke test a URL
|
||||
|
||||
```bash
|
||||
python3 scripts/diagnose_scrapling.py --url 'https://example.com'
|
||||
```
|
||||
|
||||
### Diagnose and smoke test a WeChat article body
|
||||
|
||||
```bash
|
||||
python3 scripts/diagnose_scrapling.py \
|
||||
--url 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' \
|
||||
--selector '#js_content' \
|
||||
--no-verify
|
||||
```
|
||||
|
||||
### Diagnose and smoke test a browser-backed fetch
|
||||
|
||||
```bash
|
||||
python3 scripts/diagnose_scrapling.py \
|
||||
--url 'https://example.com' \
|
||||
--dynamic
|
||||
```
|
||||
|
||||
## Guardrails
|
||||
|
||||
- Do not tell the user to reinstall blindly. Verify first.
|
||||
- Do not default to the Python library API when the user is clearly asking about the CLI.
|
||||
- Do not jump to browser-backed fetching unless the static result is missing the real content.
|
||||
- Do not claim success from exit code alone. Inspect the saved file.
|
||||
- Do not hardcode user-specific absolute paths into outputs or docs.
|
||||
|
||||
## Resources
|
||||
|
||||
- Installation and smoke test helper: `scripts/diagnose_scrapling.py`
|
||||
- Verified failure modes and recovery paths: `references/troubleshooting.md`
|
||||
164
scrapling-skill/references/troubleshooting.md
Normal file
164
scrapling-skill/references/troubleshooting.md
Normal file
@@ -0,0 +1,164 @@
|
||||
# Scrapling Troubleshooting
|
||||
|
||||
## Contents
|
||||
|
||||
- Installation modes
|
||||
- Verified failure modes
|
||||
- Static vs dynamic fetch choice
|
||||
- WeChat extraction pattern
|
||||
- Smoke test commands
|
||||
|
||||
## Installation Modes
|
||||
|
||||
Use the CLI path as the default:
|
||||
|
||||
```bash
|
||||
uv tool install 'scrapling[shell]'
|
||||
```
|
||||
|
||||
Do not assume `uv tool install scrapling` is enough for CLI usage. The base package may install the executable wrapper without the optional CLI dependencies.
|
||||
|
||||
## Verified Failure Modes
|
||||
|
||||
### 1. CLI installed without extras
|
||||
|
||||
Symptom:
|
||||
|
||||
- `scrapling --help` fails
|
||||
- Output mentions missing `click`
|
||||
- Output says Scrapling must be installed with extras
|
||||
|
||||
Recovery:
|
||||
|
||||
```bash
|
||||
uv tool uninstall scrapling
|
||||
uv tool install 'scrapling[shell]'
|
||||
```
|
||||
|
||||
### 2. Browser-backed fetchers not ready
|
||||
|
||||
Symptom:
|
||||
|
||||
- `extract fetch` or `extract stealthy-fetch` fails because the Playwright runtime is not installed
|
||||
- Scrapling has not downloaded Chromium or Chrome Headless Shell
|
||||
|
||||
Recovery:
|
||||
|
||||
```bash
|
||||
scrapling install
|
||||
```
|
||||
|
||||
Success signals:
|
||||
|
||||
- `scrapling install` later reports `The dependencies are already installed`
|
||||
- Browser caches contain both:
|
||||
- `chromium-*`
|
||||
- `chromium_headless_shell-*`
|
||||
|
||||
Typical cache roots:
|
||||
|
||||
- `~/Library/Caches/ms-playwright/`
|
||||
- `~/.cache/ms-playwright/`
|
||||
|
||||
### 3. Static fetch TLS trust-store failure
|
||||
|
||||
Symptom:
|
||||
|
||||
- `extract get` fails with `curl: (60) SSL certificate problem`
|
||||
|
||||
Interpretation:
|
||||
|
||||
- Treat this as a local certificate verification problem first
|
||||
- Do not assume the target URL or Scrapling itself is broken
|
||||
|
||||
Recovery:
|
||||
|
||||
Retry the same static command with:
|
||||
|
||||
```bash
|
||||
--no-verify
|
||||
```
|
||||
|
||||
Do not make `--no-verify` the default. Use it only after the failure matches this certificate-verification pattern.
|
||||
|
||||
## Static vs Dynamic Fetch Choice
|
||||
|
||||
Use this order:
|
||||
|
||||
1. `extract get`
|
||||
2. `extract fetch`
|
||||
3. `extract stealthy-fetch`
|
||||
|
||||
Use `extract get` when:
|
||||
|
||||
- The page is mostly server-rendered
|
||||
- The content is likely already present in raw HTML
|
||||
- The target is an article page with a stable content container
|
||||
|
||||
Use `extract fetch` when:
|
||||
|
||||
- Static HTML does not contain the real content
|
||||
- The site depends on JavaScript rendering
|
||||
- The page content appears only after runtime hydration
|
||||
|
||||
Use `extract stealthy-fetch` when:
|
||||
|
||||
- `fetch` still fails
|
||||
- The target site shows challenge or anti-bot behavior
|
||||
|
||||
## WeChat Extraction Pattern
|
||||
|
||||
For `mp.weixin.qq.com` public article pages:
|
||||
|
||||
- Start with `extract get`
|
||||
- Use the selector `#js_content`
|
||||
- Validate the saved file immediately
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
scrapling extract get 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' article.md -s '#js_content'
|
||||
```
|
||||
|
||||
Observed behavior:
|
||||
|
||||
- The static fetch can already contain the real article body
|
||||
- Browser-backed fetch is often unnecessary for article extraction
|
||||
|
||||
## Smoke Test Commands
|
||||
|
||||
### Basic diagnosis
|
||||
|
||||
```bash
|
||||
python3 scripts/diagnose_scrapling.py
|
||||
```
|
||||
|
||||
### Static extraction smoke test
|
||||
|
||||
```bash
|
||||
python3 scripts/diagnose_scrapling.py --url 'https://example.com'
|
||||
```
|
||||
|
||||
### WeChat article smoke test
|
||||
|
||||
```bash
|
||||
python3 scripts/diagnose_scrapling.py \
|
||||
--url 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' \
|
||||
--selector '#js_content'
|
||||
```
|
||||
|
||||
### Dynamic extraction smoke test
|
||||
|
||||
```bash
|
||||
python3 scripts/diagnose_scrapling.py \
|
||||
--url 'https://example.com' \
|
||||
--dynamic
|
||||
```
|
||||
|
||||
### Validate saved output
|
||||
|
||||
```bash
|
||||
wc -c article.md
|
||||
sed -n '1,40p' article.md
|
||||
rg -n '<title>|js_content|main|rich_media_title' page.html
|
||||
```
|
||||
191
scrapling-skill/scripts/diagnose_scrapling.py
Executable file
191
scrapling-skill/scripts/diagnose_scrapling.py
Executable file
@@ -0,0 +1,191 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Diagnose a local Scrapling CLI installation and optionally run a smoke test.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import shutil
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from typing import Iterable, List, Tuple
|
||||
|
||||
|
||||
def run_command(cmd: List[str]) -> Tuple[int, str, str]:
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
universal_newlines=True,
|
||||
check=False,
|
||||
)
|
||||
return result.returncode, result.stdout, result.stderr
|
||||
|
||||
|
||||
def print_section(title: str) -> None:
|
||||
print("")
|
||||
print(title)
|
||||
print("-" * len(title))
|
||||
|
||||
|
||||
def existing_dirs(paths: Iterable[Path]) -> List[str]:
|
||||
return [str(path) for path in paths if path.exists()]
|
||||
|
||||
|
||||
def detect_browser_cache() -> Tuple[List[str], List[str]]:
|
||||
roots = [
|
||||
Path.home() / "Library" / "Caches" / "ms-playwright",
|
||||
Path.home() / ".cache" / "ms-playwright",
|
||||
]
|
||||
chromium = []
|
||||
headless_shell = []
|
||||
for root in roots:
|
||||
if not root.exists():
|
||||
continue
|
||||
chromium.extend(existing_dirs(sorted(root.glob("chromium-*"))))
|
||||
headless_shell.extend(existing_dirs(sorted(root.glob("chromium_headless_shell-*"))))
|
||||
return chromium, headless_shell
|
||||
|
||||
|
||||
def diagnose_cli() -> bool:
|
||||
print_section("CLI")
|
||||
scrapling_path = shutil.which("scrapling")
|
||||
if not scrapling_path:
|
||||
print("status: missing")
|
||||
print("fix: install with `uv tool install 'scrapling[shell]'`")
|
||||
return False
|
||||
|
||||
print("path: {0}".format(scrapling_path))
|
||||
code, stdout, stderr = run_command(["scrapling", "--help"])
|
||||
output = (stdout + "\n" + stderr).strip()
|
||||
|
||||
if code == 0:
|
||||
print("status: working")
|
||||
return True
|
||||
|
||||
print("status: broken")
|
||||
if "install scrapling with any of the extras" in output.lower() or "no module named 'click'" in output.lower():
|
||||
print("cause: installed without CLI extras")
|
||||
print("fix: `uv tool uninstall scrapling` then `uv tool install 'scrapling[shell]'`")
|
||||
else:
|
||||
print("cause: unknown")
|
||||
|
||||
if output:
|
||||
print("details:")
|
||||
print(output[:1200])
|
||||
return False
|
||||
|
||||
|
||||
def diagnose_browsers() -> None:
|
||||
print_section("Browser Runtime")
|
||||
chromium, headless_shell = detect_browser_cache()
|
||||
print("chromium: {0}".format("present" if chromium else "missing"))
|
||||
for path in chromium:
|
||||
print(" - {0}".format(path))
|
||||
print("chrome-headless-shell: {0}".format("present" if headless_shell else "missing"))
|
||||
for path in headless_shell:
|
||||
print(" - {0}".format(path))
|
||||
if not chromium or not headless_shell:
|
||||
print("hint: run `scrapling install` before browser-backed fetches")
|
||||
|
||||
|
||||
def preview_file(path: Path, preview_lines: int) -> None:
|
||||
print_section("Smoke Test Output")
|
||||
if not path.exists():
|
||||
print("status: missing output file")
|
||||
return
|
||||
|
||||
size = path.stat().st_size
|
||||
print("path: {0}".format(path))
|
||||
print("bytes: {0}".format(size))
|
||||
if size == 0:
|
||||
print("status: empty")
|
||||
return
|
||||
|
||||
if path.suffix in (".md", ".txt"):
|
||||
print("preview:")
|
||||
with path.open("r", encoding="utf-8", errors="replace") as handle:
|
||||
for index, line in enumerate(handle):
|
||||
if index >= preview_lines:
|
||||
break
|
||||
print(line.rstrip())
|
||||
|
||||
|
||||
def run_smoke_test(args: argparse.Namespace) -> int:
|
||||
print_section("Smoke Test")
|
||||
|
||||
suffix = ".html"
|
||||
if args.selector:
|
||||
suffix = ".md"
|
||||
|
||||
output_path = Path(tempfile.gettempdir()) / ("scrapling-smoke" + suffix)
|
||||
if output_path.exists():
|
||||
output_path.unlink()
|
||||
|
||||
cmd = ["scrapling", "extract", "fetch" if args.dynamic else "get", args.url, str(output_path)]
|
||||
if args.selector:
|
||||
cmd.extend(["-s", args.selector])
|
||||
if args.dynamic:
|
||||
cmd.extend(["--timeout", str(args.timeout)])
|
||||
elif args.no_verify:
|
||||
cmd.append("--no-verify")
|
||||
|
||||
print("command: {0}".format(" ".join(cmd)))
|
||||
code, stdout, stderr = run_command(cmd)
|
||||
if stdout.strip():
|
||||
print(stdout.strip())
|
||||
if stderr.strip():
|
||||
print(stderr.strip())
|
||||
|
||||
preview_file(output_path, args.preview_lines)
|
||||
return code
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="Diagnose Scrapling and run an optional smoke test.")
|
||||
parser.add_argument("--url", help="Optional URL for a smoke test")
|
||||
parser.add_argument("--selector", help="Optional CSS selector for the smoke test")
|
||||
parser.add_argument(
|
||||
"--dynamic",
|
||||
action="store_true",
|
||||
help="Use `scrapling extract fetch` instead of `scrapling extract get`",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--no-verify",
|
||||
action="store_true",
|
||||
help="Pass `--no-verify` to static smoke tests",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--timeout",
|
||||
type=int,
|
||||
default=20000,
|
||||
help="Timeout in milliseconds for dynamic smoke tests",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--preview-lines",
|
||||
type=int,
|
||||
default=20,
|
||||
help="Number of preview lines for markdown/text output",
|
||||
)
|
||||
return parser
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = build_parser()
|
||||
args = parser.parse_args()
|
||||
|
||||
cli_ok = diagnose_cli()
|
||||
diagnose_browsers()
|
||||
|
||||
if not cli_ok:
|
||||
return 1
|
||||
|
||||
if not args.url:
|
||||
return 0
|
||||
|
||||
return run_smoke_test(args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user