release: add scrapling-skill and fix script compatibility

- add scrapling-skill with validated CLI workflow, diagnostics, packaging, and docs integration - fix skill-creator package_skill.py so direct script invocation works from repo root - fix continue-claude-work extract_resume_context.py typing compatibility for local python3 - bump marketplace to 1.39.0 and updated skill versions
2026-03-18 23:08:55 +08:00
parent d8a7d45e53
commit 2192458ef7
13 changed files with 722 additions and 36 deletions
--- a/scrapling-skill/references/troubleshooting.md
+++ b/scrapling-skill/references/troubleshooting.md
@@ -0,0 +1,164 @@
+# Scrapling Troubleshooting
+
+## Contents
+
+- Installation modes
+- Verified failure modes
+- Static vs dynamic fetch choice
+- WeChat extraction pattern
+- Smoke test commands
+
+## Installation Modes
+
+Use the CLI path as the default:
+
+```bash
+uv tool install 'scrapling[shell]'
+```
+
+Do not assume `uv tool install scrapling` is enough for CLI usage. The base package may install the executable wrapper without the optional CLI dependencies.
+
+## Verified Failure Modes
+
+### 1. CLI installed without extras
+
+Symptom:
+
+- `scrapling --help` fails
+- Output mentions missing `click`
+- Output says Scrapling must be installed with extras
+
+Recovery:
+
+```bash
+uv tool uninstall scrapling
+uv tool install 'scrapling[shell]'
+```
+
+### 2. Browser-backed fetchers not ready
+
+Symptom:
+
+- `extract fetch` or `extract stealthy-fetch` fails because the Playwright runtime is not installed
+- Scrapling has not downloaded Chromium or Chrome Headless Shell
+
+Recovery:
+
+```bash
+scrapling install
+```
+
+Success signals:
+
+- `scrapling install` later reports `The dependencies are already installed`
+- Browser caches contain both:
+  - `chromium-*`
+  - `chromium_headless_shell-*`
+
+Typical cache roots:
+
+- `~/Library/Caches/ms-playwright/`
+- `~/.cache/ms-playwright/`
+
+### 3. Static fetch TLS trust-store failure
+
+Symptom:
+
+- `extract get` fails with `curl: (60) SSL certificate problem`
+
+Interpretation:
+
+- Treat this as a local certificate verification problem first
+- Do not assume the target URL or Scrapling itself is broken
+
+Recovery:
+
+Retry the same static command with:
+
+```bash
+--no-verify
+```
+
+Do not make `--no-verify` the default. Use it only after the failure matches this certificate-verification pattern.
+
+## Static vs Dynamic Fetch Choice
+
+Use this order:
+
+1. `extract get`
+2. `extract fetch`
+3. `extract stealthy-fetch`
+
+Use `extract get` when:
+
+- The page is mostly server-rendered
+- The content is likely already present in raw HTML
+- The target is an article page with a stable content container
+
+Use `extract fetch` when:
+
+- Static HTML does not contain the real content
+- The site depends on JavaScript rendering
+- The page content appears only after runtime hydration
+
+Use `extract stealthy-fetch` when:
+
+- `fetch` still fails
+- The target site shows challenge or anti-bot behavior
+
+## WeChat Extraction Pattern
+
+For `mp.weixin.qq.com` public article pages:
+
+- Start with `extract get`
+- Use the selector `#js_content`
+- Validate the saved file immediately
+
+Example:
+
+```bash
+scrapling extract get 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' article.md -s '#js_content'
+```
+
+Observed behavior:
+
+- The static fetch can already contain the real article body
+- Browser-backed fetch is often unnecessary for article extraction
+
+## Smoke Test Commands
+
+### Basic diagnosis
+
+```bash
+python3 scripts/diagnose_scrapling.py
+```
+
+### Static extraction smoke test
+
+```bash
+python3 scripts/diagnose_scrapling.py --url 'https://example.com'
+```
+
+### WeChat article smoke test
+
+```bash
+python3 scripts/diagnose_scrapling.py \
+  --url 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' \
+  --selector '#js_content'
+```
+
+### Dynamic extraction smoke test
+
+```bash
+python3 scripts/diagnose_scrapling.py \
+  --url 'https://example.com' \
+  --dynamic
+```
+
+### Validate saved output
+
+```bash
+wc -c article.md
+sed -n '1,40p' article.md
+rg -n '<title>|js_content|main|rich_media_title' page.html
+```