- add scrapling-skill with validated CLI workflow, diagnostics, packaging, and docs integration - fix skill-creator package_skill.py so direct script invocation works from repo root - fix continue-claude-work extract_resume_context.py typing compatibility for local python3 - bump marketplace to 1.39.0 and updated skill versions
3.3 KiB
3.3 KiB
Scrapling Troubleshooting
Contents
- Installation modes
- Verified failure modes
- Static vs dynamic fetch choice
- WeChat extraction pattern
- Smoke test commands
Installation Modes
Use the CLI path as the default:
uv tool install 'scrapling[shell]'
Do not assume uv tool install scrapling is enough for CLI usage. The base package may install the executable wrapper without the optional CLI dependencies.
Verified Failure Modes
1. CLI installed without extras
Symptom:
scrapling --helpfails- Output mentions missing
click - Output says Scrapling must be installed with extras
Recovery:
uv tool uninstall scrapling
uv tool install 'scrapling[shell]'
2. Browser-backed fetchers not ready
Symptom:
extract fetchorextract stealthy-fetchfails because the Playwright runtime is not installed- Scrapling has not downloaded Chromium or Chrome Headless Shell
Recovery:
scrapling install
Success signals:
scrapling installlater reportsThe dependencies are already installed- Browser caches contain both:
chromium-*chromium_headless_shell-*
Typical cache roots:
~/Library/Caches/ms-playwright/~/.cache/ms-playwright/
3. Static fetch TLS trust-store failure
Symptom:
extract getfails withcurl: (60) SSL certificate problem
Interpretation:
- Treat this as a local certificate verification problem first
- Do not assume the target URL or Scrapling itself is broken
Recovery:
Retry the same static command with:
--no-verify
Do not make --no-verify the default. Use it only after the failure matches this certificate-verification pattern.
Static vs Dynamic Fetch Choice
Use this order:
extract getextract fetchextract stealthy-fetch
Use extract get when:
- The page is mostly server-rendered
- The content is likely already present in raw HTML
- The target is an article page with a stable content container
Use extract fetch when:
- Static HTML does not contain the real content
- The site depends on JavaScript rendering
- The page content appears only after runtime hydration
Use extract stealthy-fetch when:
fetchstill fails- The target site shows challenge or anti-bot behavior
WeChat Extraction Pattern
For mp.weixin.qq.com public article pages:
- Start with
extract get - Use the selector
#js_content - Validate the saved file immediately
Example:
scrapling extract get 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' article.md -s '#js_content'
Observed behavior:
- The static fetch can already contain the real article body
- Browser-backed fetch is often unnecessary for article extraction
Smoke Test Commands
Basic diagnosis
python3 scripts/diagnose_scrapling.py
Static extraction smoke test
python3 scripts/diagnose_scrapling.py --url 'https://example.com'
WeChat article smoke test
python3 scripts/diagnose_scrapling.py \
--url 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' \
--selector '#js_content'
Dynamic extraction smoke test
python3 scripts/diagnose_scrapling.py \
--url 'https://example.com' \
--dynamic
Validate saved output
wc -c article.md
sed -n '1,40p' article.md
rg -n '<title>|js_content|main|rich_media_title' page.html