Files
daymade 2192458ef7 release: add scrapling-skill and fix script compatibility
- add scrapling-skill with validated CLI workflow, diagnostics, packaging, and docs integration
- fix skill-creator package_skill.py so direct script invocation works from repo root
- fix continue-claude-work extract_resume_context.py typing compatibility for local python3
- bump marketplace to 1.39.0 and updated skill versions
2026-03-18 23:08:55 +08:00

165 lines
3.3 KiB
Markdown

# Scrapling Troubleshooting
## Contents
- Installation modes
- Verified failure modes
- Static vs dynamic fetch choice
- WeChat extraction pattern
- Smoke test commands
## Installation Modes
Use the CLI path as the default:
```bash
uv tool install 'scrapling[shell]'
```
Do not assume `uv tool install scrapling` is enough for CLI usage. The base package may install the executable wrapper without the optional CLI dependencies.
## Verified Failure Modes
### 1. CLI installed without extras
Symptom:
- `scrapling --help` fails
- Output mentions missing `click`
- Output says Scrapling must be installed with extras
Recovery:
```bash
uv tool uninstall scrapling
uv tool install 'scrapling[shell]'
```
### 2. Browser-backed fetchers not ready
Symptom:
- `extract fetch` or `extract stealthy-fetch` fails because the Playwright runtime is not installed
- Scrapling has not downloaded Chromium or Chrome Headless Shell
Recovery:
```bash
scrapling install
```
Success signals:
- `scrapling install` later reports `The dependencies are already installed`
- Browser caches contain both:
- `chromium-*`
- `chromium_headless_shell-*`
Typical cache roots:
- `~/Library/Caches/ms-playwright/`
- `~/.cache/ms-playwright/`
### 3. Static fetch TLS trust-store failure
Symptom:
- `extract get` fails with `curl: (60) SSL certificate problem`
Interpretation:
- Treat this as a local certificate verification problem first
- Do not assume the target URL or Scrapling itself is broken
Recovery:
Retry the same static command with:
```bash
--no-verify
```
Do not make `--no-verify` the default. Use it only after the failure matches this certificate-verification pattern.
## Static vs Dynamic Fetch Choice
Use this order:
1. `extract get`
2. `extract fetch`
3. `extract stealthy-fetch`
Use `extract get` when:
- The page is mostly server-rendered
- The content is likely already present in raw HTML
- The target is an article page with a stable content container
Use `extract fetch` when:
- Static HTML does not contain the real content
- The site depends on JavaScript rendering
- The page content appears only after runtime hydration
Use `extract stealthy-fetch` when:
- `fetch` still fails
- The target site shows challenge or anti-bot behavior
## WeChat Extraction Pattern
For `mp.weixin.qq.com` public article pages:
- Start with `extract get`
- Use the selector `#js_content`
- Validate the saved file immediately
Example:
```bash
scrapling extract get 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' article.md -s '#js_content'
```
Observed behavior:
- The static fetch can already contain the real article body
- Browser-backed fetch is often unnecessary for article extraction
## Smoke Test Commands
### Basic diagnosis
```bash
python3 scripts/diagnose_scrapling.py
```
### Static extraction smoke test
```bash
python3 scripts/diagnose_scrapling.py --url 'https://example.com'
```
### WeChat article smoke test
```bash
python3 scripts/diagnose_scrapling.py \
--url 'https://mp.weixin.qq.com/s/ARTICLE_ID?scene=1' \
--selector '#js_content'
```
### Dynamic extraction smoke test
```bash
python3 scripts/diagnose_scrapling.py \
--url 'https://example.com' \
--dynamic
```
### Validate saved output
```bash
wc -c article.md
sed -n '1,40p' article.md
rg -n '<title>|js_content|main|rich_media_title' page.html
```