skill-seekers-reference/cli/doc_scraper.py at d9e9fb53ad70e851b1d605b5cc259aa47968dd42

firefrost-gaming/skill-seekers-reference

Files

yusyus 105218f85e Add checkpoint/resume feature for long scrapes

Implement automatic progress saving and resumption for interrupted
or very long documentation scrapes (40K+ pages).

**Features:**
- Automatic checkpoint saving every N pages (configurable, default: 1000)
- Resume from last checkpoint with --resume flag
- Fresh start with --fresh flag (clears checkpoint)
- Progress state saved: visited URLs, pending URLs, pages scraped
- Checkpoint saved on interruption (Ctrl+C)
- Checkpoint cleared after successful completion

**Configuration:**
```json
{
  "checkpoint": {
    "enabled": true,
    "interval": 1000
  }
}
```

**Usage:**
```bash
# Start scraping (with checkpoints enabled in config)
python3 cli/doc_scraper.py --config configs/large-docs.json

# If interrupted (Ctrl+C), resume later:
python3 cli/doc_scraper.py --config configs/large-docs.json --resume

# Start fresh (clear checkpoint):
python3 cli/doc_scraper.py --config configs/large-docs.json --fresh
```

**Checkpoint Data:**
- config: Full configuration
- visited_urls: All URLs already scraped
- pending_urls: Queue of URLs to scrape
- pages_scraped: Count of pages completed
- last_updated: Timestamp
- checkpoint_interval: Interval setting

**Benefits:**
✅ Never lose progress on long scrapes
✅ Handle interruptions gracefully
✅ Resume multi-hour scrapes easily
✅ Automatic save every 1000 pages
✅ Essential for 40K+ page documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-19 20:50:24 +03:00

38 KiB

Raw Blame History

View Raw

38 KiB Raw Blame History

38 KiB

Raw Blame History