yusyus
105218f85e
Add checkpoint/resume feature for long scrapes
Implement automatic progress saving and resumption for interrupted
or very long documentation scrapes (40K+ pages).
**Features:**
- Automatic checkpoint saving every N pages (configurable, default: 1000)
- Resume from last checkpoint with --resume flag
- Fresh start with --fresh flag (clears checkpoint)
- Progress state saved: visited URLs, pending URLs, pages scraped
- Checkpoint saved on interruption (Ctrl+C)
- Checkpoint cleared after successful completion
**Configuration:**
```json
{
"checkpoint": {
"enabled": true,
"interval": 1000
}
}
```
**Usage:**
```bash
# Start scraping (with checkpoints enabled in config)
python3 cli/doc_scraper.py --config configs/large-docs.json
# If interrupted (Ctrl+C), resume later:
python3 cli/doc_scraper.py --config configs/large-docs.json --resume
# Start fresh (clear checkpoint):
python3 cli/doc_scraper.py --config configs/large-docs.json --fresh
```
**Checkpoint Data:**
- config: Full configuration
- visited_urls: All URLs already scraped
- pending_urls: Queue of URLs to scrape
- pages_scraped: Count of pages completed
- last_updated: Timestamp
- checkpoint_interval: Interval setting
**Benefits:**
✅ Never lose progress on long scrapes
✅ Handle interruptions gracefully
✅ Resume multi-hour scrapes easily
✅ Automatic save every 1000 pages
✅ Essential for 40K+ page documentation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 20:50:24 +03:00
..
2025-10-19 20:50:24 +03:00
2025-10-19 15:19:53 +03:00
2025-10-19 15:19:53 +03:00
2025-10-19 15:19:53 +03:00
2025-10-19 20:48:03 +03:00
2025-10-19 20:48:03 +03:00
2025-10-19 15:19:53 +03:00
2025-10-19 15:19:53 +03:00
2025-10-19 20:48:03 +03:00