Commit Graph

3 Commits

Author SHA1 Message Date
yusyus
af87572735 Remove unnecessary validation limits from config validator
- Remove max_pages upper limit (was 10,000, now unlimited)
- Remove rate_limit upper limit (was 10s, now unlimited)
- Convert missing selector checks from errors to warnings
- Add warnings system (non-blocking) vs errors (blocking)
- Allow users to scrape large documentation sites (45k+ pages)
- Allow flexible rate limiting for different site requirements

All reasonable validations remain (required fields, valid URLs,
correct data types, no negative values).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 14:55:56 +03:00
yusyus
eeef230c7b Implement high and medium priority improvements
High Priority:
- Fix hardcoded package_skill.py path (line 778)
  Changed from: /mnt/skills/examples/skill-creator/scripts/package_skill.py
  Changed to: package_skill.py (local repository path)

Medium Priority:
- Add comprehensive config validation
  * Validates required fields (name, base_url)
  * Validates name format (alphanumeric, hyphens, underscores)
  * Validates base_url format (http/https)
  * Validates selectors structure and recommends standard selectors
  * Validates url_patterns (include/exclude lists)
  * Validates categories structure
  * Validates rate_limit range (0-10 seconds)
  * Validates max_pages range (1-10000)
  * Validates start_urls format if present
  * Provides clear error messages for invalid configs

- Add --dry-run flag for preview mode
  * Previews first 20 URLs without saving data
  * Shows what would be scraped without creating files
  * Discovers links to estimate total pages
  * Displays configuration summary
  * No directories created in dry-run mode
  * Useful for testing configs before full scrape

All changes tested and working correctly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 01:57:59 +03:00
yusyus
78b9cae398 Init 2025-10-17 15:14:44 +00:00