Commit Graph

9 Commits

Author SHA1 Message Date
yusyus
9c1a133c51 Add page count estimator for fast config validation
- Add estimate_pages.py script (~270 lines)
- Fast estimation without downloading content (HEAD requests only)
- Shows estimated total pages and recommended max_pages
- Validates URL patterns work correctly
- Estimates scraping time based on rate_limit
- Update CLAUDE.md with estimator workflow and commands
- Update README.md features section with estimation benefits
- Usage: python3 estimate_pages.py configs/react.json
- Time: 1-2 minutes vs 20-40 minutes for full scrape

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 02:44:50 +03:00
yusyus
59c2f9126d Optimize all framework configs with start_urls for better coverage
All configs now follow the steam-economy-complete.json pattern with:
- Multiple start_urls for comprehensive entry points
- Improved include patterns for better targeting
- Enhanced exclude patterns to skip irrelevant pages

Godot Config:
- Added 7 start_urls covering getting started, scripting, 2D, 3D, physics, animation, and classes
- Added include patterns: /getting_started/, /tutorials/, /classes/
- More focused scraping of core documentation

React Config:
- Added 6 start_urls covering learn, quick-start, reference, and hooks
- Existing patterns maintained (already well-optimized)

Vue Config:
- Added 6 start_urls covering introduction, essentials, components, composables, and API
- Fixed base_url from https://vuejs.org/guide/ to https://vuejs.org/
- Added /partners/ to exclude list

Django Config:
- Added 7 start_urls covering intro, models, views, templates, forms, auth, and reference
- Added /intro/ to include patterns
- Added /releases/ to exclude list (changelog not needed)

FastAPI Config:
- Added 7 start_urls covering tutorial, first-steps, path-params, body, dependencies, advanced, and reference
- Added /deployment/ to exclude list

Benefits:
- Better initial page discovery
- More comprehensive documentation coverage
- Faster scraping (direct entry to important sections)
- Reduced unnecessary page crawling
- Consistent pattern across all configs

All configs tested and validated:
 71/71 tests passing
 All 6 configs validated successfully

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 02:24:56 +03:00
yusyus
f9c8f1d610 Clean up macOS .DS_Store file from output directory 2025-10-19 02:21:25 +03:00
yusyus
f1fa8354d2 Add comprehensive test system with 71 tests (100% pass rate)
Test Framework:
- Created tests/ directory structure
- Added __init__.py for test package
- Implemented 71 comprehensive tests across 3 test suites

Test Suites:
1. test_config_validation.py (25 tests)
   - Valid/invalid config structure
   - Required fields validation
   - Name format validation
   - URL format validation
   - Selectors validation
   - URL patterns validation
   - Categories validation
   - Rate limit validation (0-10 range)
   - Max pages validation (1-10000 range)
   - Start URLs validation

2. test_scraper_features.py (28 tests)
   - URL validation (include/exclude patterns)
   - Language detection (Python, JavaScript, GDScript, C++, etc.)
   - Pattern extraction from documentation
   - Smart categorization (by URL, title, content)
   - Text cleaning utilities

3. test_integration.py (18 tests)
   - Dry-run mode functionality
   - Config loading and validation
   - Real config files validation (godot, react, vue, django, fastapi, steam)
   - URL processing and normalization
   - Content extraction

Test Runner (run_tests.py):
- Custom colored test runner with ANSI colors
- Detailed test summary with breakdown by category
- Success rate calculation
- Command-line options:
  --suite: Run specific test suite
  --verbose: Show each test name
  --quiet: Minimal output
  --failfast: Stop on first failure
  --list: List all available tests
- Execution time: ~1 second for full suite

Documentation:
- Added comprehensive TESTING.md guide
- Test writing templates
- Best practices
- Coverage information
- Troubleshooting guide

.gitignore:
- Added Python cache files
- Added output directory
- Added IDE and OS files

Test Results:
 71/71 tests passing (100% pass rate)
 All existing configs validated
 Fast execution (<1 second)
 Ready for CI/CD integration

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 02:08:58 +03:00
yusyus
eeef230c7b Implement high and medium priority improvements
High Priority:
- Fix hardcoded package_skill.py path (line 778)
  Changed from: /mnt/skills/examples/skill-creator/scripts/package_skill.py
  Changed to: package_skill.py (local repository path)

Medium Priority:
- Add comprehensive config validation
  * Validates required fields (name, base_url)
  * Validates name format (alphanumeric, hyphens, underscores)
  * Validates base_url format (http/https)
  * Validates selectors structure and recommends standard selectors
  * Validates url_patterns (include/exclude lists)
  * Validates categories structure
  * Validates rate_limit range (0-10 seconds)
  * Validates max_pages range (1-10000)
  * Validates start_urls format if present
  * Provides clear error messages for invalid configs

- Add --dry-run flag for preview mode
  * Previews first 20 URLs without saving data
  * Shows what would be scraped without creating files
  * Discovers links to estimate total pages
  * Displays configuration summary
  * No directories created in dry-run mode
  * Useful for testing configs before full scrape

All changes tested and working correctly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 01:57:59 +03:00
yusyus
f8c75a3b2d Add comprehensive CLAUDE.md for Claude Code integration
- Add root-level CLAUDE.md with complete guidance for Claude Code
- Include Python 3.7+ requirement
- Add first-time user workflow with all commands
- Include CSS selector testing with BeautifulSoup examples
- Add output quality verification commands
- Document force re-scrape instructions
- Fix package_skill.py path (remove hardcoded /mnt/skills reference)
- Add complete config file structure with real examples
- Include testing section for selector validation
- Add performance metrics table
- Document all key code locations with line numbers
- Organize by: quick start → architecture → workflows → troubleshooting
- Preserve existing docs/CLAUDE.md as detailed technical reference

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 01:43:02 +03:00
yusyus
a9b8591731 Update README.md with detailed project description and features; add initial VSCode settings. 2025-10-17 15:21:39 +00:00
yusyus
78b9cae398 Init 2025-10-17 15:14:44 +00:00
yusyus
397d47fe7c Initial commit 2025-10-17 17:43:48 +03:00