- Support <pre class="brush: java"> pattern (SyntaxHighlighter)
- Support bare class names like <pre class="python">
- Add _extract_language_from_classes() helper method
- Apply detection logic to both code and parent pre elements
- Add 3 comprehensive test cases
Improves language detection for 25+ programming languages across
various documentation site formats.
Co-authored-by: Ricardo JL Rufino <ricardo@edu3.com.br>
Major restructure to support both CLI usage and MCP integration:
**Repository Structure:**
- cli/ - All CLI tools (doc_scraper, estimate_pages, enhance_skill, etc.)
- mcp/ - New MCP server for Claude Code integration
- configs/ - Shared configuration files
- tests/ - Updated to import from cli/
- docs/ - Shared documentation
**MCP Server (NEW):**
- mcp/server.py - Full MCP server implementation
- 6 tools available:
* generate_config - Create config from URL
* estimate_pages - Fast page count estimation
* scrape_docs - Full documentation scraping
* package_skill - Package to .zip
* list_configs - Show available presets
* validate_config - Validate config files
- mcp/README.md - Complete MCP documentation
- mcp/requirements.txt - MCP dependencies
**CLI Tools (Moved to cli/):**
- All existing functionality preserved
- Same commands, same behavior
- Tests updated to import from cli.doc_scraper
**Tests:**
- 68/71 passing (95.8%)
- Updated imports from doc_scraper to cli.doc_scraper
- Fixed validate_config() tuple unpacking (errors, warnings)
- 3 minor test failures (checking warnings instead of errors)
**Benefits:**
- Use as CLI tool: python3 cli/doc_scraper.py
- Use via MCP: Integrated with Claude Code
- Shared code and configs
- Single source of truth
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Test Framework:
- Created tests/ directory structure
- Added __init__.py for test package
- Implemented 71 comprehensive tests across 3 test suites
Test Suites:
1. test_config_validation.py (25 tests)
- Valid/invalid config structure
- Required fields validation
- Name format validation
- URL format validation
- Selectors validation
- URL patterns validation
- Categories validation
- Rate limit validation (0-10 range)
- Max pages validation (1-10000 range)
- Start URLs validation
2. test_scraper_features.py (28 tests)
- URL validation (include/exclude patterns)
- Language detection (Python, JavaScript, GDScript, C++, etc.)
- Pattern extraction from documentation
- Smart categorization (by URL, title, content)
- Text cleaning utilities
3. test_integration.py (18 tests)
- Dry-run mode functionality
- Config loading and validation
- Real config files validation (godot, react, vue, django, fastapi, steam)
- URL processing and normalization
- Content extraction
Test Runner (run_tests.py):
- Custom colored test runner with ANSI colors
- Detailed test summary with breakdown by category
- Success rate calculation
- Command-line options:
--suite: Run specific test suite
--verbose: Show each test name
--quiet: Minimal output
--failfast: Stop on first failure
--list: List all available tests
- Execution time: ~1 second for full suite
Documentation:
- Added comprehensive TESTING.md guide
- Test writing templates
- Best practices
- Coverage information
- Troubleshooting guide
.gitignore:
- Added Python cache files
- Added output directory
- Added IDE and OS files
Test Results:
✅ 71/71 tests passing (100% pass rate)
✅ All existing configs validated
✅ Fast execution (<1 second)
✅ Ready for CI/CD integration
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>