Python 3.14's urlparse() raises ValueError on URLs with unencoded
brackets that look like malformed IPv6 (e.g. http://[fdaa:x:x:x::x
from docs.openclaw.ai llms-full.txt). sanitize_url() called urlparse()
BEFORE encoding brackets, so it crashed before it could fix them.
Fix: catch ValueError from urlparse, encode ALL brackets, then retry.
This is safe because if urlparse rejected the brackets, they are NOT
valid IPv6 host literals and should be encoded anyway.
Also fixed Discord e2e tests to skip gracefully on network issues.
Fixes#284
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The original fix (741daf1) only patched LlmsTxtParser._clean_url(),
which covers URLs extracted directly from llms.txt content. But URLs
discovered from .md files during BFS crawl (_extract_markdown_content)
and from HTML pages (extract_content) bypass _clean_url() entirely.
When those pages contain links with square brackets (e.g.
/api/[v1]/users), httpx raises 'Invalid IPv6 URL' on fetch.
Fix: add a shared sanitize_url() utility in cli/utils.py that
percent-encodes [ and ] in path/query components, and apply it at
every URL ingestion point:
- _enqueue_url(): main chokepoint — all discovered URLs pass through
- scrape_page(): safety net for start_urls that skip _enqueue_url
- scrape_page_async(): same for async mode
- dry-run sync/async paths: direct fetches that also bypass _enqueue_url
LlmsTxtParser._clean_url() now delegates bracket-encoding to the
shared sanitize_url() (DRY), keeping only its malformed-anchor
stripping logic.
Added 16 tests: sanitize_url unit tests, _clean_url bracket tests,
_enqueue_url sanitization tests, and integration test verifying
markdown content with bracket URLs is handled safely.
Fixes#284
- Support <pre class="brush: java"> pattern (SyntaxHighlighter)
- Support bare class names like <pre class="python">
- Add _extract_language_from_classes() helper method
- Apply detection logic to both code and parent pre elements
- Add 3 comprehensive test cases
Improves language detection for 25+ programming languages across
various documentation site formats.
Co-authored-by: Ricardo JL Rufino <ricardo@edu3.com.br>
Major restructure to support both CLI usage and MCP integration:
**Repository Structure:**
- cli/ - All CLI tools (doc_scraper, estimate_pages, enhance_skill, etc.)
- mcp/ - New MCP server for Claude Code integration
- configs/ - Shared configuration files
- tests/ - Updated to import from cli/
- docs/ - Shared documentation
**MCP Server (NEW):**
- mcp/server.py - Full MCP server implementation
- 6 tools available:
* generate_config - Create config from URL
* estimate_pages - Fast page count estimation
* scrape_docs - Full documentation scraping
* package_skill - Package to .zip
* list_configs - Show available presets
* validate_config - Validate config files
- mcp/README.md - Complete MCP documentation
- mcp/requirements.txt - MCP dependencies
**CLI Tools (Moved to cli/):**
- All existing functionality preserved
- Same commands, same behavior
- Tests updated to import from cli.doc_scraper
**Tests:**
- 68/71 passing (95.8%)
- Updated imports from doc_scraper to cli.doc_scraper
- Fixed validate_config() tuple unpacking (errors, warnings)
- 3 minor test failures (checking warnings instead of errors)
**Benefits:**
- Use as CLI tool: python3 cli/doc_scraper.py
- Use via MCP: Integrated with Claude Code
- Shared code and configs
- Single source of truth
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Test Framework:
- Created tests/ directory structure
- Added __init__.py for test package
- Implemented 71 comprehensive tests across 3 test suites
Test Suites:
1. test_config_validation.py (25 tests)
- Valid/invalid config structure
- Required fields validation
- Name format validation
- URL format validation
- Selectors validation
- URL patterns validation
- Categories validation
- Rate limit validation (0-10 range)
- Max pages validation (1-10000 range)
- Start URLs validation
2. test_scraper_features.py (28 tests)
- URL validation (include/exclude patterns)
- Language detection (Python, JavaScript, GDScript, C++, etc.)
- Pattern extraction from documentation
- Smart categorization (by URL, title, content)
- Text cleaning utilities
3. test_integration.py (18 tests)
- Dry-run mode functionality
- Config loading and validation
- Real config files validation (godot, react, vue, django, fastapi, steam)
- URL processing and normalization
- Content extraction
Test Runner (run_tests.py):
- Custom colored test runner with ANSI colors
- Detailed test summary with breakdown by category
- Success rate calculation
- Command-line options:
--suite: Run specific test suite
--verbose: Show each test name
--quiet: Minimal output
--failfast: Stop on first failure
--list: List all available tests
- Execution time: ~1 second for full suite
Documentation:
- Added comprehensive TESTING.md guide
- Test writing templates
- Best practices
- Coverage information
- Troubleshooting guide
.gitignore:
- Added Python cache files
- Added output directory
- Added IDE and OS files
Test Results:
✅ 71/71 tests passing (100% pass rate)
✅ All existing configs validated
✅ Fast execution (<1 second)
✅ Ready for CI/CD integration
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>