Python 3.14's urlparse() raises ValueError on URLs with unencoded
brackets that look like malformed IPv6 (e.g. http://[fdaa:x:x:x::x
from docs.openclaw.ai llms-full.txt). sanitize_url() called urlparse()
BEFORE encoding brackets, so it crashed before it could fix them.
Fix: catch ValueError from urlparse, encode ALL brackets, then retry.
This is safe because if urlparse rejected the brackets, they are NOT
valid IPv6 host literals and should be encoded anyway.
Also fixed Discord e2e tests to skip gracefully on network issues.
Fixes#284
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous fix (a82cf69) only addressed anchor fragment stripping but
left the fundamental problem: _convert_to_md_urls() blindly appended
/index.html.md to ALL non-.md URLs from llms.txt. This only works for
Docusaurus sites — for sites like Discord docs it generates mass 404s.
Changes:
- _convert_to_md_urls() now strips anchors and deduplicates only,
preserving original URLs as-is instead of appending /index.html.md
- New _has_md_extension() helper uses urlparse().path.endswith(".md")
instead of error-prone ".md" in url substring matching
- Fixed ".md" in url checks at 4 locations (lines 465, 554, 716, 775)
- Removed 24 lines of dead commented-out code
- Added real-world e2e test against docs.discord.com (no mocks)
- Updated unit tests for new behavior (32 tests)
Fixes#277
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>