fix: centralize bracket-encoding to prevent 'Invalid IPv6 URL' on all code paths (#284)
The original fix (741daf1) only patched LlmsTxtParser._clean_url(),
which covers URLs extracted directly from llms.txt content. But URLs
discovered from .md files during BFS crawl (_extract_markdown_content)
and from HTML pages (extract_content) bypass _clean_url() entirely.
When those pages contain links with square brackets (e.g.
/api/[v1]/users), httpx raises 'Invalid IPv6 URL' on fetch.
Fix: add a shared sanitize_url() utility in cli/utils.py that
percent-encodes [ and ] in path/query components, and apply it at
every URL ingestion point:
- _enqueue_url(): main chokepoint — all discovered URLs pass through
- scrape_page(): safety net for start_urls that skip _enqueue_url
- scrape_page_async(): same for async mode
- dry-run sync/async paths: direct fetches that also bypass _enqueue_url
LlmsTxtParser._clean_url() now delegates bracket-encoding to the
shared sanitize_url() (DRY), keeping only its malformed-anchor
stripping logic.
Added 16 tests: sanitize_url unit tests, _clean_url bracket tests,
_enqueue_url sanitization tests, and integration test verifying
markdown content with bracket URLs is handled safely.
Fixes #284
This commit is contained in:
@@ -484,3 +484,43 @@ def offset_to_line(newline_offsets: list[int], offset: int) -> int:
|
||||
1-based line number corresponding to *offset*.
|
||||
"""
|
||||
return bisect.bisect_left(newline_offsets, offset) + 1
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# URL sanitisation
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def sanitize_url(url: str) -> str:
|
||||
"""Percent-encode square brackets in a URL's path and query components.
|
||||
|
||||
Unencoded ``[`` and ``]`` in the path are technically invalid per
|
||||
RFC 3986 (they are only legal in the host for IPv6 literals). Libraries
|
||||
such as *httpx* and *urllib3* interpret them as IPv6 address markers and
|
||||
raise ``Invalid IPv6 URL``.
|
||||
|
||||
This function encodes **only** the path and query — the scheme, host,
|
||||
and fragment are left untouched.
|
||||
|
||||
Args:
|
||||
url: Absolute or scheme-relative URL to sanitise.
|
||||
|
||||
Returns:
|
||||
The URL with ``[`` → ``%5B`` and ``]`` → ``%5D`` in its path/query,
|
||||
or the original URL unchanged when no brackets are present.
|
||||
|
||||
Examples:
|
||||
>>> sanitize_url("https://example.com/api/[v1]/users")
|
||||
'https://example.com/api/%5Bv1%5D/users'
|
||||
>>> sanitize_url("https://example.com/docs/guide")
|
||||
'https://example.com/docs/guide'
|
||||
"""
|
||||
if "[" not in url and "]" not in url:
|
||||
return url
|
||||
|
||||
from urllib.parse import urlparse, urlunparse
|
||||
|
||||
parsed = urlparse(url)
|
||||
encoded_path = parsed.path.replace("[", "%5B").replace("]", "%5D")
|
||||
encoded_query = parsed.query.replace("[", "%5B").replace("]", "%5D")
|
||||
return urlunparse(parsed._replace(path=encoded_path, query=encoded_query))
|
||||
|
||||
Reference in New Issue
Block a user