fix: centralize bracket-encoding to prevent 'Invalid IPv6 URL' on all code paths (#284)

The original fix (741daf1) only patched LlmsTxtParser._clean_url(), which covers URLs extracted directly from llms.txt content. But URLs discovered from .md files during BFS crawl (_extract_markdown_content) and from HTML pages (extract_content) bypass _clean_url() entirely. When those pages contain links with square brackets (e.g. /api/[v1]/users), httpx raises 'Invalid IPv6 URL' on fetch. Fix: add a shared sanitize_url() utility in cli/utils.py that percent-encodes [ and ] in path/query components, and apply it at every URL ingestion point: - _enqueue_url(): main chokepoint — all discovered URLs pass through - scrape_page(): safety net for start_urls that skip _enqueue_url - scrape_page_async(): same for async mode - dry-run sync/async paths: direct fetches that also bypass _enqueue_url LlmsTxtParser._clean_url() now delegates bracket-encoding to the shared sanitize_url() (DRY), keeping only its malformed-anchor stripping logic. Added 16 tests: sanitize_url unit tests, _clean_url bracket tests, _enqueue_url sanitization tests, and integration test verifying markdown content with bracket URLs is handled safely. Fixes #284
2026-03-14 23:53:47 +03:00
parent f214976ccd
commit b25a6f7f53
5 changed files with 226 additions and 15 deletions
--- a/src/skill_seekers/cli/llms_txt_parser.py
+++ b/src/skill_seekers/cli/llms_txt_parser.py
@@ -3,6 +3,8 @@
 import re
 from urllib.parse import urljoin

+from skill_seekers.cli.utils import sanitize_url
+

 class LlmsTxtParser:
    """Parse llms.txt markdown content into page structures"""
@@ -92,19 +94,8 @@ class LlmsTxtParser:
                # Extract the base URL without the malformed anchor
                url = url[:anchor_pos]

-        # Percent-encode square brackets in the path — they are only valid in
-        # the host portion of a URL (IPv6 literals). Leaving them unencoded
-        # causes httpx to raise "Invalid IPv6 URL" when the URL is fetched.
-        if "[" in url or "]" in url:
-            from urllib.parse import urlparse, urlunparse
-
-            parsed = urlparse(url)
-            # Only encode brackets in the path/query/fragment, not in the host
-            encoded_path = parsed.path.replace("[", "%5B").replace("]", "%5D")
-            encoded_query = parsed.query.replace("[", "%5B").replace("]", "%5D")
-            url = urlunparse(parsed._replace(path=encoded_path, query=encoded_query))
-
-        return url
+        # Percent-encode square brackets in the path/query (see #284).
+        return sanitize_url(url)

    def parse(self) -> list[dict]:
        """