skill-seekers-reference/tests/test_scraper_features.py at b25a6f7f5341a67e56996157822545d100f1b7ed

firefrost-gaming/skill-seekers-reference

Files

yusyus b25a6f7f53 fix: centralize bracket-encoding to prevent 'Invalid IPv6 URL' on all code paths (#284 )

The original fix (741daf1) only patched LlmsTxtParser._clean_url(),
which covers URLs extracted directly from llms.txt content. But URLs
discovered from .md files during BFS crawl (_extract_markdown_content)
and from HTML pages (extract_content) bypass _clean_url() entirely.
When those pages contain links with square brackets (e.g.
/api/[v1]/users), httpx raises 'Invalid IPv6 URL' on fetch.

Fix: add a shared sanitize_url() utility in cli/utils.py that
percent-encodes [ and ] in path/query components, and apply it at
every URL ingestion point:

- _enqueue_url(): main chokepoint — all discovered URLs pass through
- scrape_page(): safety net for start_urls that skip _enqueue_url
- scrape_page_async(): same for async mode
- dry-run sync/async paths: direct fetches that also bypass _enqueue_url

LlmsTxtParser._clean_url() now delegates bracket-encoding to the
shared sanitize_url() (DRY), keeping only its malformed-anchor
stripping logic.

Added 16 tests: sanitize_url unit tests, _clean_url bracket tests,
_enqueue_url sanitization tests, and integration test verifying
markdown content with bracket URLs is handled safely.

Fixes #284

2026-03-14 23:53:47 +03:00

26 KiB

Raw Blame History

View Raw

26 KiB Raw Blame History

26 KiB

Raw Blame History