skill-seekers-reference/tests/test_url_conversion.py at b0fd1d7ee0ce56abc55d0113da2a994bc04cd541

firefrost-gaming/skill-seekers-reference

Files

yusyus a82cf6967a fix: Strip anchor fragments in URL conversion to prevent 404 errors (fixes #277 )

Critical bug fix for llms.txt URL parsing:

Problem:
- URLs with anchor fragments (e.g., #synchronous-initialization) were
  malformed when converting to .md format
- Example: https://example.com/api#method → https://example.com/api#method/index.html.md ❌
- Caused 404 errors and duplicate requests for same page with different anchors

Solution:
1. Parse URLs with urllib.parse.urlparse() to extract fragments
2. Strip anchor fragments before appending /index.html.md
3. Deduplicate base URLs (multiple anchors → single request)
4. Fix .md detection: '.md' in url → url.endswith('.md')
   - Prevents false matches on URLs like /cmd-line or /AMD-processors

Changes:
- src/skill_seekers/cli/doc_scraper.py (_convert_to_md_urls)
  - Added URL parsing to remove fragments
  - Added deduplication with seen_base_urls set
  - Fixed .md extension detection
  - Updated log message to show deduplicated count
- tests/test_url_conversion.py (NEW)
  - 12 comprehensive tests covering all edge cases
  - Real-world MikroORM case validation
  - 54/54 tests passing (42 existing + 12 new)
- CHANGELOG.md
  - Documented bug fix and solution

Reported-by: @devjones <https://github.com/yusufkaraaslan/Skill_Seekers/issues/277>

2026-02-04 21:16:13 +03:00

9.1 KiB

Raw Blame History

View Raw

9.1 KiB Raw Blame History

9.1 KiB

Raw Blame History