yusyus
a82cf6967a
fix: Strip anchor fragments in URL conversion to prevent 404 errors (fixes #277)
Critical bug fix for llms.txt URL parsing:
Problem:
- URLs with anchor fragments (e.g., #synchronous-initialization) were
malformed when converting to .md format
- Example: https://example.com/api#method → https://example.com/api#method/index.html.md ❌
- Caused 404 errors and duplicate requests for same page with different anchors
Solution:
1. Parse URLs with urllib.parse.urlparse() to extract fragments
2. Strip anchor fragments before appending /index.html.md
3. Deduplicate base URLs (multiple anchors → single request)
4. Fix .md detection: '.md' in url → url.endswith('.md')
- Prevents false matches on URLs like /cmd-line or /AMD-processors
Changes:
- src/skill_seekers/cli/doc_scraper.py (_convert_to_md_urls)
- Added URL parsing to remove fragments
- Added deduplication with seen_base_urls set
- Fixed .md extension detection
- Updated log message to show deduplicated count
- tests/test_url_conversion.py (NEW)
- 12 comprehensive tests covering all edge cases
- Real-world MikroORM case validation
- 54/54 tests passing (42 existing + 12 new)
- CHANGELOG.md
- Documented bug fix and solution
Reported-by: @devjones <https://github.com/yusufkaraaslan/Skill_Seekers/issues/277>
2026-02-04 21:16:13 +03:00
..
2025-10-29 23:19:32 +03:00
2026-01-17 17:48:15 +00:00
2025-10-19 02:08:58 +03:00
2026-01-17 23:02:11 +03:00
2025-10-19 17:01:37 +03:00
2026-02-03 21:37:54 +03:00
2026-01-31 21:30:00 +03:00
2026-01-17 17:48:15 +00:00
2026-01-18 00:01:30 +03:00
2026-01-18 00:01:30 +03:00
2026-01-17 23:02:11 +03:00
2026-01-29 22:56:33 +03:00
2026-01-17 23:25:12 +03:00
2026-02-03 21:00:34 +03:00
2026-01-17 17:29:21 +00:00
2026-01-31 14:56:00 +03:00
2026-01-17 23:02:11 +03:00
2026-01-31 21:30:00 +03:00
2026-01-17 17:48:15 +00:00
2026-01-17 17:48:15 +00:00
2026-01-17 22:54:40 +03:00
2026-01-17 23:02:11 +03:00
2026-02-04 10:14:20 +01:00
2026-01-17 17:48:15 +00:00
2026-01-17 23:02:11 +03:00
2026-01-17 17:48:15 +00:00
2026-01-17 23:02:11 +03:00
2026-01-17 23:02:11 +03:00
2026-01-17 23:02:11 +03:00
2026-01-18 00:01:30 +03:00
2026-01-18 00:01:30 +03:00
2026-01-31 21:30:00 +03:00
2026-01-18 00:01:30 +03:00
2026-01-17 17:48:15 +00:00
2026-01-17 23:02:11 +03:00
2026-01-17 17:48:15 +00:00
2026-01-31 14:58:09 +03:00
2026-01-18 12:11:01 +03:00
2026-01-17 17:29:21 +00:00
2026-01-17 23:02:11 +03:00
2026-01-18 00:01:30 +03:00
2026-01-17 17:29:21 +00:00
2026-01-17 23:33:34 +03:00
2026-01-17 23:02:11 +03:00
2026-01-17 23:02:11 +03:00
2026-01-17 23:02:11 +03:00
2026-01-17 17:48:15 +00:00
2026-01-17 17:48:15 +00:00
2026-01-17 17:48:15 +00:00
2026-02-03 21:00:34 +03:00
2026-01-17 17:48:15 +00:00
2026-01-17 17:48:15 +00:00
2026-01-17 22:54:40 +03:00
2026-02-04 21:00:49 +03:00
2026-01-27 21:11:04 +03:00
2026-01-17 17:29:21 +00:00
2026-01-17 17:48:15 +00:00
2026-01-17 17:48:15 +00:00
2026-01-17 23:02:11 +03:00
2026-01-17 17:48:15 +00:00
2026-01-17 23:25:12 +03:00
2026-01-18 13:48:37 +03:00
2026-01-18 00:01:30 +03:00
2026-01-17 23:02:11 +03:00
2026-01-17 23:02:11 +03:00
2026-01-17 23:02:11 +03:00
2026-01-17 23:02:11 +03:00
2026-02-02 23:08:25 +03:00
2026-01-17 23:02:11 +03:00
2026-01-17 22:54:40 +03:00
2026-01-17 23:25:12 +03:00
2026-01-17 17:48:15 +00:00
2026-02-04 21:16:13 +03:00
2026-01-17 17:29:21 +00:00