fix: Strip anchor fragments in URL conversion to prevent 404 errors (fixes #277)
Critical bug fix for llms.txt URL parsing: Problem: - URLs with anchor fragments (e.g., #synchronous-initialization) were malformed when converting to .md format - Example: https://example.com/api#method → https://example.com/api#method/index.html.md ❌ - Caused 404 errors and duplicate requests for same page with different anchors Solution: 1. Parse URLs with urllib.parse.urlparse() to extract fragments 2. Strip anchor fragments before appending /index.html.md 3. Deduplicate base URLs (multiple anchors → single request) 4. Fix .md detection: '.md' in url → url.endswith('.md') - Prevents false matches on URLs like /cmd-line or /AMD-processors Changes: - src/skill_seekers/cli/doc_scraper.py (_convert_to_md_urls) - Added URL parsing to remove fragments - Added deduplication with seen_base_urls set - Fixed .md extension detection - Updated log message to show deduplicated count - tests/test_url_conversion.py (NEW) - 12 comprehensive tests covering all edge cases - Real-world MikroORM case validation - 54/54 tests passing (42 existing + 12 new) - CHANGELOG.md - Documented bug fix and solution Reported-by: @devjones <https://github.com/yusufkaraaslan/Skill_Seekers/issues/277>
This commit is contained in:
21
CHANGELOG.md
21
CHANGELOG.md
@@ -7,6 +7,27 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
|
||||
## [Unreleased]
|
||||
|
||||
### Fixed
|
||||
|
||||
#### URL Conversion Bug with Anchor Fragments (Issue #277)
|
||||
- **Critical Bug Fix**: Fixed 404 errors when scraping documentation with anchor links
|
||||
- **Problem**: URLs with anchor fragments (e.g., `#synchronous-initialization`) were malformed
|
||||
- Incorrect: `https://example.com/docs/api#method/index.html.md` ❌
|
||||
- Correct: `https://example.com/docs/api/index.html.md` ✅
|
||||
- **Root Cause**: `_convert_to_md_urls()` didn't strip anchor fragments before appending `/index.html.md`
|
||||
- **Solution**: Parse URLs with `urllib.parse` to remove fragments and deduplicate base URLs
|
||||
- **Impact**: Prevents duplicate requests for the same page with different anchors
|
||||
- **Additional Fix**: Changed `.md` detection from `".md" in url` to `url.endswith('.md')`
|
||||
- Prevents false matches on URLs like `/cmd-line` or `/AMD-processors`
|
||||
- **Test Coverage**: 12 comprehensive tests covering all edge cases
|
||||
- Anchor fragment stripping
|
||||
- Deduplication of multiple anchors on same URL
|
||||
- Query parameter preservation
|
||||
- Trailing slash handling
|
||||
- Real-world MikroORM case validation
|
||||
- 54/54 tests passing (42 existing + 12 new)
|
||||
- **Reported by**: @devjones via Issue #277
|
||||
|
||||
### Added
|
||||
|
||||
#### Extended Language Detection (NEW)
|
||||
|
||||
Reference in New Issue
Block a user