The previous fix (a82cf69) only addressed anchor fragment stripping but
left the fundamental problem: _convert_to_md_urls() blindly appended
/index.html.md to ALL non-.md URLs from llms.txt. This only works for
Docusaurus sites — for sites like Discord docs it generates mass 404s.
Changes:
- _convert_to_md_urls() now strips anchors and deduplicates only,
preserving original URLs as-is instead of appending /index.html.md
- New _has_md_extension() helper uses urlparse().path.endswith(".md")
instead of error-prone ".md" in url substring matching
- Fixed ".md" in url checks at 4 locations (lines 465, 554, 716, 775)
- Removed 24 lines of dead commented-out code
- Added real-world e2e test against docs.discord.com (no mocks)
- Updated unit tests for new behavior (32 tests)
Fixes#277
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added comprehensive integration tests using the exact MikroORM URLs that
caused 404 errors in the original bug report.
Test Coverage (6 integration tests):
1. test_mikro_orm_urls_from_issue_277
- Tests exact URLs from the bug report
- Verifies no malformed anchor fragments in results
- Validates deduplication and correct URL transformation
2. test_no_404_causing_urls_generated
- Verifies no URLs matching the 404 error pattern are generated
- Tests all problematic patterns from the issue
3. test_deduplication_prevents_multiple_requests
- Validates that multiple anchors on same page deduplicate correctly
- Ensures bandwidth savings
4. test_md_files_with_anchors_preserved
- Tests .md files with anchors are handled correctly
- Verifies anchor stripping on .md URLs
5. test_real_scraping_scenario_no_404s
- Integration test simulating full llms.txt parsing flow
- Validates URL structure with regex patterns
6. test_issue_277_error_message_urls
- Tests the exact malformed URLs from error output
- Verifies correct URLs are generated instead
Results:
- 18/18 tests passing (12 unit + 6 integration)
- All MikroORM URLs from issue #277 handled correctly
- No 404-causing patterns generated
Related: #277