fix: Add empty list checks and enhance docstrings (PR #243 review fixes)

Two critical improvements from PR #243 code review:

## Fix 1: Empty List Edge Case Handling

Added early return checks to prevent creating empty index files:

**Files Modified:**
- src/skill_seekers/cli/unified_skill_builder.py

**Changes:**
- _generate_docs_references: Skip if docs_list empty
- _generate_github_references: Skip if github_list empty
- _generate_pdf_references: Skip if pdf_list empty

**Impact:**
Prevents "Combined from 0 sources" index files which look odd.

## Fix 2: Enhanced Method Docstrings

Added comprehensive parameter types and return value documentation:

**Files Modified:**
- src/skill_seekers/cli/llms_txt_parser.py
  - extract_urls: Added detailed examples and behavior notes
  - _clean_url: Added malformed URL pattern examples

- src/skill_seekers/cli/doc_scraper.py
  - _extract_markdown_content: Full return dict structure documented
  - _extract_html_as_markdown: Extraction strategy and fallback behavior

**Impact:**
Improved developer experience with detailed API documentation.

## Testing

All tests passing:
-  32/32 PR #243 tests (markdown parsing + multi-source)
-  975/975 core tests
- 159 skipped (optional dependencies)
- 4 failed (missing anthropic - expected)

Co-authored-by: Code Review <claude-sonnet-4.5@anthropic.com>
This commit is contained in:
yusyus
2026-01-11 14:01:23 +03:00
parent a7f13ec75f
commit 04de96f2f5
3 changed files with 87 additions and 10 deletions

View File

@@ -16,8 +16,19 @@ class LlmsTxtParser:
"""
Extract all URLs from the llms.txt content.
Supports both markdown-style links [text](url) and bare URLs.
Resolves relative URLs using base_url if provided.
Filters out malformed URLs with invalid anchor patterns.
Returns:
List of unique URLs found in the content
List of unique, cleaned URLs found in the content.
Returns empty list if no valid URLs found.
Note:
- Markdown links: [Getting Started](./docs/guide.md)
- Bare URLs: https://example.com/api.md
- Relative paths resolved with base_url
- Invalid anchors (#section/path.md) are stripped
"""
urls = set()
@@ -48,11 +59,23 @@ class LlmsTxtParser:
"""
Clean and validate URL, removing invalid anchor patterns.
Detects and strips malformed anchors that contain path separators.
Valid: https://example.com/page.md#section
Invalid: https://example.com/page#section/index.html.md
Args:
url: URL to clean
url: URL to clean (absolute or relative)
Returns:
Cleaned URL or empty string if invalid
Cleaned URL with malformed anchors stripped.
Returns base URL if anchor contains '/' (malformed).
Returns original URL if anchor is valid or no anchor present.
Example:
>>> parser._clean_url("https://ex.com/page#sec/path.md")
"https://ex.com/page"
>>> parser._clean_url("https://ex.com/page.md#section")
"https://ex.com/page.md#section"
"""
# Skip URLs with path after anchor (e.g., #section/index.html.md)
# These are malformed and return duplicate HTML content