, [role="main"], or
+ 3. Headings (h1-h6) with text and id attributes
+ 4. Code blocks from or tags
+ 5. Text content from paragraphs
+
Args:
- html_content: Raw HTML content
- url: Source URL
+ html_content: Raw HTML content string
+ url: Source URL (for reference in result dict)
Returns:
- Page dict with title, content, code_samples, headings, links
+ Dict with keys:
+ - url: str - Source URL
+ - title: str - From tag, cleaned
+ - content: str - Text content from main area
+ - headings: List[Dict] - {'level': 'h2', 'text': str, 'id': str}
+ - code_samples: List[Dict] - {'code': str, 'language': str}
+ - links: List - Empty (HTML links not extracted to avoid client-side routes)
+ - patterns: List - Empty (reserved for future use)
+
+ Note:
+ Prefers or tags for content area.
+ Falls back to if no semantic content container found.
+ Language detection uses detect_language() method.
"""
page = {
'url': url,
diff --git a/src/skill_seekers/cli/llms_txt_parser.py b/src/skill_seekers/cli/llms_txt_parser.py
index 2e143bf..ae11410 100644
--- a/src/skill_seekers/cli/llms_txt_parser.py
+++ b/src/skill_seekers/cli/llms_txt_parser.py
@@ -16,8 +16,19 @@ class LlmsTxtParser:
"""
Extract all URLs from the llms.txt content.
+ Supports both markdown-style links [text](url) and bare URLs.
+ Resolves relative URLs using base_url if provided.
+ Filters out malformed URLs with invalid anchor patterns.
+
Returns:
- List of unique URLs found in the content
+ List of unique, cleaned URLs found in the content.
+ Returns empty list if no valid URLs found.
+
+ Note:
+ - Markdown links: [Getting Started](./docs/guide.md)
+ - Bare URLs: https://example.com/api.md
+ - Relative paths resolved with base_url
+ - Invalid anchors (#section/path.md) are stripped
"""
urls = set()
@@ -48,11 +59,23 @@ class LlmsTxtParser:
"""
Clean and validate URL, removing invalid anchor patterns.
+ Detects and strips malformed anchors that contain path separators.
+ Valid: https://example.com/page.md#section
+ Invalid: https://example.com/page#section/index.html.md
+
Args:
- url: URL to clean
+ url: URL to clean (absolute or relative)
Returns:
- Cleaned URL or empty string if invalid
+ Cleaned URL with malformed anchors stripped.
+ Returns base URL if anchor contains '/' (malformed).
+ Returns original URL if anchor is valid or no anchor present.
+
+ Example:
+ >>> parser._clean_url("https://ex.com/page#sec/path.md")
+ "https://ex.com/page"
+ >>> parser._clean_url("https://ex.com/page.md#section")
+ "https://ex.com/page.md#section"
"""
# Skip URLs with path after anchor (e.g., #section/index.html.md)
# These are malformed and return duplicate HTML content
diff --git a/src/skill_seekers/cli/unified_skill_builder.py b/src/skill_seekers/cli/unified_skill_builder.py
index a80f86d..ef6437c 100644
--- a/src/skill_seekers/cli/unified_skill_builder.py
+++ b/src/skill_seekers/cli/unified_skill_builder.py
@@ -287,6 +287,10 @@ This skill combines knowledge from multiple sources:
def _generate_docs_references(self, docs_list: List[Dict]):
"""Generate references from multiple documentation sources."""
+ # Skip if no documentation sources
+ if not docs_list:
+ return
+
docs_dir = os.path.join(self.skill_dir, 'references', 'documentation')
os.makedirs(docs_dir, exist_ok=True)
@@ -347,6 +351,10 @@ This skill combines knowledge from multiple sources:
def _generate_github_references(self, github_list: List[Dict]):
"""Generate references from multiple GitHub sources."""
+ # Skip if no GitHub sources
+ if not github_list:
+ return
+
github_dir = os.path.join(self.skill_dir, 'references', 'github')
os.makedirs(github_dir, exist_ok=True)
@@ -429,6 +437,10 @@ This skill combines knowledge from multiple sources:
def _generate_pdf_references(self, pdf_list: List[Dict]):
"""Generate references from PDF sources."""
+ # Skip if no PDF sources
+ if not pdf_list:
+ return
+
pdf_dir = os.path.join(self.skill_dir, 'references', 'pdf')
os.makedirs(pdf_dir, exist_ok=True)