Merge branch 'development' into feature/router-quality-improvements

Integrated multi-source support from development branch into feature branch's
C3.x auto-cloning and cache system. This merge combines TWO major features:

FEATURE BRANCH (C3.x + Cache):
- Automatic GitHub repository cloning for C3.x analysis
- Hidden .skillseeker-cache/ directory for intermediate files
- Cache reuse for faster rebuilds
- Enhanced AI skill quality improvements

DEVELOPMENT BRANCH (Multi-Source):
- Support multiple sources of same type (multiple GitHub repos, PDFs)
- List-based data storage with source indexing
- New configs: claude-code.json, medusa-mercurjs.json
- llms.txt downloader/parser enhancements
- New tests: test_markdown_parsing.py, test_multi_source.py

CONFLICT RESOLUTIONS:

1. configs/claude-code.json (COMPROMISE):
   - Kept file with _migration_note (preserves PR #244 work)
   - Feature branch had deleted it (config migration)
   - Development branch enhanced it (47 Claude Code doc URLs)

2. src/skill_seekers/cli/unified_scraper.py (INTEGRATED):
   Applied 8 changes for multi-source support:
   - List-based storage: {'github': [], 'documentation': [], 'pdf': []}
   - Source indexing with _source_counters
   - Unique naming: {name}_github_{idx}_{repo_id}
   - Unique data files: github_data_{idx}_{repo_id}.json
   - List append instead of dict assignment
   - Updated _clone_github_repo(repo_name, idx=0) signature
   - Applied same logic to _scrape_pdf()

3. src/skill_seekers/cli/unified_skill_builder.py (INTEGRATED):
   Applied 3 changes for multi-source synthesis:
   - _load_source_skill_mds(): Glob pattern for multiple sources
   - _generate_references(): Iterate through github_list
   - _generate_c3_analysis_references(repo_id): Per-repo C3.x references

TESTING STRATEGY:

Backward Compatibility:
- Single source configs work exactly as before (idx=0)

New Capabilities:
- Multiple GitHub repos: encode/httpx + facebook/react
- Multiple PDFs with unique indexing
- Mixed sources: docs + multiple GitHub repos

Pipeline Integrity:
- Scraper: Multi-source data collection with indexing
- Builder: Loads all source SKILL.md files
- Synthesis: Merges multiple sources with separators
- C3.x: Independent analysis per repo in unique subdirectories

Result: Support MULTIPLE sources per type + C3.x analysis + cache system

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
yusyus
2026-01-12 00:11:31 +03:00
10 changed files with 1695 additions and 131 deletions

View File

@@ -349,6 +349,193 @@ class DocToSkillConverter:
return page
def _extract_markdown_content(self, content: str, url: str) -> Dict[str, Any]:
"""Extract structured content from a Markdown file.
Parses markdown files from llms.txt URLs to extract:
- Title from first h1 heading
- Headings (h2-h6, excluding h1)
- Code blocks with language detection
- Internal .md links for BFS crawling
- Content paragraphs (>20 chars)
Auto-detects HTML content and falls back to _extract_html_as_markdown.
Args:
content: Raw markdown content string (or HTML if server returned HTML)
url: Source URL for resolving relative links
Returns:
Dict with keys:
- url: str - Source URL
- title: str - Extracted from first # heading
- content: str - Paragraphs joined with double newlines
- headings: List[Dict] - {'level': 'h2', 'text': str, 'id': str}
- code_samples: List[Dict] - {'code': str, 'language': str}
- links: List[str] - Absolute URLs to other .md files
- patterns: List - Empty (reserved for future use)
Note:
Only .md links are extracted to avoid client-side rendered HTML pages.
Anchor fragments (#section) are stripped from links.
"""
import re
# Detect if content is actually HTML (some .md URLs return HTML)
if content.strip().startswith('<!DOCTYPE') or content.strip().startswith('<html'):
return self._extract_html_as_markdown(content, url)
page = {
'url': url,
'title': '',
'content': '',
'headings': [],
'code_samples': [],
'patterns': [],
'links': []
}
lines = content.split('\n')
# Extract title from first h1
for line in lines:
if line.startswith('# '):
page['title'] = line[2:].strip()
break
# Extract headings (h2-h6)
for line in lines:
match = re.match(r'^(#{2,6})\s+(.+)$', line)
if match:
level = len(match.group(1))
text = match.group(2).strip()
page['headings'].append({
'level': f'h{level}',
'text': text,
'id': text.lower().replace(' ', '-')
})
# Extract code blocks with language
code_blocks = re.findall(r'```(\w+)?\n(.*?)```', content, re.DOTALL)
for lang, code in code_blocks:
if len(code.strip()) > 10:
page['code_samples'].append({
'code': code.strip(),
'language': lang or 'unknown'
})
# Extract content (paragraphs)
content_no_code = re.sub(r'```.*?```', '', content, flags=re.DOTALL)
paragraphs = []
for para in content_no_code.split('\n\n'):
text = para.strip()
# Skip headings and short text
if text and len(text) > 20 and not text.startswith('#'):
paragraphs.append(text)
page['content'] = '\n\n'.join(paragraphs)
# Extract links from markdown (only .md files to avoid client-side rendered HTML pages)
md_links = re.findall(r'\[([^\]]*)\]\(([^)]+)\)', content)
for _, href in md_links:
if href.startswith('http'):
full_url = href
elif not href.startswith('#'):
full_url = urljoin(url, href)
else:
continue
# Strip anchor fragments
full_url = full_url.split('#')[0]
# Only include .md URLs to avoid client-side rendered HTML pages
if '.md' in full_url and self.is_valid_url(full_url) and full_url not in page['links']:
page['links'].append(full_url)
return page
def _extract_html_as_markdown(self, html_content: str, url: str) -> Dict[str, Any]:
"""Extract content from HTML and convert to markdown-like structure.
Fallback method when .md URL returns HTML content instead of markdown.
Uses BeautifulSoup to extract structured data from HTML elements.
Extraction strategy:
1. Title from <title> tag
2. Main content from <main>, <article>, [role="main"], or <body>
3. Headings (h1-h6) with text and id attributes
4. Code blocks from <pre><code> or <pre> tags
5. Text content from paragraphs
Args:
html_content: Raw HTML content string
url: Source URL (for reference in result dict)
Returns:
Dict with keys:
- url: str - Source URL
- title: str - From <title> tag, cleaned
- content: str - Text content from main area
- headings: List[Dict] - {'level': 'h2', 'text': str, 'id': str}
- code_samples: List[Dict] - {'code': str, 'language': str}
- links: List - Empty (HTML links not extracted to avoid client-side routes)
- patterns: List - Empty (reserved for future use)
Note:
Prefers <main> or <article> tags for content area.
Falls back to <body> if no semantic content container found.
Language detection uses detect_language() method.
"""
page = {
'url': url,
'title': '',
'content': '',
'headings': [],
'code_samples': [],
'patterns': [],
'links': []
}
soup = BeautifulSoup(html_content, 'html.parser')
# Try to extract title
title_elem = soup.select_one('title')
if title_elem:
page['title'] = self.clean_text(title_elem.get_text())
# Try to find main content area
main = soup.select_one('main, article, [role="main"], .content')
if not main:
main = soup.body if soup.body else soup
if main:
# Extract headings
for h in main.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
text = self.clean_text(h.get_text())
if text:
page['headings'].append({
'level': h.name,
'text': text,
'id': h.get('id', '')
})
# Extract code blocks
for code_elem in main.select('pre code, pre'):
code = code_elem.get_text()
if len(code.strip()) > 10:
lang = self.detect_language(code_elem, code)
page['code_samples'].append({
'code': code.strip(),
'language': lang
})
# Extract paragraphs
paragraphs = []
for p in main.find_all('p'):
text = self.clean_text(p.get_text())
if text and len(text) > 20:
paragraphs.append(text)
page['content'] = '\n\n'.join(paragraphs)
return page
def detect_language(self, elem, code):
"""Detect programming language from code block
@@ -386,14 +573,19 @@ class DocToSkillConverter:
return text.strip()
def save_page(self, page: Dict[str, Any]) -> None:
"""Save page data"""
"""Save page data (skip pages with empty content)"""
# Skip pages with empty or very short content
if not page.get('content') or len(page.get('content', '')) < 50:
logger.debug("Skipping page with empty/short content: %s", page.get('url', 'unknown'))
return
url_hash = hashlib.md5(page['url'].encode()).hexdigest()[:10]
safe_title = re.sub(r'[^\w\s-]', '', page['title'])[:50]
safe_title = re.sub(r'[-\s]+', '_', safe_title)
filename = f"{safe_title}_{url_hash}.json"
filepath = os.path.join(self.data_dir, "pages", filename)
with open(filepath, 'w', encoding='utf-8') as f:
json.dump(page, f, indent=2, ensure_ascii=False)
@@ -408,6 +600,7 @@ class DocToSkillConverter:
Note:
Uses threading locks when workers > 1 for thread safety
Supports both HTML pages and Markdown (.md) files
"""
try:
# Scraping part (no lock needed - independent)
@@ -415,8 +608,12 @@ class DocToSkillConverter:
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
page = self.extract_content(soup, url)
# Check if this is a Markdown file
if url.endswith('.md') or '.md' in url:
page = self._extract_markdown_content(response.text, url)
else:
soup = BeautifulSoup(response.content, 'html.parser')
page = self.extract_content(soup, url)
# Thread-safe operations (lock required)
if self.workers > 1:
@@ -463,6 +660,7 @@ class DocToSkillConverter:
Note:
Uses asyncio.Lock for async-safe operations instead of threading.Lock
Supports both HTML pages and Markdown (.md) files
"""
async with semaphore: # Limit concurrent requests
try:
@@ -471,9 +669,13 @@ class DocToSkillConverter:
response = await client.get(url, headers=headers, timeout=30.0)
response.raise_for_status()
# BeautifulSoup parsing (still synchronous, but fast)
soup = BeautifulSoup(response.content, 'html.parser')
page = self.extract_content(soup, url)
# Check if this is a Markdown file
if url.endswith('.md') or '.md' in url:
page = self._extract_markdown_content(response.text, url)
else:
# BeautifulSoup parsing (still synchronous, but fast)
soup = BeautifulSoup(response.content, 'html.parser')
page = self.extract_content(soup, url)
# Async-safe operations (no lock needed - single event loop)
logger.info(" %s", url)
@@ -493,6 +695,56 @@ class DocToSkillConverter:
except Exception as e:
logger.error(" ✗ Error scraping %s: %s: %s", url, type(e).__name__, e)
def _convert_to_md_urls(self, urls: List[str]) -> List[str]:
"""
Convert URLs to .md format, trying /index.html.md suffix for non-.md URLs.
不预先检查 URL 是否存在,直接加入队列,在爬取时再验证。
Args:
urls: List of URLs to process
Returns:
List of .md URLs (未验证)
"""
md_urls = []
for url in urls:
if '.md' in url:
md_urls.append(url)
else:
# 直接转换为 .md 格式,不发送 HEAD 请求检查
url = url.rstrip('/')
md_url = f"{url}/index.html.md"
md_urls.append(md_url)
logger.info(" ✓ Converted %d URLs to .md format (will validate during crawl)", len(md_urls))
return md_urls
# ORIGINAL _convert_to_md_urls (with HEAD request validation):
# def _convert_to_md_urls(self, urls: List[str]) -> List[str]:
# md_urls = []
# non_md_urls = []
# for url in urls:
# if '.md' in url:
# md_urls.append(url)
# else:
# non_md_urls.append(url)
# if non_md_urls:
# logger.info(" 🔄 Trying to convert %d non-.md URLs to .md format...", len(non_md_urls))
# converted = 0
# for url in non_md_urls:
# url = url.rstrip('/')
# md_url = f"{url}/index.html.md"
# try:
# resp = requests.head(md_url, timeout=5, allow_redirects=True)
# if resp.status_code == 200:
# md_urls.append(md_url)
# converted += 1
# except Exception:
# pass
# logger.info(" ✓ Converted %d URLs to .md format", converted)
# return md_urls
def _try_llms_txt(self) -> bool:
"""
Try to use llms.txt instead of HTML scraping.
@@ -548,7 +800,29 @@ class DocToSkillConverter:
logger.info("%s (%d chars)", extra_filename, len(extra_content))
# Parse explicit file for skill building
parser = LlmsTxtParser(content)
parser = LlmsTxtParser(content, self.base_url)
# Extract URLs from llms.txt and add to pending_urls for BFS crawling
extracted_urls = parser.extract_urls()
if extracted_urls:
# Convert non-.md URLs to .md format by trying /index.html.md suffix
md_urls = self._convert_to_md_urls(extracted_urls)
logger.info("\n🔗 Found %d URLs in llms.txt (%d .md files), starting BFS crawl...",
len(extracted_urls), len(md_urls))
# Filter URLs based on url_patterns config
for url in md_urls:
if self.is_valid_url(url) and url not in self.visited_urls:
self.pending_urls.append(url)
logger.info(" 📋 %d URLs added to crawl queue after filtering", len(self.pending_urls))
# Return False to trigger HTML scraping with the populated pending_urls
self.llms_txt_detected = True
self.llms_txt_variant = 'explicit'
return False # Continue with BFS crawling
# Fallback: if no URLs found, use section-based parsing
pages = parser.parse()
if pages:
@@ -606,7 +880,29 @@ class DocToSkillConverter:
largest = max(downloaded.items(), key=lambda x: x[1]['size'])
logger.info("\n📄 Parsing %s for skill building...", largest[1]['filename'])
parser = LlmsTxtParser(largest[1]['content'])
parser = LlmsTxtParser(largest[1]['content'], self.base_url)
# Extract URLs from llms.txt and add to pending_urls for BFS crawling
extracted_urls = parser.extract_urls()
if extracted_urls:
# Convert non-.md URLs to .md format by trying /index.html.md suffix
md_urls = self._convert_to_md_urls(extracted_urls)
logger.info("\n🔗 Found %d URLs in llms.txt (%d .md files), starting BFS crawl...",
len(extracted_urls), len(md_urls))
# Filter URLs based on url_patterns config
for url in md_urls:
if self.is_valid_url(url) and url not in self.visited_urls:
self.pending_urls.append(url)
logger.info(" 📋 %d URLs added to crawl queue after filtering", len(self.pending_urls))
# Return False to trigger HTML scraping with the populated pending_urls
self.llms_txt_detected = True
self.llms_txt_variants = list(downloaded.keys())
return False # Continue with BFS crawling
# Fallback: if no URLs found, use section-based parsing
pages = parser.parse()
if not pages: