* fix: resolve 8 pipeline bugs found during skill quality review - Fix 0 APIs extracted from documentation by enriching summary.json with individual page file content before conflict detection - Fix all "Unknown" entries in merged_api.md by injecting dict keys as API names and falling back to AI merger field names - Fix frontmatter using raw slugs instead of config name by normalizing frontmatter after SKILL.md generation - Fix leaked absolute filesystem paths in patterns/index.md by stripping .skillseeker-cache repo clone prefixes - Fix ARCHITECTURE.md file count always showing "1 files" by counting files per language from code_analysis data - Fix YAML parse errors on GitHub Actions workflows by converting boolean keys (on: true) to strings - Fix false React/Vue.js framework detection in C# projects by filtering web frameworks based on primary language - Improve how-to guide generation by broadening workflow example filter to include setup/config examples with sufficient complexity - Fix test_git_sources_e2e failures caused by git init default branch being 'main' instead of 'master' Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address 6 review issues in ExecutionContext implementation Fixes from code review: 1. Mode resolution (#3 critical): _args_to_data no longer unconditionally overwrites mode. Only writes mode="api" when --api-key explicitly passed. Env-var-based mode detection moved to _default_data() as lowest priority. 2. Re-initialization warning (#4): initialize() now logs debug message when called a second time instead of silently returning stale instance. 3. _raw_args preserved in override (#5): temp context now copies _raw_args from parent so get_raw() works correctly inside override blocks. 4. test_local_mode_detection env cleanup (#7): test now saves/restores API key env vars to prevent failures when ANTHROPIC_API_KEY is set. 5. _load_config_file error handling (#8): wraps FileNotFoundError and JSONDecodeError with user-friendly ValueError messages. 6. Lint fixes: added logging import, fixed Generator import from collections.abc, fixed AgentClient return type annotation. Remaining P2/P3 items (documented, not blocking): - Lock TOCTOU in override() — safe on CPython, needs fix for no-GIL - get() reads _instance without lock — same CPython caveat - config_path not stored on instance - AnalysisSettings.depth not Literal constrained Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address all remaining P2/P3 review issues in ExecutionContext 1. Thread safety: get() now acquires _lock before reading _instance (#2) 2. Thread safety: override() saves/restores _initialized flag to prevent re-init during override blocks (#10) 3. Config path stored: _config_path PrivateAttr + config_path property (#6) 4. Literal validation: AnalysisSettings.depth now uses Literal["surface", "deep", "full"] — rejects invalid values (#9) 5. Test updated: test_analysis_depth_choices now expects ValidationError for invalid depth, added test_analysis_depth_valid_choices 6. Lint cleanup: removed unused imports, fixed whitespace in tests All 10 previously reported issues now resolved. 26 tests pass, lint clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: restore 5 truncated scrapers, migrate unified_scraper, fix context init 5 scrapers had main() truncated with "# Original main continues here..." after Kimi's migration — business logic was never connected: - html_scraper.py — restored HtmlToSkillConverter extraction + build - pptx_scraper.py — restored PptxToSkillConverter extraction + build - confluence_scraper.py — restored ConfluenceToSkillConverter with 3 modes - notion_scraper.py — restored NotionToSkillConverter with 4 sources - chat_scraper.py — restored ChatToSkillConverter extraction + build unified_scraper.py — migrated main() to context-first pattern with argv fallback Fixed context initialization chain: - main.py no longer initializes ExecutionContext (was stealing init from commands) - create_command.py now passes config_path from source_info.parsed - execution_context.py handles SourceInfo.raw_input (not raw_source) All 18 scrapers now genuinely migrated. 26 tests pass, lint clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: resolve 7 data flow conflicts between ExecutionContext and legacy paths Critical fixes (CLI args silently lost): - unified_scraper Phase 6: reads ctx.enhancement.level instead of raw JSON when args=None (#3, #4) - unified_scraper Phase 6 agent: reads ctx.enhancement.agent instead of 3 independent env var lookups (#5) - doc_scraper._run_enhancement: uses agent_client.api_key instead of raw os.environ.get() — respects config file api_key (#1) Important fixes: - main._handle_analyze_command: populates _fake_args from ExecutionContext so --agent and --api-key aren't lost in analyze→enhance path (#6) - doc_scraper type annotations: replaced forward refs with Any to avoid F821 undefined name errors All changes include RuntimeError fallback for backward compatibility when ExecutionContext isn't initialized. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: 3 crashes + 1 stub in migrated scrapers found by deep scan 1. github_scraper.py: args.scrape_only and args.enhance_level crash when args=None (context path). Guarded with if args and getattr(). Also fixed agent fallback to read ctx.enhancement.agent. 2. codebase_scraper.py: args.output and args.skip_api_reference crash in summary block when args=None. Replaced with output_dir local var and ctx.analysis.skip_api_reference. 3. epub_scraper.py: main() was still a stub ending with "# Rest of main() continues..." — restored full extraction + build + enhancement logic using ctx values exclusively. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: complete ExecutionContext migration for remaining scrapers Kimi's Phase 4 scraper migrations + Claude's review fixes. All 18 scrapers now use context-first pattern with argv fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: Phase 1 — ExecutionContext.get() always returns context (no RuntimeError) get() now returns a default context instead of raising RuntimeError when not explicitly initialized. This eliminates the need for try/except RuntimeError blocks in all 18 scrapers. Components can always call ExecutionContext.get() safely — it returns defaults if not initialized, or the explicitly initialized instance. Updated tests: test_get_returns_defaults_when_not_initialized, test_reset_clears_instance (no longer expects RuntimeError). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: Phase 2a-c — remove 16 individual scraper CLI commands Removed individual scraper commands from: - COMMAND_MODULES in main.py (16 entries: scrape, github, pdf, word, epub, video, jupyter, html, openapi, asciidoc, pptx, rss, manpage, confluence, notion, chat) - pyproject.toml entry points (16 skill-seekers-<type> binaries) - parsers/__init__.py (16 parser registrations) All source types now accessed via: skill-seekers create <source> Kept: create, unified, analyze, enhance, package, upload, install, install-agent, config, doctor, and utility commands. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: create SkillConverter base class + converter registry New base interface that all 17 converters will inherit: - SkillConverter.run() — extract + build (same call for all types) - SkillConverter.extract() — override in subclass - SkillConverter.build_skill() — override in subclass - get_converter(source_type, config) — factory from registry - CONVERTER_REGISTRY — maps source type → (module, class) create_command will use get_converter() instead of _call_module(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: Grand Unification — one command, one interface, direct converters Complete the Grand Unification refactor: `skill-seekers create` is now the single entry point for all 18 source types. Individual scraper CLI commands (scrape, github, pdf, analyze, unified, etc.) are removed. ## Architecture changes - **18 SkillConverter subclasses**: Every scraper now inherits SkillConverter with extract() + build_skill() + SOURCE_TYPE. Factory via get_converter(). - **create_command.py rewritten**: _build_config() constructs config dicts from ExecutionContext for each source type. Direct converter.run() calls replace the old _build_argv() + sys.argv swap + _call_module() machinery. - **main.py simplified**: create command bypasses _reconstruct_argv entirely, calls CreateCommand(args).execute() directly. analyze/unified commands removed (create handles both via auto-detection). - **CreateParser mode="all"**: Top-level parser now accepts all 120+ flags (--browser, --max-pages, --depth, etc.) since create is the only entry. - **Centralized enhancement**: Runs once in create_command after converter, not duplicated in each scraper. - **MCP tools use converters**: 5 scraping tools call get_converter() directly instead of subprocess. Config type auto-detected from keys. - **ConfigValidator → UniSkillConfigValidator**: Renamed with backward- compat alias. - **Data flow**: AgentClient + LocalSkillEnhancer read ExecutionContext first, env vars as fallback. ## What was removed - main() from all 18 scraper files (~3400 lines) - 18 CLI commands from COMMAND_MODULES + pyproject.toml entry points - analyze + unified parsers from parser registry - _build_argv, _call_module, _SKIP_ARGS, _DEST_TO_FLAG, all _route_*() - setup_argument_parser, get_configuration, _check_deprecated_flags - Tests referencing removed commands/functions ## Net impact 51 files changed, ~6000 lines removed. 2996 tests pass, 0 failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: review fixes for Grand Unification PR - Add autouse conftest fixture to reset ExecutionContext singleton between tests - Replace hardcoded defaults in _is_explicitly_set() with parser-derived defaults - Upgrade ExecutionContext double-init log from debug to info - Use logger.exception() in SkillConverter.run() to preserve tracebacks - Fix docstring "17 types" → "18 types" in skill_converter.py - DRY up 10 copy-paste help handlers into dict + loop (~100 lines removed) - Fix 2 CI workflows still referencing removed `skill-seekers scrape` command - Remove broken pyproject.toml entry point for codebase_scraper:main Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: resolve 12 logic/flow issues found in deep review Critical fixes: - UnifiedScraper.run(): replace sys.exit(1) with return 1, add return 0 - doc_scraper: use ExecutionContext.get() when already initialized instead of re-calling initialize() which silently discards new config - unified_scraper: define enhancement_config before try/except to prevent UnboundLocalError in LOCAL enhancement timeout read Important fixes: - override(): cleaner tuple save/restore for singleton swap - --agent without --api-key now sets mode="local" so env API key doesn't override explicit agent choice - Remove DeprecationWarning from _reconstruct_argv (fires on every non-create command in production) - Rewrite scrape_generic_tool to use get_converter() instead of subprocess calls to removed main() functions - SkillConverter.run() checks build_skill() return value, returns 1 if False - estimate_pages_tool uses -m module invocation instead of .py file path Low-priority fixes: - get_converter() raises descriptive ValueError on class name typo - test_default_values: save/clear API key env vars before asserting mode - test_get_converter_pdf: fix config key "path" → "pdf_path" 3056 passed, 4 failed (pre-existing dep version issues), 32 skipped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: update MCP server tests to mock converter instead of subprocess scrape_docs_tool now uses get_converter() + _run_converter() in-process instead of run_subprocess_with_streaming. Update 4 TestScrapeDocsTool tests to mock the converter layer instead of the removed subprocess path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: YusufKaraaslanSpyke <yusuf@spykegames.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
953 lines
41 KiB
Python
953 lines
41 KiB
Python
#!/usr/bin/env python3
|
|
"""
|
|
AsciiDoc Documentation to Skill Converter
|
|
|
|
Converts AsciiDoc (.adoc, .asciidoc) documentation files into AI-ready skills.
|
|
Supports both single files and directories of AsciiDoc documents.
|
|
|
|
Uses the ``asciidoc`` library when available for accurate HTML rendering,
|
|
falling back to a comprehensive regex-based parser that handles headings,
|
|
code blocks, tables, admonitions, include directives, and inline formatting.
|
|
|
|
Usage:
|
|
skill-seekers asciidoc --asciidoc-path doc.adoc --name myskill
|
|
skill-seekers asciidoc --asciidoc-path docs/ --name myskill
|
|
skill-seekers asciidoc --from-json doc_extracted.json
|
|
"""
|
|
|
|
import json
|
|
import logging
|
|
import os
|
|
import re
|
|
from pathlib import Path
|
|
|
|
# Optional dependency guard — asciidoc library for HTML conversion
|
|
try:
|
|
import asciidoc as asciidoc_lib # noqa: F401
|
|
|
|
ASCIIDOC_AVAILABLE = True
|
|
except ImportError:
|
|
ASCIIDOC_AVAILABLE = False
|
|
|
|
from skill_seekers.cli.skill_converter import SkillConverter
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
ASCIIDOC_EXTENSIONS = {".adoc", ".asciidoc", ".asc", ".ad"}
|
|
ADMONITION_TYPES = ("NOTE", "TIP", "WARNING", "IMPORTANT", "CAUTION")
|
|
|
|
# Regex patterns for AsciiDoc structure
|
|
RE_HEADING = re.compile(r"^(={1,5})\s+(.+)$", re.MULTILINE)
|
|
RE_SOURCE_ATTR = re.compile(r"^\[source(?:,\s*(\w[\w+#.-]*))?(?:,.*?)?\]$", re.MULTILINE)
|
|
RE_LISTING_DELIM = re.compile(r"^(-{4,})$", re.MULTILINE)
|
|
RE_LITERAL_DELIM = re.compile(r"^(\.{4,})$", re.MULTILINE)
|
|
RE_TABLE_DELIM = re.compile(r"^\|={3,}$", re.MULTILINE)
|
|
RE_TABLE_CELL = re.compile(r"^\|(.+)$", re.MULTILINE)
|
|
RE_ADMONITION_PARA = re.compile(
|
|
r"^(NOTE|TIP|WARNING|IMPORTANT|CAUTION):\s+(.+?)(?:\n\n|\Z)",
|
|
re.MULTILINE | re.DOTALL,
|
|
)
|
|
RE_ADMONITION_BLOCK = re.compile(
|
|
r"^\[(NOTE|TIP|WARNING|IMPORTANT|CAUTION)\]\n={4,}\n(.*?)\n={4,}",
|
|
re.MULTILINE | re.DOTALL,
|
|
)
|
|
RE_INCLUDE = re.compile(r"^include::(.+?)\[([^\]]*)\]$", re.MULTILINE)
|
|
RE_ATTRIBUTE = re.compile(r"^:([a-zA-Z0-9_-]+):\s*(.*)$", re.MULTILINE)
|
|
RE_ATTR_REF = re.compile(r"\{([a-zA-Z0-9_-]+)\}")
|
|
RE_BOLD = re.compile(r"\*([^\s*](?:.*?[^\s*])?)\*")
|
|
RE_ITALIC = re.compile(r"_([^\s_](?:.*?[^\s_])?)_")
|
|
RE_MONO = re.compile(r"`([^`]+)`")
|
|
RE_LINK = re.compile(r"(https?://\S+)\[([^\]]*)\]")
|
|
RE_XREF = re.compile(r"<<([^,>]+)(?:,\s*([^>]+))?>>")
|
|
|
|
|
|
def _check_asciidoc_deps() -> None:
|
|
"""Log debug message when asciidoc library is not installed (regex fallback used)."""
|
|
if not ASCIIDOC_AVAILABLE:
|
|
logger.debug(
|
|
"asciidoc library not installed; using regex-based parser.\n"
|
|
'Install with: pip install "skill-seekers[asciidoc]" or: pip install asciidoc'
|
|
)
|
|
|
|
|
|
def infer_description_from_asciidoc(metadata: dict | None = None, name: str = "") -> str:
|
|
"""Infer skill description from AsciiDoc document metadata."""
|
|
if metadata:
|
|
if metadata.get("description") and len(str(metadata["description"])) > 20:
|
|
desc = str(metadata["description"]).strip()
|
|
return (
|
|
f"Use when {desc[:147].lower()}..."
|
|
if len(desc) > 150
|
|
else f"Use when {desc.lower()}"
|
|
)
|
|
if metadata.get("title") and len(str(metadata["title"])) > 10:
|
|
return f"Use when working with {str(metadata['title']).lower()}"
|
|
return (
|
|
f"Use when referencing {name} documentation"
|
|
if name
|
|
else "Use when referencing this documentation"
|
|
)
|
|
|
|
|
|
def _score_code_quality(code: str) -> float:
|
|
"""Simple quality heuristic for code blocks (0-10 scale)."""
|
|
if not code:
|
|
return 0.0
|
|
score = 5.0
|
|
line_count = len(code.strip().split("\n"))
|
|
if line_count >= 10:
|
|
score += 2.0
|
|
elif line_count >= 5:
|
|
score += 1.0
|
|
if re.search(r"\b(def |class |function |func |fn )", code):
|
|
score += 1.5
|
|
if re.search(r"\b(import |from .+ import|require\(|#include|using )", code):
|
|
score += 0.5
|
|
if re.search(r"^ ", code, re.MULTILINE):
|
|
score += 0.5
|
|
if re.search(r"[=:{}()\[\]]", code):
|
|
score += 0.3
|
|
if len(code) < 30:
|
|
score -= 2.0
|
|
return min(10.0, max(0.0, score))
|
|
|
|
|
|
class AsciiDocToSkillConverter(SkillConverter):
|
|
"""Convert AsciiDoc documentation to an AI-ready skill.
|
|
|
|
Handles single ``.adoc`` files and directories. Content is parsed into
|
|
intermediate JSON, categorised, then rendered into the standard skill
|
|
directory layout (SKILL.md, references/, etc.).
|
|
"""
|
|
|
|
SOURCE_TYPE = "asciidoc"
|
|
|
|
def __init__(self, config: dict) -> None:
|
|
super().__init__(config)
|
|
self.config = config
|
|
self.name: str = config["name"]
|
|
self.asciidoc_path: str = config.get("asciidoc_path", "")
|
|
self.description: str = (
|
|
config.get("description") or f"Use when referencing {self.name} documentation"
|
|
)
|
|
self.skill_dir: str = f"output/{self.name}"
|
|
self.data_file: str = f"output/{self.name}_extracted.json"
|
|
self.categories: dict = config.get("categories", {})
|
|
self.extracted_data: dict | None = None
|
|
|
|
def extract(self):
|
|
"""Extract content from AsciiDoc files (SkillConverter interface)."""
|
|
self.extract_asciidoc()
|
|
|
|
# ------------------------------------------------------------------
|
|
# Extraction
|
|
# ------------------------------------------------------------------
|
|
|
|
def extract_asciidoc(self) -> bool:
|
|
"""Extract content from AsciiDoc file(s).
|
|
|
|
Discovers files, resolves attributes/includes, parses sections,
|
|
detects languages, and saves intermediate JSON.
|
|
|
|
Returns:
|
|
True on success.
|
|
|
|
Raises:
|
|
FileNotFoundError: If path does not exist.
|
|
ValueError: If no AsciiDoc files found.
|
|
"""
|
|
_check_asciidoc_deps()
|
|
from skill_seekers.cli.language_detector import LanguageDetector
|
|
|
|
print(f"\n🔍 Extracting from AsciiDoc: {self.asciidoc_path}")
|
|
path = Path(self.asciidoc_path)
|
|
if not path.exists():
|
|
raise FileNotFoundError(f"AsciiDoc path not found: {self.asciidoc_path}")
|
|
|
|
files = self._discover_files(path)
|
|
if not files:
|
|
raise ValueError(
|
|
f"No AsciiDoc files found at: {self.asciidoc_path}\n"
|
|
f"Expected extensions: {', '.join(sorted(ASCIIDOC_EXTENSIONS))}"
|
|
)
|
|
print(f" Found {len(files)} AsciiDoc file(s)")
|
|
|
|
all_sections: list[dict] = []
|
|
metadata: dict = {}
|
|
section_counter = 0
|
|
|
|
for file_path in sorted(files):
|
|
raw_text = file_path.read_text(encoding="utf-8", errors="replace")
|
|
attributes = self._extract_attributes(raw_text)
|
|
resolved_text = self._resolve_attributes(raw_text, attributes)
|
|
resolved_text = self._resolve_includes(resolved_text, file_path.parent)
|
|
if not metadata:
|
|
metadata = self._build_metadata(attributes, file_path)
|
|
|
|
for section in self._parse_asciidoc_sections(resolved_text):
|
|
section_counter += 1
|
|
section["section_number"] = section_counter
|
|
section["source_file"] = str(file_path)
|
|
body = section.pop("body", "")
|
|
section["code_samples"] = self._extract_code_blocks(body)
|
|
section["tables"] = self._extract_tables(body)
|
|
section["admonitions"] = self._extract_admonitions(body)
|
|
section["includes"] = self._extract_includes(body)
|
|
section["text"] = self._convert_to_markdown(body)
|
|
all_sections.append(section)
|
|
|
|
# Language detection
|
|
detector = LanguageDetector(min_confidence=0.15)
|
|
languages_detected: dict[str, int] = {}
|
|
total_code_blocks = 0
|
|
for section in all_sections:
|
|
for cs in section.get("code_samples", []):
|
|
if cs.get("language"):
|
|
languages_detected[cs["language"]] = (
|
|
languages_detected.get(cs["language"], 0) + 1
|
|
)
|
|
total_code_blocks += 1
|
|
for section in all_sections:
|
|
for cs in section.get("code_samples", []):
|
|
if not cs.get("language") and cs.get("code"):
|
|
lang, conf = detector.detect_from_code(cs["code"])
|
|
if lang and conf >= 0.3:
|
|
cs["language"] = lang
|
|
languages_detected[lang] = languages_detected.get(lang, 0) + 1
|
|
|
|
if not self.config.get("description"):
|
|
self.description = infer_description_from_asciidoc(metadata, self.name)
|
|
|
|
result_data = {
|
|
"source_path": self.asciidoc_path,
|
|
"metadata": metadata,
|
|
"total_sections": len(all_sections),
|
|
"total_files": len(files),
|
|
"total_code_blocks": total_code_blocks,
|
|
"total_tables": sum(len(s.get("tables", [])) for s in all_sections),
|
|
"total_admonitions": sum(len(s.get("admonitions", [])) for s in all_sections),
|
|
"languages_detected": languages_detected,
|
|
"pages": all_sections,
|
|
}
|
|
os.makedirs(os.path.dirname(self.data_file) or ".", exist_ok=True)
|
|
with open(self.data_file, "w", encoding="utf-8") as f:
|
|
json.dump(result_data, f, indent=2, ensure_ascii=False, default=str)
|
|
|
|
print(f"\n💾 Saved extracted data to: {self.data_file}")
|
|
self.extracted_data = result_data
|
|
print(
|
|
f"✅ Extracted {len(all_sections)} sections, {total_code_blocks} code blocks, "
|
|
f"{result_data['total_tables']} tables, {result_data['total_admonitions']} admonitions"
|
|
)
|
|
return True
|
|
|
|
def _discover_files(self, path: Path) -> list[Path]:
|
|
"""Return sorted list of AsciiDoc files from *path* (file or directory)."""
|
|
if path.is_file():
|
|
return [path] if path.suffix.lower() in ASCIIDOC_EXTENSIONS else []
|
|
found: list[Path] = []
|
|
for ext in ASCIIDOC_EXTENSIONS:
|
|
found.extend(path.rglob(f"*{ext}"))
|
|
return sorted(set(found))
|
|
|
|
# ------------------------------------------------------------------
|
|
# Attribute / include resolution
|
|
# ------------------------------------------------------------------
|
|
|
|
@staticmethod
|
|
def _extract_attributes(text: str) -> dict[str, str]:
|
|
"""Extract ``:attr-name: value`` definitions from text."""
|
|
return {m.group(1): m.group(2).strip() for m in RE_ATTRIBUTE.finditer(text)}
|
|
|
|
@staticmethod
|
|
def _resolve_attributes(text: str, attributes: dict[str, str]) -> str:
|
|
"""Replace ``{attr-name}`` references with their values."""
|
|
return RE_ATTR_REF.sub(lambda m: attributes.get(m.group(1), m.group(0)), text)
|
|
|
|
def _resolve_includes(self, text: str, base_dir: Path) -> str:
|
|
"""Resolve ``include::`` directives by inlining referenced files."""
|
|
max_depth = 5
|
|
|
|
def _resolve_once(src: str, depth: int) -> str:
|
|
if depth >= max_depth:
|
|
return src
|
|
|
|
def _replacer(match: re.Match) -> str:
|
|
inc_path = match.group(1).strip()
|
|
inc_file = base_dir / inc_path
|
|
if inc_file.is_file():
|
|
try:
|
|
return _resolve_once(
|
|
inc_file.read_text(encoding="utf-8", errors="replace"), depth + 1
|
|
)
|
|
except OSError:
|
|
logger.debug("Could not read include file: %s", inc_file)
|
|
return f"// include::{inc_path}[] (not resolved)"
|
|
|
|
return RE_INCLUDE.sub(_replacer, src)
|
|
|
|
return _resolve_once(text, 0)
|
|
|
|
@staticmethod
|
|
def _build_metadata(attributes: dict[str, str], file_path: Path) -> dict:
|
|
"""Build metadata dict from document attributes."""
|
|
return {
|
|
"title": attributes.get("doctitle", attributes.get("title", file_path.stem)),
|
|
"author": attributes.get("author", ""),
|
|
"email": attributes.get("email", ""),
|
|
"revision": attributes.get("revnumber", attributes.get("version", "")),
|
|
"date": attributes.get("revdate", attributes.get("date", "")),
|
|
"description": attributes.get("description", ""),
|
|
"keywords": attributes.get("keywords", ""),
|
|
"source_file": str(file_path),
|
|
}
|
|
|
|
# ------------------------------------------------------------------
|
|
# Section parsing
|
|
# ------------------------------------------------------------------
|
|
|
|
def _parse_asciidoc_sections(self, text: str) -> list[dict]:
|
|
"""Parse AsciiDoc text into sections split by headings (= through =====)."""
|
|
heading_matches = [
|
|
(m.start(), len(m.group(1)), m.group(2).strip(), m.group(0))
|
|
for m in RE_HEADING.finditer(text)
|
|
]
|
|
if not heading_matches:
|
|
return [{"heading": "", "heading_level": "h1", "body": text.strip(), "headings": []}]
|
|
|
|
sections: list[dict] = []
|
|
preamble = text[: heading_matches[0][0]].strip()
|
|
if preamble:
|
|
sections.append(
|
|
{"heading": "", "heading_level": "h1", "body": preamble, "headings": []}
|
|
)
|
|
|
|
for idx, (start, level, heading_text, raw) in enumerate(heading_matches):
|
|
body_start = start + len(raw)
|
|
body_end = heading_matches[idx + 1][0] if idx + 1 < len(heading_matches) else len(text)
|
|
body = text[body_start:body_end].strip()
|
|
|
|
sub_headings = [
|
|
{"level": f"h{len(m.group(1))}", "text": m.group(2).strip()}
|
|
for m in RE_HEADING.finditer(body)
|
|
if len(m.group(1)) > level
|
|
]
|
|
sections.append(
|
|
{
|
|
"heading": heading_text,
|
|
"heading_level": f"h{level}",
|
|
"body": body,
|
|
"headings": sub_headings,
|
|
}
|
|
)
|
|
return sections
|
|
|
|
# ------------------------------------------------------------------
|
|
# Code block extraction
|
|
# ------------------------------------------------------------------
|
|
|
|
def _extract_code_blocks(self, text: str) -> list[dict]:
|
|
"""Extract source/listing/literal code blocks from AsciiDoc text.
|
|
|
|
Handles [source,lang] + ---- blocks, bare ---- blocks, and .... blocks.
|
|
"""
|
|
blocks: list[dict] = []
|
|
consumed: list[tuple[int, int]] = []
|
|
|
|
# Pattern 1: [source,lang] + ---- block
|
|
for attr_m in RE_SOURCE_ATTR.finditer(text):
|
|
lang = (attr_m.group(1) or "").strip()
|
|
open_m = RE_LISTING_DELIM.search(text, attr_m.end())
|
|
if not open_m:
|
|
continue
|
|
between = text[attr_m.end() : open_m.start()].strip()
|
|
if between and not between.startswith(".") and "\n" in between:
|
|
continue
|
|
delim = open_m.group(1)
|
|
close_m = re.search(
|
|
r"^" + re.escape(delim) + r"$", text[open_m.end() + 1 :], re.MULTILINE
|
|
)
|
|
if not close_m:
|
|
continue
|
|
abs_close = open_m.end() + 1 + close_m.start()
|
|
code = text[open_m.end() : abs_close].strip("\n")
|
|
if code:
|
|
blocks.append(
|
|
{"code": code, "language": lang, "quality_score": _score_code_quality(code)}
|
|
)
|
|
consumed.append((attr_m.start(), abs_close + len(close_m.group(0))))
|
|
|
|
# Pattern 2: bare ---- listing blocks
|
|
for m in RE_LISTING_DELIM.finditer(text):
|
|
if self._in_range(m.start(), consumed):
|
|
continue
|
|
delim = m.group(1)
|
|
close_m = re.search(r"^" + re.escape(delim) + r"$", text[m.end() + 1 :], re.MULTILINE)
|
|
if not close_m:
|
|
continue
|
|
abs_close = m.end() + 1 + close_m.start()
|
|
code = text[m.end() : abs_close].strip("\n")
|
|
if code:
|
|
blocks.append(
|
|
{"code": code, "language": "", "quality_score": _score_code_quality(code)}
|
|
)
|
|
consumed.append((m.start(), abs_close + len(close_m.group(0))))
|
|
|
|
# Pattern 3: .... literal blocks
|
|
for m in RE_LITERAL_DELIM.finditer(text):
|
|
if self._in_range(m.start(), consumed):
|
|
continue
|
|
delim = m.group(1)
|
|
close_m = re.search(r"^" + re.escape(delim) + r"$", text[m.end() + 1 :], re.MULTILINE)
|
|
if not close_m:
|
|
continue
|
|
abs_close = m.end() + 1 + close_m.start()
|
|
code = text[m.end() : abs_close].strip("\n")
|
|
if code:
|
|
blocks.append(
|
|
{"code": code, "language": "", "quality_score": _score_code_quality(code)}
|
|
)
|
|
consumed.append((m.start(), abs_close + len(close_m.group(0))))
|
|
|
|
return blocks
|
|
|
|
# ------------------------------------------------------------------
|
|
# Table extraction
|
|
# ------------------------------------------------------------------
|
|
|
|
def _extract_tables(self, text: str) -> list[dict]:
|
|
"""Parse AsciiDoc tables delimited by ``|===``."""
|
|
tables: list[dict] = []
|
|
delimiters = list(RE_TABLE_DELIM.finditer(text))
|
|
idx = 0
|
|
while idx + 1 < len(delimiters):
|
|
body = text[delimiters[idx].end() : delimiters[idx + 1].start()].strip()
|
|
if body:
|
|
table = self._parse_table_body(body)
|
|
if table:
|
|
tables.append(table)
|
|
idx += 2
|
|
return tables
|
|
|
|
@staticmethod
|
|
def _parse_table_body(table_body: str) -> dict | None:
|
|
"""Parse body of an AsciiDoc table into headers and rows."""
|
|
groups = re.split(r"\n\s*\n", table_body.strip())
|
|
if not groups:
|
|
return None
|
|
|
|
def _parse_row(row_text: str) -> list[str]:
|
|
return [p.strip() for p in row_text.split("|") if p.strip()]
|
|
|
|
# First group → headers
|
|
headers: list[str] = []
|
|
for line in groups[0].strip().splitlines():
|
|
if line.strip().startswith("|"):
|
|
parsed = _parse_row(line)
|
|
if parsed and not headers:
|
|
headers = parsed
|
|
elif parsed:
|
|
for i, cell in enumerate(parsed):
|
|
if i < len(headers):
|
|
headers[i] = f"{headers[i]} {cell}".strip()
|
|
else:
|
|
headers.append(cell)
|
|
|
|
# Remaining groups → rows
|
|
rows: list[list[str]] = []
|
|
for group in groups[1:]:
|
|
for line in group.strip().splitlines():
|
|
if line.strip().startswith("|"):
|
|
parsed = _parse_row(line)
|
|
if parsed:
|
|
rows.append(parsed)
|
|
|
|
# Single group fallback: first parsed line = header, rest = rows
|
|
if len(groups) == 1 and not rows:
|
|
all_parsed = [
|
|
_parse_row(line)
|
|
for line in groups[0].strip().splitlines()
|
|
if line.strip().startswith("|")
|
|
]
|
|
all_parsed = [r for r in all_parsed if r]
|
|
if len(all_parsed) > 1:
|
|
headers, rows = all_parsed[0], all_parsed[1:]
|
|
elif all_parsed:
|
|
headers = all_parsed[0]
|
|
|
|
return {"headers": headers, "rows": rows} if headers or rows else None
|
|
|
|
# ------------------------------------------------------------------
|
|
# Admonition extraction
|
|
# ------------------------------------------------------------------
|
|
|
|
def _extract_admonitions(self, text: str) -> list[dict]:
|
|
"""Extract NOTE/TIP/WARNING/IMPORTANT/CAUTION admonitions."""
|
|
admonitions: list[dict] = []
|
|
seen: set[str] = set()
|
|
for pattern in (RE_ADMONITION_BLOCK, RE_ADMONITION_PARA):
|
|
for m in pattern.finditer(text):
|
|
adm_type, adm_text = m.group(1), m.group(2).strip()
|
|
if adm_text and adm_text not in seen:
|
|
admonitions.append({"type": adm_type, "text": adm_text})
|
|
seen.add(adm_text)
|
|
return admonitions
|
|
|
|
# ------------------------------------------------------------------
|
|
# Include directive extraction
|
|
# ------------------------------------------------------------------
|
|
|
|
@staticmethod
|
|
def _extract_includes(text: str) -> list[dict]:
|
|
"""Detect remaining ``include::`` directives in text."""
|
|
return [
|
|
{"path": m.group(1).strip(), "options": m.group(2).strip()}
|
|
for m in RE_INCLUDE.finditer(text)
|
|
]
|
|
|
|
# ------------------------------------------------------------------
|
|
# AsciiDoc → Markdown conversion
|
|
# ------------------------------------------------------------------
|
|
|
|
def _convert_to_markdown(self, text: str) -> str:
|
|
"""Convert AsciiDoc inline formatting to Markdown equivalents."""
|
|
result = text
|
|
# Remove processed block delimiters and attribute lines
|
|
for pat in (
|
|
RE_LISTING_DELIM,
|
|
RE_LITERAL_DELIM,
|
|
RE_TABLE_DELIM,
|
|
RE_SOURCE_ATTR,
|
|
RE_ATTRIBUTE,
|
|
):
|
|
result = pat.sub("", result)
|
|
# Remove admonition block markers and delimiters
|
|
result = re.sub(
|
|
r"^\[(NOTE|TIP|WARNING|IMPORTANT|CAUTION)\]\s*$", "", result, flags=re.MULTILINE
|
|
)
|
|
result = re.sub(r"^={4,}$", "", result, flags=re.MULTILINE)
|
|
# Headings: = Title → # Title
|
|
result = RE_HEADING.sub(lambda m: f"{'#' * len(m.group(1))} {m.group(2).strip()}", result)
|
|
# Inline formatting
|
|
result = RE_BOLD.sub(r"**\1**", result)
|
|
result = RE_ITALIC.sub(r"*\1*", result)
|
|
result = RE_LINK.sub(r"[\2](\1)", result)
|
|
result = RE_XREF.sub(lambda m: f"*{m.group(2) or m.group(1)}*", result)
|
|
# Lists: * item → - item, . item → 1. item
|
|
result = re.sub(
|
|
r"^(\*{1,5})\s+",
|
|
lambda m: " " * (len(m.group(1)) - 1) + "- ",
|
|
result,
|
|
flags=re.MULTILINE,
|
|
)
|
|
result = re.sub(
|
|
r"^(\.{1,5})\s+",
|
|
lambda m: " " * (len(m.group(1)) - 1) + "1. ",
|
|
result,
|
|
flags=re.MULTILINE,
|
|
)
|
|
# Block titles: .Title → **Title**
|
|
result = re.sub(r"^\.([A-Z][\w\s]+)$", r"**\1**", result, flags=re.MULTILINE)
|
|
# Include comments
|
|
result = re.sub(
|
|
r"^//\s*include::(.+?)\[\].*$", r"*(included: \1)*", result, flags=re.MULTILINE
|
|
)
|
|
# Remove leftover table cell markers
|
|
result = re.sub(r"^\|\s*", "", result, flags=re.MULTILINE)
|
|
# Collapse blank lines
|
|
result = re.sub(r"\n{3,}", "\n\n", result)
|
|
return result.strip()
|
|
|
|
# ------------------------------------------------------------------
|
|
# Load / categorize / build
|
|
# ------------------------------------------------------------------
|
|
|
|
def load_extracted_data(self, json_path: str) -> bool:
|
|
"""Load previously extracted data from JSON file."""
|
|
print(f"\n📂 Loading extracted data from: {json_path}")
|
|
with open(json_path, encoding="utf-8") as f:
|
|
self.extracted_data = json.load(f)
|
|
total = self.extracted_data.get("total_sections", len(self.extracted_data.get("pages", [])))
|
|
print(f"✅ Loaded {total} sections")
|
|
return True
|
|
|
|
def categorize_content(self) -> dict:
|
|
"""Categorize sections by source file, headings, or keywords."""
|
|
print("\n📋 Categorizing content...")
|
|
categorized: dict[str, dict] = {}
|
|
sections = self.extracted_data.get("pages", [])
|
|
path = Path(self.asciidoc_path) if self.asciidoc_path else None
|
|
|
|
if path and path.is_file():
|
|
key = self._sanitize_filename(path.stem)
|
|
categorized[key] = {"title": path.stem, "pages": sections}
|
|
print(f"✅ Created 1 category (single file): {path.stem}: {len(sections)} sections")
|
|
return categorized
|
|
|
|
if path and path.is_dir():
|
|
for s in sections:
|
|
src_stem = Path(s.get("source_file", "unknown")).stem
|
|
key = self._sanitize_filename(src_stem)
|
|
categorized.setdefault(key, {"title": src_stem, "pages": []})["pages"].append(s)
|
|
if categorized:
|
|
print(f"✅ Created {len(categorized)} categories (by source file)")
|
|
for cat in categorized.values():
|
|
print(f" - {cat['title']}: {len(cat['pages'])} sections")
|
|
return categorized
|
|
|
|
if self.categories:
|
|
first_val = next(iter(self.categories.values()), None)
|
|
if isinstance(first_val, list) and first_val and isinstance(first_val[0], dict):
|
|
for k, pages in self.categories.items():
|
|
categorized[k] = {"title": k.replace("_", " ").title(), "pages": pages}
|
|
else:
|
|
for k in self.categories:
|
|
categorized[k] = {"title": k.replace("_", " ").title(), "pages": []}
|
|
for s in sections:
|
|
txt = s.get("text", "").lower()
|
|
htxt = s.get("heading", "").lower()
|
|
scores = {
|
|
k: sum(
|
|
1
|
|
for kw in kws
|
|
if isinstance(kw, str) and (kw.lower() in txt or kw.lower() in htxt)
|
|
)
|
|
for k, kws in self.categories.items()
|
|
if isinstance(kws, list)
|
|
}
|
|
scores = {k: v for k, v in scores.items() if v > 0}
|
|
if scores:
|
|
categorized[max(scores, key=scores.get)]["pages"].append(s)
|
|
else:
|
|
categorized.setdefault("other", {"title": "Other", "pages": []})[
|
|
"pages"
|
|
].append(s)
|
|
else:
|
|
categorized["content"] = {"title": "Content", "pages": sections}
|
|
|
|
print(f"✅ Created {len(categorized)} categories")
|
|
for cat in categorized.values():
|
|
print(f" - {cat['title']}: {len(cat['pages'])} sections")
|
|
return categorized
|
|
|
|
def build_skill(self) -> None:
|
|
"""Build complete skill directory structure."""
|
|
print(f"\n🏗️ Building skill: {self.name}")
|
|
for subdir in ("references", "scripts", "assets"):
|
|
os.makedirs(f"{self.skill_dir}/{subdir}", exist_ok=True)
|
|
|
|
categorized = self.categorize_content()
|
|
print("\n📝 Generating reference files...")
|
|
total_cats = len(categorized)
|
|
for i, (cat_key, cat_data) in enumerate(categorized.items(), 1):
|
|
self._generate_reference_file(cat_key, cat_data, i, total_cats)
|
|
self._generate_index(categorized)
|
|
self._generate_skill_md(categorized)
|
|
print(f"\n✅ Skill built successfully: {self.skill_dir}/")
|
|
print(f"\n📦 Next step: Package with: skill-seekers package {self.skill_dir}/")
|
|
|
|
# ------------------------------------------------------------------
|
|
# Private generation methods
|
|
# ------------------------------------------------------------------
|
|
|
|
def _ref_filename(self, cat_data: dict, section_num: int, total: int) -> str:
|
|
"""Compute reference file path for a category."""
|
|
sections = cat_data["pages"]
|
|
adoc_base = ""
|
|
if self.asciidoc_path:
|
|
p = Path(self.asciidoc_path)
|
|
adoc_base = p.stem if p.is_file() else ""
|
|
|
|
if sections:
|
|
nums = [s.get("section_number", i + 1) for i, s in enumerate(sections)]
|
|
if total == 1:
|
|
return f"{self.skill_dir}/references/{adoc_base or 'main'}.md"
|
|
base = adoc_base or "section"
|
|
return f"{self.skill_dir}/references/{base}_s{min(nums)}-s{max(nums)}.md"
|
|
return f"{self.skill_dir}/references/section_{section_num:02d}.md"
|
|
|
|
def _generate_reference_file(
|
|
self, _cat_key: str, cat_data: dict, section_num: int, total: int
|
|
) -> None:
|
|
"""Generate a reference Markdown file for one category."""
|
|
filename = self._ref_filename(cat_data, section_num, total)
|
|
with open(filename, "w", encoding="utf-8") as f:
|
|
f.write(f"# {cat_data['title']}\n\n")
|
|
for section in cat_data["pages"]:
|
|
sec_num = section.get("section_number", "?")
|
|
heading = section.get("heading", "")
|
|
hl = section.get("heading_level", "h1")
|
|
f.write(f"---\n\n**📄 Source: Section {sec_num}**\n\n")
|
|
if heading:
|
|
f.write(f"{'#' * (int(hl[1]) + 1)} {heading}\n\n")
|
|
for sub in section.get("headings", []):
|
|
sl = sub.get("level", "h3")
|
|
if sub.get("text"):
|
|
f.write(f"{'#' * (int(sl[1]) + 1)} {sub['text']}\n\n")
|
|
if section.get("text"):
|
|
f.write(f"{section['text']}\n\n")
|
|
if section.get("code_samples"):
|
|
f.write("### Code Examples\n\n")
|
|
for c in section["code_samples"]:
|
|
f.write(f"```{c.get('language', '')}\n{c['code']}\n```\n\n")
|
|
if section.get("tables"):
|
|
f.write("### Tables\n\n")
|
|
for t in section["tables"]:
|
|
hdrs = t.get("headers", [])
|
|
if hdrs:
|
|
f.write("| " + " | ".join(str(h) for h in hdrs) + " |\n")
|
|
f.write("| " + " | ".join("---" for _ in hdrs) + " |\n")
|
|
for row in t.get("rows", []):
|
|
f.write("| " + " | ".join(str(c) for c in row) + " |\n")
|
|
f.write("\n")
|
|
if section.get("admonitions"):
|
|
f.write("### Notes & Warnings\n\n")
|
|
for a in section["admonitions"]:
|
|
f.write(f"> **{a.get('type', 'NOTE')}:** {a.get('text', '')}\n\n")
|
|
f.write("---\n\n")
|
|
print(f" Generated: {filename}")
|
|
|
|
def _generate_index(self, categorized: dict) -> None:
|
|
"""Generate references/index.md."""
|
|
filename = f"{self.skill_dir}/references/index.md"
|
|
adoc_base = ""
|
|
if self.asciidoc_path:
|
|
p = Path(self.asciidoc_path)
|
|
adoc_base = p.stem if p.is_file() else ""
|
|
total = len(categorized)
|
|
|
|
with open(filename, "w", encoding="utf-8") as f:
|
|
f.write(f"# {self.name.title()} Documentation Reference\n\n## Categories\n\n")
|
|
for i, (_k, cd) in enumerate(categorized.items(), 1):
|
|
pages = cd["pages"]
|
|
cnt = len(pages)
|
|
if pages:
|
|
nums = [s.get("section_number", j + 1) for j, s in enumerate(pages)]
|
|
rng = f"Sections {min(nums)}-{max(nums)}"
|
|
if total == 1:
|
|
lf = f"{adoc_base or 'main'}.md"
|
|
else:
|
|
lf = f"{adoc_base or 'section'}_s{min(nums)}-s{max(nums)}.md"
|
|
else:
|
|
lf, rng = f"section_{i:02d}.md", "N/A"
|
|
f.write(f"- [{cd['title']}]({lf}) ({cnt} sections, {rng})\n")
|
|
|
|
f.write("\n## Statistics\n\n")
|
|
for key, label in [
|
|
("total_sections", "Total sections"),
|
|
("total_code_blocks", "Code blocks"),
|
|
("total_tables", "Tables"),
|
|
("total_admonitions", "Admonitions"),
|
|
("total_files", "Source files"),
|
|
]:
|
|
f.write(f"- {label}: {self.extracted_data.get(key, 0)}\n")
|
|
meta = self.extracted_data.get("metadata", {})
|
|
if meta.get("author"):
|
|
f.write(f"- Author: {meta['author']}\n")
|
|
if meta.get("date"):
|
|
f.write(f"- Date: {meta['date']}\n")
|
|
print(f" Generated: {filename}")
|
|
|
|
def _generate_skill_md(self, categorized: dict) -> None:
|
|
"""Generate main SKILL.md file with rich summary content."""
|
|
filename = f"{self.skill_dir}/SKILL.md"
|
|
skill_name = self.name.lower().replace("_", "-").replace(" ", "-")[:64]
|
|
desc = self.description[:1024]
|
|
ed = self.extracted_data # shorthand
|
|
|
|
with open(filename, "w", encoding="utf-8") as f:
|
|
f.write(f"---\nname: {skill_name}\ndescription: {desc}\n---\n\n")
|
|
f.write(f"# {self.name.title()} Documentation Skill\n\n{self.description}\n\n")
|
|
|
|
# Document metadata
|
|
meta = ed.get("metadata", {})
|
|
if any(v for v in meta.values() if v):
|
|
f.write("## 📋 Document Information\n\n")
|
|
for key, label in [
|
|
("title", "Title"),
|
|
("author", "Author"),
|
|
("revision", "Revision"),
|
|
("date", "Date"),
|
|
("description", "Description"),
|
|
]:
|
|
if meta.get(key):
|
|
f.write(f"**{label}:** {meta[key]}\n\n")
|
|
|
|
f.write("## 💡 When to Use This Skill\n\nUse this skill when you need to:\n")
|
|
f.write(f"- Understand {self.name} concepts and fundamentals\n")
|
|
f.write("- Look up API references and technical specifications\n")
|
|
f.write("- Find code examples and implementation patterns\n")
|
|
f.write("- Review tutorials, guides, and best practices\n")
|
|
f.write("- Explore the complete documentation structure\n\n")
|
|
|
|
# Section Overview
|
|
f.write(
|
|
f"## 📖 Section Overview\n\n**Total Sections:** {ed.get('total_sections', 0)}\n\n"
|
|
)
|
|
f.write("**Content Breakdown:**\n\n")
|
|
for cd in categorized.values():
|
|
f.write(f"- **{cd['title']}**: {len(cd['pages'])} sections\n")
|
|
f.write("\n")
|
|
|
|
f.write(self._format_key_concepts())
|
|
f.write("## ⚡ Quick Reference\n\n")
|
|
f.write(self._format_patterns_from_content())
|
|
|
|
# Code examples (top 15 grouped by language)
|
|
all_code = [c for s in ed.get("pages", []) for c in s.get("code_samples", [])]
|
|
all_code.sort(key=lambda x: x.get("quality_score", 0), reverse=True)
|
|
if all_code[:15]:
|
|
f.write("## 📝 Code Examples\n\n*High-quality examples from documentation*\n\n")
|
|
by_lang: dict[str, list] = {}
|
|
for c in all_code[:15]:
|
|
by_lang.setdefault(c.get("language", "unknown"), []).append(c)
|
|
for lang in sorted(by_lang):
|
|
exs = by_lang[lang]
|
|
f.write(f"### {lang.title()} Examples ({len(exs)})\n\n")
|
|
for i, c in enumerate(exs[:5], 1):
|
|
ct = c.get("code", "")
|
|
f.write(
|
|
f"**Example {i}** (Quality: {c.get('quality_score', 0):.1f}/10):\n\n"
|
|
)
|
|
f.write(f"```{lang}\n{ct[:500]}{'...' if len(ct) > 500 else ''}\n```\n\n")
|
|
|
|
# Table summary
|
|
all_tables = [
|
|
(s.get("heading", ""), t) for s in ed.get("pages", []) for t in s.get("tables", [])
|
|
]
|
|
if all_tables:
|
|
f.write(f"## 📊 Table Summary\n\n*{len(all_tables)} table(s) found*\n\n")
|
|
for sh, t in all_tables[:5]:
|
|
if sh:
|
|
f.write(f"**From section: {sh}**\n\n")
|
|
hdrs = t.get("headers", [])
|
|
if hdrs:
|
|
f.write("| " + " | ".join(str(h) for h in hdrs) + " |\n")
|
|
f.write("| " + " | ".join("---" for _ in hdrs) + " |\n")
|
|
for row in t.get("rows", [])[:5]:
|
|
f.write("| " + " | ".join(str(c) for c in row) + " |\n")
|
|
f.write("\n")
|
|
|
|
# Admonition summary
|
|
all_adm = [a for s in ed.get("pages", []) for a in s.get("admonitions", [])]
|
|
if all_adm:
|
|
f.write("## ⚠️ Admonition Summary\n\n")
|
|
by_type: dict[str, list[str]] = {}
|
|
for a in all_adm:
|
|
by_type.setdefault(a.get("type", "NOTE"), []).append(a.get("text", ""))
|
|
for at in sorted(by_type):
|
|
items = by_type[at]
|
|
f.write(f"**{at}** ({len(items)}):\n\n")
|
|
for txt in items[:5]:
|
|
f.write(f"> {txt[:120]}{'...' if len(txt) > 120 else ''}\n\n")
|
|
|
|
# Statistics
|
|
f.write("## 📊 Documentation Statistics\n\n")
|
|
for key, label in [
|
|
("total_sections", "Total Sections"),
|
|
("total_code_blocks", "Code Blocks"),
|
|
("total_tables", "Tables"),
|
|
("total_admonitions", "Admonitions"),
|
|
("total_files", "Source Files"),
|
|
]:
|
|
f.write(f"- **{label}**: {ed.get(key, 0)}\n")
|
|
langs = ed.get("languages_detected", {})
|
|
if langs:
|
|
f.write(f"- **Programming Languages**: {len(langs)}\n\n**Language Breakdown:**\n\n")
|
|
for lang, count in sorted(langs.items(), key=lambda x: x[1], reverse=True):
|
|
f.write(f"- {lang}: {count} examples\n")
|
|
f.write("\n")
|
|
|
|
# Navigation
|
|
f.write("## 🗺️ Navigation\n\n**Reference Files:**\n\n")
|
|
for cd in categorized.values():
|
|
cf = self._sanitize_filename(cd["title"])
|
|
f.write(f"- `references/{cf}.md` - {cd['title']}\n")
|
|
f.write("\nSee `references/index.md` for complete documentation structure.\n\n")
|
|
f.write("---\n\n**Generated by Skill Seeker** | AsciiDoc Scraper\n")
|
|
|
|
with open(filename, encoding="utf-8") as f:
|
|
print(f" Generated: {filename} ({len(f.read().splitlines())} lines)")
|
|
|
|
# ------------------------------------------------------------------
|
|
# Content analysis helpers
|
|
# ------------------------------------------------------------------
|
|
|
|
def _format_key_concepts(self) -> str:
|
|
"""Extract key concepts from headings across all sections."""
|
|
all_h: list[tuple[str, str]] = []
|
|
for s in self.extracted_data.get("pages", []):
|
|
h = s.get("heading", "").strip()
|
|
if h and len(h) > 3:
|
|
all_h.append((s.get("heading_level", "h1"), h))
|
|
for sub in s.get("headings", []):
|
|
t = sub.get("text", "").strip()
|
|
if t and len(t) > 3:
|
|
all_h.append((sub.get("level", "h3"), t))
|
|
if not all_h:
|
|
return ""
|
|
content = "## 🔑 Key Concepts\n\n*Main topics covered in this documentation*\n\n"
|
|
h1s = [t for lv, t in all_h if lv == "h1"]
|
|
h2s = [t for lv, t in all_h if lv == "h2"]
|
|
if h1s:
|
|
content += "**Major Topics:**\n\n" + "".join(f"- {h}\n" for h in h1s[:10]) + "\n"
|
|
if h2s:
|
|
content += "**Subtopics:**\n\n" + "".join(f"- {h}\n" for h in h2s[:15]) + "\n"
|
|
return content
|
|
|
|
def _format_patterns_from_content(self) -> str:
|
|
"""Extract common documentation patterns from section headings."""
|
|
keywords = [
|
|
"getting started",
|
|
"installation",
|
|
"configuration",
|
|
"usage",
|
|
"api",
|
|
"examples",
|
|
"tutorial",
|
|
"guide",
|
|
"best practices",
|
|
"troubleshooting",
|
|
"faq",
|
|
]
|
|
patterns: list[dict] = []
|
|
for s in self.extracted_data.get("pages", []):
|
|
ht = s.get("heading", "").lower()
|
|
for kw in keywords:
|
|
if kw in ht:
|
|
patterns.append(
|
|
{
|
|
"type": kw.title(),
|
|
"heading": s.get("heading", ""),
|
|
"section": s.get("section_number", 0),
|
|
}
|
|
)
|
|
break
|
|
if not patterns:
|
|
return "*See reference files for detailed content*\n\n"
|
|
by_type: dict[str, list] = {}
|
|
for p in patterns:
|
|
by_type.setdefault(p["type"], []).append(p)
|
|
content = "*Common documentation patterns found:*\n\n"
|
|
for pt in sorted(by_type):
|
|
items = by_type[pt]
|
|
content += f"**{pt}** ({len(items)} sections):\n"
|
|
content += "".join(f"- {it['heading']} (section {it['section']})\n" for it in items[:3])
|
|
content += "\n"
|
|
return content
|
|
|
|
# ------------------------------------------------------------------
|
|
# Utilities
|
|
# ------------------------------------------------------------------
|
|
|
|
@staticmethod
|
|
def _sanitize_filename(name: str) -> str:
|
|
"""Convert name to a safe filename slug."""
|
|
safe = re.sub(r"[^\w\s-]", "", name.lower())
|
|
return re.sub(r"[-\s]+", "_", safe)
|
|
|
|
@staticmethod
|
|
def _in_range(pos: int, ranges: list[tuple[int, int]]) -> bool:
|
|
"""Check whether pos falls within any consumed range."""
|
|
return any(s <= pos < e for s, e in ranges)
|