* feat: fix unified scraper pipeline gaps, add multi-agent support, and Unity skill configs Fix multiple bugs in the unified scraper pipeline discovered while creating Unity skills (Spine, Addressables, DOTween): - Fix doc scraper KeyError by passing base_url in temp config - Fix scraped_data list-vs-dict bug in detect_conflicts() and merge_sources() - Add Phase 6 auto-enhancement from config "enhancement" block (LOCAL + API mode) - Add "browser": true config support for JavaScript SPA documentation sites - Add Phase 3 skip message for better UX - Add subprocess timeout (3600s) for doc scraper - Fix SkillEnhancer missing skill_dir argument in API mode - Fix browser renderer defaults (60s timeout, domcontentloaded wait condition) - Fix C3.x JSON filename mismatch (design_patterns.json → all_patterns.json) - Fix workflow builtin target handling when no pattern data available - Make AI enhancement timeout configurable via SKILL_SEEKER_ENHANCE_TIMEOUT env var (300s default) - Add C#, Go, Rust, Swift, Ruby, PHP, GDScript to GitHub scraper extension map - Add multi-agent LOCAL mode support across all 17 scrapers (--agent flag) - Add Kimi/Moonshot platform support (API keys, agent presets, config wizard) - Add unity-game-dev.yaml workflow (7 stages covering Unity-specific patterns) - Add 3 Unity skill configs (Spine, Addressables, DOTween) - Add comprehensive Claude bias audit report Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: create AgentClient abstraction, remove hardcoded Claude from 5 enhancers (#334) Phase 1 of the full agent-agnostic refactor. Creates a centralized AgentClient that all enhancers use instead of hardcoded subprocess calls and model names. New file: - agent_client.py: Unified AI client supporting API mode (Anthropic, Moonshot, Google, OpenAI) and LOCAL mode (Claude Code, Kimi, Codex, Copilot, OpenCode, custom agents). Provides detect_api_key(), get_model(), detect_default_target(). Refactored (removed all hardcoded ["claude", ...] subprocess calls): - ai_enhancer.py: -140 lines, delegates to AgentClient - config_enhancer.py: -150 lines, removed _run_claude_cli() - guide_enhancer.py: -120 lines, removed _check_claude_cli(), _call_claude_*() - unified_enhancer.py: -100 lines, removed _check_claude_cli(), _call_claude_*() - codebase_scraper.py: collapsed 3 functions into 1 using AgentClient Fixed: - utils.py: has_api_key()/get_api_key() now check all providers - enhance_skill.py, video_scraper.py, video_visual.py: model names configurable via ANTHROPIC_MODEL env var - enhancement_workflow.py: uses call() with _call_claude() fallback Net: -153 lines of code while adding full multi-agent support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: Phase 2 agent-agnostic refactor — defaults, help text, merge mode, MCP (#334) Phase 2 of the full agent-agnostic refactor: Default targets: - Changed default="claude" to auto-detect from API keys in 5 argument files and 3 CLI scripts (install_skill, upload_skill, enhance_skill) - Added AgentClient.detect_default_target() for runtime resolution - MCP server functions now use "auto" default with runtime detection Help text (16+ argument files): - Replaced "ANTHROPIC_API_KEY" / "Claude Code" with agent-neutral wording - Now mentions all API keys (ANTHROPIC, MOONSHOT, etc.) and "AI coding agent" Log messages: - main.py, enhance_command.py: "Claude Code CLI" → dynamic agent name - enhance_command.py docstring: "Claude Code" → "AI coding agent" Merge mode rename: - Added "ai-enhanced" as the preferred merge mode name - "claude-enhanced" kept as backward-compatible alias - Renamed ClaudeEnhancedMerger → AIEnhancedMerger (with alias) - Updated choices, validators, and descriptions MCP server descriptions: - server_fastmcp.py: "Claude AI skills" → "LLM skills" in tool descriptions - packaging_tools.py: Updated defaults and dry-run messages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: Phase 3 agent-agnostic refactor — docstrings, MCP descriptions, README (#334) Phase 3 of the full agent-agnostic refactor: Module docstrings (17+ scraper files): - "Claude Skill Converter" → "AI Skill Converter" - "Build Claude skill" → "Build AI/LLM skill" - "Asking Claude" → "Asking AI" - Updated doc_scraper, github_scraper, pdf_scraper, word_scraper, epub_scraper, video_scraper, enhance_skill, enhance_skill_local, unified_scraper, and others MCP server_legacy.py (30+ fixes): - All tool descriptions: "Claude skill" → "LLM skill" - "Upload to Claude" → "Upload skill" - "enhance with Claude Code" → "enhance with AI agent" - Kept claude.ai/skills URLs (platform-specific, correct) MCP README.md: - Added multi-agent support note at top - "Claude AI skills" → "LLM skills" throughout - Updated examples to show multi-platform usage - Kept Claude Code in supported agents list (accurate) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: Phase 3 continued — remaining docstring and comment fixes (#334) Additional agent-neutral text fixes in 8 files missed from the initial Phase 3 commit: - config_extractor.py, config_manager.py, constants.py: comments - enhance_command.py: docstring and print messages - guide_enhancer.py: class/module docstrings - parsers/enhance_parser.py, install_parser.py: help text - signal_flow_analyzer.py: docstring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * workflow added * fix: address code review issues in AgentClient and Phase 6 (#334) Fixes found during commit review: 1. AgentClient._call_local: Only append "Write your response to:" when caller explicitly passes output_file (was always appending) 2. Codex agent: Added uses_stdin flag to preset, pipe prompt via stdin instead of DEVNULL (codex reads from stdin with "-" arg) 3. Provider detection: Added _detect_provider_from_key() to detect provider from API key prefix (sk-ant- → anthropic, AIza → google) instead of always assuming anthropic 4. Phase 6 API mode: Replaced direct SkillEnhancer/ANTHROPIC_API_KEY with AgentClient for multi-provider support (Moonshot, Google, OpenAI) 5. config_enhancer: Removed output_file path from prompt — AgentClient manages temp files and output detection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: make claude adaptor model name configurable via ANTHROPIC_MODEL env var Missed in the Phase 1 refactor — adaptors/claude.py:381 had a hardcoded model name without the os.environ.get() wrapper that all other files use. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add copilot stdin support, custom agent, and kimi aliases (#334) Additional agent improvements from Kimi review: - Added uses_stdin: True to copilot agent preset (reads from stdin like codex) - Added custom agent support via SKILL_SEEKER_AGENT_CMD env var in _call_local() - Added kimi_code/kimi-code aliases in normalize_agent_name() - Added "kimi" to --target choices in enhance arguments - Updated help text with MOONSHOT_API_KEY across argument files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: Kimi CLI integration — add uses_stdin and output parsing (#334) Kimi CLI's --print mode requires stdin piping and outputs structured protocol messages (TurnBegin, TextPart, etc.) instead of plain text. Fixes: - Added uses_stdin: True to kimi preset (was not piping prompt) - Added parse_output: "kimi" flag to preset - Added _parse_kimi_output() to extract text from TextPart lines - Kimi now returns clean text instead of raw protocol dump Tested: kimi returns '{"status": "ok"}' correctly via AgentClient. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: Kimi CLI in enhance_skill_local — remove wrong skip-permissions, use absolute path Two bugs in enhance_skill_local.py AGENT_PRESETS for Kimi: 1. supports_skip_permissions was True — Kimi doesn't support --dangerously-skip-permissions, only Claude does. Fixed to False. 2. {skill_dir} was resolved as relative path — Kimi CLI requires absolute paths for --work-dir. Fixed with .resolve(). Tested: `skill-seekers enhance output/test-e2e/ --agent kimi` now works end-to-end (107s, 9233 bytes output). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove invalid --enhance-level flag from enhance subprocess calls doc_scraper.py and video_scraper.py were passing --enhance-level to skill-seekers-enhance, which doesn't accept that flag. This caused enhancement to fail silently after scraping completed. Fixes: - Removed --enhance-level from enhance subprocess calls - Added --agent passthrough in doc_scraper.py - Fixed log messages to show correct command Tested: `skill-seekers create <url> --enhance-level 1` now chains scrape → enhance successfully. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add --agent and --agent-cmd to create command UNIVERSAL_ARGUMENTS The --agent flag was defined in common.py but not imported into the create command's UNIVERSAL_ARGUMENTS, so it wasn't available when using `skill-seekers create <source> --agent kimi`. Now all 17 source types support the --agent flag via the create command. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: update docs data_file path after moving to cache directory The scraped_data["documentation"] stored the original output/ path for data_file, but the directory was moved to .skillseeker-cache/ afterward. Phase 2 conflict detection then failed with FileNotFoundError trying to read the old path. Now updates data_file to point to the cache location after the move. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: multi-language code signature extraction in GitHub scraper The GitHub scraper only analyzed files matching the primary language (by bytes). For multi-language repos like spine-runtimes (C++ primary but C# is the target), this meant 0 C# files were analyzed. Fix: Analyze top 3 languages with known extension mappings instead of just the primary. Also support "language" field in config source to explicitly target specific languages (e.g., "language": "C#"). Updated Unity configs to specify language: "C#" for focused analysis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: per-file language detection + remove artificial analysis limits Rewrites GitHub scraper's _extract_signatures_and_tests() to detect language per-file from extension instead of only analyzing the primary language. This fixes multi-language repos like spine-runtimes (C++ primary) where C# files were never analyzed. Changes: - Build reverse ext→language map, detect language per-file - Analyze ALL files with known extensions (not just primary language) - Config "language" field works as optional filter, not a workaround - Store per-file language + languages_analyzed in output - Remove 50-file API mode limit (rate limiting already handles this) - Remove 100-file default config extraction limit (now unlimited by default) - Fix unified scraper default max_pages from 100 to 500 (matches constants.py) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove remaining 100-file limit in config_extractor.extract_from_directory The find_config_files default was changed to unlimited but extract_from_directory and CLI --max-files still defaulted to 100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: replace interactive terminal merge with automated AgentClient call AIEnhancedMerger._launch_claude_merge() used to open a terminal window, run a bash script, and poll for a file — requiring manual interaction. Now uses AgentClient.call() to send the merge prompt directly and parse the JSON response. Fully automated, no terminal needed, works with any configured AI agent (Claude, Kimi, etc.). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add marketplace pipeline for publishing skills to Claude Code plugin repos Connect the three-repo pipeline: configs repo → Skill Seekers engine → plugin marketplace repos. Enables automated publishing of generated skills directly into Claude Code plugin repositories with proper plugin.json and marketplace.json structure. New components: - MarketplaceManager: Registry for plugin marketplace repos at ~/.skill-seekers/marketplaces.json with per-repo git tokens, branch config, and default author metadata - MarketplacePublisher: Clones marketplace repo, creates plugin directory structure (skills/, .claude-plugin/plugin.json), updates marketplace.json, commits and pushes. Includes skill_name validation to prevent path traversal, and cleanup of partial state on git failures - 4 MCP tools: add_marketplace, list_marketplaces, remove_marketplace, publish_to_marketplace — registered in FastMCP server - Phase 6 in install workflow: automatic marketplace publishing after packaging, triggered by --marketplace CLI arg or marketplace_targets config field CLI additions: - --marketplace NAME: publish to registered marketplace after packaging - --marketplace-category CAT: plugin category (default: development) - --create-branch: create feature branch instead of committing to main Security: - Skill name regex validation (^[a-zA-Z0-9][a-zA-Z0-9._-]*$) prevents path traversal attacks via malicious SKILL.md frontmatter - has_api_key variable scoping fix in install workflow summary - try/finally cleanup of partial plugin directories on publish failure Config schema: - Optional marketplace_targets field in config JSON for multi-marketplace auto-publishing: [{"marketplace": "spyke", "category": "development"}] - Backward compatible — ignored by older versions Tests: 58 tests (36 manager + 22 publisher including 2 integration tests using file:// git protocol for full publish success path) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: thread agent selection through entire enhancement pipeline Propagates the --agent and --agent-cmd CLI parameters through all enhancement components so users can use any supported coding agent (kimi, claude, copilot, codex, opencode) consistently across the full pipeline, not just in top-level enhancement. Agent parameter threading: - AIEnhancer: accepts agent param, passes to AgentClient - ConfigEnhancer: accepts agent param, passes to AgentClient - WorkflowEngine: accepts agent param, passes to sub-enhancers (PatternEnhancer, TestExampleEnhancer, AIEnhancer) - ArchitecturalPatternDetector: accepts agent param for AI enhancement - analyze_codebase(): accepts agent/agent_cmd, forwards to ConfigEnhancer, ArchitecturalPatternDetector, and doc processing - UnifiedScraper: reads agent from CLI args, forwards to doc scraper subprocess, C3.x analysis, and LOCAL enhancement - CreateCommand: forwards --agent and --agent-cmd to subprocess argv - workflow_runner: passes agent to WorkflowEngine for inline/named workflows Timeout improvements: - Default enhancement timeout increased from 300s (5min) to 2700s (45min) to accommodate large skill generation with local agents - New get_default_timeout() in agent_client.py with env var override (SKILL_SEEKER_ENHANCE_TIMEOUT) supporting 'unlimited' value - Config enhancement block supports "timeout": "unlimited" field - Removed hardcoded timeout=300 and timeout=600 calls in config_enhancer and merge_sources, now using centralized default CLI additions (unified_scraper): - --agent AGENT: select local coding agent for enhancement - --agent-cmd CMD: override agent command template (advanced) Config: unity-dotween.json updated with agent=kimi, timeout=unlimited, removed unused file_patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add claude-code unified config for Claude Code CLI skill generation Unified config combining official Claude Code documentation and source code analysis. Covers internals, architecture, tools, commands, IDE integrations, MCP, plugins, skills, and development workflows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add multi-agent support verification report and test artifacts - AGENT_SUPPORT_VERIFICATION.md: verification report confirming agent parameter threading works across all enhancement components - END_TO_END_EXAMPLES.md: complete workflows for all 17 source types with both Claude and Kimi agents - test_agents.sh: shell script for real-world testing of agent support across major CLI commands with both agents - test_realworld.md: real-world test scenarios for manual QA Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add .env to .gitignore to prevent secret exposure The .env file containing API keys (ANTHROPIC_API_KEY, GITHUB_TOKEN, etc.) was not in .gitignore, causing it to appear as untracked and risking accidental commit. Added .env, .env.local, and .env.*.local patterns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: URL filtering uses base directory instead of full page URL (#331) is_valid_url() checked url.startswith(self.base_url) where base_url could be a full page path like ".../manual/index.html". Sibling pages like ".../manual/LoadingAssets.html" failed the check because they don't start with ".../index.html". Now strips the filename to get the directory prefix: "https://example.com/docs/index.html" → "https://example.com/docs/" This fixes SPA sites like Unity's DocFX docs where browser mode renders the page but sibling links were filtered out. Closes #331 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: pass language config through to GitHub scraper in unified flow The unified scraper built github_config from source fields but didn't include the "language" field. The GitHub scraper's per-file detection read self.config.get("language", "") which was always empty, so it fell back to analyzing all languages instead of the focused C# filter. For DOTween (C# only repo), this caused 0 files analyzed because without the language filter, it analyzed top 3 languages but the file tree matching failed silently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: centralize all enhancement timeouts to 45min default with unlimited support All enhancement/AI timeouts now use get_default_timeout() from agent_client.py instead of scattered hardcoded values (120s, 300s, 600s). Default: 2700s (45 minutes) Override: SKILL_SEEKER_ENHANCE_TIMEOUT env var Unlimited: Set to "unlimited", "none", or "0" Updated: agent_client.py, enhance_skill_local.py, arguments/enhance.py, enhance_command.py, unified_enhancer.py, unified_scraper.py Not changed (different purposes): - Browser page load timeout (60s) - API HTTP request timeout (120s) - Doc scraper subprocess timeout (3600s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add browser_wait_until and browser_extra_wait config for SPA docs DocFX sites (Unity docs) render navigation via JavaScript after initial page load. With domcontentloaded, only 1 link was found. With networkidle + 5s extra wait, 95 content pages are discovered. New config options for documentation sources: - browser_wait_until: "networkidle" | "load" | "domcontentloaded" - browser_extra_wait: milliseconds to wait after page load for lazy nav Updated Addressables config to use networkidle + 5000ms extra wait. Pass browser settings through unified scraper to doc scraper config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: three-layer smart discovery engine for SPA documentation sites Replaces the browser_wait_until/browser_extra_wait config hacks with a proper discovery engine that runs before the BFS crawl loop: Layer 1: sitemap.xml — checks domain root for sitemap, parses <loc> tags Layer 2: llms.txt — existing mechanism (unchanged) Layer 3: SPA nav — renders index page with networkidle via Playwright, extracts all links from the fully-rendered DOM sidebar/TOC The BFS crawl then uses domcontentloaded (fast) since all pages are already discovered. No config hacks needed — browser mode automatically triggers SPA discovery when only 1 page is found. Tested: Unity Addressables DocFX site now discovers 95 pages (was 1). Removed browser_wait_until/browser_extra_wait from Addressables config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: replace manual arg forwarding with dynamic routing in create command The create command manually hardcoded ~60% of scraper flags in _route_*() methods, causing ~40 flags to be silently dropped. Every new flag required edits in 2 places (arguments/create.py + create_command.py), guaranteed to drift. Replaced with _build_argv() — a dynamic forwarder that iterates vars(self.args) and forwards all explicitly-set arguments automatically, using the same pattern as main.py::_reconstruct_argv(). This eliminates the root cause of all flag gaps. Changes in create_command.py (-380 lines, +175 lines = net -205): - Added _build_argv() dynamic arg forwarder with dest→flag translation map for mismatched names (async_mode→--async, video_playlist→--playlist, skip_config→--skip-config-patterns, workflow_var→--var) - Added _call_module() helper (dedup sys.argv swap pattern) - Simplified all _route_*() methods from 50-70 lines to 5-10 lines each - Deleted _add_common_args() entirely (subsumed by _build_argv) - _route_generic() now forwards ALL args, not just universal ones New flags accessible via create command: - --from-json: build skill from pre-extracted JSON (all source types) - --skip-api-reference: skip API reference generation (local codebase) - --skip-dependency-graph: skip dependency analysis (local codebase) - --skip-config-patterns: skip config pattern extraction (local codebase) - --no-comments: skip comment extraction (local codebase) - --depth: analysis depth control (local codebase, deprecated) - --setup: auto-detect GPU/install video deps (video) Bug fix in unified_scraper.py: - Fixed C3.x pattern data loss: unified_scraper read patterns/detected_patterns.json but codebase_scraper writes patterns/all_patterns.json. Changed both read locations (line 828 for local sources, line 1597 for GitHub C3.x) to use the correct filename. This was causing 100% loss of design pattern data (e.g., 905 patterns detected but 0 included in final skill). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address 5 code review issues in marketplace and package pipeline Fixes found by automated code review of the marketplace feature and package command: 1. --marketplace flag silently ignored in package_skill.py CLI Added MarketplacePublisher invocation after successful packaging when --marketplace is provided. Previously the flag was parsed but never acted on. 2. Missing 7 platform choices in --target (package.py) Added minimax, opencode, deepseek, qwen, openrouter, together, fireworks to the argparse choices list. These platforms have registered adaptors but were rejected by the argument parser. 3. is_update always True for new marketplace registrations Two separate datetime.now() calls produced different microsecond timestamps, making added_at != updated_at always. Fixed by assigning a single timestamp to both fields. 4. Shallow clone (depth=1) caused push failures for marketplace repos MarketplacePublisher now does full clones instead of using GitConfigRepo's shallow clone (which is designed for read-only config fetching). Full clone is required for commit+push workflow. 5. Partial plugin dir not cleaned on force=True failure Removed the `and not force` guard from cleanup logic — if an operation fails midway, the partial directory should be cleaned regardless of whether force was set. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address dynamic routing edge cases in create_command Fixes from code review of the _build_argv() refactor: 1. Non-None defaults forwarded unconditionally — added enhance_level=2, doc_version="", video_languages="en", whisper_model="base", platform="slack", visual_interval=0.7, visual_min_gap=0.5, visual_similarity=3.0 to the defaults dict so they're only forwarded when the user explicitly overrides them. This fixes video sources incorrectly getting --enhance-level 2 (video default is 0). 2. video_url dest not translated — added "video_url": "--url" to _DEST_TO_FLAG so create correctly forwards --video-url as --url to video_scraper.py. 3. Video positional args double-forwarded — added video_url, video_playlist, video_file to _SKIP_ARGS since _route_video() already handles them via positional args from source detection. 4. Removed dead workflow_var entry from _DEST_TO_FLAG — the create parser uses key "var" not "workflow_var", so the translation was never triggered. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: resolve 15 broken tests and --from-json crash bug in create command Fixes found by Kimi code review of the dynamic routing refactor: 1. 3 test_create_arguments.py failures — UNIVERSAL_ARGUMENTS count changed from 19 to 21 (added agent, agent_cmd). Updated expected count and name set. Moved from_json out of UNIVERSAL to ADVANCED_ARGUMENTS since not all scrapers support it. 2. 12 test_create_integration_basic.py failures — tests called _add_common_args() which was deleted in the refactor. Rewrote _collect_argv() to use _build_argv() via CreateCommand with SourceDetector. Updated _make_args defaults to match new parameter set. 3. --from-json crash bug — was in UNIVERSAL_ARGUMENTS so create accepted it for all source types, but web/github/local scrapers don't support it. Forwarding it caused argparse "unrecognized arguments" errors. Moved to ADVANCED_ARGUMENTS with documentation listing which source types support it. 4. Additional _is_explicitly_set defaults — added enhance_level=2, doc_version="", video_languages="en", whisper_model="base", platform="slack", visual_interval/min_gap/similarity defaults to prevent unconditional forwarding of parser defaults. 5. Video arg handling — added video_url to _DEST_TO_FLAG translation map, added video_url/video_playlist/video_file to _SKIP_ARGS (handled as positionals by _route_video). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: C3.x analysis data loss — read from references/ after _generate_references() cleanup Root cause: _generate_references() in codebase_scraper.py copies analysis directories (patterns/, test_examples/, config_patterns/, architecture/, dependencies/, api_reference/) into references/ then DELETES the originals to avoid duplication (Issue #279). But unified_scraper.py reads from the original paths after analyze_codebase() returns — by which time the originals are gone. This caused 100% data loss for all 6 C3.x data types (design patterns, test examples, config patterns, architecture, dependencies, API reference) in the unified scraper pipeline. The data was correctly detected (e.g., 905 patterns in 510 files) but never made it into the final skill. Fix: Added _load_json_fallback() method that checks references/{subdir}/ first (where _generate_references moves the data), falling back to the original path. Applied to both GitHub C3.x analysis (line ~1599) and local source analysis (line ~828). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add allowlist to _build_argv for config route to unified_scraper _build_argv() was forwarding all CLI args (--name, --doc-version, etc.) to unified_scraper which doesn't accept them. Added allowlist parameter to _build_argv() — when provided, ONLY args in the allowlist are forwarded. The config route now uses _UNIFIED_SCRAPER_ARGS allowlist with the exact set of flags unified_scraper accepts. This is a targeted patch — the proper fix is the ExecutionContext singleton refactor planned separately. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add force=True to marketplace publish from package CLI The package command's --marketplace flag didn't pass force=True to MarketplacePublisher, so re-publishing an existing skill would fail with "already exists" error instead of overwriting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add push_config tool for publishing configs to registered source repos New ConfigPublisher class that validates configs, places them in the correct category directory, commits, and pushes to registered source repositories. Follows the MarketplacePublisher pattern. Features: - Auto-detect category from config name/description - Validate via ConfigValidator + repo's validate-config.py - Support feature branch or direct push - Force overwrite existing configs - MCP tool: push_config(config_path, source_name, category) Usage: push_config(config_path="configs/unity-spine.json", source_name="spyke") Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: security hardening, error handling, tests, and cleanup Security: - Remove command injection via cloned repo script execution (config_publisher) - Replace git add -A with targeted staging (marketplace_publisher) - Clear auth tokens from cached .git/config after clone - Use defusedxml for sitemap XML parsing (XXE protection) - Add path traversal validation for config names Error handling: - AgentClient: specific exception handling for rate limit, auth, connection errors - AgentClient: log subprocess stderr on non-zero exit, raise on explicit API mode failure - config_publisher: only catch ValueError for validation warnings Logic bugs: - Fix _build_argv silently dropping --enhance-level 2 (matched default) - Fix URL filtering over-broadening (strip to parent instead of adding /) - Log warning when _call_module returns None exit code Tests (134 new): - test_agent_client.py: 71 tests for normalize, detect, init, timeout, model - test_config_publisher.py: 23 tests for detect_category, publish, errors - test_create_integration_basic.py: 20 tests for _build_argv routing - Fix 11 pre-existing failures (guide_enhancer, doctor, install_skill, marketplace) Cleanup: - Remove 5 dev artifact files (-1405 lines) - Rename _launch_claude_merge to _launch_ai_merge All 3194 tests pass, 39 expected skips. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: pin ruff==0.15.8 in CI and reformat packaging_tools.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add missing pytest install to vector DB adaptor test jobs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: reformat 7 files for ruff 0.15.8 and fix vector DB test path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove test-week2-integration job referencing missing script Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: update e2e test to accept dynamic platform name in upload phase Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: YusufKaraaslanSpyke <yusuf@spykegames.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2171 lines
84 KiB
Python
2171 lines
84 KiB
Python
#!/usr/bin/env python3
|
|
"""
|
|
Confluence Documentation to Skill Converter
|
|
|
|
Converts Confluence spaces into AI-ready skills by extracting page content,
|
|
hierarchy, code blocks, tables, and attachments. Supports two extraction modes:
|
|
|
|
1. **API mode**: Connects to a Confluence instance via the Atlassian REST API
|
|
(requires ``atlassian-python-api``). Fetches pages from a specified space,
|
|
preserving the parent-child hierarchy. Requires ``--base-url``, ``--space-key``,
|
|
and authentication via ``--username`` / ``--token`` (or env vars).
|
|
|
|
2. **Export mode**: Parses a Confluence HTML/XML export directory previously
|
|
downloaded from the Confluence admin UI. Requires ``--export-path`` pointing
|
|
to the extracted export directory containing ``entities.xml`` or HTML files.
|
|
|
|
Usage:
|
|
# API mode
|
|
skill-seekers confluence --base-url https://wiki.example.com \\
|
|
--space-key PROJ --username user@example.com --token $CONFLUENCE_TOKEN \\
|
|
--name my-project-wiki
|
|
|
|
# Export mode
|
|
skill-seekers confluence --export-path ./confluence-export/ --name my-wiki
|
|
|
|
# Build from previously extracted JSON
|
|
skill-seekers confluence --from-json my-wiki_extracted.json
|
|
|
|
# Standalone execution
|
|
python3 -m skill_seekers.cli.confluence_scraper --base-url https://wiki.example.com \\
|
|
--space-key DEV --name dev-wiki --max-pages 200
|
|
"""
|
|
|
|
import argparse
|
|
import json
|
|
import logging
|
|
import os
|
|
import re
|
|
import sys
|
|
from pathlib import Path
|
|
from typing import Any
|
|
|
|
# Optional dependency guard for atlassian-python-api
|
|
try:
|
|
from atlassian import Confluence
|
|
|
|
ATLASSIAN_AVAILABLE = True
|
|
except ImportError:
|
|
ATLASSIAN_AVAILABLE = False
|
|
|
|
# BeautifulSoup is a core dependency (always available)
|
|
from bs4 import BeautifulSoup, Comment, Tag
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
# Confluence-specific HTML macro class patterns to strip during cleaning
|
|
_CONFLUENCE_MACRO_CLASSES = {
|
|
"confluence-information-macro",
|
|
"confluence-information-macro-body",
|
|
"confluence-information-macro-icon",
|
|
"expand-container",
|
|
"expand-content",
|
|
"expand-control",
|
|
"plugin-tabmeta",
|
|
"plugin_pagetree",
|
|
"page-metadata",
|
|
"aui-message",
|
|
}
|
|
|
|
# Confluence macro element tag names (structured-macro in storage format)
|
|
_STORAGE_MACRO_TAGS = {
|
|
"ac:structured-macro",
|
|
"ac:rich-text-body",
|
|
"ac:parameter",
|
|
"ac:plain-text-body",
|
|
"ac:image",
|
|
"ac:link",
|
|
"ac:emoticon",
|
|
"ac:task-list",
|
|
"ac:task",
|
|
"ac:task-body",
|
|
"ac:task-status",
|
|
"ri:attachment",
|
|
"ri:page",
|
|
"ri:space",
|
|
"ri:url",
|
|
"ri:user",
|
|
}
|
|
|
|
# Known Confluence code macro language mappings
|
|
_CODE_MACRO_LANGS = {
|
|
"py": "python",
|
|
"python": "python",
|
|
"python3": "python",
|
|
"js": "javascript",
|
|
"javascript": "javascript",
|
|
"ts": "typescript",
|
|
"typescript": "typescript",
|
|
"java": "java",
|
|
"bash": "bash",
|
|
"sh": "bash",
|
|
"shell": "bash",
|
|
"sql": "sql",
|
|
"xml": "xml",
|
|
"html": "html",
|
|
"css": "css",
|
|
"json": "json",
|
|
"yaml": "yaml",
|
|
"yml": "yaml",
|
|
"ruby": "ruby",
|
|
"go": "go",
|
|
"golang": "go",
|
|
"rust": "rust",
|
|
"c": "c",
|
|
"cpp": "cpp",
|
|
"csharp": "csharp",
|
|
"cs": "csharp",
|
|
"kotlin": "kotlin",
|
|
"swift": "swift",
|
|
"scala": "scala",
|
|
"groovy": "groovy",
|
|
"perl": "perl",
|
|
"php": "php",
|
|
"r": "r",
|
|
"powershell": "powershell",
|
|
"dockerfile": "dockerfile",
|
|
"terraform": "hcl",
|
|
"hcl": "hcl",
|
|
"markdown": "markdown",
|
|
"text": "",
|
|
"none": "",
|
|
}
|
|
|
|
|
|
def _check_atlassian_deps() -> None:
|
|
"""Raise RuntimeError if atlassian-python-api is not installed."""
|
|
if not ATLASSIAN_AVAILABLE:
|
|
raise RuntimeError(
|
|
"atlassian-python-api is required for Confluence API mode.\n"
|
|
"Install with: pip install atlassian-python-api\n"
|
|
'Or: pip install "skill-seekers[confluence]"'
|
|
)
|
|
|
|
|
|
def infer_description_from_confluence(
|
|
space_info: dict | None = None,
|
|
name: str = "",
|
|
) -> str:
|
|
"""Infer skill description from Confluence space metadata.
|
|
|
|
Args:
|
|
space_info: Confluence space metadata dict (name, description, key).
|
|
name: Skill name for fallback.
|
|
|
|
Returns:
|
|
Description string suitable for "Use when..." format.
|
|
"""
|
|
if space_info:
|
|
desc_text = space_info.get("description", "")
|
|
if isinstance(desc_text, dict):
|
|
# Confluence API returns description as {"plain": {"value": "..."}}
|
|
desc_text = desc_text.get("plain", {}).get("value", "") or desc_text.get(
|
|
"view", {}
|
|
).get("value", "")
|
|
if desc_text and len(desc_text) > 20:
|
|
clean = re.sub(r"<[^>]+>", "", desc_text).strip()
|
|
if len(clean) > 150:
|
|
clean = clean[:147] + "..."
|
|
return f"Use when {clean.lower()}"
|
|
space_name = space_info.get("name", "")
|
|
if space_name and len(space_name) > 5:
|
|
return f"Use when working with {space_name.lower()} documentation"
|
|
return (
|
|
f"Use when referencing {name} documentation"
|
|
if name
|
|
else "Use when referencing this Confluence documentation"
|
|
)
|
|
|
|
|
|
class ConfluenceToSkillConverter:
|
|
"""Convert Confluence space documentation to an AI-ready skill.
|
|
|
|
Supports two extraction modes:
|
|
|
|
- **API mode**: Uses the Atlassian Confluence REST API to fetch pages from
|
|
a space, including page hierarchy, labels, and storage-format content.
|
|
Requires ``base_url``, ``space_key``, and authentication credentials.
|
|
|
|
- **Export mode**: Parses a Confluence HTML/XML export directory that has
|
|
been downloaded and extracted from the Confluence admin interface.
|
|
Requires ``export_path`` pointing to the extracted directory.
|
|
|
|
After extraction, the converter categorises pages by their parent-child
|
|
hierarchy, generates reference markdown files, an index, and the main
|
|
SKILL.md manifest.
|
|
|
|
Attributes:
|
|
config: Configuration dictionary.
|
|
name: Skill name used for output directory and filenames.
|
|
base_url: Confluence instance base URL (API mode).
|
|
space_key: Confluence space key (API mode).
|
|
export_path: Path to exported Confluence directory (export mode).
|
|
username: Confluence username / email for API authentication.
|
|
token: Confluence API token or password.
|
|
description: Skill description for SKILL.md frontmatter.
|
|
max_pages: Maximum number of pages to fetch in API mode.
|
|
skill_dir: Output directory for the generated skill.
|
|
data_file: Path to the intermediate extracted JSON file.
|
|
extracted_data: Structured extraction results dict.
|
|
"""
|
|
|
|
def __init__(self, config: dict) -> None:
|
|
"""Initialize the Confluence to skill converter.
|
|
|
|
Args:
|
|
config: Configuration dictionary containing:
|
|
- name (str): Skill name (required).
|
|
- base_url (str): Confluence instance URL (API mode).
|
|
- space_key (str): Confluence space key (API mode).
|
|
- export_path (str): Path to export directory (export mode).
|
|
- username (str): API username / email (optional, falls back to env).
|
|
- token (str): API token (optional, falls back to env).
|
|
- description (str): Skill description (optional).
|
|
- max_pages (int): Maximum pages to fetch, default 500.
|
|
"""
|
|
self.config = config
|
|
self.name: str = config["name"]
|
|
self.base_url: str = config.get("base_url", "")
|
|
self.space_key: str = config.get("space_key", "")
|
|
self.export_path: str = config.get("export_path", "")
|
|
self.username: str = config.get("username", "")
|
|
self.token: str = config.get("token", "")
|
|
self.description: str = (
|
|
config.get("description") or f"Use when referencing {self.name} documentation"
|
|
)
|
|
self.max_pages: int = int(config.get("max_pages", 500))
|
|
|
|
# Output paths
|
|
self.skill_dir = f"output/{self.name}"
|
|
self.data_file = f"output/{self.name}_extracted.json"
|
|
|
|
# Extracted data storage
|
|
self.extracted_data: dict[str, Any] | None = None
|
|
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
# Extraction dispatcher
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
|
|
def extract_confluence(self) -> bool:
|
|
"""Extract content from Confluence, dispatching to API or export mode.
|
|
|
|
Determines the extraction mode based on the provided configuration:
|
|
- If ``base_url`` and ``space_key`` are set, uses API mode.
|
|
- If ``export_path`` is set, uses export mode.
|
|
- Raises ValueError if neither mode is configured.
|
|
|
|
After extraction, saves intermediate JSON to ``{name}_extracted.json``
|
|
and updates the description from space metadata if not explicitly set.
|
|
|
|
Returns:
|
|
True on successful extraction.
|
|
|
|
Raises:
|
|
ValueError: If neither API nor export configuration is provided.
|
|
RuntimeError: If API dependencies are missing or connection fails.
|
|
"""
|
|
if self.base_url and self.space_key:
|
|
print(f"\n Extracting from Confluence API: {self.base_url}")
|
|
print(f" Space: {self.space_key}")
|
|
raw_pages = self._extract_via_api()
|
|
elif self.export_path:
|
|
print(f"\n Extracting from Confluence export: {self.export_path}")
|
|
raw_pages = self._extract_from_export()
|
|
else:
|
|
raise ValueError(
|
|
"No Confluence source configured. Provide either:\n"
|
|
" - --base-url and --space-key (API mode), or\n"
|
|
" - --export-path (export mode)"
|
|
)
|
|
|
|
if not raw_pages:
|
|
logger.warning("No pages extracted from Confluence")
|
|
|
|
# Build page hierarchy tree
|
|
page_tree = self._extract_page_tree(raw_pages)
|
|
|
|
# Parse each page's HTML content to structured sections
|
|
sections: list[dict[str, Any]] = []
|
|
total_code_blocks = 0
|
|
total_images = 0
|
|
section_number = 0
|
|
|
|
for page in raw_pages:
|
|
page_id = page.get("id", "")
|
|
page_title = page.get("title", "Untitled")
|
|
body_html = page.get("body", "")
|
|
labels = page.get("labels", [])
|
|
parent_id = page.get("parent_id", "")
|
|
|
|
if not body_html:
|
|
logger.debug("Skipping page with no body: %s", page_title)
|
|
continue
|
|
|
|
# Parse the Confluence HTML content
|
|
parsed = self._parse_confluence_html(body_html, page_title)
|
|
|
|
section_number += 1
|
|
section_data: dict[str, Any] = {
|
|
"section_number": section_number,
|
|
"page_id": page_id,
|
|
"heading": page_title,
|
|
"heading_level": "h1",
|
|
"parent_id": parent_id,
|
|
"labels": labels,
|
|
"text": parsed.get("text", ""),
|
|
"headings": parsed.get("headings", []),
|
|
"code_samples": parsed.get("code_samples", []),
|
|
"tables": parsed.get("tables", []),
|
|
"images": parsed.get("images", []),
|
|
"links": parsed.get("links", []),
|
|
"macros": parsed.get("macros", []),
|
|
}
|
|
sections.append(section_data)
|
|
total_code_blocks += len(parsed.get("code_samples", []))
|
|
total_images += len(parsed.get("images", []))
|
|
|
|
# Collect space metadata
|
|
space_info = raw_pages[0].get("space_info", {}) if raw_pages else {}
|
|
|
|
# Update description from space metadata if not explicitly set
|
|
if not self.config.get("description"):
|
|
self.description = infer_description_from_confluence(space_info, self.name)
|
|
|
|
# Detect programming languages in code samples
|
|
languages_detected: dict[str, int] = {}
|
|
for section in sections:
|
|
for code_sample in section.get("code_samples", []):
|
|
lang = code_sample.get("language", "")
|
|
if lang:
|
|
languages_detected[lang] = languages_detected.get(lang, 0) + 1
|
|
|
|
result_data: dict[str, Any] = {
|
|
"source": self.base_url or self.export_path,
|
|
"space_key": self.space_key,
|
|
"space_info": space_info,
|
|
"page_tree": page_tree,
|
|
"total_sections": len(sections),
|
|
"total_pages": len(raw_pages),
|
|
"total_code_blocks": total_code_blocks,
|
|
"total_images": total_images,
|
|
"languages_detected": languages_detected,
|
|
"pages": sections,
|
|
}
|
|
|
|
# Save extracted data
|
|
os.makedirs(os.path.dirname(self.data_file) or ".", exist_ok=True)
|
|
with open(self.data_file, "w", encoding="utf-8") as f:
|
|
json.dump(result_data, f, indent=2, ensure_ascii=False, default=str)
|
|
|
|
print(f"\n Saved extracted data to: {self.data_file}")
|
|
self.extracted_data = result_data
|
|
print(
|
|
f" Extracted {len(sections)} pages, "
|
|
f"{total_code_blocks} code blocks, "
|
|
f"{total_images} images"
|
|
)
|
|
return True
|
|
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
# API extraction
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
|
|
def _extract_via_api(self) -> list[dict[str, Any]]:
|
|
"""Fetch pages from a Confluence space using the REST API.
|
|
|
|
Connects to the Confluence instance using ``atlassian-python-api``,
|
|
retrieves all pages in the configured space (up to ``max_pages``),
|
|
and returns them as a list of normalised page dicts.
|
|
|
|
Authentication is resolved in priority order:
|
|
1. Constructor arguments (username/token)
|
|
2. Environment variables (CONFLUENCE_USERNAME / CONFLUENCE_TOKEN)
|
|
|
|
Returns:
|
|
List of page dicts with keys: id, title, body, parent_id, labels,
|
|
url, space_info, version, created, modified.
|
|
|
|
Raises:
|
|
RuntimeError: If atlassian-python-api is not installed or
|
|
the connection / fetch fails.
|
|
"""
|
|
_check_atlassian_deps()
|
|
|
|
# Resolve authentication credentials
|
|
username = (
|
|
self.username
|
|
or os.environ.get("CONFLUENCE_USERNAME", "")
|
|
or os.environ.get("ATLASSIAN_USERNAME", "")
|
|
)
|
|
token = (
|
|
self.token
|
|
or os.environ.get("CONFLUENCE_TOKEN", "")
|
|
or os.environ.get("ATLASSIAN_TOKEN", "")
|
|
)
|
|
|
|
if not username or not token:
|
|
raise RuntimeError(
|
|
"Confluence API authentication required.\n"
|
|
"Provide --username and --token, or set CONFLUENCE_USERNAME "
|
|
"and CONFLUENCE_TOKEN environment variables."
|
|
)
|
|
|
|
# Connect to Confluence
|
|
try:
|
|
confluence = Confluence(
|
|
url=self.base_url,
|
|
username=username,
|
|
password=token,
|
|
cloud=self._is_cloud_instance(),
|
|
)
|
|
except Exception as e:
|
|
raise RuntimeError(f"Failed to connect to Confluence at {self.base_url}: {e}") from e
|
|
|
|
# Fetch space information
|
|
space_info: dict[str, Any] = {}
|
|
try:
|
|
space_data = confluence.get_space(self.space_key, expand="description.plain,homepage")
|
|
space_info = {
|
|
"key": space_data.get("key", self.space_key),
|
|
"name": space_data.get("name", self.space_key),
|
|
"description": space_data.get("description", {}),
|
|
"type": space_data.get("type", "global"),
|
|
"homepage_id": (
|
|
space_data.get("homepage", {}).get("id", "")
|
|
if space_data.get("homepage")
|
|
else ""
|
|
),
|
|
}
|
|
print(f" Space: {space_info.get('name', self.space_key)}")
|
|
except Exception as e:
|
|
logger.warning("Could not fetch space info: %s", e)
|
|
space_info = {"key": self.space_key, "name": self.space_key}
|
|
|
|
# Fetch all pages in the space, paginated
|
|
pages: list[dict[str, Any]] = []
|
|
start = 0
|
|
limit = 50 # Confluence API page size
|
|
expand_fields = "body.storage,version,ancestors,metadata.labels"
|
|
|
|
print(f" Fetching pages (max {self.max_pages})...")
|
|
|
|
while len(pages) < self.max_pages:
|
|
try:
|
|
batch = confluence.get_all_pages_from_space(
|
|
self.space_key,
|
|
start=start,
|
|
limit=min(limit, self.max_pages - len(pages)),
|
|
expand=expand_fields,
|
|
content_type="page",
|
|
)
|
|
except Exception as e:
|
|
logger.error("Failed to fetch pages at offset %d: %s", start, e)
|
|
break
|
|
|
|
if not batch:
|
|
break
|
|
|
|
for page_data in batch:
|
|
page_id = str(page_data.get("id", ""))
|
|
title = page_data.get("title", "Untitled")
|
|
|
|
# Extract body (storage format HTML)
|
|
body = page_data.get("body", {}).get("storage", {}).get("value", "")
|
|
|
|
# Extract parent ID from ancestors
|
|
ancestors = page_data.get("ancestors", [])
|
|
parent_id = str(ancestors[-1]["id"]) if ancestors else ""
|
|
|
|
# Extract labels
|
|
labels_data = page_data.get("metadata", {}).get("labels", {}).get("results", [])
|
|
labels = [lbl.get("name", "") for lbl in labels_data if lbl.get("name")]
|
|
|
|
# Version and dates
|
|
version_info = page_data.get("version", {})
|
|
version_number = version_info.get("number", 1)
|
|
created = version_info.get("when", "") if version_number == 1 else ""
|
|
modified = version_info.get("when", "")
|
|
|
|
# Build page URL
|
|
page_url = f"{self.base_url}/wiki/spaces/{self.space_key}/pages/{page_id}"
|
|
links = page_data.get("_links", {})
|
|
if links.get("webui"):
|
|
page_url = f"{self.base_url}/wiki{links['webui']}"
|
|
|
|
page_dict: dict[str, Any] = {
|
|
"id": page_id,
|
|
"title": title,
|
|
"body": body,
|
|
"parent_id": parent_id,
|
|
"labels": labels,
|
|
"url": page_url,
|
|
"space_info": space_info,
|
|
"version": version_number,
|
|
"created": created,
|
|
"modified": modified,
|
|
}
|
|
pages.append(page_dict)
|
|
|
|
print(f" Fetched {len(pages)} pages...")
|
|
start += len(batch)
|
|
|
|
# If we got fewer results than the limit, we've reached the end
|
|
if len(batch) < limit:
|
|
break
|
|
|
|
print(f" Total pages fetched: {len(pages)}")
|
|
return pages
|
|
|
|
def _is_cloud_instance(self) -> bool:
|
|
"""Detect whether the base URL points to an Atlassian Cloud instance.
|
|
|
|
Cloud instances use ``*.atlassian.net`` domain names.
|
|
|
|
Returns:
|
|
True if the URL looks like an Atlassian Cloud instance.
|
|
"""
|
|
return "atlassian.net" in self.base_url.lower()
|
|
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
# Export extraction
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
|
|
def _extract_from_export(self) -> list[dict[str, Any]]:
|
|
"""Parse a Confluence HTML/XML export directory into page dicts.
|
|
|
|
Confluence exports can contain either:
|
|
- An ``entities.xml`` file (full XML export from admin)
|
|
- A directory of HTML files (HTML export)
|
|
|
|
This method auto-detects the export format and delegates accordingly.
|
|
HTML files are parsed with BeautifulSoup to extract content and metadata.
|
|
|
|
Returns:
|
|
List of normalised page dicts (same structure as API mode).
|
|
|
|
Raises:
|
|
FileNotFoundError: If the export path does not exist.
|
|
ValueError: If no parseable content is found in the export.
|
|
"""
|
|
export_dir = Path(self.export_path)
|
|
if not export_dir.exists():
|
|
raise FileNotFoundError(f"Confluence export path not found: {self.export_path}")
|
|
if not export_dir.is_dir():
|
|
raise ValueError(f"Export path is not a directory: {self.export_path}")
|
|
|
|
pages: list[dict[str, Any]] = []
|
|
space_info: dict[str, Any] = {"key": self.space_key or "EXPORT", "name": self.name}
|
|
|
|
# Check for entities.xml (full XML export)
|
|
entities_xml = export_dir / "entities.xml"
|
|
if entities_xml.exists():
|
|
pages = self._parse_entities_xml(entities_xml, space_info)
|
|
if pages:
|
|
print(f" Parsed entities.xml: {len(pages)} pages")
|
|
return pages
|
|
|
|
# Fall back to HTML file export
|
|
html_files = sorted(
|
|
f for f in export_dir.rglob("*.html") if f.is_file() and f.name != "index.html"
|
|
)
|
|
|
|
if not html_files:
|
|
# Also try .htm files
|
|
html_files = sorted(
|
|
f for f in export_dir.rglob("*.htm") if f.is_file() and f.name != "index.htm"
|
|
)
|
|
|
|
if not html_files:
|
|
raise ValueError(
|
|
f"No HTML files found in export directory: {self.export_path}\n"
|
|
"Expected either entities.xml or HTML files from Confluence export."
|
|
)
|
|
|
|
print(f" Found {len(html_files)} HTML files in export")
|
|
|
|
# Parse index.html for page hierarchy if available
|
|
index_file = export_dir / "index.html"
|
|
hierarchy_map: dict[str, str] = {} # filename -> parent filename
|
|
if index_file.exists():
|
|
hierarchy_map = self._parse_export_index(index_file)
|
|
|
|
for idx, html_file in enumerate(html_files):
|
|
if idx >= self.max_pages:
|
|
logger.info("Reached max_pages limit (%d)", self.max_pages)
|
|
break
|
|
|
|
try:
|
|
raw_html = html_file.read_text(encoding="utf-8", errors="ignore")
|
|
except Exception as e:
|
|
logger.warning("Could not read %s: %s", html_file, e)
|
|
continue
|
|
|
|
soup = BeautifulSoup(raw_html, "html.parser")
|
|
|
|
# Extract title
|
|
title_tag = soup.find("title")
|
|
title = title_tag.get_text(strip=True) if title_tag else html_file.stem
|
|
|
|
# Find main content area (Confluence exports use specific div IDs)
|
|
main_content = (
|
|
soup.find("div", id="main-content")
|
|
or soup.find("div", class_="wiki-content")
|
|
or soup.find("div", id="content")
|
|
or soup.find("body")
|
|
)
|
|
|
|
body_html = str(main_content) if main_content else ""
|
|
file_key = html_file.stem
|
|
parent_key = hierarchy_map.get(file_key, "")
|
|
|
|
page_dict: dict[str, Any] = {
|
|
"id": file_key,
|
|
"title": title,
|
|
"body": body_html,
|
|
"parent_id": parent_key,
|
|
"labels": [],
|
|
"url": str(html_file),
|
|
"space_info": space_info,
|
|
"version": 1,
|
|
"created": "",
|
|
"modified": "",
|
|
}
|
|
pages.append(page_dict)
|
|
|
|
print(f" Parsed {len(pages)} pages from HTML export")
|
|
return pages
|
|
|
|
def _parse_entities_xml(
|
|
self,
|
|
xml_path: Path,
|
|
space_info: dict[str, Any],
|
|
) -> list[dict[str, Any]]:
|
|
"""Parse Confluence entities.xml export file.
|
|
|
|
The entities.xml file contains all page data including body content
|
|
in Confluence storage format. This method extracts page objects and
|
|
their parent-child relationships.
|
|
|
|
Args:
|
|
xml_path: Path to the entities.xml file.
|
|
space_info: Space metadata dict to attach to each page.
|
|
|
|
Returns:
|
|
List of normalised page dicts.
|
|
"""
|
|
pages: list[dict[str, Any]] = []
|
|
|
|
try:
|
|
# Use iterparse for memory efficiency on large exports
|
|
import xml.etree.ElementTree as ET
|
|
|
|
tree = ET.parse(xml_path) # noqa: S314
|
|
root = tree.getroot()
|
|
except Exception as e:
|
|
logger.warning("Failed to parse entities.xml: %s", e)
|
|
return []
|
|
|
|
# Find all page objects in the XML
|
|
for obj_elem in root.iter("object"):
|
|
obj_class = obj_elem.get("class", "")
|
|
if obj_class != "Page":
|
|
continue
|
|
|
|
page_data: dict[str, str] = {}
|
|
for prop_elem in obj_elem:
|
|
prop_name = prop_elem.get("name", "")
|
|
if prop_name == "title":
|
|
page_data["title"] = prop_elem.text or ""
|
|
elif prop_name == "id":
|
|
page_data["id"] = prop_elem.text or ""
|
|
elif prop_name == "bodyContents":
|
|
# Body content is nested inside a collection
|
|
for body_obj in prop_elem.iter("object"):
|
|
for body_prop in body_obj:
|
|
if body_prop.get("name") == "body":
|
|
page_data["body"] = body_prop.text or ""
|
|
elif prop_name == "parent":
|
|
# Parent reference
|
|
parent_ref = prop_elem.find("id")
|
|
if parent_ref is not None and parent_ref.text:
|
|
page_data["parent_id"] = parent_ref.text
|
|
|
|
if page_data.get("title") and page_data.get("id"):
|
|
page_dict: dict[str, Any] = {
|
|
"id": page_data.get("id", ""),
|
|
"title": page_data.get("title", ""),
|
|
"body": page_data.get("body", ""),
|
|
"parent_id": page_data.get("parent_id", ""),
|
|
"labels": [],
|
|
"url": "",
|
|
"space_info": space_info,
|
|
"version": 1,
|
|
"created": "",
|
|
"modified": "",
|
|
}
|
|
pages.append(page_dict)
|
|
|
|
return pages
|
|
|
|
def _parse_export_index(self, index_path: Path) -> dict[str, str]:
|
|
"""Parse the index.html from a Confluence HTML export for hierarchy.
|
|
|
|
The export index page contains a nested list structure representing
|
|
the page tree. This method parses it to build a child-to-parent mapping.
|
|
|
|
Args:
|
|
index_path: Path to the index.html file.
|
|
|
|
Returns:
|
|
Dict mapping page filename stem to parent filename stem.
|
|
"""
|
|
hierarchy: dict[str, str] = {}
|
|
|
|
try:
|
|
raw_html = index_path.read_text(encoding="utf-8", errors="ignore")
|
|
soup = BeautifulSoup(raw_html, "html.parser")
|
|
|
|
# Confluence export index uses nested <ul><li><a href="..."> structure
|
|
def _walk_list(ul_elem: Tag, parent_key: str = "") -> None:
|
|
for li in ul_elem.find_all("li", recursive=False):
|
|
link = li.find("a", href=True)
|
|
if not link:
|
|
continue
|
|
href = link.get("href", "")
|
|
# Extract filename stem from href
|
|
page_key = Path(href).stem if href else ""
|
|
if page_key and parent_key:
|
|
hierarchy[page_key] = parent_key
|
|
|
|
# Recurse into nested lists
|
|
nested_ul = li.find("ul", recursive=False)
|
|
if nested_ul:
|
|
_walk_list(nested_ul, page_key)
|
|
|
|
top_ul = soup.find("ul")
|
|
if top_ul:
|
|
_walk_list(top_ul)
|
|
|
|
except Exception as e:
|
|
logger.warning("Failed to parse export index: %s", e)
|
|
|
|
return hierarchy
|
|
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
# HTML / content parsing
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
|
|
def _parse_confluence_html(
|
|
self,
|
|
html_content: str,
|
|
page_title: str = "",
|
|
) -> dict[str, Any]:
|
|
"""Parse Confluence storage format HTML into structured content.
|
|
|
|
Confluence uses a custom XHTML-based storage format with proprietary
|
|
macro elements (``ac:structured-macro``, ``ac:rich-text-body``, etc.).
|
|
This method:
|
|
|
|
1. Extracts code macros and panel macros before cleaning.
|
|
2. Cleans Confluence-specific markup (macros, boilerplate divs).
|
|
3. Extracts sub-headings, text content, code blocks, tables, images,
|
|
and links from the cleaned HTML.
|
|
|
|
Args:
|
|
html_content: Raw HTML string in Confluence storage format.
|
|
page_title: Page title for context in logging.
|
|
|
|
Returns:
|
|
Dict with keys: text, headings, code_samples, tables, images,
|
|
links, macros.
|
|
"""
|
|
soup = BeautifulSoup(html_content, "html.parser")
|
|
|
|
# Step 1: Extract macros before cleaning (they contain valuable content)
|
|
macros = self._extract_macros(soup)
|
|
|
|
# Step 2: Clean Confluence-specific HTML
|
|
cleaned_soup = self._clean_confluence_html(soup)
|
|
|
|
# Step 3: Extract structured content from cleaned HTML
|
|
text_parts: list[str] = []
|
|
headings: list[dict[str, str]] = []
|
|
code_samples: list[dict[str, Any]] = []
|
|
tables: list[dict[str, Any]] = []
|
|
images: list[dict[str, str]] = []
|
|
links: list[dict[str, str]] = []
|
|
|
|
# Add code samples from extracted macros
|
|
for macro in macros:
|
|
if macro.get("type") == "code":
|
|
code_samples.append(
|
|
{
|
|
"code": macro.get("content", ""),
|
|
"language": macro.get("language", ""),
|
|
"title": macro.get("title", ""),
|
|
"quality_score": _score_code_quality(macro.get("content", "")),
|
|
}
|
|
)
|
|
|
|
# Extract headings (h1-h6)
|
|
for heading_tag in cleaned_soup.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]):
|
|
heading_text = heading_tag.get_text(strip=True)
|
|
if heading_text:
|
|
headings.append(
|
|
{
|
|
"level": heading_tag.name,
|
|
"text": heading_text,
|
|
}
|
|
)
|
|
|
|
# Extract code blocks from <pre>/<code> elements (non-macro code)
|
|
for pre_tag in cleaned_soup.find_all("pre"):
|
|
code_elem = pre_tag.find("code")
|
|
if code_elem:
|
|
code_text = code_elem.get_text()
|
|
lang = self._detect_language_from_classes(code_elem)
|
|
else:
|
|
code_text = pre_tag.get_text()
|
|
lang = self._detect_language_from_classes(pre_tag)
|
|
|
|
code_text = code_text.strip()
|
|
if code_text and len(code_text) > 10:
|
|
# Avoid duplicates from macro extraction
|
|
is_duplicate = any(cs.get("code", "").strip() == code_text for cs in code_samples)
|
|
if not is_duplicate:
|
|
code_samples.append(
|
|
{
|
|
"code": code_text,
|
|
"language": lang,
|
|
"title": "",
|
|
"quality_score": _score_code_quality(code_text),
|
|
}
|
|
)
|
|
pre_tag.decompose()
|
|
|
|
# Extract tables
|
|
for table_tag in cleaned_soup.find_all("table"):
|
|
table_data = self._extract_table(table_tag)
|
|
if table_data:
|
|
tables.append(table_data)
|
|
table_tag.decompose()
|
|
|
|
# Extract images
|
|
for img_tag in cleaned_soup.find_all("img"):
|
|
src = img_tag.get("src", "")
|
|
alt = img_tag.get("alt", "")
|
|
if src:
|
|
images.append({"src": src, "alt": alt})
|
|
|
|
# Extract links
|
|
for a_tag in cleaned_soup.find_all("a", href=True):
|
|
href = a_tag.get("href", "")
|
|
link_text = a_tag.get_text(strip=True)
|
|
if href and link_text and not href.startswith("javascript:"):
|
|
links.append({"href": href, "text": link_text})
|
|
|
|
# Extract remaining text content
|
|
body_text = self._html_to_text(cleaned_soup)
|
|
if body_text and body_text.strip():
|
|
text_parts.append(body_text.strip())
|
|
|
|
return {
|
|
"text": "\n\n".join(text_parts),
|
|
"headings": headings,
|
|
"code_samples": code_samples,
|
|
"tables": tables,
|
|
"images": images,
|
|
"links": links,
|
|
"macros": [m for m in macros if m.get("type") != "code"],
|
|
}
|
|
|
|
def _extract_macros(self, soup: BeautifulSoup) -> list[dict[str, Any]]:
|
|
"""Extract Confluence macros from storage format HTML.
|
|
|
|
Identifies and parses structured macros including:
|
|
- **code**: Code blocks with language specification.
|
|
- **panel** / **info** / **note** / **warning** / **tip**: Callout panels.
|
|
- **expand**: Expandable content sections.
|
|
- **toc**: Table of contents macro.
|
|
- **jira**: JIRA issue references.
|
|
- **excerpt**: Page excerpts.
|
|
|
|
Extracts the macro content and metadata, then removes the macro
|
|
elements from the soup to avoid double-processing.
|
|
|
|
Args:
|
|
soup: BeautifulSoup object containing Confluence storage format HTML.
|
|
|
|
Returns:
|
|
List of macro dicts with type, content, language (for code), title.
|
|
"""
|
|
macros: list[dict[str, Any]] = []
|
|
|
|
# Find all ac:structured-macro elements
|
|
for macro_elem in soup.find_all("ac:structured-macro"):
|
|
macro_name = macro_elem.get("ac:name", "") or macro_elem.get("data-macro-name", "")
|
|
if not macro_name:
|
|
continue
|
|
|
|
# Extract parameters
|
|
params: dict[str, str] = {}
|
|
for param in macro_elem.find_all("ac:parameter"):
|
|
param_name = param.get("ac:name", "") or param.get("name", "")
|
|
param_value = param.get_text(strip=True)
|
|
if param_name:
|
|
params[param_name] = param_value
|
|
|
|
# Extract body content
|
|
body_elem = macro_elem.find("ac:rich-text-body") or macro_elem.find(
|
|
"ac:plain-text-body"
|
|
)
|
|
body_content = ""
|
|
if body_elem:
|
|
if macro_elem.find("ac:plain-text-body"):
|
|
body_content = body_elem.get_text()
|
|
else:
|
|
body_content = body_elem.get_text(strip=True)
|
|
|
|
macro_dict: dict[str, Any] = {
|
|
"type": macro_name,
|
|
"params": params,
|
|
"content": body_content,
|
|
}
|
|
|
|
# Special handling for code macros
|
|
if macro_name == "code":
|
|
lang_raw = params.get("language", "").lower().strip()
|
|
macro_dict["language"] = _CODE_MACRO_LANGS.get(lang_raw, lang_raw)
|
|
macro_dict["title"] = params.get("title", "")
|
|
macro_dict["type"] = "code"
|
|
|
|
# Panel-type macros
|
|
elif macro_name in ("panel", "info", "note", "warning", "tip", "excerpt"):
|
|
macro_dict["title"] = params.get("title", "")
|
|
|
|
macros.append(macro_dict)
|
|
|
|
# Remove the macro element to avoid double-processing
|
|
macro_elem.decompose()
|
|
|
|
# Also handle legacy Confluence code blocks with class="code-macro"
|
|
for code_div in soup.find_all("div", class_="code"):
|
|
pre_elem = code_div.find("pre")
|
|
if pre_elem:
|
|
code_text = pre_elem.get_text()
|
|
if code_text and code_text.strip():
|
|
macros.append(
|
|
{
|
|
"type": "code",
|
|
"params": {},
|
|
"content": code_text.strip(),
|
|
"language": "",
|
|
"title": "",
|
|
}
|
|
)
|
|
code_div.decompose()
|
|
|
|
return macros
|
|
|
|
def _clean_confluence_html(self, soup: BeautifulSoup) -> BeautifulSoup:
|
|
"""Strip Confluence-specific markup from parsed HTML.
|
|
|
|
Removes:
|
|
- Script and style elements.
|
|
- HTML comments.
|
|
- Confluence-specific macro wrapper divs (by class name).
|
|
- Remaining ``ac:*`` and ``ri:*`` namespace elements.
|
|
- Empty ``<div>`` and ``<span>`` containers.
|
|
- Confluence status/date live-search elements.
|
|
|
|
Args:
|
|
soup: BeautifulSoup object to clean (modified in-place and returned).
|
|
|
|
Returns:
|
|
The cleaned BeautifulSoup object.
|
|
"""
|
|
# Remove script, style, noscript
|
|
for tag in soup(["script", "style", "noscript"]):
|
|
tag.decompose()
|
|
|
|
# Remove HTML comments
|
|
for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
|
|
comment.extract()
|
|
|
|
# Remove Confluence-specific boilerplate divs by class
|
|
for css_class in _CONFLUENCE_MACRO_CLASSES:
|
|
for elem in soup.find_all(class_=css_class):
|
|
elem.decompose()
|
|
|
|
# Remove remaining ac:* and ri:* namespace elements that weren't
|
|
# captured by macro extraction (e.g. empty placeholders)
|
|
for tag_name in list(_STORAGE_MACRO_TAGS):
|
|
for elem in soup.find_all(tag_name):
|
|
# Preserve text content by replacing element with its text
|
|
text_content = elem.get_text(strip=True)
|
|
if text_content:
|
|
elem.replace_with(text_content)
|
|
else:
|
|
elem.decompose()
|
|
|
|
# Remove Confluence status macros and date elements
|
|
for elem in soup.find_all("time"):
|
|
elem.decompose()
|
|
for elem in soup.find_all("ac:emoticon"):
|
|
elem.decompose()
|
|
|
|
# Remove empty wrapper divs and spans (cleanup after macro removal)
|
|
for tag_name in ("div", "span"):
|
|
for elem in soup.find_all(tag_name):
|
|
if not elem.get_text(strip=True) and not elem.find(["img", "table", "pre"]):
|
|
elem.decompose()
|
|
|
|
return soup
|
|
|
|
def _extract_page_tree(
|
|
self,
|
|
pages: list[dict[str, Any]],
|
|
) -> list[dict[str, Any]]:
|
|
"""Build a hierarchical page tree from a flat list of pages.
|
|
|
|
Constructs a tree structure based on parent_id relationships. Pages
|
|
without a parent are placed at the root level. The tree is useful
|
|
for categorisation and navigation.
|
|
|
|
Args:
|
|
pages: Flat list of page dicts with id and parent_id fields.
|
|
|
|
Returns:
|
|
List of tree node dicts, each with keys: id, title, children,
|
|
depth, labels.
|
|
"""
|
|
# Build lookup maps
|
|
by_id: dict[str, dict[str, Any]] = {}
|
|
for page in pages:
|
|
page_id = page.get("id", "")
|
|
if page_id:
|
|
by_id[page_id] = {
|
|
"id": page_id,
|
|
"title": page.get("title", ""),
|
|
"children": [],
|
|
"depth": 0,
|
|
"labels": page.get("labels", []),
|
|
}
|
|
|
|
# Build parent-child relationships
|
|
roots: list[dict[str, Any]] = []
|
|
for page in pages:
|
|
page_id = page.get("id", "")
|
|
parent_id = page.get("parent_id", "")
|
|
node = by_id.get(page_id)
|
|
if not node:
|
|
continue
|
|
|
|
if parent_id and parent_id in by_id:
|
|
parent_node = by_id[parent_id]
|
|
parent_node["children"].append(node)
|
|
node["depth"] = parent_node["depth"] + 1
|
|
else:
|
|
roots.append(node)
|
|
|
|
# Sort children alphabetically at each level
|
|
def _sort_children(node: dict[str, Any]) -> None:
|
|
node["children"].sort(key=lambda n: n.get("title", "").lower())
|
|
for child in node["children"]:
|
|
_sort_children(child)
|
|
|
|
for root in roots:
|
|
_sort_children(root)
|
|
|
|
roots.sort(key=lambda n: n.get("title", "").lower())
|
|
return roots
|
|
|
|
def _extract_table(self, table_elem: Tag) -> dict[str, Any] | None:
|
|
"""Extract an HTML table to a markdown-ready dict.
|
|
|
|
Handles ``<thead>``/``<tbody>`` structure as well as header-less tables.
|
|
Confluence tables often use ``<th>`` in the first row.
|
|
|
|
Args:
|
|
table_elem: BeautifulSoup ``<table>`` Tag.
|
|
|
|
Returns:
|
|
Dict with 'headers' and 'rows' lists, or None if empty.
|
|
"""
|
|
headers: list[str] = []
|
|
rows: list[list[str]] = []
|
|
|
|
# Try <thead> for headers
|
|
thead = table_elem.find("thead")
|
|
if thead:
|
|
header_row = thead.find("tr")
|
|
if header_row:
|
|
headers = [th.get_text(strip=True) for th in header_row.find_all(["th", "td"])]
|
|
|
|
# Body rows
|
|
tbody = table_elem.find("tbody") or table_elem
|
|
all_rows = tbody.find_all("tr")
|
|
|
|
for row in all_rows:
|
|
cells = row.find_all(["td", "th"])
|
|
cell_texts = [c.get_text(strip=True) for c in cells]
|
|
|
|
# If no thead and first row has <th> elements, use as headers
|
|
if not headers and row.find("th") and not rows:
|
|
headers = cell_texts
|
|
continue
|
|
|
|
if cell_texts and cell_texts != headers:
|
|
rows.append(cell_texts)
|
|
|
|
# If still no headers, promote first row
|
|
if not headers and rows:
|
|
headers = rows.pop(0)
|
|
|
|
if not headers and not rows:
|
|
return None
|
|
|
|
return {"headers": headers, "rows": rows}
|
|
|
|
def _detect_language_from_classes(self, elem: Tag) -> str:
|
|
"""Detect programming language from CSS classes on an element.
|
|
|
|
Checks for common class conventions: ``language-python``,
|
|
``brush: java``, ``code-python``, or bare language names.
|
|
|
|
Args:
|
|
elem: BeautifulSoup Tag with potential language class hints.
|
|
|
|
Returns:
|
|
Normalised language string, or empty string if undetected.
|
|
"""
|
|
classes = elem.get("class", [])
|
|
if not classes:
|
|
return ""
|
|
|
|
prefixes = ("language-", "lang-", "code-", "highlight-", "brush:")
|
|
for cls in classes:
|
|
cls_lower = cls.lower().strip()
|
|
for prefix in prefixes:
|
|
if cls_lower.startswith(prefix):
|
|
lang_raw = cls_lower[len(prefix) :].strip()
|
|
return _CODE_MACRO_LANGS.get(lang_raw, lang_raw)
|
|
|
|
# Check for bare language names
|
|
known = set(_CODE_MACRO_LANGS.keys())
|
|
for cls in classes:
|
|
if cls.lower() in known:
|
|
return _CODE_MACRO_LANGS.get(cls.lower(), cls.lower())
|
|
|
|
return ""
|
|
|
|
def _html_to_text(self, elem: Tag | BeautifulSoup) -> str:
|
|
"""Convert an HTML element to clean markdown-like text.
|
|
|
|
Handles paragraphs, bold/italic, links, lists, blockquotes,
|
|
inline code, headings, definition lists, and horizontal rules.
|
|
|
|
Args:
|
|
elem: BeautifulSoup Tag or soup to convert.
|
|
|
|
Returns:
|
|
Cleaned text string with basic markdown formatting.
|
|
"""
|
|
if not hasattr(elem, "children"):
|
|
return str(elem).strip()
|
|
|
|
parts: list[str] = []
|
|
|
|
for child in elem.children:
|
|
if not hasattr(child, "name"):
|
|
text = str(child)
|
|
if text.strip():
|
|
parts.append(text)
|
|
continue
|
|
|
|
if child.name is None:
|
|
continue
|
|
|
|
tag = child.name
|
|
|
|
if tag == "br":
|
|
parts.append("\n")
|
|
elif tag in ("p", "div"):
|
|
inner = self._html_to_text(child)
|
|
if inner.strip():
|
|
parts.append(f"\n\n{inner.strip()}\n\n")
|
|
elif tag in ("strong", "b"):
|
|
inner = child.get_text(strip=True)
|
|
if inner:
|
|
parts.append(f"**{inner}**")
|
|
elif tag in ("em", "i"):
|
|
inner = child.get_text(strip=True)
|
|
if inner:
|
|
parts.append(f"*{inner}*")
|
|
elif tag == "a" and child.get("href"):
|
|
link_text = child.get_text(strip=True)
|
|
href = child.get("href", "")
|
|
if link_text and href and not href.startswith("javascript:"):
|
|
parts.append(f"[{link_text}]({href})")
|
|
elif link_text:
|
|
parts.append(link_text)
|
|
elif tag in ("ul", "ol"):
|
|
items = child.find_all("li", recursive=False)
|
|
for idx, li in enumerate(items):
|
|
li_text = li.get_text(strip=True)
|
|
if li_text:
|
|
prefix = f"{idx + 1}." if tag == "ol" else "-"
|
|
parts.append(f"\n{prefix} {li_text}")
|
|
parts.append("\n")
|
|
elif tag == "blockquote":
|
|
bq_text = child.get_text(strip=True)
|
|
if bq_text:
|
|
lines = bq_text.split("\n")
|
|
quoted = "\n".join(f"> {line}" for line in lines)
|
|
parts.append(f"\n\n{quoted}\n\n")
|
|
elif tag == "code":
|
|
if child.find_parent("pre") is None:
|
|
code_text = child.get_text()
|
|
if code_text.strip():
|
|
parts.append(f"`{code_text.strip()}`")
|
|
elif tag in ("h1", "h2", "h3", "h4", "h5", "h6"):
|
|
level = int(tag[1])
|
|
inner = child.get_text(strip=True)
|
|
if inner:
|
|
parts.append(f"\n\n{'#' * level} {inner}\n\n")
|
|
elif tag == "dl":
|
|
for dt in child.find_all("dt"):
|
|
term = dt.get_text(strip=True)
|
|
dd = dt.find_next_sibling("dd")
|
|
definition = dd.get_text(strip=True) if dd else ""
|
|
parts.append(f"\n**{term}**: {definition}")
|
|
parts.append("\n")
|
|
elif tag == "hr":
|
|
parts.append("\n\n---\n\n")
|
|
else:
|
|
inner = self._html_to_text(child)
|
|
if inner.strip():
|
|
parts.append(inner)
|
|
|
|
result = "".join(parts)
|
|
result = re.sub(r"\n{3,}", "\n\n", result)
|
|
return result
|
|
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
# Load extracted data
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
|
|
def load_extracted_data(self, json_path: str) -> bool:
|
|
"""Load previously extracted data from a JSON file.
|
|
|
|
Args:
|
|
json_path: Path to the intermediate extracted JSON file.
|
|
|
|
Returns:
|
|
True on success.
|
|
|
|
Raises:
|
|
FileNotFoundError: If the JSON file does not exist.
|
|
"""
|
|
print(f"\n Loading extracted data from: {json_path}")
|
|
if not os.path.exists(json_path):
|
|
raise FileNotFoundError(f"Extracted data file not found: {json_path}")
|
|
|
|
with open(json_path, encoding="utf-8") as f:
|
|
self.extracted_data = json.load(f)
|
|
|
|
total = self.extracted_data.get("total_sections", 0)
|
|
print(f" Loaded {total} pages")
|
|
return True
|
|
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
# Categorisation
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
|
|
def categorize_content(self) -> dict[str, dict[str, Any]]:
|
|
"""Categorise pages by space / parent-page hierarchy.
|
|
|
|
Groups pages based on their parent page relationships. Root pages
|
|
(those without a parent) form top-level categories. Pages with
|
|
parents are grouped under their parent's category. Deep nesting
|
|
is flattened to two levels.
|
|
|
|
If no hierarchy information is available, falls back to grouping
|
|
by labels or placing all pages in a single "content" category.
|
|
|
|
Returns:
|
|
Dict mapping category key to dict with 'title' and 'pages' lists.
|
|
"""
|
|
print("\n Categorising content...")
|
|
|
|
categorised: dict[str, dict[str, Any]] = {}
|
|
sections = self.extracted_data.get("pages", [])
|
|
page_tree = self.extracted_data.get("page_tree", [])
|
|
|
|
if not sections:
|
|
categorised["content"] = {"title": "Content", "pages": []}
|
|
return categorised
|
|
|
|
# Build a lookup from page_id to section
|
|
sections_by_id: dict[str, dict[str, Any]] = {}
|
|
for section in sections:
|
|
page_id = section.get("page_id", "")
|
|
if page_id:
|
|
sections_by_id[page_id] = section
|
|
|
|
# Strategy 1: Use page hierarchy if available
|
|
if page_tree:
|
|
for root_node in page_tree:
|
|
root_id = root_node.get("id", "")
|
|
root_title = root_node.get("title", "Untitled")
|
|
cat_key = self._sanitize_filename(root_title)
|
|
|
|
# Collect the root page and all its descendants
|
|
descendant_ids = self._collect_descendant_ids(root_node)
|
|
all_ids = [root_id] + descendant_ids
|
|
|
|
cat_pages = [sections_by_id[pid] for pid in all_ids if pid in sections_by_id]
|
|
|
|
if cat_pages:
|
|
categorised[cat_key] = {
|
|
"title": root_title,
|
|
"pages": cat_pages,
|
|
}
|
|
|
|
# Strategy 2: Group by parent_id when no tree is available
|
|
if not categorised:
|
|
parent_groups: dict[str, list[dict[str, Any]]] = {}
|
|
for section in sections:
|
|
parent_id = section.get("parent_id", "")
|
|
group_key = parent_id or "root"
|
|
if group_key not in parent_groups:
|
|
parent_groups[group_key] = []
|
|
parent_groups[group_key].append(section)
|
|
|
|
for group_key, group_pages in parent_groups.items():
|
|
if group_key == "root":
|
|
cat_title = "Root Pages"
|
|
else:
|
|
# Try to find the parent page title
|
|
parent_section = sections_by_id.get(group_key)
|
|
cat_title = (
|
|
parent_section.get("heading", "Section")
|
|
if parent_section
|
|
else f"Section {group_key}"
|
|
)
|
|
|
|
cat_key = self._sanitize_filename(cat_title)
|
|
categorised[cat_key] = {
|
|
"title": cat_title,
|
|
"pages": group_pages,
|
|
}
|
|
|
|
# Strategy 3: Single category fallback
|
|
if not categorised:
|
|
categorised["content"] = {
|
|
"title": "Content",
|
|
"pages": sections,
|
|
}
|
|
|
|
print(f" Created {len(categorised)} categories")
|
|
for cat_key, cat_data in categorised.items():
|
|
print(f" - {cat_data['title']}: {len(cat_data['pages'])} pages")
|
|
|
|
return categorised
|
|
|
|
def _collect_descendant_ids(self, node: dict[str, Any]) -> list[str]:
|
|
"""Recursively collect all descendant page IDs from a tree node.
|
|
|
|
Args:
|
|
node: Tree node dict with 'children' list.
|
|
|
|
Returns:
|
|
Flat list of all descendant page IDs.
|
|
"""
|
|
ids: list[str] = []
|
|
for child in node.get("children", []):
|
|
child_id = child.get("id", "")
|
|
if child_id:
|
|
ids.append(child_id)
|
|
ids.extend(self._collect_descendant_ids(child))
|
|
return ids
|
|
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
# Skill building
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
|
|
def build_skill(self) -> None:
|
|
"""Build the complete skill structure from extracted data.
|
|
|
|
Creates output directories, categorises content, and generates:
|
|
- Reference markdown files for each category.
|
|
- A reference index file.
|
|
- The main SKILL.md manifest.
|
|
|
|
The output directory structure follows the standard skill layout::
|
|
|
|
output/{name}/
|
|
SKILL.md
|
|
references/
|
|
index.md
|
|
{category}.md
|
|
scripts/
|
|
assets/
|
|
"""
|
|
print(f"\n Building skill: {self.name}")
|
|
|
|
# Create directories
|
|
os.makedirs(f"{self.skill_dir}/references", exist_ok=True)
|
|
os.makedirs(f"{self.skill_dir}/scripts", exist_ok=True)
|
|
os.makedirs(f"{self.skill_dir}/assets", exist_ok=True)
|
|
|
|
# Categorise content
|
|
categorised = self.categorize_content()
|
|
|
|
# Generate reference files
|
|
print("\n Generating reference files...")
|
|
section_num = 1
|
|
total_categories = len(categorised)
|
|
for cat_key, cat_data in categorised.items():
|
|
self._generate_reference_file(cat_key, cat_data, section_num, total_categories)
|
|
section_num += 1
|
|
|
|
# Generate index
|
|
self._generate_index(categorised)
|
|
|
|
# Generate SKILL.md
|
|
self._generate_skill_md(categorised)
|
|
|
|
print(f"\n Skill built successfully: {self.skill_dir}/")
|
|
print(f"\n Next step: Package with: skill-seekers package {self.skill_dir}/")
|
|
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
# Private generators
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
|
|
def _generate_reference_file(
|
|
self,
|
|
cat_key: str,
|
|
cat_data: dict[str, Any],
|
|
section_num: int,
|
|
total_categories: int,
|
|
) -> None:
|
|
"""Generate a reference markdown file for a content category.
|
|
|
|
Creates a markdown file containing all pages in the category, with
|
|
headings, text content, code examples, tables, images, and links.
|
|
|
|
Args:
|
|
cat_key: Category key (sanitised filename stem).
|
|
cat_data: Category dict with 'title' and 'pages' keys.
|
|
section_num: Current section number for filename generation.
|
|
total_categories: Total number of categories for filename logic.
|
|
"""
|
|
sections = cat_data["pages"]
|
|
safe_key = self._sanitize_filename(cat_data["title"])
|
|
filename = f"{self.skill_dir}/references/{safe_key}.md"
|
|
|
|
with open(filename, "w", encoding="utf-8") as f:
|
|
f.write(f"# {cat_data['title']}\n\n")
|
|
|
|
for section in sections:
|
|
sec_num = section.get("section_number", "?")
|
|
heading = section.get("heading", "")
|
|
labels = section.get("labels", [])
|
|
|
|
f.write(f"---\n\n**Page {sec_num}: {heading}**\n\n")
|
|
|
|
# Labels
|
|
if labels:
|
|
label_str = ", ".join(f"`{lbl}`" for lbl in labels)
|
|
f.write(f"**Labels:** {label_str}\n\n")
|
|
|
|
# Sub-headings
|
|
for sub in section.get("headings", []):
|
|
sub_level = sub.get("level", "h3")
|
|
sub_text = sub.get("text", "")
|
|
if sub_text:
|
|
md_depth = int(sub_level[1]) + 1 if sub_level else 4
|
|
md_depth = min(md_depth, 6)
|
|
f.write(f"{'#' * md_depth} {sub_text}\n\n")
|
|
|
|
# Text content
|
|
if section.get("text"):
|
|
f.write(f"{section['text']}\n\n")
|
|
|
|
# Code samples
|
|
code_list = section.get("code_samples", [])
|
|
if code_list:
|
|
f.write("### Code Examples\n\n")
|
|
for code in code_list:
|
|
lang = code.get("language", "")
|
|
title = code.get("title", "")
|
|
if title:
|
|
f.write(f"**{title}**\n\n")
|
|
f.write(f"```{lang}\n{code['code']}\n```\n\n")
|
|
|
|
# Tables
|
|
table_list = section.get("tables", [])
|
|
if table_list:
|
|
for table in table_list:
|
|
headers = table.get("headers", [])
|
|
rows = table.get("rows", [])
|
|
if headers:
|
|
f.write("| " + " | ".join(str(h) for h in headers) + " |\n")
|
|
f.write("| " + " | ".join("---" for _ in headers) + " |\n")
|
|
for row in rows:
|
|
f.write("| " + " | ".join(str(c) for c in row) + " |\n")
|
|
f.write("\n")
|
|
|
|
# Images
|
|
image_list = section.get("images", [])
|
|
if image_list:
|
|
for img in image_list:
|
|
alt = img.get("alt", "Image")
|
|
src = img.get("src", "")
|
|
if src:
|
|
f.write(f"\n\n")
|
|
|
|
# Links
|
|
link_list = section.get("links", [])
|
|
if link_list:
|
|
f.write("### Related Links\n\n")
|
|
for link in link_list[:20]:
|
|
f.write(f"- [{link['text']}]({link['href']})\n")
|
|
f.write("\n")
|
|
|
|
# Non-code macros (panels, notes, warnings, etc.)
|
|
macro_list = section.get("macros", [])
|
|
if macro_list:
|
|
for macro in macro_list:
|
|
macro_type = macro.get("type", "")
|
|
macro_content = macro.get("content", "")
|
|
macro_title = macro.get("title", "")
|
|
|
|
if macro_type in ("info", "note", "tip"):
|
|
prefix = {"info": "INFO", "note": "NOTE", "tip": "TIP"}.get(
|
|
macro_type, "NOTE"
|
|
)
|
|
header = f"> **{prefix}**"
|
|
if macro_title:
|
|
header += f": {macro_title}"
|
|
f.write(f"{header}\n")
|
|
for line in macro_content.split("\n"):
|
|
f.write(f"> {line}\n")
|
|
f.write("\n")
|
|
elif macro_type == "warning":
|
|
header = "> **WARNING**"
|
|
if macro_title:
|
|
header += f": {macro_title}"
|
|
f.write(f"{header}\n")
|
|
for line in macro_content.split("\n"):
|
|
f.write(f"> {line}\n")
|
|
f.write("\n")
|
|
elif macro_type == "panel":
|
|
if macro_title:
|
|
f.write(f"**{macro_title}**\n\n")
|
|
if macro_content:
|
|
f.write(f"{macro_content}\n\n")
|
|
elif macro_type == "expand":
|
|
expand_title = macro_title or "Details"
|
|
f.write(f"<details>\n<summary>{expand_title}</summary>\n\n")
|
|
f.write(f"{macro_content}\n\n")
|
|
f.write("</details>\n\n")
|
|
elif macro_content:
|
|
f.write(f"{macro_content}\n\n")
|
|
|
|
f.write("---\n\n")
|
|
|
|
print(f" Generated: {filename}")
|
|
|
|
def _generate_index(self, categorised: dict[str, dict[str, Any]]) -> None:
|
|
"""Generate the reference index file.
|
|
|
|
Creates an ``index.md`` listing all categories with links, page counts,
|
|
and overall statistics about the extracted content.
|
|
|
|
Args:
|
|
categorised: Dict of category_key -> category data.
|
|
"""
|
|
filename = f"{self.skill_dir}/references/index.md"
|
|
|
|
with open(filename, "w", encoding="utf-8") as f:
|
|
f.write(f"# {self.name.title()} Confluence Reference\n\n")
|
|
|
|
space_info = self.extracted_data.get("space_info", {})
|
|
if space_info.get("name"):
|
|
f.write(f"**Space:** {space_info['name']}")
|
|
if space_info.get("key"):
|
|
f.write(f" ({space_info['key']})")
|
|
f.write("\n\n")
|
|
|
|
f.write("## Categories\n\n")
|
|
|
|
for cat_key, cat_data in categorised.items():
|
|
safe_name = self._sanitize_filename(cat_data["title"])
|
|
page_count = len(cat_data["pages"])
|
|
f.write(f"- [{cat_data['title']}]({safe_name}.md) ({page_count} pages)\n")
|
|
|
|
f.write("\n## Statistics\n\n")
|
|
f.write(f"- Total pages: {self.extracted_data.get('total_sections', 0)}\n")
|
|
f.write(f"- Code blocks: {self.extracted_data.get('total_code_blocks', 0)}\n")
|
|
f.write(f"- Images: {self.extracted_data.get('total_images', 0)}\n")
|
|
|
|
langs = self.extracted_data.get("languages_detected", {})
|
|
if langs:
|
|
f.write(f"- Programming languages: {len(langs)}\n\n")
|
|
f.write("**Language Breakdown:**\n\n")
|
|
for lang, count in sorted(langs.items(), key=lambda x: x[1], reverse=True):
|
|
f.write(f"- {lang}: {count} examples\n")
|
|
|
|
# Page tree structure
|
|
page_tree = self.extracted_data.get("page_tree", [])
|
|
if page_tree:
|
|
f.write("\n## Page Tree\n\n")
|
|
f.write("```\n")
|
|
self._write_tree_structure(f, page_tree, indent=0)
|
|
f.write("```\n")
|
|
|
|
print(f" Generated: {filename}")
|
|
|
|
def _write_tree_structure(
|
|
self,
|
|
f: Any,
|
|
nodes: list[dict[str, Any]],
|
|
indent: int = 0,
|
|
) -> None:
|
|
"""Write a page tree structure to a file in ASCII tree format.
|
|
|
|
Args:
|
|
f: File handle to write to.
|
|
nodes: List of tree node dicts with 'title' and 'children'.
|
|
indent: Current indentation level.
|
|
"""
|
|
for node in nodes:
|
|
prefix = " " * indent
|
|
title = node.get("title", "Untitled")
|
|
f.write(f"{prefix}- {title}\n")
|
|
children = node.get("children", [])
|
|
if children:
|
|
self._write_tree_structure(f, children, indent + 1)
|
|
|
|
def _generate_skill_md(self, categorised: dict[str, dict[str, Any]]) -> None:
|
|
"""Generate the main SKILL.md file.
|
|
|
|
Creates a comprehensive skill manifest with:
|
|
- YAML frontmatter (name, description).
|
|
- Space information and metadata.
|
|
- Usage guidance ("When to Use This Skill").
|
|
- Content overview with category listing.
|
|
- Key topics extracted from page headings.
|
|
- Code examples (top quality samples).
|
|
- Documentation statistics.
|
|
- Navigation links to reference files.
|
|
|
|
Args:
|
|
categorised: Dict of category_key -> category data.
|
|
"""
|
|
filename = f"{self.skill_dir}/SKILL.md"
|
|
space_info = self.extracted_data.get("space_info", {})
|
|
|
|
# Skill name for frontmatter (lowercase, hyphens, max 64 chars)
|
|
skill_name = self.name.lower().replace("_", "-").replace(" ", "-")[:64]
|
|
desc = self.description[:1024] if len(self.description) > 1024 else self.description
|
|
|
|
with open(filename, "w", encoding="utf-8") as f:
|
|
# YAML frontmatter
|
|
f.write("---\n")
|
|
f.write(f"name: {skill_name}\n")
|
|
f.write(f"description: {desc}\n")
|
|
f.write("---\n\n")
|
|
|
|
# Header
|
|
space_name = space_info.get("name", self.name.title())
|
|
f.write(f"# {space_name} Documentation Skill\n\n")
|
|
f.write(f"{self.description}\n\n")
|
|
|
|
# Space metadata
|
|
if space_info.get("key"):
|
|
f.write("## Space Information\n\n")
|
|
f.write(f"**Space:** {space_info.get('name', 'N/A')}\n")
|
|
f.write(f"**Key:** {space_info.get('key', 'N/A')}\n")
|
|
source = self.extracted_data.get("source", "")
|
|
if source:
|
|
f.write(f"**Source:** {source}\n")
|
|
f.write(f"**Pages:** {self.extracted_data.get('total_pages', 0)}\n\n")
|
|
|
|
# When to Use
|
|
f.write("## When to Use This Skill\n\n")
|
|
f.write("Use this skill when you need to:\n")
|
|
f.write(f"- Understand {space_name} concepts and architecture\n")
|
|
f.write("- Look up API references and technical specifications\n")
|
|
f.write("- Find code examples and implementation patterns\n")
|
|
f.write("- Review processes, guidelines, and best practices\n")
|
|
f.write("- Navigate the documentation structure and find related pages\n\n")
|
|
|
|
# Content overview
|
|
total_pages = self.extracted_data.get("total_sections", 0)
|
|
f.write("## Content Overview\n\n")
|
|
f.write(f"**Total Pages:** {total_pages}\n\n")
|
|
f.write("**Categories:**\n\n")
|
|
for cat_key, cat_data in categorised.items():
|
|
page_count = len(cat_data["pages"])
|
|
f.write(f"- **{cat_data['title']}**: {page_count} pages\n")
|
|
f.write("\n")
|
|
|
|
# Key topics from headings
|
|
f.write(self._format_key_topics())
|
|
|
|
# Code examples (top quality)
|
|
all_code: list[dict[str, Any]] = []
|
|
for section in self.extracted_data.get("pages", []):
|
|
for code in section.get("code_samples", []):
|
|
code_copy = dict(code)
|
|
code_copy["source_page"] = section.get("heading", "")
|
|
all_code.append(code_copy)
|
|
|
|
all_code.sort(key=lambda x: x.get("quality_score", 0), reverse=True)
|
|
top_code = all_code[:10]
|
|
|
|
if top_code:
|
|
f.write("## Code Examples\n\n")
|
|
f.write("*Top code examples from the documentation*\n\n")
|
|
|
|
by_lang: dict[str, list[dict[str, Any]]] = {}
|
|
for code in top_code:
|
|
lang = code.get("language", "") or "unknown"
|
|
by_lang.setdefault(lang, []).append(code)
|
|
|
|
for lang in sorted(by_lang.keys()):
|
|
examples = by_lang[lang]
|
|
lang_display = lang.title() if lang != "unknown" else "Other"
|
|
f.write(f"### {lang_display} ({len(examples)} examples)\n\n")
|
|
for i, code in enumerate(examples[:3], 1):
|
|
quality = code.get("quality_score", 0)
|
|
source = code.get("source_page", "")
|
|
title = code.get("title", "")
|
|
code_text = code.get("code", "")
|
|
|
|
header_parts = [f"**Example {i}**"]
|
|
if title:
|
|
header_parts.append(f"({title})")
|
|
if source:
|
|
header_parts.append(f"from *{source}*")
|
|
header_parts.append(f"[Quality: {quality:.1f}/10]")
|
|
f.write(" ".join(header_parts) + ":\n\n")
|
|
|
|
f.write(f"```{lang}\n")
|
|
if len(code_text) <= 500:
|
|
f.write(code_text)
|
|
else:
|
|
f.write(code_text[:500] + "\n...")
|
|
f.write("\n```\n\n")
|
|
|
|
# Statistics
|
|
f.write("## Documentation Statistics\n\n")
|
|
f.write(f"- **Total Pages**: {total_pages}\n")
|
|
f.write(f"- **Code Blocks**: {self.extracted_data.get('total_code_blocks', 0)}\n")
|
|
f.write(f"- **Images**: {self.extracted_data.get('total_images', 0)}\n")
|
|
f.write(f"- **Categories**: {len(categorised)}\n")
|
|
|
|
langs = self.extracted_data.get("languages_detected", {})
|
|
if langs:
|
|
f.write(f"- **Programming Languages**: {len(langs)}\n\n")
|
|
f.write("**Language Breakdown:**\n\n")
|
|
for lang, count in sorted(langs.items(), key=lambda x: x[1], reverse=True):
|
|
f.write(f"- {lang}: {count} examples\n")
|
|
f.write("\n")
|
|
else:
|
|
f.write("\n")
|
|
|
|
# Navigation
|
|
f.write("## Navigation\n\n")
|
|
f.write("**Reference Files:**\n\n")
|
|
for cat_key, cat_data in categorised.items():
|
|
safe_name = self._sanitize_filename(cat_data["title"])
|
|
f.write(f"- `references/{safe_name}.md` - {cat_data['title']}\n")
|
|
f.write("\n")
|
|
f.write("See `references/index.md` for complete documentation structure.\n\n")
|
|
|
|
# Footer
|
|
f.write("---\n\n")
|
|
f.write("**Generated by Skill Seekers** | Confluence Documentation Scraper\n")
|
|
|
|
with open(filename, encoding="utf-8") as f:
|
|
line_count = len(f.read().split("\n"))
|
|
print(f" Generated: {filename} ({line_count} lines)")
|
|
|
|
def _format_key_topics(self) -> str:
|
|
"""Extract key topics from page headings across all sections.
|
|
|
|
Collects page titles and sub-headings to identify the main topics
|
|
covered in the documentation.
|
|
|
|
Returns:
|
|
Formatted markdown string with key topics section.
|
|
"""
|
|
page_titles: list[str] = []
|
|
sub_headings: list[str] = []
|
|
|
|
for section in self.extracted_data.get("pages", []):
|
|
heading = section.get("heading", "").strip()
|
|
if heading and len(heading) > 3:
|
|
page_titles.append(heading)
|
|
|
|
for sub in section.get("headings", []):
|
|
text = sub.get("text", "").strip()
|
|
level = sub.get("level", "h3")
|
|
if text and len(text) > 3 and level in ("h2", "h3"):
|
|
sub_headings.append(text)
|
|
|
|
if not page_titles and not sub_headings:
|
|
return ""
|
|
|
|
content = "## Key Topics\n\n"
|
|
content += "*Main topics covered in this documentation*\n\n"
|
|
|
|
if page_titles:
|
|
content += "**Pages:**\n\n"
|
|
for title in page_titles[:15]:
|
|
content += f"- {title}\n"
|
|
if len(page_titles) > 15:
|
|
content += f"- *...and {len(page_titles) - 15} more*\n"
|
|
content += "\n"
|
|
|
|
if sub_headings:
|
|
# Deduplicate and show top subtopics
|
|
unique_subs = list(dict.fromkeys(sub_headings))
|
|
content += "**Subtopics:**\n\n"
|
|
for heading in unique_subs[:20]:
|
|
content += f"- {heading}\n"
|
|
if len(unique_subs) > 20:
|
|
content += f"- *...and {len(unique_subs) - 20} more*\n"
|
|
content += "\n"
|
|
|
|
return content
|
|
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
# Utility helpers
|
|
# ──────────────────────────────────────────────────────────────────────
|
|
|
|
def _sanitize_filename(self, name: str) -> str:
|
|
"""Convert a string to a safe filename.
|
|
|
|
Removes special characters, converts spaces and hyphens to underscores,
|
|
and lowercases the result.
|
|
|
|
Args:
|
|
name: Raw string to sanitise.
|
|
|
|
Returns:
|
|
Filesystem-safe filename string.
|
|
"""
|
|
safe = re.sub(r"[^\w\s-]", "", name.lower())
|
|
safe = re.sub(r"[-\s]+", "_", safe)
|
|
return safe[:100] # Limit filename length
|
|
|
|
|
|
# ──────────────────────────────────────────────────────────────────────────────
|
|
# Module-level helpers
|
|
# ──────────────────────────────────────────────────────────────────────────────
|
|
|
|
|
|
def _score_code_quality(code: str) -> float:
|
|
"""Simple quality heuristic for code blocks (0-10 scale).
|
|
|
|
Scores based on line count, presence of definitions, imports,
|
|
indentation, and operator usage. Short snippets are penalised.
|
|
|
|
Args:
|
|
code: Source code string.
|
|
|
|
Returns:
|
|
Quality score between 0.0 and 10.0.
|
|
"""
|
|
if not code:
|
|
return 0.0
|
|
|
|
score = 5.0
|
|
lines = code.strip().split("\n")
|
|
line_count = len(lines)
|
|
|
|
# More lines = more substantial
|
|
if line_count >= 10:
|
|
score += 2.0
|
|
elif line_count >= 5:
|
|
score += 1.0
|
|
|
|
# Has function/class definitions
|
|
if re.search(r"\b(def |class |function |func |fn )", code):
|
|
score += 1.5
|
|
|
|
# Has imports/require
|
|
if re.search(r"\b(import |from .+ import|require\(|#include|using )", code):
|
|
score += 0.5
|
|
|
|
# Has indentation (structured code)
|
|
if re.search(r"^ ", code, re.MULTILINE):
|
|
score += 0.5
|
|
|
|
# Has assignment, operators, or common code syntax
|
|
if re.search(r"[=:{}()\[\]]", code):
|
|
score += 0.3
|
|
|
|
# Very short snippets get penalised
|
|
if len(code) < 30:
|
|
score -= 2.0
|
|
|
|
return min(10.0, max(0.0, score))
|
|
|
|
|
|
# ──────────────────────────────────────────────────────────────────────────────
|
|
# CLI entry point
|
|
# ──────────────────────────────────────────────────────────────────────────────
|
|
|
|
|
|
def main() -> int:
|
|
"""CLI entry point for the Confluence scraper.
|
|
|
|
Parses command-line arguments and runs the extraction/build pipeline.
|
|
Supports three workflows:
|
|
|
|
1. **API mode**: ``--base-url URL --space-key KEY --name my-skill``
|
|
2. **Export mode**: ``--export-path ./export-dir/ --name my-skill``
|
|
3. **Build from JSON**: ``--from-json my-skill_extracted.json``
|
|
|
|
Returns:
|
|
Exit code (0 for success, non-zero for failure).
|
|
"""
|
|
parser = argparse.ArgumentParser(
|
|
description="Convert Confluence documentation to AI-ready skills",
|
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
|
epilog=(
|
|
"Examples:\n"
|
|
" %(prog)s --base-url https://wiki.example.com "
|
|
"--space-key PROJ --name my-wiki\n"
|
|
" %(prog)s --export-path ./confluence-export/ --name my-wiki\n"
|
|
" %(prog)s --from-json my-wiki_extracted.json\n"
|
|
),
|
|
)
|
|
|
|
# Standard shared arguments
|
|
from .arguments.common import add_all_standard_arguments
|
|
|
|
add_all_standard_arguments(parser)
|
|
|
|
# Override enhance-level default to 0 for Confluence
|
|
for action in parser._actions:
|
|
if hasattr(action, "dest") and action.dest == "enhance_level":
|
|
action.default = 0
|
|
action.help = (
|
|
"AI enhancement level (auto-detects API vs LOCAL mode): "
|
|
"0=disabled (default for Confluence), 1=SKILL.md only, "
|
|
"2=+architecture/config, 3=full enhancement. "
|
|
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
|
|
"otherwise LOCAL (Claude Code, Kimi, etc.)"
|
|
)
|
|
|
|
# Confluence-specific arguments
|
|
parser.add_argument(
|
|
"--base-url",
|
|
type=str,
|
|
help="Confluence instance base URL (e.g., https://wiki.example.com)",
|
|
metavar="URL",
|
|
)
|
|
parser.add_argument(
|
|
"--space-key",
|
|
type=str,
|
|
help="Confluence space key to extract (e.g., PROJ, DEV)",
|
|
metavar="KEY",
|
|
)
|
|
parser.add_argument(
|
|
"--export-path",
|
|
type=str,
|
|
help="Path to Confluence HTML/XML export directory",
|
|
metavar="PATH",
|
|
)
|
|
parser.add_argument(
|
|
"--username",
|
|
type=str,
|
|
help=("Confluence username / email for API auth (or set CONFLUENCE_USERNAME env var)"),
|
|
metavar="USER",
|
|
)
|
|
parser.add_argument(
|
|
"--token",
|
|
type=str,
|
|
help=("Confluence API token for API auth (or set CONFLUENCE_TOKEN env var)"),
|
|
metavar="TOKEN",
|
|
)
|
|
parser.add_argument(
|
|
"--max-pages",
|
|
type=int,
|
|
default=500,
|
|
help="Maximum number of pages to fetch (default: 500)",
|
|
metavar="N",
|
|
)
|
|
parser.add_argument(
|
|
"--from-json",
|
|
type=str,
|
|
help="Build skill from previously extracted JSON data",
|
|
metavar="FILE",
|
|
)
|
|
|
|
args = parser.parse_args()
|
|
|
|
# Setup logging
|
|
if getattr(args, "quiet", False):
|
|
logging.basicConfig(level=logging.WARNING, format="%(message)s")
|
|
elif getattr(args, "verbose", False):
|
|
logging.basicConfig(level=logging.DEBUG, format="%(levelname)s: %(message)s")
|
|
else:
|
|
logging.basicConfig(level=logging.INFO, format="%(message)s")
|
|
|
|
# Handle --dry-run
|
|
if getattr(args, "dry_run", False):
|
|
source = (
|
|
getattr(args, "base_url", None)
|
|
or getattr(args, "export_path", None)
|
|
or getattr(args, "from_json", None)
|
|
or "(none)"
|
|
)
|
|
print(f"\n{'=' * 60}")
|
|
print("DRY RUN: Confluence Extraction")
|
|
print(f"{'=' * 60}")
|
|
print(f"Source: {source}")
|
|
print(f"Space key: {getattr(args, 'space_key', None) or '(N/A)'}")
|
|
print(f"Name: {getattr(args, 'name', None) or '(auto-detect)'}")
|
|
print(f"Max pages: {getattr(args, 'max_pages', 500)}")
|
|
print(f"Enhance level: {getattr(args, 'enhance_level', 0)}")
|
|
print(f"\n Dry run complete")
|
|
return 0
|
|
|
|
# Validate inputs
|
|
has_api = getattr(args, "base_url", None) and getattr(args, "space_key", None)
|
|
has_export = getattr(args, "export_path", None)
|
|
has_json = getattr(args, "from_json", None)
|
|
|
|
if not (has_api or has_export or has_json):
|
|
parser.error(
|
|
"Must specify one of:\n"
|
|
" --base-url URL --space-key KEY (API mode)\n"
|
|
" --export-path PATH (export mode)\n"
|
|
" --from-json FILE (build from JSON)"
|
|
)
|
|
|
|
# Build from pre-extracted JSON
|
|
if has_json:
|
|
name = getattr(args, "name", None) or Path(args.from_json).stem.replace("_extracted", "")
|
|
config: dict[str, Any] = {
|
|
"name": name,
|
|
"description": (
|
|
getattr(args, "description", None) or f"Use when referencing {name} documentation"
|
|
),
|
|
}
|
|
try:
|
|
converter = ConfluenceToSkillConverter(config)
|
|
converter.load_extracted_data(args.from_json)
|
|
converter.build_skill()
|
|
except Exception as e:
|
|
print(f"\n Error: {e}", file=sys.stderr)
|
|
sys.exit(1)
|
|
return 0
|
|
|
|
# Determine name
|
|
if not getattr(args, "name", None):
|
|
if has_api:
|
|
args.name = args.space_key.lower()
|
|
elif has_export:
|
|
args.name = Path(args.export_path).name
|
|
else:
|
|
args.name = "confluence-skill"
|
|
|
|
# Build config
|
|
config = {
|
|
"name": args.name,
|
|
"base_url": getattr(args, "base_url", "") or "",
|
|
"space_key": getattr(args, "space_key", "") or "",
|
|
"export_path": getattr(args, "export_path", "") or "",
|
|
"username": getattr(args, "username", "") or "",
|
|
"token": getattr(args, "token", "") or "",
|
|
"max_pages": getattr(args, "max_pages", 500),
|
|
}
|
|
if getattr(args, "description", None):
|
|
config["description"] = args.description
|
|
|
|
# Create converter and run
|
|
try:
|
|
converter = ConfluenceToSkillConverter(config)
|
|
|
|
if not converter.extract_confluence():
|
|
print("\n Confluence extraction failed", file=sys.stderr)
|
|
sys.exit(1)
|
|
|
|
converter.build_skill()
|
|
|
|
# Enhancement workflow integration
|
|
from skill_seekers.cli.workflow_runner import run_workflows
|
|
|
|
workflow_executed, workflow_names = run_workflows(args)
|
|
workflow_name = ", ".join(workflow_names) if workflow_names else None
|
|
|
|
# Traditional enhancement (complements workflow system)
|
|
if getattr(args, "enhance_level", 0) > 0:
|
|
api_key = getattr(args, "api_key", None) or os.environ.get("ANTHROPIC_API_KEY")
|
|
mode = "API" if api_key else "LOCAL"
|
|
|
|
print("\n" + "=" * 80)
|
|
print(f" AI Enhancement ({mode} mode, level {args.enhance_level})")
|
|
print("=" * 80)
|
|
if workflow_executed:
|
|
print(f" Running after workflow: {workflow_name}")
|
|
print(
|
|
" (Workflow provides specialised analysis,"
|
|
" enhancement provides general improvements)"
|
|
)
|
|
print("")
|
|
|
|
skill_dir = converter.skill_dir
|
|
if api_key:
|
|
try:
|
|
from skill_seekers.cli.enhance_skill import enhance_skill_md
|
|
|
|
enhance_skill_md(skill_dir, api_key)
|
|
print(" API enhancement complete!")
|
|
except ImportError:
|
|
print(" API enhancement not available. Falling back to LOCAL mode...")
|
|
from skill_seekers.cli.enhance_skill_local import (
|
|
LocalSkillEnhancer,
|
|
)
|
|
|
|
agent = getattr(args, "agent", None) if args else None
|
|
agent_cmd = getattr(args, "agent_cmd", None) if args else None
|
|
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
|
|
enhancer.run(headless=True)
|
|
print(" Local enhancement complete!")
|
|
else:
|
|
from skill_seekers.cli.enhance_skill_local import (
|
|
LocalSkillEnhancer,
|
|
)
|
|
|
|
agent = getattr(args, "agent", None) if args else None
|
|
agent_cmd = getattr(args, "agent_cmd", None) if args else None
|
|
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
|
|
enhancer.run(headless=True)
|
|
print(" Local enhancement complete!")
|
|
|
|
except (ValueError, RuntimeError, FileNotFoundError) as e:
|
|
print(f"\n Error: {e}", file=sys.stderr)
|
|
sys.exit(1)
|
|
except Exception as e:
|
|
print(f"\n Unexpected error during Confluence processing: {e}", file=sys.stderr)
|
|
import traceback
|
|
|
|
traceback.print_exc()
|
|
sys.exit(1)
|
|
|
|
return 0
|
|
|
|
|
|
if __name__ == "__main__":
|
|
sys.exit(main())
|