feat: Grand Unification — one command, one interface, direct converters (#346)

* fix: resolve 8 pipeline bugs found during skill quality review

- Fix 0 APIs extracted from documentation by enriching summary.json
  with individual page file content before conflict detection
- Fix all "Unknown" entries in merged_api.md by injecting dict keys
  as API names and falling back to AI merger field names
- Fix frontmatter using raw slugs instead of config name by
  normalizing frontmatter after SKILL.md generation
- Fix leaked absolute filesystem paths in patterns/index.md by
  stripping .skillseeker-cache repo clone prefixes
- Fix ARCHITECTURE.md file count always showing "1 files" by
  counting files per language from code_analysis data
- Fix YAML parse errors on GitHub Actions workflows by converting
  boolean keys (on: true) to strings
- Fix false React/Vue.js framework detection in C# projects by
  filtering web frameworks based on primary language
- Improve how-to guide generation by broadening workflow example
  filter to include setup/config examples with sufficient complexity
- Fix test_git_sources_e2e failures caused by git init default
  branch being 'main' instead of 'master'

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address 6 review issues in ExecutionContext implementation

Fixes from code review:

1. Mode resolution (#3 critical): _args_to_data no longer unconditionally
   overwrites mode. Only writes mode="api" when --api-key explicitly passed.
   Env-var-based mode detection moved to _default_data() as lowest priority.

2. Re-initialization warning (#4): initialize() now logs debug message
   when called a second time instead of silently returning stale instance.

3. _raw_args preserved in override (#5): temp context now copies _raw_args
   from parent so get_raw() works correctly inside override blocks.

4. test_local_mode_detection env cleanup (#7): test now saves/restores
   API key env vars to prevent failures when ANTHROPIC_API_KEY is set.

5. _load_config_file error handling (#8): wraps FileNotFoundError and
   JSONDecodeError with user-friendly ValueError messages.

6. Lint fixes: added logging import, fixed Generator import from
   collections.abc, fixed AgentClient return type annotation.

Remaining P2/P3 items (documented, not blocking):
- Lock TOCTOU in override() — safe on CPython, needs fix for no-GIL
- get() reads _instance without lock — same CPython caveat
- config_path not stored on instance
- AnalysisSettings.depth not Literal constrained

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address all remaining P2/P3 review issues in ExecutionContext

1. Thread safety: get() now acquires _lock before reading _instance (#2)
2. Thread safety: override() saves/restores _initialized flag to prevent
   re-init during override blocks (#10)
3. Config path stored: _config_path PrivateAttr + config_path property (#6)
4. Literal validation: AnalysisSettings.depth now uses
   Literal["surface", "deep", "full"] — rejects invalid values (#9)
5. Test updated: test_analysis_depth_choices now expects ValidationError
   for invalid depth, added test_analysis_depth_valid_choices
6. Lint cleanup: removed unused imports, fixed whitespace in tests

All 10 previously reported issues now resolved.
26 tests pass, lint clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: restore 5 truncated scrapers, migrate unified_scraper, fix context init

5 scrapers had main() truncated with "# Original main continues here..."
after Kimi's migration — business logic was never connected:
- html_scraper.py — restored HtmlToSkillConverter extraction + build
- pptx_scraper.py — restored PptxToSkillConverter extraction + build
- confluence_scraper.py — restored ConfluenceToSkillConverter with 3 modes
- notion_scraper.py — restored NotionToSkillConverter with 4 sources
- chat_scraper.py — restored ChatToSkillConverter extraction + build

unified_scraper.py — migrated main() to context-first pattern with argv fallback

Fixed context initialization chain:
- main.py no longer initializes ExecutionContext (was stealing init from commands)
- create_command.py now passes config_path from source_info.parsed
- execution_context.py handles SourceInfo.raw_input (not raw_source)

All 18 scrapers now genuinely migrated. 26 tests pass, lint clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve 7 data flow conflicts between ExecutionContext and legacy paths

Critical fixes (CLI args silently lost):
- unified_scraper Phase 6: reads ctx.enhancement.level instead of raw JSON
  when args=None (#3, #4)
- unified_scraper Phase 6 agent: reads ctx.enhancement.agent instead of
  3 independent env var lookups (#5)
- doc_scraper._run_enhancement: uses agent_client.api_key instead of raw
  os.environ.get() — respects config file api_key (#1)

Important fixes:
- main._handle_analyze_command: populates _fake_args from ExecutionContext
  so --agent and --api-key aren't lost in analyze→enhance path (#6)
- doc_scraper type annotations: replaced forward refs with Any to avoid
  F821 undefined name errors

All changes include RuntimeError fallback for backward compatibility when
ExecutionContext isn't initialized.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: 3 crashes + 1 stub in migrated scrapers found by deep scan

1. github_scraper.py: args.scrape_only and args.enhance_level crash when
   args=None (context path). Guarded with if args and getattr(). Also
   fixed agent fallback to read ctx.enhancement.agent.

2. codebase_scraper.py: args.output and args.skip_api_reference crash in
   summary block when args=None. Replaced with output_dir local var and
   ctx.analysis.skip_api_reference.

3. epub_scraper.py: main() was still a stub ending with "# Rest of main()
   continues..." — restored full extraction + build + enhancement logic
   using ctx values exclusively.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: complete ExecutionContext migration for remaining scrapers

Kimi's Phase 4 scraper migrations + Claude's review fixes.
All 18 scrapers now use context-first pattern with argv fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: Phase 1 — ExecutionContext.get() always returns context (no RuntimeError)

get() now returns a default context instead of raising RuntimeError when
not explicitly initialized. This eliminates the need for try/except
RuntimeError blocks in all 18 scrapers.

Components can always call ExecutionContext.get() safely — it returns
defaults if not initialized, or the explicitly initialized instance.

Updated tests: test_get_returns_defaults_when_not_initialized,
test_reset_clears_instance (no longer expects RuntimeError).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: Phase 2a-c — remove 16 individual scraper CLI commands

Removed individual scraper commands from:
- COMMAND_MODULES in main.py (16 entries: scrape, github, pdf, word,
  epub, video, jupyter, html, openapi, asciidoc, pptx, rss, manpage,
  confluence, notion, chat)
- pyproject.toml entry points (16 skill-seekers-<type> binaries)
- parsers/__init__.py (16 parser registrations)

All source types now accessed via: skill-seekers create <source>
Kept: create, unified, analyze, enhance, package, upload, install,
      install-agent, config, doctor, and utility commands.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: create SkillConverter base class + converter registry

New base interface that all 17 converters will inherit:
- SkillConverter.run() — extract + build (same call for all types)
- SkillConverter.extract() — override in subclass
- SkillConverter.build_skill() — override in subclass
- get_converter(source_type, config) — factory from registry
- CONVERTER_REGISTRY — maps source type → (module, class)

create_command will use get_converter() instead of _call_module().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: Grand Unification — one command, one interface, direct converters

Complete the Grand Unification refactor: `skill-seekers create` is now
the single entry point for all 18 source types. Individual scraper CLI
commands (scrape, github, pdf, analyze, unified, etc.) are removed.

## Architecture changes

- **18 SkillConverter subclasses**: Every scraper now inherits SkillConverter
  with extract() + build_skill() + SOURCE_TYPE. Factory via get_converter().
- **create_command.py rewritten**: _build_config() constructs config dicts
  from ExecutionContext for each source type. Direct converter.run() calls
  replace the old _build_argv() + sys.argv swap + _call_module() machinery.
- **main.py simplified**: create command bypasses _reconstruct_argv entirely,
  calls CreateCommand(args).execute() directly. analyze/unified commands
  removed (create handles both via auto-detection).
- **CreateParser mode="all"**: Top-level parser now accepts all 120+ flags
  (--browser, --max-pages, --depth, etc.) since create is the only entry.
- **Centralized enhancement**: Runs once in create_command after converter,
  not duplicated in each scraper.
- **MCP tools use converters**: 5 scraping tools call get_converter()
  directly instead of subprocess. Config type auto-detected from keys.
- **ConfigValidator → UniSkillConfigValidator**: Renamed with backward-
  compat alias.
- **Data flow**: AgentClient + LocalSkillEnhancer read ExecutionContext
  first, env vars as fallback.

## What was removed

- main() from all 18 scraper files (~3400 lines)
- 18 CLI commands from COMMAND_MODULES + pyproject.toml entry points
- analyze + unified parsers from parser registry
- _build_argv, _call_module, _SKIP_ARGS, _DEST_TO_FLAG, all _route_*()
- setup_argument_parser, get_configuration, _check_deprecated_flags
- Tests referencing removed commands/functions

## Net impact

51 files changed, ~6000 lines removed. 2996 tests pass, 0 failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: review fixes for Grand Unification PR

- Add autouse conftest fixture to reset ExecutionContext singleton between tests
- Replace hardcoded defaults in _is_explicitly_set() with parser-derived defaults
- Upgrade ExecutionContext double-init log from debug to info
- Use logger.exception() in SkillConverter.run() to preserve tracebacks
- Fix docstring "17 types" → "18 types" in skill_converter.py
- DRY up 10 copy-paste help handlers into dict + loop (~100 lines removed)
- Fix 2 CI workflows still referencing removed `skill-seekers scrape` command
- Remove broken pyproject.toml entry point for codebase_scraper:main

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve 12 logic/flow issues found in deep review

Critical fixes:
- UnifiedScraper.run(): replace sys.exit(1) with return 1, add return 0
- doc_scraper: use ExecutionContext.get() when already initialized instead
  of re-calling initialize() which silently discards new config
- unified_scraper: define enhancement_config before try/except to prevent
  UnboundLocalError in LOCAL enhancement timeout read

Important fixes:
- override(): cleaner tuple save/restore for singleton swap
- --agent without --api-key now sets mode="local" so env API key doesn't
  override explicit agent choice
- Remove DeprecationWarning from _reconstruct_argv (fires on every
  non-create command in production)
- Rewrite scrape_generic_tool to use get_converter() instead of subprocess
  calls to removed main() functions
- SkillConverter.run() checks build_skill() return value, returns 1 if False
- estimate_pages_tool uses -m module invocation instead of .py file path

Low-priority fixes:
- get_converter() raises descriptive ValueError on class name typo
- test_default_values: save/clear API key env vars before asserting mode
- test_get_converter_pdf: fix config key "path" → "pdf_path"

3056 passed, 4 failed (pre-existing dep version issues), 32 skipped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: update MCP server tests to mock converter instead of subprocess

scrape_docs_tool now uses get_converter() + _run_converter() in-process
instead of run_subprocess_with_streaming. Update 4 TestScrapeDocsTool
tests to mock the converter layer instead of the removed subprocess path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: YusufKaraaslanSpyke <yusuf@spykegames.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
yusyus
2026-04-05 23:00:52 +03:00
committed by GitHub
parent 2a14309342
commit 6d37e43b83
62 changed files with 2841 additions and 6840 deletions

View File

@@ -21,6 +21,9 @@ from .llms_txt_detector import LlmsTxtDetector
from .llms_txt_downloader import LlmsTxtDownloader
from .llms_txt_parser import LlmsTxtParser
# ExecutionContext - single source of truth for all configuration
from .execution_context import ExecutionContext, get_context
try:
from .utils import open_folder, read_reference_files
except ImportError:
@@ -35,6 +38,8 @@ __all__ = [
"LlmsTxtDetector",
"LlmsTxtDownloader",
"LlmsTxtParser",
"ExecutionContext",
"get_context",
"open_folder",
"read_reference_files",
"__version__",

View File

@@ -164,9 +164,16 @@ class AgentClient:
Resolved from: arg → env SKILL_SEEKER_AGENT → "claude"
api_key: API key override. If None, auto-detected from env vars.
"""
# Resolve agent name
# Resolve agent name: param > ExecutionContext > env var > default
try:
from skill_seekers.cli.execution_context import ExecutionContext
ctx = ExecutionContext.get()
ctx_agent = ctx.enhancement.agent or ""
except Exception:
ctx_agent = ""
env_agent = os.environ.get("SKILL_SEEKER_AGENT", "").strip()
self.agent = normalize_agent_name(agent or env_agent or "claude")
self.agent = normalize_agent_name(agent or ctx_agent or env_agent or "claude")
self.agent_display = AGENT_PRESETS.get(self.agent, {}).get("display_name", self.agent)
# Detect API key and provider

View File

@@ -139,6 +139,10 @@ class ArchitecturalPatternDetector:
"Laravel": ["laravel", "illuminate", "artisan", "app/Http/Controllers", "app/Models"],
}
# Web frameworks should only match for web-language projects
_WEB_FRAMEWORKS = {"React", "Vue.js", "Express", "Angular"}
_WEB_LANGUAGES = {"JavaScript", "TypeScript", "Python", "PHP", "Ruby"}
def __init__(self, enhance_with_ai: bool = True, agent: str | None = None):
"""
Initialize detector.
@@ -268,11 +272,26 @@ class ArchitecturalPatternDetector:
# Return early to prevent web framework false positives
return detected
# Determine primary language to filter out impossible framework matches
# e.g., C#/C++ projects should not match React/Vue.js/Express
lang_counts: dict[str, int] = {}
for file_data in files:
lang = file_data.get("language", "")
if lang:
lang_counts[lang] = lang_counts.get(lang, 0) + 1
primary_lang = max(lang_counts, key=lang_counts.get) if lang_counts else ""
skip_web = primary_lang and primary_lang not in self._WEB_LANGUAGES
# Check other frameworks (including imports - fixes #239)
for framework, markers in self.FRAMEWORK_MARKERS.items():
if framework in ["Unity", "Unreal", "Godot"]:
continue # Already checked
# Skip web frameworks for non-web language projects
if skip_web and framework in self._WEB_FRAMEWORKS:
continue
# Check in file paths, directory structure, AND imports
path_matches = sum(1 for marker in markers if marker.lower() in all_content.lower())
dir_matches = sum(1 for marker in markers if marker.lower() in dir_content.lower())

View File

@@ -938,3 +938,14 @@ def add_create_arguments(parser: argparse.ArgumentParser, mode: str = "default")
action="store_true",
help=argparse.SUPPRESS,
)
def get_create_defaults() -> dict[str, Any]:
"""Build a defaults dict from a throwaway parser with all create arguments.
Used by CreateCommand._is_explicitly_set() to compare argument values
against their registered defaults instead of hardcoded values.
"""
temp = argparse.ArgumentParser(add_help=False)
add_create_arguments(temp, mode="all")
return {action.dest: action.default for action in temp._actions if action.dest != "help"}

View File

@@ -15,12 +15,10 @@ Usage:
skill-seekers asciidoc --from-json doc_extracted.json
"""
import argparse
import json
import logging
import os
import re
import sys
from pathlib import Path
# Optional dependency guard — asciidoc library for HTML conversion
@@ -31,6 +29,8 @@ try:
except ImportError:
ASCIIDOC_AVAILABLE = False
from skill_seekers.cli.skill_converter import SkillConverter
logger = logging.getLogger(__name__)
ASCIIDOC_EXTENSIONS = {".adoc", ".asciidoc", ".asc", ".ad"}
@@ -112,7 +112,7 @@ def _score_code_quality(code: str) -> float:
return min(10.0, max(0.0, score))
class AsciiDocToSkillConverter:
class AsciiDocToSkillConverter(SkillConverter):
"""Convert AsciiDoc documentation to an AI-ready skill.
Handles single ``.adoc`` files and directories. Content is parsed into
@@ -120,7 +120,10 @@ class AsciiDocToSkillConverter:
directory layout (SKILL.md, references/, etc.).
"""
SOURCE_TYPE = "asciidoc"
def __init__(self, config: dict) -> None:
super().__init__(config)
self.config = config
self.name: str = config["name"]
self.asciidoc_path: str = config.get("asciidoc_path", "")
@@ -132,6 +135,10 @@ class AsciiDocToSkillConverter:
self.categories: dict = config.get("categories", {})
self.extracted_data: dict | None = None
def extract(self):
"""Extract content from AsciiDoc files (SkillConverter interface)."""
self.extract_asciidoc()
# ------------------------------------------------------------------
# Extraction
# ------------------------------------------------------------------
@@ -943,147 +950,3 @@ class AsciiDocToSkillConverter:
def _in_range(pos: int, ranges: list[tuple[int, int]]) -> bool:
"""Check whether pos falls within any consumed range."""
return any(s <= pos < e for s, e in ranges)
# ============================================================================
# CLI entry point
# ============================================================================
def main() -> int:
"""CLI entry point for AsciiDoc scraper."""
from skill_seekers.cli.arguments.asciidoc import add_asciidoc_arguments
parser = argparse.ArgumentParser(
description="Convert AsciiDoc documentation to skill",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_asciidoc_arguments(parser)
args = parser.parse_args()
# Set logging level
if getattr(args, "quiet", False):
logging.getLogger().setLevel(logging.WARNING)
elif getattr(args, "verbose", False):
logging.getLogger().setLevel(logging.DEBUG)
# Handle --dry-run
if getattr(args, "dry_run", False):
source = (
getattr(args, "asciidoc_path", None) or getattr(args, "from_json", None) or "(none)"
)
print(f"\n{'=' * 60}")
print("DRY RUN: AsciiDoc Extraction")
print(f"{'=' * 60}")
print(f"Source: {source}")
print(f"Name: {getattr(args, 'name', None) or '(auto-detect)'}")
print(f"Enhance level: {getattr(args, 'enhance_level', 0)}")
print(f"\n✅ Dry run complete")
return 0
# Validate inputs
if not (getattr(args, "asciidoc_path", None) or getattr(args, "from_json", None)):
parser.error("Must specify --asciidoc-path or --from-json")
# Build from JSON workflow
if getattr(args, "from_json", None):
name = Path(args.from_json).stem.replace("_extracted", "")
config = {
"name": getattr(args, "name", None) or name,
"description": getattr(args, "description", None)
or f"Use when referencing {name} documentation",
}
try:
converter = AsciiDocToSkillConverter(config)
converter.load_extracted_data(args.from_json)
converter.build_skill()
except Exception as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
return 0
# Direct AsciiDoc mode
if not getattr(args, "name", None):
p = Path(args.asciidoc_path)
args.name = p.stem if p.is_file() else p.name
config = {
"name": args.name,
"asciidoc_path": args.asciidoc_path,
"description": getattr(args, "description", None),
}
try:
converter = AsciiDocToSkillConverter(config)
# Extract
if not converter.extract_asciidoc():
print("\n❌ AsciiDoc extraction failed - see error above", file=sys.stderr)
sys.exit(1)
# Build skill
converter.build_skill()
# Enhancement Workflow Integration
from skill_seekers.cli.workflow_runner import run_workflows
workflow_executed, workflow_names = run_workflows(args)
workflow_name = ", ".join(workflow_names) if workflow_names else None
# Traditional enhancement (complements workflow system)
if getattr(args, "enhance_level", 0) > 0:
api_key = getattr(args, "api_key", None) or os.environ.get("ANTHROPIC_API_KEY")
mode = "API" if api_key else "LOCAL"
print("\n" + "=" * 80)
print(f"🤖 Traditional AI Enhancement ({mode} mode, level {args.enhance_level})")
print("=" * 80)
if workflow_executed:
print(f" Running after workflow: {workflow_name}")
print(
" (Workflow provides specialized analysis,"
" enhancement provides general improvements)"
)
print("")
skill_dir = converter.skill_dir
if api_key:
try:
from skill_seekers.cli.enhance_skill import enhance_skill_md
enhance_skill_md(skill_dir, api_key)
print("✅ API enhancement complete!")
except ImportError:
print("❌ API enhancement not available. Falling back to LOCAL mode...")
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
else:
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
except (FileNotFoundError, ValueError, RuntimeError) as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"\n❌ Unexpected error during AsciiDoc processing: {e}", file=sys.stderr)
import traceback
traceback.print_exc()
sys.exit(1)
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -34,16 +34,16 @@ Usage:
skill-seekers chat --from-json myteam_extracted.json --name myteam
"""
import argparse
import json
import logging
import os
import re
import sys
from collections import defaultdict
from datetime import datetime, timezone
from pathlib import Path
from skill_seekers.cli.skill_converter import SkillConverter
# Optional dependency guard — Slack SDK
try:
from slack_sdk import WebClient
@@ -243,7 +243,7 @@ def _score_code_quality(code: str) -> float:
# ---------------------------------------------------------------------------
class ChatToSkillConverter:
class ChatToSkillConverter(SkillConverter):
"""Convert Slack or Discord chat history into an AI-ready skill.
Follows the same pipeline pattern as the EPUB, Jupyter, and PPTX scrapers:
@@ -261,6 +261,8 @@ class ChatToSkillConverter:
channel, date range, and detected topic.
"""
SOURCE_TYPE = "chat"
def __init__(self, config: dict) -> None:
"""Initialize the converter with a configuration dictionary.
@@ -276,6 +278,7 @@ class ChatToSkillConverter:
- description (str): Skill description (optional, inferred
if absent).
"""
super().__init__(config)
self.config = config
self.name: str = config["name"]
self.export_path: str = config.get("export_path", "")
@@ -294,6 +297,10 @@ class ChatToSkillConverter:
# Extracted data (populated by extract_chat or load_extracted_data)
self.extracted_data: dict | None = None
def extract(self):
"""Extract content from chat history (SkillConverter interface)."""
self.extract_chat()
# ------------------------------------------------------------------
# Extraction — public entry point
# ------------------------------------------------------------------
@@ -1730,195 +1737,3 @@ class ChatToSkillConverter:
"""
safe = re.sub(r"[^\w\s-]", "", name.lower())
return re.sub(r"[-\s]+", "_", safe)
# ---------------------------------------------------------------------------
# CLI entry point
# ---------------------------------------------------------------------------
def main() -> int:
"""CLI entry point for the Slack/Discord chat scraper.
Parses command-line arguments and runs the extraction and
skill-building pipeline. Supports export import, API fetch,
and loading from previously extracted JSON.
Returns:
Exit code (0 for success, non-zero for errors).
"""
from .arguments.chat import add_chat_arguments
parser = argparse.ArgumentParser(
description="Convert Slack/Discord chat history to AI-ready skill",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Slack workspace export
%(prog)s --export-path ./slack-export/ --platform slack --name myteam
# Slack API
%(prog)s --platform slack --token xoxb-... --channel C01234 --name myteam
# Discord export (DiscordChatExporter)
%(prog)s --export-path ./discord-export.json --platform discord --name myserver
# Discord API
%(prog)s --platform discord --token Bot-token --channel 12345 --name myserver
# From previously extracted JSON
%(prog)s --from-json myteam_extracted.json --name myteam
""",
)
add_chat_arguments(parser)
args = parser.parse_args()
# Set logging level
if getattr(args, "quiet", False):
logging.getLogger().setLevel(logging.WARNING)
elif getattr(args, "verbose", False):
logging.getLogger().setLevel(logging.DEBUG)
# Handle --dry-run
if args.dry_run:
source = args.export_path or args.from_json or f"{args.platform}-api"
print(f"\n{'=' * 60}")
print("DRY RUN: Chat Extraction")
print(f"{'=' * 60}")
print(f"Platform: {args.platform}")
print(f"Source: {source}")
print(f"Name: {args.name or '(auto-detect)'}")
print(f"Channel: {args.channel or '(all)'}")
print(f"Max messages: {args.max_messages}")
print(f"Enhance level: {args.enhance_level}")
print(f"\n✅ Dry run complete")
return 0
# Validate inputs
if args.from_json:
# Build from previously extracted JSON
name = args.name or Path(args.from_json).stem.replace("_extracted", "")
config = {
"name": name,
"description": (args.description or f"Use when referencing {name} chat knowledge base"),
}
try:
converter = ChatToSkillConverter(config)
converter.load_extracted_data(args.from_json)
converter.build_skill()
except Exception as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
return 0
# Require either --export-path or --token for extraction
if not args.export_path and not args.token:
parser.error(
"Must specify --export-path (export mode), --token (API mode), "
"or --from-json (build from extracted data)"
)
if not args.name:
if args.export_path:
args.name = Path(args.export_path).stem
else:
args.name = f"{args.platform}_chat"
config = {
"name": args.name,
"export_path": args.export_path or "",
"platform": args.platform,
"token": args.token or "",
"channel": args.channel or "",
"max_messages": args.max_messages,
"description": args.description,
}
try:
converter = ChatToSkillConverter(config)
# Extract
if not converter.extract_chat():
print(
"\n❌ Chat extraction failed - see error above",
file=sys.stderr,
)
sys.exit(1)
# Build skill
converter.build_skill()
# Enhancement Workflow Integration
from skill_seekers.cli.workflow_runner import run_workflows
workflow_executed, workflow_names = run_workflows(args)
workflow_name = ", ".join(workflow_names) if workflow_names else None
# Traditional enhancement (complements workflow system)
if getattr(args, "enhance_level", 0) > 0:
api_key = getattr(args, "api_key", None) or os.environ.get("ANTHROPIC_API_KEY")
mode = "API" if api_key else "LOCAL"
print("\n" + "=" * 80)
print(f"🤖 Traditional AI Enhancement ({mode} mode, level {args.enhance_level})")
print("=" * 80)
if workflow_executed:
print(f" Running after workflow: {workflow_name}")
print(
" (Workflow provides specialized analysis, "
"enhancement provides general improvements)"
)
print("")
skill_dir = converter.skill_dir
if api_key:
try:
from skill_seekers.cli.enhance_skill import enhance_skill_md
enhance_skill_md(skill_dir, api_key)
print("✅ API enhancement complete!")
except ImportError:
print("❌ API enhancement not available. Falling back to LOCAL mode...")
from skill_seekers.cli.enhance_skill_local import (
LocalSkillEnhancer,
)
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
else:
from skill_seekers.cli.enhance_skill_local import (
LocalSkillEnhancer,
)
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
except (FileNotFoundError, ValueError) as e:
print(f"\n❌ Input error: {e}", file=sys.stderr)
sys.exit(1)
except RuntimeError as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(
f"\n❌ Unexpected error during chat processing: {e}",
file=sys.stderr,
)
import traceback
traceback.print_exc()
sys.exit(1)
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -24,12 +24,10 @@ Credits:
- pathspec for .gitignore support: https://pypi.org/project/pathspec/
"""
import argparse
import json
import logging
import os
import re
import sys
from pathlib import Path
from typing import Any
@@ -38,7 +36,7 @@ from skill_seekers.cli.code_analyzer import CodeAnalyzer
from skill_seekers.cli.config_extractor import ConfigExtractor
from skill_seekers.cli.dependency_analyzer import DependencyAnalyzer
from skill_seekers.cli.signal_flow_analyzer import SignalFlowAnalyzer
from skill_seekers.cli.utils import setup_logging
from skill_seekers.cli.skill_converter import SkillConverter
# Try to import pathspec for .gitignore support
try:
@@ -2147,278 +2145,56 @@ def _generate_references(output_dir: Path):
logger.info(f"✅ Generated references directory: {references_dir}")
def _check_deprecated_flags(args):
"""Check for deprecated flags and show migration warnings."""
warnings = []
class CodebaseAnalyzer(SkillConverter):
"""SkillConverter wrapper around the analyze_codebase / _generate_skill_md functions."""
# Deprecated: --depth
if hasattr(args, "depth") and args.depth:
preset_map = {
"surface": "quick",
"deep": "standard",
"full": "comprehensive",
}
suggested_preset = preset_map.get(args.depth, "standard")
warnings.append(
f"⚠️ DEPRECATED: --depth {args.depth} → use --preset {suggested_preset} instead"
SOURCE_TYPE = "local"
def __init__(self, config: dict[str, Any]):
super().__init__(config)
self.directory = Path(config.get("directory", ".")).resolve()
self.output_dir = Path(config.get("output_dir", f"output/{self.name}"))
self.depth = config.get("depth", "deep")
self.languages = config.get("languages")
self.file_patterns = config.get("file_patterns")
self.build_api_reference = config.get("build_api_reference", True)
self.extract_comments = config.get("extract_comments", True)
self.build_dependency_graph = config.get("build_dependency_graph", True)
self.detect_patterns = config.get("detect_patterns", True)
self.extract_test_examples = config.get("extract_test_examples", True)
self.build_how_to_guides = config.get("build_how_to_guides", True)
self.extract_config_patterns = config.get("extract_config_patterns", True)
self.extract_docs = config.get("extract_docs", True)
self.enhance_level = config.get("enhance_level", 0)
self.skill_name = config.get("skill_name") or self.name
self.skill_description = config.get("skill_description")
self.doc_version = config.get("doc_version", "")
self._results: dict[str, Any] | None = None
def extract(self):
"""SkillConverter interface — delegates to analyze_codebase()."""
self._results = analyze_codebase(
directory=self.directory,
output_dir=self.output_dir,
depth=self.depth,
languages=self.languages,
file_patterns=self.file_patterns,
build_api_reference=self.build_api_reference,
extract_comments=self.extract_comments,
build_dependency_graph=self.build_dependency_graph,
detect_patterns=self.detect_patterns,
extract_test_examples=self.extract_test_examples,
build_how_to_guides=self.build_how_to_guides,
extract_config_patterns=self.extract_config_patterns,
extract_docs=self.extract_docs,
enhance_level=self.enhance_level,
skill_name=self.skill_name,
skill_description=self.skill_description,
doc_version=self.doc_version,
)
# Deprecated: --ai-mode
if hasattr(args, "ai_mode") and args.ai_mode and args.ai_mode != "auto":
if args.ai_mode == "api":
warnings.append(
"⚠️ DEPRECATED: --ai-mode api → use --enhance-level with ANTHROPIC_API_KEY set instead"
)
elif args.ai_mode == "local":
warnings.append(
"⚠️ DEPRECATED: --ai-mode local → use --enhance-level without API key instead"
)
elif args.ai_mode == "none":
warnings.append("⚠️ DEPRECATED: --ai-mode none → use --enhance-level 0 instead")
# Deprecated: --quick flag
if hasattr(args, "quick") and args.quick:
warnings.append("⚠️ DEPRECATED: --quick → use --preset quick instead")
# Deprecated: --comprehensive flag
if hasattr(args, "comprehensive") and args.comprehensive:
warnings.append("⚠️ DEPRECATED: --comprehensive → use --preset comprehensive instead")
# Show warnings if any found
if warnings:
print("\n" + "=" * 70)
for warning in warnings:
print(warning)
print("\n💡 MIGRATION TIP:")
print(" --preset quick (1-2 min, basic features)")
print(" --preset standard (5-10 min, core features, DEFAULT)")
print(" --preset comprehensive (20-60 min, all features + AI)")
print(" --enhance-level 0-3 (granular AI enhancement control)")
print("\n⚠️ Deprecated flags will be removed in v4.0.0")
print("=" * 70 + "\n")
def main():
"""Command-line interface for codebase analysis."""
from skill_seekers.cli.arguments.analyze import add_analyze_arguments
parser = argparse.ArgumentParser(
description="Analyze local codebases and extract code knowledge",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Analyze current directory
codebase-scraper --directory . --output output/codebase/
# Deep analysis with API reference and dependency graph
codebase-scraper --directory /path/to/repo --depth deep --build-api-reference --build-dependency-graph
# Analyze only Python and JavaScript
codebase-scraper --directory . --languages Python,JavaScript
# Use file patterns
codebase-scraper --directory . --file-patterns "*.py,src/**/*.js"
# Full analysis with all features (default)
codebase-scraper --directory . --depth deep
# Surface analysis (fast, skip all analysis features)
codebase-scraper --directory . --depth surface --skip-api-reference --skip-dependency-graph --skip-patterns --skip-test-examples
# Skip specific features
codebase-scraper --directory . --skip-patterns --skip-test-examples
""",
)
# Register all args from the shared definitions module
add_analyze_arguments(parser)
# Extra legacy arg only used by standalone CLI (not in arguments/analyze.py)
parser.add_argument(
"--ai-mode",
choices=["auto", "api", "local", "none"],
default="auto",
help=(
"AI enhancement mode for how-to guides: "
"auto (auto-detect: API if ANTHROPIC_API_KEY set, else LOCAL), "
"api (Anthropic API, requires ANTHROPIC_API_KEY), "
"local (coding agent CLI, FREE, no API key), "
"none (disable AI enhancement). "
"💡 TIP: Use --enhance flag instead for simpler UX!"
),
)
# Check for deprecated flags
deprecated_flags = {
"--build-api-reference": "--skip-api-reference",
"--build-dependency-graph": "--skip-dependency-graph",
"--detect-patterns": "--skip-patterns",
"--extract-test-examples": "--skip-test-examples",
"--build-how-to-guides": "--skip-how-to-guides",
"--extract-config-patterns": "--skip-config-patterns",
}
for old_flag, new_flag in deprecated_flags.items():
if old_flag in sys.argv:
logger.warning(
f"⚠️ DEPRECATED: {old_flag} is deprecated. "
f"All features are now enabled by default. "
f"Use {new_flag} to disable this feature."
)
# Handle --preset-list flag BEFORE parse_args() to avoid required --directory validation
if "--preset-list" in sys.argv:
from skill_seekers.cli.presets import PresetManager
print(PresetManager.format_preset_help())
return 0
args = parser.parse_args()
# Check for deprecated flags and show warnings
_check_deprecated_flags(args)
# Handle presets using formal preset system
preset_name = None
if hasattr(args, "preset") and args.preset:
# New --preset flag (recommended)
preset_name = args.preset
elif hasattr(args, "quick") and args.quick:
# Legacy --quick flag (backward compatibility)
preset_name = "quick"
elif hasattr(args, "comprehensive") and args.comprehensive:
# Legacy --comprehensive flag (backward compatibility)
preset_name = "comprehensive"
else:
# Default preset if none specified
preset_name = "standard"
# Apply preset using PresetManager
if preset_name:
from skill_seekers.cli.presets import PresetManager
try:
preset_args = PresetManager.apply_preset(preset_name, vars(args))
# Update args with preset values
for key, value in preset_args.items():
setattr(args, key, value)
preset = PresetManager.get_preset(preset_name)
logger.info(f"{preset.icon} {preset.name} analysis mode: {preset.description}")
except ValueError as e:
logger.error(f"{e}")
return 1
# Apply default depth if not set by preset or CLI
if args.depth is None:
args.depth = "deep" # Default depth
setup_logging(verbose=args.verbose, quiet=getattr(args, "quiet", False))
# Handle --dry-run
if getattr(args, "dry_run", False):
directory = Path(args.directory)
print(f"\n{'=' * 60}")
print(f"DRY RUN: Codebase Analysis")
print(f"{'=' * 60}")
print(f"Directory: {directory.resolve()}")
print(f"Output: {args.output}")
print(f"Preset: {preset_name}")
print(f"Depth: {args.depth or 'deep (default)'}")
print(f"Name: {getattr(args, 'name', None) or directory.name}")
print(f"Enhance: level {args.enhance_level}")
print(f"Skip flags: ", end="")
skips = []
for flag in [
"skip_api_reference",
"skip_dependency_graph",
"skip_patterns",
"skip_test_examples",
"skip_how_to_guides",
"skip_config_patterns",
"skip_docs",
]:
if getattr(args, flag, False):
skips.append(f"--{flag.replace('_', '-')}")
print(", ".join(skips) if skips else "(none)")
print(f"\n✅ Dry run complete")
return 0
# Validate directory
directory = Path(args.directory)
if not directory.exists():
logger.error(f"Directory not found: {directory}")
return 1
if not directory.is_dir():
logger.error(f"Not a directory: {directory}")
return 1
# Parse languages
languages = None
if args.languages:
languages = [lang.strip() for lang in args.languages.split(",")]
# Parse file patterns
file_patterns = None
if args.file_patterns:
file_patterns = [p.strip() for p in args.file_patterns.split(",")]
# Analyze codebase
try:
results = analyze_codebase(
directory=directory,
output_dir=Path(args.output),
depth=args.depth,
languages=languages,
file_patterns=file_patterns,
build_api_reference=not args.skip_api_reference,
extract_comments=not args.no_comments,
build_dependency_graph=not args.skip_dependency_graph,
detect_patterns=not args.skip_patterns,
extract_test_examples=not args.skip_test_examples,
build_how_to_guides=not args.skip_how_to_guides,
extract_config_patterns=not args.skip_config_patterns,
extract_docs=not args.skip_docs,
enhance_level=args.enhance_level, # AI enhancement level (0-3)
skill_name=getattr(args, "name", None),
skill_description=getattr(args, "description", None),
doc_version=getattr(args, "doc_version", ""),
)
# ============================================================
# WORKFLOW SYSTEM INTEGRATION (Phase 2)
# ============================================================
from skill_seekers.cli.workflow_runner import run_workflows
workflow_executed, workflow_names = run_workflows(args)
# Print summary
print(f"\n{'=' * 60}")
print("CODEBASE ANALYSIS COMPLETE")
if workflow_executed:
print(f" + {len(workflow_names)} ENHANCEMENT WORKFLOW(S) EXECUTED")
print(f"{'=' * 60}")
print(f"Files analyzed: {len(results['files'])}")
print(f"Output directory: {args.output}")
if not args.skip_api_reference:
print(f"API reference: {Path(args.output) / 'api_reference'}")
if workflow_executed:
print(f"Workflows applied: {', '.join(workflow_names)}")
print(f"{'=' * 60}\n")
return 0
except KeyboardInterrupt:
logger.error("\nAnalysis interrupted by user")
return 130
except Exception as e:
logger.error(f"Analysis failed: {e}")
import traceback
traceback.print_exc()
return 1
if __name__ == "__main__":
sys.exit(main())
def build_skill(self):
"""SkillConverter interface — no-op because analyze_codebase() already calls _generate_skill_md()."""
# analyze_codebase() generates SKILL.md internally via _generate_skill_md(),
# so there is nothing additional to do here.
pass

View File

@@ -627,15 +627,17 @@ class ConfigParser:
parent_path = []
for key, value in data.items():
# YAML parses 'on:' as boolean True; convert non-string keys
str_key = str(key) if not isinstance(key, str) else key
if isinstance(value, dict):
# Recurse into nested dicts
self._extract_settings_from_dict(value, config_file, parent_path + [key])
self._extract_settings_from_dict(value, config_file, parent_path + [str_key])
else:
setting = ConfigSetting(
key=".".join(parent_path + [key]) if parent_path else key,
key=".".join(parent_path + [str_key]) if parent_path else str_key,
value=value,
value_type=self._infer_type(value),
nested_path=parent_path + [key],
nested_path=parent_path + [str_key],
)
config_file.settings.append(setting)

View File

@@ -1,8 +1,8 @@
#!/usr/bin/env python3
"""
Unified Config Validator
UniSkillConfig Validator
Validates unified config format that supports multiple sources:
Validates uni_skill_config format that supports multiple sources:
- documentation (website scraping)
- github (repository scraping)
- pdf (PDF document scraping)
@@ -34,9 +34,9 @@ logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ConfigValidator:
class UniSkillConfigValidator:
"""
Validates unified config format (legacy support removed in v2.11.0).
Validates uni_skill_config format (legacy support removed in v2.11.0).
"""
# Valid source types
@@ -100,7 +100,7 @@ class ConfigValidator:
def validate(self) -> bool:
"""
Validate unified config format.
Validate uni_skill_config format.
Returns:
True if valid
@@ -136,8 +136,8 @@ class ConfigValidator:
return self._validate_unified()
def _validate_unified(self) -> bool:
"""Validate unified config format."""
logger.info("Validating unified config format...")
"""Validate uni_skill_config format."""
logger.info("Validating uni_skill_config format...")
# Required top-level fields
if "name" not in self.config:
@@ -483,7 +483,11 @@ class ConfigValidator:
return has_docs_api and has_github_code
def validate_config(config_path: str) -> ConfigValidator:
# Backward-compat alias
ConfigValidator = UniSkillConfigValidator
def validate_config(config_path: str) -> UniSkillConfigValidator:
"""
Validate config file and return validator instance.
@@ -491,12 +495,12 @@ def validate_config(config_path: str) -> ConfigValidator:
config_path: Path to config JSON file
Returns:
ConfigValidator instance
UniSkillConfigValidator instance
Raises:
ValueError if config is invalid
"""
validator = ConfigValidator(config_path)
validator = UniSkillConfigValidator(config_path)
validator.validate()
return validator

View File

@@ -31,15 +31,15 @@ Usage:
--space-key DEV --name dev-wiki --max-pages 200
"""
import argparse
import json
import logging
import os
import re
import sys
from pathlib import Path
from typing import Any
from skill_seekers.cli.skill_converter import SkillConverter
# Optional dependency guard for atlassian-python-api
try:
from atlassian import Confluence
@@ -177,7 +177,7 @@ def infer_description_from_confluence(
)
class ConfluenceToSkillConverter:
class ConfluenceToSkillConverter(SkillConverter):
"""Convert Confluence space documentation to an AI-ready skill.
Supports two extraction modes:
@@ -209,6 +209,8 @@ class ConfluenceToSkillConverter:
extracted_data: Structured extraction results dict.
"""
SOURCE_TYPE = "confluence"
def __init__(self, config: dict) -> None:
"""Initialize the Confluence to skill converter.
@@ -223,6 +225,7 @@ class ConfluenceToSkillConverter:
- description (str): Skill description (optional).
- max_pages (int): Maximum pages to fetch, default 500.
"""
super().__init__(config)
self.config = config
self.name: str = config["name"]
self.base_url: str = config.get("base_url", "")
@@ -242,6 +245,10 @@ class ConfluenceToSkillConverter:
# Extracted data storage
self.extracted_data: dict[str, Any] | None = None
def extract(self):
"""Extract content from Confluence (SkillConverter interface)."""
self.extract_confluence()
# ──────────────────────────────────────────────────────────────────────
# Extraction dispatcher
# ──────────────────────────────────────────────────────────────────────
@@ -1916,255 +1923,3 @@ def _score_code_quality(code: str) -> float:
score -= 2.0
return min(10.0, max(0.0, score))
# ──────────────────────────────────────────────────────────────────────────────
# CLI entry point
# ──────────────────────────────────────────────────────────────────────────────
def main() -> int:
"""CLI entry point for the Confluence scraper.
Parses command-line arguments and runs the extraction/build pipeline.
Supports three workflows:
1. **API mode**: ``--base-url URL --space-key KEY --name my-skill``
2. **Export mode**: ``--export-path ./export-dir/ --name my-skill``
3. **Build from JSON**: ``--from-json my-skill_extracted.json``
Returns:
Exit code (0 for success, non-zero for failure).
"""
parser = argparse.ArgumentParser(
description="Convert Confluence documentation to AI-ready skills",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=(
"Examples:\n"
" %(prog)s --base-url https://wiki.example.com "
"--space-key PROJ --name my-wiki\n"
" %(prog)s --export-path ./confluence-export/ --name my-wiki\n"
" %(prog)s --from-json my-wiki_extracted.json\n"
),
)
# Standard shared arguments
from .arguments.common import add_all_standard_arguments
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for Confluence
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for Confluence), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code, Kimi, etc.)"
)
# Confluence-specific arguments
parser.add_argument(
"--base-url",
type=str,
help="Confluence instance base URL (e.g., https://wiki.example.com)",
metavar="URL",
)
parser.add_argument(
"--space-key",
type=str,
help="Confluence space key to extract (e.g., PROJ, DEV)",
metavar="KEY",
)
parser.add_argument(
"--export-path",
type=str,
help="Path to Confluence HTML/XML export directory",
metavar="PATH",
)
parser.add_argument(
"--username",
type=str,
help=("Confluence username / email for API auth (or set CONFLUENCE_USERNAME env var)"),
metavar="USER",
)
parser.add_argument(
"--token",
type=str,
help=("Confluence API token for API auth (or set CONFLUENCE_TOKEN env var)"),
metavar="TOKEN",
)
parser.add_argument(
"--max-pages",
type=int,
default=500,
help="Maximum number of pages to fetch (default: 500)",
metavar="N",
)
parser.add_argument(
"--from-json",
type=str,
help="Build skill from previously extracted JSON data",
metavar="FILE",
)
args = parser.parse_args()
# Setup logging
if getattr(args, "quiet", False):
logging.basicConfig(level=logging.WARNING, format="%(message)s")
elif getattr(args, "verbose", False):
logging.basicConfig(level=logging.DEBUG, format="%(levelname)s: %(message)s")
else:
logging.basicConfig(level=logging.INFO, format="%(message)s")
# Handle --dry-run
if getattr(args, "dry_run", False):
source = (
getattr(args, "base_url", None)
or getattr(args, "export_path", None)
or getattr(args, "from_json", None)
or "(none)"
)
print(f"\n{'=' * 60}")
print("DRY RUN: Confluence Extraction")
print(f"{'=' * 60}")
print(f"Source: {source}")
print(f"Space key: {getattr(args, 'space_key', None) or '(N/A)'}")
print(f"Name: {getattr(args, 'name', None) or '(auto-detect)'}")
print(f"Max pages: {getattr(args, 'max_pages', 500)}")
print(f"Enhance level: {getattr(args, 'enhance_level', 0)}")
print(f"\n Dry run complete")
return 0
# Validate inputs
has_api = getattr(args, "base_url", None) and getattr(args, "space_key", None)
has_export = getattr(args, "export_path", None)
has_json = getattr(args, "from_json", None)
if not (has_api or has_export or has_json):
parser.error(
"Must specify one of:\n"
" --base-url URL --space-key KEY (API mode)\n"
" --export-path PATH (export mode)\n"
" --from-json FILE (build from JSON)"
)
# Build from pre-extracted JSON
if has_json:
name = getattr(args, "name", None) or Path(args.from_json).stem.replace("_extracted", "")
config: dict[str, Any] = {
"name": name,
"description": (
getattr(args, "description", None) or f"Use when referencing {name} documentation"
),
}
try:
converter = ConfluenceToSkillConverter(config)
converter.load_extracted_data(args.from_json)
converter.build_skill()
except Exception as e:
print(f"\n Error: {e}", file=sys.stderr)
sys.exit(1)
return 0
# Determine name
if not getattr(args, "name", None):
if has_api:
args.name = args.space_key.lower()
elif has_export:
args.name = Path(args.export_path).name
else:
args.name = "confluence-skill"
# Build config
config = {
"name": args.name,
"base_url": getattr(args, "base_url", "") or "",
"space_key": getattr(args, "space_key", "") or "",
"export_path": getattr(args, "export_path", "") or "",
"username": getattr(args, "username", "") or "",
"token": getattr(args, "token", "") or "",
"max_pages": getattr(args, "max_pages", 500),
}
if getattr(args, "description", None):
config["description"] = args.description
# Create converter and run
try:
converter = ConfluenceToSkillConverter(config)
if not converter.extract_confluence():
print("\n Confluence extraction failed", file=sys.stderr)
sys.exit(1)
converter.build_skill()
# Enhancement workflow integration
from skill_seekers.cli.workflow_runner import run_workflows
workflow_executed, workflow_names = run_workflows(args)
workflow_name = ", ".join(workflow_names) if workflow_names else None
# Traditional enhancement (complements workflow system)
if getattr(args, "enhance_level", 0) > 0:
api_key = getattr(args, "api_key", None) or os.environ.get("ANTHROPIC_API_KEY")
mode = "API" if api_key else "LOCAL"
print("\n" + "=" * 80)
print(f" AI Enhancement ({mode} mode, level {args.enhance_level})")
print("=" * 80)
if workflow_executed:
print(f" Running after workflow: {workflow_name}")
print(
" (Workflow provides specialised analysis,"
" enhancement provides general improvements)"
)
print("")
skill_dir = converter.skill_dir
if api_key:
try:
from skill_seekers.cli.enhance_skill import enhance_skill_md
enhance_skill_md(skill_dir, api_key)
print(" API enhancement complete!")
except ImportError:
print(" API enhancement not available. Falling back to LOCAL mode...")
from skill_seekers.cli.enhance_skill_local import (
LocalSkillEnhancer,
)
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print(" Local enhancement complete!")
else:
from skill_seekers.cli.enhance_skill_local import (
LocalSkillEnhancer,
)
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print(" Local enhancement complete!")
except (ValueError, RuntimeError, FileNotFoundError) as e:
print(f"\n Error: {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"\n Unexpected error during Confluence processing: {e}", file=sys.stderr)
import traceback
traceback.print_exc()
sys.exit(1)
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -1,19 +1,22 @@
"""Unified create command - single entry point for skill creation.
Auto-detects source type (web, GitHub, local, PDF, config) and routes
to appropriate scraper while maintaining full backward compatibility.
to appropriate converter via get_converter().
"""
import sys
import logging
import argparse
from typing import Any
from skill_seekers.cli.source_detector import SourceDetector, SourceInfo
from skill_seekers.cli.execution_context import ExecutionContext
from skill_seekers.cli.skill_converter import get_converter
from skill_seekers.cli.arguments.create import (
get_compatible_arguments,
get_create_defaults,
get_universal_argument_names,
)
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
logger = logging.getLogger(__name__)
@@ -21,14 +24,20 @@ logger = logging.getLogger(__name__)
class CreateCommand:
"""Unified create command implementation."""
def __init__(self, args: argparse.Namespace):
def __init__(self, args: argparse.Namespace, parser_defaults: dict[str, Any] | None = None):
"""Initialize create command.
Args:
args: Parsed command-line arguments
parser_defaults: Default values from the argument parser. Used by
_is_explicitly_set() to detect which args the user actually
provided on the command line vs. which are just defaults.
"""
self.args = args
self.source_info: SourceInfo | None = None
self._parser_defaults = (
parser_defaults if parser_defaults is not None else get_create_defaults()
)
def execute(self) -> int:
"""Execute the create command.
@@ -52,12 +61,36 @@ class CreateCommand:
logger.error(f"Source validation failed: {e}")
return 1
# 3. Validate and warn about incompatible arguments
# 3. Initialize ExecutionContext with source info
# This provides a single source of truth for all configuration
# Resolve config path from args or source detection
config_path = getattr(self.args, "config", None) or (
self.source_info.parsed.get("config_path") if self.source_info else None
)
ExecutionContext.initialize(
args=self.args,
config_path=config_path,
source_info=self.source_info,
)
# 4. Validate and warn about incompatible arguments
self._validate_arguments()
# 4. Route to appropriate scraper
logger.info(f"Routing to {self.source_info.type} scraper...")
return self._route_to_scraper()
# 5. Route to appropriate converter
logger.info(f"Routing to {self.source_info.type} converter...")
result = self._route_to_scraper()
if result != 0:
return result
# 6. Centralized enhancement (runs after converter, not inside each scraper)
ctx = ExecutionContext.get()
if ctx.enhancement.enabled and ctx.enhancement.level > 0:
self._run_enhancement(ctx)
# 7. Centralized workflows
self._run_workflows()
return 0
def _validate_arguments(self) -> None:
"""Validate arguments and warn about incompatible ones."""
@@ -86,313 +119,338 @@ class CreateCommand:
f"{self.source_info.type} sources and will be ignored"
)
def _is_explicitly_set(self, arg_name: str, arg_value: any) -> bool:
def _is_explicitly_set(self, arg_name: str, arg_value: Any) -> bool:
"""Check if an argument was explicitly set by the user.
Compares the current value against the parser's registered default.
This avoids hardcoding default values that can drift out of sync.
Args:
arg_name: Argument name
arg_value: Argument value
arg_name: Argument destination name
arg_value: Current argument value
Returns:
True if user explicitly set this argument
"""
# Boolean flags - True means it was set
if isinstance(arg_value, bool):
return arg_value
# None means not set
if arg_value is None:
return False
# Check against common defaults — args with these values were NOT
# explicitly set by the user and should not be forwarded.
defaults = {
"max_issues": 100,
"chunk_tokens": DEFAULT_CHUNK_TOKENS,
"chunk_overlap_tokens": DEFAULT_CHUNK_OVERLAP_TOKENS,
"output": None,
"doc_version": "",
"video_languages": "en",
"whisper_model": "base",
"platform": "slack",
"visual_interval": 0.7,
"visual_min_gap": 0.5,
"visual_similarity": 3.0,
}
# Boolean flags: True means explicitly set (store_true defaults to False)
if isinstance(arg_value, bool):
return arg_value
if arg_name in defaults:
return arg_value != defaults[arg_name]
# Compare against parser default if available
if arg_name in self._parser_defaults:
return arg_value != self._parser_defaults[arg_name]
# Any other non-None value means it was set
# No registered default and non-None → user must have set it
return True
def _route_to_scraper(self) -> int:
"""Route to appropriate scraper based on source type.
"""Route to appropriate converter based on source type.
Builds a config dict from ExecutionContext + source_info, then
calls converter.run() directly — no sys.argv swap needed.
Returns:
Exit code from scraper
Exit code from converter
"""
if self.source_info.type == "web":
return self._route_web()
elif self.source_info.type == "github":
return self._route_github()
elif self.source_info.type == "local":
return self._route_local()
elif self.source_info.type == "pdf":
return self._route_pdf()
elif self.source_info.type == "word":
return self._route_word()
elif self.source_info.type == "epub":
return self._route_epub()
elif self.source_info.type == "video":
return self._route_video()
elif self.source_info.type == "config":
return self._route_config()
elif self.source_info.type == "jupyter":
return self._route_generic("jupyter_scraper", "--notebook")
elif self.source_info.type == "html":
return self._route_generic("html_scraper", "--html-path")
elif self.source_info.type == "openapi":
return self._route_generic("openapi_scraper", "--spec")
elif self.source_info.type == "asciidoc":
return self._route_generic("asciidoc_scraper", "--asciidoc-path")
elif self.source_info.type == "pptx":
return self._route_generic("pptx_scraper", "--pptx")
elif self.source_info.type == "rss":
return self._route_generic("rss_scraper", "--feed-path")
elif self.source_info.type == "manpage":
return self._route_generic("man_scraper", "--man-path")
elif self.source_info.type == "confluence":
return self._route_generic("confluence_scraper", "--export-path")
elif self.source_info.type == "notion":
return self._route_generic("notion_scraper", "--export-path")
elif self.source_info.type == "chat":
return self._route_generic("chat_scraper", "--export-path")
else:
logger.error(f"Unknown source type: {self.source_info.type}")
return 1
source_type = self.source_info.type
ctx = ExecutionContext.get()
# ── Dynamic argument forwarding ──────────────────────────────────────
#
# Instead of manually checking each flag in every _route_*() method,
# _build_argv() dynamically iterates vars(self.args) and forwards all
# explicitly-set arguments. This is the same pattern used by
# main.py::_reconstruct_argv() and eliminates ~40 missing-flag gaps.
# UnifiedScraper is special — it takes config_path, not a config dict
if source_type == "config":
from skill_seekers.cli.unified_scraper import UnifiedScraper
# Dest names that differ from their CLI flag (dest → flag)
_DEST_TO_FLAG = {
"async_mode": "--async",
"video_url": "--url",
"video_playlist": "--playlist",
"video_languages": "--languages",
"skip_config": "--skip-config-patterns",
}
config_path = self.source_info.parsed["config_path"]
merge_mode = getattr(self.args, "merge_mode", None)
converter = UnifiedScraper(config_path, merge_mode=merge_mode)
return converter.run()
# Internal args that should never be forwarded to sub-scrapers.
# video_url/video_playlist/video_file are handled as positionals by _route_video().
# config is forwarded manually only by routes that need it (web, github).
_SKIP_ARGS = frozenset(
{
"source",
"func",
"subcommand",
"command",
"config",
"video_url",
"video_playlist",
"video_file",
}
)
config = self._build_config(source_type, ctx)
converter = get_converter(source_type, config)
return converter.run()
def _build_argv(
self,
module_name: str,
positional_args: list[str],
allowlist: frozenset[str] | None = None,
) -> list[str]:
"""Build argv dynamically by forwarding all explicitly-set arguments.
def _build_config(self, source_type: str, ctx: ExecutionContext) -> dict[str, Any]:
"""Build a config dict for the converter from ExecutionContext.
Uses the same pattern as main.py::_reconstruct_argv().
Replaces manual per-flag checking in _route_*() and _add_common_args().
Each converter reads specific keys from the config dict passed to
its __init__. This method constructs that dict from the centralized
ExecutionContext, which already holds all CLI args + config file values.
Args:
module_name: Scraper module name (e.g., "doc_scraper")
positional_args: Positional arguments to prepend (e.g., [url] or ["--repo", repo])
allowlist: If provided, ONLY forward args in this set (overrides _SKIP_ARGS).
Used for targets with strict arg sets like unified_scraper.
source_type: Detected source type (web, github, pdf, etc.)
ctx: Initialized ExecutionContext
Returns:
Complete argv list for the scraper
Config dict suitable for the converter's __init__.
"""
argv = [module_name] + positional_args
# Auto-add suggested name if user didn't provide one (skip for allowlisted targets)
if not allowlist and not self.args.name and self.source_info:
argv.extend(["--name", self.source_info.suggested_name])
for key, value in vars(self.args).items():
# If allowlist provided, only forward args in the allowlist
if allowlist is not None:
if key not in allowlist:
continue
elif key in self._SKIP_ARGS or key.startswith("_help_"):
continue
if not self._is_explicitly_set(key, value):
continue
# Use translation map for mismatched dest→flag names, else derive from key
if key in self._DEST_TO_FLAG:
arg_flag = self._DEST_TO_FLAG[key]
else:
arg_flag = f"--{key.replace('_', '-')}"
if isinstance(value, bool):
if value:
argv.append(arg_flag)
elif isinstance(value, list):
for item in value:
argv.extend([arg_flag, str(item)])
elif value is not None:
argv.extend([arg_flag, str(value)])
return argv
def _call_module(self, module, argv: list[str]) -> int:
"""Call a scraper module with the given argv.
Swaps sys.argv, calls module.main(), restores sys.argv.
"""
logger.debug(f"Calling {argv[0]} with argv: {argv}")
original_argv = sys.argv
try:
sys.argv = argv
result = module.main()
if result is None:
logger.warning(f"Module returned None exit code, treating as success")
return 0
return result
finally:
sys.argv = original_argv
def _route_web(self) -> int:
"""Route to web documentation scraper (doc_scraper.py)."""
from skill_seekers.cli import doc_scraper
url = self.source_info.parsed.get("url", self.source_info.raw_source)
argv = self._build_argv("doc_scraper", [url])
# Forward config if set (not in _build_argv since it's in SKIP_ARGS
# to avoid double-forwarding for config-type sources)
if self.args.config:
argv.extend(["--config", self.args.config])
return self._call_module(doc_scraper, argv)
def _route_github(self) -> int:
"""Route to GitHub repository scraper (github_scraper.py)."""
from skill_seekers.cli import github_scraper
repo = self.source_info.parsed.get("repo", self.source_info.raw_source)
argv = self._build_argv("github_scraper", ["--repo", repo])
if self.args.config:
argv.extend(["--config", self.args.config])
return self._call_module(github_scraper, argv)
def _route_local(self) -> int:
"""Route to local codebase analyzer (codebase_scraper.py)."""
from skill_seekers.cli import codebase_scraper
directory = self.source_info.parsed.get("directory", self.source_info.raw_source)
argv = self._build_argv("codebase_scraper", ["--directory", directory])
return self._call_module(codebase_scraper, argv)
def _route_pdf(self) -> int:
"""Route to PDF scraper (pdf_scraper.py)."""
from skill_seekers.cli import pdf_scraper
file_path = self.source_info.parsed.get("file_path", self.source_info.raw_source)
argv = self._build_argv("pdf_scraper", ["--pdf", file_path])
return self._call_module(pdf_scraper, argv)
def _route_word(self) -> int:
"""Route to Word document scraper (word_scraper.py)."""
from skill_seekers.cli import word_scraper
file_path = self.source_info.parsed.get("file_path", self.source_info.raw_source)
argv = self._build_argv("word_scraper", ["--docx", file_path])
return self._call_module(word_scraper, argv)
def _route_epub(self) -> int:
"""Route to EPUB scraper (epub_scraper.py)."""
from skill_seekers.cli import epub_scraper
file_path = self.source_info.parsed.get("file_path", self.source_info.raw_source)
argv = self._build_argv("epub_scraper", ["--epub", file_path])
return self._call_module(epub_scraper, argv)
def _route_video(self) -> int:
"""Route to video scraper (video_scraper.py)."""
from skill_seekers.cli import video_scraper
parsed = self.source_info.parsed
if parsed.get("source_kind") == "file":
positional = ["--video-file", parsed["file_path"]]
elif parsed.get("url"):
url = parsed["url"]
flag = "--playlist" if "playlist" in url.lower() else "--url"
positional = [flag, url]
else:
positional = []
name = ctx.output.name or self.source_info.suggested_name
argv = self._build_argv("video_scraper", positional)
return self._call_module(video_scraper, argv)
# Args accepted by unified_scraper (allowlist for config route)
_UNIFIED_SCRAPER_ARGS = frozenset(
{
"merge_mode",
"skip_codebase_analysis",
"fresh",
"dry_run",
"enhance_workflow",
"enhance_stage",
"var",
"workflow_dry_run",
"api_key",
"enhance_level",
"agent",
"agent_cmd",
# Common keys shared by all converters
config: dict[str, Any] = {
"name": name,
"description": getattr(self.args, "description", None)
or f"Use when working with {name}",
}
)
def _route_config(self) -> int:
"""Route to unified scraper for config files (unified_scraper.py)."""
from skill_seekers.cli import unified_scraper
if source_type == "web":
url = parsed.get("url", parsed.get("base_url", self.source_info.raw_input))
config.update(
{
"base_url": url,
"doc_version": ctx.output.doc_version,
"max_pages": ctx.scraping.max_pages,
"rate_limit": ctx.scraping.rate_limit,
"browser": ctx.scraping.browser,
"browser_wait_until": ctx.scraping.browser_wait_until,
"browser_extra_wait": ctx.scraping.browser_extra_wait,
"workers": ctx.scraping.workers,
"async_mode": ctx.scraping.async_mode,
"resume": ctx.scraping.resume,
"fresh": ctx.scraping.fresh,
"skip_scrape": ctx.scraping.skip_scrape,
"selectors": {"title": "title", "code_blocks": "pre code"},
"url_patterns": {"include": [], "exclude": []},
}
)
# Load from config file if provided
config_path = getattr(self.args, "config", None)
if config_path:
self._merge_json_config(config, config_path)
config_path = self.source_info.parsed["config_path"]
argv = self._build_argv(
"unified_scraper",
["--config", config_path],
allowlist=self._UNIFIED_SCRAPER_ARGS,
)
return self._call_module(unified_scraper, argv)
elif source_type == "github":
repo = parsed.get("repo", self.source_info.raw_input)
config.update(
{
"repo": repo,
"local_repo_path": getattr(self.args, "local_repo_path", None),
"include_issues": getattr(self.args, "include_issues", True),
"max_issues": getattr(self.args, "max_issues", 100),
"include_changelog": getattr(self.args, "include_changelog", True),
"include_releases": getattr(self.args, "include_releases", True),
"include_code": getattr(self.args, "include_code", False),
}
)
config_path = getattr(self.args, "config", None)
if config_path:
self._merge_json_config(config, config_path)
def _route_generic(self, module_name: str, file_flag: str) -> int:
"""Generic routing for new source types.
elif source_type == "local":
directory = parsed.get("directory", self.source_info.raw_input)
config.update(
{
"directory": directory,
"depth": ctx.analysis.depth,
"output_dir": ctx.output.output_dir or f"output/{name}",
"languages": getattr(self.args, "languages", None),
"file_patterns": ctx.analysis.file_patterns,
"detect_patterns": not ctx.analysis.skip_patterns,
"extract_test_examples": not ctx.analysis.skip_test_examples,
"build_how_to_guides": not ctx.analysis.skip_how_to_guides,
"extract_config_patterns": not ctx.analysis.skip_config_patterns,
"build_api_reference": not ctx.analysis.skip_api_reference,
"build_dependency_graph": not ctx.analysis.skip_dependency_graph,
"extract_docs": not ctx.analysis.skip_docs,
"extract_comments": not ctx.analysis.no_comments,
"enhance_level": ctx.enhancement.level if ctx.enhancement.enabled else 0,
"skill_name": name,
"doc_version": ctx.output.doc_version,
}
)
All new source types (jupyter, html, openapi, asciidoc, pptx, rss,
manpage, confluence, notion, chat) use dynamic argument forwarding.
elif source_type == "pdf":
config.update(
{
"pdf_path": parsed.get("file_path", self.source_info.raw_input),
"extract_options": {
"chunk_size": 10,
"min_quality": 5.0,
"extract_images": True,
"min_image_size": 100,
},
}
)
elif source_type == "word":
config["docx_path"] = parsed.get("file_path", self.source_info.raw_input)
elif source_type == "epub":
config["epub_path"] = parsed.get("file_path", self.source_info.raw_input)
elif source_type == "video":
config.update(
{
"languages": getattr(self.args, "video_languages", "en"),
"visual": getattr(self.args, "visual", False),
"whisper_model": getattr(self.args, "whisper_model", "base"),
"visual_interval": getattr(self.args, "visual_interval", 0.7),
"visual_min_gap": getattr(self.args, "visual_min_gap", 0.5),
"visual_similarity": getattr(self.args, "visual_similarity", 3.0),
}
)
# Video source can be URL, playlist, or file
if parsed.get("source_kind") == "file":
config["video_file"] = parsed["file_path"]
elif parsed.get("url"):
url = parsed["url"]
if "playlist" in url.lower():
config["playlist"] = url
else:
config["url"] = url
else:
# Fallback: treat raw input as URL
config["url"] = self.source_info.raw_input
elif source_type == "jupyter":
config["notebook_path"] = parsed.get("file_path", self.source_info.raw_input)
elif source_type == "html":
config["html_path"] = parsed.get("file_path", self.source_info.raw_input)
elif source_type == "openapi":
file_path = parsed.get("file_path", self.source_info.raw_input)
if file_path.startswith(("http://", "https://")):
config["spec_url"] = file_path
else:
config["spec_path"] = file_path
elif source_type == "asciidoc":
config["asciidoc_path"] = parsed.get("file_path", self.source_info.raw_input)
elif source_type == "pptx":
config["pptx_path"] = parsed.get("file_path", self.source_info.raw_input)
elif source_type == "rss":
file_path = parsed.get("file_path", self.source_info.raw_input)
if file_path.startswith(("http://", "https://")):
config["feed_url"] = file_path
else:
config["feed_path"] = file_path
config["follow_links"] = getattr(self.args, "follow_links", True)
config["max_articles"] = getattr(self.args, "max_articles", 50)
elif source_type == "manpage":
file_path = parsed.get("file_path", "")
if file_path:
config["man_path"] = file_path
man_names = parsed.get("man_names", [])
if man_names:
config["man_names"] = man_names
elif source_type == "confluence":
config.update(
{
"export_path": parsed.get("file_path", ""),
"base_url": getattr(self.args, "confluence_url", ""),
"space_key": getattr(self.args, "space_key", ""),
"username": getattr(self.args, "username", ""),
"token": getattr(self.args, "token", ""),
"max_pages": getattr(self.args, "max_pages", 500),
}
)
elif source_type == "notion":
config.update(
{
"export_path": parsed.get("file_path"),
"database_id": getattr(self.args, "database_id", None),
"page_id": getattr(self.args, "page_id", None),
"token": getattr(self.args, "notion_token", None),
"max_pages": getattr(self.args, "max_pages", 100),
}
)
elif source_type == "chat":
config.update(
{
"export_path": parsed.get("file_path", ""),
"platform": getattr(self.args, "platform", "slack"),
"token": getattr(self.args, "token", ""),
"channel": getattr(self.args, "channel", ""),
"max_messages": getattr(self.args, "max_messages", 1000),
}
)
return config
@staticmethod
def _merge_json_config(config: dict[str, Any], config_path: str) -> None:
"""Merge a JSON config file into the config dict.
Config file values are used as defaults — CLI args (already in config) take precedence.
"""
import importlib
import json
module = importlib.import_module(f"skill_seekers.cli.{module_name}")
try:
with open(config_path, encoding="utf-8") as f:
file_config = json.load(f)
# Only set keys that aren't already in config
for key, value in file_config.items():
if key not in config:
config[key] = value
except (FileNotFoundError, json.JSONDecodeError) as e:
logger.warning(f"Could not load config file {config_path}: {e}")
file_path = self.source_info.parsed.get("file_path", "")
positional = [file_flag, file_path] if file_path else []
argv = self._build_argv(module_name, positional)
return self._call_module(module, argv)
def _run_enhancement(self, ctx: ExecutionContext) -> None:
"""Run centralized AI enhancement after converter completes."""
from pathlib import Path
name = ctx.output.name or (
self.source_info.suggested_name if self.source_info else "unnamed"
)
skill_dir = ctx.output.output_dir or f"output/{name}"
logger.info("\n" + "=" * 60)
logger.info(f"Enhancing SKILL.md (level {ctx.enhancement.level})")
logger.info("=" * 60)
try:
from skill_seekers.cli.agent_client import AgentClient
client = AgentClient(
mode=ctx.enhancement.mode,
agent=ctx.enhancement.agent,
api_key=ctx.enhancement.api_key,
)
if client.mode == "api" and client.client:
from skill_seekers.cli.enhance_skill import enhance_skill_md
api_key = ctx.enhancement.api_key or client.api_key
if api_key:
enhance_skill_md(skill_dir, api_key)
logger.info("API enhancement complete!")
else:
logger.warning("No API key available for enhancement")
else:
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
enhancer = LocalSkillEnhancer(
Path(skill_dir),
agent=ctx.enhancement.agent,
agent_cmd=ctx.enhancement.agent_cmd,
)
success = enhancer.run(headless=True, timeout=ctx.enhancement.timeout)
if success:
agent_name = ctx.enhancement.agent or "claude"
logger.info(f"Local enhancement complete! (via {agent_name})")
else:
logger.warning("Local enhancement did not complete")
except Exception as e:
logger.warning(f"Enhancement failed: {e}")
def _run_workflows(self) -> None:
"""Run enhancement workflows if configured."""
try:
from skill_seekers.cli.workflow_runner import run_workflows
run_workflows(self.args)
except ImportError:
pass
except Exception as e:
logger.warning(f"Workflow execution failed: {e}")
def main() -> int:
@@ -492,97 +550,28 @@ Common Workflows:
args = parser.parse_args()
# Handle source-specific help modes
if args._help_web:
# Recreate parser with web-specific arguments
parser_web = argparse.ArgumentParser(
prog="skill-seekers create",
description="Create skill from web documentation",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_create_arguments(parser_web, mode="web")
parser_web.print_help()
return 0
elif args._help_github:
parser_github = argparse.ArgumentParser(
prog="skill-seekers create",
description="Create skill from GitHub repository",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_create_arguments(parser_github, mode="github")
parser_github.print_help()
return 0
elif args._help_local:
parser_local = argparse.ArgumentParser(
prog="skill-seekers create",
description="Create skill from local codebase",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_create_arguments(parser_local, mode="local")
parser_local.print_help()
return 0
elif args._help_pdf:
parser_pdf = argparse.ArgumentParser(
prog="skill-seekers create",
description="Create skill from PDF file",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_create_arguments(parser_pdf, mode="pdf")
parser_pdf.print_help()
return 0
elif args._help_word:
parser_word = argparse.ArgumentParser(
prog="skill-seekers create",
description="Create skill from Word document (.docx)",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_create_arguments(parser_word, mode="word")
parser_word.print_help()
return 0
elif args._help_epub:
parser_epub = argparse.ArgumentParser(
prog="skill-seekers create",
description="Create skill from EPUB e-book (.epub)",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_create_arguments(parser_epub, mode="epub")
parser_epub.print_help()
return 0
elif args._help_video:
parser_video = argparse.ArgumentParser(
prog="skill-seekers create",
description="Create skill from video (YouTube, Vimeo, local files)",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_create_arguments(parser_video, mode="video")
parser_video.print_help()
return 0
elif args._help_config:
parser_config = argparse.ArgumentParser(
prog="skill-seekers create",
description="Create skill from multi-source config file (unified scraper)",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_create_arguments(parser_config, mode="config")
parser_config.print_help()
return 0
elif args._help_advanced:
parser_advanced = argparse.ArgumentParser(
prog="skill-seekers create",
description="Create skill - advanced options",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_create_arguments(parser_advanced, mode="advanced")
parser_advanced.print_help()
return 0
elif args._help_all:
parser_all = argparse.ArgumentParser(
prog="skill-seekers create",
description="Create skill - all options",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_create_arguments(parser_all, mode="all")
parser_all.print_help()
return 0
_HELP_MODES = {
"_help_web": ("web", "Create skill from web documentation"),
"_help_github": ("github", "Create skill from GitHub repository"),
"_help_local": ("local", "Create skill from local codebase"),
"_help_pdf": ("pdf", "Create skill from PDF file"),
"_help_word": ("word", "Create skill from Word document (.docx)"),
"_help_epub": ("epub", "Create skill from EPUB e-book (.epub)"),
"_help_video": ("video", "Create skill from video (YouTube, Vimeo, local files)"),
"_help_config": ("config", "Create skill from multi-source config file (unified scraper)"),
"_help_advanced": ("advanced", "Create skill - advanced options"),
"_help_all": ("all", "Create skill - all options"),
}
for attr, (mode, description) in _HELP_MODES.items():
if getattr(args, attr, False):
help_parser = argparse.ArgumentParser(
prog="skill-seekers create",
description=description,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_create_arguments(help_parser, mode=mode)
help_parser.print_help()
return 0
# Setup logging
log_level = logging.DEBUG if args.verbose else (logging.WARNING if args.quiet else logging.INFO)

570
src/skill_seekers/cli/doc_scraper.py Executable file → Normal file
View File

@@ -46,7 +46,7 @@ from skill_seekers.cli.language_detector import LanguageDetector
from skill_seekers.cli.llms_txt_detector import LlmsTxtDetector
from skill_seekers.cli.llms_txt_downloader import LlmsTxtDownloader
from skill_seekers.cli.llms_txt_parser import LlmsTxtParser
from skill_seekers.cli.arguments.scrape import add_scrape_arguments
from skill_seekers.cli.skill_converter import SkillConverter
from skill_seekers.cli.utils import sanitize_url, setup_logging
# Configure logging
@@ -152,8 +152,11 @@ def infer_description_from_docs(
)
class DocToSkillConverter:
class DocToSkillConverter(SkillConverter):
SOURCE_TYPE = "web"
def __init__(self, config: dict[str, Any], dry_run: bool = False, resume: bool = False) -> None:
super().__init__(config)
self.config = config
self.name = config["name"]
self.base_url = config["base_url"]
@@ -1943,6 +1946,10 @@ To refresh this skill with updated documentation:
logger.info(" ✓ index.md")
def extract(self):
"""SkillConverter interface — delegates to scrape_all()."""
self.scrape_all()
def build_skill(self) -> bool:
"""Build the skill from scraped data.
@@ -2209,495 +2216,130 @@ def load_config(config_path: str) -> dict[str, Any]:
return config
def interactive_config() -> dict[str, Any]:
"""Interactive configuration wizard for creating new configs.
def scrape_documentation(
config: dict[str, Any],
ctx: Any | None = None,
verbose: bool = False,
quiet: bool = False,
) -> int:
"""Scrape documentation using config and optional context.
Prompts user for all required configuration fields step-by-step
and returns a complete configuration dictionary.
Returns:
dict: Complete configuration dictionary with user-provided values
Example:
>>> config = interactive_config()
# User enters: name=react, url=https://react.dev, etc.
>>> config['name']
'react'
"""
logger.info("\n" + "=" * 60)
logger.info("Documentation to Skill Converter")
logger.info("=" * 60 + "\n")
config: dict[str, Any] = {}
# Basic info
config["name"] = input("Skill name (e.g., 'react', 'godot'): ").strip()
config["description"] = input("Skill description: ").strip()
config["base_url"] = input("Base URL (e.g., https://docs.example.com/): ").strip()
if not config["base_url"].endswith("/"):
config["base_url"] += "/"
# Selectors
logger.info("\nCSS Selectors (press Enter for defaults):")
selectors = {}
selectors["main_content"] = (
input(" Main content [div[role='main']]: ").strip() or "div[role='main']"
)
selectors["title"] = input(" Title [title]: ").strip() or "title"
selectors["code_blocks"] = input(" Code blocks [pre code]: ").strip() or "pre code"
config["selectors"] = selectors
# URL patterns
logger.info("\nURL Patterns (comma-separated, optional):")
include = input(" Include: ").strip()
exclude = input(" Exclude: ").strip()
config["url_patterns"] = {
"include": [p.strip() for p in include.split(",") if p.strip()],
"exclude": [p.strip() for p in exclude.split(",") if p.strip()],
}
# Settings
rate = input(f"\nRate limit (seconds) [{DEFAULT_RATE_LIMIT}]: ").strip()
config["rate_limit"] = float(rate) if rate else DEFAULT_RATE_LIMIT
max_p = input(f"Max pages [{DEFAULT_MAX_PAGES}]: ").strip()
config["max_pages"] = int(max_p) if max_p else DEFAULT_MAX_PAGES
return config
def check_existing_data(name: str) -> tuple[bool, int]:
"""Check if scraped data already exists for a skill.
This is the main entry point for programmatic use. CLI main() is a thin
wrapper around this function.
Args:
name (str): Skill name to check
config: Configuration dictionary with required fields (name, base_url, etc.)
ctx: Optional ExecutionContext for shared configuration
verbose: Enable verbose logging
quiet: Minimize logging output
Returns:
tuple: (exists, page_count) where exists is bool and page_count is int
Example:
>>> exists, count = check_existing_data('react')
>>> if exists:
... print(f"Found {count} existing pages")
Exit code (0 for success, non-zero for error)
"""
data_dir = f"output/{name}_data"
if os.path.exists(data_dir) and os.path.exists(f"{data_dir}/summary.json"):
with open(f"{data_dir}/summary.json", encoding="utf-8") as f:
summary = json.load(f)
return True, summary.get("total_pages", 0)
return False, 0
from skill_seekers.cli.execution_context import ExecutionContext
# Setup logging
setup_logging(verbose=verbose, quiet=quiet)
def setup_argument_parser() -> argparse.ArgumentParser:
"""Setup and configure command-line argument parser.
Creates an ArgumentParser with all CLI options for the doc scraper tool,
including configuration, scraping, enhancement, and performance options.
All arguments are defined in skill_seekers.cli.arguments.scrape to ensure
consistency between the standalone scraper and unified CLI.
Returns:
argparse.ArgumentParser: Configured argument parser
Example:
>>> parser = setup_argument_parser()
>>> args = parser.parse_args(['--config', 'configs/react.json'])
>>> print(args.config)
configs/react.json
"""
parser = argparse.ArgumentParser(
description="Convert documentation websites to AI skills",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
# Add all scrape arguments from shared definitions
# This ensures the standalone scraper and unified CLI stay in sync
add_scrape_arguments(parser)
return parser
def get_configuration(args: argparse.Namespace) -> dict[str, Any]:
"""Load or create configuration from command-line arguments.
Handles three configuration modes:
1. Load from JSON file (--config)
2. Interactive configuration wizard (--interactive or missing args)
3. Quick mode from command-line arguments (--name, --url)
Also applies CLI overrides for rate limiting and worker count.
Args:
args: Parsed command-line arguments from argparse
Returns:
dict: Configuration dictionary with all required fields
Example:
>>> args = parser.parse_args(['--name', 'react', '--url', 'https://react.dev'])
>>> config = get_configuration(args)
>>> print(config['name'])
react
"""
# Handle URL from either positional argument or --url flag
# Positional 'url' takes precedence, then --url flag
effective_url = getattr(args, "url", None)
# Get base configuration
if args.config:
config = load_config(args.config)
elif args.interactive or not (args.name and effective_url):
config = interactive_config()
else:
config = {
"name": args.name,
"description": args.description or f"Use when working with {args.name}",
"base_url": effective_url,
"selectors": {
"title": "title",
"code_blocks": "pre code",
},
"url_patterns": {"include": [], "exclude": []},
"rate_limit": DEFAULT_RATE_LIMIT,
"max_pages": DEFAULT_MAX_PAGES,
}
# Apply CLI override for doc_version (works for all config modes)
cli_doc_version = getattr(args, "doc_version", "")
if cli_doc_version:
config["doc_version"] = cli_doc_version
# Apply CLI overrides for rate limiting
if args.no_rate_limit:
config["rate_limit"] = 0
logger.info("⚡ Rate limiting disabled")
elif args.rate_limit is not None:
config["rate_limit"] = args.rate_limit
if args.rate_limit == 0:
logger.info("⚡ Rate limiting disabled")
# Use existing context if already initialized, otherwise create one
if ctx is None:
if ExecutionContext._initialized:
ctx = ExecutionContext.get()
else:
logger.info("⚡ Rate limit override: %ss per page", args.rate_limit)
ctx = ExecutionContext.initialize(args=argparse.Namespace(**config))
# Apply CLI overrides for worker count
if args.workers:
# Validate workers count
if args.workers < 1:
logger.error("❌ Error: --workers must be at least 1 (got %d)", args.workers)
logger.error(" Suggestion: Use --workers 1 (default) or omit the flag")
sys.exit(1)
if args.workers > 10:
logger.warning("⚠️ Warning: --workers capped at 10 (requested %d)", args.workers)
args.workers = 10
config["workers"] = args.workers
if args.workers > 1:
logger.info("🚀 Parallel scraping enabled: %d workers", args.workers)
# Build converter and execute
try:
converter = _run_scraping(config)
if converter is None:
return 1
# Apply CLI override for async mode
if args.async_mode:
config["async_mode"] = True
if config.get("workers", 1) > 1:
logger.info("⚡ Async mode enabled (2-3x faster than threads)")
else:
logger.warning(
"⚠️ Async mode enabled but workers=1. Consider using --workers 4 for better performance"
)
# Handle enhancement if enabled
if ctx.enhancement.enabled and ctx.enhancement.level > 0:
_run_enhancement(config, ctx, converter)
# Apply CLI override for browser mode
if getattr(args, "browser", False):
config["browser"] = True
logger.info("🌐 Browser mode enabled (Playwright headless Chromium)")
# Apply CLI override for max_pages
if args.max_pages is not None:
old_max = config.get("max_pages", DEFAULT_MAX_PAGES)
config["max_pages"] = args.max_pages
# Warnings for --max-pages usage
if args.max_pages > 1000:
logger.warning(
"⚠️ --max-pages=%d is very high - scraping may take hours", args.max_pages
)
logger.warning(" Recommendation: Use configs with reasonable limits for production")
elif args.max_pages < 10:
logger.warning(
"⚠️ --max-pages=%d is very low - may result in incomplete skill", args.max_pages
)
if old_max and old_max != args.max_pages:
logger.info(
"📊 Max pages override: %d%d (from --max-pages flag)", old_max, args.max_pages
)
else:
logger.info("📊 Max pages set to: %d (from --max-pages flag)", args.max_pages)
return config
return 0
except Exception as e:
logger.error(f"Scraping failed: {e}")
return 1
def execute_scraping_and_building(
config: dict[str, Any], args: argparse.Namespace
) -> Optional["DocToSkillConverter"]:
"""Execute the scraping and skill building process.
Handles dry run mode, existing data checks, scraping with checkpoints,
keyboard interrupts, and skill building. This is the core workflow
orchestration for the scraping phase.
Args:
config (dict): Configuration dictionary with scraping parameters
args: Parsed command-line arguments
Returns:
DocToSkillConverter: The converter instance after scraping/building,
or None if process was aborted
Example:
>>> config = {'name': 'react', 'base_url': 'https://react.dev'}
>>> converter = execute_scraping_and_building(config, args)
>>> if converter:
... print("Scraping complete!")
"""
# Dry run mode - preview only
if args.dry_run:
logger.info("\n" + "=" * 60)
logger.info("DRY RUN MODE")
logger.info("=" * 60)
logger.info("This will show what would be scraped without saving anything.\n")
converter = DocToSkillConverter(config, dry_run=True)
converter.scrape_all()
logger.info("\n📋 Configuration Summary:")
logger.info(" Name: %s", config["name"])
logger.info(" Base URL: %s", config["base_url"])
logger.info(" Max pages: %d", config.get("max_pages", DEFAULT_MAX_PAGES))
logger.info(" Rate limit: %ss", config.get("rate_limit", DEFAULT_RATE_LIMIT))
logger.info(" Categories: %d", len(config.get("categories", {})))
return None
# Check for existing data
exists, page_count = check_existing_data(config["name"])
if exists and not args.skip_scrape and not args.fresh:
# Check force_rescrape flag from config
if config.get("force_rescrape", False):
# Auto-delete cached data and rescrape
logger.info("\n✓ Found existing data: %d pages", page_count)
logger.info(" force_rescrape enabled - deleting cached data and rescaping")
import shutil
data_dir = f"output/{config['name']}_data"
if os.path.exists(data_dir):
shutil.rmtree(data_dir)
logger.info(f" Deleted: {data_dir}")
else:
# Only prompt if force_rescrape is False
logger.info("\n✓ Found existing data: %d pages", page_count)
response = input("Use existing data? (y/n): ").strip().lower()
if response == "y":
args.skip_scrape = True
elif exists and args.fresh:
logger.info("\n✓ Found existing data: %d pages", page_count)
logger.info(" --fresh flag set, will re-scrape from scratch")
def _run_scraping(config: dict[str, Any]) -> Optional["DocToSkillConverter"]:
"""Run the scraping process."""
# Create converter
converter = DocToSkillConverter(config, resume=args.resume)
converter = DocToSkillConverter(config)
# Initialize workflow tracking (will be updated if workflow runs)
converter.workflow_executed = False
converter.workflow_name = None
# Handle fresh start (clear checkpoint)
if args.fresh:
converter.clear_checkpoint()
# Scrape or skip
if not args.skip_scrape:
try:
converter.scrape_all()
# Save final checkpoint
if converter.checkpoint_enabled:
converter.save_checkpoint()
logger.info("\n💾 Final checkpoint saved")
# Clear checkpoint after successful completion
converter.clear_checkpoint()
logger.info("✅ Scraping complete - checkpoint cleared")
except KeyboardInterrupt:
logger.warning("\n\nScraping interrupted.")
if converter.checkpoint_enabled:
converter.save_checkpoint()
logger.info("💾 Progress saved to checkpoint")
logger.info(
" Resume with: --config %s --resume",
args.config if args.config else "config.json",
)
response = input("Continue with skill building? (y/n): ").strip().lower()
if response != "y":
return None
# Check for resume
if config.get("resume") and converter.checkpoint_exists():
logger.info("📂 Resuming from checkpoint...")
converter.load_checkpoint()
else:
logger.info("\n⏭️ Skipping scrape, using existing data")
# Clear checkpoint if fresh start
if config.get("fresh"):
converter.clear_checkpoint()
# Scrape
if not config.get("skip_scrape"):
logger.info("\n🔍 Starting scrape...")
try:
asyncio.run(converter.scrape())
except KeyboardInterrupt:
logger.info("\n\n⚠️ Interrupted by user")
converter.save_checkpoint()
logger.info("💾 Checkpoint saved. Resume with --resume")
return None
# Build skill
success = converter.build_skill()
if not success:
sys.exit(1)
# RAG chunking (optional - NEW v2.10.0)
if args.chunk_for_rag:
logger.info("\n" + "=" * 60)
logger.info("🔪 Generating RAG chunks...")
logger.info("=" * 60)
from skill_seekers.cli.rag_chunker import RAGChunker
chunker = RAGChunker(
chunk_size=args.chunk_tokens,
chunk_overlap=args.chunk_overlap_tokens,
preserve_code_blocks=not args.no_preserve_code_blocks,
preserve_paragraphs=not args.no_preserve_paragraphs,
)
# Chunk the skill
skill_dir = Path(converter.skill_dir)
chunks = chunker.chunk_skill(skill_dir)
# Save chunks
chunks_path = skill_dir / "rag_chunks.json"
chunker.save_chunks(chunks, chunks_path)
logger.info(f"✅ Generated {len(chunks)} RAG chunks")
logger.info(f"📄 Saved to: {chunks_path}")
logger.info(f"💡 Use with LangChain: --target langchain")
logger.info(f"💡 Use with LlamaIndex: --target llama-index")
# ============================================================
# WORKFLOW SYSTEM INTEGRATION (Phase 2 - doc_scraper)
# ============================================================
from skill_seekers.cli.workflow_runner import run_workflows
# Pass doc-scraper-specific context to workflows
doc_context = {
"name": config["name"],
"base_url": config.get("base_url", ""),
"description": config.get("description", ""),
}
workflow_executed, workflow_names = run_workflows(args, context=doc_context)
# Store workflow execution status on converter for execute_enhancement() to access
converter.workflow_executed = workflow_executed
converter.workflow_name = ", ".join(workflow_names) if workflow_names else None
logger.info("\n📦 Building skill...")
converter.build_skill()
return converter
def execute_enhancement(config: dict[str, Any], args: argparse.Namespace, converter=None) -> None:
"""Execute optional SKILL.md enhancement with AI.
def _run_enhancement(
config: dict[str, Any],
ctx: Any,
_converter: Any,
) -> None:
"""Run enhancement using context settings."""
from pathlib import Path
Supports two enhancement modes:
1. API-based enhancement (requires ANTHROPIC_API_KEY)
2. Local enhancement using a coding agent CLI (no API key needed)
skill_dir = f"output/{config['name']}"
Prints appropriate messages and suggestions based on whether
enhancement was requested and whether it succeeded.
logger.info("\n" + "=" * 60)
logger.info(f"🤖 Enhancing SKILL.md (level {ctx.enhancement.level})")
logger.info("=" * 60)
Args:
config (dict): Configuration dictionary with skill name
args: Parsed command-line arguments with enhancement flags
converter: Optional DocToSkillConverter instance (to check workflow status)
# Use AgentClient from context
try:
agent_client = ctx.get_agent_client()
Example:
>>> execute_enhancement(config, args)
# Runs enhancement if --enhance or --enhance-local flag is set
"""
import subprocess
# Run enhancement based on mode
if agent_client.mode == "api" and agent_client.client:
# API mode enhancement
from skill_seekers.cli.enhance_skill import enhance_skill_md
# Check if workflow was already executed (for logging context)
workflow_executed = (
converter and hasattr(converter, "workflow_executed") and converter.workflow_executed
)
workflow_name = converter.workflow_name if workflow_executed else None
# Use AgentClient's API key detection (respects priority: CLI > config > env)
api_key = ctx.enhancement.api_key or agent_client.api_key
if api_key:
enhance_skill_md(skill_dir, api_key)
logger.info("✅ API enhancement complete!")
else:
logger.warning("⚠️ No API key available for enhancement")
else:
# Local mode enhancement
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
# Optional enhancement with auto-detected mode (API or LOCAL)
# Note: Runs independently of workflow system (they complement each other)
if getattr(args, "enhance_level", 0) > 0:
import os
has_api_key = bool(os.environ.get("ANTHROPIC_API_KEY") or args.api_key)
mode = "API" if has_api_key else "LOCAL"
logger.info("\n" + "=" * 80)
logger.info(f"🤖 Traditional AI Enhancement ({mode} mode, level {args.enhance_level})")
logger.info("=" * 80)
if workflow_executed:
logger.info(f" Running after workflow: {workflow_name}")
logger.info(
" (Workflow provides specialized analysis, enhancement provides general improvements)"
enhancer = LocalSkillEnhancer(
Path(skill_dir),
agent=ctx.enhancement.agent,
agent_cmd=ctx.enhancement.agent_cmd,
)
logger.info("")
try:
enhance_cmd = ["skill-seekers-enhance", f"output/{config['name']}/"]
if args.api_key:
enhance_cmd.extend(["--api-key", args.api_key])
if getattr(args, "agent", None):
enhance_cmd.extend(["--agent", args.agent])
if getattr(args, "interactive_enhancement", False):
enhance_cmd.append("--interactive-enhancement")
result = subprocess.run(enhance_cmd, check=True)
if result.returncode == 0:
logger.info("\n✅ Enhancement complete!")
except subprocess.CalledProcessError:
logger.warning("\n⚠ Enhancement failed, but skill was still built")
except FileNotFoundError:
logger.warning("\n⚠ skill-seekers-enhance command not found. Run manually:")
logger.info(
" skill-seekers enhance output/%s/",
config["name"],
)
# Print packaging instructions
logger.info("\n📦 Package your skill:")
logger.info(" skill-seekers-package output/%s/", config["name"])
# Suggest enhancement if not done
if getattr(args, "enhance_level", 0) == 0:
logger.info("\n💡 Optional: Enhance SKILL.md with AI:")
logger.info(" skill-seekers enhance output/%s/", config["name"])
logger.info(" or re-run with: --enhance-level 2 (auto-detects API vs LOCAL mode)")
logger.info(
" API-based: skill-seekers-enhance-api output/%s/",
config["name"],
)
logger.info(" or re-run with: --enhance")
logger.info(
"\n💡 Tip: Use --interactive-enhancement with --enhance-local to open terminal window"
)
def main() -> None:
parser = setup_argument_parser()
args = parser.parse_args()
# Setup logging based on verbosity flags
setup_logging(verbose=args.verbose, quiet=args.quiet)
config = get_configuration(args)
# Execute scraping and building
converter = execute_scraping_and_building(config, args)
# Exit if dry run or aborted
if converter is None:
return
# Execute enhancement and print instructions (pass converter for workflow status check)
execute_enhancement(config, args, converter)
if __name__ == "__main__":
main()
success = enhancer.run(headless=True, timeout=ctx.enhancement.timeout)
if success:
agent_name = ctx.enhancement.agent or "claude"
logger.info(f"✅ Local enhancement complete! (via {agent_name})")
else:
logger.warning("⚠️ Local enhancement did not complete")
except Exception as e:
logger.warning(f"⚠️ Enhancement failed: {e}")

View File

@@ -200,11 +200,22 @@ class LocalSkillEnhancer:
raise ValueError(f"Executable '{executable}' not found in PATH")
def _resolve_agent(self, agent, agent_cmd):
# Priority: explicit param > ExecutionContext > env var > default
try:
from skill_seekers.cli.execution_context import ExecutionContext
ctx = ExecutionContext.get()
ctx_agent = ctx.enhancement.agent or ""
ctx_cmd = ctx.enhancement.agent_cmd or ""
except Exception:
ctx_agent = ""
ctx_cmd = ""
env_agent = os.environ.get("SKILL_SEEKER_AGENT", "").strip()
env_cmd = os.environ.get("SKILL_SEEKER_AGENT_CMD", "").strip()
agent_name = _normalize_agent_name(agent or env_agent or "claude")
cmd_override = agent_cmd or env_cmd or None
agent_name = _normalize_agent_name(agent or ctx_agent or env_agent or "claude")
cmd_override = agent_cmd or ctx_cmd or env_cmd or None
if agent_name == "custom":
if not cmd_override:

View File

@@ -10,12 +10,10 @@ Usage:
skill-seekers epub --from-json book_extracted.json
"""
import argparse
import json
import logging
import os
import re
import sys
from pathlib import Path
# Optional dependency guard
@@ -30,6 +28,8 @@ except ImportError:
# BeautifulSoup is a core dependency (always available)
from bs4 import BeautifulSoup, Comment
from .skill_converter import SkillConverter
logger = logging.getLogger(__name__)
@@ -68,10 +68,13 @@ def infer_description_from_epub(metadata: dict | None = None, name: str = "") ->
)
class EpubToSkillConverter:
class EpubToSkillConverter(SkillConverter):
"""Convert EPUB e-book to AI skill."""
SOURCE_TYPE = "epub"
def __init__(self, config):
super().__init__(config)
self.config = config
self.name = config["name"]
self.epub_path = config.get("epub_path", "")
@@ -89,6 +92,10 @@ class EpubToSkillConverter:
# Extracted data
self.extracted_data = None
def extract(self):
"""SkillConverter interface — delegates to extract_epub()."""
return self.extract_epub()
def extract_epub(self):
"""Extract content from EPUB file.
@@ -1068,143 +1075,3 @@ def _score_code_quality(code: str) -> float:
score -= 2.0
return min(10.0, max(0.0, score))
def main():
from .arguments.epub import add_epub_arguments
parser = argparse.ArgumentParser(
description="Convert EPUB e-book to skill",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_epub_arguments(parser)
args = parser.parse_args()
# Set logging level
if getattr(args, "quiet", False):
logging.getLogger().setLevel(logging.WARNING)
elif getattr(args, "verbose", False):
logging.getLogger().setLevel(logging.DEBUG)
# Handle --dry-run
if getattr(args, "dry_run", False):
source = getattr(args, "epub", None) or getattr(args, "from_json", None) or "(none)"
print(f"\n{'=' * 60}")
print("DRY RUN: EPUB Extraction")
print(f"{'=' * 60}")
print(f"Source: {source}")
print(f"Name: {getattr(args, 'name', None) or '(auto-detect)'}")
print(f"Enhance level: {getattr(args, 'enhance_level', 0)}")
print(f"\n✅ Dry run complete")
return 0
# Validate inputs
if not (getattr(args, "epub", None) or getattr(args, "from_json", None)):
parser.error("Must specify --epub or --from-json")
# Build from JSON workflow
if getattr(args, "from_json", None):
name = Path(args.from_json).stem.replace("_extracted", "")
config = {
"name": getattr(args, "name", None) or name,
"description": getattr(args, "description", None)
or f"Use when referencing {name} documentation",
}
try:
converter = EpubToSkillConverter(config)
converter.load_extracted_data(args.from_json)
converter.build_skill()
except Exception as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
return 0
# Direct EPUB mode
if not getattr(args, "name", None):
# Auto-detect name from filename
args.name = Path(args.epub).stem
config = {
"name": args.name,
"epub_path": args.epub,
# Pass None so extract_epub() can infer from EPUB metadata
"description": getattr(args, "description", None),
}
try:
converter = EpubToSkillConverter(config)
# Extract
if not converter.extract_epub():
print("\n❌ EPUB extraction failed - see error above", file=sys.stderr)
sys.exit(1)
# Build skill
converter.build_skill()
# Enhancement Workflow Integration
from skill_seekers.cli.workflow_runner import run_workflows
workflow_executed, workflow_names = run_workflows(args)
workflow_name = ", ".join(workflow_names) if workflow_names else None
# Traditional enhancement (complements workflow system)
if getattr(args, "enhance_level", 0) > 0:
import os
api_key = getattr(args, "api_key", None) or os.environ.get("ANTHROPIC_API_KEY")
mode = "API" if api_key else "LOCAL"
print("\n" + "=" * 80)
print(f"🤖 Traditional AI Enhancement ({mode} mode, level {args.enhance_level})")
print("=" * 80)
if workflow_executed:
print(f" Running after workflow: {workflow_name}")
print(
" (Workflow provides specialized analysis,"
" enhancement provides general improvements)"
)
print("")
skill_dir = converter.skill_dir
if api_key:
try:
from skill_seekers.cli.enhance_skill import enhance_skill_md
enhance_skill_md(skill_dir, api_key)
print("✅ API enhancement complete!")
except ImportError:
print("❌ API enhancement not available. Falling back to LOCAL mode...")
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
else:
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
except RuntimeError as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"\n❌ Unexpected error during EPUB processing: {e}", file=sys.stderr)
import traceback
traceback.print_exc()
sys.exit(1)
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,549 @@
"""ExecutionContext - Single source of truth for all configuration.
This module provides a singleton context object that holds all resolved
configuration from CLI args, config files, and environment variables.
All components read from this context instead of parsing their own argv.
Example:
>>> from skill_seekers.cli.execution_context import ExecutionContext
>>> ctx = ExecutionContext.initialize(args=parsed_args)
>>> ctx = ExecutionContext.get() # Get initialized instance
>>> print(ctx.output.name)
>>> print(ctx.enhancement.agent)
"""
from __future__ import annotations
import contextlib
import json
import logging
import os
import threading
from pathlib import Path
from typing import Any, ClassVar, Literal
from collections.abc import Generator
from pydantic import BaseModel, Field, PrivateAttr
logger = logging.getLogger(__name__)
class SourceInfoConfig(BaseModel):
"""Source detection results."""
type: str = Field(..., description="Source type (web, github, pdf, etc.)")
raw_source: str = Field(..., description="Original user input")
parsed: dict[str, Any] = Field(default_factory=dict, description="Parsed source details")
suggested_name: str = Field(default="", description="Auto-generated skill name")
class EnhancementSettings(BaseModel):
"""AI enhancement configuration."""
model_config = {
"json_schema_extra": {
"example": {
"enabled": True,
"level": 2,
"mode": "auto",
"agent": "kimi",
"timeout": 2700,
}
}
}
enabled: bool = Field(default=True, description="Whether enhancement is enabled")
level: int = Field(default=2, ge=0, le=3, description="Enhancement level (0-3)")
mode: str = Field(default="auto", description="Mode: api, local, or auto")
agent: str | None = Field(default=None, description="Local agent name (claude, kimi, etc.)")
agent_cmd: str | None = Field(default=None, description="Custom agent command override")
api_key: str | None = Field(default=None, description="API key for enhancement")
timeout: int = Field(default=2700, description="Timeout in seconds (default: 45min)")
workflows: list[str] = Field(default_factory=list, description="Enhancement workflow names")
stages: list[str] = Field(default_factory=list, description="Inline enhancement stages")
workflow_vars: dict[str, str] = Field(default_factory=dict, description="Workflow variables")
class OutputSettings(BaseModel):
"""Output configuration."""
model_config = {
"json_schema_extra": {
"example": {
"name": "react-docs",
"doc_version": "18.2",
"dry_run": False,
}
}
}
name: str | None = Field(default=None, description="Skill name")
output_dir: str | None = Field(default=None, description="Output directory override")
doc_version: str = Field(default="", description="Documentation version tag")
dry_run: bool = Field(default=False, description="Preview mode without execution")
class ScrapingSettings(BaseModel):
"""Web scraping configuration."""
max_pages: int | None = Field(default=None, description="Maximum pages to scrape")
rate_limit: float | None = Field(default=None, description="Rate limit in seconds")
browser: bool = Field(default=False, description="Use headless browser for JS sites")
browser_wait_until: str = Field(
default="domcontentloaded", description="Browser wait condition"
)
browser_extra_wait: int = Field(default=0, description="Extra wait time in ms after page load")
workers: int = Field(default=1, description="Number of parallel workers")
async_mode: bool = Field(default=False, description="Enable async mode")
resume: bool = Field(default=False, description="Resume from checkpoint")
fresh: bool = Field(default=False, description="Clear checkpoint and start fresh")
skip_scrape: bool = Field(default=False, description="Skip scraping, use existing data")
languages: list[str] = Field(default_factory=lambda: ["en"], description="Language preferences")
class AnalysisSettings(BaseModel):
"""Code analysis configuration."""
depth: Literal["surface", "deep", "full"] = Field(
default="surface", description="Analysis depth: surface, deep, full"
)
skip_patterns: bool = Field(default=False, description="Skip design pattern detection")
skip_test_examples: bool = Field(default=False, description="Skip test example extraction")
skip_how_to_guides: bool = Field(default=False, description="Skip how-to guide generation")
skip_config_patterns: bool = Field(default=False, description="Skip config pattern extraction")
skip_api_reference: bool = Field(default=False, description="Skip API reference generation")
skip_dependency_graph: bool = Field(default=False, description="Skip dependency graph")
skip_docs: bool = Field(default=False, description="Skip documentation extraction")
no_comments: bool = Field(default=False, description="Skip comment extraction")
file_patterns: list[str] | None = Field(default=None, description="File patterns to analyze")
class RAGSettings(BaseModel):
"""RAG (Retrieval-Augmented Generation) configuration."""
chunk_for_rag: bool = Field(default=False, description="Enable semantic chunking")
chunk_tokens: int = Field(default=512, description="Chunk size in tokens")
chunk_overlap_tokens: int = Field(default=50, description="Overlap between chunks")
preserve_code_blocks: bool = Field(default=True, description="Don't split code blocks")
preserve_paragraphs: bool = Field(default=True, description="Respect paragraph boundaries")
class ExecutionContext(BaseModel):
"""Single source of truth for all execution configuration.
This is a singleton - use ExecutionContext.get() to access the instance.
Initialize once at entry point with ExecutionContext.initialize().
Example:
>>> ctx = ExecutionContext.initialize(args=parsed_args)
>>> ctx = ExecutionContext.get() # Get initialized instance
>>> print(ctx.output.name)
"""
model_config = {
"json_schema_extra": {
"example": {
"source": {"type": "web", "raw_source": "https://react.dev/"},
"enhancement": {"level": 2, "agent": "kimi"},
"output": {"name": "react-docs"},
}
}
}
# Configuration sections
source: SourceInfoConfig | None = Field(default=None, description="Source information")
enhancement: EnhancementSettings = Field(default_factory=EnhancementSettings)
output: OutputSettings = Field(default_factory=OutputSettings)
scraping: ScrapingSettings = Field(default_factory=ScrapingSettings)
analysis: AnalysisSettings = Field(default_factory=AnalysisSettings)
rag: RAGSettings = Field(default_factory=RAGSettings)
# Private attributes
_raw_args: dict[str, Any] = PrivateAttr(default_factory=dict)
_config_path: str | None = PrivateAttr(default=None)
# Singleton storage (class-level)
_instance: ClassVar[ExecutionContext | None] = None
_lock: ClassVar[threading.Lock] = threading.Lock()
_initialized: ClassVar[bool] = False
@classmethod
def get(cls) -> ExecutionContext:
"""Get the singleton instance (thread-safe).
Returns a default context if not explicitly initialized.
This ensures components can always read from the context
without try/except blocks.
"""
with cls._lock:
if cls._instance is None:
cls._instance = cls()
logger.debug("ExecutionContext auto-initialized with defaults")
return cls._instance
@classmethod
def initialize(
cls,
args: Any | None = None,
config_path: str | None = None,
source_info: Any | None = None,
) -> ExecutionContext:
"""Initialize the singleton context.
Priority (highest to lowest):
1. CLI args (explicit user input)
2. Config file (JSON config)
3. Environment variables
4. Defaults
Args:
args: Parsed argparse.Namespace
config_path: Path to config JSON file
source_info: SourceInfo from source_detector
Returns:
Initialized ExecutionContext instance
"""
with cls._lock:
if cls._initialized:
logger.info(
"ExecutionContext.initialize() called again — returning existing instance. "
"Use ExecutionContext.reset() first if re-initialization is intended."
)
return cls._instance
context_data = cls._build_from_sources(args, config_path, source_info)
cls._instance = cls.model_validate(context_data)
if args:
cls._instance._raw_args = vars(args)
cls._instance._config_path = config_path
cls._initialized = True
return cls._instance
@classmethod
def reset(cls) -> None:
"""Reset the singleton (mainly for testing)."""
with cls._lock:
cls._instance = None
cls._initialized = False
@classmethod
def _build_from_sources(
cls,
args: Any | None,
config_path: str | None,
source_info: Any | None,
) -> dict[str, Any]:
"""Build context dict from all configuration sources."""
# Start with defaults
data = cls._default_data()
# Layer 1: Config file
if config_path:
file_config = cls._load_config_file(config_path)
data = cls._deep_merge(data, file_config)
# Layer 2: CLI args (override config file)
if args:
arg_config = cls._args_to_data(args)
data = cls._deep_merge(data, arg_config)
# Layer 3: Source info
if source_info:
data["source"] = {
"type": source_info.type,
"raw_source": getattr(source_info, "raw_source", None)
or getattr(source_info, "raw_input", ""),
"parsed": source_info.parsed,
"suggested_name": source_info.suggested_name,
}
return data
@classmethod
def _default_data(cls) -> dict[str, Any]:
"""Get default configuration."""
from skill_seekers.cli.agent_client import get_default_timeout
return {
"enhancement": {
"enabled": True,
"level": 2,
# Env-var-based mode detection (lowest priority — CLI and config override this)
"mode": "api"
if any(
os.environ.get(k)
for k in (
"ANTHROPIC_API_KEY",
"OPENAI_API_KEY",
"MOONSHOT_API_KEY",
"GOOGLE_API_KEY",
)
)
else "auto",
"agent": os.environ.get("SKILL_SEEKER_AGENT"),
"agent_cmd": None,
"api_key": None,
"timeout": get_default_timeout(),
"workflows": [],
"stages": [],
"workflow_vars": {},
},
"output": {
"name": None,
"output_dir": None,
"doc_version": "",
"dry_run": False,
},
"scraping": {
"max_pages": None,
"rate_limit": None,
"browser": False,
"browser_wait_until": "domcontentloaded",
"browser_extra_wait": 0,
"workers": 1,
"async_mode": False,
"resume": False,
"fresh": False,
"skip_scrape": False,
"languages": ["en"],
},
"analysis": {
"depth": "surface",
"skip_patterns": False,
"skip_test_examples": False,
"skip_how_to_guides": False,
"skip_config_patterns": False,
"skip_api_reference": False,
"skip_dependency_graph": False,
"skip_docs": False,
"no_comments": False,
"file_patterns": None,
},
"rag": {
"chunk_for_rag": False,
"chunk_tokens": 512,
"chunk_overlap_tokens": 50,
"preserve_code_blocks": True,
"preserve_paragraphs": True,
},
}
@classmethod
def _load_config_file(cls, config_path: str) -> dict[str, Any]:
"""Load and normalize config file."""
path = Path(config_path)
try:
with open(path, encoding="utf-8") as f:
file_data = json.load(f)
except FileNotFoundError:
raise ValueError(f"Config file not found: {config_path}") from None
except json.JSONDecodeError as e:
raise ValueError(f"Invalid JSON in config file {config_path}: {e}") from None
config: dict[str, Any] = {}
# Unified config format (sources array)
if "sources" in file_data:
enhancement = file_data.get("enhancement", {})
# Handle timeout field (can be "unlimited" or integer)
timeout_val = enhancement.get("timeout", 2700)
if isinstance(timeout_val, str) and timeout_val.lower() in ("unlimited", "none"):
from skill_seekers.cli.agent_client import UNLIMITED_TIMEOUT
timeout_val = UNLIMITED_TIMEOUT
config["output"] = {
"name": file_data.get("name"),
"doc_version": file_data.get("version", ""),
}
config["enhancement"] = {
"enabled": enhancement.get("enabled", True),
"level": enhancement.get("level", 2),
"mode": enhancement.get("mode", "auto").lower(),
"agent": enhancement.get("agent"),
"timeout": timeout_val,
"workflows": file_data.get("workflows", []),
"stages": file_data.get("workflow_stages", []),
"workflow_vars": file_data.get("workflow_vars", {}),
}
# Simple web config format
elif "base_url" in file_data:
config["output"] = {
"name": file_data.get("name"),
"doc_version": file_data.get("version", ""),
}
config["scraping"] = {
"max_pages": file_data.get("max_pages"),
"rate_limit": file_data.get("rate_limit"),
"browser": file_data.get("browser", False),
}
return config
@classmethod
def _args_to_data(cls, args: Any) -> dict[str, Any]:
"""Convert argparse.Namespace to config dict."""
config: dict[str, Any] = {}
# Output
if hasattr(args, "name") and args.name is not None:
config.setdefault("output", {})["name"] = args.name
if hasattr(args, "output") and args.output is not None:
config.setdefault("output", {})["output_dir"] = args.output
if hasattr(args, "doc_version") and args.doc_version:
config.setdefault("output", {})["doc_version"] = args.doc_version
if getattr(args, "dry_run", False):
config.setdefault("output", {})["dry_run"] = True
# Enhancement
if hasattr(args, "enhance_level") and args.enhance_level is not None:
config.setdefault("enhancement", {})["level"] = args.enhance_level
if getattr(args, "agent", None):
config.setdefault("enhancement", {})["agent"] = args.agent
if getattr(args, "agent_cmd", None):
config.setdefault("enhancement", {})["agent_cmd"] = args.agent_cmd
if getattr(args, "api_key", None):
config.setdefault("enhancement", {})["api_key"] = args.api_key
# Resolve mode from explicit CLI flags:
# --api-key → "api", --agent (without --api-key) → "local".
# Env-var-based mode detection belongs in _default_data(), not here,
# to preserve the priority: CLI args > Config file > Env vars > Defaults.
if getattr(args, "api_key", None):
config.setdefault("enhancement", {})["mode"] = "api"
elif getattr(args, "agent", None):
config.setdefault("enhancement", {})["mode"] = "local"
# Workflows
if getattr(args, "enhance_workflow", None):
config.setdefault("enhancement", {})["workflows"] = list(args.enhance_workflow)
if getattr(args, "enhance_stage", None):
config.setdefault("enhancement", {})["stages"] = list(args.enhance_stage)
if getattr(args, "var", None):
config.setdefault("enhancement", {})["workflow_vars"] = cls._parse_vars(args.var)
# Scraping
if hasattr(args, "max_pages") and args.max_pages is not None:
config.setdefault("scraping", {})["max_pages"] = args.max_pages
if hasattr(args, "rate_limit") and args.rate_limit is not None:
config.setdefault("scraping", {})["rate_limit"] = args.rate_limit
if getattr(args, "browser", False):
config.setdefault("scraping", {})["browser"] = True
if hasattr(args, "workers") and args.workers:
config.setdefault("scraping", {})["workers"] = args.workers
if getattr(args, "async_mode", False):
config.setdefault("scraping", {})["async_mode"] = True
if getattr(args, "resume", False):
config.setdefault("scraping", {})["resume"] = True
if getattr(args, "fresh", False):
config.setdefault("scraping", {})["fresh"] = True
if getattr(args, "skip_scrape", False):
config.setdefault("scraping", {})["skip_scrape"] = True
# Analysis
if getattr(args, "depth", None):
config.setdefault("analysis", {})["depth"] = args.depth
if getattr(args, "skip_patterns", False):
config.setdefault("analysis", {})["skip_patterns"] = True
if getattr(args, "skip_test_examples", False):
config.setdefault("analysis", {})["skip_test_examples"] = True
if getattr(args, "skip_how_to_guides", False):
config.setdefault("analysis", {})["skip_how_to_guides"] = True
if getattr(args, "file_patterns", None):
config.setdefault("analysis", {})["file_patterns"] = [
p.strip() for p in args.file_patterns.split(",")
]
# RAG
if getattr(args, "chunk_for_rag", False):
config.setdefault("rag", {})["chunk_for_rag"] = True
if hasattr(args, "chunk_tokens") and args.chunk_tokens is not None:
config.setdefault("rag", {})["chunk_tokens"] = args.chunk_tokens
return config
@staticmethod
def _deep_merge(base: dict[str, Any], override: dict[str, Any]) -> dict[str, Any]:
"""Deep merge override into base."""
result = base.copy()
for key, value in override.items():
if isinstance(value, dict) and key in result and isinstance(result[key], dict):
result[key] = ExecutionContext._deep_merge(result[key], value)
else:
result[key] = value
return result
@staticmethod
def _parse_vars(var_list: list[str]) -> dict[str, str]:
"""Parse --var key=value into dict."""
result = {}
for var in var_list:
if "=" in var:
key, value = var.split("=", 1)
result[key] = value
return result
@property
def config_path(self) -> str | None:
"""Path to the config file used for initialization, if any."""
return self._config_path
def get_raw(self, name: str, default: Any = None) -> Any:
"""Get raw argument value (backward compatibility)."""
return self._raw_args.get(name, default)
def get_agent_client(self) -> Any:
"""Get configured AgentClient from context."""
from skill_seekers.cli.agent_client import AgentClient
return AgentClient(mode=self.enhancement.mode, agent=self.enhancement.agent)
@contextlib.contextmanager
def override(self, **kwargs: Any) -> Generator[ExecutionContext, None, None]:
"""Temporarily override context values.
Thread-safe: uses an override stack so nested/concurrent overrides
restore correctly regardless of ordering.
Usage:
with ctx.override(enhancement__level=3):
run_workflow() # Uses level 3
# Original values restored
"""
# Create new data with overrides
current_data = self.model_dump(exclude={"_raw_args"})
for key, value in kwargs.items():
if "__" in key:
parts = key.split("__")
target = current_data
for part in parts[:-1]:
target = target.setdefault(part, {})
target[parts[-1]] = value
else:
current_data[key] = value
# Create temporary instance and preserve _raw_args
temp_ctx = self.__class__.model_validate(current_data)
temp_ctx._raw_args = dict(self._raw_args) # Copy raw args to temp context
# Swap singleton atomically and save previous state on a stack
# so nested/concurrent overrides restore in the correct order.
with self.__class__._lock:
saved = (self.__class__._instance, self.__class__._initialized)
self.__class__._instance = temp_ctx
self.__class__._initialized = True
try:
yield temp_ctx
finally:
with self.__class__._lock:
self.__class__._instance = saved[0]
self.__class__._initialized = saved[1]
def get_context() -> ExecutionContext:
"""Shortcut for ExecutionContext.get()."""
return ExecutionContext.get()

View File

@@ -14,7 +14,6 @@ Usage:
skill-seekers github --repo owner/repo --token $GITHUB_TOKEN
"""
import argparse
import fnmatch
import itertools
import json
@@ -32,8 +31,7 @@ except ImportError:
print("Error: PyGithub not installed. Run: pip install PyGithub")
sys.exit(1)
from skill_seekers.cli.arguments.github import add_github_arguments
from skill_seekers.cli.utils import setup_logging
from skill_seekers.cli.skill_converter import SkillConverter
# Try to import pathspec for .gitignore support
try:
@@ -183,7 +181,7 @@ def extract_description_from_readme(readme_content: str, repo_name: str) -> str:
return f"Use when working with {project_name}"
class GitHubScraper:
class GitHubScraper(SkillConverter):
"""
GitHub Repository Scraper (C1.1-C1.9)
@@ -199,8 +197,11 @@ class GitHubScraper:
- Releases
"""
SOURCE_TYPE = "github"
def __init__(self, config: dict[str, Any], local_repo_path: str | None = None):
"""Initialize GitHub scraper with configuration."""
super().__init__(config)
self.config = config
self.repo_name = config["repo"]
self.name = config.get("name", self.repo_name.split("/")[-1])
@@ -353,6 +354,15 @@ class GitHubScraper:
logger.error(f"Unexpected error during scraping: {e}")
raise
def extract(self):
"""SkillConverter interface — delegates to scrape()."""
self.scrape()
def build_skill(self):
"""SkillConverter interface — delegates to GitHubToSkillConverter."""
converter = GitHubToSkillConverter(self.config)
converter.build_skill()
def _fetch_repository(self):
"""C1.1: Fetch repository structure using GitHub API."""
logger.info(f"Fetching repository: {self.repo_name}")
@@ -1379,186 +1389,3 @@ Use this skill when you need to:
with open(structure_path, "w", encoding="utf-8") as f:
f.write(content)
logger.info(f"Generated: {structure_path}")
def setup_argument_parser() -> argparse.ArgumentParser:
"""Setup and configure command-line argument parser.
Creates an ArgumentParser with all CLI options for the github scraper.
All arguments are defined in skill_seekers.cli.arguments.github to ensure
consistency between the standalone scraper and unified CLI.
Returns:
argparse.ArgumentParser: Configured argument parser
"""
parser = argparse.ArgumentParser(
description="GitHub Repository to AI Skill Converter",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
skill-seekers github --repo facebook/react
skill-seekers github --config configs/react_github.json
skill-seekers github --repo owner/repo --token $GITHUB_TOKEN
""",
)
# Add all github arguments from shared definitions
# This ensures the standalone scraper and unified CLI stay in sync
add_github_arguments(parser)
return parser
def main():
"""C1.10: CLI tool entry point."""
parser = setup_argument_parser()
args = parser.parse_args()
setup_logging(verbose=getattr(args, "verbose", False), quiet=getattr(args, "quiet", False))
# Handle --dry-run
if getattr(args, "dry_run", False):
repo = args.repo or (args.config and "(from config)")
print(f"\n{'=' * 60}")
print(f"DRY RUN: GitHub Repository Analysis")
print(f"{'=' * 60}")
print(f"Repository: {repo}")
print(f"Name: {getattr(args, 'name', None) or '(auto-detect)'}")
print(f"Include issues: {not getattr(args, 'no_issues', False)}")
print(f"Include releases: {not getattr(args, 'no_releases', False)}")
print(f"Include changelog: {not getattr(args, 'no_changelog', False)}")
print(f"Max issues: {getattr(args, 'max_issues', 100)}")
print(f"Enhance level: {getattr(args, 'enhance_level', 0)}")
print(f"Profile: {getattr(args, 'profile', None) or '(default)'}")
print(f"\n✅ Dry run complete")
return 0
# Build config from args or file
if args.config:
with open(args.config, encoding="utf-8") as f:
config = json.load(f)
# Override with CLI args if provided
if args.non_interactive:
config["interactive"] = False
if args.profile:
config["github_profile"] = args.profile
elif args.repo:
config = {
"repo": args.repo,
"name": args.name or args.repo.split("/")[-1],
"description": args.description or f"Use when working with {args.repo.split('/')[-1]}",
"github_token": args.token,
"include_issues": not args.no_issues,
"include_changelog": not args.no_changelog,
"include_releases": not args.no_releases,
"max_issues": args.max_issues,
"interactive": not args.non_interactive,
"github_profile": args.profile,
"local_repo_path": getattr(args, "local_repo_path", None),
}
else:
parser.error("Either --repo or --config is required")
try:
# Phase 1: Scrape GitHub repository
scraper = GitHubScraper(config)
scraper.scrape()
if args.scrape_only:
logger.info("Scrape complete (--scrape-only mode)")
return
# Phase 2: Build skill
converter = GitHubToSkillConverter(config)
converter.build_skill()
skill_name = config.get("name", config["repo"].split("/")[-1])
skill_dir = f"output/{skill_name}"
# ============================================================
# WORKFLOW SYSTEM INTEGRATION (Phase 2 - github_scraper)
# ============================================================
from skill_seekers.cli.workflow_runner import run_workflows
# Pass GitHub-specific context to workflows
github_context = {
"repo": config.get("repo", ""),
"name": skill_name,
"description": config.get("description", ""),
}
workflow_executed, workflow_names = run_workflows(args, context=github_context)
workflow_name = ", ".join(workflow_names) if workflow_names else None
# Phase 3: Optional enhancement with auto-detected mode
# Note: Runs independently of workflow system (they complement each other)
if getattr(args, "enhance_level", 0) > 0:
import os
# Auto-detect mode based on API key availability
api_key = args.api_key or os.environ.get("ANTHROPIC_API_KEY")
mode = "API" if api_key else "LOCAL"
logger.info("\n" + "=" * 80)
logger.info(f"🤖 Traditional AI Enhancement ({mode} mode, level {args.enhance_level})")
logger.info("=" * 80)
if workflow_executed:
logger.info(f" Running after workflow: {workflow_name}")
logger.info(
" (Workflow provides specialized analysis, enhancement provides general improvements)"
)
logger.info("")
if api_key:
# API-based enhancement
try:
from skill_seekers.cli.enhance_skill import enhance_skill_md
enhance_skill_md(skill_dir, api_key)
logger.info("✅ API enhancement complete!")
except ImportError:
logger.error("❌ API enhancement not available. Install: pip install anthropic")
logger.info("💡 Falling back to LOCAL mode...")
# Fall back to LOCAL mode
from pathlib import Path
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
agent_name = agent or "claude"
logger.info(f"✅ Local enhancement complete! (via {agent_name})")
else:
# LOCAL enhancement (no API key)
from pathlib import Path
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
agent_name = agent or "claude"
logger.info(f"✅ Local enhancement complete! (via {agent_name})")
logger.info(f"\n✅ Success! Skill created at: {skill_dir}/")
# Only suggest enhancement if neither workflow nor traditional enhancement was done
if not workflow_executed and getattr(args, "enhance_level", 0) == 0:
logger.info("\n💡 Optional: Enhance SKILL.md with AI:")
logger.info(f" skill-seekers enhance {skill_dir}/ --enhance-level 2")
logger.info(" (auto-detects API vs LOCAL mode based on ANTHROPIC_API_KEY)")
logger.info("\n💡 Or use a workflow:")
logger.info(
f" skill-seekers github --repo {config['repo']} --enhance-workflow architecture-comprehensive"
)
logger.info(f"\nNext step: skill-seekers package {skill_dir}/")
except Exception as e:
logger.error(f"Error: {e}")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -908,8 +908,29 @@ class HowToGuideBuilder:
return collection
def _extract_workflow_examples(self, examples: list[dict]) -> list[dict]:
"""Filter to workflow category only"""
return [ex for ex in examples if isinstance(ex, dict) and ex.get("category") == "workflow"]
"""Filter to examples suitable for guide generation.
Includes:
- All workflow-category examples
- Setup/config examples with sufficient complexity (4+ steps or high confidence)
- Instantiation examples with high confidence and multiple dependencies
"""
guide_worthy = []
for ex in examples:
if not isinstance(ex, dict):
continue
category = ex.get("category", "")
complexity = ex.get("complexity_score", 0)
confidence = ex.get("confidence", 0)
if category == "workflow":
guide_worthy.append(ex)
elif category in ("setup", "config") and (complexity >= 0.4 or confidence >= 0.7):
guide_worthy.append(ex)
elif category == "instantiation" and complexity >= 0.6 and confidence >= 0.7:
guide_worthy.append(ex)
return guide_worthy
def _create_guide(self, title: str, workflows: list[dict], enhancer=None) -> HowToGuide:
"""

View File

@@ -16,17 +16,17 @@ Usage:
skill-seekers html --from-json page_extracted.json
"""
import argparse
import json
import logging
import os
import re
import sys
from pathlib import Path
# BeautifulSoup is a core dependency (always available)
from bs4 import BeautifulSoup, Comment, Tag
from .skill_converter import SkillConverter
logger = logging.getLogger(__name__)
# File extensions treated as HTML
@@ -95,7 +95,7 @@ def _collect_html_files(html_path: str) -> list[Path]:
raise ValueError(f"Path is neither a file nor a directory: {html_path}")
class HtmlToSkillConverter:
class HtmlToSkillConverter(SkillConverter):
"""Convert local HTML files to a skill.
Supports single HTML files and directories of HTML files. Parses document
@@ -112,6 +112,8 @@ class HtmlToSkillConverter:
extracted_data: Parsed extraction results dict.
"""
SOURCE_TYPE = "html"
def __init__(self, config: dict) -> None:
"""Initialize the HTML to skill converter.
@@ -122,6 +124,7 @@ class HtmlToSkillConverter:
- description (str): Skill description (optional).
- categories (dict): Category definitions for content grouping.
"""
super().__init__(config)
self.config = config
self.name: str = config["name"]
self.html_path: str = config.get("html_path", "")
@@ -139,6 +142,10 @@ class HtmlToSkillConverter:
# Extracted data
self.extracted_data: dict | None = None
def extract(self):
"""SkillConverter interface — delegates to extract_html()."""
return self.extract_html()
# ------------------------------------------------------------------
# Extraction
# ------------------------------------------------------------------
@@ -1742,205 +1749,3 @@ def _score_code_quality(code: str) -> float:
score -= 2.0
return min(10.0, max(0.0, score))
# ---------------------------------------------------------------------------
# CLI entry point
# ---------------------------------------------------------------------------
def main() -> int:
"""CLI entry point for the HTML scraper.
Parses command-line arguments and runs the extraction/build pipeline.
Supports two workflows:
1. Direct HTML extraction: ``--html-path page.html --name myskill``
2. Build from JSON: ``--from-json page_extracted.json``
Returns:
Exit code (0 for success, non-zero for failure).
"""
parser = argparse.ArgumentParser(
description="Convert local HTML files to skill",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=(
"Examples:\n"
" %(prog)s --html-path page.html --name myskill\n"
" %(prog)s --html-path ./docs/ --name myskill\n"
" %(prog)s --from-json page_extracted.json\n"
),
)
# Shared universal args
from .arguments.common import add_all_standard_arguments
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for HTML
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for HTML), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code, Kimi, etc.)"
)
# HTML-specific args
parser.add_argument(
"--html-path",
type=str,
help="Path to HTML file or directory of HTML files",
metavar="PATH",
)
parser.add_argument(
"--from-json",
type=str,
help="Build skill from previously extracted JSON",
metavar="FILE",
)
args = parser.parse_args()
# Set logging level
if getattr(args, "quiet", False):
logging.getLogger().setLevel(logging.WARNING)
elif getattr(args, "verbose", False):
logging.getLogger().setLevel(logging.DEBUG)
# Handle --dry-run
if getattr(args, "dry_run", False):
source = getattr(args, "html_path", None) or getattr(args, "from_json", None) or "(none)"
print(f"\n{'=' * 60}")
print("DRY RUN: HTML Extraction")
print(f"{'=' * 60}")
print(f"Source: {source}")
print(f"Name: {getattr(args, 'name', None) or '(auto-detect)'}")
print(f"Enhance level: {getattr(args, 'enhance_level', 0)}")
print(f"\n✅ Dry run complete")
return 0
# Validate inputs
if not (getattr(args, "html_path", None) or getattr(args, "from_json", None)):
parser.error("Must specify --html-path or --from-json")
# Build from JSON workflow
if getattr(args, "from_json", None):
name = Path(args.from_json).stem.replace("_extracted", "")
config = {
"name": getattr(args, "name", None) or name,
"description": getattr(args, "description", None)
or f"Use when referencing {name} documentation",
}
try:
converter = HtmlToSkillConverter(config)
converter.load_extracted_data(args.from_json)
converter.build_skill()
except Exception as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
return 0
# Direct HTML mode
if not getattr(args, "name", None):
# Auto-detect name from path
path = Path(args.html_path)
args.name = path.stem if path.is_file() else path.name
config = {
"name": args.name,
"html_path": args.html_path,
# Pass None so extract_html() can infer from HTML metadata
"description": getattr(args, "description", None),
}
try:
converter = HtmlToSkillConverter(config)
# Extract
if not converter.extract_html():
print(
"\n❌ HTML extraction failed - see error above",
file=sys.stderr,
)
sys.exit(1)
# Build skill
converter.build_skill()
# Enhancement Workflow Integration
from skill_seekers.cli.workflow_runner import run_workflows
workflow_executed, workflow_names = run_workflows(args)
workflow_name = ", ".join(workflow_names) if workflow_names else None
# Traditional enhancement (complements workflow system)
if getattr(args, "enhance_level", 0) > 0:
api_key = getattr(args, "api_key", None) or os.environ.get("ANTHROPIC_API_KEY")
mode = "API" if api_key else "LOCAL"
print("\n" + "=" * 80)
print(f"🤖 Traditional AI Enhancement ({mode} mode, level {args.enhance_level})")
print("=" * 80)
if workflow_executed:
print(f" Running after workflow: {workflow_name}")
print(
" (Workflow provides specialized analysis,"
" enhancement provides general improvements)"
)
print("")
skill_dir = converter.skill_dir
if api_key:
try:
from skill_seekers.cli.enhance_skill import (
enhance_skill_md,
)
enhance_skill_md(skill_dir, api_key)
print("✅ API enhancement complete!")
except ImportError:
print("❌ API enhancement not available. Falling back to LOCAL mode...")
from skill_seekers.cli.enhance_skill_local import (
LocalSkillEnhancer,
)
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
else:
from skill_seekers.cli.enhance_skill_local import (
LocalSkillEnhancer,
)
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
except (FileNotFoundError, ValueError) as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
except RuntimeError as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(
f"\n❌ Unexpected error during HTML processing: {e}",
file=sys.stderr,
)
import traceback
traceback.print_exc()
sys.exit(1)
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -14,12 +14,10 @@ Usage:
skill-seekers jupyter --from-json notebook_extracted.json
"""
import argparse
import json
import logging
import os
import re
import sys
from pathlib import Path
# Optional dependency guard
@@ -30,6 +28,8 @@ try:
except ImportError:
JUPYTER_AVAILABLE = False
from .skill_converter import SkillConverter
logger = logging.getLogger(__name__)
# Import pattern categories for code analysis
@@ -199,10 +199,13 @@ def infer_description_from_notebook(metadata: dict | None = None, name: str = ""
)
class JupyterToSkillConverter:
class JupyterToSkillConverter(SkillConverter):
"""Convert Jupyter Notebook (.ipynb) to skill."""
SOURCE_TYPE = "jupyter"
def __init__(self, config: dict):
super().__init__(config)
self.config = config
self.name = config["name"]
self.notebook_path = config.get("notebook_path", "")
@@ -214,6 +217,10 @@ class JupyterToSkillConverter:
self.categories = config.get("categories", {})
self.extracted_data: dict | None = None
def extract(self):
"""SkillConverter interface — delegates to extract_notebook()."""
return self.extract_notebook()
# ------------------------------------------------------------------
# Extraction
# ------------------------------------------------------------------
@@ -1082,132 +1089,3 @@ def _score_code_quality(code: str) -> float:
if line_count > 0 and not non_magic:
score -= 1.0
return min(10.0, max(0.0, score))
# ---------------------------------------------------------------------------
# CLI entry point
# ---------------------------------------------------------------------------
def main() -> int:
"""Standalone CLI entry point for the Jupyter Notebook scraper."""
from .arguments.jupyter import add_jupyter_arguments
parser = argparse.ArgumentParser(
description="Convert Jupyter Notebook (.ipynb) to skill",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_jupyter_arguments(parser)
args = parser.parse_args()
if getattr(args, "quiet", False):
logging.getLogger().setLevel(logging.WARNING)
elif getattr(args, "verbose", False):
logging.getLogger().setLevel(logging.DEBUG)
if getattr(args, "dry_run", False):
source = getattr(args, "notebook", None) or getattr(args, "from_json", None) or "(none)"
print(f"\n{'=' * 60}")
print("DRY RUN: Jupyter Notebook Extraction")
print(f"{'=' * 60}")
print(f"Source: {source}")
print(f"Name: {getattr(args, 'name', None) or '(auto-detect)'}")
print(f"Enhance level: {getattr(args, 'enhance_level', 0)}")
print(f"\n✅ Dry run complete")
return 0
if not (getattr(args, "notebook", None) or getattr(args, "from_json", None)):
parser.error("Must specify --notebook or --from-json")
if getattr(args, "from_json", None):
name = Path(args.from_json).stem.replace("_extracted", "")
config = {
"name": getattr(args, "name", None) or name,
"description": getattr(args, "description", None)
or f"Use when referencing {name} notebook documentation",
}
try:
converter = JupyterToSkillConverter(config)
converter.load_extracted_data(args.from_json)
converter.build_skill()
except Exception as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
return 0
# Direct notebook mode
if not getattr(args, "name", None):
nb_path = Path(args.notebook)
args.name = nb_path.stem if nb_path.is_file() else (nb_path.name or "notebooks")
config = {
"name": args.name,
"notebook_path": args.notebook,
"description": getattr(args, "description", None),
}
try:
converter = JupyterToSkillConverter(config)
if not converter.extract_notebook():
print("\n❌ Notebook extraction failed - see error above", file=sys.stderr)
sys.exit(1)
converter.build_skill()
from skill_seekers.cli.workflow_runner import run_workflows
workflow_executed, workflow_names = run_workflows(args)
workflow_name = ", ".join(workflow_names) if workflow_names else None
if getattr(args, "enhance_level", 0) > 0:
api_key = getattr(args, "api_key", None) or os.environ.get("ANTHROPIC_API_KEY")
mode = "API" if api_key else "LOCAL"
print("\n" + "=" * 80)
print(f"🤖 Traditional AI Enhancement ({mode} mode, level {args.enhance_level})")
print("=" * 80)
if workflow_executed:
print(f" Running after workflow: {workflow_name}")
print(
" (Workflow provides specialized analysis, "
"enhancement provides general improvements)"
)
print("")
skill_dir = converter.skill_dir
if api_key:
try:
from skill_seekers.cli.enhance_skill import enhance_skill_md
enhance_skill_md(skill_dir, api_key)
print("✅ API enhancement complete!")
except ImportError:
print("❌ API enhancement not available. Falling back to LOCAL mode...")
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
else:
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
except RuntimeError as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"\n❌ Unexpected error during Jupyter processing: {e}", file=sys.stderr)
import traceback
traceback.print_exc()
sys.exit(1)
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -2,97 +2,67 @@
"""
Skill Seekers - Unified CLI Entry Point
Provides a git-style unified command-line interface for all Skill Seekers tools.
Convert documentation, codebases, and repositories into AI skills.
Usage:
skill-seekers <command> [options]
Commands:
config Configure GitHub tokens, API keys, and settings
scrape Scrape documentation website
github Scrape GitHub repository
pdf Extract from PDF file
word Extract from Word (.docx) file
epub Extract from EPUB e-book (.epub)
video Extract from video (YouTube or local)
jupyter Extract from Jupyter Notebook (.ipynb)
html Extract from local HTML files
openapi Extract from OpenAPI/Swagger spec
asciidoc Extract from AsciiDoc documents (.adoc)
pptx Extract from PowerPoint (.pptx)
rss Extract from RSS/Atom feeds
manpage Extract from man pages
confluence Extract from Confluence wiki
notion Extract from Notion pages
chat Extract from Slack/Discord chat exports
unified Multi-source scraping (docs + GitHub + PDF + more)
analyze Analyze local codebase and extract code knowledge
create Create skill from any source (auto-detects type)
enhance AI-powered enhancement (auto: API or LOCAL mode)
enhance-status Check enhancement status (for background/daemon modes)
package Package skill into .zip file
upload Upload skill to target platform
install One-command workflow (scrape + enhance + package + upload)
install-agent Install skill to AI agent directories
estimate Estimate page count before scraping
extract-test-examples Extract usage examples from test files
install-agent Install skill to AI agent directories
resume Resume interrupted scraping job
config Configure GitHub tokens, API keys, and settings
doctor Health check for dependencies and configuration
Examples:
skill-seekers scrape --config configs/react.json
skill-seekers github --repo microsoft/TypeScript
skill-seekers unified --config configs/react_unified.json
skill-seekers extract-test-examples tests/ --language python
skill-seekers create https://react.dev
skill-seekers create owner/repo
skill-seekers create ./document.pdf
skill-seekers create configs/unity-spine.json
skill-seekers create configs/unity-spine.json --enhance-workflow unity-game-dev
skill-seekers enhance output/react/
skill-seekers package output/react/
skill-seekers install-agent output/react/ --agent cursor
"""
import argparse
import importlib
import os
import sys
from pathlib import Path
from skill_seekers.cli import __version__
# Command module mapping (command name -> module path)
COMMAND_MODULES = {
"create": "skill_seekers.cli.create_command", # NEW: Unified create command
"doctor": "skill_seekers.cli.doctor",
"config": "skill_seekers.cli.config_command",
"scrape": "skill_seekers.cli.doc_scraper",
"github": "skill_seekers.cli.github_scraper",
"pdf": "skill_seekers.cli.pdf_scraper",
"word": "skill_seekers.cli.word_scraper",
"epub": "skill_seekers.cli.epub_scraper",
"video": "skill_seekers.cli.video_scraper",
"unified": "skill_seekers.cli.unified_scraper",
# Skill creation — unified entry point for all 18 source types
"create": "skill_seekers.cli.create_command",
# Enhancement & packaging
"enhance": "skill_seekers.cli.enhance_command",
"enhance-status": "skill_seekers.cli.enhance_status",
"package": "skill_seekers.cli.package_skill",
"upload": "skill_seekers.cli.upload_skill",
"install": "skill_seekers.cli.install_skill",
"install-agent": "skill_seekers.cli.install_agent",
# Utilities
"estimate": "skill_seekers.cli.estimate_pages",
"extract-test-examples": "skill_seekers.cli.test_example_extractor",
"install-agent": "skill_seekers.cli.install_agent",
"analyze": "skill_seekers.cli.codebase_scraper",
"install": "skill_seekers.cli.install_skill",
"resume": "skill_seekers.cli.resume_command",
"quality": "skill_seekers.cli.quality_metrics",
# Configuration & workflows
"config": "skill_seekers.cli.config_command",
"doctor": "skill_seekers.cli.doctor",
"workflows": "skill_seekers.cli.workflows_command",
"sync-config": "skill_seekers.cli.sync_config",
# Advanced (less common)
"stream": "skill_seekers.cli.streaming_ingest",
"update": "skill_seekers.cli.incremental_updater",
"multilang": "skill_seekers.cli.multilang_support",
"quality": "skill_seekers.cli.quality_metrics",
"workflows": "skill_seekers.cli.workflows_command",
"sync-config": "skill_seekers.cli.sync_config",
# New source types (v3.2.0+)
"jupyter": "skill_seekers.cli.jupyter_scraper",
"html": "skill_seekers.cli.html_scraper",
"openapi": "skill_seekers.cli.openapi_scraper",
"asciidoc": "skill_seekers.cli.asciidoc_scraper",
"pptx": "skill_seekers.cli.pptx_scraper",
"rss": "skill_seekers.cli.rss_scraper",
"manpage": "skill_seekers.cli.man_scraper",
"confluence": "skill_seekers.cli.confluence_scraper",
"notion": "skill_seekers.cli.notion_scraper",
"chat": "skill_seekers.cli.chat_scraper",
}
@@ -106,14 +76,14 @@ def create_parser() -> argparse.ArgumentParser:
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Scrape documentation
skill-seekers scrape --config configs/react.json
# Create skill from documentation (auto-detects source type)
skill-seekers create https://docs.react.dev --name react
# Scrape GitHub repository
skill-seekers github --repo microsoft/TypeScript --name typescript
# Create skill from GitHub repository
skill-seekers create microsoft/TypeScript --name typescript
# Multi-source scraping (unified)
skill-seekers unified --config configs/react_unified.json
# Create skill from PDF file
skill-seekers create ./documentation.pdf --name mydocs
# AI-powered enhancement
skill-seekers enhance output/react/
@@ -145,6 +115,9 @@ For more information: https://github.com/yusufkaraaslan/Skill_Seekers
def _reconstruct_argv(command: str, args: argparse.Namespace) -> list[str]:
"""Reconstruct sys.argv from args namespace for command module.
DEPRECATED: Use ExecutionContext instead. This function is kept for
backward compatibility and will be removed in a future version.
Args:
command: Command name
args: Parsed arguments namespace
@@ -206,18 +179,8 @@ def main(argv: list[str] | None = None) -> int:
Returns:
Exit code (0 for success, non-zero for error)
"""
# Special handling for analyze --preset-list (no directory required)
if argv is None:
argv = sys.argv[1:]
if len(argv) >= 2 and argv[0] == "analyze" and "--preset-list" in argv:
from skill_seekers.cli.codebase_scraper import main as analyze_main
original_argv = sys.argv.copy()
sys.argv = ["codebase_scraper.py", "--preset-list"]
try:
return analyze_main() or 0
finally:
sys.argv = original_argv
parser = create_parser()
args = parser.parse_args(argv)
@@ -226,6 +189,10 @@ def main(argv: list[str] | None = None) -> int:
parser.print_help()
return 1
# Note: ExecutionContext is initialized by individual commands (e.g., create_command,
# enhance_command) with the correct config_path and source_info. Do NOT initialize
# it here — commands need to set config_path which requires source detection first.
# Get command module
module_name = COMMAND_MODULES.get(args.command)
if not module_name:
@@ -233,9 +200,38 @@ def main(argv: list[str] | None = None) -> int:
parser.print_help()
return 1
# Special handling for 'analyze' command (has post-processing)
if args.command == "analyze":
return _handle_analyze_command(args)
# create command: call directly with parsed args (no argv reconstruction)
if args.command == "create":
# Handle --help-* flags before execute (no source needed for help)
from skill_seekers.cli.arguments.create import add_create_arguments
help_modes = {
"_help_web": "web",
"_help_github": "github",
"_help_local": "local",
"_help_pdf": "pdf",
"_help_word": "word",
"_help_epub": "epub",
"_help_video": "video",
"_help_config": "config",
"_help_advanced": "advanced",
"_help_all": "all",
}
for attr, mode in help_modes.items():
if getattr(args, attr, False):
help_parser = argparse.ArgumentParser(
prog="skill-seekers create",
description=f"Create skill — {mode} options",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_create_arguments(help_parser, mode=mode)
help_parser.print_help()
return 0
from skill_seekers.cli.create_command import CreateCommand
command = CreateCommand(args)
return command.execute()
# Standard delegation for all other commands
try:
@@ -269,165 +265,5 @@ def main(argv: list[str] | None = None) -> int:
return 1
def _handle_analyze_command(args: argparse.Namespace) -> int:
"""Handle analyze command with special post-processing logic.
Args:
args: Parsed arguments
Returns:
Exit code
"""
from skill_seekers.cli.codebase_scraper import main as analyze_main
# Reconstruct sys.argv for analyze command
original_argv = sys.argv.copy()
sys.argv = ["codebase_scraper.py", "--directory", args.directory]
if args.output:
sys.argv.extend(["--output", args.output])
# Handle preset flags (depth and features)
if args.quick:
sys.argv.extend(
[
"--depth",
"surface",
"--skip-patterns",
"--skip-test-examples",
"--skip-how-to-guides",
"--skip-config-patterns",
]
)
elif args.comprehensive:
sys.argv.extend(["--depth", "full"])
elif args.depth:
sys.argv.extend(["--depth", args.depth])
# Determine enhance_level (simplified - use default or override)
enhance_level = getattr(args, "enhance_level", 2) # Default is 2
if getattr(args, "quick", False):
enhance_level = 0 # Quick mode disables enhancement
sys.argv.extend(["--enhance-level", str(enhance_level)])
# Pass through remaining arguments
if args.languages:
sys.argv.extend(["--languages", args.languages])
if args.file_patterns:
sys.argv.extend(["--file-patterns", args.file_patterns])
if args.skip_api_reference:
sys.argv.append("--skip-api-reference")
if args.skip_dependency_graph:
sys.argv.append("--skip-dependency-graph")
if args.skip_patterns:
sys.argv.append("--skip-patterns")
if args.skip_test_examples:
sys.argv.append("--skip-test-examples")
if args.skip_how_to_guides:
sys.argv.append("--skip-how-to-guides")
if args.skip_config_patterns:
sys.argv.append("--skip-config-patterns")
if args.skip_docs:
sys.argv.append("--skip-docs")
if args.no_comments:
sys.argv.append("--no-comments")
if args.verbose:
sys.argv.append("--verbose")
if getattr(args, "quiet", False):
sys.argv.append("--quiet")
if getattr(args, "dry_run", False):
sys.argv.append("--dry-run")
if getattr(args, "preset", None):
sys.argv.extend(["--preset", args.preset])
if getattr(args, "name", None):
sys.argv.extend(["--name", args.name])
if getattr(args, "description", None):
sys.argv.extend(["--description", args.description])
if getattr(args, "api_key", None):
sys.argv.extend(["--api-key", args.api_key])
# Enhancement Workflow arguments
if getattr(args, "enhance_workflow", None):
for wf in args.enhance_workflow:
sys.argv.extend(["--enhance-workflow", wf])
if getattr(args, "enhance_stage", None):
for stage in args.enhance_stage:
sys.argv.extend(["--enhance-stage", stage])
if getattr(args, "var", None):
for var in args.var:
sys.argv.extend(["--var", var])
if getattr(args, "workflow_dry_run", False):
sys.argv.append("--workflow-dry-run")
try:
result = analyze_main() or 0
# Enhance SKILL.md if enhance_level >= 1
if result == 0 and enhance_level >= 1:
skill_dir = Path(args.output)
skill_md = skill_dir / "SKILL.md"
if skill_md.exists():
print("\n" + "=" * 60)
print(f"ENHANCING SKILL.MD WITH AI (Level {enhance_level})")
print("=" * 60 + "\n")
try:
from skill_seekers.cli.enhance_command import (
_is_root,
_pick_mode,
_run_api_mode,
_run_local_mode,
)
import argparse as _ap
_fake_args = _ap.Namespace(
skill_directory=str(skill_dir),
target=None,
api_key=None,
dry_run=False,
agent=None,
agent_cmd=None,
interactive_enhancement=False,
background=False,
daemon=False,
no_force=False,
timeout=2700,
)
_mode, _target = _pick_mode(_fake_args)
if _mode == "api":
print(f"\n🤖 Enhancement mode: API ({_target})")
success = _run_api_mode(_fake_args, _target) == 0
elif _is_root():
print("\n⚠️ Skipping SKILL.md enhancement: running as root")
print(" Set ANTHROPIC_API_KEY / GOOGLE_API_KEY to enable API mode")
success = False
else:
agent_name = (
os.environ.get("SKILL_SEEKER_AGENT", "claude").strip() or "claude"
)
print(f"\n🤖 Enhancement mode: LOCAL ({agent_name})")
success = _run_local_mode(_fake_args) == 0
if success:
print("\n✅ SKILL.md enhancement complete!")
with open(skill_md) as f:
lines = len(f.readlines())
print(f" Enhanced SKILL.md: {lines} lines")
else:
print("\n⚠️ SKILL.md enhancement did not complete")
print(" You can retry with: skill-seekers enhance " + str(skill_dir))
except Exception as e:
print(f"\n⚠️ SKILL.md enhancement failed: {e}")
print(" You can retry with: skill-seekers enhance " + str(skill_dir))
else:
print(f"\n⚠️ SKILL.md not found at {skill_md}, skipping enhancement")
return result
finally:
sys.argv = original_argv
if __name__ == "__main__":
sys.exit(main())

View File

@@ -20,15 +20,15 @@ Usage:
skill-seekers man --from-json unix-tools_extracted.json
"""
import argparse
import json
import logging
import os
import re
import subprocess
import sys
from pathlib import Path
from skill_seekers.cli.skill_converter import SkillConverter
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
@@ -116,7 +116,7 @@ def infer_description_from_manpages(
)
class ManPageToSkillConverter:
class ManPageToSkillConverter(SkillConverter):
"""Convert Unix man pages into a skill directory structure.
Supports extraction via the ``man`` command or by reading raw man-page
@@ -125,6 +125,8 @@ class ManPageToSkillConverter:
from skill generation.
"""
SOURCE_TYPE = "manpage"
def __init__(self, config: dict) -> None:
"""Initialise the converter from a configuration dictionary.
@@ -137,6 +139,7 @@ class ManPageToSkillConverter:
- ``description``-- explicit description (optional)
- ``categories`` -- keyword-based categorisation map (optional)
"""
super().__init__(config)
self.config = config
self.name: str = config["name"]
self.man_names: list[str] = config.get("man_names", [])
@@ -156,6 +159,10 @@ class ManPageToSkillConverter:
# Extracted data placeholder
self.extracted_data: dict | None = None
def extract(self):
"""Extract content from man pages (SkillConverter interface)."""
self.extract_manpages()
# ------------------------------------------------------------------
# Extraction
# ------------------------------------------------------------------
@@ -1285,233 +1292,3 @@ class ManPageToSkillConverter:
safe = re.sub(r"[^\w\s-]", "", name.lower())
safe = re.sub(r"[-\s]+", "_", safe)
return safe
# ---------------------------------------------------------------------------
# CLI entry point
# ---------------------------------------------------------------------------
def main() -> int:
"""CLI entry point for the man page scraper.
Supports three workflows:
1. ``--man-names git,curl`` -- extract named man pages via the ``man``
command.
2. ``--man-path /usr/share/man/man1`` -- read man page files from a
directory.
3. ``--from-json data.json`` -- reload previously extracted data and
rebuild the skill.
Returns:
Exit code (0 on success, non-zero on error).
"""
parser = argparse.ArgumentParser(
description="Convert Unix man pages to a skill",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=(
"Examples:\n"
" %(prog)s --man-names git,curl --name unix-tools\n"
" %(prog)s --man-path /usr/share/man/man1 --name coreutils\n"
" %(prog)s --from-json unix-tools_extracted.json\n"
),
)
# Standard arguments (name, description, output, enhance-level, etc.)
from .arguments.common import add_all_standard_arguments
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for man pages
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for man), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code, Kimi, etc.)"
)
# Man-specific arguments
parser.add_argument(
"--man-names",
type=str,
help="Comma-separated list of man page names (e.g. git,curl,grep)",
metavar="NAMES",
)
parser.add_argument(
"--man-path",
type=str,
help="Directory containing man page files (.1-.8, .man, .gz)",
metavar="DIR",
)
parser.add_argument(
"--sections",
type=str,
help="Comma-separated list of man section numbers to extract (e.g. 1,3,8)",
metavar="NUMS",
)
parser.add_argument(
"--from-json",
type=str,
help="Build skill from previously extracted JSON",
metavar="FILE",
)
args = parser.parse_args()
# Logging level
if getattr(args, "quiet", False):
logging.getLogger().setLevel(logging.WARNING)
elif getattr(args, "verbose", False):
logging.getLogger().setLevel(logging.DEBUG)
# Dry run
if getattr(args, "dry_run", False):
source = (
getattr(args, "man_names", None)
or getattr(args, "man_path", None)
or getattr(args, "from_json", None)
or "(none)"
)
print(f"\n{'=' * 60}")
print("DRY RUN: Man Page Extraction")
print(f"{'=' * 60}")
print(f"Source: {source}")
print(f"Name: {getattr(args, 'name', None) or '(auto-detect)'}")
print(f"Sections: {getattr(args, 'sections', None) or 'all'}")
print(f"Enhance level: {getattr(args, 'enhance_level', 0)}")
print(f"\n✅ Dry run complete")
return 0
# Validate: must have at least one source
if not (
getattr(args, "man_names", None)
or getattr(args, "man_path", None)
or getattr(args, "from_json", None)
):
parser.error("Must specify --man-names, --man-path, or --from-json")
# Parse section numbers
section_list: list[int] = []
if getattr(args, "sections", None):
try:
section_list = [int(s.strip()) for s in args.sections.split(",") if s.strip()]
except ValueError:
parser.error("--sections must be comma-separated integers (e.g. 1,3,8)")
# Parse man names
man_name_list: list[str] = []
if getattr(args, "man_names", None):
man_name_list = [n.strip() for n in args.man_names.split(",") if n.strip()]
# Build from JSON workflow
if getattr(args, "from_json", None):
name = Path(args.from_json).stem.replace("_extracted", "")
config = {
"name": getattr(args, "name", None) or name,
"description": getattr(args, "description", None)
or f"Use when referencing {name} documentation",
}
try:
converter = ManPageToSkillConverter(config)
converter.load_extracted_data(args.from_json)
converter.build_skill()
except Exception as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
return 0
# Auto-detect name from man names or path
if not getattr(args, "name", None):
if man_name_list:
args.name = man_name_list[0] if len(man_name_list) == 1 else "man-pages"
elif getattr(args, "man_path", None):
args.name = Path(args.man_path).name
else:
args.name = "man-pages"
config = {
"name": args.name,
"man_names": man_name_list,
"man_path": getattr(args, "man_path", ""),
"sections": section_list,
"description": getattr(args, "description", None),
}
try:
converter = ManPageToSkillConverter(config)
# Extract
if not converter.extract_manpages():
print("\n❌ Man page extraction failed -- see error above", file=sys.stderr)
sys.exit(1)
# Build skill
converter.build_skill()
# Enhancement Workflow Integration
from skill_seekers.cli.workflow_runner import run_workflows
workflow_executed, workflow_names = run_workflows(args)
workflow_name = ", ".join(workflow_names) if workflow_names else None
# Traditional enhancement (complements workflow system)
if getattr(args, "enhance_level", 0) > 0:
api_key = getattr(args, "api_key", None) or os.environ.get("ANTHROPIC_API_KEY")
mode = "API" if api_key else "LOCAL"
print("\n" + "=" * 80)
print(f"🤖 Traditional AI Enhancement ({mode} mode, level {args.enhance_level})")
print("=" * 80)
if workflow_executed:
print(f" Running after workflow: {workflow_name}")
print(
" (Workflow provides specialized analysis,"
" enhancement provides general improvements)"
)
print("")
skill_dir = converter.skill_dir
if api_key:
try:
from skill_seekers.cli.enhance_skill import enhance_skill_md
enhance_skill_md(skill_dir, api_key)
print("✅ API enhancement complete!")
except ImportError:
print("❌ API enhancement not available. Falling back to LOCAL mode...")
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
else:
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
except RuntimeError as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"\n❌ Unexpected error during man page processing: {e}", file=sys.stderr)
import traceback
traceback.print_exc()
sys.exit(1)
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -16,17 +16,17 @@ Usage:
skill-seekers notion --from-json output/myskill_notion_data.json --name myskill
"""
import argparse
import csv
import json
import logging
import os
import re
import sys
import time
from pathlib import Path
from typing import Any
from skill_seekers.cli.skill_converter import SkillConverter
# Optional dependency guard — notion-client is not a core dependency
try:
from notion_client import Client as NotionClient
@@ -71,7 +71,7 @@ def infer_description_from_notion(metadata: dict | None = None, name: str = "")
)
class NotionToSkillConverter:
class NotionToSkillConverter(SkillConverter):
"""Convert Notion workspace content (database or page tree) to a skill.
Args:
@@ -79,7 +79,10 @@ class NotionToSkillConverter:
token, description, max_pages.
"""
SOURCE_TYPE = "notion"
def __init__(self, config: dict) -> None:
super().__init__(config)
self.config = config
self.name: str = config["name"]
self.database_id: str | None = config.get("database_id")
@@ -109,6 +112,10 @@ class NotionToSkillConverter:
logger.info("Notion API client initialised")
return self._client
def extract(self):
"""Extract content from Notion (SkillConverter interface)."""
self.extract_notion()
# -- Public extraction -----------------------------------------------
def extract_notion(self) -> bool:
@@ -857,173 +864,3 @@ class NotionToSkillConverter:
"""Strip trailing Notion hex IDs from export filenames."""
cleaned = re.sub(r"\s+[0-9a-f]{16,}$", "", stem)
return cleaned.strip() or stem
# ---------------------------------------------------------------------------
# CLI entry point
# ---------------------------------------------------------------------------
def main() -> int:
"""CLI entry point for the Notion scraper."""
from .arguments.common import add_all_standard_arguments
parser = argparse.ArgumentParser(
description="Convert Notion workspace content to AI-ready skill",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=(
"Examples:\n"
" skill-seekers notion --database-id ID --token $NOTION_TOKEN --name myskill\n"
" skill-seekers notion --page-id ID --token $NOTION_TOKEN --name myskill\n"
" skill-seekers notion --export-path ./export/ --name myskill\n"
" skill-seekers notion --from-json output/myskill_notion_data.json --name myskill"
),
)
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for Notion
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
# Notion-specific arguments
parser.add_argument(
"--database-id", type=str, help="Notion database ID (API mode)", metavar="ID"
)
parser.add_argument(
"--page-id", type=str, help="Notion page ID (API mode, recursive)", metavar="ID"
)
parser.add_argument(
"--export-path", type=str, help="Notion export directory (export mode)", metavar="PATH"
)
parser.add_argument(
"--token", type=str, help="Notion integration token (or NOTION_TOKEN env)", metavar="TOKEN"
)
parser.add_argument(
"--max-pages",
type=int,
default=DEFAULT_MAX_PAGES,
help=f"Maximum pages to extract (default: {DEFAULT_MAX_PAGES})",
metavar="N",
)
parser.add_argument(
"--from-json", type=str, help="Build from previously extracted JSON", metavar="FILE"
)
args = parser.parse_args()
# Logging
level = (
logging.WARNING
if getattr(args, "quiet", False)
else (logging.DEBUG if getattr(args, "verbose", False) else logging.INFO)
)
logging.basicConfig(level=level, format="%(message)s", force=True)
# Dry run
if getattr(args, "dry_run", False):
source = (
getattr(args, "database_id", None)
or getattr(args, "page_id", None)
or getattr(args, "export_path", None)
or getattr(args, "from_json", None)
or "(none)"
)
print(f"\n{'=' * 60}\nDRY RUN: Notion Extraction\n{'=' * 60}")
print(
f"Source: {source}\nName: {getattr(args, 'name', None) or '(auto)'}\nMax pages: {args.max_pages}"
)
return 0
# Validate
has_source = any(
getattr(args, a, None) for a in ("database_id", "page_id", "export_path", "from_json")
)
if not has_source:
parser.error("Must specify --database-id, --page-id, --export-path, or --from-json")
if not getattr(args, "name", None):
if getattr(args, "from_json", None):
args.name = Path(args.from_json).stem.replace("_notion_data", "")
elif getattr(args, "export_path", None):
args.name = Path(args.export_path).stem
else:
parser.error("--name is required when using --database-id or --page-id")
# --from-json: build only
if getattr(args, "from_json", None):
config = {
"name": args.name,
"description": getattr(args, "description", None),
"max_pages": args.max_pages,
}
try:
conv = NotionToSkillConverter(config)
conv.load_extracted_data(args.from_json)
conv.build_skill()
except Exception as e:
print(f"\n Error: {e}", file=sys.stderr)
sys.exit(1) # noqa: E702
return 0
# Full extract + build
config: dict[str, Any] = {
"name": args.name,
"database_id": getattr(args, "database_id", None),
"page_id": getattr(args, "page_id", None),
"export_path": getattr(args, "export_path", None),
"token": getattr(args, "token", None),
"description": getattr(args, "description", None),
"max_pages": args.max_pages,
}
try:
conv = NotionToSkillConverter(config)
if not conv.extract_notion():
print("\n Notion extraction failed", file=sys.stderr)
sys.exit(1) # noqa: E702
conv.build_skill()
# Run enhancement workflows if specified
try:
from skill_seekers.cli.workflow_runner import run_workflows
run_workflows(args)
except (ImportError, AttributeError):
pass
# Traditional AI enhancement
if getattr(args, "enhance_level", 0) > 0:
api_key = getattr(args, "api_key", None) or os.environ.get("ANTHROPIC_API_KEY")
skill_dir = conv.skill_dir
if api_key:
try:
from skill_seekers.cli.enhance_skill import enhance_skill_md
enhance_skill_md(skill_dir, api_key)
except ImportError:
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
else:
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
except RuntimeError as e:
print(f"\n Error: {e}", file=sys.stderr)
sys.exit(1) # noqa: E702
except Exception as e:
print(f"\n Unexpected error: {e}", file=sys.stderr)
import traceback
traceback.print_exc()
sys.exit(1) # noqa: E702
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -22,14 +22,11 @@ Usage:
python3 -m skill_seekers.cli.openapi_scraper --spec spec.yaml --name my-api
"""
import argparse
import copy
import json
import logging
import os
import re
import sys
from pathlib import Path
from typing import Any
# Optional dependency guard
@@ -40,6 +37,8 @@ try:
except ImportError:
YAML_AVAILABLE = False
from skill_seekers.cli.skill_converter import SkillConverter
logger = logging.getLogger(__name__)
# HTTP methods recognized in OpenAPI path items
@@ -90,7 +89,7 @@ def infer_description_from_spec(info: dict | None = None, name: str = "") -> str
return f"Use when working with the {name} API" if name else "Use when working with this API"
class OpenAPIToSkillConverter:
class OpenAPIToSkillConverter(SkillConverter):
"""Convert OpenAPI/Swagger specifications to AI-ready skills.
Supports OpenAPI 2.0 (Swagger), 3.0, and 3.1 specifications in both
@@ -111,6 +110,8 @@ class OpenAPIToSkillConverter:
extracted_data: Structured extraction result with endpoints, schemas, etc.
"""
SOURCE_TYPE = "openapi"
def __init__(self, config: dict) -> None:
"""Initialize the converter with configuration.
@@ -125,6 +126,7 @@ class OpenAPIToSkillConverter:
ValueError: If neither spec_path nor spec_url is provided and
no from_json workflow is intended.
"""
super().__init__(config)
self.config = config
self.name = config["name"]
self.spec_path: str = config.get("spec_path", "")
@@ -142,6 +144,10 @@ class OpenAPIToSkillConverter:
self.extracted_data: dict[str, Any] = {}
self.openapi_version: str = ""
def extract(self):
"""Extract content from OpenAPI spec (SkillConverter interface)."""
self.extract_spec()
# ──────────────────────────────────────────────────────────────────────
# Spec loading
# ──────────────────────────────────────────────────────────────────────
@@ -1772,192 +1778,3 @@ class OpenAPIToSkillConverter:
safe = re.sub(r"[^\w\s-]", "", name.lower())
safe = re.sub(r"[-\s]+", "_", safe)
return safe
# ──────────────────────────────────────────────────────────────────────────────
# CLI entry point
# ──────────────────────────────────────────────────────────────────────────────
def main() -> int:
"""CLI entry point for the OpenAPI scraper.
Supports three input modes:
1. Local spec file: --spec path/to/spec.yaml
2. Remote spec URL: --spec-url https://example.com/openapi.json
3. Pre-extracted JSON: --from-json extracted.json
Standard arguments (--name, --description, --verbose, --quiet, --dry-run)
are provided by the shared argument system.
"""
_check_yaml_deps()
parser = argparse.ArgumentParser(
description="Convert OpenAPI/Swagger specifications to AI-ready skills",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
%(prog)s --spec petstore.yaml --name petstore-api
%(prog)s --spec-url https://petstore3.swagger.io/api/v3/openapi.json --name petstore
%(prog)s --from-json petstore_extracted.json
""",
)
# Standard shared arguments
from .arguments.common import add_all_standard_arguments
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for OpenAPI
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for OpenAPI), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code, Kimi, etc.)"
)
# OpenAPI-specific arguments
parser.add_argument(
"--spec",
type=str,
help="Local path to OpenAPI/Swagger spec file (YAML or JSON)",
metavar="PATH",
)
parser.add_argument(
"--spec-url",
type=str,
help="Remote URL to fetch OpenAPI/Swagger spec from",
metavar="URL",
)
parser.add_argument(
"--from-json",
type=str,
help="Build skill from previously extracted JSON data",
metavar="FILE",
)
args = parser.parse_args()
# Setup logging
if getattr(args, "quiet", False):
logging.basicConfig(level=logging.WARNING, format="%(message)s")
elif getattr(args, "verbose", False):
logging.basicConfig(level=logging.DEBUG, format="%(levelname)s: %(message)s")
else:
logging.basicConfig(level=logging.INFO, format="%(message)s")
# Handle --dry-run
if getattr(args, "dry_run", False):
source = args.spec or args.spec_url or args.from_json or "(none)"
print(f"\n{'=' * 60}")
print("DRY RUN: OpenAPI Specification Extraction")
print(f"{'=' * 60}")
print(f"Source: {source}")
print(f"Name: {getattr(args, 'name', None) or '(auto-detect)'}")
print(f"Enhance level: {getattr(args, 'enhance_level', 0)}")
print(f"\n Dry run complete")
return 0
# Validate inputs
if not (args.spec or args.spec_url or args.from_json):
parser.error("Must specify --spec (file path), --spec-url (URL), or --from-json")
# Build from pre-extracted JSON
if args.from_json:
name = args.name or Path(args.from_json).stem.replace("_extracted", "")
config: dict[str, Any] = {
"name": name,
"description": (args.description or f"Use when working with the {name} API"),
}
converter = OpenAPIToSkillConverter(config)
converter.load_extracted_data(args.from_json)
converter.build_skill()
return 0
# Determine name
if not args.name:
if args.spec:
name = Path(args.spec).stem
elif args.spec_url:
# Derive name from URL
from urllib.parse import urlparse
url_path = urlparse(args.spec_url).path
name = Path(url_path).stem if url_path else "api"
else:
name = "api"
else:
name = args.name
# Build config
config = {
"name": name,
"spec_path": args.spec or "",
"spec_url": args.spec_url or "",
}
if args.description:
config["description"] = args.description
# Create converter and run
try:
converter = OpenAPIToSkillConverter(config)
if not converter.extract_spec():
print("\n OpenAPI extraction failed", file=sys.stderr)
sys.exit(1)
converter.build_skill()
# Enhancement workflow integration
if getattr(args, "enhance_level", 0) > 0:
api_key = getattr(args, "api_key", None) or os.environ.get("ANTHROPIC_API_KEY")
mode = "API" if api_key else "LOCAL"
print(f"\n{'=' * 80}")
print(f" AI Enhancement ({mode} mode, level {args.enhance_level})")
print("=" * 80)
skill_dir = converter.skill_dir
if api_key:
try:
from skill_seekers.cli.enhance_skill import enhance_skill_md
enhance_skill_md(skill_dir, api_key)
print(" API enhancement complete!")
except ImportError:
print(" API enhancement not available. Falling back to LOCAL mode...")
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print(" Local enhancement complete!")
else:
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print(" Local enhancement complete!")
except (ValueError, RuntimeError) as e:
print(f"\n Error: {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"\n Unexpected error during OpenAPI processing: {e}", file=sys.stderr)
import traceback
traceback.print_exc()
sys.exit(1)
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -1,21 +1,15 @@
"""Parser registry and factory.
This module registers all subcommand parsers and provides a factory
function to create them.
function to create them. Individual scraper commands have been removed —
use `skill-seekers create <source>` for all source types.
"""
from .base import SubcommandParser
# Import all parser classes
from .create_parser import CreateParser # NEW: Unified create command
# Import parser classes (scrapers removed — use create command)
from .create_parser import CreateParser
from .config_parser import ConfigParser
from .scrape_parser import ScrapeParser
from .github_parser import GitHubParser
from .pdf_parser import PDFParser
from .word_parser import WordParser
from .epub_parser import EpubParser
from .video_parser import VideoParser
from .unified_parser import UnifiedParser
from .enhance_parser import EnhanceParser
from .enhance_status_parser import EnhanceStatusParser
from .package_parser import PackageParser
@@ -23,7 +17,6 @@ from .upload_parser import UploadParser
from .estimate_parser import EstimateParser
from .test_examples_parser import TestExamplesParser
from .install_agent_parser import InstallAgentParser
from .analyze_parser import AnalyzeParser
from .install_parser import InstallParser
from .resume_parser import ResumeParser
from .stream_parser import StreamParser
@@ -34,57 +27,26 @@ from .workflows_parser import WorkflowsParser
from .sync_config_parser import SyncConfigParser
from .doctor_parser import DoctorParser
# New source type parsers (v3.2.0+)
from .jupyter_parser import JupyterParser
from .html_parser import HtmlParser
from .openapi_parser import OpenAPIParser
from .asciidoc_parser import AsciiDocParser
from .pptx_parser import PptxParser
from .rss_parser import RssParser
from .manpage_parser import ManPageParser
from .confluence_parser import ConfluenceParser
from .notion_parser import NotionParser
from .chat_parser import ChatParser
# Registry of all parsers (in order of usage frequency)
# Registry of all parsers
PARSERS = [
CreateParser(), # NEW: Unified create command (placed first for prominence)
CreateParser(),
DoctorParser(),
ConfigParser(),
ScrapeParser(),
GitHubParser(),
PackageParser(),
UploadParser(),
AnalyzeParser(),
EnhanceParser(),
EnhanceStatusParser(),
PDFParser(),
WordParser(),
EpubParser(),
VideoParser(),
UnifiedParser(),
PackageParser(),
UploadParser(),
EstimateParser(),
InstallParser(),
InstallAgentParser(),
TestExamplesParser(),
ResumeParser(),
StreamParser(),
UpdateParser(),
MultilangParser(),
QualityParser(),
WorkflowsParser(),
SyncConfigParser(),
# New source types (v3.2.0+)
JupyterParser(),
HtmlParser(),
OpenAPIParser(),
AsciiDocParser(),
PptxParser(),
RssParser(),
ManPageParser(),
ConfluenceParser(),
NotionParser(),
ChatParser(),
StreamParser(),
UpdateParser(),
MultilangParser(),
]

View File

@@ -48,14 +48,13 @@ Presets: -p quick (1-2min) | -p standard (5-10min) | -p comprehensive (20-60min)
def add_arguments(self, parser):
"""Add create-specific arguments.
Uses shared argument definitions with progressive disclosure.
Default mode shows only universal arguments (15 flags).
Multi-mode help handled via custom flags detected in argument parsing.
Registers ALL arguments (120+ flags) so the top-level parser
accepts source-specific flags like --browser, --max-pages, etc.
Default help still shows only universal args; use --help-all for full list.
"""
# Add all arguments in 'default' mode (universal only)
# This keeps help text clean and focused
add_create_arguments(parser, mode="default")
# Register all arguments so source-specific flags are accepted
# by the top-level parser (create is the only entry point now)
add_create_arguments(parser, mode="all")
# Add hidden help mode flags
# These won't show in default help but can be used to get source-specific help

View File

@@ -11,16 +11,14 @@ Usage:
python3 pdf_scraper.py --from-json manual_extracted.json
"""
import argparse
import json
import logging
import os
import re
import sys
from pathlib import Path
# Import the PDF extractor
from .pdf_extractor_poc import PDFExtractor
from .skill_converter import SkillConverter
def infer_description_from_pdf(pdf_metadata: dict = None, name: str = "") -> str:
@@ -62,10 +60,13 @@ def infer_description_from_pdf(pdf_metadata: dict = None, name: str = "") -> str
)
class PDFToSkillConverter:
class PDFToSkillConverter(SkillConverter):
"""Convert PDF documentation to AI skill"""
SOURCE_TYPE = "pdf"
def __init__(self, config):
super().__init__(config)
self.config = config
self.name = config["name"]
self.pdf_path = config.get("pdf_path", "")
@@ -87,6 +88,10 @@ class PDFToSkillConverter:
# Extracted data
self.extracted_data = None
def extract(self):
"""SkillConverter interface — delegates to extract_pdf()."""
return self.extract_pdf()
def extract_pdf(self):
"""Extract content from PDF using pdf_extractor_poc.py"""
print(f"\n🔍 Extracting from PDF: {self.pdf_path}")
@@ -631,151 +636,3 @@ class PDFToSkillConverter:
safe = re.sub(r"[^\w\s-]", "", name.lower())
safe = re.sub(r"[-\s]+", "_", safe)
return safe
def main():
from .arguments.pdf import add_pdf_arguments
parser = argparse.ArgumentParser(
description="Convert PDF documentation to AI skill",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_pdf_arguments(parser)
args = parser.parse_args()
# Set logging level from behavior args
if getattr(args, "quiet", False):
logging.getLogger().setLevel(logging.WARNING)
elif getattr(args, "verbose", False):
logging.getLogger().setLevel(logging.DEBUG)
# Handle --dry-run
if getattr(args, "dry_run", False):
source = args.pdf or args.config or args.from_json or "(none)"
print(f"\n{'=' * 60}")
print(f"DRY RUN: PDF Extraction")
print(f"{'=' * 60}")
print(f"Source: {source}")
print(f"Name: {getattr(args, 'name', None) or '(auto-detect)'}")
print(f"Enhance level: {getattr(args, 'enhance_level', 0)}")
print(f"\n✅ Dry run complete")
return
# Validate inputs
if not (args.config or args.pdf or args.from_json):
parser.error("Must specify --config, --pdf, or --from-json")
# Load or create config
if args.config:
with open(args.config) as f:
config = json.load(f)
elif args.from_json:
# Build from extracted JSON
name = Path(args.from_json).stem.replace("_extracted", "")
config = {
"name": name,
"description": args.description or f"Use when referencing {name} documentation",
}
converter = PDFToSkillConverter(config)
converter.load_extracted_data(args.from_json)
converter.build_skill()
return
else:
# Direct PDF mode
if not args.name:
parser.error("Must specify --name with --pdf")
config = {
"name": args.name,
"pdf_path": args.pdf,
"description": args.description or f"Use when referencing {args.name} documentation",
"extract_options": {
"chunk_size": 10,
"min_quality": 5.0,
"extract_images": True,
"min_image_size": 100,
},
}
# Create converter
try:
converter = PDFToSkillConverter(config)
# Extract if needed
if config.get("pdf_path") and not converter.extract_pdf():
print("\n❌ PDF extraction failed - see error above", file=sys.stderr)
sys.exit(1)
# Build skill
converter.build_skill()
# ═══════════════════════════════════════════════════════════════════════════
# Enhancement Workflow Integration (Phase 2 - PDF Support)
# ═══════════════════════════════════════════════════════════════════════════
from skill_seekers.cli.workflow_runner import run_workflows
workflow_executed, workflow_names = run_workflows(args)
workflow_name = ", ".join(workflow_names) if workflow_names else None
# ═══════════════════════════════════════════════════════════════════════════
# Traditional Enhancement (complements workflow system)
# ═══════════════════════════════════════════════════════════════════════════
if getattr(args, "enhance_level", 0) > 0:
import os
api_key = getattr(args, "api_key", None) or os.environ.get("ANTHROPIC_API_KEY")
mode = "API" if api_key else "LOCAL"
print("\n" + "=" * 80)
print(f"🤖 Traditional AI Enhancement ({mode} mode, level {args.enhance_level})")
print("=" * 80)
if workflow_executed:
print(f" Running after workflow: {workflow_name}")
print(
" (Workflow provides specialized analysis, enhancement provides general improvements)"
)
print("")
skill_dir = converter.skill_dir
if api_key:
try:
from skill_seekers.cli.enhance_skill import enhance_skill_md
enhance_skill_md(skill_dir, api_key)
print("✅ API enhancement complete!")
except ImportError:
print("❌ API enhancement not available. Falling back to LOCAL mode...")
from pathlib import Path
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
agent_name = agent or "claude"
print(f"✅ Local enhancement complete! (via {agent_name})")
else:
from pathlib import Path
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
agent_name = agent or "claude"
print(f"✅ Local enhancement complete! (via {agent_name})")
except RuntimeError as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"\n❌ Unexpected error during PDF processing: {e}", file=sys.stderr)
import traceback
traceback.print_exc()
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -15,12 +15,10 @@ Usage:
skill-seekers pptx --from-json presentation_extracted.json
"""
import argparse
import json
import logging
import os
import re
import sys
from pathlib import Path
# Optional dependency guard
@@ -33,6 +31,8 @@ try:
except ImportError:
PPTX_AVAILABLE = False
from skill_seekers.cli.skill_converter import SkillConverter
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
@@ -147,7 +147,7 @@ def infer_description_from_pptx(
# ---------------------------------------------------------------------------
class PptxToSkillConverter:
class PptxToSkillConverter(SkillConverter):
"""Convert PowerPoint presentation (.pptx) to an AI-ready skill.
Follows the same pipeline pattern as the Word, EPUB, and PDF scrapers:
@@ -165,6 +165,8 @@ class PptxToSkillConverter:
.pptx files (merged into a single skill).
"""
SOURCE_TYPE = "pptx"
def __init__(self, config: dict) -> None:
"""Initialize the converter with a configuration dictionary.
@@ -175,6 +177,7 @@ class PptxToSkillConverter:
- description (str): Skill description (optional, inferred if absent)
- categories (dict): Manual category assignments (optional)
"""
super().__init__(config)
self.config = config
self.name: str = config["name"]
self.pptx_path: str = config.get("pptx_path", "")
@@ -192,6 +195,10 @@ class PptxToSkillConverter:
# Extracted data (populated by extract_pptx or load_extracted_data)
self.extracted_data: dict | None = None
def extract(self):
"""Extract content from PowerPoint files (SkillConverter interface)."""
self.extract_pptx()
# ------------------------------------------------------------------
# Extraction
# ------------------------------------------------------------------
@@ -1661,165 +1668,3 @@ def _score_code_quality(code: str) -> float:
score -= 2.0
return min(10.0, max(0.0, score))
# ---------------------------------------------------------------------------
# CLI entry point
# ---------------------------------------------------------------------------
def main() -> int:
"""CLI entry point for the PowerPoint scraper.
Parses command-line arguments and runs the extraction and skill-building
pipeline. Supports direct .pptx input, directory input, and loading from
previously extracted JSON.
Returns:
Exit code (0 for success, non-zero for errors).
"""
from skill_seekers.cli.arguments.pptx import add_pptx_arguments
parser = argparse.ArgumentParser(
description="Convert PowerPoint presentation (.pptx) to skill",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_pptx_arguments(parser)
args = parser.parse_args()
# Set logging level
if getattr(args, "quiet", False):
logging.getLogger().setLevel(logging.WARNING)
elif getattr(args, "verbose", False):
logging.getLogger().setLevel(logging.DEBUG)
# Handle --dry-run
if getattr(args, "dry_run", False):
source = getattr(args, "pptx", None) or getattr(args, "from_json", None) or "(none)"
print(f"\n{'=' * 60}")
print("DRY RUN: PowerPoint Extraction")
print(f"{'=' * 60}")
print(f"Source: {source}")
print(f"Name: {getattr(args, 'name', None) or '(auto-detect)'}")
print(f"Enhance level: {getattr(args, 'enhance_level', 0)}")
print(f"\n✅ Dry run complete")
return 0
# Validate inputs
if not (getattr(args, "pptx", None) or getattr(args, "from_json", None)):
parser.error("Must specify --pptx or --from-json")
# Build from JSON workflow
if getattr(args, "from_json", None):
name = Path(args.from_json).stem.replace("_extracted", "")
config = {
"name": getattr(args, "name", None) or name,
"description": getattr(args, "description", None)
or f"Use when referencing {name} presentation",
}
try:
converter = PptxToSkillConverter(config)
converter.load_extracted_data(args.from_json)
converter.build_skill()
except Exception as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
return 0
# Direct PPTX mode
if not getattr(args, "name", None):
# Auto-detect name from filename or directory name
pptx_path = Path(args.pptx)
args.name = pptx_path.stem if pptx_path.is_file() else pptx_path.name
config = {
"name": args.name,
"pptx_path": args.pptx,
# Pass None so extract_pptx() can infer from presentation metadata
"description": getattr(args, "description", None),
}
try:
converter = PptxToSkillConverter(config)
# Extract
if not converter.extract_pptx():
print(
"\n❌ PowerPoint extraction failed - see error above",
file=sys.stderr,
)
sys.exit(1)
# Build skill
converter.build_skill()
# Enhancement Workflow Integration
from skill_seekers.cli.workflow_runner import run_workflows
workflow_executed, workflow_names = run_workflows(args)
workflow_name = ", ".join(workflow_names) if workflow_names else None
# Traditional enhancement (complements workflow system)
if getattr(args, "enhance_level", 0) > 0:
api_key = getattr(args, "api_key", None) or os.environ.get("ANTHROPIC_API_KEY")
mode = "API" if api_key else "LOCAL"
print("\n" + "=" * 80)
print(f"🤖 Traditional AI Enhancement ({mode} mode, level {args.enhance_level})")
print("=" * 80)
if workflow_executed:
print(f" Running after workflow: {workflow_name}")
print(
" (Workflow provides specialized analysis,"
" enhancement provides general improvements)"
)
print("")
skill_dir = converter.skill_dir
if api_key:
try:
from skill_seekers.cli.enhance_skill import enhance_skill_md
enhance_skill_md(skill_dir, api_key)
print("✅ API enhancement complete!")
except ImportError:
print("❌ API enhancement not available. Falling back to LOCAL mode...")
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
else:
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
except (FileNotFoundError, ValueError) as e:
print(f"\n❌ Input error: {e}", file=sys.stderr)
sys.exit(1)
except RuntimeError as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(
f"\n❌ Unexpected error during PowerPoint processing: {e}",
file=sys.stderr,
)
import traceback
traceback.print_exc()
sys.exit(1)
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -19,16 +19,13 @@ Usage:
python3 -m skill_seekers.cli.rss_scraper --feed-url https://example.com/atom.xml --name myblog
"""
import argparse
import hashlib
import json
import logging
import os
import re
import sys
import time
from datetime import datetime
from pathlib import Path
from typing import Any
# Optional dependency guard — feedparser is not in core deps
@@ -42,6 +39,8 @@ except ImportError:
# BeautifulSoup is a core dependency (always available)
from bs4 import BeautifulSoup, Comment, Tag
from skill_seekers.cli.skill_converter import SkillConverter
logger = logging.getLogger(__name__)
# Feed type constants
@@ -109,7 +108,7 @@ def infer_description_from_feed(
)
class RssToSkillConverter:
class RssToSkillConverter(SkillConverter):
"""Convert RSS/Atom feeds to AI-ready skills.
Parses RSS 2.0, RSS 1.0 (RDF), and Atom feeds using feedparser.
@@ -117,6 +116,8 @@ class RssToSkillConverter:
requests + BeautifulSoup.
"""
SOURCE_TYPE = "rss"
def __init__(self, config: dict[str, Any]) -> None:
"""Initialize the converter with configuration.
@@ -125,6 +126,7 @@ class RssToSkillConverter:
follow_links (default True), max_articles (default 50),
and description (optional).
"""
super().__init__(config)
self.config = config
self.name: str = config["name"]
self.feed_url: str = config.get("feed_url", "")
@@ -142,6 +144,10 @@ class RssToSkillConverter:
# Internal state
self.extracted_data: dict[str, Any] | None = None
def extract(self):
"""Extract content from RSS/Atom feed (SkillConverter interface)."""
self.extract_feed()
# ──────────────────────────────────────────────────────────────────────
# Public API
# ──────────────────────────────────────────────────────────────────────
@@ -865,227 +871,3 @@ class RssToSkillConverter:
safe = re.sub(r"[^\w\s-]", "", name.lower())
safe = re.sub(r"[-\s]+", "_", safe)
return safe or "unnamed"
# ──────────────────────────────────────────────────────────────────────────
# CLI entry point
# ──────────────────────────────────────────────────────────────────────────
def main() -> int:
"""CLI entry point for the RSS/Atom feed scraper."""
from .arguments.common import add_all_standard_arguments
parser = argparse.ArgumentParser(
description="Convert RSS/Atom feed to AI-ready skill",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=(
"Examples:\n"
" %(prog)s --feed-url https://example.com/feed.xml --name myblog\n"
" %(prog)s --feed-path ./feed.xml --name myblog\n"
" %(prog)s --feed-url https://example.com/rss --no-follow-links --name myblog\n"
" %(prog)s --from-json myblog_extracted.json\n"
),
)
# Standard arguments (name, description, output, enhance-level, etc.)
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for RSS
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for RSS), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code, Kimi, etc.)"
)
# RSS-specific arguments
parser.add_argument(
"--feed-url",
type=str,
help="URL of the RSS/Atom feed to scrape",
metavar="URL",
)
parser.add_argument(
"--feed-path",
type=str,
help="Local file path to an RSS/Atom XML file",
metavar="PATH",
)
parser.add_argument(
"--follow-links",
action="store_true",
default=True,
dest="follow_links",
help="Follow article links to scrape full content (default: enabled)",
)
parser.add_argument(
"--no-follow-links",
action="store_false",
dest="follow_links",
help="Do not follow article links — use feed content only",
)
parser.add_argument(
"--max-articles",
type=int,
default=50,
metavar="N",
help="Maximum number of articles to process (default: 50)",
)
parser.add_argument(
"--from-json",
type=str,
help="Build skill from previously extracted JSON file",
metavar="FILE",
)
args = parser.parse_args()
# Set logging level
if getattr(args, "quiet", False):
logging.getLogger().setLevel(logging.WARNING)
elif getattr(args, "verbose", False):
logging.getLogger().setLevel(logging.DEBUG)
# Handle --dry-run
if getattr(args, "dry_run", False):
source = (
getattr(args, "feed_url", None)
or getattr(args, "feed_path", None)
or getattr(args, "from_json", None)
or "(none)"
)
print(f"\n{'=' * 60}")
print("DRY RUN: RSS/Atom Feed Extraction")
print(f"{'=' * 60}")
print(f"Source: {source}")
print(f"Name: {getattr(args, 'name', None) or '(auto-detect)'}")
print(f"Follow links: {getattr(args, 'follow_links', True)}")
print(f"Max articles: {getattr(args, 'max_articles', 50)}")
print(f"Enhance level: {getattr(args, 'enhance_level', 0)}")
print(f"\n✅ Dry run complete")
return 0
# Validate inputs
has_source = (
getattr(args, "feed_url", None)
or getattr(args, "feed_path", None)
or getattr(args, "from_json", None)
)
if not has_source:
parser.error("Must specify --feed-url, --feed-path, or --from-json")
# Build from JSON workflow
if getattr(args, "from_json", None):
name = Path(args.from_json).stem.replace("_extracted", "")
config: dict[str, Any] = {
"name": getattr(args, "name", None) or name,
"description": getattr(args, "description", None)
or f"Use when referencing {name} feed content",
}
try:
converter = RssToSkillConverter(config)
converter.load_extracted_data(args.from_json)
converter.build_skill()
except Exception as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
return 0
# Feed extraction workflow
if not getattr(args, "name", None):
# Auto-detect name from URL or file path
if getattr(args, "feed_url", None):
from urllib.parse import urlparse
parsed_url = urlparse(args.feed_url)
args.name = parsed_url.hostname.replace(".", "-") if parsed_url.hostname else "feed"
elif getattr(args, "feed_path", None):
args.name = Path(args.feed_path).stem
config = {
"name": args.name,
"feed_url": getattr(args, "feed_url", "") or "",
"feed_path": getattr(args, "feed_path", "") or "",
"follow_links": getattr(args, "follow_links", True),
"max_articles": getattr(args, "max_articles", 50),
"description": getattr(args, "description", None),
}
try:
converter = RssToSkillConverter(config)
# Extract feed
if not converter.extract_feed():
print("\n❌ Feed extraction failed — see error above", file=sys.stderr)
sys.exit(1)
# Build skill
converter.build_skill()
# Enhancement Workflow Integration
from skill_seekers.cli.workflow_runner import run_workflows
workflow_executed, workflow_names = run_workflows(args)
workflow_name = ", ".join(workflow_names) if workflow_names else None
# Traditional enhancement (complements workflow system)
if getattr(args, "enhance_level", 0) > 0:
api_key = getattr(args, "api_key", None) or os.environ.get("ANTHROPIC_API_KEY")
mode = "API" if api_key else "LOCAL"
print("\n" + "=" * 80)
print(f"🤖 Traditional AI Enhancement ({mode} mode, level {args.enhance_level})")
print("=" * 80)
if workflow_executed:
print(f" Running after workflow: {workflow_name}")
print(
" (Workflow provides specialized analysis, "
"enhancement provides general improvements)"
)
print("")
skill_dir = converter.skill_dir
if api_key:
try:
from skill_seekers.cli.enhance_skill import enhance_skill_md
enhance_skill_md(skill_dir, api_key)
print("✅ API enhancement complete!")
except ImportError:
print("❌ API enhancement not available. Falling back to LOCAL mode...")
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
else:
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
except RuntimeError as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"\n❌ Unexpected error during feed processing: {e}", file=sys.stderr)
import traceback
traceback.print_exc()
sys.exit(1)
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,115 @@
"""
SkillConverter — Base interface for all source type converters.
Every scraper/converter inherits this and implements extract().
The create command calls converter.run() — same interface for all 18 types.
Usage:
converter = get_converter("web", config)
converter.run() # extract + build + return exit code
"""
import logging
from typing import Any
logger = logging.getLogger(__name__)
class SkillConverter:
"""Base interface for all skill converters.
Subclasses must implement extract() at minimum.
build_skill() has a default implementation that most converters override.
"""
# Override in subclass
SOURCE_TYPE: str = "unknown"
def __init__(self, config: dict[str, Any]):
self.config = config
self.name = config.get("name", "unnamed")
self.skill_dir = f"output/{self.name}"
def run(self) -> int:
"""Main entry point — extract source and build skill.
Returns:
Exit code (0 for success, non-zero for failure).
"""
try:
logger.info(f"Extracting from {self.SOURCE_TYPE} source: {self.name}")
self.extract()
result = self.build_skill()
if result is False:
logger.error(f"{self.SOURCE_TYPE} build_skill() reported failure")
return 1
logger.info(f"✅ Skill built: {self.skill_dir}/")
return 0
except Exception as e:
logger.exception(f"{self.SOURCE_TYPE} extraction failed: {e}")
return 1
def extract(self):
"""Extract content from source. Override in subclass."""
raise NotImplementedError(f"{self.__class__.__name__} must implement extract()")
def build_skill(self):
"""Build SKILL.md from extracted data. Override in subclass."""
raise NotImplementedError(f"{self.__class__.__name__} must implement build_skill()")
# Registry mapping source type → (module_path, class_name)
CONVERTER_REGISTRY: dict[str, tuple[str, str]] = {
"web": ("skill_seekers.cli.doc_scraper", "DocToSkillConverter"),
"github": ("skill_seekers.cli.github_scraper", "GitHubScraper"),
"pdf": ("skill_seekers.cli.pdf_scraper", "PDFToSkillConverter"),
"word": ("skill_seekers.cli.word_scraper", "WordToSkillConverter"),
"epub": ("skill_seekers.cli.epub_scraper", "EpubToSkillConverter"),
"video": ("skill_seekers.cli.video_scraper", "VideoToSkillConverter"),
"local": ("skill_seekers.cli.codebase_scraper", "CodebaseAnalyzer"),
"jupyter": ("skill_seekers.cli.jupyter_scraper", "JupyterToSkillConverter"),
"html": ("skill_seekers.cli.html_scraper", "HtmlToSkillConverter"),
"openapi": ("skill_seekers.cli.openapi_scraper", "OpenAPIToSkillConverter"),
"asciidoc": ("skill_seekers.cli.asciidoc_scraper", "AsciiDocToSkillConverter"),
"pptx": ("skill_seekers.cli.pptx_scraper", "PptxToSkillConverter"),
"rss": ("skill_seekers.cli.rss_scraper", "RssToSkillConverter"),
"manpage": ("skill_seekers.cli.man_scraper", "ManPageToSkillConverter"),
"confluence": ("skill_seekers.cli.confluence_scraper", "ConfluenceToSkillConverter"),
"notion": ("skill_seekers.cli.notion_scraper", "NotionToSkillConverter"),
"chat": ("skill_seekers.cli.chat_scraper", "ChatToSkillConverter"),
# NOTE: UnifiedScraper takes (config_path: str), not (config: dict).
# Callers must construct it directly, not via get_converter().
"config": ("skill_seekers.cli.unified_scraper", "UnifiedScraper"),
}
def get_converter(source_type: str, config: dict[str, Any]) -> SkillConverter:
"""Get the appropriate converter for a source type.
Args:
source_type: Source type from SourceDetector (web, github, pdf, etc.)
config: Configuration dict for the converter.
Returns:
Initialized converter instance.
Raises:
ValueError: If source type is not supported.
"""
import importlib
if source_type not in CONVERTER_REGISTRY:
raise ValueError(
f"Unknown source type: {source_type}. "
f"Supported: {', '.join(sorted(CONVERTER_REGISTRY))}"
)
module_path, class_name = CONVERTER_REGISTRY[source_type]
module = importlib.import_module(module_path)
converter_class = getattr(module, class_name, None)
if converter_class is None:
raise ValueError(
f"Class '{class_name}' not found in module '{module_path}'. "
f"Check CONVERTER_REGISTRY entry for '{source_type}'."
)
return converter_class(config)

View File

@@ -12,7 +12,6 @@ Usage:
skill-seekers unified --config configs/react_unified.json --merge-mode ai-enhanced
"""
import argparse
import json
import logging
import os
@@ -28,8 +27,8 @@ try:
from skill_seekers.cli.config_validator import validate_config
from skill_seekers.cli.conflict_detector import ConflictDetector
from skill_seekers.cli.merge_sources import AIEnhancedMerger, RuleBasedMerger
from skill_seekers.cli.skill_converter import SkillConverter
from skill_seekers.cli.unified_skill_builder import UnifiedSkillBuilder
from skill_seekers.cli.utils import setup_logging
except ImportError as e:
print(f"Error importing modules: {e}")
print("Make sure you're running from the project root directory")
@@ -38,7 +37,7 @@ except ImportError as e:
logger = logging.getLogger(__name__)
class UnifiedScraper:
class UnifiedScraper(SkillConverter):
"""
Orchestrates multi-source scraping and merging.
@@ -50,6 +49,8 @@ class UnifiedScraper:
5. Build unified skill
"""
SOURCE_TYPE = "config"
def __init__(self, config_path: str, merge_mode: str | None = None):
"""
Initialize unified scraper.
@@ -58,6 +59,7 @@ class UnifiedScraper:
config_path: Path to unified config JSON
merge_mode: Override config merge_mode ('rule-based' or 'claude-enhanced')
"""
super().__init__({"name": "unified", "config_path": config_path})
self.config_path = config_path
# Validate and load config
@@ -157,6 +159,42 @@ class UnifiedScraper:
logger.info(f"📝 Logging to: {log_file}")
logger.info(f"🗂️ Cache directory: {self.cache_dir}")
@staticmethod
def _enrich_docs_json(docs_json: dict, data_file_path: str) -> dict:
"""Enrich docs summary with page content from individual page files.
summary.json only has {title, url} per page; full content lives in pages/*.json.
ConflictDetector needs content to extract APIs, so we load page files and convert
to the dict format {url: page_data} that the detector's dict branch understands.
"""
pages = docs_json.get("pages", [])
if not isinstance(pages, list) or not pages or "content" in pages[0]:
return docs_json
pages_dir = os.path.join(os.path.dirname(data_file_path), "pages")
if not os.path.isdir(pages_dir):
return docs_json
enriched_pages = {}
for page_file in os.listdir(pages_dir):
if page_file.endswith(".json"):
try:
with open(os.path.join(pages_dir, page_file), encoding="utf-8") as pf:
page_data = json.load(pf)
url = page_data.get("url", "")
if url:
enriched_pages[url] = page_data
except (json.JSONDecodeError, OSError):
continue
if enriched_pages:
docs_json = {**docs_json, "pages": enriched_pages}
logger.info(
f"Enriched docs data with {len(enriched_pages)} page files for API extraction"
)
return docs_json
def scrape_all_sources(self):
"""
Scrape all configured sources.
@@ -259,50 +297,41 @@ class UnifiedScraper:
"sources": [doc_source],
}
# Write temporary config
temp_config_path = os.path.join(self.data_dir, "temp_docs_config.json")
with open(temp_config_path, "w", encoding="utf-8") as f:
json.dump(doc_config, f, indent=2)
# Run doc_scraper as subprocess
# Run doc_scraper directly (no subprocess needed with ExecutionContext)
logger.info(f"Scraping documentation from {source['base_url']}")
doc_scraper_path = Path(__file__).parent / "doc_scraper.py"
cmd = [sys.executable, str(doc_scraper_path), "--config", temp_config_path, "--fresh"]
# Forward agent-related CLI args so doc scraper enhancement respects
# the user's chosen agent instead of defaulting to claude.
cli_args = getattr(self, "_cli_args", None)
if cli_args is not None:
if getattr(cli_args, "agent", None):
cmd.extend(["--agent", cli_args.agent])
if getattr(cli_args, "agent_cmd", None):
cmd.extend(["--agent-cmd", cli_args.agent_cmd])
if getattr(cli_args, "api_key", None):
cmd.extend(["--api-key", cli_args.api_key])
# Support "browser": true in source config for JavaScript SPA sites
if source.get("browser", False):
cmd.append("--browser")
doc_config["browser"] = True
logger.info(" 🌐 Browser mode enabled (JavaScript rendering via Playwright)")
# Import and call directly
try:
result = subprocess.run(
cmd, capture_output=True, text=True, stdin=subprocess.DEVNULL, timeout=3600
from skill_seekers.cli.doc_scraper import scrape_documentation
from skill_seekers.cli.execution_context import ExecutionContext
# Create child context with doc-specific overrides
doc_ctx = ExecutionContext.get().override(
output__name=f"{self.name}_docs",
scraping__max_pages=source.get("max_pages", 500),
)
except subprocess.TimeoutExpired:
logger.error("Documentation scraping timed out after 60 minutes")
return
if result.returncode != 0:
logger.error(f"Documentation scraping failed with return code {result.returncode}")
logger.error(f"STDERR: {result.stderr}")
logger.error(f"STDOUT: {result.stdout}")
return
with doc_ctx:
result = scrape_documentation(
config=doc_config,
ctx=ExecutionContext.get(),
)
# Log subprocess output for debugging
if result.stdout:
logger.info(f"Doc scraper output: {result.stdout[-500:]}") # Last 500 chars
if result != 0:
logger.error(f"Documentation scraping failed with return code {result}")
return
except Exception as e:
logger.error(f"Documentation scraping failed: {e}")
import traceback
logger.debug(f"Traceback: {traceback.format_exc()}")
return
# Load scraped data
docs_data_file = f"output/{doc_config['name']}_data/summary.json"
@@ -327,10 +356,6 @@ class UnifiedScraper:
else:
logger.warning("Documentation data file not found")
# Clean up temp config
if os.path.exists(temp_config_path):
os.remove(temp_config_path)
# Move intermediate files to cache to keep output/ clean
docs_output_dir = f"output/{doc_config['name']}"
docs_data_dir = f"output/{doc_config['name']}_data"
@@ -1704,13 +1729,17 @@ class UnifiedScraper:
docs_data = docs_list[0]
github_data = github_list[0]
# Load data files
# Load data files (cached for reuse in merge_sources)
with open(docs_data["data_file"], encoding="utf-8") as f:
docs_json = json.load(f)
docs_json = self._enrich_docs_json(docs_json, docs_data["data_file"])
with open(github_data["data_file"], encoding="utf-8") as f:
github_json = json.load(f)
self._cached_docs_json = docs_json
self._cached_github_json = github_json
# Detect conflicts
detector = ConflictDetector(docs_json, github_json)
conflicts = detector.detect_all_conflicts()
@@ -1758,15 +1787,18 @@ class UnifiedScraper:
logger.warning("Missing documentation or GitHub data for merging")
return None
docs_data = docs_list[0]
github_data = github_list[0]
# Reuse cached data from detect_conflicts() to avoid redundant disk I/O
docs_json = getattr(self, "_cached_docs_json", None)
github_json = getattr(self, "_cached_github_json", None)
# Load data
with open(docs_data["data_file"], encoding="utf-8") as f:
docs_json = json.load(f)
with open(github_data["data_file"], encoding="utf-8") as f:
github_json = json.load(f)
if docs_json is None or github_json is None:
docs_data = docs_list[0]
github_data = github_list[0]
with open(docs_data["data_file"], encoding="utf-8") as f:
docs_json = json.load(f)
docs_json = self._enrich_docs_json(docs_json, docs_data["data_file"])
with open(github_data["data_file"], encoding="utf-8") as f:
github_json = json.load(f)
# Choose merger
if self.merge_mode in ("ai-enhanced", "claude-enhanced"):
@@ -1786,6 +1818,10 @@ class UnifiedScraper:
return merged_data
def extract(self):
"""SkillConverter interface — delegates to scrape_all_sources()."""
self.scrape_all_sources()
def build_skill(self, merged_data: dict | None = None):
"""
Build final unified skill.
@@ -1892,17 +1928,28 @@ class UnifiedScraper:
run_workflows(effective_args, context=unified_context)
# Phase 6: AI Enhancement of SKILL.md
# Triggered by config "enhancement" block or CLI --enhance-level
# Read from ExecutionContext first (has correct priority resolution),
# fall back to raw config dict for backward compatibility.
enhancement_config = self.config.get("enhancement", {})
enhancement_enabled = enhancement_config.get("enabled", False)
enhancement_level = enhancement_config.get("level", 0)
enhancement_mode = enhancement_config.get("mode", "AUTO").upper()
try:
from skill_seekers.cli.execution_context import ExecutionContext
# CLI --enhance-level overrides config
cli_enhance_level = getattr(args, "enhance_level", None) if args is not None else None
if cli_enhance_level is not None:
enhancement_enabled = cli_enhance_level > 0
enhancement_level = cli_enhance_level
ctx = ExecutionContext.get()
enhancement_enabled = ctx.enhancement.enabled
enhancement_level = ctx.enhancement.level
enhancement_mode = ctx.enhancement.mode.upper()
except (RuntimeError, Exception):
# Fallback to raw config + args
enhancement_enabled = enhancement_config.get("enabled", False)
enhancement_level = enhancement_config.get("level", 0)
enhancement_mode = enhancement_config.get("mode", "AUTO").upper()
cli_enhance_level = (
getattr(args, "enhance_level", None) if args is not None else None
)
if cli_enhance_level is not None:
enhancement_enabled = cli_enhance_level > 0
enhancement_level = cli_enhance_level
if enhancement_enabled and enhancement_level > 0:
logger.info("\n" + "=" * 60)
@@ -1918,16 +1965,19 @@ class UnifiedScraper:
try:
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
# Get agent from CLI args, config enhancement block, or env var
agent = None
agent_cmd = None
if args is not None:
agent = getattr(args, "agent", None)
agent_cmd = getattr(args, "agent_cmd", None)
if not agent:
agent = enhancement_config.get("agent", None)
if not agent:
agent = os.environ.get("SKILL_SEEKER_AGENT", "").strip() or None
# Get agent from ExecutionContext (already resolved with correct priority)
try:
ctx = ExecutionContext.get()
agent = ctx.enhancement.agent
agent_cmd = ctx.enhancement.agent_cmd
except (RuntimeError, Exception):
agent = None
agent_cmd = None
if args is not None:
agent = getattr(args, "agent", None)
agent_cmd = getattr(args, "agent_cmd", None)
if not agent:
agent = os.environ.get("SKILL_SEEKER_AGENT", "").strip() or None
# Read timeout from config enhancement block
timeout_val = enhancement_config.get("timeout")
@@ -2016,183 +2066,14 @@ class UnifiedScraper:
logger.info(f"📁 Output: {self.output_dir}/")
logger.info(f"📁 Data: {self.data_dir}/")
return 0
except KeyboardInterrupt:
logger.info("\n\n⚠️ Scraping interrupted by user")
sys.exit(1)
return 130
except Exception as e:
logger.error(f"\n\n❌ Error during scraping: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
def main():
"""Main entry point."""
parser = argparse.ArgumentParser(
description="Unified multi-source scraper",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Basic usage with unified config
skill-seekers unified --config configs/godot_unified.json
# Override merge mode
skill-seekers unified --config configs/react_unified.json --merge-mode ai-enhanced
# Backward compatible with legacy configs
skill-seekers unified --config configs/react.json
""",
)
parser.add_argument("--config", "-c", required=True, help="Path to unified config JSON file")
parser.add_argument(
"--merge-mode",
"-m",
choices=["rule-based", "ai-enhanced", "claude-enhanced"],
help="Override config merge mode (ai-enhanced or rule-based). 'claude-enhanced' accepted as alias.",
)
parser.add_argument(
"--skip-codebase-analysis",
action="store_true",
help="Skip C3.x codebase analysis for GitHub sources (default: enabled)",
)
parser.add_argument(
"--fresh",
action="store_true",
help="Clear any existing data and start fresh (ignore checkpoints)",
)
parser.add_argument(
"--dry-run",
action="store_true",
help="Preview what will be scraped without actually scraping",
)
# Enhancement Workflow arguments (mirrors scrape/github/pdf/codebase scrapers)
parser.add_argument(
"--enhance-workflow",
action="append",
dest="enhance_workflow",
help="Apply enhancement workflow (file path or preset). Can use multiple times to chain workflows.",
metavar="WORKFLOW",
)
parser.add_argument(
"--enhance-stage",
action="append",
dest="enhance_stage",
help="Add inline enhancement stage (format: 'name:prompt'). Can be used multiple times.",
metavar="STAGE",
)
parser.add_argument(
"--var",
action="append",
dest="var",
help="Override workflow variable (format: 'key=value'). Can be used multiple times.",
metavar="VAR",
)
parser.add_argument(
"--workflow-dry-run",
action="store_true",
dest="workflow_dry_run",
help="Preview workflow stages without executing (requires --enhance-workflow)",
)
parser.add_argument(
"--api-key",
type=str,
metavar="KEY",
help="Anthropic API key (or set ANTHROPIC_API_KEY env var)",
)
parser.add_argument(
"--enhance-level",
type=int,
choices=[0, 1, 2, 3],
default=None,
metavar="LEVEL",
help=(
"Global AI enhancement level override for all sources "
"(0=off, 1=SKILL.md, 2=+arch/config, 3=full). "
"Overrides per-source enhance_level in config."
),
)
parser.add_argument(
"--agent",
type=str,
choices=["claude", "codex", "copilot", "opencode", "kimi", "custom"],
metavar="AGENT",
help="Local coding agent for enhancement (default: AI agent from SKILL_SEEKER_AGENT env var)",
)
parser.add_argument(
"--agent-cmd",
type=str,
metavar="CMD",
help="Override agent command template (advanced)",
)
args = parser.parse_args()
setup_logging()
# Create scraper
scraper = UnifiedScraper(args.config, args.merge_mode)
# Disable codebase analysis if requested
if args.skip_codebase_analysis:
for source in scraper.config.get("sources", []):
if source["type"] == "github":
source["enable_codebase_analysis"] = False
logger.info(
f"⏭️ Skipping codebase analysis for GitHub source: {source.get('repo', 'unknown')}"
)
# Handle --fresh flag (clear cache)
if args.fresh:
import shutil
if os.path.exists(scraper.cache_dir):
logger.info(f"🧹 Clearing cache: {scraper.cache_dir}")
shutil.rmtree(scraper.cache_dir)
# Recreate directories
os.makedirs(scraper.sources_dir, exist_ok=True)
os.makedirs(scraper.data_dir, exist_ok=True)
os.makedirs(scraper.repos_dir, exist_ok=True)
os.makedirs(scraper.logs_dir, exist_ok=True)
# Handle --dry-run flag
if args.dry_run:
logger.info("🔍 DRY RUN MODE - Preview only, no scraping will occur")
logger.info(f"\nWould scrape {len(scraper.config.get('sources', []))} sources:")
# Source type display config: type -> (label, key for detail)
_SOURCE_DISPLAY = {
"documentation": ("Documentation", "base_url"),
"github": ("GitHub", "repo"),
"pdf": ("PDF", "path"),
"word": ("Word", "path"),
"epub": ("EPUB", "path"),
"video": ("Video", "url"),
"local": ("Local Codebase", "path"),
"jupyter": ("Jupyter Notebook", "path"),
"html": ("HTML", "path"),
"openapi": ("OpenAPI Spec", "path"),
"asciidoc": ("AsciiDoc", "path"),
"pptx": ("PowerPoint", "path"),
"confluence": ("Confluence", "base_url"),
"notion": ("Notion", "page_id"),
"rss": ("RSS/Atom Feed", "url"),
"manpage": ("Man Page", "names"),
"chat": ("Chat Export", "path"),
}
for idx, source in enumerate(scraper.config.get("sources", []), 1):
source_type = source.get("type", "unknown")
label, key = _SOURCE_DISPLAY.get(source_type, (source_type.title(), "path"))
detail = source.get(key, "N/A")
if isinstance(detail, list):
detail = ", ".join(str(d) for d in detail)
logger.info(f" {idx}. {label}: {detail}")
logger.info(f"\nOutput directory: {scraper.output_dir}")
logger.info(f"Merge mode: {scraper.merge_mode}")
return
# Run scraper (pass args for workflow integration)
scraper.run(args=args)
if __name__ == "__main__":
main()
return 1

View File

@@ -15,6 +15,7 @@ discrepancies transparently.
import json
import logging
import os
import re
import shutil
from pathlib import Path
@@ -532,12 +533,49 @@ This skill synthesizes knowledge from multiple sources:
logger.warning("No source SKILL.md files found, generating minimal SKILL.md (legacy)")
content = self._generate_minimal_skill_md()
# Ensure frontmatter uses config name/description, not auto-generated slugs
content = self._normalize_frontmatter(content)
# Write final content
with open(skill_path, "w", encoding="utf-8") as f:
f.write(content)
logger.info(f"Created SKILL.md ({len(content)} chars, ~{len(content.split())} words)")
def _normalize_frontmatter(self, content: str) -> str:
"""Ensure SKILL.md frontmatter uses the config name and description.
Standalone source SKILL.md files may have auto-generated slugs
(e.g., 'primetween-github-0-kyrylokuzyk-primetween'). This replaces
the name and description with the canonical values from the config.
"""
if not content.startswith("---"):
return content
end = content.find("---", 3)
if end == -1:
return content
frontmatter = content[3:end]
body = content[end + 3 :]
canonical_name = self.name.lower().replace("_", "-").replace(" ", "-")[:64]
frontmatter = re.sub(
r"^name:.*$", f"name: {canonical_name}", frontmatter, count=1, flags=re.MULTILINE
)
# Handle both single-line and multiline YAML description values
desc = self.description[:1024] if len(self.description) > 1024 else self.description
frontmatter = re.sub(
r"^description:.*(?:\n[ \t]+.*)*$",
f"description: {desc}",
frontmatter,
count=1,
flags=re.MULTILINE,
)
return f"---{frontmatter}---{body}"
def _synthesize_docs_pdf(self, skill_mds: dict[str, str]) -> str:
"""Synthesize documentation + PDF sources.
@@ -958,8 +996,8 @@ This skill combines knowledge from multiple sources:
def _format_api_entry(self, api_data: dict, inline_conflict: bool = False) -> str:
"""Format a single API entry."""
name = api_data.get("name", "Unknown")
signature = api_data.get("merged_signature", name)
description = api_data.get("merged_description", "")
signature = api_data.get("merged_signature", api_data.get("signature", name))
description = api_data.get("merged_description", api_data.get("description", ""))
warning = api_data.get("warning", "")
entry = f"#### `{signature}`\n\n"
@@ -1302,7 +1340,7 @@ This skill combines knowledge from multiple sources:
apis = self.merged_data.get("apis", {})
for api_name in sorted(apis.keys()):
api_data = apis[api_name]
api_data = {**apis[api_name], "name": api_name}
entry = self._format_api_entry(api_data, inline_conflict=True)
f.write(entry)
@@ -1386,16 +1424,33 @@ This skill combines knowledge from multiple sources:
if c3_data.get("architecture"):
languages = c3_data["architecture"].get("languages", {})
# If no languages from C3.7, try to get from GitHub data
# github_data already available from method scope
if not languages and github_data.get("languages"):
# GitHub data has languages as list, convert to dict with count 1
languages = dict.fromkeys(github_data["languages"], 1)
# If no languages from C3.7, try to get from code_analysis or GitHub data
if not languages:
code_analysis = github_data.get("code_analysis", {})
if code_analysis.get("files_analyzed") and code_analysis.get("languages_analyzed"):
# Use code_analysis file counts per language
files = code_analysis.get("files", [])
lang_counts = {}
for file_info in files:
lang = file_info.get("language", "Unknown")
lang_counts[lang] = lang_counts.get(lang, 0) + 1
if lang_counts:
languages = lang_counts
else:
# Fallback: total count attributed to primary language
for lang in code_analysis["languages_analyzed"]:
languages[lang] = code_analysis["files_analyzed"]
elif github_data.get("languages"):
gh_langs = github_data["languages"]
if isinstance(gh_langs, dict):
languages = dict.fromkeys(gh_langs, 0)
elif isinstance(gh_langs, list):
languages = dict.fromkeys(gh_langs, 0)
if languages:
f.write("**Languages Detected**:\n")
for lang, count in sorted(languages.items(), key=lambda x: x[1], reverse=True)[:5]:
if isinstance(count, int):
if isinstance(count, int) and count > 0:
f.write(f"- {lang}: {count} files\n")
else:
f.write(f"- {lang}\n")
@@ -1534,6 +1589,24 @@ This skill combines knowledge from multiple sources:
logger.info("📐 Created ARCHITECTURE.md")
@staticmethod
def _make_path_relative(file_path: str) -> str:
"""Strip absolute path prefixes, keeping only the repo-relative path."""
# Strip .skillseeker-cache repo clone paths
if ".skillseeker-cache" in file_path:
# Pattern: ...repos/{idx}_{owner}_{repo}/relative/path
parts = file_path.split("/repos/")
if len(parts) > 1:
# Skip the repo dir name (e.g., '0_Owner_Repo/')
remainder = parts[1]
slash_idx = remainder.find("/")
if slash_idx != -1:
return remainder[slash_idx + 1 :]
# Generic: if it looks absolute, try to make it relative
if os.path.isabs(file_path):
return os.path.basename(file_path)
return file_path
def _generate_pattern_references(self, c3_dir: str, patterns_data: dict):
"""Generate design pattern references (C3.1)."""
if not patterns_data:
@@ -1556,7 +1629,8 @@ This skill combines knowledge from multiple sources:
for file_data in patterns_data:
patterns = file_data.get("patterns", [])
if patterns:
f.write(f"## {file_data['file_path']}\n\n")
display_path = self._make_path_relative(file_data["file_path"])
f.write(f"## {display_path}\n\n")
for p in patterns:
f.write(f"### {p['pattern_type']}\n\n")
if p.get("class_name"):

View File

@@ -14,14 +14,13 @@ Usage:
python3 video_scraper.py --from-json video_extracted.json
"""
import argparse
import json
import logging
import os
import re
import sys
import time
from skill_seekers.cli.skill_converter import SkillConverter
from skill_seekers.cli.video_models import (
AudioVisualAlignment,
TextGroupTimeline,
@@ -318,9 +317,11 @@ def _ai_clean_reference(ref_path: str, content: str, api_key: str | None = None)
# =============================================================================
class VideoToSkillConverter:
class VideoToSkillConverter(SkillConverter):
"""Convert video content to AI skill."""
SOURCE_TYPE = "video"
def __init__(self, config: dict):
"""Initialize converter.
@@ -333,6 +334,7 @@ class VideoToSkillConverter:
- visual: Whether to enable visual extraction
- whisper_model: Whisper model size
"""
super().__init__(config)
self.config = config
self.name = config["name"]
self.description = config.get("description", "")
@@ -355,6 +357,10 @@ class VideoToSkillConverter:
# Results
self.result: VideoScraperResult | None = None
def extract(self):
"""Extract content from video source (SkillConverter interface)."""
self.process()
def process(self) -> VideoScraperResult:
"""Run the full video processing pipeline.
@@ -1015,241 +1021,3 @@ class VideoToSkillConverter:
lines.append(f"- [{video.title}](references/{ref_filename})")
return "\n".join(lines)
# =============================================================================
# CLI Entry Point
# =============================================================================
def main() -> int:
"""Entry point for video scraper CLI.
Returns:
Exit code (0 for success, non-zero for error).
"""
from skill_seekers.cli.arguments.video import add_video_arguments
parser = argparse.ArgumentParser(
prog="skill-seekers-video",
description="Extract transcripts and metadata from videos and generate skill",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""\
Examples:
skill-seekers video --url https://www.youtube.com/watch?v=...
skill-seekers video --video-file recording.mp4
skill-seekers video --playlist https://www.youtube.com/playlist?list=...
skill-seekers video --from-json video_extracted.json
skill-seekers video --url https://youtu.be/... --languages en,es
""",
)
add_video_arguments(parser)
args = parser.parse_args()
# --setup: run GPU detection + dependency installation, then exit
if getattr(args, "setup", False):
from skill_seekers.cli.video_setup import run_setup
return run_setup(interactive=True)
# Setup logging
log_level = logging.DEBUG if args.verbose else (logging.WARNING if args.quiet else logging.INFO)
logging.basicConfig(level=log_level, format="%(levelname)s: %(message)s")
# Validate inputs
has_source = any(
[
getattr(args, "url", None),
getattr(args, "video_file", None),
getattr(args, "playlist", None),
]
)
has_json = getattr(args, "from_json", None)
if not has_source and not has_json:
parser.error("Must specify --url, --video-file, --playlist, or --from-json")
# Parse and validate time clipping
raw_start = getattr(args, "start_time", None)
raw_end = getattr(args, "end_time", None)
clip_start: float | None = None
clip_end: float | None = None
if raw_start is not None:
try:
clip_start = parse_time_to_seconds(raw_start)
except ValueError as exc:
parser.error(f"--start-time: {exc}")
if raw_end is not None:
try:
clip_end = parse_time_to_seconds(raw_end)
except ValueError as exc:
parser.error(f"--end-time: {exc}")
if clip_start is not None or clip_end is not None:
if getattr(args, "playlist", None):
parser.error("--start-time/--end-time cannot be used with --playlist")
if clip_start is not None and clip_end is not None and clip_start >= clip_end:
parser.error(f"--start-time ({clip_start}s) must be before --end-time ({clip_end}s)")
# Build config
config = {
"name": args.name or "video_skill",
"description": getattr(args, "description", None) or "",
"output": getattr(args, "output", None),
"url": getattr(args, "url", None),
"video_file": getattr(args, "video_file", None),
"playlist": getattr(args, "playlist", None),
"languages": getattr(args, "languages", "en"),
"visual": getattr(args, "visual", False),
"whisper_model": getattr(args, "whisper_model", "base"),
"visual_interval": getattr(args, "visual_interval", 0.7),
"visual_min_gap": getattr(args, "visual_min_gap", 0.5),
"visual_similarity": getattr(args, "visual_similarity", 3.0),
"vision_ocr": getattr(args, "vision_ocr", False),
"start_time": clip_start,
"end_time": clip_end,
}
converter = VideoToSkillConverter(config)
# Dry run
if args.dry_run:
logger.info("DRY RUN — would process:")
for key in ["url", "video_file", "playlist"]:
if config.get(key):
logger.info(f" {key}: {config[key]}")
logger.info(f" name: {config['name']}")
logger.info(f" languages: {config['languages']}")
logger.info(f" visual: {config['visual']}")
if clip_start is not None or clip_end is not None:
start_str = _format_duration(clip_start) if clip_start is not None else "start"
end_str = _format_duration(clip_end) if clip_end is not None else "end"
logger.info(f" clip range: {start_str} - {end_str}")
return 0
# Workflow 1: Build from JSON
if has_json:
logger.info(f"Loading extracted data from {args.from_json}")
converter.load_extracted_data(args.from_json)
converter.build_skill()
logger.info(f"Skill built at {converter.skill_dir}")
return 0
# Workflow 2: Full extraction + build
try:
result = converter.process()
if not result.videos:
logger.error("No videos were successfully processed")
if result.errors:
for err in result.errors:
logger.error(f" {err['source']}: {err['error']}")
return 1
converter.save_extracted_data()
converter.build_skill()
logger.info(f"\nSkill built successfully at {converter.skill_dir}")
logger.info(f" Videos: {len(result.videos)}")
logger.info(f" Segments: {result.total_segments}")
logger.info(f" Duration: {_format_duration(result.total_duration_seconds)}")
logger.info(f" Processing time: {result.processing_time_seconds:.1f}s")
if result.warnings:
for w in result.warnings:
logger.warning(f" {w}")
except RuntimeError as e:
logger.error(str(e))
return 1
# Enhancement
enhance_level = getattr(args, "enhance_level", 0)
if enhance_level > 0:
# Pass 1: Clean reference files (Code Timeline reconstruction)
converter._enhance_reference_files(enhance_level, args)
# Auto-inject video-tutorial workflow if no workflow specified
if not getattr(args, "enhance_workflow", None):
args.enhance_workflow = ["video-tutorial"]
# Pass 2: Run workflow stages (specialized video analysis)
try:
from skill_seekers.cli.workflow_runner import run_workflows
video_context = {
"skill_name": converter.name,
"skill_dir": converter.skill_dir,
"source_type": "video_tutorial",
}
run_workflows(args, context=video_context)
except ImportError:
logger.debug("Workflow runner not available, skipping workflow stages")
# Run traditional SKILL.md enhancement (reads references + rewrites)
_run_video_enhancement(converter.skill_dir, enhance_level, args)
return 0
def _run_video_enhancement(skill_dir: str, enhance_level: int, args) -> None:
"""Run traditional SKILL.md enhancement with video-aware prompt.
This calls the same SkillEnhancer used by other scrapers, but the prompt
auto-detects video_tutorial source type and uses a video-specific prompt.
"""
import os
import subprocess
has_api_key = bool(
os.environ.get("ANTHROPIC_API_KEY")
or os.environ.get("ANTHROPIC_AUTH_TOKEN")
or getattr(args, "api_key", None)
or os.environ.get("MOONSHOT_API_KEY")
)
agent = getattr(args, "agent", None)
if not has_api_key and not agent:
logger.info("\n💡 Enhance your video skill with AI:")
logger.info(f" export ANTHROPIC_API_KEY=sk-ant-...")
logger.info(f" skill-seekers enhance {skill_dir} --enhance-level {enhance_level}")
return
logger.info(f"\n🤖 Running video-aware SKILL.md enhancement (level {enhance_level})...")
try:
enhance_cmd = ["skill-seekers-enhance", skill_dir]
api_key = getattr(args, "api_key", None)
if api_key:
enhance_cmd.extend(["--api-key", api_key])
if agent:
enhance_cmd.extend(["--agent", agent])
logger.info(
"Starting video skill enhancement (this may take 10+ minutes "
"for large videos with AI enhancement)..."
)
subprocess.run(enhance_cmd, check=True, timeout=1800)
logger.info("Video skill enhancement complete!")
except subprocess.TimeoutExpired:
logger.warning(
"⚠ Enhancement timed out after 30 minutes. "
"The skill was still built without enhancement. "
"You can retry manually with:\n"
f" skill-seekers enhance {skill_dir} --enhance-level {enhance_level}"
)
except subprocess.CalledProcessError as exc:
logger.warning(
f"⚠ Enhancement failed (exit code {exc.returncode}), "
"but skill was still built. You can retry manually with:\n"
f" skill-seekers enhance {skill_dir} --enhance-level {enhance_level}"
)
except FileNotFoundError:
logger.warning("⚠ skill-seekers-enhance not found. Run manually:")
logger.info(f" skill-seekers enhance {skill_dir} --enhance-level {enhance_level}")
if __name__ == "__main__":
sys.exit(main())

View File

@@ -10,12 +10,10 @@ Usage:
python3 word_scraper.py --from-json document_extracted.json
"""
import argparse
import json
import logging
import os
import re
import sys
from pathlib import Path
# Optional dependency guard
@@ -27,6 +25,8 @@ try:
except ImportError:
WORD_AVAILABLE = False
from .skill_converter import SkillConverter
logger = logging.getLogger(__name__)
@@ -72,10 +72,13 @@ def infer_description_from_word(metadata: dict = None, name: str = "") -> str:
)
class WordToSkillConverter:
class WordToSkillConverter(SkillConverter):
"""Convert Word document (.docx) to AI skill."""
SOURCE_TYPE = "word"
def __init__(self, config):
super().__init__(config)
self.config = config
self.name = config["name"]
self.docx_path = config.get("docx_path", "")
@@ -93,6 +96,10 @@ class WordToSkillConverter:
# Extracted data
self.extracted_data = None
def extract(self):
"""SkillConverter interface — delegates to extract_docx()."""
return self.extract_docx()
def extract_docx(self):
"""Extract content from Word document using mammoth + python-docx.
@@ -918,146 +925,3 @@ def _score_code_quality(code: str) -> float:
score -= 2.0
return min(10.0, max(0.0, score))
def main():
from .arguments.word import add_word_arguments
parser = argparse.ArgumentParser(
description="Convert Word document (.docx) to AI skill",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_word_arguments(parser)
args = parser.parse_args()
# Set logging level
if getattr(args, "quiet", False):
logging.getLogger().setLevel(logging.WARNING)
elif getattr(args, "verbose", False):
logging.getLogger().setLevel(logging.DEBUG)
# Handle --dry-run
if getattr(args, "dry_run", False):
source = getattr(args, "docx", None) or getattr(args, "from_json", None) or "(none)"
print(f"\n{'=' * 60}")
print("DRY RUN: Word Document Extraction")
print(f"{'=' * 60}")
print(f"Source: {source}")
print(f"Name: {getattr(args, 'name', None) or '(auto-detect)'}")
print(f"Enhance level: {getattr(args, 'enhance_level', 0)}")
print(f"\n✅ Dry run complete")
return 0
# Validate inputs
if not (getattr(args, "docx", None) or getattr(args, "from_json", None)):
parser.error("Must specify --docx or --from-json")
# Build from JSON workflow
if getattr(args, "from_json", None):
name = Path(args.from_json).stem.replace("_extracted", "")
config = {
"name": getattr(args, "name", None) or name,
"description": getattr(args, "description", None)
or f"Use when referencing {name} documentation",
}
try:
converter = WordToSkillConverter(config)
converter.load_extracted_data(args.from_json)
converter.build_skill()
except Exception as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
return 0
# Direct DOCX mode
if not getattr(args, "name", None):
# Auto-detect name from filename
args.name = Path(args.docx).stem
config = {
"name": args.name,
"docx_path": args.docx,
# Pass None so extract_docx() can infer from document metadata (subject/title)
"description": getattr(args, "description", None),
}
if getattr(args, "categories", None):
config["categories"] = args.categories
try:
converter = WordToSkillConverter(config)
# Extract
if not converter.extract_docx():
print("\n❌ Word extraction failed - see error above", file=sys.stderr)
sys.exit(1)
# Build skill
converter.build_skill()
# Enhancement Workflow Integration
from skill_seekers.cli.workflow_runner import run_workflows
workflow_executed, workflow_names = run_workflows(args)
workflow_name = ", ".join(workflow_names) if workflow_names else None
# Traditional enhancement (complements workflow system)
if getattr(args, "enhance_level", 0) > 0:
import os
api_key = getattr(args, "api_key", None) or os.environ.get("ANTHROPIC_API_KEY")
mode = "API" if api_key else "LOCAL"
print("\n" + "=" * 80)
print(f"🤖 Traditional AI Enhancement ({mode} mode, level {args.enhance_level})")
print("=" * 80)
if workflow_executed:
print(f" Running after workflow: {workflow_name}")
print(
" (Workflow provides specialized analysis, enhancement provides general improvements)"
)
print("")
skill_dir = converter.skill_dir
if api_key:
try:
from skill_seekers.cli.enhance_skill import enhance_skill_md
enhance_skill_md(skill_dir, api_key)
print("✅ API enhancement complete!")
except ImportError:
print("❌ API enhancement not available. Falling back to LOCAL mode...")
from pathlib import Path
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
else:
from pathlib import Path
from skill_seekers.cli.enhance_skill_local import LocalSkillEnhancer
agent = getattr(args, "agent", None) if args else None
agent_cmd = getattr(args, "agent_cmd", None) if args else None
enhancer = LocalSkillEnhancer(Path(skill_dir), agent=agent, agent_cmd=agent_cmd)
enhancer.run(headless=True)
print("✅ Local enhancement complete!")
except RuntimeError as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"\n❌ Unexpected error during Word processing: {e}", file=sys.stderr)
import traceback
traceback.print_exc()
sys.exit(1)
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -13,7 +13,9 @@ This module contains all scraping-related MCP tool implementations:
Extracted from server.py for better modularity and organization.
"""
import io
import json
import logging
import sys
from pathlib import Path
@@ -34,6 +36,48 @@ except ImportError:
CLI_DIR = Path(__file__).parent.parent.parent / "cli"
def _run_converter(converter, progress_msg: str) -> list:
"""Run a converter in-process with log capture.
Args:
converter: An initialized SkillConverter instance.
progress_msg: Progress message to prepend to output.
Returns:
List[TextContent] with success/error message.
"""
log_capture = io.StringIO()
handler = logging.StreamHandler(log_capture)
handler.setLevel(logging.INFO)
sk_logger = logging.getLogger("skill_seekers")
sk_logger.addHandler(handler)
try:
result = converter.run()
except Exception as exc:
captured = log_capture.getvalue()
return [
TextContent(
type="text",
text=f"{progress_msg}{captured}\n\n❌ Converter raised an exception:\n{exc}",
)
]
finally:
sk_logger.removeHandler(handler)
captured = log_capture.getvalue()
output = progress_msg + captured
if result == 0:
return [TextContent(type="text", text=output)]
else:
return [
TextContent(
type="text",
text=f"{output}\n\n❌ Converter returned non-zero exit code ({result})",
)
]
def run_subprocess_with_streaming(cmd: list[str], timeout: int = None) -> tuple:
"""
Run subprocess with real-time output streaming.
@@ -141,10 +185,11 @@ async def estimate_pages_tool(args: dict) -> list[TextContent]:
# Estimate: 0.5s per page discovered
timeout = max(300, max_discovery // 2) # Minimum 5 minutes
# Run estimate_pages.py
# Run estimate_pages module
cmd = [
sys.executable,
str(CLI_DIR / "estimate_pages.py"),
"-m",
"skill_seekers.cli.estimate_pages",
config_path,
"--max-discovery",
str(max_discovery),
@@ -185,8 +230,6 @@ async def scrape_docs_tool(args: dict) -> list[TextContent]:
"""
config_path = args["config_path"]
unlimited = args.get("unlimited", False)
enhance_local = args.get("enhance_local", False)
skip_scrape = args.get("skip_scrape", False)
dry_run = args.get("dry_run", False)
merge_mode = args.get("merge_mode")
@@ -218,80 +261,52 @@ async def scrape_docs_tool(args: dict) -> list[TextContent]:
else:
config_to_use = config_path
# Choose scraper based on format
# Build progress message
if is_unified:
scraper_script = "unified_scraper.py"
progress_msg = "🔄 Starting unified multi-source scraping...\n"
progress_msg += "📦 Config format: Unified (multiple sources)\n"
else:
scraper_script = "doc_scraper.py"
progress_msg = "🔄 Starting scraping process...\n"
progress_msg += "📦 Config format: Legacy (single source)\n"
# Build command
cmd = [sys.executable, str(CLI_DIR / scraper_script), "--config", config_to_use]
# Add merge mode for unified configs
if is_unified and merge_mode:
cmd.extend(["--merge-mode", merge_mode])
# Add --fresh to avoid user input prompts when existing data found
if not skip_scrape:
cmd.append("--fresh")
if enhance_local:
cmd.append("--enhance-local")
if skip_scrape:
cmd.append("--skip-scrape")
if dry_run:
cmd.append("--dry-run")
# Determine timeout based on operation type
if dry_run:
timeout = 300 # 5 minutes for dry run
elif skip_scrape:
timeout = 600 # 10 minutes for building from cache
elif unlimited:
timeout = None # No timeout for unlimited mode (user explicitly requested)
else:
# Read config to estimate timeout
try:
if is_unified:
# For unified configs, estimate based on all sources
total_pages = 0
for source in config.get("sources", []):
if source.get("type") == "documentation":
total_pages += source.get("max_pages", 500)
max_pages = total_pages or 500
else:
max_pages = config.get("max_pages", 500)
# Estimate: 30s per page + buffer
timeout = max(3600, max_pages * 35) # Minimum 1 hour, or 35s per page
except Exception:
timeout = 14400 # Default: 4 hours
# Add progress message
if timeout:
progress_msg += f"⏱️ Maximum time allowed: {timeout // 60} minutes\n"
else:
progress_msg += "⏱️ Unlimited mode - no timeout\n"
progress_msg += "📝 Progress will be shown below:\n\n"
# Run scraper with streaming
stdout, stderr, returncode = run_subprocess_with_streaming(cmd, timeout=timeout)
# Run converter in-process
try:
if is_unified:
from skill_seekers.cli.unified_scraper import UnifiedScraper
# Clean up temporary config
if unlimited and Path(config_to_use).exists():
Path(config_to_use).unlink()
converter = UnifiedScraper(config_to_use, merge_mode=merge_mode)
else:
from skill_seekers.cli.skill_converter import get_converter
output = progress_msg + stdout
# For legacy format, detect type from config keys
with open(config_to_use) as f:
config_to_pass = json.load(f)
if returncode == 0:
return [TextContent(type="text", text=output)]
else:
error_output = output + f"\n\n❌ Error:\n{stderr}"
return [TextContent(type="text", text=error_output)]
# Detect source type from config content
if "base_url" in config_to_pass:
source_type = "web"
elif "repo" in config_to_pass:
source_type = "github"
elif "pdf_path" in config_to_pass:
source_type = "pdf"
elif "directory" in config_to_pass:
source_type = "local"
else:
source_type = "web" # default fallback
converter = get_converter(source_type, config_to_pass)
if dry_run:
converter.dry_run = True
result = _run_converter(converter, progress_msg)
finally:
# Clean up temporary config
if unlimited and Path(config_to_use).exists():
Path(config_to_use).unlink()
return result
async def scrape_pdf_tool(args: dict) -> list[TextContent]:
@@ -318,44 +333,48 @@ async def scrape_pdf_tool(args: dict) -> list[TextContent]:
description = args.get("description")
from_json = args.get("from_json")
# Build command
cmd = [sys.executable, str(CLI_DIR / "pdf_scraper.py")]
progress_msg = "📄 Scraping PDF documentation...\n\n"
# Mode 1: Config file
if config_path:
cmd.extend(["--config", config_path])
with open(config_path) as f:
pdf_config = json.load(f)
# Mode 2: Direct PDF
elif pdf_path and name:
cmd.extend(["--pdf", pdf_path, "--name", name])
pdf_config = {"name": name, "pdf_path": pdf_path}
if description:
cmd.extend(["--description", description])
pdf_config["description"] = description
# Mode 3: From JSON
# Mode 3: From JSON — use PDFToSkillConverter.load_extracted_data
elif from_json:
cmd.extend(["--from-json", from_json])
from skill_seekers.cli.pdf_scraper import PDFToSkillConverter
# Build a minimal config; name is derived from the JSON filename
json_name = Path(from_json).stem.replace("_extracted", "")
pdf_config = {"name": json_name}
converter = PDFToSkillConverter(pdf_config)
converter.load_extracted_data(from_json)
converter.build_skill()
return [
TextContent(
type="text",
text=f"{progress_msg}✅ Skill built from extracted JSON: {from_json}",
)
]
else:
return [
TextContent(
type="text", text="❌ Error: Must specify --config, --pdf + --name, or --from-json"
type="text",
text="❌ Error: Must specify --config, --pdf + --name, or --from-json",
)
]
# Run pdf_scraper.py with streaming (can take a while)
timeout = 600 # 10 minutes for PDF extraction
from skill_seekers.cli.skill_converter import get_converter
progress_msg = "📄 Scraping PDF documentation...\n"
progress_msg += f"⏱️ Maximum time: {timeout // 60} minutes\n\n"
stdout, stderr, returncode = run_subprocess_with_streaming(cmd, timeout=timeout)
output = progress_msg + stdout
if returncode == 0:
return [TextContent(type="text", text=output)]
else:
return [TextContent(type="text", text=f"{output}\n\n❌ Error:\n{stderr}")]
converter = get_converter("pdf", pdf_config)
return _run_converter(converter, progress_msg)
async def scrape_video_tool(args: dict) -> list[TextContent]:
@@ -411,29 +430,23 @@ async def scrape_video_tool(args: dict) -> list[TextContent]:
start_time = args.get("start_time")
end_time = args.get("end_time")
# Build command
cmd = [sys.executable, str(CLI_DIR / "video_scraper.py")]
# Build config dict for the converter
video_config: dict = {}
if from_json:
cmd.extend(["--from-json", from_json])
video_config["from_json"] = from_json
video_config["name"] = name or Path(from_json).stem.replace("_video_extracted", "")
elif url:
cmd.extend(["--url", url])
if name:
cmd.extend(["--name", name])
if description:
cmd.extend(["--description", description])
if languages:
cmd.extend(["--languages", languages])
video_config["url"] = url
if not name:
return [TextContent(type="text", text="❌ Error: --name is required with --url")]
video_config["name"] = name
elif video_file:
cmd.extend(["--video-file", video_file])
if name:
cmd.extend(["--name", name])
if description:
cmd.extend(["--description", description])
video_config["video_file"] = video_file
video_config["name"] = name or Path(video_file).stem
elif playlist:
cmd.extend(["--playlist", playlist])
if name:
cmd.extend(["--name", name])
video_config["playlist"] = playlist
video_config["name"] = name or "playlist"
else:
return [
TextContent(
@@ -442,38 +455,31 @@ async def scrape_video_tool(args: dict) -> list[TextContent]:
)
]
# Visual extraction parameters
if visual:
cmd.append("--visual")
if description:
video_config["description"] = description
if languages:
video_config["languages"] = languages
video_config["visual"] = visual
if whisper_model:
cmd.extend(["--whisper-model", whisper_model])
video_config["whisper_model"] = whisper_model
if visual_interval is not None:
cmd.extend(["--visual-interval", str(visual_interval)])
video_config["visual_interval"] = visual_interval
if visual_min_gap is not None:
cmd.extend(["--visual-min-gap", str(visual_min_gap)])
video_config["visual_min_gap"] = visual_min_gap
if visual_similarity is not None:
cmd.extend(["--visual-similarity", str(visual_similarity)])
if vision_ocr:
cmd.append("--vision-ocr")
video_config["visual_similarity"] = visual_similarity
video_config["vision_ocr"] = vision_ocr
if start_time:
cmd.extend(["--start-time", str(start_time)])
video_config["start_time"] = start_time
if end_time:
cmd.extend(["--end-time", str(end_time)])
video_config["end_time"] = end_time
# Run video_scraper.py with streaming
timeout = 600 # 10 minutes for video extraction
progress_msg = "🎬 Scraping video content...\n\n"
progress_msg = "🎬 Scraping video content...\n"
progress_msg += f"⏱️ Maximum time: {timeout // 60} minutes\n\n"
from skill_seekers.cli.skill_converter import get_converter
stdout, stderr, returncode = run_subprocess_with_streaming(cmd, timeout=timeout)
output = progress_msg + stdout
if returncode == 0:
return [TextContent(type="text", text=output)]
else:
return [TextContent(type="text", text=f"{output}\n\n❌ Error:\n{stderr}")]
converter = get_converter("video", video_config)
return _run_converter(converter, progress_msg)
async def scrape_github_tool(args: dict) -> list[TextContent]:
@@ -510,50 +516,37 @@ async def scrape_github_tool(args: dict) -> list[TextContent]:
max_issues = args.get("max_issues", 100)
scrape_only = args.get("scrape_only", False)
# Build command
cmd = [sys.executable, str(CLI_DIR / "github_scraper.py")]
# Mode 1: Config file
# Build config dict for the converter
if config_path:
cmd.extend(["--config", config_path])
# Mode 2: Direct repo
with open(config_path) as f:
github_config = json.load(f)
elif repo:
cmd.extend(["--repo", repo])
github_config: dict = {"repo": repo}
if name:
cmd.extend(["--name", name])
github_config["name"] = name
if description:
cmd.extend(["--description", description])
github_config["description"] = description
if token:
cmd.extend(["--token", token])
github_config["token"] = token
if no_issues:
cmd.append("--no-issues")
github_config["no_issues"] = True
if no_changelog:
cmd.append("--no-changelog")
github_config["no_changelog"] = True
if no_releases:
cmd.append("--no-releases")
github_config["no_releases"] = True
if max_issues != 100:
cmd.extend(["--max-issues", str(max_issues)])
github_config["max_issues"] = max_issues
if scrape_only:
cmd.append("--scrape-only")
github_config["scrape_only"] = True
else:
return [TextContent(type="text", text="❌ Error: Must specify --repo or --config")]
# Run github_scraper.py with streaming (can take a while)
timeout = 600 # 10 minutes for GitHub scraping
progress_msg = "🐙 Scraping GitHub repository...\n\n"
progress_msg = "🐙 Scraping GitHub repository...\n"
progress_msg += f"⏱️ Maximum time: {timeout // 60} minutes\n\n"
from skill_seekers.cli.skill_converter import get_converter
stdout, stderr, returncode = run_subprocess_with_streaming(cmd, timeout=timeout)
output = progress_msg + stdout
if returncode == 0:
return [TextContent(type="text", text=output)]
else:
return [TextContent(type="text", text=f"{output}\n\n❌ Error:\n{stderr}")]
converter = get_converter("github", github_config)
return _run_converter(converter, progress_msg)
async def scrape_codebase_tool(args: dict) -> list[TextContent]:
@@ -605,7 +598,7 @@ async def scrape_codebase_tool(args: dict) -> list[TextContent]:
if not directory:
return [TextContent(type="text", text="❌ Error: directory parameter is required")]
output = args.get("output", "output/codebase/")
output_dir = args.get("output", "output/codebase/")
depth = args.get("depth", "deep")
languages = args.get("languages", "")
file_patterns = args.get("file_patterns", "")
@@ -620,43 +613,28 @@ async def scrape_codebase_tool(args: dict) -> list[TextContent]:
skip_config_patterns = args.get("skip_config_patterns", False)
skip_docs = args.get("skip_docs", False)
# Build command
cmd = [sys.executable, "-m", "skill_seekers.cli.codebase_scraper"]
cmd.extend(["--directory", directory])
# Derive a name from the directory for the converter
dir_name = Path(directory).resolve().name or "codebase"
if output:
cmd.extend(["--output", output])
if depth:
cmd.extend(["--depth", depth])
# Build config dict for CodebaseAnalyzer
codebase_config: dict = {
"name": dir_name,
"directory": directory,
"output_dir": output_dir,
"depth": depth,
"enhance_level": enhance_level,
"build_api_reference": not skip_api_reference,
"build_dependency_graph": not skip_dependency_graph,
"detect_patterns": not skip_patterns,
"extract_test_examples": not skip_test_examples,
"build_how_to_guides": not skip_how_to_guides,
"extract_config_patterns": not skip_config_patterns,
"extract_docs": not skip_docs,
}
if languages:
cmd.extend(["--languages", languages])
codebase_config["languages"] = languages
if file_patterns:
cmd.extend(["--file-patterns", file_patterns])
if enhance_level > 0:
cmd.extend(["--enhance-level", str(enhance_level)])
# Skip flags
if skip_api_reference:
cmd.append("--skip-api-reference")
if skip_dependency_graph:
cmd.append("--skip-dependency-graph")
if skip_patterns:
cmd.append("--skip-patterns")
if skip_test_examples:
cmd.append("--skip-test-examples")
if skip_how_to_guides:
cmd.append("--skip-how-to-guides")
if skip_config_patterns:
cmd.append("--skip-config-patterns")
if skip_docs:
cmd.append("--skip-docs")
# Adjust timeout based on enhance_level
timeout = 600 # 10 minutes base
if enhance_level >= 2:
timeout = 1200 # 20 minutes with AI enhancement
if enhance_level >= 3:
timeout = 3600 # 60 minutes for full enhancement
codebase_config["file_patterns"] = file_patterns
level_names = {0: "off", 1: "SKILL.md only", 2: "standard", 3: "full"}
progress_msg = "🔍 Analyzing local codebase...\n"
@@ -664,16 +642,12 @@ async def scrape_codebase_tool(args: dict) -> list[TextContent]:
progress_msg += f"📊 Depth: {depth}\n"
if enhance_level > 0:
progress_msg += f"🤖 AI Enhancement: Level {enhance_level} ({level_names.get(enhance_level, 'unknown')})\n"
progress_msg += f"⏱️ Maximum time: {timeout // 60} minutes\n\n"
progress_msg += "\n"
stdout, stderr, returncode = run_subprocess_with_streaming(cmd, timeout=timeout)
from skill_seekers.cli.skill_converter import get_converter
output_text = progress_msg + stdout
if returncode == 0:
return [TextContent(type="text", text=output_text)]
else:
return [TextContent(type="text", text=f"{output_text}\n\n❌ Error:\n{stderr}")]
converter = get_converter("local", codebase_config)
return _run_converter(converter, progress_msg)
async def detect_patterns_tool(args: dict) -> list[TextContent]:
@@ -1096,51 +1070,35 @@ async def scrape_generic_tool(args: dict) -> list[TextContent]:
)
]
# Build the subprocess command
# Map source type to module name (most are <type>_scraper, but some differ)
_MODULE_NAMES = {
"manpage": "man_scraper",
# Build config dict for the converter — map MCP args to the keys
# each converter expects in its __init__.
_CONFIG_KEY: dict[str, str] = {
"jupyter": "notebook_path",
"html": "html_path",
"openapi": "spec_path",
"asciidoc": "asciidoc_path",
"pptx": "pptx_path",
"manpage": "man_path",
"confluence": "export_path",
"notion": "export_path",
"rss": "feed_path",
"chat": "export_path",
}
module_name = _MODULE_NAMES.get(source_type, f"{source_type}_scraper")
cmd = [sys.executable, "-m", f"skill_seekers.cli.{module_name}"]
# Map source type to the correct CLI flag for file/path input and URL input.
# Each scraper has its own flag name — using a generic --path or --url would fail.
_PATH_FLAGS: dict[str, str] = {
"jupyter": "--notebook",
"html": "--html-path",
"openapi": "--spec",
"asciidoc": "--asciidoc-path",
"pptx": "--pptx",
"manpage": "--man-path",
"confluence": "--export-path",
"notion": "--export-path",
"rss": "--feed-path",
"chat": "--export-path",
}
_URL_FLAGS: dict[str, str] = {
"confluence": "--base-url",
"notion": "--page-id",
"rss": "--feed-url",
"openapi": "--spec-url",
_URL_CONFIG_KEY: dict[str, str] = {
"confluence": "base_url",
"notion": "page_id",
"rss": "feed_url",
"openapi": "spec_url",
}
# Determine the input flag based on source type
config: dict = {"name": name}
if source_type in _URL_BASED_TYPES and url:
url_flag = _URL_FLAGS.get(source_type, "--url")
cmd.extend([url_flag, url])
config[_URL_CONFIG_KEY.get(source_type, "url")] = url
elif path:
path_flag = _PATH_FLAGS.get(source_type, "--path")
cmd.extend([path_flag, path])
config[_CONFIG_KEY.get(source_type, "path")] = path
elif url:
# Allow url fallback for file-based types (some may accept URLs too)
url_flag = _URL_FLAGS.get(source_type, "--url")
cmd.extend([url_flag, url])
cmd.extend(["--name", name])
# Set a reasonable timeout
timeout = 600 # 10 minutes
config[_URL_CONFIG_KEY.get(source_type, "url")] = url
emoji = _SOURCE_EMOJIS.get(source_type, "🔧")
progress_msg = f"{emoji} Scraping {source_type} source...\n"
@@ -1148,14 +1106,9 @@ async def scrape_generic_tool(args: dict) -> list[TextContent]:
progress_msg += f"📁 Path: {path}\n"
if url:
progress_msg += f"🔗 URL: {url}\n"
progress_msg += f"📛 Name: {name}\n"
progress_msg += f"⏱️ Maximum time: {timeout // 60} minutes\n\n"
progress_msg += f"📛 Name: {name}\n\n"
stdout, stderr, returncode = run_subprocess_with_streaming(cmd, timeout=timeout)
from skill_seekers.cli.skill_converter import get_converter
output = progress_msg + stdout
if returncode == 0:
return [TextContent(type="text", text=output)]
else:
return [TextContent(type="text", text=f"{output}\n\n❌ Error:\n{stderr}")]
converter = get_converter(source_type, config)
return _run_converter(converter, progress_msg)