Files
skill-seekers-reference/CLAUDE.md
yusyus 6d37e43b83 feat: Grand Unification — one command, one interface, direct converters (#346)
* fix: resolve 8 pipeline bugs found during skill quality review

- Fix 0 APIs extracted from documentation by enriching summary.json
  with individual page file content before conflict detection
- Fix all "Unknown" entries in merged_api.md by injecting dict keys
  as API names and falling back to AI merger field names
- Fix frontmatter using raw slugs instead of config name by
  normalizing frontmatter after SKILL.md generation
- Fix leaked absolute filesystem paths in patterns/index.md by
  stripping .skillseeker-cache repo clone prefixes
- Fix ARCHITECTURE.md file count always showing "1 files" by
  counting files per language from code_analysis data
- Fix YAML parse errors on GitHub Actions workflows by converting
  boolean keys (on: true) to strings
- Fix false React/Vue.js framework detection in C# projects by
  filtering web frameworks based on primary language
- Improve how-to guide generation by broadening workflow example
  filter to include setup/config examples with sufficient complexity
- Fix test_git_sources_e2e failures caused by git init default
  branch being 'main' instead of 'master'

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address 6 review issues in ExecutionContext implementation

Fixes from code review:

1. Mode resolution (#3 critical): _args_to_data no longer unconditionally
   overwrites mode. Only writes mode="api" when --api-key explicitly passed.
   Env-var-based mode detection moved to _default_data() as lowest priority.

2. Re-initialization warning (#4): initialize() now logs debug message
   when called a second time instead of silently returning stale instance.

3. _raw_args preserved in override (#5): temp context now copies _raw_args
   from parent so get_raw() works correctly inside override blocks.

4. test_local_mode_detection env cleanup (#7): test now saves/restores
   API key env vars to prevent failures when ANTHROPIC_API_KEY is set.

5. _load_config_file error handling (#8): wraps FileNotFoundError and
   JSONDecodeError with user-friendly ValueError messages.

6. Lint fixes: added logging import, fixed Generator import from
   collections.abc, fixed AgentClient return type annotation.

Remaining P2/P3 items (documented, not blocking):
- Lock TOCTOU in override() — safe on CPython, needs fix for no-GIL
- get() reads _instance without lock — same CPython caveat
- config_path not stored on instance
- AnalysisSettings.depth not Literal constrained

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address all remaining P2/P3 review issues in ExecutionContext

1. Thread safety: get() now acquires _lock before reading _instance (#2)
2. Thread safety: override() saves/restores _initialized flag to prevent
   re-init during override blocks (#10)
3. Config path stored: _config_path PrivateAttr + config_path property (#6)
4. Literal validation: AnalysisSettings.depth now uses
   Literal["surface", "deep", "full"] — rejects invalid values (#9)
5. Test updated: test_analysis_depth_choices now expects ValidationError
   for invalid depth, added test_analysis_depth_valid_choices
6. Lint cleanup: removed unused imports, fixed whitespace in tests

All 10 previously reported issues now resolved.
26 tests pass, lint clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: restore 5 truncated scrapers, migrate unified_scraper, fix context init

5 scrapers had main() truncated with "# Original main continues here..."
after Kimi's migration — business logic was never connected:
- html_scraper.py — restored HtmlToSkillConverter extraction + build
- pptx_scraper.py — restored PptxToSkillConverter extraction + build
- confluence_scraper.py — restored ConfluenceToSkillConverter with 3 modes
- notion_scraper.py — restored NotionToSkillConverter with 4 sources
- chat_scraper.py — restored ChatToSkillConverter extraction + build

unified_scraper.py — migrated main() to context-first pattern with argv fallback

Fixed context initialization chain:
- main.py no longer initializes ExecutionContext (was stealing init from commands)
- create_command.py now passes config_path from source_info.parsed
- execution_context.py handles SourceInfo.raw_input (not raw_source)

All 18 scrapers now genuinely migrated. 26 tests pass, lint clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve 7 data flow conflicts between ExecutionContext and legacy paths

Critical fixes (CLI args silently lost):
- unified_scraper Phase 6: reads ctx.enhancement.level instead of raw JSON
  when args=None (#3, #4)
- unified_scraper Phase 6 agent: reads ctx.enhancement.agent instead of
  3 independent env var lookups (#5)
- doc_scraper._run_enhancement: uses agent_client.api_key instead of raw
  os.environ.get() — respects config file api_key (#1)

Important fixes:
- main._handle_analyze_command: populates _fake_args from ExecutionContext
  so --agent and --api-key aren't lost in analyze→enhance path (#6)
- doc_scraper type annotations: replaced forward refs with Any to avoid
  F821 undefined name errors

All changes include RuntimeError fallback for backward compatibility when
ExecutionContext isn't initialized.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: 3 crashes + 1 stub in migrated scrapers found by deep scan

1. github_scraper.py: args.scrape_only and args.enhance_level crash when
   args=None (context path). Guarded with if args and getattr(). Also
   fixed agent fallback to read ctx.enhancement.agent.

2. codebase_scraper.py: args.output and args.skip_api_reference crash in
   summary block when args=None. Replaced with output_dir local var and
   ctx.analysis.skip_api_reference.

3. epub_scraper.py: main() was still a stub ending with "# Rest of main()
   continues..." — restored full extraction + build + enhancement logic
   using ctx values exclusively.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: complete ExecutionContext migration for remaining scrapers

Kimi's Phase 4 scraper migrations + Claude's review fixes.
All 18 scrapers now use context-first pattern with argv fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: Phase 1 — ExecutionContext.get() always returns context (no RuntimeError)

get() now returns a default context instead of raising RuntimeError when
not explicitly initialized. This eliminates the need for try/except
RuntimeError blocks in all 18 scrapers.

Components can always call ExecutionContext.get() safely — it returns
defaults if not initialized, or the explicitly initialized instance.

Updated tests: test_get_returns_defaults_when_not_initialized,
test_reset_clears_instance (no longer expects RuntimeError).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: Phase 2a-c — remove 16 individual scraper CLI commands

Removed individual scraper commands from:
- COMMAND_MODULES in main.py (16 entries: scrape, github, pdf, word,
  epub, video, jupyter, html, openapi, asciidoc, pptx, rss, manpage,
  confluence, notion, chat)
- pyproject.toml entry points (16 skill-seekers-<type> binaries)
- parsers/__init__.py (16 parser registrations)

All source types now accessed via: skill-seekers create <source>
Kept: create, unified, analyze, enhance, package, upload, install,
      install-agent, config, doctor, and utility commands.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: create SkillConverter base class + converter registry

New base interface that all 17 converters will inherit:
- SkillConverter.run() — extract + build (same call for all types)
- SkillConverter.extract() — override in subclass
- SkillConverter.build_skill() — override in subclass
- get_converter(source_type, config) — factory from registry
- CONVERTER_REGISTRY — maps source type → (module, class)

create_command will use get_converter() instead of _call_module().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: Grand Unification — one command, one interface, direct converters

Complete the Grand Unification refactor: `skill-seekers create` is now
the single entry point for all 18 source types. Individual scraper CLI
commands (scrape, github, pdf, analyze, unified, etc.) are removed.

## Architecture changes

- **18 SkillConverter subclasses**: Every scraper now inherits SkillConverter
  with extract() + build_skill() + SOURCE_TYPE. Factory via get_converter().
- **create_command.py rewritten**: _build_config() constructs config dicts
  from ExecutionContext for each source type. Direct converter.run() calls
  replace the old _build_argv() + sys.argv swap + _call_module() machinery.
- **main.py simplified**: create command bypasses _reconstruct_argv entirely,
  calls CreateCommand(args).execute() directly. analyze/unified commands
  removed (create handles both via auto-detection).
- **CreateParser mode="all"**: Top-level parser now accepts all 120+ flags
  (--browser, --max-pages, --depth, etc.) since create is the only entry.
- **Centralized enhancement**: Runs once in create_command after converter,
  not duplicated in each scraper.
- **MCP tools use converters**: 5 scraping tools call get_converter()
  directly instead of subprocess. Config type auto-detected from keys.
- **ConfigValidator → UniSkillConfigValidator**: Renamed with backward-
  compat alias.
- **Data flow**: AgentClient + LocalSkillEnhancer read ExecutionContext
  first, env vars as fallback.

## What was removed

- main() from all 18 scraper files (~3400 lines)
- 18 CLI commands from COMMAND_MODULES + pyproject.toml entry points
- analyze + unified parsers from parser registry
- _build_argv, _call_module, _SKIP_ARGS, _DEST_TO_FLAG, all _route_*()
- setup_argument_parser, get_configuration, _check_deprecated_flags
- Tests referencing removed commands/functions

## Net impact

51 files changed, ~6000 lines removed. 2996 tests pass, 0 failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: review fixes for Grand Unification PR

- Add autouse conftest fixture to reset ExecutionContext singleton between tests
- Replace hardcoded defaults in _is_explicitly_set() with parser-derived defaults
- Upgrade ExecutionContext double-init log from debug to info
- Use logger.exception() in SkillConverter.run() to preserve tracebacks
- Fix docstring "17 types" → "18 types" in skill_converter.py
- DRY up 10 copy-paste help handlers into dict + loop (~100 lines removed)
- Fix 2 CI workflows still referencing removed `skill-seekers scrape` command
- Remove broken pyproject.toml entry point for codebase_scraper:main

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve 12 logic/flow issues found in deep review

Critical fixes:
- UnifiedScraper.run(): replace sys.exit(1) with return 1, add return 0
- doc_scraper: use ExecutionContext.get() when already initialized instead
  of re-calling initialize() which silently discards new config
- unified_scraper: define enhancement_config before try/except to prevent
  UnboundLocalError in LOCAL enhancement timeout read

Important fixes:
- override(): cleaner tuple save/restore for singleton swap
- --agent without --api-key now sets mode="local" so env API key doesn't
  override explicit agent choice
- Remove DeprecationWarning from _reconstruct_argv (fires on every
  non-create command in production)
- Rewrite scrape_generic_tool to use get_converter() instead of subprocess
  calls to removed main() functions
- SkillConverter.run() checks build_skill() return value, returns 1 if False
- estimate_pages_tool uses -m module invocation instead of .py file path

Low-priority fixes:
- get_converter() raises descriptive ValueError on class name typo
- test_default_values: save/clear API key env vars before asserting mode
- test_get_converter_pdf: fix config key "path" → "pdf_path"

3056 passed, 4 failed (pre-existing dep version issues), 32 skipped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: update MCP server tests to mock converter instead of subprocess

scrape_docs_tool now uses get_converter() + _run_converter() in-process
instead of run_subprocess_with_streaming. Update 4 TestScrapeDocsTool
tests to mock the converter layer instead of the removed subprocess path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: YusufKaraaslanSpyke <yusuf@spykegames.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 23:00:52 +03:00

253 lines
11 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
**Skill Seekers** converts documentation from 17 source types into production-ready formats for 24+ AI platforms (LLM platforms, RAG frameworks, vector databases, AI coding assistants). Published on PyPI as `skill-seekers`.
**Version:** 3.4.0 | **Python:** 3.10+ | **Website:** https://skillseekersweb.com/
**Architecture:** See `docs/UML_ARCHITECTURE.md` for UML diagrams and module overview. StarUML project at `docs/UML/skill_seekers.mdj`.
## Essential Commands
```bash
# REQUIRED before running tests or CLI (src/ layout)
pip install -e .
# Run all tests (NEVER skip - all must pass before commits)
pytest tests/ -v
# Fast iteration (skip slow MCP tests ~20min)
pytest tests/ --ignore=tests/test_mcp_fastmcp.py --ignore=tests/test_mcp_server.py --ignore=tests/test_install_skill_e2e.py -q
# Single test
pytest tests/test_scraper_features.py::test_detect_language -vv -s
# Code quality (must pass before push - matches CI)
uvx ruff check src/ tests/
uvx ruff format --check src/ tests/
mypy src/skill_seekers # continue-on-error in CI
# Auto-fix lint/format issues
uvx ruff check --fix --unsafe-fixes src/ tests/
uvx ruff format src/ tests/
# Build & publish
uv build
uv publish
```
## CI Matrix
Runs on push/PR to `main` or `development`. Lint job (Python 3.12, Ubuntu) + Test job (Ubuntu + macOS, Python 3.10/3.11/3.12, excludes macOS+3.10). Both must pass for merge.
## Git Workflow
- **Main branch:** `main` (requires tests + 1 review)
- **Development branch:** `development` (default PR target, requires tests)
- **Feature branches:** `feature/{task-id}-{description}` from `development`
- PRs always target `development`, never `main` directly
## Architecture
### CLI: Unified create command
Entry point `src/skill_seekers/cli/main.py`. The `create` command is the **only** entry point for skill creation — it auto-detects source type and routes to the appropriate `SkillConverter`.
```
skill-seekers create <source> # Auto-detect: URL, owner/repo, ./path, file.pdf, etc.
skill-seekers package <dir> # Package for platform (--target claude/gemini/openai/markdown/minimax/opencode/kimi/deepseek/qwen/openrouter/together/fireworks, --format langchain/llama-index/haystack/chroma/faiss/weaviate/qdrant/pinecone)
```
### SkillConverter Pattern (Template Method + Factory)
All 18 source types implement the `SkillConverter` base class (`skill_converter.py`):
```python
converter = get_converter("web", config) # Factory lookup
converter.run() # Template: extract() → build_skill()
```
Registry in `CONVERTER_REGISTRY` maps source type → (module, class). `create_command.py` builds config from `ExecutionContext`, calls `get_converter()`, then runs centralized enhancement.
### Data Flow (5 phases)
1. **Scrape** - Source-specific scraper extracts content to `output/{name}_data/pages/*.json`
2. **Build** - `build_skill()` categorizes pages, extracts patterns, generates `output/{name}/SKILL.md`
3. **Enhance** (optional) - LLM rewrites SKILL.md (`--enhance-level 0-3`, auto-detects API vs LOCAL mode)
4. **Package** - Platform adaptor formats output (`.zip`, `.tar.gz`, JSON, vector index)
5. **Upload** (optional) - Platform API upload
### Platform Adaptor Pattern (Strategy + Factory)
Factory: `get_adaptor(platform, config)` in `adaptors/__init__.py` returns a `SkillAdaptor` instance. Base class `SkillAdaptor` + `SkillMetadata` in `adaptors/base.py`.
```
src/skill_seekers/cli/adaptors/
├── __init__.py # Factory: get_adaptor(platform, config), ADAPTORS registry
├── base.py # Abstract base: SkillAdaptor, SkillMetadata
├── openai_compatible.py # Shared base for OpenAI-compatible platforms
├── claude.py # --target claude
├── gemini.py # --target gemini
├── openai.py # --target openai
├── markdown.py # --target markdown
├── minimax.py # --target minimax
├── opencode.py # --target opencode
├── kimi.py # --target kimi
├── deepseek.py # --target deepseek
├── qwen.py # --target qwen
├── openrouter.py # --target openrouter
├── together.py # --target together
├── fireworks.py # --target fireworks
├── langchain.py # --format langchain
├── llama_index.py # --format llama-index
├── haystack.py # --format haystack
├── chroma.py # --format chroma
├── faiss_helpers.py # --format faiss
├── qdrant.py # --format qdrant
├── weaviate.py # --format weaviate
├── pinecone_adaptor.py # --format pinecone
└── streaming_adaptor.py # --format streaming
```
`--target` = LLM platforms, `--format` = RAG/vector DBs. All adaptors are imported with `try/except ImportError` so missing optional deps don't break the registry.
### 18 Source Type Converters
Each in `src/skill_seekers/cli/{type}_scraper.py` as a `SkillConverter` subclass (no `main()`). The `create_command.py` uses `source_detector.py` to auto-detect, then calls `get_converter()`. Converters: web (doc_scraper), github, pdf, word, epub, video, local (codebase_scraper), jupyter, html, openapi, asciidoc, pptx, rss, manpage, confluence, notion, chat, config (unified_scraper).
### CLI Argument System
```
src/skill_seekers/cli/
├── parsers/ # Subcommand parser registration
│ └── create_parser.py # Progressive help disclosure (--help-web, --help-github, etc.)
├── arguments/ # Argument definitions
│ ├── common.py # add_all_standard_arguments() - shared across all scrapers
│ └── create.py # UNIVERSAL_ARGUMENTS, WEB_ARGUMENTS, GITHUB_ARGUMENTS, etc.
└── source_detector.py # Auto-detect source type from input string
```
### C3.x Codebase Analysis Pipeline
Local codebase analysis features, all opt-out (`--skip-*` flags):
- C3.1 `pattern_recognizer.py` - Design pattern detection (10 GoF patterns, 9 languages)
- C3.2 `test_example_extractor.py` - Usage examples from tests
- C3.3 `how_to_guide_builder.py` - AI-enhanced educational guides
- C3.4 `config_extractor.py` - Configuration pattern extraction
- C3.5 `generate_router.py` - Architecture overview generation
- C3.10 `signal_flow_analyzer.py` - Godot signal flow analysis
### MCP Server
`src/skill_seekers/mcp/server_fastmcp.py` - 40 tools via FastMCP. Transport: stdio (Claude Code) or HTTP (Cursor/Windsurf). Optional dependency: `pip install -e ".[mcp]"`
Supporting modules:
- `marketplace_publisher.py` - Publish skills to plugin marketplace repositories
- `marketplace_manager.py` - Manage marketplace registry
- `config_publisher.py` - Push configs to registered config source repositories
### Enhancement Modes (via AgentClient)
Enhancement now uses the `AgentClient` abstraction (`src/skill_seekers/cli/agent_client.py`) instead of direct Claude API calls:
- **API mode** (if API key set): Supports Anthropic, Moonshot/Kimi, Google Gemini, OpenAI
- **LOCAL mode** (fallback): Supports Claude Code, Kimi Code, Codex, Copilot, OpenCode, custom agents
- Control: `--enhance-level 0` (off) / `1` (SKILL.md only) / `2` (default, balanced) / `3` (full)
- Agent selection: `--agent claude|codex|copilot|opencode|kimi|custom`
## Key Implementation Details
### Smart Categorization (`doc_scraper.py:smart_categorize()`)
Scores pages against category keywords: 3 points for URL match, 2 for title, 1 for content. Threshold of 2+ required. Falls back to "other".
### Content Extraction (`doc_scraper.py`)
`FALLBACK_MAIN_SELECTORS` constant + `_find_main_content()` helper handle CSS selector fallback. Links are extracted from the full page before early return (not just main content). `body` is deliberately excluded from fallbacks.
### Three-Stream GitHub Architecture (`unified_codebase_analyzer.py`)
Stream 1: Code Analysis (AST, patterns, tests, guides). Stream 2: Documentation (README, docs/, wiki). Stream 3: Community (issues, PRs, metadata). Depth control: `basic` (1-2 min) or `c3x` (20-60 min).
## Testing
### Test markers (pytest.ini)
```bash
pytest tests/ -v # Default: fast tests only
pytest tests/ -v -m slow # Include slow tests (>5s)
pytest tests/ -v -m integration # External services required
pytest tests/ -v -m e2e # Resource-intensive
pytest tests/ -v -m "not slow and not integration" # Fastest subset
```
### Known legitimate skips (~11)
- 2: chromadb incompatible with Python 3.14 (pydantic v1)
- 2: weaviate-client not installed
- 2: Qdrant not running (requires docker)
- 2: langchain/llama_index not installed
- 3: GITHUB_TOKEN not set
### sys.modules gotcha
`test_swift_detection.py` deletes `skill_seekers.cli` modules from `sys.modules`. It must save and restore both `sys.modules` entries AND parent package attributes (`setattr`). See the test file for the pattern.
## Dependencies
Core deps include `langchain`, `llama-index`, `anthropic`, `httpx`, `PyMuPDF`, `pydantic`. Platform-specific deps are optional:
```bash
pip install -e ".[mcp]" # MCP server
pip install -e ".[gemini]" # Google Gemini
pip install -e ".[openai]" # OpenAI
pip install -e ".[docx]" # Word documents
pip install -e ".[epub]" # EPUB books
pip install -e ".[video]" # Video (lightweight)
pip install -e ".[video-full]"# Video (Whisper + visual)
pip install -e ".[jupyter]" # Jupyter notebooks
pip install -e ".[pptx]" # PowerPoint
pip install -e ".[rss]" # RSS/Atom feeds
pip install -e ".[confluence]"# Confluence wiki
pip install -e ".[notion]" # Notion pages
pip install -e ".[chroma]" # ChromaDB
pip install -e ".[all]" # Everything (except video-full)
```
Dev dependencies use PEP 735 `[dependency-groups]` in pyproject.toml.
## Environment Variables
```bash
ANTHROPIC_API_KEY=sk-ant-... # Claude AI (or compatible endpoint)
ANTHROPIC_BASE_URL=https://... # Optional: Claude-compatible API endpoint
GOOGLE_API_KEY=AIza... # Google Gemini (optional)
OPENAI_API_KEY=sk-... # OpenAI (optional)
GITHUB_TOKEN=ghp_... # Higher GitHub rate limits
```
## Adding New Features
### New platform adaptor
1. Create `src/skill_seekers/cli/adaptors/{platform}.py` inheriting `SkillAdaptor` from `base.py`
2. Register in `adaptors/__init__.py` (add try/except import + add to `ADAPTORS` dict)
3. Add optional dep to `pyproject.toml`
4. Add tests in `tests/`
### New source type converter
1. Create `src/skill_seekers/cli/{type}_scraper.py` with a class inheriting `SkillConverter`
2. Implement `extract()` and `build_skill()` methods, set `SOURCE_TYPE`
3. Register in `CONVERTER_REGISTRY` in `skill_converter.py`
4. Add source type config building in `create_command.py:_build_config()`
5. Add auto-detection in `source_detector.py`
6. Add optional dep if needed
7. Add tests
### New CLI argument
- Universal: `UNIVERSAL_ARGUMENTS` in `arguments/create.py`
- Source-specific: appropriate dict (`WEB_ARGUMENTS`, `GITHUB_ARGUMENTS`, etc.)
- Shared across scrapers: `add_all_standard_arguments()` in `arguments/common.py`