Files
skill-seekers-reference/CLAUDE.md
yusyus 8152045e38 chore: consolidate Docs/ into docs/ (single documentation directory)
Move UML/ directory and Architecture.md from Docs/ to docs/.
Rename Architecture.md to UML_ARCHITECTURE.md to avoid collision
with existing docs/ARCHITECTURE.md (docs organization file).

Update all references in README.md, CONTRIBUTING.md, CLAUDE.md,
and the architecture file itself.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 20:02:53 +03:00

235 lines
10 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
**Skill Seekers** converts documentation from 17 source types into production-ready formats for 24+ AI platforms (LLM platforms, RAG frameworks, vector databases, AI coding assistants). Published on PyPI as `skill-seekers`.
**Version:** 3.3.0 | **Python:** 3.10+ | **Website:** https://skillseekersweb.com/
**Architecture:** See `docs/UML_ARCHITECTURE.md` for UML diagrams and module overview. StarUML project at `docs/UML/skill_seekers.mdj`.
## Essential Commands
```bash
# REQUIRED before running tests or CLI (src/ layout)
pip install -e .
# Run all tests (NEVER skip - all must pass before commits)
pytest tests/ -v
# Fast iteration (skip slow MCP tests ~20min)
pytest tests/ --ignore=tests/test_mcp_fastmcp.py --ignore=tests/test_mcp_server.py --ignore=tests/test_install_skill_e2e.py -q
# Single test
pytest tests/test_scraper_features.py::test_detect_language -vv -s
# Code quality (must pass before push - matches CI)
uvx ruff check src/ tests/
uvx ruff format --check src/ tests/
mypy src/skill_seekers # continue-on-error in CI
# Auto-fix lint/format issues
uvx ruff check --fix --unsafe-fixes src/ tests/
uvx ruff format src/ tests/
# Build & publish
uv build
uv publish
```
## CI Matrix
Runs on push/PR to `main` or `development`. Lint job (Python 3.12, Ubuntu) + Test job (Ubuntu + macOS, Python 3.10/3.11/3.12, excludes macOS+3.10). Both must pass for merge.
## Git Workflow
- **Main branch:** `main` (requires tests + 1 review)
- **Development branch:** `development` (default PR target, requires tests)
- **Feature branches:** `feature/{task-id}-{description}` from `development`
- PRs always target `development`, never `main` directly
## Architecture
### CLI: Git-style dispatcher
Entry point `src/skill_seekers/cli/main.py` maps subcommands to modules. The `create` command auto-detects source type and is the recommended entry point for users.
```
skill-seekers create <source> # Auto-detect: URL, owner/repo, ./path, file.pdf, etc.
skill-seekers <type> [options] # Direct: scrape, github, pdf, word, epub, video, jupyter, html, openapi, asciidoc, pptx, rss, manpage, confluence, notion, chat
skill-seekers analyze <dir> # Analyze local codebase (C3.x pipeline)
skill-seekers package <dir> # Package for platform (--target claude/gemini/openai/markdown/minimax/opencode/kimi/deepseek/qwen/openrouter/together/fireworks, --format langchain/llama-index/haystack/chroma/faiss/weaviate/qdrant/pinecone)
```
### Data Flow (5 phases)
1. **Scrape** - Source-specific scraper extracts content to `output/{name}_data/pages/*.json`
2. **Build** - `build_skill()` categorizes pages, extracts patterns, generates `output/{name}/SKILL.md`
3. **Enhance** (optional) - LLM rewrites SKILL.md (`--enhance-level 0-3`, auto-detects API vs LOCAL mode)
4. **Package** - Platform adaptor formats output (`.zip`, `.tar.gz`, JSON, vector index)
5. **Upload** (optional) - Platform API upload
### Platform Adaptor Pattern (Strategy + Factory)
Factory: `get_adaptor(platform, config)` in `adaptors/__init__.py` returns a `SkillAdaptor` instance. Base class `SkillAdaptor` + `SkillMetadata` in `adaptors/base.py`.
```
src/skill_seekers/cli/adaptors/
├── __init__.py # Factory: get_adaptor(platform, config), ADAPTORS registry
├── base.py # Abstract base: SkillAdaptor, SkillMetadata
├── openai_compatible.py # Shared base for OpenAI-compatible platforms
├── claude.py # --target claude
├── gemini.py # --target gemini
├── openai.py # --target openai
├── markdown.py # --target markdown
├── minimax.py # --target minimax
├── opencode.py # --target opencode
├── kimi.py # --target kimi
├── deepseek.py # --target deepseek
├── qwen.py # --target qwen
├── openrouter.py # --target openrouter
├── together.py # --target together
├── fireworks.py # --target fireworks
├── langchain.py # --format langchain
├── llama_index.py # --format llama-index
├── haystack.py # --format haystack
├── chroma.py # --format chroma
├── faiss_helpers.py # --format faiss
├── qdrant.py # --format qdrant
├── weaviate.py # --format weaviate
├── pinecone_adaptor.py # --format pinecone
└── streaming_adaptor.py # --format streaming
```
`--target` = LLM platforms, `--format` = RAG/vector DBs. All adaptors are imported with `try/except ImportError` so missing optional deps don't break the registry.
### 17 Source Type Scrapers
Each in `src/skill_seekers/cli/{type}_scraper.py` with a `main()` entry point. The `create_command.py` uses `source_detector.py` to auto-route. New scrapers added in v3.2.0+: jupyter, html, openapi, asciidoc, pptx, rss, manpage, confluence, notion, chat.
### CLI Argument System
```
src/skill_seekers/cli/
├── parsers/ # Subcommand parser registration
│ └── create_parser.py # Progressive help disclosure (--help-web, --help-github, etc.)
├── arguments/ # Argument definitions
│ ├── common.py # add_all_standard_arguments() - shared across all scrapers
│ └── create.py # UNIVERSAL_ARGUMENTS, WEB_ARGUMENTS, GITHUB_ARGUMENTS, etc.
└── source_detector.py # Auto-detect source type from input string
```
### C3.x Codebase Analysis Pipeline
Local codebase analysis features, all opt-out (`--skip-*` flags):
- C3.1 `pattern_recognizer.py` - Design pattern detection (10 GoF patterns, 9 languages)
- C3.2 `test_example_extractor.py` - Usage examples from tests
- C3.3 `how_to_guide_builder.py` - AI-enhanced educational guides
- C3.4 `config_extractor.py` - Configuration pattern extraction
- C3.5 `generate_router.py` - Architecture overview generation
- C3.10 `signal_flow_analyzer.py` - Godot signal flow analysis
### MCP Server
`src/skill_seekers/mcp/server_fastmcp.py` - 26+ tools via FastMCP. Transport: stdio (Claude Code) or HTTP (Cursor/Windsurf). Optional dependency: `pip install -e ".[mcp]"`
### Enhancement Modes
- **API mode** (if `ANTHROPIC_API_KEY` set): Direct Claude API calls
- **LOCAL mode** (fallback): Uses Claude Code CLI (free with Max plan)
- Control: `--enhance-level 0` (off) / `1` (SKILL.md only) / `2` (default, balanced) / `3` (full)
## Key Implementation Details
### Smart Categorization (`doc_scraper.py:smart_categorize()`)
Scores pages against category keywords: 3 points for URL match, 2 for title, 1 for content. Threshold of 2+ required. Falls back to "other".
### Content Extraction (`doc_scraper.py`)
`FALLBACK_MAIN_SELECTORS` constant + `_find_main_content()` helper handle CSS selector fallback. Links are extracted from the full page before early return (not just main content). `body` is deliberately excluded from fallbacks.
### Three-Stream GitHub Architecture (`unified_codebase_analyzer.py`)
Stream 1: Code Analysis (AST, patterns, tests, guides). Stream 2: Documentation (README, docs/, wiki). Stream 3: Community (issues, PRs, metadata). Depth control: `basic` (1-2 min) or `c3x` (20-60 min).
## Testing
### Test markers (pytest.ini)
```bash
pytest tests/ -v # Default: fast tests only
pytest tests/ -v -m slow # Include slow tests (>5s)
pytest tests/ -v -m integration # External services required
pytest tests/ -v -m e2e # Resource-intensive
pytest tests/ -v -m "not slow and not integration" # Fastest subset
```
### Known legitimate skips (~11)
- 2: chromadb incompatible with Python 3.14 (pydantic v1)
- 2: weaviate-client not installed
- 2: Qdrant not running (requires docker)
- 2: langchain/llama_index not installed
- 3: GITHUB_TOKEN not set
### sys.modules gotcha
`test_swift_detection.py` deletes `skill_seekers.cli` modules from `sys.modules`. It must save and restore both `sys.modules` entries AND parent package attributes (`setattr`). See the test file for the pattern.
## Dependencies
Core deps include `langchain`, `llama-index`, `anthropic`, `httpx`, `PyMuPDF`, `pydantic`. Platform-specific deps are optional:
```bash
pip install -e ".[mcp]" # MCP server
pip install -e ".[gemini]" # Google Gemini
pip install -e ".[openai]" # OpenAI
pip install -e ".[docx]" # Word documents
pip install -e ".[epub]" # EPUB books
pip install -e ".[video]" # Video (lightweight)
pip install -e ".[video-full]"# Video (Whisper + visual)
pip install -e ".[jupyter]" # Jupyter notebooks
pip install -e ".[pptx]" # PowerPoint
pip install -e ".[rss]" # RSS/Atom feeds
pip install -e ".[confluence]"# Confluence wiki
pip install -e ".[notion]" # Notion pages
pip install -e ".[chroma]" # ChromaDB
pip install -e ".[all]" # Everything (except video-full)
```
Dev dependencies use PEP 735 `[dependency-groups]` in pyproject.toml.
## Environment Variables
```bash
ANTHROPIC_API_KEY=sk-ant-... # Claude AI (or compatible endpoint)
ANTHROPIC_BASE_URL=https://... # Optional: Claude-compatible API endpoint
GOOGLE_API_KEY=AIza... # Google Gemini (optional)
OPENAI_API_KEY=sk-... # OpenAI (optional)
GITHUB_TOKEN=ghp_... # Higher GitHub rate limits
```
## Adding New Features
### New platform adaptor
1. Create `src/skill_seekers/cli/adaptors/{platform}.py` inheriting `SkillAdaptor` from `base.py`
2. Register in `adaptors/__init__.py` (add try/except import + add to `ADAPTORS` dict)
3. Add optional dep to `pyproject.toml`
4. Add tests in `tests/`
### New source type scraper
1. Create `src/skill_seekers/cli/{type}_scraper.py` with `main()`
2. Add to `COMMAND_MODULES` in `cli/main.py`
3. Add entry point in `pyproject.toml` `[project.scripts]`
4. Add auto-detection in `source_detector.py`
5. Add optional dep if needed
6. Add tests
### New CLI argument
- Universal: `UNIVERSAL_ARGUMENTS` in `arguments/create.py`
- Source-specific: appropriate dict (`WEB_ARGUMENTS`, `GITHUB_ARGUMENTS`, etc.)
- Shared across scrapers: `add_all_standard_arguments()` in `arguments/common.py`