skill-seekers-reference/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

**Skill Seekers** converts documentation from 17 source types into production-ready formats for 24+ AI platforms (LLM platforms, RAG frameworks, vector databases, AI coding assistants). Published on PyPI as `skill-seekers`.

**Version:** 3.3.0 | **Python:** 3.10+ | **Website:** https://skillseekersweb.com/

**Architecture:** See `docs/UML_ARCHITECTURE.md` for UML diagrams and module overview. StarUML project at `docs/UML/skill_seekers.mdj`.

## Essential Commands

```bash
# REQUIRED before running tests or CLI (src/ layout)
pip install -e .

# Run all tests (NEVER skip - all must pass before commits)
pytest tests/ -v

# Fast iteration (skip slow MCP tests ~20min)
pytest tests/ --ignore=tests/test_mcp_fastmcp.py --ignore=tests/test_mcp_server.py --ignore=tests/test_install_skill_e2e.py -q

# Single test
pytest tests/test_scraper_features.py::test_detect_language -vv -s

# Code quality (must pass before push - matches CI)
uvx ruff check src/ tests/
uvx ruff format --check src/ tests/
mypy src/skill_seekers  # continue-on-error in CI

# Auto-fix lint/format issues
uvx ruff check --fix --unsafe-fixes src/ tests/
uvx ruff format src/ tests/

# Build & publish
uv build
uv publish
```

## CI Matrix

Runs on push/PR to `main` or `development`. Lint job (Python 3.12, Ubuntu) + Test job (Ubuntu + macOS, Python 3.10/3.11/3.12, excludes macOS+3.10). Both must pass for merge.

## Git Workflow

- **Main branch:** `main` (requires tests + 1 review)
- **Development branch:** `development` (default PR target, requires tests)
- **Feature branches:** `feature/{task-id}-{description}` from `development`
- PRs always target `development`, never `main` directly

## Architecture

### CLI: Git-style dispatcher

Entry point `src/skill_seekers/cli/main.py` maps subcommands to modules. The `create` command auto-detects source type and is the recommended entry point for users.

```
skill-seekers create <source>     # Auto-detect: URL, owner/repo, ./path, file.pdf, etc.
skill-seekers <type> [options]    # Direct: scrape, github, pdf, word, epub, video, jupyter, html, openapi, asciidoc, pptx, rss, manpage, confluence, notion, chat
skill-seekers analyze <dir>       # Analyze local codebase (C3.x pipeline)
skill-seekers package <dir>       # Package for platform (--target claude/gemini/openai/markdown/minimax/opencode/kimi/deepseek/qwen/openrouter/together/fireworks, --format langchain/llama-index/haystack/chroma/faiss/weaviate/qdrant/pinecone)
```

### Data Flow (5 phases)

1. **Scrape** - Source-specific scraper extracts content to `output/{name}_data/pages/*.json`
2. **Build** - `build_skill()` categorizes pages, extracts patterns, generates `output/{name}/SKILL.md`
3. **Enhance** (optional) - LLM rewrites SKILL.md (`--enhance-level 0-3`, auto-detects API vs LOCAL mode)
4. **Package** - Platform adaptor formats output (`.zip`, `.tar.gz`, JSON, vector index)
5. **Upload** (optional) - Platform API upload

### Platform Adaptor Pattern (Strategy + Factory)

Factory: `get_adaptor(platform, config)` in `adaptors/__init__.py` returns a `SkillAdaptor` instance. Base class `SkillAdaptor` + `SkillMetadata` in `adaptors/base.py`.

```
src/skill_seekers/cli/adaptors/
├── __init__.py              # Factory: get_adaptor(platform, config), ADAPTORS registry
├── base.py                  # Abstract base: SkillAdaptor, SkillMetadata
├── openai_compatible.py     # Shared base for OpenAI-compatible platforms
├── claude.py                # --target claude
├── gemini.py                # --target gemini
├── openai.py                # --target openai
├── markdown.py              # --target markdown
├── minimax.py               # --target minimax
├── opencode.py              # --target opencode
├── kimi.py                  # --target kimi
├── deepseek.py              # --target deepseek
├── qwen.py                  # --target qwen
├── openrouter.py            # --target openrouter
├── together.py              # --target together
├── fireworks.py             # --target fireworks
├── langchain.py             # --format langchain
├── llama_index.py           # --format llama-index
├── haystack.py              # --format haystack
├── chroma.py                # --format chroma
├── faiss_helpers.py         # --format faiss
├── qdrant.py                # --format qdrant
├── weaviate.py              # --format weaviate
├── pinecone_adaptor.py      # --format pinecone
└── streaming_adaptor.py     # --format streaming
```

`--target` = LLM platforms, `--format` = RAG/vector DBs. All adaptors are imported with `try/except ImportError` so missing optional deps don't break the registry.

### 17 Source Type Scrapers

Each in `src/skill_seekers/cli/{type}_scraper.py` with a `main()` entry point. The `create_command.py` uses `source_detector.py` to auto-route. New scrapers added in v3.2.0+: jupyter, html, openapi, asciidoc, pptx, rss, manpage, confluence, notion, chat.

### CLI Argument System

```
src/skill_seekers/cli/
├── parsers/              # Subcommand parser registration
│   └── create_parser.py  # Progressive help disclosure (--help-web, --help-github, etc.)
├── arguments/            # Argument definitions
│   ├── common.py         # add_all_standard_arguments() - shared across all scrapers
│   └── create.py         # UNIVERSAL_ARGUMENTS, WEB_ARGUMENTS, GITHUB_ARGUMENTS, etc.
└── source_detector.py    # Auto-detect source type from input string
```

### C3.x Codebase Analysis Pipeline

Local codebase analysis features, all opt-out (`--skip-*` flags):
- C3.1 `pattern_recognizer.py` - Design pattern detection (10 GoF patterns, 9 languages)
- C3.2 `test_example_extractor.py` - Usage examples from tests
- C3.3 `how_to_guide_builder.py` - AI-enhanced educational guides
- C3.4 `config_extractor.py` - Configuration pattern extraction
- C3.5 `generate_router.py` - Architecture overview generation
- C3.10 `signal_flow_analyzer.py` - Godot signal flow analysis

### MCP Server

`src/skill_seekers/mcp/server_fastmcp.py` - 26+ tools via FastMCP. Transport: stdio (Claude Code) or HTTP (Cursor/Windsurf). Optional dependency: `pip install -e ".[mcp]"`

### Enhancement Modes

- **API mode** (if `ANTHROPIC_API_KEY` set): Direct Claude API calls
- **LOCAL mode** (fallback): Uses Claude Code CLI (free with Max plan)
- Control: `--enhance-level 0` (off) / `1` (SKILL.md only) / `2` (default, balanced) / `3` (full)

## Key Implementation Details

### Smart Categorization (`doc_scraper.py:smart_categorize()`)

Scores pages against category keywords: 3 points for URL match, 2 for title, 1 for content. Threshold of 2+ required. Falls back to "other".

### Content Extraction (`doc_scraper.py`)

`FALLBACK_MAIN_SELECTORS` constant + `_find_main_content()` helper handle CSS selector fallback. Links are extracted from the full page before early return (not just main content). `body` is deliberately excluded from fallbacks.

### Three-Stream GitHub Architecture (`unified_codebase_analyzer.py`)

Stream 1: Code Analysis (AST, patterns, tests, guides). Stream 2: Documentation (README, docs/, wiki). Stream 3: Community (issues, PRs, metadata). Depth control: `basic` (1-2 min) or `c3x` (20-60 min).

## Testing

### Test markers (pytest.ini)

```bash
pytest tests/ -v                                    # Default: fast tests only
pytest tests/ -v -m slow                            # Include slow tests (>5s)
pytest tests/ -v -m integration                     # External services required
pytest tests/ -v -m e2e                             # Resource-intensive
pytest tests/ -v -m "not slow and not integration"  # Fastest subset
```

### Known legitimate skips (~11)

- 2: chromadb incompatible with Python 3.14 (pydantic v1)
- 2: weaviate-client not installed
- 2: Qdrant not running (requires docker)
- 2: langchain/llama_index not installed
- 3: GITHUB_TOKEN not set

### sys.modules gotcha

`test_swift_detection.py` deletes `skill_seekers.cli` modules from `sys.modules`. It must save and restore both `sys.modules` entries AND parent package attributes (`setattr`). See the test file for the pattern.

## Dependencies

Core deps include `langchain`, `llama-index`, `anthropic`, `httpx`, `PyMuPDF`, `pydantic`. Platform-specific deps are optional:

```bash
pip install -e ".[mcp]"       # MCP server
pip install -e ".[gemini]"    # Google Gemini
pip install -e ".[openai]"    # OpenAI
pip install -e ".[docx]"      # Word documents
pip install -e ".[epub]"      # EPUB books
pip install -e ".[video]"     # Video (lightweight)
pip install -e ".[video-full]"# Video (Whisper + visual)
pip install -e ".[jupyter]"   # Jupyter notebooks
pip install -e ".[pptx]"      # PowerPoint
pip install -e ".[rss]"       # RSS/Atom feeds
pip install -e ".[confluence]"# Confluence wiki
pip install -e ".[notion]"    # Notion pages
pip install -e ".[chroma]"    # ChromaDB
pip install -e ".[all]"       # Everything (except video-full)
```

Dev dependencies use PEP 735 `[dependency-groups]` in pyproject.toml.

## Environment Variables

```bash
ANTHROPIC_API_KEY=sk-ant-...          # Claude AI (or compatible endpoint)
ANTHROPIC_BASE_URL=https://...        # Optional: Claude-compatible API endpoint
GOOGLE_API_KEY=AIza...                # Google Gemini (optional)
OPENAI_API_KEY=sk-...                 # OpenAI (optional)
GITHUB_TOKEN=ghp_...                  # Higher GitHub rate limits
```

## Adding New Features

### New platform adaptor
1. Create `src/skill_seekers/cli/adaptors/{platform}.py` inheriting `SkillAdaptor` from `base.py`
2. Register in `adaptors/__init__.py` (add try/except import + add to `ADAPTORS` dict)
3. Add optional dep to `pyproject.toml`
4. Add tests in `tests/`

### New source type scraper
1. Create `src/skill_seekers/cli/{type}_scraper.py` with `main()`
2. Add to `COMMAND_MODULES` in `cli/main.py`
3. Add entry point in `pyproject.toml` `[project.scripts]`
4. Add auto-detection in `source_detector.py`
5. Add optional dep if needed
6. Add tests

### New CLI argument
- Universal: `UNIVERSAL_ARGUMENTS` in `arguments/create.py`
- Source-specific: appropriate dict (`WEB_ARGUMENTS`, `GITHUB_ARGUMENTS`, etc.)
- Shared across scrapers: `add_all_standard_arguments()` in `arguments/common.py`