PyGithub's get_languages() returns raw API JSON which in some environments includes non-integer metadata keys (e.g., "url"), causing a TypeError in sum(). Now filters to integer values only before calculating percentages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
235 lines
10 KiB
Markdown
235 lines
10 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Project Overview
|
|
|
|
**Skill Seekers** converts documentation from 17 source types into production-ready formats for 24+ AI platforms (LLM platforms, RAG frameworks, vector databases, AI coding assistants). Published on PyPI as `skill-seekers`.
|
|
|
|
**Version:** 3.4.0 | **Python:** 3.10+ | **Website:** https://skillseekersweb.com/
|
|
|
|
**Architecture:** See `docs/UML_ARCHITECTURE.md` for UML diagrams and module overview. StarUML project at `docs/UML/skill_seekers.mdj`.
|
|
|
|
## Essential Commands
|
|
|
|
```bash
|
|
# REQUIRED before running tests or CLI (src/ layout)
|
|
pip install -e .
|
|
|
|
# Run all tests (NEVER skip - all must pass before commits)
|
|
pytest tests/ -v
|
|
|
|
# Fast iteration (skip slow MCP tests ~20min)
|
|
pytest tests/ --ignore=tests/test_mcp_fastmcp.py --ignore=tests/test_mcp_server.py --ignore=tests/test_install_skill_e2e.py -q
|
|
|
|
# Single test
|
|
pytest tests/test_scraper_features.py::test_detect_language -vv -s
|
|
|
|
# Code quality (must pass before push - matches CI)
|
|
uvx ruff check src/ tests/
|
|
uvx ruff format --check src/ tests/
|
|
mypy src/skill_seekers # continue-on-error in CI
|
|
|
|
# Auto-fix lint/format issues
|
|
uvx ruff check --fix --unsafe-fixes src/ tests/
|
|
uvx ruff format src/ tests/
|
|
|
|
# Build & publish
|
|
uv build
|
|
uv publish
|
|
```
|
|
|
|
## CI Matrix
|
|
|
|
Runs on push/PR to `main` or `development`. Lint job (Python 3.12, Ubuntu) + Test job (Ubuntu + macOS, Python 3.10/3.11/3.12, excludes macOS+3.10). Both must pass for merge.
|
|
|
|
## Git Workflow
|
|
|
|
- **Main branch:** `main` (requires tests + 1 review)
|
|
- **Development branch:** `development` (default PR target, requires tests)
|
|
- **Feature branches:** `feature/{task-id}-{description}` from `development`
|
|
- PRs always target `development`, never `main` directly
|
|
|
|
## Architecture
|
|
|
|
### CLI: Git-style dispatcher
|
|
|
|
Entry point `src/skill_seekers/cli/main.py` maps subcommands to modules. The `create` command auto-detects source type and is the recommended entry point for users.
|
|
|
|
```
|
|
skill-seekers create <source> # Auto-detect: URL, owner/repo, ./path, file.pdf, etc.
|
|
skill-seekers <type> [options] # Direct: scrape, github, pdf, word, epub, video, jupyter, html, openapi, asciidoc, pptx, rss, manpage, confluence, notion, chat
|
|
skill-seekers analyze <dir> # Analyze local codebase (C3.x pipeline)
|
|
skill-seekers package <dir> # Package for platform (--target claude/gemini/openai/markdown/minimax/opencode/kimi/deepseek/qwen/openrouter/together/fireworks, --format langchain/llama-index/haystack/chroma/faiss/weaviate/qdrant/pinecone)
|
|
```
|
|
|
|
### Data Flow (5 phases)
|
|
|
|
1. **Scrape** - Source-specific scraper extracts content to `output/{name}_data/pages/*.json`
|
|
2. **Build** - `build_skill()` categorizes pages, extracts patterns, generates `output/{name}/SKILL.md`
|
|
3. **Enhance** (optional) - LLM rewrites SKILL.md (`--enhance-level 0-3`, auto-detects API vs LOCAL mode)
|
|
4. **Package** - Platform adaptor formats output (`.zip`, `.tar.gz`, JSON, vector index)
|
|
5. **Upload** (optional) - Platform API upload
|
|
|
|
### Platform Adaptor Pattern (Strategy + Factory)
|
|
|
|
Factory: `get_adaptor(platform, config)` in `adaptors/__init__.py` returns a `SkillAdaptor` instance. Base class `SkillAdaptor` + `SkillMetadata` in `adaptors/base.py`.
|
|
|
|
```
|
|
src/skill_seekers/cli/adaptors/
|
|
├── __init__.py # Factory: get_adaptor(platform, config), ADAPTORS registry
|
|
├── base.py # Abstract base: SkillAdaptor, SkillMetadata
|
|
├── openai_compatible.py # Shared base for OpenAI-compatible platforms
|
|
├── claude.py # --target claude
|
|
├── gemini.py # --target gemini
|
|
├── openai.py # --target openai
|
|
├── markdown.py # --target markdown
|
|
├── minimax.py # --target minimax
|
|
├── opencode.py # --target opencode
|
|
├── kimi.py # --target kimi
|
|
├── deepseek.py # --target deepseek
|
|
├── qwen.py # --target qwen
|
|
├── openrouter.py # --target openrouter
|
|
├── together.py # --target together
|
|
├── fireworks.py # --target fireworks
|
|
├── langchain.py # --format langchain
|
|
├── llama_index.py # --format llama-index
|
|
├── haystack.py # --format haystack
|
|
├── chroma.py # --format chroma
|
|
├── faiss_helpers.py # --format faiss
|
|
├── qdrant.py # --format qdrant
|
|
├── weaviate.py # --format weaviate
|
|
├── pinecone_adaptor.py # --format pinecone
|
|
└── streaming_adaptor.py # --format streaming
|
|
```
|
|
|
|
`--target` = LLM platforms, `--format` = RAG/vector DBs. All adaptors are imported with `try/except ImportError` so missing optional deps don't break the registry.
|
|
|
|
### 17 Source Type Scrapers
|
|
|
|
Each in `src/skill_seekers/cli/{type}_scraper.py` with a `main()` entry point. The `create_command.py` uses `source_detector.py` to auto-route. New scrapers added in v3.2.0+: jupyter, html, openapi, asciidoc, pptx, rss, manpage, confluence, notion, chat.
|
|
|
|
### CLI Argument System
|
|
|
|
```
|
|
src/skill_seekers/cli/
|
|
├── parsers/ # Subcommand parser registration
|
|
│ └── create_parser.py # Progressive help disclosure (--help-web, --help-github, etc.)
|
|
├── arguments/ # Argument definitions
|
|
│ ├── common.py # add_all_standard_arguments() - shared across all scrapers
|
|
│ └── create.py # UNIVERSAL_ARGUMENTS, WEB_ARGUMENTS, GITHUB_ARGUMENTS, etc.
|
|
└── source_detector.py # Auto-detect source type from input string
|
|
```
|
|
|
|
### C3.x Codebase Analysis Pipeline
|
|
|
|
Local codebase analysis features, all opt-out (`--skip-*` flags):
|
|
- C3.1 `pattern_recognizer.py` - Design pattern detection (10 GoF patterns, 9 languages)
|
|
- C3.2 `test_example_extractor.py` - Usage examples from tests
|
|
- C3.3 `how_to_guide_builder.py` - AI-enhanced educational guides
|
|
- C3.4 `config_extractor.py` - Configuration pattern extraction
|
|
- C3.5 `generate_router.py` - Architecture overview generation
|
|
- C3.10 `signal_flow_analyzer.py` - Godot signal flow analysis
|
|
|
|
### MCP Server
|
|
|
|
`src/skill_seekers/mcp/server_fastmcp.py` - 26+ tools via FastMCP. Transport: stdio (Claude Code) or HTTP (Cursor/Windsurf). Optional dependency: `pip install -e ".[mcp]"`
|
|
|
|
### Enhancement Modes
|
|
|
|
- **API mode** (if `ANTHROPIC_API_KEY` set): Direct Claude API calls
|
|
- **LOCAL mode** (fallback): Uses Claude Code CLI (free with Max plan)
|
|
- Control: `--enhance-level 0` (off) / `1` (SKILL.md only) / `2` (default, balanced) / `3` (full)
|
|
|
|
## Key Implementation Details
|
|
|
|
### Smart Categorization (`doc_scraper.py:smart_categorize()`)
|
|
|
|
Scores pages against category keywords: 3 points for URL match, 2 for title, 1 for content. Threshold of 2+ required. Falls back to "other".
|
|
|
|
### Content Extraction (`doc_scraper.py`)
|
|
|
|
`FALLBACK_MAIN_SELECTORS` constant + `_find_main_content()` helper handle CSS selector fallback. Links are extracted from the full page before early return (not just main content). `body` is deliberately excluded from fallbacks.
|
|
|
|
### Three-Stream GitHub Architecture (`unified_codebase_analyzer.py`)
|
|
|
|
Stream 1: Code Analysis (AST, patterns, tests, guides). Stream 2: Documentation (README, docs/, wiki). Stream 3: Community (issues, PRs, metadata). Depth control: `basic` (1-2 min) or `c3x` (20-60 min).
|
|
|
|
## Testing
|
|
|
|
### Test markers (pytest.ini)
|
|
|
|
```bash
|
|
pytest tests/ -v # Default: fast tests only
|
|
pytest tests/ -v -m slow # Include slow tests (>5s)
|
|
pytest tests/ -v -m integration # External services required
|
|
pytest tests/ -v -m e2e # Resource-intensive
|
|
pytest tests/ -v -m "not slow and not integration" # Fastest subset
|
|
```
|
|
|
|
### Known legitimate skips (~11)
|
|
|
|
- 2: chromadb incompatible with Python 3.14 (pydantic v1)
|
|
- 2: weaviate-client not installed
|
|
- 2: Qdrant not running (requires docker)
|
|
- 2: langchain/llama_index not installed
|
|
- 3: GITHUB_TOKEN not set
|
|
|
|
### sys.modules gotcha
|
|
|
|
`test_swift_detection.py` deletes `skill_seekers.cli` modules from `sys.modules`. It must save and restore both `sys.modules` entries AND parent package attributes (`setattr`). See the test file for the pattern.
|
|
|
|
## Dependencies
|
|
|
|
Core deps include `langchain`, `llama-index`, `anthropic`, `httpx`, `PyMuPDF`, `pydantic`. Platform-specific deps are optional:
|
|
|
|
```bash
|
|
pip install -e ".[mcp]" # MCP server
|
|
pip install -e ".[gemini]" # Google Gemini
|
|
pip install -e ".[openai]" # OpenAI
|
|
pip install -e ".[docx]" # Word documents
|
|
pip install -e ".[epub]" # EPUB books
|
|
pip install -e ".[video]" # Video (lightweight)
|
|
pip install -e ".[video-full]"# Video (Whisper + visual)
|
|
pip install -e ".[jupyter]" # Jupyter notebooks
|
|
pip install -e ".[pptx]" # PowerPoint
|
|
pip install -e ".[rss]" # RSS/Atom feeds
|
|
pip install -e ".[confluence]"# Confluence wiki
|
|
pip install -e ".[notion]" # Notion pages
|
|
pip install -e ".[chroma]" # ChromaDB
|
|
pip install -e ".[all]" # Everything (except video-full)
|
|
```
|
|
|
|
Dev dependencies use PEP 735 `[dependency-groups]` in pyproject.toml.
|
|
|
|
## Environment Variables
|
|
|
|
```bash
|
|
ANTHROPIC_API_KEY=sk-ant-... # Claude AI (or compatible endpoint)
|
|
ANTHROPIC_BASE_URL=https://... # Optional: Claude-compatible API endpoint
|
|
GOOGLE_API_KEY=AIza... # Google Gemini (optional)
|
|
OPENAI_API_KEY=sk-... # OpenAI (optional)
|
|
GITHUB_TOKEN=ghp_... # Higher GitHub rate limits
|
|
```
|
|
|
|
## Adding New Features
|
|
|
|
### New platform adaptor
|
|
1. Create `src/skill_seekers/cli/adaptors/{platform}.py` inheriting `SkillAdaptor` from `base.py`
|
|
2. Register in `adaptors/__init__.py` (add try/except import + add to `ADAPTORS` dict)
|
|
3. Add optional dep to `pyproject.toml`
|
|
4. Add tests in `tests/`
|
|
|
|
### New source type scraper
|
|
1. Create `src/skill_seekers/cli/{type}_scraper.py` with `main()`
|
|
2. Add to `COMMAND_MODULES` in `cli/main.py`
|
|
3. Add entry point in `pyproject.toml` `[project.scripts]`
|
|
4. Add auto-detection in `source_detector.py`
|
|
5. Add optional dep if needed
|
|
6. Add tests
|
|
|
|
### New CLI argument
|
|
- Universal: `UNIVERSAL_ARGUMENTS` in `arguments/create.py`
|
|
- Source-specific: appropriate dict (`WEB_ARGUMENTS`, `GITHUB_ARGUMENTS`, etc.)
|
|
- Shared across scrapers: `add_all_standard_arguments()` in `arguments/common.py`
|