Add Jupyter Notebook, Local HTML, OpenAPI/Swagger, AsciiDoc, PowerPoint, RSS/Atom, Man Pages, Confluence, Notion, and Slack/Discord Chat as new skill source types. Each type is fully integrated across: - Standalone CLI commands (skill-seekers <type>) - Auto-detection via 'skill-seekers create' (file extension + content sniffing) - Unified multi-source configs (scraped_data, dispatch, config validation) - Unified skill builder (generic merge + source-attributed synthesis) - MCP server (scrape_generic tool with per-type flag mapping) - pyproject.toml (entry points, optional deps, [all] group) Also fixes: EPUB unified pipeline gap, missing word/video config validators, OpenAPI yaml import guard, MCP flag mismatch for all 10 types, stale docstrings, and adds 77 integration tests + complex-merge workflow. 50 files changed, +20,201 lines
172 lines
7.8 KiB
Markdown
172 lines
7.8 KiB
Markdown
# AGENTS.md - Skill Seekers
|
|
|
|
Concise reference for AI coding agents. Skill Seekers is a Python CLI tool (v3.2.0) that converts documentation sites, GitHub repos, PDFs, videos, notebooks, wikis, and more into AI-ready skills for 16+ LLM platforms and RAG pipelines.
|
|
|
|
## Setup
|
|
|
|
```bash
|
|
# REQUIRED before running tests (src/ layout — tests fail without this)
|
|
pip install -e .
|
|
# With dev tools
|
|
pip install -e ".[dev]"
|
|
# With all optional deps
|
|
pip install -e ".[all]"
|
|
```
|
|
|
|
## Build / Test / Lint Commands
|
|
|
|
```bash
|
|
# Run ALL tests (never skip tests — all must pass before commits)
|
|
pytest tests/ -v
|
|
|
|
# Run a single test file
|
|
pytest tests/test_scraper_features.py -v
|
|
|
|
# Run a single test function
|
|
pytest tests/test_scraper_features.py::test_detect_language -v
|
|
|
|
# Run a single test class method
|
|
pytest tests/test_adaptors/test_claude_adaptor.py::TestClaudeAdaptor::test_package -v
|
|
|
|
# Skip slow/integration tests
|
|
pytest tests/ -v -m "not slow and not integration"
|
|
|
|
# With coverage
|
|
pytest tests/ --cov=src/skill_seekers --cov-report=term
|
|
|
|
# Lint (ruff)
|
|
ruff check src/ tests/
|
|
ruff check src/ tests/ --fix
|
|
|
|
# Format (ruff)
|
|
ruff format --check src/ tests/
|
|
ruff format src/ tests/
|
|
|
|
# Type check (mypy)
|
|
mypy src/skill_seekers --show-error-codes --pretty
|
|
```
|
|
|
|
**Test markers:** `slow`, `integration`, `e2e`, `venv`, `bootstrap`, `benchmark`
|
|
**Async tests:** use `@pytest.mark.asyncio`; asyncio_mode is `auto`.
|
|
|
|
## Code Style
|
|
|
|
### Formatting Rules (ruff — from pyproject.toml)
|
|
- **Line length:** 100 characters
|
|
- **Target Python:** 3.10+
|
|
- **Enabled lint rules:** E, W, F, I, B, C4, UP, ARG, SIM
|
|
- **Ignored rules:** E501 (line length handled by formatter), F541 (f-string style), ARG002 (unused method args for interface compliance), B007 (intentional unused loop vars), I001 (formatter handles imports), SIM114 (readability preference)
|
|
|
|
### Imports
|
|
- Sort with isort (via ruff); `skill_seekers` is first-party
|
|
- Standard library → third-party → first-party, separated by blank lines
|
|
- Use `from __future__ import annotations` only if needed for forward refs
|
|
- Guard optional imports with try/except ImportError (see `adaptors/__init__.py` pattern)
|
|
|
|
### Naming Conventions
|
|
- **Files:** `snake_case.py`
|
|
- **Classes:** `PascalCase` (e.g., `SkillAdaptor`, `ClaudeAdaptor`)
|
|
- **Functions/methods:** `snake_case`
|
|
- **Constants:** `UPPER_CASE` (e.g., `ADAPTORS`, `DEFAULT_CHUNK_TOKENS`)
|
|
- **Private:** prefix with `_`
|
|
|
|
### Type Hints
|
|
- Gradual typing — add hints where practical, not enforced everywhere
|
|
- Use modern syntax: `str | None` not `Optional[str]`, `list[str]` not `List[str]`
|
|
- MyPy config: `disallow_untyped_defs = false`, `check_untyped_defs = true`, `ignore_missing_imports = true`
|
|
|
|
### Docstrings
|
|
- Module-level docstring on every file (triple-quoted, describes purpose)
|
|
- Google-style or standard docstrings for public functions/classes
|
|
- Include `Args:`, `Returns:`, `Raises:` sections where useful
|
|
|
|
### Error Handling
|
|
- Use specific exceptions, never bare `except:`
|
|
- Provide helpful error messages with context (see `get_adaptor()` in `adaptors/__init__.py`)
|
|
- Use `raise ValueError(...)` for invalid arguments, `raise RuntimeError(...)` for state errors
|
|
- Guard optional dependency imports with try/except and give clear install instructions on failure
|
|
|
|
### Suppressing Lint Warnings
|
|
- Use inline `# noqa: XXXX` comments (e.g., `# noqa: F401` for re-exports, `# noqa: ARG001` for required but unused params)
|
|
|
|
## Supported Source Types (17)
|
|
|
|
| Type | CLI Command | Config Type | Detection |
|
|
|------|------------|-------------|-----------|
|
|
| Documentation (web) | `scrape` / `create <url>` | `documentation` | HTTP/HTTPS URLs |
|
|
| GitHub repo | `github` / `create owner/repo` | `github` | `owner/repo` or github.com URLs |
|
|
| PDF | `pdf` / `create file.pdf` | `pdf` | `.pdf` extension |
|
|
| Word (.docx) | `word` / `create file.docx` | `word` | `.docx` extension |
|
|
| EPUB | `epub` / `create file.epub` | `epub` | `.epub` extension |
|
|
| Video | `video` / `create <url/file>` | `video` | YouTube/Vimeo URLs, video extensions |
|
|
| Local codebase | `analyze` / `create ./path` | `local` | Directory paths |
|
|
| Jupyter Notebook | `jupyter` / `create file.ipynb` | `jupyter` | `.ipynb` extension |
|
|
| Local HTML | `html` / `create file.html` | `html` | `.html`/`.htm` extensions |
|
|
| OpenAPI/Swagger | `openapi` / `create spec.yaml` | `openapi` | `.yaml`/`.yml` with OpenAPI content |
|
|
| AsciiDoc | `asciidoc` / `create file.adoc` | `asciidoc` | `.adoc`/`.asciidoc` extensions |
|
|
| PowerPoint | `pptx` / `create file.pptx` | `pptx` | `.pptx` extension |
|
|
| RSS/Atom | `rss` / `create feed.rss` | `rss` | `.rss`/`.atom` extensions |
|
|
| Man pages | `manpage` / `create cmd.1` | `manpage` | `.1`-`.8`/`.man` extensions |
|
|
| Confluence | `confluence` | `confluence` | API or export directory |
|
|
| Notion | `notion` | `notion` | API or export directory |
|
|
| Slack/Discord | `chat` | `chat` | Export directory or API |
|
|
|
|
## Project Layout
|
|
|
|
```
|
|
src/skill_seekers/ # Main package (src/ layout)
|
|
cli/ # CLI commands and entry points
|
|
adaptors/ # Platform adaptors (Strategy pattern, inherit SkillAdaptor)
|
|
arguments/ # CLI argument definitions (one per source type)
|
|
parsers/ # Subcommand parsers (one per source type)
|
|
storage/ # Cloud storage (inherit BaseStorageAdaptor)
|
|
main.py # Unified CLI entry point (COMMAND_MODULES dict)
|
|
source_detector.py # Auto-detects source type from user input
|
|
create_command.py # Unified `create` command routing
|
|
config_validator.py # VALID_SOURCE_TYPES set + per-type validation
|
|
unified_scraper.py # Multi-source orchestrator (scraped_data + dispatch)
|
|
unified_skill_builder.py # Pairwise synthesis + generic merge
|
|
mcp/ # MCP server (FastMCP + legacy)
|
|
tools/ # MCP tool implementations by category
|
|
sync/ # Sync monitoring (Pydantic models)
|
|
benchmark/ # Benchmarking framework
|
|
embedding/ # FastAPI embedding server
|
|
workflows/ # 67 YAML workflow presets (includes complex-merge.yaml)
|
|
_version.py # Reads version from pyproject.toml
|
|
tests/ # 115+ test files (pytest)
|
|
configs/ # Preset JSON scraping configs
|
|
docs/ # 80+ markdown doc files
|
|
```
|
|
|
|
## Key Patterns
|
|
|
|
**Adaptor (Strategy) pattern** — all platform logic in `cli/adaptors/`. Inherit `SkillAdaptor`, implement `format_skill_md()`, `package()`, `upload()`. Register in `adaptors/__init__.py` ADAPTORS dict.
|
|
|
|
**Scraper pattern** — each source type has: `cli/<type>_scraper.py` (with `<Type>ToSkillConverter` class + `main()`), `arguments/<type>.py`, `parsers/<type>_parser.py`. Register in `parsers/__init__.py` PARSERS list, `main.py` COMMAND_MODULES dict, `config_validator.py` VALID_SOURCE_TYPES set.
|
|
|
|
**Unified pipeline** — `unified_scraper.py` dispatches to per-type `_scrape_<type>()` methods. `unified_skill_builder.py` uses pairwise synthesis for docs+github+pdf combos and `_generic_merge()` for all other combinations.
|
|
|
|
**MCP tools** — grouped in `mcp/tools/` by category. `scrape_generic_tool` handles all new source types.
|
|
|
|
**CLI subcommands** — git-style in `cli/main.py`. Each delegates to a module's `main()` function.
|
|
|
|
## Git Workflow
|
|
|
|
- **`main`** — production, protected
|
|
- **`development`** — default PR target, active dev
|
|
- Feature branches created from `development`
|
|
|
|
## Pre-commit Checklist
|
|
|
|
```bash
|
|
ruff check src/ tests/
|
|
ruff format --check src/ tests/
|
|
pytest tests/ -v -x # stop on first failure
|
|
```
|
|
|
|
Never commit API keys. Use env vars: `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, `OPENAI_API_KEY`, `GITHUB_TOKEN`.
|
|
|
|
## CI
|
|
|
|
GitHub Actions (`.github/workflows/tests.yml`): ruff + mypy lint job, then pytest matrix (Ubuntu + macOS, Python 3.10-3.12) with Codecov upload.
|