Add Jupyter Notebook, Local HTML, OpenAPI/Swagger, AsciiDoc, PowerPoint, RSS/Atom, Man Pages, Confluence, Notion, and Slack/Discord Chat as new skill source types. Each type is fully integrated across: - Standalone CLI commands (skill-seekers <type>) - Auto-detection via 'skill-seekers create' (file extension + content sniffing) - Unified multi-source configs (scraped_data, dispatch, config validation) - Unified skill builder (generic merge + source-attributed synthesis) - MCP server (scrape_generic tool with per-type flag mapping) - pyproject.toml (entry points, optional deps, [all] group) Also fixes: EPUB unified pipeline gap, missing word/video config validators, OpenAPI yaml import guard, MCP flag mismatch for all 10 types, stale docstrings, and adds 77 integration tests + complex-merge workflow. 50 files changed, +20,201 lines
7.8 KiB
AGENTS.md - Skill Seekers
Concise reference for AI coding agents. Skill Seekers is a Python CLI tool (v3.2.0) that converts documentation sites, GitHub repos, PDFs, videos, notebooks, wikis, and more into AI-ready skills for 16+ LLM platforms and RAG pipelines.
Setup
# REQUIRED before running tests (src/ layout — tests fail without this)
pip install -e .
# With dev tools
pip install -e ".[dev]"
# With all optional deps
pip install -e ".[all]"
Build / Test / Lint Commands
# Run ALL tests (never skip tests — all must pass before commits)
pytest tests/ -v
# Run a single test file
pytest tests/test_scraper_features.py -v
# Run a single test function
pytest tests/test_scraper_features.py::test_detect_language -v
# Run a single test class method
pytest tests/test_adaptors/test_claude_adaptor.py::TestClaudeAdaptor::test_package -v
# Skip slow/integration tests
pytest tests/ -v -m "not slow and not integration"
# With coverage
pytest tests/ --cov=src/skill_seekers --cov-report=term
# Lint (ruff)
ruff check src/ tests/
ruff check src/ tests/ --fix
# Format (ruff)
ruff format --check src/ tests/
ruff format src/ tests/
# Type check (mypy)
mypy src/skill_seekers --show-error-codes --pretty
Test markers: slow, integration, e2e, venv, bootstrap, benchmark
Async tests: use @pytest.mark.asyncio; asyncio_mode is auto.
Code Style
Formatting Rules (ruff — from pyproject.toml)
- Line length: 100 characters
- Target Python: 3.10+
- Enabled lint rules: E, W, F, I, B, C4, UP, ARG, SIM
- Ignored rules: E501 (line length handled by formatter), F541 (f-string style), ARG002 (unused method args for interface compliance), B007 (intentional unused loop vars), I001 (formatter handles imports), SIM114 (readability preference)
Imports
- Sort with isort (via ruff);
skill_seekersis first-party - Standard library → third-party → first-party, separated by blank lines
- Use
from __future__ import annotationsonly if needed for forward refs - Guard optional imports with try/except ImportError (see
adaptors/__init__.pypattern)
Naming Conventions
- Files:
snake_case.py - Classes:
PascalCase(e.g.,SkillAdaptor,ClaudeAdaptor) - Functions/methods:
snake_case - Constants:
UPPER_CASE(e.g.,ADAPTORS,DEFAULT_CHUNK_TOKENS) - Private: prefix with
_
Type Hints
- Gradual typing — add hints where practical, not enforced everywhere
- Use modern syntax:
str | NonenotOptional[str],list[str]notList[str] - MyPy config:
disallow_untyped_defs = false,check_untyped_defs = true,ignore_missing_imports = true
Docstrings
- Module-level docstring on every file (triple-quoted, describes purpose)
- Google-style or standard docstrings for public functions/classes
- Include
Args:,Returns:,Raises:sections where useful
Error Handling
- Use specific exceptions, never bare
except: - Provide helpful error messages with context (see
get_adaptor()inadaptors/__init__.py) - Use
raise ValueError(...)for invalid arguments,raise RuntimeError(...)for state errors - Guard optional dependency imports with try/except and give clear install instructions on failure
Suppressing Lint Warnings
- Use inline
# noqa: XXXXcomments (e.g.,# noqa: F401for re-exports,# noqa: ARG001for required but unused params)
Supported Source Types (17)
| Type | CLI Command | Config Type | Detection |
|---|---|---|---|
| Documentation (web) | scrape / create <url> |
documentation |
HTTP/HTTPS URLs |
| GitHub repo | github / create owner/repo |
github |
owner/repo or github.com URLs |
pdf / create file.pdf |
pdf |
.pdf extension |
|
| Word (.docx) | word / create file.docx |
word |
.docx extension |
| EPUB | epub / create file.epub |
epub |
.epub extension |
| Video | video / create <url/file> |
video |
YouTube/Vimeo URLs, video extensions |
| Local codebase | analyze / create ./path |
local |
Directory paths |
| Jupyter Notebook | jupyter / create file.ipynb |
jupyter |
.ipynb extension |
| Local HTML | html / create file.html |
html |
.html/.htm extensions |
| OpenAPI/Swagger | openapi / create spec.yaml |
openapi |
.yaml/.yml with OpenAPI content |
| AsciiDoc | asciidoc / create file.adoc |
asciidoc |
.adoc/.asciidoc extensions |
| PowerPoint | pptx / create file.pptx |
pptx |
.pptx extension |
| RSS/Atom | rss / create feed.rss |
rss |
.rss/.atom extensions |
| Man pages | manpage / create cmd.1 |
manpage |
.1-.8/.man extensions |
| Confluence | confluence |
confluence |
API or export directory |
| Notion | notion |
notion |
API or export directory |
| Slack/Discord | chat |
chat |
Export directory or API |
Project Layout
src/skill_seekers/ # Main package (src/ layout)
cli/ # CLI commands and entry points
adaptors/ # Platform adaptors (Strategy pattern, inherit SkillAdaptor)
arguments/ # CLI argument definitions (one per source type)
parsers/ # Subcommand parsers (one per source type)
storage/ # Cloud storage (inherit BaseStorageAdaptor)
main.py # Unified CLI entry point (COMMAND_MODULES dict)
source_detector.py # Auto-detects source type from user input
create_command.py # Unified `create` command routing
config_validator.py # VALID_SOURCE_TYPES set + per-type validation
unified_scraper.py # Multi-source orchestrator (scraped_data + dispatch)
unified_skill_builder.py # Pairwise synthesis + generic merge
mcp/ # MCP server (FastMCP + legacy)
tools/ # MCP tool implementations by category
sync/ # Sync monitoring (Pydantic models)
benchmark/ # Benchmarking framework
embedding/ # FastAPI embedding server
workflows/ # 67 YAML workflow presets (includes complex-merge.yaml)
_version.py # Reads version from pyproject.toml
tests/ # 115+ test files (pytest)
configs/ # Preset JSON scraping configs
docs/ # 80+ markdown doc files
Key Patterns
Adaptor (Strategy) pattern — all platform logic in cli/adaptors/. Inherit SkillAdaptor, implement format_skill_md(), package(), upload(). Register in adaptors/__init__.py ADAPTORS dict.
Scraper pattern — each source type has: cli/<type>_scraper.py (with <Type>ToSkillConverter class + main()), arguments/<type>.py, parsers/<type>_parser.py. Register in parsers/__init__.py PARSERS list, main.py COMMAND_MODULES dict, config_validator.py VALID_SOURCE_TYPES set.
Unified pipeline — unified_scraper.py dispatches to per-type _scrape_<type>() methods. unified_skill_builder.py uses pairwise synthesis for docs+github+pdf combos and _generic_merge() for all other combinations.
MCP tools — grouped in mcp/tools/ by category. scrape_generic_tool handles all new source types.
CLI subcommands — git-style in cli/main.py. Each delegates to a module's main() function.
Git Workflow
main— production, protecteddevelopment— default PR target, active dev- Feature branches created from
development
Pre-commit Checklist
ruff check src/ tests/
ruff format --check src/ tests/
pytest tests/ -v -x # stop on first failure
Never commit API keys. Use env vars: ANTHROPIC_API_KEY, GOOGLE_API_KEY, OPENAI_API_KEY, GITHUB_TOKEN.
CI
GitHub Actions (.github/workflows/tests.yml): ruff + mypy lint job, then pytest matrix (Ubuntu + macOS, Python 3.10-3.12) with Codecov upload.