Files
skill-seekers-reference/AGENTS.md
yusyus 53b911b697 feat: add 10 new skill source types (17 total) with full pipeline integration
Add Jupyter Notebook, Local HTML, OpenAPI/Swagger, AsciiDoc, PowerPoint,
RSS/Atom, Man Pages, Confluence, Notion, and Slack/Discord Chat as new
skill source types. Each type is fully integrated across:

- Standalone CLI commands (skill-seekers <type>)
- Auto-detection via 'skill-seekers create' (file extension + content sniffing)
- Unified multi-source configs (scraped_data, dispatch, config validation)
- Unified skill builder (generic merge + source-attributed synthesis)
- MCP server (scrape_generic tool with per-type flag mapping)
- pyproject.toml (entry points, optional deps, [all] group)

Also fixes: EPUB unified pipeline gap, missing word/video config validators,
OpenAPI yaml import guard, MCP flag mismatch for all 10 types, stale
docstrings, and adds 77 integration tests + complex-merge workflow.

50 files changed, +20,201 lines
2026-03-15 15:30:15 +03:00

7.8 KiB

AGENTS.md - Skill Seekers

Concise reference for AI coding agents. Skill Seekers is a Python CLI tool (v3.2.0) that converts documentation sites, GitHub repos, PDFs, videos, notebooks, wikis, and more into AI-ready skills for 16+ LLM platforms and RAG pipelines.

Setup

# REQUIRED before running tests (src/ layout — tests fail without this)
pip install -e .
# With dev tools
pip install -e ".[dev]"
# With all optional deps
pip install -e ".[all]"

Build / Test / Lint Commands

# Run ALL tests (never skip tests — all must pass before commits)
pytest tests/ -v

# Run a single test file
pytest tests/test_scraper_features.py -v

# Run a single test function
pytest tests/test_scraper_features.py::test_detect_language -v

# Run a single test class method
pytest tests/test_adaptors/test_claude_adaptor.py::TestClaudeAdaptor::test_package -v

# Skip slow/integration tests
pytest tests/ -v -m "not slow and not integration"

# With coverage
pytest tests/ --cov=src/skill_seekers --cov-report=term

# Lint (ruff)
ruff check src/ tests/
ruff check src/ tests/ --fix

# Format (ruff)
ruff format --check src/ tests/
ruff format src/ tests/

# Type check (mypy)
mypy src/skill_seekers --show-error-codes --pretty

Test markers: slow, integration, e2e, venv, bootstrap, benchmark Async tests: use @pytest.mark.asyncio; asyncio_mode is auto.

Code Style

Formatting Rules (ruff — from pyproject.toml)

  • Line length: 100 characters
  • Target Python: 3.10+
  • Enabled lint rules: E, W, F, I, B, C4, UP, ARG, SIM
  • Ignored rules: E501 (line length handled by formatter), F541 (f-string style), ARG002 (unused method args for interface compliance), B007 (intentional unused loop vars), I001 (formatter handles imports), SIM114 (readability preference)

Imports

  • Sort with isort (via ruff); skill_seekers is first-party
  • Standard library → third-party → first-party, separated by blank lines
  • Use from __future__ import annotations only if needed for forward refs
  • Guard optional imports with try/except ImportError (see adaptors/__init__.py pattern)

Naming Conventions

  • Files: snake_case.py
  • Classes: PascalCase (e.g., SkillAdaptor, ClaudeAdaptor)
  • Functions/methods: snake_case
  • Constants: UPPER_CASE (e.g., ADAPTORS, DEFAULT_CHUNK_TOKENS)
  • Private: prefix with _

Type Hints

  • Gradual typing — add hints where practical, not enforced everywhere
  • Use modern syntax: str | None not Optional[str], list[str] not List[str]
  • MyPy config: disallow_untyped_defs = false, check_untyped_defs = true, ignore_missing_imports = true

Docstrings

  • Module-level docstring on every file (triple-quoted, describes purpose)
  • Google-style or standard docstrings for public functions/classes
  • Include Args:, Returns:, Raises: sections where useful

Error Handling

  • Use specific exceptions, never bare except:
  • Provide helpful error messages with context (see get_adaptor() in adaptors/__init__.py)
  • Use raise ValueError(...) for invalid arguments, raise RuntimeError(...) for state errors
  • Guard optional dependency imports with try/except and give clear install instructions on failure

Suppressing Lint Warnings

  • Use inline # noqa: XXXX comments (e.g., # noqa: F401 for re-exports, # noqa: ARG001 for required but unused params)

Supported Source Types (17)

Type CLI Command Config Type Detection
Documentation (web) scrape / create <url> documentation HTTP/HTTPS URLs
GitHub repo github / create owner/repo github owner/repo or github.com URLs
PDF pdf / create file.pdf pdf .pdf extension
Word (.docx) word / create file.docx word .docx extension
EPUB epub / create file.epub epub .epub extension
Video video / create <url/file> video YouTube/Vimeo URLs, video extensions
Local codebase analyze / create ./path local Directory paths
Jupyter Notebook jupyter / create file.ipynb jupyter .ipynb extension
Local HTML html / create file.html html .html/.htm extensions
OpenAPI/Swagger openapi / create spec.yaml openapi .yaml/.yml with OpenAPI content
AsciiDoc asciidoc / create file.adoc asciidoc .adoc/.asciidoc extensions
PowerPoint pptx / create file.pptx pptx .pptx extension
RSS/Atom rss / create feed.rss rss .rss/.atom extensions
Man pages manpage / create cmd.1 manpage .1-.8/.man extensions
Confluence confluence confluence API or export directory
Notion notion notion API or export directory
Slack/Discord chat chat Export directory or API

Project Layout

src/skill_seekers/           # Main package (src/ layout)
  cli/                       # CLI commands and entry points
    adaptors/                # Platform adaptors (Strategy pattern, inherit SkillAdaptor)
    arguments/               # CLI argument definitions (one per source type)
    parsers/                 # Subcommand parsers (one per source type)
    storage/                 # Cloud storage (inherit BaseStorageAdaptor)
    main.py                  # Unified CLI entry point (COMMAND_MODULES dict)
    source_detector.py       # Auto-detects source type from user input
    create_command.py        # Unified `create` command routing
    config_validator.py      # VALID_SOURCE_TYPES set + per-type validation
    unified_scraper.py       # Multi-source orchestrator (scraped_data + dispatch)
    unified_skill_builder.py # Pairwise synthesis + generic merge
  mcp/                       # MCP server (FastMCP + legacy)
    tools/                   # MCP tool implementations by category
  sync/                      # Sync monitoring (Pydantic models)
  benchmark/                 # Benchmarking framework
  embedding/                 # FastAPI embedding server
  workflows/                 # 67 YAML workflow presets (includes complex-merge.yaml)
  _version.py                # Reads version from pyproject.toml
tests/                       # 115+ test files (pytest)
configs/                     # Preset JSON scraping configs
docs/                        # 80+ markdown doc files

Key Patterns

Adaptor (Strategy) pattern — all platform logic in cli/adaptors/. Inherit SkillAdaptor, implement format_skill_md(), package(), upload(). Register in adaptors/__init__.py ADAPTORS dict.

Scraper pattern — each source type has: cli/<type>_scraper.py (with <Type>ToSkillConverter class + main()), arguments/<type>.py, parsers/<type>_parser.py. Register in parsers/__init__.py PARSERS list, main.py COMMAND_MODULES dict, config_validator.py VALID_SOURCE_TYPES set.

Unified pipelineunified_scraper.py dispatches to per-type _scrape_<type>() methods. unified_skill_builder.py uses pairwise synthesis for docs+github+pdf combos and _generic_merge() for all other combinations.

MCP tools — grouped in mcp/tools/ by category. scrape_generic_tool handles all new source types.

CLI subcommands — git-style in cli/main.py. Each delegates to a module's main() function.

Git Workflow

  • main — production, protected
  • development — default PR target, active dev
  • Feature branches created from development

Pre-commit Checklist

ruff check src/ tests/
ruff format --check src/ tests/
pytest tests/ -v -x   # stop on first failure

Never commit API keys. Use env vars: ANTHROPIC_API_KEY, GOOGLE_API_KEY, OPENAI_API_KEY, GITHUB_TOKEN.

CI

GitHub Actions (.github/workflows/tests.yml): ruff + mypy lint job, then pytest matrix (Ubuntu + macOS, Python 3.10-3.12) with Codecov upload.