feat: add 10 new skill source types (17 total) with full pipeline integration

Add Jupyter Notebook, Local HTML, OpenAPI/Swagger, AsciiDoc, PowerPoint,
RSS/Atom, Man Pages, Confluence, Notion, and Slack/Discord Chat as new
skill source types. Each type is fully integrated across:

- Standalone CLI commands (skill-seekers <type>)
- Auto-detection via 'skill-seekers create' (file extension + content sniffing)
- Unified multi-source configs (scraped_data, dispatch, config validation)
- Unified skill builder (generic merge + source-attributed synthesis)
- MCP server (scrape_generic tool with per-type flag mapping)
- pyproject.toml (entry points, optional deps, [all] group)

Also fixes: EPUB unified pipeline gap, missing word/video config validators,
OpenAPI yaml import guard, MCP flag mismatch for all 10 types, stale
docstrings, and adds 77 integration tests + complex-merge workflow.

50 files changed, +20,201 lines
This commit is contained in:
yusyus
2026-03-15 15:30:15 +03:00
parent 64403a3686
commit 53b911b697
50 changed files with 20193 additions and 856 deletions

933
AGENTS.md
View File

@@ -1,866 +1,171 @@
# AGENTS.md - Skill Seekers
Essential guidance for AI coding agents working with the Skill Seekers codebase.
Concise reference for AI coding agents. Skill Seekers is a Python CLI tool (v3.2.0) that converts documentation sites, GitHub repos, PDFs, videos, notebooks, wikis, and more into AI-ready skills for 16+ LLM platforms and RAG pipelines.
---
## Project Overview
**Skill Seekers** is a Python CLI tool that converts documentation websites, GitHub repositories, PDF files, and videos into AI-ready skills for LLM platforms and RAG (Retrieval-Augmented Generation) pipelines. It serves as the universal preprocessing layer for AI systems.
### Key Facts
| Attribute | Value |
|-----------|-------|
| **Current Version** | 3.1.3 |
| **Python Version** | 3.10+ (tested on 3.10, 3.11, 3.12, 3.13) |
| **License** | MIT |
| **Package Name** | `skill-seekers` (PyPI) |
| **Source Files** | 182 Python files |
| **Test Files** | 105+ test files |
| **Website** | https://skillseekersweb.com/ |
| **Repository** | https://github.com/yusufkaraaslan/Skill_Seekers |
### Supported Target Platforms
| Platform | Format | Use Case |
|----------|--------|----------|
| **Claude AI** | ZIP + YAML | Claude Code skills |
| **Google Gemini** | tar.gz | Gemini skills |
| **OpenAI ChatGPT** | ZIP + Vector Store | Custom GPTs |
| **LangChain** | Documents | QA chains, agents, retrievers |
| **LlamaIndex** | TextNodes | Query engines, chat engines |
| **Haystack** | Documents | Enterprise RAG pipelines |
| **Pinecone** | Ready for upsert | Production vector search |
| **Weaviate** | Vector objects | Vector database |
| **Qdrant** | Points | Vector database |
| **Chroma** | Documents | Local vector database |
| **FAISS** | Index files | Local similarity search |
| **Cursor IDE** | .cursorrules | AI coding assistant rules |
| **Windsurf** | .windsurfrules | AI coding rules |
| **Cline** | .clinerules + MCP | VS Code extension |
| **Continue.dev** | HTTP context | Universal IDE support |
| **Generic Markdown** | ZIP | Universal export |
### Core Workflow
1. **Scrape Phase** - Crawl documentation/GitHub/PDF/video sources
2. **Build Phase** - Organize content into categorized references
3. **Enhancement Phase** - AI-powered quality improvements (optional)
4. **Package Phase** - Create platform-specific packages
5. **Upload Phase** - Auto-upload to target platform (optional)
---
## Project Structure
```
/mnt/1ece809a-2821-4f10-aecb-fcdf34760c0b/Git/Skill_Seekers/
├── src/skill_seekers/ # Main source code (src/ layout)
│ ├── cli/ # CLI tools and commands (~70 modules)
│ │ ├── adaptors/ # Platform adaptors (Strategy pattern)
│ │ │ ├── base.py # Abstract base class (SkillAdaptor)
│ │ │ ├── claude.py # Claude AI adaptor
│ │ │ ├── gemini.py # Google Gemini adaptor
│ │ │ ├── openai.py # OpenAI ChatGPT adaptor
│ │ │ ├── markdown.py # Generic Markdown adaptor
│ │ │ ├── chroma.py # Chroma vector DB adaptor
│ │ │ ├── faiss_helpers.py # FAISS index adaptor
│ │ │ ├── haystack.py # Haystack RAG adaptor
│ │ │ ├── langchain.py # LangChain adaptor
│ │ │ ├── llama_index.py # LlamaIndex adaptor
│ │ │ ├── qdrant.py # Qdrant vector DB adaptor
│ │ │ ├── weaviate.py # Weaviate vector DB adaptor
│ │ │ └── streaming_adaptor.py # Streaming output adaptor
│ │ ├── arguments/ # CLI argument definitions
│ │ ├── parsers/ # Argument parsers
│ │ │ └── extractors/ # Content extractors
│ │ ├── presets/ # Preset configuration management
│ │ ├── storage/ # Cloud storage adaptors
│ │ ├── main.py # Unified CLI entry point
│ │ ├── create_command.py # Unified create command
│ │ ├── doc_scraper.py # Documentation scraper
│ │ ├── github_scraper.py # GitHub repository scraper
│ │ ├── pdf_scraper.py # PDF extraction
│ │ ├── word_scraper.py # Word document scraper
│ │ ├── video_scraper.py # Video extraction
│ │ ├── video_setup.py # GPU detection & dependency installation
│ │ ├── unified_scraper.py # Multi-source scraping
│ │ ├── codebase_scraper.py # Local codebase analysis
│ │ ├── enhance_command.py # AI enhancement command
│ │ ├── enhance_skill_local.py # AI enhancement (local mode)
│ │ ├── package_skill.py # Skill packager
│ │ ├── upload_skill.py # Upload to platforms
│ │ ├── cloud_storage_cli.py # Cloud storage CLI
│ │ ├── benchmark_cli.py # Benchmarking CLI
│ │ ├── sync_cli.py # Sync monitoring CLI
│ │ └── workflows_command.py # Workflow management CLI
│ ├── mcp/ # MCP server integration
│ │ ├── server_fastmcp.py # FastMCP server (~708 lines)
│ │ ├── server_legacy.py # Legacy server implementation
│ │ ├── server.py # Server entry point
│ │ ├── agent_detector.py # AI agent detection
│ │ ├── git_repo.py # Git repository operations
│ │ ├── source_manager.py # Config source management
│ │ └── tools/ # MCP tool implementations
│ │ ├── config_tools.py # Configuration tools
│ │ ├── packaging_tools.py # Packaging tools
│ │ ├── scraping_tools.py # Scraping tools
│ │ ├── source_tools.py # Source management tools
│ │ ├── splitting_tools.py # Config splitting tools
│ │ ├── vector_db_tools.py # Vector database tools
│ │ └── workflow_tools.py # Workflow management tools
│ ├── sync/ # Sync monitoring module
│ │ ├── detector.py # Change detection
│ │ ├── models.py # Data models (Pydantic)
│ │ ├── monitor.py # Monitoring logic
│ │ └── notifier.py # Notification system
│ ├── benchmark/ # Benchmarking framework
│ │ ├── framework.py # Benchmark framework
│ │ ├── models.py # Benchmark models
│ │ └── runner.py # Benchmark runner
│ ├── embedding/ # Embedding server
│ │ ├── server.py # FastAPI embedding server
│ │ ├── generator.py # Embedding generation
│ │ ├── cache.py # Embedding cache
│ │ └── models.py # Embedding models
│ ├── workflows/ # YAML workflow presets (66 presets)
│ ├── _version.py # Version information (reads from pyproject.toml)
│ └── __init__.py # Package init
├── tests/ # Test suite (105+ test files)
├── configs/ # Preset configuration files
├── docs/ # Documentation (80+ markdown files)
│ ├── integrations/ # Platform integration guides
│ ├── guides/ # User guides
│ ├── reference/ # API reference
│ ├── features/ # Feature documentation
│ ├── blog/ # Blog posts
│ └── roadmap/ # Roadmap documents
├── examples/ # Usage examples
├── .github/workflows/ # CI/CD workflows
├── pyproject.toml # Main project configuration
├── requirements.txt # Pinned dependencies
├── mypy.ini # MyPy type checker configuration
├── Dockerfile # Main Docker image (multi-stage)
├── Dockerfile.mcp # MCP server Docker image
└── docker-compose.yml # Full stack deployment
```
---
## Build and Development Commands
### Prerequisites
- Python 3.10 or higher
- pip or uv package manager
- Git (for GitHub scraping features)
### Setup (REQUIRED before any development)
## Setup
```bash
# Install in editable mode (REQUIRED for tests due to src/ layout)
# REQUIRED before running tests (src/ layout — tests fail without this)
pip install -e .
# Install with all platform dependencies
pip install -e ".[all-llms]"
# Install with all optional dependencies
pip install -e ".[all]"
# Install specific platforms only
pip install -e ".[gemini]" # Google Gemini support
pip install -e ".[openai]" # OpenAI ChatGPT support
pip install -e ".[mcp]" # MCP server dependencies
pip install -e ".[s3]" # AWS S3 support
pip install -e ".[gcs]" # Google Cloud Storage
pip install -e ".[azure]" # Azure Blob Storage
pip install -e ".[embedding]" # Embedding server support
pip install -e ".[rag-upload]" # Vector DB upload support
# Install dev dependencies (using dependency-groups)
# With dev tools
pip install -e ".[dev]"
# With all optional deps
pip install -e ".[all]"
```
**CRITICAL:** The project uses a `src/` layout. Tests WILL FAIL unless you install with `pip install -e .` first.
### Building
## Build / Test / Lint Commands
```bash
# Build package using uv (recommended)
uv build
# Or using standard build
python -m build
# Publish to PyPI
uv publish
```
### Docker
```bash
# Build Docker image
docker build -t skill-seekers .
# Run with docker-compose (includes vector databases)
docker-compose up -d
# Run MCP server only
docker-compose up -d mcp-server
# View logs
docker-compose logs -f mcp-server
```
---
## Testing Instructions
### Running Tests
**CRITICAL:** Never skip tests - all tests must pass before commits.
```bash
# All tests (must run pip install -e . first!)
# Run ALL tests (never skip tests — all must pass before commits)
pytest tests/ -v
# Specific test file
# Run a single test file
pytest tests/test_scraper_features.py -v
pytest tests/test_mcp_fastmcp.py -v
pytest tests/test_cloud_storage.py -v
# With coverage
pytest tests/ --cov=src/skill_seekers --cov-report=term --cov-report=html
# Single test
# Run a single test function
pytest tests/test_scraper_features.py::test_detect_language -v
# E2E tests
pytest tests/test_e2e_three_stream_pipeline.py -v
# Run a single test class method
pytest tests/test_adaptors/test_claude_adaptor.py::TestClaudeAdaptor::test_package -v
# Skip slow tests
pytest tests/ -v -m "not slow"
# Run only integration tests
pytest tests/ -v -m integration
# Run only specific marker
# Skip slow/integration tests
pytest tests/ -v -m "not slow and not integration"
```
### Test Architecture
# With coverage
pytest tests/ --cov=src/skill_seekers --cov-report=term
- **105+ test files** covering all features
- **CI Matrix:** Ubuntu + macOS, Python 3.10-3.12
- Test markers defined in `pyproject.toml`:
| Marker | Description |
|--------|-------------|
| `slow` | Tests taking >5 seconds |
| `integration` | Requires external services (APIs) |
| `e2e` | End-to-end tests (resource-intensive) |
| `venv` | Requires virtual environment setup |
| `bootstrap` | Bootstrap skill specific |
| `benchmark` | Performance benchmark tests |
### Test Configuration
From `pyproject.toml`:
```toml
[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py"]
addopts = "-v --tb=short --strict-markers"
asyncio_mode = "auto"
asyncio_default_fixture_loop_scope = "function"
```
The `conftest.py` file checks that the package is installed before running tests.
---
## Code Style Guidelines
### Linting and Formatting
```bash
# Run ruff linter
# Lint (ruff)
ruff check src/ tests/
# Run ruff formatter check
ruff format --check src/ tests/
# Auto-fix issues
ruff check src/ tests/ --fix
# Format (ruff)
ruff format --check src/ tests/
ruff format src/ tests/
# Run mypy type checker
# Type check (mypy)
mypy src/skill_seekers --show-error-codes --pretty
```
### Style Rules (from pyproject.toml)
**Test markers:** `slow`, `integration`, `e2e`, `venv`, `bootstrap`, `benchmark`
**Async tests:** use `@pytest.mark.asyncio`; asyncio_mode is `auto`.
## Code Style
### Formatting Rules (ruff — from pyproject.toml)
- **Line length:** 100 characters
- **Target Python:** 3.10+
- **Enabled rules:** E, W, F, I, B, C4, UP, ARG, SIM
- **Ignored rules:** E501, F541, ARG002, B007, I001, SIM114
- **Import sorting:** isort style with `skill_seekers` as first-party
- **Enabled lint rules:** E, W, F, I, B, C4, UP, ARG, SIM
- **Ignored rules:** E501 (line length handled by formatter), F541 (f-string style), ARG002 (unused method args for interface compliance), B007 (intentional unused loop vars), I001 (formatter handles imports), SIM114 (readability preference)
### MyPy Configuration (from pyproject.toml)
### Imports
- Sort with isort (via ruff); `skill_seekers` is first-party
- Standard library → third-party → first-party, separated by blank lines
- Use `from __future__ import annotations` only if needed for forward refs
- Guard optional imports with try/except ImportError (see `adaptors/__init__.py` pattern)
```toml
[tool.mypy]
python_version = "3.10"
warn_return_any = true
warn_unused_configs = true
disallow_untyped_defs = false
disallow_incomplete_defs = false
check_untyped_defs = true
ignore_missing_imports = true
show_error_codes = true
pretty = true
### Naming Conventions
- **Files:** `snake_case.py`
- **Classes:** `PascalCase` (e.g., `SkillAdaptor`, `ClaudeAdaptor`)
- **Functions/methods:** `snake_case`
- **Constants:** `UPPER_CASE` (e.g., `ADAPTORS`, `DEFAULT_CHUNK_TOKENS`)
- **Private:** prefix with `_`
### Type Hints
- Gradual typing — add hints where practical, not enforced everywhere
- Use modern syntax: `str | None` not `Optional[str]`, `list[str]` not `List[str]`
- MyPy config: `disallow_untyped_defs = false`, `check_untyped_defs = true`, `ignore_missing_imports = true`
### Docstrings
- Module-level docstring on every file (triple-quoted, describes purpose)
- Google-style or standard docstrings for public functions/classes
- Include `Args:`, `Returns:`, `Raises:` sections where useful
### Error Handling
- Use specific exceptions, never bare `except:`
- Provide helpful error messages with context (see `get_adaptor()` in `adaptors/__init__.py`)
- Use `raise ValueError(...)` for invalid arguments, `raise RuntimeError(...)` for state errors
- Guard optional dependency imports with try/except and give clear install instructions on failure
### Suppressing Lint Warnings
- Use inline `# noqa: XXXX` comments (e.g., `# noqa: F401` for re-exports, `# noqa: ARG001` for required but unused params)
## Supported Source Types (17)
| Type | CLI Command | Config Type | Detection |
|------|------------|-------------|-----------|
| Documentation (web) | `scrape` / `create <url>` | `documentation` | HTTP/HTTPS URLs |
| GitHub repo | `github` / `create owner/repo` | `github` | `owner/repo` or github.com URLs |
| PDF | `pdf` / `create file.pdf` | `pdf` | `.pdf` extension |
| Word (.docx) | `word` / `create file.docx` | `word` | `.docx` extension |
| EPUB | `epub` / `create file.epub` | `epub` | `.epub` extension |
| Video | `video` / `create <url/file>` | `video` | YouTube/Vimeo URLs, video extensions |
| Local codebase | `analyze` / `create ./path` | `local` | Directory paths |
| Jupyter Notebook | `jupyter` / `create file.ipynb` | `jupyter` | `.ipynb` extension |
| Local HTML | `html` / `create file.html` | `html` | `.html`/`.htm` extensions |
| OpenAPI/Swagger | `openapi` / `create spec.yaml` | `openapi` | `.yaml`/`.yml` with OpenAPI content |
| AsciiDoc | `asciidoc` / `create file.adoc` | `asciidoc` | `.adoc`/`.asciidoc` extensions |
| PowerPoint | `pptx` / `create file.pptx` | `pptx` | `.pptx` extension |
| RSS/Atom | `rss` / `create feed.rss` | `rss` | `.rss`/`.atom` extensions |
| Man pages | `manpage` / `create cmd.1` | `manpage` | `.1`-`.8`/`.man` extensions |
| Confluence | `confluence` | `confluence` | API or export directory |
| Notion | `notion` | `notion` | API or export directory |
| Slack/Discord | `chat` | `chat` | Export directory or API |
## Project Layout
```
src/skill_seekers/ # Main package (src/ layout)
cli/ # CLI commands and entry points
adaptors/ # Platform adaptors (Strategy pattern, inherit SkillAdaptor)
arguments/ # CLI argument definitions (one per source type)
parsers/ # Subcommand parsers (one per source type)
storage/ # Cloud storage (inherit BaseStorageAdaptor)
main.py # Unified CLI entry point (COMMAND_MODULES dict)
source_detector.py # Auto-detects source type from user input
create_command.py # Unified `create` command routing
config_validator.py # VALID_SOURCE_TYPES set + per-type validation
unified_scraper.py # Multi-source orchestrator (scraped_data + dispatch)
unified_skill_builder.py # Pairwise synthesis + generic merge
mcp/ # MCP server (FastMCP + legacy)
tools/ # MCP tool implementations by category
sync/ # Sync monitoring (Pydantic models)
benchmark/ # Benchmarking framework
embedding/ # FastAPI embedding server
workflows/ # 67 YAML workflow presets (includes complex-merge.yaml)
_version.py # Reads version from pyproject.toml
tests/ # 115+ test files (pytest)
configs/ # Preset JSON scraping configs
docs/ # 80+ markdown doc files
```
### Code Conventions
## Key Patterns
1. **Use type hints** where practical (gradual typing approach)
2. **Docstrings:** Use Google-style or standard docstrings
3. **Error handling:** Use specific exceptions, provide helpful messages
4. **Async code:** Use `asyncio`, mark tests with `@pytest.mark.asyncio`
5. **File naming:** Use snake_case for all Python files
6. **Class naming:** Use PascalCase for classes
7. **Function naming:** Use snake_case for functions and methods
8. **Constants:** Use UPPER_CASE for module-level constants
**Adaptor (Strategy) pattern** — all platform logic in `cli/adaptors/`. Inherit `SkillAdaptor`, implement `format_skill_md()`, `package()`, `upload()`. Register in `adaptors/__init__.py` ADAPTORS dict.
---
**Scraper pattern** — each source type has: `cli/<type>_scraper.py` (with `<Type>ToSkillConverter` class + `main()`), `arguments/<type>.py`, `parsers/<type>_parser.py`. Register in `parsers/__init__.py` PARSERS list, `main.py` COMMAND_MODULES dict, `config_validator.py` VALID_SOURCE_TYPES set.
## Architecture Patterns
**Unified pipeline**`unified_scraper.py` dispatches to per-type `_scrape_<type>()` methods. `unified_skill_builder.py` uses pairwise synthesis for docs+github+pdf combos and `_generic_merge()` for all other combinations.
### Platform Adaptor Pattern (Strategy Pattern)
**MCP tools** — grouped in `mcp/tools/` by category. `scrape_generic_tool` handles all new source types.
All platform-specific logic is encapsulated in adaptors:
```python
from skill_seekers.cli.adaptors import get_adaptor
# Get platform-specific adaptor
adaptor = get_adaptor('gemini') # or 'claude', 'openai', 'langchain', etc.
# Package skill
adaptor.package(skill_dir='output/react/', output_path='output/')
# Upload to platform
adaptor.upload(
package_path='output/react-gemini.tar.gz',
api_key=os.getenv('GOOGLE_API_KEY')
)
```
Each adaptor inherits from `SkillAdaptor` base class and implements:
- `format_skill_md()` - Format SKILL.md content
- `package()` - Create platform-specific package
- `upload()` - Upload to platform API
- `validate_api_key()` - Validate API key format
- `supports_enhancement()` - Whether AI enhancement is supported
### CLI Architecture (Git-style)
Entry point: `src/skill_seekers/cli/main.py`
The CLI uses subcommands that delegate to existing modules:
```bash
# skill-seekers scrape --config react.json
# Transforms to: doc_scraper.main() with modified sys.argv
```
**Available subcommands:**
- `create` - Unified create command
- `config` - Configuration wizard
- `scrape` - Documentation scraping
- `github` - GitHub repository scraping
- `pdf` - PDF extraction
- `word` - Word document extraction
- `video` - Video extraction (YouTube or local). Use `--setup` to auto-detect GPU and install visual deps.
- `unified` - Multi-source scraping
- `analyze` / `codebase` - Local codebase analysis
- `enhance` - AI enhancement
- `package` - Package skill for target platform
- `upload` - Upload to platform
- `cloud` - Cloud storage operations
- `sync` - Sync monitoring
- `benchmark` - Performance benchmarking
- `embed` - Embedding server
- `install` / `install-agent` - Complete workflow
- `stream` - Streaming ingestion
- `update` - Incremental updates
- `multilang` - Multi-language support
- `quality` - Quality metrics
- `resume` - Resume interrupted jobs
- `estimate` - Estimate page counts
- `workflows` - Workflow management
### MCP Server Architecture
Two implementations:
- `server_fastmcp.py` - Modern, decorator-based (recommended, ~708 lines)
- `server_legacy.py` - Legacy implementation
Tools are organized by category:
- Config tools (3 tools): generate_config, list_configs, validate_config
- Scraping tools (10 tools): estimate_pages, scrape_docs, scrape_github, scrape_pdf, scrape_video (supports `setup` parameter for GPU detection and visual dep installation), scrape_codebase, detect_patterns, extract_test_examples, build_how_to_guides, extract_config_patterns
- Packaging tools (4 tools): package_skill, upload_skill, enhance_skill, install_skill
- Source tools (5 tools): fetch_config, submit_config, add_config_source, list_config_sources, remove_config_source
- Splitting tools (2 tools): split_config, generate_router
- Vector Database tools (4 tools): export_to_weaviate, export_to_chroma, export_to_faiss, export_to_qdrant
- Workflow tools (5 tools): list_workflows, get_workflow, create_workflow, update_workflow, delete_workflow
**Running MCP Server:**
```bash
# Stdio transport (default)
python -m skill_seekers.mcp.server_fastmcp
# HTTP transport
python -m skill_seekers.mcp.server_fastmcp --http --port 8765
```
### Cloud Storage Architecture
Abstract base class pattern for cloud providers:
- `base_storage.py` - Defines `BaseStorageAdaptor` interface
- `s3_storage.py` - AWS S3 implementation
- `gcs_storage.py` - Google Cloud Storage implementation
- `azure_storage.py` - Azure Blob Storage implementation
### Sync Monitoring Architecture
Pydantic-based models in `src/skill_seekers/sync/`:
- `models.py` - Data models (SyncConfig, ChangeReport, SyncState)
- `detector.py` - Change detection logic
- `monitor.py` - Monitoring daemon
- `notifier.py` - Notification system (webhook, email, slack)
---
**CLI subcommands** — git-style in `cli/main.py`. Each delegates to a module's `main()` function.
## Git Workflow
### Branch Structure
- **`main`** — production, protected
- **`development`** — default PR target, active dev
- Feature branches created from `development`
```
main (production)
│ (only maintainer merges)
development (integration) ← default branch for PRs
│ (all contributor PRs go here)
feature branches
```
- **`main`** - Production, always stable, protected
- **`development`** - Active development, default for PRs
- **Feature branches** - Your work, created from `development`
### Creating a Feature Branch
## Pre-commit Checklist
```bash
# 1. Checkout development
git checkout development
git pull upstream development
# 2. Create feature branch
git checkout -b my-feature
# 3. Make changes, commit, push
git add .
git commit -m "Add my feature"
git push origin my-feature
# 4. Create PR targeting 'development' branch
```
---
## CI/CD Configuration
### GitHub Actions Workflows
All workflows are in `.github/workflows/`:
**`tests.yml`:**
- Runs on: push/PR to `main` and `development`
- Lint job: Ruff + MyPy
- Test matrix: Ubuntu + macOS, Python 3.10-3.12
- Coverage: Uploads to Codecov
**`release.yml`:**
- Triggered on version tags (`v*`)
- Builds and publishes to PyPI using `uv`
- Creates GitHub release with changelog
**`docker-publish.yml`:**
- Builds and publishes Docker images
- Multi-architecture support (linux/amd64, linux/arm64)
**`vector-db-export.yml`:**
- Tests vector database exports
**`scheduled-updates.yml`:**
- Scheduled sync monitoring
**`quality-metrics.yml`:**
- Quality metrics tracking
**`test-vector-dbs.yml`:**
- Vector database integration tests
### Pre-commit Checks (Manual)
```bash
# Before committing, run:
ruff check src/ tests/
ruff format --check src/ tests/
pytest tests/ -v -x # Stop on first failure
pytest tests/ -v -x # stop on first failure
```
---
Never commit API keys. Use env vars: `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, `OPENAI_API_KEY`, `GITHUB_TOKEN`.
## Security Considerations
## CI
### API Keys and Secrets
1. **Never commit API keys** to the repository
2. **Use environment variables:**
- `ANTHROPIC_API_KEY` - Claude AI
- `GOOGLE_API_KEY` - Google Gemini
- `OPENAI_API_KEY` - OpenAI
- `GITHUB_TOKEN` - GitHub API
- `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` - AWS S3
- `GOOGLE_APPLICATION_CREDENTIALS` - GCS
- `AZURE_STORAGE_CONNECTION_STRING` - Azure
3. **Configuration storage:**
- Stored at `~/.config/skill-seekers/config.json`
- Permissions: 600 (owner read/write only)
### Rate Limit Handling
- GitHub API has rate limits (5000 requests/hour for authenticated)
- The tool has built-in rate limit handling with retry logic
- Use `--non-interactive` flag for CI/CD environments
### Custom API Endpoints
Support for Claude-compatible APIs:
```bash
export ANTHROPIC_API_KEY=your-custom-api-key
export ANTHROPIC_BASE_URL=https://custom-endpoint.com/v1
```
---
## Common Development Tasks
### Adding a New CLI Command
1. Create module in `src/skill_seekers/cli/my_command.py`
2. Implement `main()` function with argument parsing
3. Add entry point in `pyproject.toml`:
```toml
[project.scripts]
skill-seekers-my-command = "skill_seekers.cli.my_command:main"
```
4. Add subcommand handler in `src/skill_seekers/cli/main.py`
5. Add argument parser in `src/skill_seekers/cli/parsers/`
6. Add tests in `tests/test_my_command.py`
### Adding a New Platform Adaptor
1. Create `src/skill_seekers/cli/adaptors/my_platform.py`
2. Inherit from `SkillAdaptor` base class
3. Implement required methods: `package()`, `upload()`, `format_skill_md()`
4. Register in `src/skill_seekers/cli/adaptors/__init__.py`
5. Add optional dependencies in `pyproject.toml`
6. Add tests in `tests/test_adaptors/`
### Adding an MCP Tool
1. Implement tool logic in `src/skill_seekers/mcp/tools/category_tools.py`
2. Register in `src/skill_seekers/mcp/server_fastmcp.py`
3. Add test in `tests/test_mcp_fastmcp.py`
### Adding Cloud Storage Provider
1. Create module in `src/skill_seekers/cli/storage/my_storage.py`
2. Inherit from `BaseStorageAdaptor` base class
3. Implement required methods: `upload_file()`, `download_file()`, `list_files()`, `delete_file()`
4. Register in `src/skill_seekers/cli/storage/__init__.py`
5. Add optional dependencies in `pyproject.toml`
---
## Documentation
### Project Documentation (New Structure - v3.1.0+)
**Entry Points:**
- **README.md** - Main project documentation with navigation
- **docs/README.md** - Documentation hub
- **AGENTS.md** - This file, for AI coding agents
**Getting Started (for new users):**
- `docs/getting-started/01-installation.md` - Installation guide
- `docs/getting-started/02-quick-start.md` - 3 commands to first skill
- `docs/getting-started/03-your-first-skill.md` - Complete walkthrough
- `docs/getting-started/04-next-steps.md` - Where to go from here
**User Guides (common tasks):**
- `docs/user-guide/01-core-concepts.md` - How Skill Seekers works
- `docs/user-guide/02-scraping.md` - All scraping options
- `docs/user-guide/03-enhancement.md` - AI enhancement explained
- `docs/user-guide/04-packaging.md` - Export to platforms
- `docs/user-guide/05-workflows.md` - Enhancement workflows
- `docs/user-guide/06-troubleshooting.md` - Common issues
**Reference (technical details):**
- `docs/reference/CLI_REFERENCE.md` - Complete command reference (20 commands)
- `docs/reference/MCP_REFERENCE.md` - MCP tools reference (33 tools)
- `docs/reference/CONFIG_FORMAT.md` - JSON configuration specification
- `docs/reference/ENVIRONMENT_VARIABLES.md` - All environment variables
**Advanced (power user topics):**
- `docs/advanced/mcp-server.md` - MCP server setup
- `docs/advanced/mcp-tools.md` - Advanced MCP usage
- `docs/advanced/custom-workflows.md` - Creating custom workflows
- `docs/advanced/multi-source.md` - Multi-source scraping
### Configuration Documentation
Preset configs are in `configs/` directory:
- `godot.json` / `godot_unified.json` - Godot Engine
- `blender.json` / `blender-unified.json` - Blender Engine
- `claude-code.json` - Claude Code
- `httpx_comprehensive.json` - HTTPX library
- `medusa-mercurjs.json` - Medusa/MercurJS
- `astrovalley_unified.json` - Astrovalley
- `react.json` - React documentation
- `configs/integrations/` - Integration-specific configs
---
## Key Dependencies
### Core Dependencies (Required)
| Package | Version | Purpose |
|---------|---------|---------|
| `requests` | >=2.32.5 | HTTP requests |
| `beautifulsoup4` | >=4.14.2 | HTML parsing |
| `PyGithub` | >=2.5.0 | GitHub API |
| `GitPython` | >=3.1.40 | Git operations |
| `httpx` | >=0.28.1 | Async HTTP |
| `anthropic` | >=0.76.0 | Claude AI API |
| `PyMuPDF` | >=1.24.14 | PDF processing |
| `Pillow` | >=11.0.0 | Image processing |
| `pytesseract` | >=0.3.13 | OCR |
| `pydantic` | >=2.12.3 | Data validation |
| `pydantic-settings` | >=2.11.0 | Settings management |
| `click` | >=8.3.0 | CLI framework |
| `Pygments` | >=2.19.2 | Syntax highlighting |
| `pathspec` | >=0.12.1 | Path matching |
| `networkx` | >=3.0 | Graph operations |
| `schedule` | >=1.2.0 | Scheduled tasks |
| `python-dotenv` | >=1.1.1 | Environment variables |
| `jsonschema` | >=4.25.1 | JSON validation |
| `PyYAML` | >=6.0 | YAML parsing |
| `langchain` | >=1.2.10 | LangChain integration |
| `llama-index` | >=0.14.15 | LlamaIndex integration |
### Optional Dependencies
| Feature | Package | Install Command |
|---------|---------|-----------------|
| MCP Server | `mcp>=1.25,<2` | `pip install -e ".[mcp]"` |
| Google Gemini | `google-generativeai>=0.8.0` | `pip install -e ".[gemini]"` |
| OpenAI | `openai>=1.0.0` | `pip install -e ".[openai]"` |
| AWS S3 | `boto3>=1.34.0` | `pip install -e ".[s3]"` |
| Google Cloud Storage | `google-cloud-storage>=2.10.0` | `pip install -e ".[gcs]"` |
| Azure Blob Storage | `azure-storage-blob>=12.19.0` | `pip install -e ".[azure]"` |
| Word Documents | `mammoth>=1.6.0`, `python-docx>=1.1.0` | `pip install -e ".[docx]"` |
| Video (lightweight) | `yt-dlp>=2024.12.0`, `youtube-transcript-api>=1.2.0` | `pip install -e ".[video]"` |
| Video (full) | +`faster-whisper`, `scenedetect`, `opencv-python-headless` (`easyocr` now installed via `--setup`) | `pip install -e ".[video-full]"` |
| Video (GPU setup) | Auto-detects GPU, installs PyTorch + easyocr + all visual deps | `skill-seekers video --setup` |
| Chroma DB | `chromadb>=0.4.0` | `pip install -e ".[chroma]"` |
| Weaviate | `weaviate-client>=3.25.0` | `pip install -e ".[weaviate]"` |
| Pinecone | `pinecone>=5.0.0` | `pip install -e ".[pinecone]"` |
| Embedding Server | `fastapi>=0.109.0`, `uvicorn>=0.27.0`, `sentence-transformers>=2.3.0` | `pip install -e ".[embedding]"` |
### Dev Dependencies (in dependency-groups)
| Package | Version | Purpose |
|---------|---------|---------|
| `pytest` | >=8.4.2 | Testing framework |
| `pytest-asyncio` | >=0.24.0 | Async test support |
| `pytest-cov` | >=7.0.0 | Coverage |
| `coverage` | >=7.11.0 | Coverage reporting |
| `ruff` | >=0.14.13 | Linting/formatting |
| `mypy` | >=1.19.1 | Type checking |
| `psutil` | >=5.9.0 | Process utilities for testing |
| `numpy` | >=1.24.0 | Numerical operations |
| `starlette` | >=0.31.0 | HTTP transport testing |
| `httpx` | >=0.24.0 | HTTP client for testing |
| `boto3` | >=1.26.0 | AWS S3 testing |
| `google-cloud-storage` | >=2.10.0 | GCS testing |
| `azure-storage-blob` | >=12.17.0 | Azure testing |
---
## Troubleshooting
### Common Issues
**ImportError: No module named 'skill_seekers'**
- Solution: Run `pip install -e .`
**Tests failing with "package not installed"**
- Solution: Ensure you ran `pip install -e .` in the correct virtual environment
**MCP server import errors**
- Solution: Install with `pip install -e ".[mcp]"`
**Type checking failures**
- MyPy is configured to be lenient (gradual typing)
- Focus on critical paths, not full coverage
**Docker build failures**
- Ensure you have BuildKit enabled: `DOCKER_BUILDKIT=1`
- Check that all submodules are initialized: `git submodule update --init`
**Rate limit errors from GitHub**
- Set `GITHUB_TOKEN` environment variable for authenticated requests
- Improves rate limit from 60 to 5000 requests/hour
### Getting Help
- Check **TROUBLESHOOTING.md** for detailed solutions
- Review **docs/FAQ.md** for common questions
- Visit https://skillseekersweb.com/ for documentation
- Open an issue on GitHub with:
- Clear title and description
- Steps to reproduce
- Expected vs actual behavior
- Environment details (OS, Python version)
- Error messages and stack traces
---
## Environment Variables Reference
| Variable | Purpose | Required For |
|----------|---------|--------------|
| `ANTHROPIC_API_KEY` | Claude AI API access | Claude enhancement/upload |
| `GOOGLE_API_KEY` | Google Gemini API access | Gemini enhancement/upload |
| `OPENAI_API_KEY` | OpenAI API access | OpenAI enhancement/upload |
| `GITHUB_TOKEN` | GitHub API authentication | GitHub scraping (recommended) |
| `AWS_ACCESS_KEY_ID` | AWS S3 authentication | S3 cloud storage |
| `AWS_SECRET_ACCESS_KEY` | AWS S3 authentication | S3 cloud storage |
| `GOOGLE_APPLICATION_CREDENTIALS` | GCS authentication path | GCS cloud storage |
| `AZURE_STORAGE_CONNECTION_STRING` | Azure Blob authentication | Azure cloud storage |
| `ANTHROPIC_BASE_URL` | Custom Claude endpoint | Custom API endpoints |
| `SKILL_SEEKERS_HOME` | Data directory path | Docker/runtime |
| `SKILL_SEEKERS_OUTPUT` | Output directory path | Docker/runtime |
---
## Version Management
The version is defined in `pyproject.toml` and dynamically read by `src/skill_seekers/_version.py`:
```python
# _version.py reads from pyproject.toml
__version__ = get_version() # Returns version from pyproject.toml
```
**To update version:**
1. Edit `version` in `pyproject.toml`
2. The `_version.py` file will automatically pick up the new version
---
## Configuration File Format
Skill Seekers uses JSON configuration files to define scraping targets. Example structure:
```json
{
"name": "godot",
"description": "Godot Engine documentation",
"merge_mode": "claude-enhanced",
"sources": [
{
"type": "documentation",
"base_url": "https://docs.godotengine.org/en/stable/",
"extract_api": true,
"selectors": {
"main_content": "div[role='main']",
"title": "title",
"code_blocks": "pre"
},
"url_patterns": {
"include": [],
"exclude": ["/search.html", "/_static/"]
},
"categories": {
"getting_started": ["introduction", "getting_started"],
"scripting": ["scripting", "gdscript"]
},
"rate_limit": 0.5,
"max_pages": 500
},
{
"type": "github",
"repo": "godotengine/godot",
"enable_codebase_analysis": true,
"code_analysis_depth": "deep",
"fetch_issues": true,
"max_issues": 100
}
]
}
```
---
## Workflow Presets
Skill Seekers includes 66 YAML workflow presets for AI enhancement in `src/skill_seekers/workflows/`:
**Built-in presets:**
- `default.yaml` - Standard enhancement workflow
- `minimal.yaml` - Fast, minimal enhancement
- `security-focus.yaml` - Security-focused review
- `architecture-comprehensive.yaml` - Deep architecture analysis
- `api-documentation.yaml` - API documentation focus
- And 61 more specialized presets...
**Usage:**
```bash
# Apply a preset
skill-seekers create ./my-project --enhance-workflow security-focus
# Chain multiple presets
skill-seekers create ./my-project --enhance-workflow security-focus --enhance-workflow minimal
# Manage presets
skill-seekers workflows list
skill-seekers workflows show security-focus
skill-seekers workflows copy security-focus
```
---
*This document is maintained for AI coding agents. For human contributors, see README.md and CONTRIBUTING.md.*
*Last updated: 2026-03-01*
GitHub Actions (`.github/workflows/tests.yml`): ruff + mypy lint job, then pytest matrix (Ubuntu + macOS, Python 3.10-3.12) with Codecov upload.

View File

@@ -8,6 +8,77 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased]
### Added
#### 10 New Skill Source Types (17 total)
Skill Seekers now supports 17 source types — up from 7. Every new type is fully integrated into the CLI (`skill-seekers <type>`), `create` command auto-detection, unified multi-source configs, config validation, the MCP server, and the skill builder.
- **Jupyter Notebook** — `skill-seekers jupyter --notebook file.ipynb` or `skill-seekers create file.ipynb`
- Extracts markdown cells, code cells with outputs, kernel metadata, imports, and language detection
- Handles single files and directories of notebooks; filters `.ipynb_checkpoints`
- Optional dependency: `pip install "skill-seekers[jupyter]"` (nbformat)
- Entry point: `skill-seekers-jupyter`
- **Local HTML** — `skill-seekers html --html-path file.html` or `skill-seekers create file.html`
- Parses HTML using BeautifulSoup with smart main content detection (`<article>`, `<main>`, `.content`, largest div)
- Extracts headings, code blocks, tables (to markdown), images, links; converts inline HTML to markdown
- Handles single files and directories; supports `.html`, `.htm`, `.xhtml` extensions
- No extra dependencies (BeautifulSoup is a core dep)
- **OpenAPI/Swagger** — `skill-seekers openapi --spec spec.yaml` or `skill-seekers create spec.yaml`
- Parses OpenAPI 3.0/3.1 and Swagger 2.0 specs from YAML or JSON (local files or URLs via `--spec-url`)
- Extracts endpoints, parameters, request/response schemas, security schemes, tags
- Resolves `$ref` references with circular reference protection; handles `allOf`/`oneOf`/`anyOf`
- Groups endpoints by tags; generates comprehensive API reference markdown
- Source detection sniffs YAML file content for `openapi:` or `swagger:` keys (avoids false positives on non-API YAML files)
- Optional dependency: `pip install "skill-seekers[openapi]"` (pyyaml — already a core dep, guard added for safety)
- **AsciiDoc** — `skill-seekers asciidoc --asciidoc-path file.adoc` or `skill-seekers create file.adoc`
- Regex-based parser (no external library required) with optional `asciidoc` library support
- Extracts headings (= through =====), `[source,lang]` code blocks, `|===` tables, admonitions (NOTE/TIP/WARNING/IMPORTANT/CAUTION), and `include::` directives
- Converts AsciiDoc formatting to markdown; handles single files and directories
- Optional dependency: `pip install "skill-seekers[asciidoc]"` (asciidoc library for advanced rendering)
- **PowerPoint (.pptx)** — `skill-seekers pptx --pptx file.pptx` or `skill-seekers create file.pptx`
- Extracts slide text, speaker notes, tables, images (with alt text), and grouped shapes
- Detects code blocks by monospace font analysis (30+ font families)
- Groups slides into sections by layout type; handles single files and directories
- Optional dependency: `pip install "skill-seekers[pptx]"` (python-pptx)
- **RSS/Atom Feeds** — `skill-seekers rss --feed-url <url>` / `--feed-path file.rss` or `skill-seekers create feed.rss`
- Parses RSS 2.0, RSS 1.0, and Atom feeds via feedparser
- Optionally follows article links (`--follow-links`, default on) to scrape full page content using BeautifulSoup
- Extracts article titles, summaries, authors, dates, categories; configurable `--max-articles` (default 50)
- Source detection matches `.rss` and `.atom` extensions (`.xml` excluded to avoid false positives)
- Optional dependency: `pip install "skill-seekers[rss]"` (feedparser)
- **Man Pages** — `skill-seekers manpage --man-names git,curl` / `--man-path dir/` or `skill-seekers create git.1`
- Extracts man pages by running `man` command via subprocess or reading `.1``.8`/`.man` files directly
- Handles gzip/bzip2/xz compressed man files; strips troff/groff formatting (backspace overstriking, macros, font escapes)
- Parses structured sections (NAME, SYNOPSIS, DESCRIPTION, OPTIONS, EXAMPLES, SEE ALSO)
- Source detection uses basename heuristic to avoid false positives on log rotation files (e.g., `access.log.1`)
- No external dependencies (stdlib only)
- **Confluence** — `skill-seekers confluence --base-url <url> --space-key <key>` or `--export-path dir/`
- API mode: fetches pages from Confluence REST API with pagination (`atlassian-python-api`)
- Export mode: parses Confluence HTML/XML export directories
- Extracts page content, code/panel/info/warning macros, page hierarchy, tables
- Optional dependency: `pip install "skill-seekers[confluence]"` (atlassian-python-api)
- **Notion** — `skill-seekers notion --database-id <id>` / `--page-id <id>` or `--export-path dir/`
- API mode: fetches pages via Notion API with support for 20+ block types (paragraph, heading, code, callout, toggle, table, etc.)
- Export mode: parses Notion Markdown/CSV export directories
- Extracts rich text with annotations (bold, italic, code, links), 16+ property types for database entries
- Optional dependency: `pip install "skill-seekers[notion]"` (notion-client)
- **Slack/Discord Chat** — `skill-seekers chat --export-path dir/` or `--token <token> --channel <channel>`
- Slack: parses workspace JSON exports or fetches via Slack Web API (`slack_sdk`)
- Discord: parses DiscordChatExporter JSON or fetches via Discord HTTP API
- Extracts messages, code snippets (fenced blocks), shared URLs, threads, reactions, attachments
- Generates per-channel summaries and topic categorization
- Optional dependency: `pip install "skill-seekers[chat]"` (slack-sdk)
#### EPUB Unified Pipeline Integration
- **EPUB (.epub) input support** via `skill-seekers create book.epub` or `skill-seekers epub --epub book.epub`
- Extracts chapters, metadata (Dublin Core), code blocks, images, and tables from EPUB 2 and EPUB 3 files
- DRM detection with clear error messages (Adobe ADEPT, Apple FairPlay, Readium LCP)
@@ -16,6 +87,61 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `--help-epub` flag for EPUB-specific help
- Optional dependency: `pip install "skill-seekers[epub]"` (ebooklib)
- 107 tests across 14 test classes
- **EPUB added to unified scraper** — `_scrape_epub()` method, `scraped_data["epub"]`, config validation (`_validate_epub_source`), and dry-run display. Previously EPUB worked standalone but was missing from multi-source configs.
#### Unified Skill Builder — Generic Merge System
- **`_generic_merge()`** — Priority-based section merge for any combination of source types not covered by existing pairwise synthesis (docs+github, docs+pdf, etc.). Produces YAML frontmatter + source-attributed sections.
- **`_append_extra_sources()`** — Appends additional source type content (e.g., Jupyter + PPTX) to pairwise-synthesized SKILL.md.
- **`_generate_generic_references()`** — Generates `references/<type>/index.md` for any source type, with ID resolution fallback chain.
- **`_SOURCE_LABELS`** dict — Human-readable labels for all 17 source types used in merge attribution.
#### Config Validator Expansion
- **17 source types in `VALID_SOURCE_TYPES`** — All new types plus `word` and `video` now have per-type validation methods.
- **`_validate_word_source()`** — Validates `path` field for Word documents (was previously missing).
- **`_validate_video_source()`** — Validates `url`, `path`, or `playlist` field for video sources (was previously missing).
- **11 new `_validate_*_source()` methods** — One for each new type with appropriate required-field checks.
#### Source Detection Improvements
- **7 new file extension detections** in `SourceDetector.detect()``.ipynb`, `.html`/`.htm`, `.pptx`, `.adoc`/`.asciidoc`, `.rss`/`.atom`, `.1``.8`/`.man`, `.yaml`/`.yml` (with content sniffing)
- **`_looks_like_openapi()`** — Content sniffing for YAML files: only classifies as OpenAPI if the file contains `openapi:` or `swagger:` key in first 20 lines (prevents false positives on docker-compose, Ansible, Kubernetes manifests, etc.)
- **Man page basename heuristic** — `.1``.8` extensions only detected as man pages if the basename has no dots (e.g., `git.1` matches but `access.log.1` does not)
- **`.xml` excluded from RSS detection** — Too generic; only `.rss` and `.atom` trigger RSS detection
#### MCP Server Integration
- **`scrape_generic` tool** — New MCP tool handles all 10 new source types via subprocess with per-type flag mapping
- **`_PATH_FLAGS` / `_URL_FLAGS` dicts** — Correct flag routing for each source type (e.g., jupyter→`--notebook`, html→`--html-path`, rss→`--feed-url`)
- **`GENERIC_SOURCE_TYPES` tuple** — Lists all 10 new types for validation
- **Config validation display** — `validate_config` tool now shows source details for all new types
- **Tool count updated** — 33 → 34 tools (scraping tools 10 → 11)
#### CLI Wiring
- **10 new CLI subcommands** — `jupyter`, `html`, `openapi`, `asciidoc`, `pptx`, `rss`, `manpage`, `confluence`, `notion`, `chat` in `COMMAND_MODULES`
- **10 new argument modules** — `arguments/{jupyter,html,openapi,asciidoc,pptx,rss,manpage,confluence,notion,chat}.py` with per-type `*_ARGUMENTS` dicts
- **10 new parser modules** — `parsers/{jupyter,html,openapi,asciidoc,pptx,rss,manpage,confluence,notion,chat}_parser.py` with `SubcommandParser` implementations
- **`create` command routing** — `_route_generic()` method for all new types with correct module names and CLI flags
- **10 new entry points** in pyproject.toml — `skill-seekers-{jupyter,html,openapi,asciidoc,pptx,rss,manpage,confluence,notion,chat}`
- **7 new optional dependency groups** in pyproject.toml — `[jupyter]`, `[asciidoc]`, `[pptx]`, `[confluence]`, `[notion]`, `[rss]`, `[chat]`
- **`[all]` group updated** — Includes all 7 new optional dependencies
#### Workflow & Documentation
- **`complex-merge.yaml`** — New 7-stage AI-powered workflow for complex multi-source merging (source inventory → cross-reference → conflict detection → priority merge → gap analysis → synthesis → quality check)
- **AGENTS.md rewritten** — Updated with all 17 source types, scraper pattern docs, project layout, and key pattern documentation
- **77 new integration tests** in `test_new_source_types.py` — Source detection, config validation, generic merge, CLI wiring, validation, and create command routing
### Fixed
- **Config validator missing `word` and `video` dispatch** — `_validate_source()` had no `elif` branches for `word` or `video` types, silently skipping validation. Added dispatch entries and `_validate_word_source()` / `_validate_video_source()` methods.
- **`openapi_scraper.py` unconditional `import yaml`** — Would crash at import time if pyyaml not installed. Added `try/except ImportError` guard with `YAML_AVAILABLE` flag and `_check_yaml_deps()` helper.
- **`asciidoc_scraper.py` missing standard arguments** — `main()` manually defined args instead of using `add_asciidoc_arguments()`. Refactored to use shared argument definitions + added enhancement workflow integration.
- **`pptx_scraper.py` missing standard arguments** — Same issue. Refactored to use `add_pptx_arguments()`.
- **`chat_scraper.py` missing standard arguments** — Same issue. Refactored to use `add_chat_arguments()`.
- **`notion_scraper.py` missing `run_workflows` call** — `--enhance-workflow` flags were silently ignored. Added workflow runner integration.
- **`openapi_scraper.py` return type `None`** — `main()` returned `None` instead of `int`. Fixed to `return 0` on success, matching all other scrapers.
- **MCP `scrape_generic_tool` flag mismatch** — Was passing `--path`/`--url` as generic flags, but every scraper expects its own flag name (e.g., `--notebook`, `--html-path`, `--spec`). All 10 source types would have failed at runtime. Fixed with per-type `_PATH_FLAGS` and `_URL_FLAGS` mappings.
- **Word scraper `docx_id` key mismatch** — Unified scraper data dict used `docx_id` but generic reference generation looked for `word_id`. Added `word_id` alias.
- **`main.py` docstring stale** — Missing all 10 new commands. Updated to list all 27 commands.
- **`source_detector.py` module docstring stale** — Described only 5 source types. Updated to describe 14+ detected types.
- **`manpage_parser.py` docstring referenced wrong file** — Said `manpage_scraper.py` but actual file is `man_scraper.py`. Fixed.
- **Parser registry test count** — Updated expected count from 25 to 35 for 10 new parsers.
## [3.2.0] - 2026-03-01

View File

@@ -168,6 +168,35 @@ all-cloud = [
"azure-storage-blob>=12.19.0",
]
# New source type dependencies (v3.2.0+)
jupyter = [
"nbformat>=5.9.0",
]
asciidoc = [
"asciidoc>=10.0.0",
]
pptx = [
"python-pptx>=0.6.21",
]
confluence = [
"atlassian-python-api>=3.41.0",
]
notion = [
"notion-client>=2.0.0",
]
rss = [
"feedparser>=6.0.0",
]
chat = [
"slack-sdk>=3.27.0",
]
# Embedding server support
embedding = [
"fastapi>=0.109.0",
@@ -204,6 +233,14 @@ all = [
"sentence-transformers>=2.3.0",
"numpy>=1.24.0",
"voyageai>=0.2.0",
# New source types (v3.2.0+)
"nbformat>=5.9.0",
"asciidoc>=10.0.0",
"python-pptx>=0.6.21",
"atlassian-python-api>=3.41.0",
"notion-client>=2.0.0",
"feedparser>=6.0.0",
"slack-sdk>=3.27.0",
]
[project.urls]
@@ -253,6 +290,18 @@ skill-seekers-quality = "skill_seekers.cli.quality_metrics:main"
skill-seekers-workflows = "skill_seekers.cli.workflows_command:main"
skill-seekers-sync-config = "skill_seekers.cli.sync_config:main"
# New source type entry points (v3.2.0+)
skill-seekers-jupyter = "skill_seekers.cli.jupyter_scraper:main"
skill-seekers-html = "skill_seekers.cli.html_scraper:main"
skill-seekers-openapi = "skill_seekers.cli.openapi_scraper:main"
skill-seekers-asciidoc = "skill_seekers.cli.asciidoc_scraper:main"
skill-seekers-pptx = "skill_seekers.cli.pptx_scraper:main"
skill-seekers-rss = "skill_seekers.cli.rss_scraper:main"
skill-seekers-manpage = "skill_seekers.cli.man_scraper:main"
skill-seekers-confluence = "skill_seekers.cli.confluence_scraper:main"
skill-seekers-notion = "skill_seekers.cli.notion_scraper:main"
skill-seekers-chat = "skill_seekers.cli.chat_scraper:main"
[tool.setuptools]
package-dir = {"" = "src"}

View File

@@ -0,0 +1,68 @@
"""AsciiDoc command argument definitions.
This module defines ALL arguments for the asciidoc command in ONE place.
Both asciidoc_scraper.py (standalone) and parsers/asciidoc_parser.py (unified CLI)
import and use these definitions.
Shared arguments (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
via ``add_all_standard_arguments()``.
"""
import argparse
from typing import Any
from .common import add_all_standard_arguments
# AsciiDoc-specific argument definitions as data structure
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
ASCIIDOC_ARGUMENTS: dict[str, dict[str, Any]] = {
"asciidoc_path": {
"flags": ("--asciidoc-path",),
"kwargs": {
"type": str,
"help": "Path to AsciiDoc file or directory containing .adoc files",
"metavar": "PATH",
},
},
"from_json": {
"flags": ("--from-json",),
"kwargs": {
"type": str,
"help": "Build skill from extracted JSON",
"metavar": "FILE",
},
},
}
def add_asciidoc_arguments(parser: argparse.ArgumentParser) -> None:
"""Add all asciidoc command arguments to a parser.
Registers shared args (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
then adds AsciiDoc-specific args on top.
The default for --enhance-level is overridden to 0 (disabled) for AsciiDoc.
"""
# Shared universal args first
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for AsciiDoc
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for AsciiDoc), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code)"
)
# AsciiDoc-specific args
for arg_name, arg_def in ASCIIDOC_ARGUMENTS.items():
flags = arg_def["flags"]
kwargs = arg_def["kwargs"]
parser.add_argument(*flags, **kwargs)

View File

@@ -0,0 +1,102 @@
"""Chat command argument definitions.
This module defines ALL arguments for the chat command in ONE place.
Both chat_scraper.py (standalone) and parsers/chat_parser.py (unified CLI)
import and use these definitions.
Shared arguments (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
via ``add_all_standard_arguments()``.
"""
import argparse
from typing import Any
from .common import add_all_standard_arguments
# Chat-specific argument definitions as data structure
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
CHAT_ARGUMENTS: dict[str, dict[str, Any]] = {
"export_path": {
"flags": ("--export-path",),
"kwargs": {
"type": str,
"help": "Path to chat export directory or file",
"metavar": "PATH",
},
},
"platform": {
"flags": ("--platform",),
"kwargs": {
"type": str,
"choices": ["slack", "discord"],
"default": "slack",
"help": "Chat platform type (default: slack)",
},
},
"token": {
"flags": ("--token",),
"kwargs": {
"type": str,
"help": "API token for chat platform authentication",
"metavar": "TOKEN",
},
},
"channel": {
"flags": ("--channel",),
"kwargs": {
"type": str,
"help": "Channel name or ID to extract from",
"metavar": "CHANNEL",
},
},
"max_messages": {
"flags": ("--max-messages",),
"kwargs": {
"type": int,
"default": 10000,
"help": "Maximum number of messages to extract (default: 10000)",
"metavar": "N",
},
},
"from_json": {
"flags": ("--from-json",),
"kwargs": {
"type": str,
"help": "Build skill from extracted JSON",
"metavar": "FILE",
},
},
}
def add_chat_arguments(parser: argparse.ArgumentParser) -> None:
"""Add all chat command arguments to a parser.
Registers shared args (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
then adds Chat-specific args on top.
The default for --enhance-level is overridden to 0 (disabled) for Chat.
"""
# Shared universal args first
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for Chat
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for Chat), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code)"
)
# Chat-specific args
for arg_name, arg_def in CHAT_ARGUMENTS.items():
flags = arg_def["flags"]
kwargs = arg_def["kwargs"]
parser.add_argument(*flags, **kwargs)

View File

@@ -0,0 +1,109 @@
"""Confluence command argument definitions.
This module defines ALL arguments for the confluence command in ONE place.
Both confluence_scraper.py (standalone) and parsers/confluence_parser.py (unified CLI)
import and use these definitions.
Shared arguments (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
via ``add_all_standard_arguments()``.
"""
import argparse
from typing import Any
from .common import add_all_standard_arguments
# Confluence-specific argument definitions as data structure
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
CONFLUENCE_ARGUMENTS: dict[str, dict[str, Any]] = {
"base_url": {
"flags": ("--base-url",),
"kwargs": {
"type": str,
"help": "Confluence instance base URL",
"metavar": "URL",
},
},
"space_key": {
"flags": ("--space-key",),
"kwargs": {
"type": str,
"help": "Confluence space key to extract from",
"metavar": "KEY",
},
},
"export_path": {
"flags": ("--export-path",),
"kwargs": {
"type": str,
"help": "Path to Confluence HTML/XML export directory",
"metavar": "PATH",
},
},
"username": {
"flags": ("--username",),
"kwargs": {
"type": str,
"help": "Confluence username for API authentication",
"metavar": "USER",
},
},
"token": {
"flags": ("--token",),
"kwargs": {
"type": str,
"help": "Confluence API token for authentication",
"metavar": "TOKEN",
},
},
"max_pages": {
"flags": ("--max-pages",),
"kwargs": {
"type": int,
"default": 500,
"help": "Maximum number of pages to extract (default: 500)",
"metavar": "N",
},
},
"from_json": {
"flags": ("--from-json",),
"kwargs": {
"type": str,
"help": "Build skill from extracted JSON",
"metavar": "FILE",
},
},
}
def add_confluence_arguments(parser: argparse.ArgumentParser) -> None:
"""Add all confluence command arguments to a parser.
Registers shared args (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
then adds Confluence-specific args on top.
The default for --enhance-level is overridden to 0 (disabled) for Confluence.
"""
# Shared universal args first
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for Confluence
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for Confluence), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code)"
)
# Confluence-specific args
for arg_name, arg_def in CONFLUENCE_ARGUMENTS.items():
flags = arg_def["flags"]
kwargs = arg_def["kwargs"]
parser.add_argument(*flags, **kwargs)

View File

@@ -549,6 +549,121 @@ CONFIG_ARGUMENTS: dict[str, dict[str, Any]] = {
# For unified config files, use `skill-seekers unified --fresh` directly.
}
# New source type arguments (v3.2.0+)
# These are minimal dicts since most flags are handled by each scraper's own argument module.
# The create command only needs the primary input flag for routing.
JUPYTER_ARGUMENTS: dict[str, dict[str, Any]] = {
"notebook": {
"flags": ("--notebook",),
"kwargs": {"type": str, "help": "Jupyter Notebook file path (.ipynb)", "metavar": "PATH"},
},
}
HTML_ARGUMENTS: dict[str, dict[str, Any]] = {
"html_path": {
"flags": ("--html-path",),
"kwargs": {"type": str, "help": "Local HTML file or directory path", "metavar": "PATH"},
},
}
OPENAPI_ARGUMENTS: dict[str, dict[str, Any]] = {
"spec": {
"flags": ("--spec",),
"kwargs": {"type": str, "help": "OpenAPI/Swagger spec file path", "metavar": "PATH"},
},
"spec_url": {
"flags": ("--spec-url",),
"kwargs": {"type": str, "help": "OpenAPI/Swagger spec URL", "metavar": "URL"},
},
}
ASCIIDOC_ARGUMENTS: dict[str, dict[str, Any]] = {
"asciidoc_path": {
"flags": ("--asciidoc-path",),
"kwargs": {"type": str, "help": "AsciiDoc file or directory path", "metavar": "PATH"},
},
}
PPTX_ARGUMENTS: dict[str, dict[str, Any]] = {
"pptx": {
"flags": ("--pptx",),
"kwargs": {"type": str, "help": "PowerPoint file path (.pptx)", "metavar": "PATH"},
},
}
RSS_ARGUMENTS: dict[str, dict[str, Any]] = {
"feed_url": {
"flags": ("--feed-url",),
"kwargs": {"type": str, "help": "RSS/Atom feed URL", "metavar": "URL"},
},
"feed_path": {
"flags": ("--feed-path",),
"kwargs": {"type": str, "help": "RSS/Atom feed file path", "metavar": "PATH"},
},
}
MANPAGE_ARGUMENTS: dict[str, dict[str, Any]] = {
"man_names": {
"flags": ("--man-names",),
"kwargs": {
"type": str,
"help": "Comma-separated man page names (e.g., 'git,curl')",
"metavar": "NAMES",
},
},
"man_path": {
"flags": ("--man-path",),
"kwargs": {"type": str, "help": "Directory of man page files", "metavar": "PATH"},
},
}
CONFLUENCE_ARGUMENTS: dict[str, dict[str, Any]] = {
"conf_base_url": {
"flags": ("--conf-base-url",),
"kwargs": {"type": str, "help": "Confluence base URL", "metavar": "URL"},
},
"space_key": {
"flags": ("--space-key",),
"kwargs": {"type": str, "help": "Confluence space key", "metavar": "KEY"},
},
"conf_export_path": {
"flags": ("--conf-export-path",),
"kwargs": {"type": str, "help": "Confluence export directory", "metavar": "PATH"},
},
}
NOTION_ARGUMENTS: dict[str, dict[str, Any]] = {
"database_id": {
"flags": ("--database-id",),
"kwargs": {"type": str, "help": "Notion database ID", "metavar": "ID"},
},
"page_id": {
"flags": ("--page-id",),
"kwargs": {"type": str, "help": "Notion page ID", "metavar": "ID"},
},
"notion_export_path": {
"flags": ("--notion-export-path",),
"kwargs": {"type": str, "help": "Notion export directory", "metavar": "PATH"},
},
}
CHAT_ARGUMENTS: dict[str, dict[str, Any]] = {
"chat_export_path": {
"flags": ("--chat-export-path",),
"kwargs": {"type": str, "help": "Slack/Discord export directory", "metavar": "PATH"},
},
"platform": {
"flags": ("--platform",),
"kwargs": {
"type": str,
"choices": ["slack", "discord"],
"default": "slack",
"help": "Chat platform (default: slack)",
},
},
}
# =============================================================================
# TIER 3: ADVANCED/RARE ARGUMENTS
# =============================================================================
@@ -613,6 +728,17 @@ def get_source_specific_arguments(source_type: str) -> dict[str, dict[str, Any]]
"epub": EPUB_ARGUMENTS,
"video": VIDEO_ARGUMENTS,
"config": CONFIG_ARGUMENTS,
# New source types (v3.2.0+)
"jupyter": JUPYTER_ARGUMENTS,
"html": HTML_ARGUMENTS,
"openapi": OPENAPI_ARGUMENTS,
"asciidoc": ASCIIDOC_ARGUMENTS,
"pptx": PPTX_ARGUMENTS,
"rss": RSS_ARGUMENTS,
"manpage": MANPAGE_ARGUMENTS,
"confluence": CONFLUENCE_ARGUMENTS,
"notion": NOTION_ARGUMENTS,
"chat": CHAT_ARGUMENTS,
}
return source_args.get(source_type, {})
@@ -703,6 +829,24 @@ def add_create_arguments(parser: argparse.ArgumentParser, mode: str = "default")
for arg_name, arg_def in CONFIG_ARGUMENTS.items():
parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])
# New source types (v3.2.0+)
_NEW_SOURCE_ARGS = {
"jupyter": JUPYTER_ARGUMENTS,
"html": HTML_ARGUMENTS,
"openapi": OPENAPI_ARGUMENTS,
"asciidoc": ASCIIDOC_ARGUMENTS,
"pptx": PPTX_ARGUMENTS,
"rss": RSS_ARGUMENTS,
"manpage": MANPAGE_ARGUMENTS,
"confluence": CONFLUENCE_ARGUMENTS,
"notion": NOTION_ARGUMENTS,
"chat": CHAT_ARGUMENTS,
}
for stype, sargs in _NEW_SOURCE_ARGS.items():
if mode in [stype, "all"]:
for arg_name, arg_def in sargs.items():
parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])
# Add advanced arguments if requested
if mode in ["advanced", "all"]:
for arg_name, arg_def in ADVANCED_ARGUMENTS.items():

View File

@@ -0,0 +1,68 @@
"""HTML command argument definitions.
This module defines ALL arguments for the html command in ONE place.
Both html_scraper.py (standalone) and parsers/html_parser.py (unified CLI)
import and use these definitions.
Shared arguments (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
via ``add_all_standard_arguments()``.
"""
import argparse
from typing import Any
from .common import add_all_standard_arguments
# HTML-specific argument definitions as data structure
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
HTML_ARGUMENTS: dict[str, dict[str, Any]] = {
"html_path": {
"flags": ("--html-path",),
"kwargs": {
"type": str,
"help": "Path to HTML file or directory containing HTML files",
"metavar": "PATH",
},
},
"from_json": {
"flags": ("--from-json",),
"kwargs": {
"type": str,
"help": "Build skill from extracted JSON",
"metavar": "FILE",
},
},
}
def add_html_arguments(parser: argparse.ArgumentParser) -> None:
"""Add all html command arguments to a parser.
Registers shared args (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
then adds HTML-specific args on top.
The default for --enhance-level is overridden to 0 (disabled) for HTML.
"""
# Shared universal args first
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for HTML
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for HTML), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code)"
)
# HTML-specific args
for arg_name, arg_def in HTML_ARGUMENTS.items():
flags = arg_def["flags"]
kwargs = arg_def["kwargs"]
parser.add_argument(*flags, **kwargs)

View File

@@ -0,0 +1,68 @@
"""Jupyter Notebook command argument definitions.
This module defines ALL arguments for the jupyter command in ONE place.
Both jupyter_scraper.py (standalone) and parsers/jupyter_parser.py (unified CLI)
import and use these definitions.
Shared arguments (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
via ``add_all_standard_arguments()``.
"""
import argparse
from typing import Any
from .common import add_all_standard_arguments
# Jupyter-specific argument definitions as data structure
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
JUPYTER_ARGUMENTS: dict[str, dict[str, Any]] = {
"notebook": {
"flags": ("--notebook",),
"kwargs": {
"type": str,
"help": "Path to .ipynb file or directory containing notebooks",
"metavar": "PATH",
},
},
"from_json": {
"flags": ("--from-json",),
"kwargs": {
"type": str,
"help": "Build skill from extracted JSON",
"metavar": "FILE",
},
},
}
def add_jupyter_arguments(parser: argparse.ArgumentParser) -> None:
"""Add all jupyter command arguments to a parser.
Registers shared args (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
then adds Jupyter-specific args on top.
The default for --enhance-level is overridden to 0 (disabled) for Jupyter.
"""
# Shared universal args first
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for Jupyter
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for Jupyter), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code)"
)
# Jupyter-specific args
for arg_name, arg_def in JUPYTER_ARGUMENTS.items():
flags = arg_def["flags"]
kwargs = arg_def["kwargs"]
parser.add_argument(*flags, **kwargs)

View File

@@ -0,0 +1,84 @@
"""Man page command argument definitions.
This module defines ALL arguments for the manpage command in ONE place.
Both manpage_scraper.py (standalone) and parsers/manpage_parser.py (unified CLI)
import and use these definitions.
Shared arguments (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
via ``add_all_standard_arguments()``.
"""
import argparse
from typing import Any
from .common import add_all_standard_arguments
# ManPage-specific argument definitions as data structure
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
MANPAGE_ARGUMENTS: dict[str, dict[str, Any]] = {
"man_names": {
"flags": ("--man-names",),
"kwargs": {
"type": str,
"help": "Comma-separated list of man page names (e.g., 'ls,grep,find')",
"metavar": "NAMES",
},
},
"man_path": {
"flags": ("--man-path",),
"kwargs": {
"type": str,
"help": "Path to directory containing man page files",
"metavar": "PATH",
},
},
"sections": {
"flags": ("--sections",),
"kwargs": {
"type": str,
"help": "Comma-separated section numbers to include (e.g., '1,3,8')",
"metavar": "SECTIONS",
},
},
"from_json": {
"flags": ("--from-json",),
"kwargs": {
"type": str,
"help": "Build skill from extracted JSON",
"metavar": "FILE",
},
},
}
def add_manpage_arguments(parser: argparse.ArgumentParser) -> None:
"""Add all manpage command arguments to a parser.
Registers shared args (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
then adds ManPage-specific args on top.
The default for --enhance-level is overridden to 0 (disabled) for ManPage.
"""
# Shared universal args first
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for ManPage
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for ManPage), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code)"
)
# ManPage-specific args
for arg_name, arg_def in MANPAGE_ARGUMENTS.items():
flags = arg_def["flags"]
kwargs = arg_def["kwargs"]
parser.add_argument(*flags, **kwargs)

View File

@@ -0,0 +1,101 @@
"""Notion command argument definitions.
This module defines ALL arguments for the notion command in ONE place.
Both notion_scraper.py (standalone) and parsers/notion_parser.py (unified CLI)
import and use these definitions.
Shared arguments (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
via ``add_all_standard_arguments()``.
"""
import argparse
from typing import Any
from .common import add_all_standard_arguments
# Notion-specific argument definitions as data structure
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
NOTION_ARGUMENTS: dict[str, dict[str, Any]] = {
"database_id": {
"flags": ("--database-id",),
"kwargs": {
"type": str,
"help": "Notion database ID to extract from",
"metavar": "ID",
},
},
"page_id": {
"flags": ("--page-id",),
"kwargs": {
"type": str,
"help": "Notion page ID to extract from",
"metavar": "ID",
},
},
"export_path": {
"flags": ("--export-path",),
"kwargs": {
"type": str,
"help": "Path to Notion export directory",
"metavar": "PATH",
},
},
"token": {
"flags": ("--token",),
"kwargs": {
"type": str,
"help": "Notion integration token for API authentication",
"metavar": "TOKEN",
},
},
"max_pages": {
"flags": ("--max-pages",),
"kwargs": {
"type": int,
"default": 500,
"help": "Maximum number of pages to extract (default: 500)",
"metavar": "N",
},
},
"from_json": {
"flags": ("--from-json",),
"kwargs": {
"type": str,
"help": "Build skill from extracted JSON",
"metavar": "FILE",
},
},
}
def add_notion_arguments(parser: argparse.ArgumentParser) -> None:
"""Add all notion command arguments to a parser.
Registers shared args (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
then adds Notion-specific args on top.
The default for --enhance-level is overridden to 0 (disabled) for Notion.
"""
# Shared universal args first
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for Notion
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for Notion), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code)"
)
# Notion-specific args
for arg_name, arg_def in NOTION_ARGUMENTS.items():
flags = arg_def["flags"]
kwargs = arg_def["kwargs"]
parser.add_argument(*flags, **kwargs)

View File

@@ -0,0 +1,76 @@
"""OpenAPI command argument definitions.
This module defines ALL arguments for the openapi command in ONE place.
Both openapi_scraper.py (standalone) and parsers/openapi_parser.py (unified CLI)
import and use these definitions.
Shared arguments (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
via ``add_all_standard_arguments()``.
"""
import argparse
from typing import Any
from .common import add_all_standard_arguments
# OpenAPI-specific argument definitions as data structure
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
OPENAPI_ARGUMENTS: dict[str, dict[str, Any]] = {
"spec": {
"flags": ("--spec",),
"kwargs": {
"type": str,
"help": "Path to OpenAPI/Swagger spec file",
"metavar": "PATH",
},
},
"spec_url": {
"flags": ("--spec-url",),
"kwargs": {
"type": str,
"help": "URL to OpenAPI/Swagger spec",
"metavar": "URL",
},
},
"from_json": {
"flags": ("--from-json",),
"kwargs": {
"type": str,
"help": "Build skill from extracted JSON",
"metavar": "FILE",
},
},
}
def add_openapi_arguments(parser: argparse.ArgumentParser) -> None:
"""Add all openapi command arguments to a parser.
Registers shared args (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
then adds OpenAPI-specific args on top.
The default for --enhance-level is overridden to 0 (disabled) for OpenAPI.
"""
# Shared universal args first
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for OpenAPI
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for OpenAPI), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code)"
)
# OpenAPI-specific args
for arg_name, arg_def in OPENAPI_ARGUMENTS.items():
flags = arg_def["flags"]
kwargs = arg_def["kwargs"]
parser.add_argument(*flags, **kwargs)

View File

@@ -0,0 +1,68 @@
"""PPTX command argument definitions.
This module defines ALL arguments for the pptx command in ONE place.
Both pptx_scraper.py (standalone) and parsers/pptx_parser.py (unified CLI)
import and use these definitions.
Shared arguments (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
via ``add_all_standard_arguments()``.
"""
import argparse
from typing import Any
from .common import add_all_standard_arguments
# PPTX-specific argument definitions as data structure
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
PPTX_ARGUMENTS: dict[str, dict[str, Any]] = {
"pptx": {
"flags": ("--pptx",),
"kwargs": {
"type": str,
"help": "Path to PowerPoint file (.pptx)",
"metavar": "PATH",
},
},
"from_json": {
"flags": ("--from-json",),
"kwargs": {
"type": str,
"help": "Build skill from extracted JSON",
"metavar": "FILE",
},
},
}
def add_pptx_arguments(parser: argparse.ArgumentParser) -> None:
"""Add all pptx command arguments to a parser.
Registers shared args (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
then adds PPTX-specific args on top.
The default for --enhance-level is overridden to 0 (disabled) for PPTX.
"""
# Shared universal args first
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for PPTX
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for PPTX), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code)"
)
# PPTX-specific args
for arg_name, arg_def in PPTX_ARGUMENTS.items():
flags = arg_def["flags"]
kwargs = arg_def["kwargs"]
parser.add_argument(*flags, **kwargs)

View File

@@ -0,0 +1,101 @@
"""RSS command argument definitions.
This module defines ALL arguments for the rss command in ONE place.
Both rss_scraper.py (standalone) and parsers/rss_parser.py (unified CLI)
import and use these definitions.
Shared arguments (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
via ``add_all_standard_arguments()``.
"""
import argparse
from typing import Any
from .common import add_all_standard_arguments
# RSS-specific argument definitions as data structure
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
RSS_ARGUMENTS: dict[str, dict[str, Any]] = {
"feed_url": {
"flags": ("--feed-url",),
"kwargs": {
"type": str,
"help": "URL of the RSS/Atom feed",
"metavar": "URL",
},
},
"feed_path": {
"flags": ("--feed-path",),
"kwargs": {
"type": str,
"help": "Path to local RSS/Atom feed file",
"metavar": "PATH",
},
},
"follow_links": {
"flags": ("--follow-links",),
"kwargs": {
"action": "store_true",
"default": True,
"help": "Follow article links and extract full content (default: True)",
},
},
"no_follow_links": {
"flags": ("--no-follow-links",),
"kwargs": {
"action": "store_false",
"dest": "follow_links",
"help": "Do not follow article links; use feed summary only",
},
},
"max_articles": {
"flags": ("--max-articles",),
"kwargs": {
"type": int,
"default": 50,
"help": "Maximum number of articles to extract (default: 50)",
"metavar": "N",
},
},
"from_json": {
"flags": ("--from-json",),
"kwargs": {
"type": str,
"help": "Build skill from extracted JSON",
"metavar": "FILE",
},
},
}
def add_rss_arguments(parser: argparse.ArgumentParser) -> None:
"""Add all rss command arguments to a parser.
Registers shared args (name, description, output, enhance-level, api-key,
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
then adds RSS-specific args on top.
The default for --enhance-level is overridden to 0 (disabled) for RSS.
"""
# Shared universal args first
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for RSS
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for RSS), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code)"
)
# RSS-specific args
for arg_name, arg_def in RSS_ARGUMENTS.items():
flags = arg_def["flags"]
kwargs = arg_def["kwargs"]
parser.add_argument(*flags, **kwargs)

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -7,6 +7,19 @@ Validates unified config format that supports multiple sources:
- github (repository scraping)
- pdf (PDF document scraping)
- local (local codebase analysis)
- word (Word .docx document scraping)
- video (video transcript/visual extraction)
- epub (EPUB e-book extraction)
- jupyter (Jupyter Notebook extraction)
- html (local HTML file extraction)
- openapi (OpenAPI/Swagger spec extraction)
- asciidoc (AsciiDoc document extraction)
- pptx (PowerPoint presentation extraction)
- confluence (Confluence wiki extraction)
- notion (Notion page extraction)
- rss (RSS/Atom feed extraction)
- manpage (man page extraction)
- chat (Slack/Discord chat export extraction)
Legacy config format support removed in v2.11.0.
All configs must use unified format with 'sources' array.
@@ -27,7 +40,25 @@ class ConfigValidator:
"""
# Valid source types
VALID_SOURCE_TYPES = {"documentation", "github", "pdf", "local", "word", "video"}
VALID_SOURCE_TYPES = {
"documentation",
"github",
"pdf",
"local",
"word",
"video",
"epub",
"jupyter",
"html",
"openapi",
"asciidoc",
"pptx",
"confluence",
"notion",
"rss",
"manpage",
"chat",
}
# Valid merge modes
VALID_MERGE_MODES = {"rule-based", "claude-enhanced"}
@@ -159,6 +190,32 @@ class ConfigValidator:
self._validate_pdf_source(source, index)
elif source_type == "local":
self._validate_local_source(source, index)
elif source_type == "word":
self._validate_word_source(source, index)
elif source_type == "video":
self._validate_video_source(source, index)
elif source_type == "epub":
self._validate_epub_source(source, index)
elif source_type == "jupyter":
self._validate_jupyter_source(source, index)
elif source_type == "html":
self._validate_html_source(source, index)
elif source_type == "openapi":
self._validate_openapi_source(source, index)
elif source_type == "asciidoc":
self._validate_asciidoc_source(source, index)
elif source_type == "pptx":
self._validate_pptx_source(source, index)
elif source_type == "confluence":
self._validate_confluence_source(source, index)
elif source_type == "notion":
self._validate_notion_source(source, index)
elif source_type == "rss":
self._validate_rss_source(source, index)
elif source_type == "manpage":
self._validate_manpage_source(source, index)
elif source_type == "chat":
self._validate_chat_source(source, index)
def _validate_documentation_source(self, source: dict[str, Any], index: int):
"""Validate documentation source configuration."""
@@ -253,12 +310,126 @@ class ConfigValidator:
f"Source {index} (local): Invalid ai_mode '{ai_mode}'. Must be one of {self.VALID_AI_MODES}"
)
def _validate_word_source(self, source: dict[str, Any], index: int):
"""Validate Word document (.docx) source configuration."""
if "path" not in source:
raise ValueError(f"Source {index} (word): Missing required field 'path'")
word_path = source["path"]
if not Path(word_path).exists():
logger.warning(f"Source {index} (word): File not found: {word_path}")
def _validate_video_source(self, source: dict[str, Any], index: int):
"""Validate video source configuration."""
has_url = "url" in source
has_path = "path" in source
has_playlist = "playlist" in source
if not has_url and not has_path and not has_playlist:
raise ValueError(
f"Source {index} (video): Missing required field 'url', 'path', or 'playlist'"
)
def _validate_epub_source(self, source: dict[str, Any], index: int):
"""Validate EPUB source configuration."""
if "path" not in source:
raise ValueError(f"Source {index} (epub): Missing required field 'path'")
epub_path = source["path"]
if not Path(epub_path).exists():
logger.warning(f"Source {index} (epub): File not found: {epub_path}")
def _validate_jupyter_source(self, source: dict[str, Any], index: int):
"""Validate Jupyter Notebook source configuration."""
if "path" not in source:
raise ValueError(f"Source {index} (jupyter): Missing required field 'path'")
nb_path = source["path"]
if not Path(nb_path).exists():
logger.warning(f"Source {index} (jupyter): Path not found: {nb_path}")
def _validate_html_source(self, source: dict[str, Any], index: int):
"""Validate local HTML source configuration."""
if "path" not in source:
raise ValueError(f"Source {index} (html): Missing required field 'path'")
html_path = source["path"]
if not Path(html_path).exists():
logger.warning(f"Source {index} (html): Path not found: {html_path}")
def _validate_openapi_source(self, source: dict[str, Any], index: int):
"""Validate OpenAPI/Swagger source configuration."""
if "path" not in source and "url" not in source:
raise ValueError(f"Source {index} (openapi): Missing required field 'path' or 'url'")
if "path" in source and not Path(source["path"]).exists():
logger.warning(f"Source {index} (openapi): File not found: {source['path']}")
def _validate_asciidoc_source(self, source: dict[str, Any], index: int):
"""Validate AsciiDoc source configuration."""
if "path" not in source:
raise ValueError(f"Source {index} (asciidoc): Missing required field 'path'")
adoc_path = source["path"]
if not Path(adoc_path).exists():
logger.warning(f"Source {index} (asciidoc): Path not found: {adoc_path}")
def _validate_pptx_source(self, source: dict[str, Any], index: int):
"""Validate PowerPoint source configuration."""
if "path" not in source:
raise ValueError(f"Source {index} (pptx): Missing required field 'path'")
pptx_path = source["path"]
if not Path(pptx_path).exists():
logger.warning(f"Source {index} (pptx): File not found: {pptx_path}")
def _validate_confluence_source(self, source: dict[str, Any], index: int):
"""Validate Confluence source configuration."""
has_url = "url" in source or "base_url" in source
has_path = "path" in source
if not has_url and not has_path:
raise ValueError(
f"Source {index} (confluence): Missing required field 'url'/'base_url' "
f"(for API) or 'path' (for export)"
)
if has_url and "space_key" not in source and "path" not in source:
logger.warning(f"Source {index} (confluence): No 'space_key' specified for API mode")
def _validate_notion_source(self, source: dict[str, Any], index: int):
"""Validate Notion source configuration."""
has_url = "url" in source or "database_id" in source or "page_id" in source
has_path = "path" in source
if not has_url and not has_path:
raise ValueError(
f"Source {index} (notion): Missing required field 'url'/'database_id'/'page_id' "
f"(for API) or 'path' (for export)"
)
def _validate_rss_source(self, source: dict[str, Any], index: int):
"""Validate RSS/Atom feed source configuration."""
if "url" not in source and "path" not in source:
raise ValueError(f"Source {index} (rss): Missing required field 'url' or 'path'")
def _validate_manpage_source(self, source: dict[str, Any], index: int):
"""Validate man page source configuration."""
if "path" not in source and "names" not in source:
raise ValueError(f"Source {index} (manpage): Missing required field 'path' or 'names'")
if "path" in source and not Path(source["path"]).exists():
logger.warning(f"Source {index} (manpage): Path not found: {source['path']}")
def _validate_chat_source(self, source: dict[str, Any], index: int):
"""Validate Slack/Discord chat source configuration."""
has_path = "path" in source
has_api = "token" in source or "webhook_url" in source
has_channel = "channel" in source or "channel_id" in source
if not has_path and not has_api:
raise ValueError(
f"Source {index} (chat): Missing required field 'path' (for export) "
f"or 'token' (for API)"
)
if has_api and not has_channel:
logger.warning(
f"Source {index} (chat): No 'channel' or 'channel_id' specified for API mode"
)
def get_sources_by_type(self, source_type: str) -> list[dict[str, Any]]:
"""
Get all sources of a specific type.
Args:
source_type: 'documentation', 'github', 'pdf', or 'local'
source_type: Any valid source type string
Returns:
List of sources matching the type

File diff suppressed because it is too large Load Diff

View File

@@ -140,6 +140,26 @@ class CreateCommand:
return self._route_video()
elif self.source_info.type == "config":
return self._route_config()
elif self.source_info.type == "jupyter":
return self._route_generic("jupyter_scraper", "--notebook")
elif self.source_info.type == "html":
return self._route_generic("html_scraper", "--html-path")
elif self.source_info.type == "openapi":
return self._route_generic("openapi_scraper", "--spec")
elif self.source_info.type == "asciidoc":
return self._route_generic("asciidoc_scraper", "--asciidoc-path")
elif self.source_info.type == "pptx":
return self._route_generic("pptx_scraper", "--pptx")
elif self.source_info.type == "rss":
return self._route_generic("rss_scraper", "--feed-path")
elif self.source_info.type == "manpage":
return self._route_generic("man_scraper", "--man-path")
elif self.source_info.type == "confluence":
return self._route_generic("confluence_scraper", "--export-path")
elif self.source_info.type == "notion":
return self._route_generic("notion_scraper", "--export-path")
elif self.source_info.type == "chat":
return self._route_generic("chat_scraper", "--export-path")
else:
logger.error(f"Unknown source type: {self.source_info.type}")
return 1
@@ -485,6 +505,40 @@ class CreateCommand:
finally:
sys.argv = original_argv
def _route_generic(self, module_name: str, file_flag: str) -> int:
"""Generic routing for new source types.
Most new source types (jupyter, html, openapi, asciidoc, pptx, rss,
manpage, confluence, notion, chat) follow the same pattern:
import module, build argv with --flag <file_path>, add common args, call main().
Args:
module_name: Python module name under skill_seekers.cli (e.g., "jupyter_scraper")
file_flag: CLI flag for the source file (e.g., "--notebook")
Returns:
Exit code from scraper
"""
import importlib
module = importlib.import_module(f"skill_seekers.cli.{module_name}")
argv = [module_name]
file_path = self.source_info.parsed.get("file_path", "")
if file_path:
argv.extend([file_flag, file_path])
self._add_common_args(argv)
logger.debug(f"Calling {module_name} with argv: {argv}")
original_argv = sys.argv
try:
sys.argv = argv
return module.main()
finally:
sys.argv = original_argv
def _add_common_args(self, argv: list[str]) -> None:
"""Add truly universal arguments to argv list.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -15,7 +15,17 @@ Commands:
word Extract from Word (.docx) file
epub Extract from EPUB e-book (.epub)
video Extract from video (YouTube or local)
unified Multi-source scraping (docs + GitHub + PDF)
jupyter Extract from Jupyter Notebook (.ipynb)
html Extract from local HTML files
openapi Extract from OpenAPI/Swagger spec
asciidoc Extract from AsciiDoc documents (.adoc)
pptx Extract from PowerPoint (.pptx)
rss Extract from RSS/Atom feeds
manpage Extract from man pages
confluence Extract from Confluence wiki
notion Extract from Notion pages
chat Extract from Slack/Discord chat exports
unified Multi-source scraping (docs + GitHub + PDF + more)
analyze Analyze local codebase and extract code knowledge
enhance AI-powered enhancement (auto: API or LOCAL mode)
enhance-status Check enhancement status (for background/daemon modes)
@@ -70,6 +80,17 @@ COMMAND_MODULES = {
"quality": "skill_seekers.cli.quality_metrics",
"workflows": "skill_seekers.cli.workflows_command",
"sync-config": "skill_seekers.cli.sync_config",
# New source types (v3.2.0+)
"jupyter": "skill_seekers.cli.jupyter_scraper",
"html": "skill_seekers.cli.html_scraper",
"openapi": "skill_seekers.cli.openapi_scraper",
"asciidoc": "skill_seekers.cli.asciidoc_scraper",
"pptx": "skill_seekers.cli.pptx_scraper",
"rss": "skill_seekers.cli.rss_scraper",
"manpage": "skill_seekers.cli.man_scraper",
"confluence": "skill_seekers.cli.confluence_scraper",
"notion": "skill_seekers.cli.notion_scraper",
"chat": "skill_seekers.cli.chat_scraper",
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -33,6 +33,18 @@ from .quality_parser import QualityParser
from .workflows_parser import WorkflowsParser
from .sync_config_parser import SyncConfigParser
# New source type parsers (v3.2.0+)
from .jupyter_parser import JupyterParser
from .html_parser import HtmlParser
from .openapi_parser import OpenAPIParser
from .asciidoc_parser import AsciiDocParser
from .pptx_parser import PptxParser
from .rss_parser import RssParser
from .manpage_parser import ManPageParser
from .confluence_parser import ConfluenceParser
from .notion_parser import NotionParser
from .chat_parser import ChatParser
# Registry of all parsers (in order of usage frequency)
PARSERS = [
CreateParser(), # NEW: Unified create command (placed first for prominence)
@@ -60,6 +72,17 @@ PARSERS = [
QualityParser(),
WorkflowsParser(),
SyncConfigParser(),
# New source types (v3.2.0+)
JupyterParser(),
HtmlParser(),
OpenAPIParser(),
AsciiDocParser(),
PptxParser(),
RssParser(),
ManPageParser(),
ConfluenceParser(),
NotionParser(),
ChatParser(),
]

View File

@@ -0,0 +1,32 @@
"""AsciiDoc subcommand parser.
Uses shared argument definitions from arguments.asciidoc to ensure
consistency with the standalone asciidoc_scraper module.
"""
from .base import SubcommandParser
from skill_seekers.cli.arguments.asciidoc import add_asciidoc_arguments
class AsciiDocParser(SubcommandParser):
"""Parser for asciidoc subcommand."""
@property
def name(self) -> str:
return "asciidoc"
@property
def help(self) -> str:
return "Extract from AsciiDoc documents (.adoc)"
@property
def description(self) -> str:
return "Extract content from AsciiDoc documents (.adoc) and generate skill"
def add_arguments(self, parser):
"""Add asciidoc-specific arguments.
Uses shared argument definitions to ensure consistency
with asciidoc_scraper.py (standalone scraper).
"""
add_asciidoc_arguments(parser)

View File

@@ -0,0 +1,32 @@
"""Chat subcommand parser.
Uses shared argument definitions from arguments.chat to ensure
consistency with the standalone chat_scraper module.
"""
from .base import SubcommandParser
from skill_seekers.cli.arguments.chat import add_chat_arguments
class ChatParser(SubcommandParser):
"""Parser for chat subcommand."""
@property
def name(self) -> str:
return "chat"
@property
def help(self) -> str:
return "Extract from Slack/Discord chat exports"
@property
def description(self) -> str:
return "Extract content from Slack/Discord chat exports and generate skill"
def add_arguments(self, parser):
"""Add chat-specific arguments.
Uses shared argument definitions to ensure consistency
with chat_scraper.py (standalone scraper).
"""
add_chat_arguments(parser)

View File

@@ -0,0 +1,32 @@
"""Confluence subcommand parser.
Uses shared argument definitions from arguments.confluence to ensure
consistency with the standalone confluence_scraper module.
"""
from .base import SubcommandParser
from skill_seekers.cli.arguments.confluence import add_confluence_arguments
class ConfluenceParser(SubcommandParser):
"""Parser for confluence subcommand."""
@property
def name(self) -> str:
return "confluence"
@property
def help(self) -> str:
return "Extract from Confluence wiki"
@property
def description(self) -> str:
return "Extract content from Confluence wiki and generate skill"
def add_arguments(self, parser):
"""Add confluence-specific arguments.
Uses shared argument definitions to ensure consistency
with confluence_scraper.py (standalone scraper).
"""
add_confluence_arguments(parser)

View File

@@ -0,0 +1,32 @@
"""HTML subcommand parser.
Uses shared argument definitions from arguments.html to ensure
consistency with the standalone html_scraper module.
"""
from .base import SubcommandParser
from skill_seekers.cli.arguments.html import add_html_arguments
class HtmlParser(SubcommandParser):
"""Parser for html subcommand."""
@property
def name(self) -> str:
return "html"
@property
def help(self) -> str:
return "Extract from local HTML files (.html/.htm)"
@property
def description(self) -> str:
return "Extract content from local HTML files (.html/.htm) and generate skill"
def add_arguments(self, parser):
"""Add html-specific arguments.
Uses shared argument definitions to ensure consistency
with html_scraper.py (standalone scraper).
"""
add_html_arguments(parser)

View File

@@ -0,0 +1,32 @@
"""Jupyter Notebook subcommand parser.
Uses shared argument definitions from arguments.jupyter to ensure
consistency with the standalone jupyter_scraper module.
"""
from .base import SubcommandParser
from skill_seekers.cli.arguments.jupyter import add_jupyter_arguments
class JupyterParser(SubcommandParser):
"""Parser for jupyter subcommand."""
@property
def name(self) -> str:
return "jupyter"
@property
def help(self) -> str:
return "Extract from Jupyter Notebook (.ipynb)"
@property
def description(self) -> str:
return "Extract content from Jupyter Notebook (.ipynb) and generate skill"
def add_arguments(self, parser):
"""Add jupyter-specific arguments.
Uses shared argument definitions to ensure consistency
with jupyter_scraper.py (standalone scraper).
"""
add_jupyter_arguments(parser)

View File

@@ -0,0 +1,32 @@
"""Man page subcommand parser.
Uses shared argument definitions from arguments.manpage to ensure
consistency with the standalone man_scraper module.
"""
from .base import SubcommandParser
from skill_seekers.cli.arguments.manpage import add_manpage_arguments
class ManPageParser(SubcommandParser):
"""Parser for manpage subcommand."""
@property
def name(self) -> str:
return "manpage"
@property
def help(self) -> str:
return "Extract from man pages"
@property
def description(self) -> str:
return "Extract content from man pages and generate skill"
def add_arguments(self, parser):
"""Add manpage-specific arguments.
Uses shared argument definitions to ensure consistency
with man_scraper.py (standalone scraper).
"""
add_manpage_arguments(parser)

View File

@@ -0,0 +1,32 @@
"""Notion subcommand parser.
Uses shared argument definitions from arguments.notion to ensure
consistency with the standalone notion_scraper module.
"""
from .base import SubcommandParser
from skill_seekers.cli.arguments.notion import add_notion_arguments
class NotionParser(SubcommandParser):
"""Parser for notion subcommand."""
@property
def name(self) -> str:
return "notion"
@property
def help(self) -> str:
return "Extract from Notion pages"
@property
def description(self) -> str:
return "Extract content from Notion pages and generate skill"
def add_arguments(self, parser):
"""Add notion-specific arguments.
Uses shared argument definitions to ensure consistency
with notion_scraper.py (standalone scraper).
"""
add_notion_arguments(parser)

View File

@@ -0,0 +1,32 @@
"""OpenAPI subcommand parser.
Uses shared argument definitions from arguments.openapi to ensure
consistency with the standalone openapi_scraper module.
"""
from .base import SubcommandParser
from skill_seekers.cli.arguments.openapi import add_openapi_arguments
class OpenAPIParser(SubcommandParser):
"""Parser for openapi subcommand."""
@property
def name(self) -> str:
return "openapi"
@property
def help(self) -> str:
return "Extract from OpenAPI/Swagger spec"
@property
def description(self) -> str:
return "Extract content from OpenAPI/Swagger spec and generate skill"
def add_arguments(self, parser):
"""Add openapi-specific arguments.
Uses shared argument definitions to ensure consistency
with openapi_scraper.py (standalone scraper).
"""
add_openapi_arguments(parser)

View File

@@ -0,0 +1,32 @@
"""PPTX subcommand parser.
Uses shared argument definitions from arguments.pptx to ensure
consistency with the standalone pptx_scraper module.
"""
from .base import SubcommandParser
from skill_seekers.cli.arguments.pptx import add_pptx_arguments
class PptxParser(SubcommandParser):
"""Parser for pptx subcommand."""
@property
def name(self) -> str:
return "pptx"
@property
def help(self) -> str:
return "Extract from PowerPoint presentations (.pptx)"
@property
def description(self) -> str:
return "Extract content from PowerPoint presentations (.pptx) and generate skill"
def add_arguments(self, parser):
"""Add pptx-specific arguments.
Uses shared argument definitions to ensure consistency
with pptx_scraper.py (standalone scraper).
"""
add_pptx_arguments(parser)

View File

@@ -0,0 +1,32 @@
"""RSS subcommand parser.
Uses shared argument definitions from arguments.rss to ensure
consistency with the standalone rss_scraper module.
"""
from .base import SubcommandParser
from skill_seekers.cli.arguments.rss import add_rss_arguments
class RssParser(SubcommandParser):
"""Parser for rss subcommand."""
@property
def name(self) -> str:
return "rss"
@property
def help(self) -> str:
return "Extract from RSS/Atom feeds"
@property
def description(self) -> str:
return "Extract content from RSS/Atom feeds and generate skill"
def add_arguments(self, parser):
"""Add rss-specific arguments.
Uses shared argument definitions to ensure consistency
with rss_scraper.py (standalone scraper).
"""
add_rss_arguments(parser)

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1,7 +1,12 @@
"""Source type detection for unified create command.
Auto-detects whether a source is a web URL, GitHub repository,
local directory, PDF file, or config file based on patterns.
Auto-detects source type from user input — supports web URLs, GitHub repos,
local directories, and 14+ file types (PDF, DOCX, EPUB, IPYNB, HTML, YAML/OpenAPI,
AsciiDoc, PPTX, RSS/Atom, man pages, video files, and config JSON).
Note: Confluence, Notion, and Slack/Discord chat sources are API/export-based
and cannot be auto-detected from a single argument. Use their dedicated
subcommands (``skill-seekers confluence``, ``notion``, ``chat``) instead.
"""
import os
@@ -66,11 +71,49 @@ class SourceDetector:
if source.endswith(".epub"):
return cls._detect_epub(source)
if source.endswith(".ipynb"):
return cls._detect_jupyter(source)
if source.lower().endswith((".html", ".htm")):
return cls._detect_html(source)
if source.endswith(".pptx"):
return cls._detect_pptx(source)
if source.lower().endswith((".adoc", ".asciidoc")):
return cls._detect_asciidoc(source)
# Man page file extensions (.1 through .8, .man)
# Only match if the basename looks like a man page (e.g., "git.1", not "log.1")
# Require basename without the extension to be a plausible command name
if source.lower().endswith(".man"):
return cls._detect_manpage(source)
MAN_SECTION_EXTENSIONS = (".1", ".2", ".3", ".4", ".5", ".6", ".7", ".8")
if source.lower().endswith(MAN_SECTION_EXTENSIONS):
# Heuristic: man pages have a simple basename (no dots before extension)
# e.g., "git.1" is a man page, "access.log.1" is not
basename_no_ext = os.path.splitext(os.path.basename(source))[0]
if "." not in basename_no_ext:
return cls._detect_manpage(source)
# Video file extensions
VIDEO_EXTENSIONS = (".mp4", ".mkv", ".avi", ".mov", ".webm", ".flv", ".wmv")
if source.lower().endswith(VIDEO_EXTENSIONS):
return cls._detect_video_file(source)
# RSS/Atom feed file extensions (only .rss and .atom — .xml is too generic)
if source.lower().endswith((".rss", ".atom")):
return cls._detect_rss(source)
# OpenAPI/Swagger spec detection (YAML files with OpenAPI content)
# Sniff file content for 'openapi:' or 'swagger:' keys before committing
if (
source.lower().endswith((".yaml", ".yml"))
and os.path.isfile(source)
and cls._looks_like_openapi(source)
):
return cls._detect_openapi(source)
# 2. Video URL detection (before directory check)
video_url_info = cls._detect_video_url(source)
if video_url_info:
@@ -97,15 +140,22 @@ class SourceDetector:
raise ValueError(
f"Cannot determine source type for: {source}\n\n"
"Examples:\n"
" Web: skill-seekers create https://docs.react.dev/\n"
" GitHub: skill-seekers create facebook/react\n"
" Local: skill-seekers create ./my-project\n"
" PDF: skill-seekers create tutorial.pdf\n"
" DOCX: skill-seekers create document.docx\n"
" EPUB: skill-seekers create ebook.epub\n"
" Video: skill-seekers create https://youtube.com/watch?v=...\n"
" Video: skill-seekers create recording.mp4\n"
" Config: skill-seekers create configs/react.json"
" Web: skill-seekers create https://docs.react.dev/\n"
" GitHub: skill-seekers create facebook/react\n"
" Local: skill-seekers create ./my-project\n"
" PDF: skill-seekers create tutorial.pdf\n"
" DOCX: skill-seekers create document.docx\n"
" EPUB: skill-seekers create ebook.epub\n"
" Jupyter: skill-seekers create notebook.ipynb\n"
" HTML: skill-seekers create page.html\n"
" OpenAPI: skill-seekers create openapi.yaml\n"
" AsciiDoc: skill-seekers create document.adoc\n"
" PowerPoint: skill-seekers create presentation.pptx\n"
" RSS: skill-seekers create feed.rss\n"
" Man page: skill-seekers create command.1\n"
" Video: skill-seekers create https://youtube.com/watch?v=...\n"
" Video: skill-seekers create recording.mp4\n"
" Config: skill-seekers create configs/react.json"
)
@classmethod
@@ -140,6 +190,90 @@ class SourceDetector:
type="epub", parsed={"file_path": source}, suggested_name=name, raw_input=source
)
@classmethod
def _detect_jupyter(cls, source: str) -> SourceInfo:
"""Detect Jupyter Notebook file source."""
name = os.path.splitext(os.path.basename(source))[0]
return SourceInfo(
type="jupyter", parsed={"file_path": source}, suggested_name=name, raw_input=source
)
@classmethod
def _detect_html(cls, source: str) -> SourceInfo:
"""Detect local HTML file source."""
name = os.path.splitext(os.path.basename(source))[0]
return SourceInfo(
type="html", parsed={"file_path": source}, suggested_name=name, raw_input=source
)
@classmethod
def _detect_pptx(cls, source: str) -> SourceInfo:
"""Detect PowerPoint file source."""
name = os.path.splitext(os.path.basename(source))[0]
return SourceInfo(
type="pptx", parsed={"file_path": source}, suggested_name=name, raw_input=source
)
@classmethod
def _detect_asciidoc(cls, source: str) -> SourceInfo:
"""Detect AsciiDoc file source."""
name = os.path.splitext(os.path.basename(source))[0]
return SourceInfo(
type="asciidoc", parsed={"file_path": source}, suggested_name=name, raw_input=source
)
@classmethod
def _detect_manpage(cls, source: str) -> SourceInfo:
"""Detect man page file source."""
name = os.path.splitext(os.path.basename(source))[0]
return SourceInfo(
type="manpage", parsed={"file_path": source}, suggested_name=name, raw_input=source
)
@classmethod
def _detect_rss(cls, source: str) -> SourceInfo:
"""Detect RSS/Atom feed file source."""
name = os.path.splitext(os.path.basename(source))[0]
return SourceInfo(
type="rss", parsed={"file_path": source}, suggested_name=name, raw_input=source
)
@classmethod
def _looks_like_openapi(cls, source: str) -> bool:
"""Check if a YAML/JSON file looks like an OpenAPI or Swagger spec.
Reads the first few lines to look for 'openapi:' or 'swagger:' keys.
Args:
source: Path to the file
Returns:
True if the file appears to be an OpenAPI/Swagger spec
"""
try:
with open(source, encoding="utf-8", errors="replace") as f:
# Read first 20 lines — the openapi/swagger key is always near the top
for _ in range(20):
line = f.readline()
if not line:
break
stripped = line.strip().lower()
if stripped.startswith("openapi:") or stripped.startswith("swagger:"):
return True
if stripped.startswith('"openapi"') or stripped.startswith('"swagger"'):
return True
except OSError:
pass
return False
@classmethod
def _detect_openapi(cls, source: str) -> SourceInfo:
"""Detect OpenAPI/Swagger spec file source."""
name = os.path.splitext(os.path.basename(source))[0]
return SourceInfo(
type="openapi", parsed={"file_path": source}, suggested_name=name, raw_input=source
)
@classmethod
def _detect_video_file(cls, source: str) -> SourceInfo:
"""Detect local video file source."""
@@ -312,5 +446,19 @@ class SourceDetector:
if not os.path.isfile(config_path):
raise ValueError(f"Path is not a file: {config_path}")
# For web and github, validation happens during scraping
# (URL accessibility, repo existence)
elif source_info.type in ("jupyter", "html", "pptx", "asciidoc", "manpage", "openapi"):
file_path = source_info.parsed.get("file_path", "")
if file_path:
type_label = source_info.type.upper()
if not os.path.exists(file_path):
raise ValueError(f"{type_label} file does not exist: {file_path}")
if not os.path.isfile(file_path) and not os.path.isdir(file_path):
raise ValueError(f"Path is not a file or directory: {file_path}")
elif source_info.type == "rss":
file_path = source_info.parsed.get("file_path", "")
if file_path and not os.path.exists(file_path):
raise ValueError(f"RSS/Atom file does not exist: {file_path}")
# For web, github, confluence, notion, chat, rss (URL), validation happens
# during scraping (URL accessibility, API auth, etc.)

View File

@@ -76,6 +76,17 @@ class UnifiedScraper:
"word": [], # List of word sources
"video": [], # List of video sources
"local": [], # List of local sources (docs or code)
"epub": [], # List of epub sources
"jupyter": [], # List of Jupyter notebook sources
"html": [], # List of local HTML sources
"openapi": [], # List of OpenAPI/Swagger spec sources
"asciidoc": [], # List of AsciiDoc sources
"pptx": [], # List of PowerPoint sources
"confluence": [], # List of Confluence wiki sources
"notion": [], # List of Notion page sources
"rss": [], # List of RSS/Atom feed sources
"manpage": [], # List of man page sources
"chat": [], # List of Slack/Discord chat sources
}
# Track source index for unique naming (multi-source support)
@@ -86,6 +97,17 @@ class UnifiedScraper:
"word": 0,
"video": 0,
"local": 0,
"epub": 0,
"jupyter": 0,
"html": 0,
"openapi": 0,
"asciidoc": 0,
"pptx": 0,
"confluence": 0,
"notion": 0,
"rss": 0,
"manpage": 0,
"chat": 0,
}
# Output paths - cleaner organization
@@ -166,6 +188,28 @@ class UnifiedScraper:
self._scrape_video(source)
elif source_type == "local":
self._scrape_local(source)
elif source_type == "epub":
self._scrape_epub(source)
elif source_type == "jupyter":
self._scrape_jupyter(source)
elif source_type == "html":
self._scrape_html(source)
elif source_type == "openapi":
self._scrape_openapi(source)
elif source_type == "asciidoc":
self._scrape_asciidoc(source)
elif source_type == "pptx":
self._scrape_pptx(source)
elif source_type == "confluence":
self._scrape_confluence(source)
elif source_type == "notion":
self._scrape_notion(source)
elif source_type == "rss":
self._scrape_rss(source)
elif source_type == "manpage":
self._scrape_manpage(source)
elif source_type == "chat":
self._scrape_chat(source)
else:
logger.warning(f"Unknown source type: {source_type}")
except Exception as e:
@@ -571,6 +615,7 @@ class UnifiedScraper:
{
"docx_path": docx_path,
"docx_id": docx_id,
"word_id": docx_id, # Alias for generic reference generation
"idx": idx,
"data": word_data,
"data_file": cache_word_data,
@@ -788,6 +833,595 @@ class UnifiedScraper:
logger.debug(f"Traceback: {traceback.format_exc()}")
raise
# ------------------------------------------------------------------
# New source type handlers (v3.2.0+)
# ------------------------------------------------------------------
def _scrape_epub(self, source: dict[str, Any]):
"""Scrape EPUB e-book (.epub)."""
try:
from skill_seekers.cli.epub_scraper import EpubToSkillConverter
except ImportError:
logger.error(
"EPUB scraper dependencies not installed.\n"
" Install with: pip install skill-seekers[epub]"
)
return
idx = self._source_counters["epub"]
self._source_counters["epub"] += 1
epub_path = source["path"]
epub_id = os.path.splitext(os.path.basename(epub_path))[0]
epub_config = {
"name": f"{self.name}_epub_{idx}_{epub_id}",
"epub_path": source["path"],
"description": source.get("description", f"{epub_id} e-book"),
}
logger.info(f"Scraping EPUB: {source['path']}")
converter = EpubToSkillConverter(epub_config)
converter.extract_epub()
epub_data_file = converter.data_file
with open(epub_data_file, encoding="utf-8") as f:
epub_data = json.load(f)
cache_epub_data = os.path.join(self.data_dir, f"epub_data_{idx}_{epub_id}.json")
shutil.copy(epub_data_file, cache_epub_data)
self.scraped_data["epub"].append(
{
"epub_path": epub_path,
"epub_id": epub_id,
"idx": idx,
"data": epub_data,
"data_file": cache_epub_data,
}
)
try:
converter.build_skill()
logger.info("✅ EPUB: Standalone SKILL.md created")
except Exception as e:
logger.warning(f"⚠️ Failed to build standalone EPUB SKILL.md: {e}")
logger.info(f"✅ EPUB: {len(epub_data.get('chapters', []))} chapters extracted")
def _scrape_jupyter(self, source: dict[str, Any]):
"""Scrape Jupyter Notebook (.ipynb)."""
try:
from skill_seekers.cli.jupyter_scraper import JupyterToSkillConverter
except ImportError:
logger.error(
"Jupyter scraper dependencies not installed.\n"
" Install with: pip install skill-seekers[jupyter]"
)
return
idx = self._source_counters["jupyter"]
self._source_counters["jupyter"] += 1
nb_path = source["path"]
nb_id = os.path.splitext(os.path.basename(nb_path))[0]
nb_config = {
"name": f"{self.name}_jupyter_{idx}_{nb_id}",
"notebook_path": source["path"],
"description": source.get("description", f"{nb_id} notebook"),
}
logger.info(f"Scraping Jupyter Notebook: {source['path']}")
converter = JupyterToSkillConverter(nb_config)
converter.extract_notebook()
nb_data_file = converter.data_file
with open(nb_data_file, encoding="utf-8") as f:
nb_data = json.load(f)
cache_nb_data = os.path.join(self.data_dir, f"jupyter_data_{idx}_{nb_id}.json")
shutil.copy(nb_data_file, cache_nb_data)
self.scraped_data["jupyter"].append(
{
"notebook_path": nb_path,
"notebook_id": nb_id,
"idx": idx,
"data": nb_data,
"data_file": cache_nb_data,
}
)
try:
converter.build_skill()
logger.info("✅ Jupyter: Standalone SKILL.md created")
except Exception as e:
logger.warning(f"⚠️ Failed to build standalone Jupyter SKILL.md: {e}")
logger.info(f"✅ Jupyter: {len(nb_data.get('cells', []))} cells extracted")
def _scrape_html(self, source: dict[str, Any]):
"""Scrape local HTML file(s)."""
try:
from skill_seekers.cli.html_scraper import HtmlToSkillConverter
except ImportError:
logger.error("html_scraper.py not found")
return
idx = self._source_counters["html"]
self._source_counters["html"] += 1
html_path = source["path"]
html_id = os.path.splitext(os.path.basename(html_path.rstrip("/")))[0]
html_config = {
"name": f"{self.name}_html_{idx}_{html_id}",
"html_path": source["path"],
"description": source.get("description", f"{html_id} HTML content"),
}
logger.info(f"Scraping local HTML: {source['path']}")
converter = HtmlToSkillConverter(html_config)
converter.extract_html()
html_data_file = converter.data_file
with open(html_data_file, encoding="utf-8") as f:
html_data = json.load(f)
cache_html_data = os.path.join(self.data_dir, f"html_data_{idx}_{html_id}.json")
shutil.copy(html_data_file, cache_html_data)
self.scraped_data["html"].append(
{
"html_path": html_path,
"html_id": html_id,
"idx": idx,
"data": html_data,
"data_file": cache_html_data,
}
)
try:
converter.build_skill()
logger.info("✅ HTML: Standalone SKILL.md created")
except Exception as e:
logger.warning(f"⚠️ Failed to build standalone HTML SKILL.md: {e}")
logger.info(f"✅ HTML: {len(html_data.get('pages', []))} pages extracted")
def _scrape_openapi(self, source: dict[str, Any]):
"""Scrape OpenAPI/Swagger specification."""
try:
from skill_seekers.cli.openapi_scraper import OpenAPIToSkillConverter
except ImportError:
logger.error("openapi_scraper.py not found")
return
idx = self._source_counters["openapi"]
self._source_counters["openapi"] += 1
spec_path = source.get("path", source.get("url", ""))
spec_id = os.path.splitext(os.path.basename(spec_path))[0] if spec_path else f"spec_{idx}"
openapi_config = {
"name": f"{self.name}_openapi_{idx}_{spec_id}",
"spec_path": source.get("path"),
"spec_url": source.get("url"),
"description": source.get("description", f"{spec_id} API spec"),
}
logger.info(f"Scraping OpenAPI spec: {spec_path}")
converter = OpenAPIToSkillConverter(openapi_config)
converter.extract_spec()
api_data_file = converter.data_file
with open(api_data_file, encoding="utf-8") as f:
api_data = json.load(f)
cache_api_data = os.path.join(self.data_dir, f"openapi_data_{idx}_{spec_id}.json")
shutil.copy(api_data_file, cache_api_data)
self.scraped_data["openapi"].append(
{
"spec_path": spec_path,
"spec_id": spec_id,
"idx": idx,
"data": api_data,
"data_file": cache_api_data,
}
)
try:
converter.build_skill()
logger.info("✅ OpenAPI: Standalone SKILL.md created")
except Exception as e:
logger.warning(f"⚠️ Failed to build standalone OpenAPI SKILL.md: {e}")
logger.info(f"✅ OpenAPI: {len(api_data.get('endpoints', []))} endpoints extracted")
def _scrape_asciidoc(self, source: dict[str, Any]):
"""Scrape AsciiDoc document(s)."""
try:
from skill_seekers.cli.asciidoc_scraper import AsciiDocToSkillConverter
except ImportError:
logger.error(
"AsciiDoc scraper dependencies not installed.\n"
" Install with: pip install skill-seekers[asciidoc]"
)
return
idx = self._source_counters["asciidoc"]
self._source_counters["asciidoc"] += 1
adoc_path = source["path"]
adoc_id = os.path.splitext(os.path.basename(adoc_path.rstrip("/")))[0]
adoc_config = {
"name": f"{self.name}_asciidoc_{idx}_{adoc_id}",
"asciidoc_path": source["path"],
"description": source.get("description", f"{adoc_id} AsciiDoc content"),
}
logger.info(f"Scraping AsciiDoc: {source['path']}")
converter = AsciiDocToSkillConverter(adoc_config)
converter.extract_asciidoc()
adoc_data_file = converter.data_file
with open(adoc_data_file, encoding="utf-8") as f:
adoc_data = json.load(f)
cache_adoc_data = os.path.join(self.data_dir, f"asciidoc_data_{idx}_{adoc_id}.json")
shutil.copy(adoc_data_file, cache_adoc_data)
self.scraped_data["asciidoc"].append(
{
"asciidoc_path": adoc_path,
"asciidoc_id": adoc_id,
"idx": idx,
"data": adoc_data,
"data_file": cache_adoc_data,
}
)
try:
converter.build_skill()
logger.info("✅ AsciiDoc: Standalone SKILL.md created")
except Exception as e:
logger.warning(f"⚠️ Failed to build standalone AsciiDoc SKILL.md: {e}")
logger.info(f"✅ AsciiDoc: {len(adoc_data.get('sections', []))} sections extracted")
def _scrape_pptx(self, source: dict[str, Any]):
"""Scrape PowerPoint presentation (.pptx)."""
try:
from skill_seekers.cli.pptx_scraper import PptxToSkillConverter
except ImportError:
logger.error(
"PowerPoint scraper dependencies not installed.\n"
" Install with: pip install skill-seekers[pptx]"
)
return
idx = self._source_counters["pptx"]
self._source_counters["pptx"] += 1
pptx_path = source["path"]
pptx_id = os.path.splitext(os.path.basename(pptx_path))[0]
pptx_config = {
"name": f"{self.name}_pptx_{idx}_{pptx_id}",
"pptx_path": source["path"],
"description": source.get("description", f"{pptx_id} presentation"),
}
logger.info(f"Scraping PowerPoint: {source['path']}")
converter = PptxToSkillConverter(pptx_config)
converter.extract_pptx()
pptx_data_file = converter.data_file
with open(pptx_data_file, encoding="utf-8") as f:
pptx_data = json.load(f)
cache_pptx_data = os.path.join(self.data_dir, f"pptx_data_{idx}_{pptx_id}.json")
shutil.copy(pptx_data_file, cache_pptx_data)
self.scraped_data["pptx"].append(
{
"pptx_path": pptx_path,
"pptx_id": pptx_id,
"idx": idx,
"data": pptx_data,
"data_file": cache_pptx_data,
}
)
try:
converter.build_skill()
logger.info("✅ PowerPoint: Standalone SKILL.md created")
except Exception as e:
logger.warning(f"⚠️ Failed to build standalone PowerPoint SKILL.md: {e}")
logger.info(f"✅ PowerPoint: {len(pptx_data.get('slides', []))} slides extracted")
def _scrape_confluence(self, source: dict[str, Any]):
"""Scrape Confluence wiki (API or exported HTML/XML)."""
try:
from skill_seekers.cli.confluence_scraper import ConfluenceToSkillConverter
except ImportError:
logger.error(
"Confluence scraper dependencies not installed.\n"
" Install with: pip install skill-seekers[confluence]"
)
return
idx = self._source_counters["confluence"]
self._source_counters["confluence"] += 1
source_id = source.get("space_key", source.get("path", f"confluence_{idx}"))
if isinstance(source_id, str) and "/" in source_id:
source_id = os.path.basename(source_id.rstrip("/"))
conf_config = {
"name": f"{self.name}_confluence_{idx}_{source_id}",
"base_url": source.get("base_url", source.get("url")),
"space_key": source.get("space_key"),
"export_path": source.get("path"),
"username": source.get("username"),
"token": source.get("token"),
"description": source.get("description", f"{source_id} Confluence content"),
"max_pages": source.get("max_pages", 500),
}
logger.info(f"Scraping Confluence: {source_id}")
converter = ConfluenceToSkillConverter(conf_config)
converter.extract_confluence()
conf_data_file = converter.data_file
with open(conf_data_file, encoding="utf-8") as f:
conf_data = json.load(f)
cache_conf_data = os.path.join(self.data_dir, f"confluence_data_{idx}_{source_id}.json")
shutil.copy(conf_data_file, cache_conf_data)
self.scraped_data["confluence"].append(
{
"source_id": source_id,
"idx": idx,
"data": conf_data,
"data_file": cache_conf_data,
}
)
try:
converter.build_skill()
logger.info("✅ Confluence: Standalone SKILL.md created")
except Exception as e:
logger.warning(f"⚠️ Failed to build standalone Confluence SKILL.md: {e}")
logger.info(f"✅ Confluence: {len(conf_data.get('pages', []))} pages extracted")
def _scrape_notion(self, source: dict[str, Any]):
"""Scrape Notion pages (API or exported Markdown)."""
try:
from skill_seekers.cli.notion_scraper import NotionToSkillConverter
except ImportError:
logger.error(
"Notion scraper dependencies not installed.\n"
" Install with: pip install skill-seekers[notion]"
)
return
idx = self._source_counters["notion"]
self._source_counters["notion"] += 1
source_id = source.get(
"database_id", source.get("page_id", source.get("path", f"notion_{idx}"))
)
if isinstance(source_id, str) and "/" in source_id:
source_id = os.path.basename(source_id.rstrip("/"))
notion_config = {
"name": f"{self.name}_notion_{idx}_{source_id}",
"database_id": source.get("database_id"),
"page_id": source.get("page_id"),
"export_path": source.get("path"),
"token": source.get("token"),
"description": source.get("description", f"{source_id} Notion content"),
"max_pages": source.get("max_pages", 500),
}
logger.info(f"Scraping Notion: {source_id}")
converter = NotionToSkillConverter(notion_config)
converter.extract_notion()
notion_data_file = converter.data_file
with open(notion_data_file, encoding="utf-8") as f:
notion_data = json.load(f)
cache_notion_data = os.path.join(self.data_dir, f"notion_data_{idx}_{source_id}.json")
shutil.copy(notion_data_file, cache_notion_data)
self.scraped_data["notion"].append(
{
"source_id": source_id,
"idx": idx,
"data": notion_data,
"data_file": cache_notion_data,
}
)
try:
converter.build_skill()
logger.info("✅ Notion: Standalone SKILL.md created")
except Exception as e:
logger.warning(f"⚠️ Failed to build standalone Notion SKILL.md: {e}")
logger.info(f"✅ Notion: {len(notion_data.get('pages', []))} pages extracted")
def _scrape_rss(self, source: dict[str, Any]):
"""Scrape RSS/Atom feed (with optional full article scraping)."""
try:
from skill_seekers.cli.rss_scraper import RssToSkillConverter
except ImportError:
logger.error(
"RSS scraper dependencies not installed.\n"
" Install with: pip install skill-seekers[rss]"
)
return
idx = self._source_counters["rss"]
self._source_counters["rss"] += 1
feed_url = source.get("url", source.get("path", ""))
feed_id = feed_url.split("/")[-1].split(".")[0] if feed_url else f"feed_{idx}"
rss_config = {
"name": f"{self.name}_rss_{idx}_{feed_id}",
"feed_url": source.get("url"),
"feed_path": source.get("path"),
"follow_links": source.get("follow_links", True),
"max_articles": source.get("max_articles", 50),
"description": source.get("description", f"{feed_id} RSS/Atom feed"),
}
logger.info(f"Scraping RSS/Atom feed: {feed_url}")
converter = RssToSkillConverter(rss_config)
converter.extract_feed()
rss_data_file = converter.data_file
with open(rss_data_file, encoding="utf-8") as f:
rss_data = json.load(f)
cache_rss_data = os.path.join(self.data_dir, f"rss_data_{idx}_{feed_id}.json")
shutil.copy(rss_data_file, cache_rss_data)
self.scraped_data["rss"].append(
{
"feed_url": feed_url,
"feed_id": feed_id,
"idx": idx,
"data": rss_data,
"data_file": cache_rss_data,
}
)
try:
converter.build_skill()
logger.info("✅ RSS: Standalone SKILL.md created")
except Exception as e:
logger.warning(f"⚠️ Failed to build standalone RSS SKILL.md: {e}")
logger.info(f"✅ RSS: {len(rss_data.get('articles', []))} articles extracted")
def _scrape_manpage(self, source: dict[str, Any]):
"""Scrape man page(s)."""
try:
from skill_seekers.cli.man_scraper import ManPageToSkillConverter
except ImportError:
logger.error("man_scraper.py not found")
return
idx = self._source_counters["manpage"]
self._source_counters["manpage"] += 1
man_names = source.get("names", [])
man_path = source.get("path", "")
man_id = man_names[0] if man_names else os.path.basename(man_path.rstrip("/"))
man_config = {
"name": f"{self.name}_manpage_{idx}_{man_id}",
"man_names": man_names,
"man_path": man_path,
"sections": source.get("sections", []),
"description": source.get("description", f"{man_id} man pages"),
}
logger.info(f"Scraping man pages: {man_id}")
converter = ManPageToSkillConverter(man_config)
converter.extract_manpages()
man_data_file = converter.data_file
with open(man_data_file, encoding="utf-8") as f:
man_data = json.load(f)
cache_man_data = os.path.join(self.data_dir, f"manpage_data_{idx}_{man_id}.json")
shutil.copy(man_data_file, cache_man_data)
self.scraped_data["manpage"].append(
{
"man_id": man_id,
"idx": idx,
"data": man_data,
"data_file": cache_man_data,
}
)
try:
converter.build_skill()
logger.info("✅ Man pages: Standalone SKILL.md created")
except Exception as e:
logger.warning(f"⚠️ Failed to build standalone man page SKILL.md: {e}")
logger.info(f"✅ Man pages: {len(man_data.get('pages', []))} man pages extracted")
def _scrape_chat(self, source: dict[str, Any]):
"""Scrape Slack/Discord chat export or API."""
try:
from skill_seekers.cli.chat_scraper import ChatToSkillConverter
except ImportError:
logger.error(
"Chat scraper dependencies not installed.\n"
" Install with: pip install skill-seekers[chat]"
)
return
idx = self._source_counters["chat"]
self._source_counters["chat"] += 1
export_path = source.get("path", "")
channel = source.get("channel", source.get("channel_id", ""))
chat_id = channel or os.path.basename(export_path.rstrip("/")) or f"chat_{idx}"
chat_config = {
"name": f"{self.name}_chat_{idx}_{chat_id}",
"export_path": source.get("path"),
"platform": source.get("platform", "slack"),
"token": source.get("token"),
"channel": channel,
"max_messages": source.get("max_messages", 10000),
"description": source.get("description", f"{chat_id} chat export"),
}
logger.info(f"Scraping chat: {chat_id}")
converter = ChatToSkillConverter(chat_config)
converter.extract_chat()
chat_data_file = converter.data_file
with open(chat_data_file, encoding="utf-8") as f:
chat_data = json.load(f)
cache_chat_data = os.path.join(self.data_dir, f"chat_data_{idx}_{chat_id}.json")
shutil.copy(chat_data_file, cache_chat_data)
self.scraped_data["chat"].append(
{
"chat_id": chat_id,
"platform": source.get("platform", "slack"),
"idx": idx,
"data": chat_data,
"data_file": cache_chat_data,
}
)
try:
converter.build_skill()
logger.info("✅ Chat: Standalone SKILL.md created")
except Exception as e:
logger.warning(f"⚠️ Failed to build standalone chat SKILL.md: {e}")
logger.info(f"✅ Chat: {len(chat_data.get('messages', []))} messages extracted")
def _load_json(self, file_path: Path) -> dict:
"""
Load JSON file safely.
@@ -1297,14 +1931,33 @@ Examples:
if args.dry_run:
logger.info("🔍 DRY RUN MODE - Preview only, no scraping will occur")
logger.info(f"\nWould scrape {len(scraper.config.get('sources', []))} sources:")
# Source type display config: type -> (label, key for detail)
_SOURCE_DISPLAY = {
"documentation": ("Documentation", "base_url"),
"github": ("GitHub", "repo"),
"pdf": ("PDF", "path"),
"word": ("Word", "path"),
"epub": ("EPUB", "path"),
"video": ("Video", "url"),
"local": ("Local Codebase", "path"),
"jupyter": ("Jupyter Notebook", "path"),
"html": ("HTML", "path"),
"openapi": ("OpenAPI Spec", "path"),
"asciidoc": ("AsciiDoc", "path"),
"pptx": ("PowerPoint", "path"),
"confluence": ("Confluence", "base_url"),
"notion": ("Notion", "page_id"),
"rss": ("RSS/Atom Feed", "url"),
"manpage": ("Man Page", "names"),
"chat": ("Chat Export", "path"),
}
for idx, source in enumerate(scraper.config.get("sources", []), 1):
source_type = source.get("type", "unknown")
if source_type == "documentation":
logger.info(f" {idx}. Documentation: {source.get('base_url', 'N/A')}")
elif source_type == "github":
logger.info(f" {idx}. GitHub: {source.get('repo', 'N/A')}")
elif source_type == "pdf":
logger.info(f" {idx}. PDF: {source.get('pdf_path', 'N/A')}")
label, key = _SOURCE_DISPLAY.get(source_type, (source_type.title(), "path"))
detail = source.get(key, "N/A")
if isinstance(detail, list):
detail = ", ".join(str(d) for d in detail)
logger.info(f" {idx}. {label}: {detail}")
logger.info(f"\nOutput directory: {scraper.output_dir}")
logger.info(f"Merge mode: {scraper.merge_mode}")
return

View File

@@ -136,6 +136,44 @@ class UnifiedSkillBuilder:
skill_mds["pdf"] = "\n\n---\n\n".join(pdf_sources)
logger.debug(f"Combined {len(pdf_sources)} PDF SKILL.md files")
# Load additional source types using generic glob pattern
# Each source type uses: {name}_{type}_{idx}_*/ or {name}_{type}_*/
_extra_types = [
"word",
"epub",
"video",
"jupyter",
"html",
"openapi",
"asciidoc",
"pptx",
"confluence",
"notion",
"rss",
"manpage",
"chat",
]
for source_type in _extra_types:
type_sources = []
for type_dir in sources_dir.glob(f"{self.name}_{source_type}_*"):
type_skill_path = type_dir / "SKILL.md"
if type_skill_path.exists():
try:
content = type_skill_path.read_text(encoding="utf-8")
type_sources.append(content)
logger.debug(
f"Loaded {source_type} SKILL.md from {type_dir.name} "
f"({len(content)} chars)"
)
except OSError as e:
logger.warning(
f"Failed to read {source_type} SKILL.md from {type_dir.name}: {e}"
)
if type_sources:
skill_mds[source_type] = "\n\n---\n\n".join(type_sources)
logger.debug(f"Combined {len(type_sources)} {source_type} SKILL.md files")
logger.info(f"Loaded {len(skill_mds)} source SKILL.md files")
return skill_mds
@@ -477,6 +515,18 @@ This skill synthesizes knowledge from multiple sources:
logger.info("Using PDF SKILL.md as-is")
content = skill_mds["pdf"]
# Generic merge for additional source types not covered by pairwise methods
if not content and skill_mds:
# At least one source SKILL.md exists but not docs/github/pdf
logger.info(f"Generic merge for source types: {list(skill_mds.keys())}")
content = self._generic_merge(skill_mds)
elif content and len(skill_mds) > (int(has_docs) + int(has_github) + int(has_pdf)):
# Pairwise synthesis handled the core types; append additional sources
extra_types = set(skill_mds.keys()) - {"documentation", "github", "pdf"}
if extra_types:
logger.info(f"Appending additional sources: {extra_types}")
content = self._append_extra_sources(content, skill_mds, extra_types)
# Fallback: generate minimal SKILL.md (legacy behavior)
if not content:
logger.warning("No source SKILL.md files found, generating minimal SKILL.md (legacy)")
@@ -574,6 +624,165 @@ This skill synthesizes knowledge from multiple sources:
return "\n".join(lines)
# ------------------------------------------------------------------
# Generic merge system for any combination of source types (v3.2.0+)
# ------------------------------------------------------------------
# Human-readable labels for source types
_SOURCE_LABELS: dict[str, str] = {
"documentation": "Documentation",
"github": "GitHub Repository",
"pdf": "PDF Document",
"word": "Word Document",
"epub": "EPUB E-book",
"video": "Video",
"local": "Local Codebase",
"jupyter": "Jupyter Notebook",
"html": "HTML Document",
"openapi": "OpenAPI/Swagger Spec",
"asciidoc": "AsciiDoc Document",
"pptx": "PowerPoint Presentation",
"confluence": "Confluence Wiki",
"notion": "Notion Page",
"rss": "RSS/Atom Feed",
"manpage": "Man Page",
"chat": "Chat Export",
}
def _generic_merge(self, skill_mds: dict[str, str]) -> str:
"""Generic merge for any combination of source types.
Uses a priority-based section ordering approach:
1. Parse all source SKILL.md files into sections
2. Collect unique sections across all sources
3. Merge matching sections with source attribution
4. Produce a unified SKILL.md
This preserves the existing pairwise synthesis for docs+github, docs+pdf, etc.
and handles any other combination generically.
Args:
skill_mds: Dict mapping source type to SKILL.md content
Returns:
Merged SKILL.md content string
"""
skill_name = self.name.lower().replace("_", "-").replace(" ", "-")[:64]
desc = self.description[:1024] if len(self.description) > 1024 else self.description
# Parse all source SKILL.md files into sections
all_sections: dict[str, dict[str, str]] = {}
for source_type, content in skill_mds.items():
all_sections[source_type] = self._parse_skill_md_sections(content)
# Determine all unique section names in priority order
# Sections that appear earlier in sources have higher priority
seen_sections: list[str] = []
for _source_type, sections in all_sections.items():
for section_name in sections:
if section_name not in seen_sections:
seen_sections.append(section_name)
# Build merged content
source_labels = ", ".join(self._SOURCE_LABELS.get(t, t.title()) for t in skill_mds)
lines = [
"---",
f"name: {skill_name}",
f"description: {desc}",
"---",
"",
f"# {self.name.replace('_', ' ').title()}",
"",
f"{self.description}",
"",
f"*Merged from: {source_labels}*",
"",
]
# Emit each section, merging content from all sources that have it
for section_name in seen_sections:
contributing_sources = [
(stype, sections[section_name])
for stype, sections in all_sections.items()
if section_name in sections
]
if len(contributing_sources) == 1:
# Single source for this section — emit as-is
stype, content = contributing_sources[0]
label = self._SOURCE_LABELS.get(stype, stype.title())
lines.append(f"## {section_name}")
lines.append("")
lines.append(f"*From {label}*")
lines.append("")
lines.append(content)
lines.append("")
else:
# Multiple sources — merge with attribution
lines.append(f"## {section_name}")
lines.append("")
for stype, content in contributing_sources:
label = self._SOURCE_LABELS.get(stype, stype.title())
lines.append(f"### From {label}")
lines.append("")
lines.append(content)
lines.append("")
lines.append("---")
lines.append("")
lines.append("*Generated by Skill Seeker's unified multi-source scraper*")
return "\n".join(lines)
def _append_extra_sources(
self,
base_content: str,
skill_mds: dict[str, str],
extra_types: set[str],
) -> str:
"""Append additional source content to existing pairwise-synthesized SKILL.md.
Used when the core docs+github+pdf synthesis has run, but there are
additional source types (epub, jupyter, etc.) that need to be included.
Args:
base_content: Already-synthesized SKILL.md content
skill_mds: All source SKILL.md files
extra_types: Set of extra source type keys to append
Returns:
Extended SKILL.md content
"""
lines = base_content.split("\n")
# Find the final separator (---) or end of file
insertion_index = len(lines)
for i in range(len(lines) - 1, -1, -1):
if lines[i].strip() == "---":
insertion_index = i
break
# Build extra content
extra_lines = [""]
for source_type in sorted(extra_types):
if source_type not in skill_mds:
continue
label = self._SOURCE_LABELS.get(source_type, source_type.title())
sections = self._parse_skill_md_sections(skill_mds[source_type])
extra_lines.append(f"## {label} Content")
extra_lines.append("")
for section_name, content in sections.items():
extra_lines.append(f"### {section_name}")
extra_lines.append("")
extra_lines.append(content)
extra_lines.append("")
lines[insertion_index:insertion_index] = extra_lines
return "\n".join(lines)
def _generate_minimal_skill_md(self) -> str:
"""Generate minimal SKILL.md (legacy fallback behavior).
@@ -597,18 +806,42 @@ This skill combines knowledge from multiple sources:
"""
# Source type display keys: type -> (label, primary_key, extra_keys)
_source_detail_map = {
"documentation": ("Documentation", "base_url", [("Pages", "max_pages", "unlimited")]),
"github": (
"GitHub Repository",
"repo",
[("Code Analysis", "code_analysis_depth", "surface"), ("Issues", "max_issues", 0)],
),
"pdf": ("PDF Document", "path", []),
"word": ("Word Document", "path", []),
"epub": ("EPUB E-book", "path", []),
"video": ("Video", "url", []),
"local": ("Local Codebase", "path", [("Analysis Depth", "analysis_depth", "surface")]),
"jupyter": ("Jupyter Notebook", "path", []),
"html": ("HTML Document", "path", []),
"openapi": ("OpenAPI Spec", "path", []),
"asciidoc": ("AsciiDoc Document", "path", []),
"pptx": ("PowerPoint", "path", []),
"confluence": ("Confluence Wiki", "base_url", []),
"notion": ("Notion Page", "page_id", []),
"rss": ("RSS/Atom Feed", "url", []),
"manpage": ("Man Page", "names", []),
"chat": ("Chat Export", "path", []),
}
# List sources
for source in self.config.get("sources", []):
source_type = source["type"]
if source_type == "documentation":
content += f"- ✅ **Documentation**: {source.get('base_url', 'N/A')}\n"
content += f" - Pages: {source.get('max_pages', 'unlimited')}\n"
elif source_type == "github":
content += f"- ✅ **GitHub Repository**: {source.get('repo', 'N/A')}\n"
content += f" - Code Analysis: {source.get('code_analysis_depth', 'surface')}\n"
content += f" - Issues: {source.get('max_issues', 0)}\n"
elif source_type == "pdf":
content += f"- ✅ **PDF Document**: {source.get('path', 'N/A')}\n"
display = _source_detail_map.get(source_type, (source_type.title(), "path", []))
label, primary_key, extras = display
primary_val = source.get(primary_key, "N/A")
if isinstance(primary_val, list):
primary_val = ", ".join(str(v) for v in primary_val)
content += f"- ✅ **{label}**: {primary_val}\n"
for extra_label, extra_key, extra_default in extras:
content += f" - {extra_label}: {source.get(extra_key, extra_default)}\n"
# C3.x Architecture & Code Analysis section (if available)
github_data = self.scraped_data.get("github", {})
@@ -796,6 +1029,27 @@ This skill combines knowledge from multiple sources:
if pdf_list:
self._generate_pdf_references(pdf_list)
# Generate references for all additional source types
_extra_source_types = [
"word",
"epub",
"video",
"jupyter",
"html",
"openapi",
"asciidoc",
"pptx",
"confluence",
"notion",
"rss",
"manpage",
"chat",
]
for source_type in _extra_source_types:
source_list = self.scraped_data.get(source_type, [])
if source_list:
self._generate_generic_references(source_type, source_list)
# Generate merged API reference if available
if self.merged_data:
self._generate_merged_api_reference()
@@ -977,6 +1231,63 @@ This skill combines knowledge from multiple sources:
logger.info(f"Created PDF references ({len(pdf_list)} sources)")
def _generate_generic_references(self, source_type: str, source_list: list[dict]):
"""Generate references for any source type using a generic approach.
Creates a references/<source_type>/ directory with an index and
copies any data files from the source list.
Args:
source_type: The source type key (e.g., 'epub', 'jupyter')
source_list: List of scraped source dicts for this type
"""
if not source_list:
return
label = self._SOURCE_LABELS.get(source_type, source_type.title())
type_dir = os.path.join(self.skill_dir, "references", source_type)
os.makedirs(type_dir, exist_ok=True)
# Create index
index_path = os.path.join(type_dir, "index.md")
with open(index_path, "w", encoding="utf-8") as f:
f.write(f"# {label} References\n\n")
f.write(f"Reference from {len(source_list)} {label} source(s).\n\n")
for i, source_data in enumerate(source_list):
# Try common ID fields
source_id = (
source_data.get("source_id")
or source_data.get(f"{source_type}_id")
or source_data.get("notebook_id")
or source_data.get("spec_id")
or source_data.get("feed_id")
or source_data.get("man_id")
or source_data.get("chat_id")
or f"source_{i}"
)
f.write(f"## {source_id}\n\n")
# Write summary of extracted data
data = source_data.get("data", {})
if isinstance(data, dict):
for key in ["title", "description", "metadata"]:
if key in data:
val = data[key]
if isinstance(val, str) and val:
f.write(f"**{key.title()}:** {val}\n\n")
# Copy data file if available
data_file = source_data.get("data_file")
if data_file and os.path.isfile(data_file):
dest = os.path.join(type_dir, f"{source_id}_data.json")
import contextlib
with contextlib.suppress(OSError):
shutil.copy(data_file, dest)
logger.info(f"Created {label} references ({len(source_list)} sources)")
def _generate_merged_api_reference(self):
"""Generate merged API reference file."""
api_dir = os.path.join(self.skill_dir, "references", "api")

View File

@@ -3,16 +3,16 @@
Skill Seeker MCP Server (FastMCP Implementation)
Modern, decorator-based MCP server using FastMCP for simplified tool registration.
Provides 33 tools for generating Claude AI skills from documentation.
Provides 34 tools for generating Claude AI skills from documentation.
This is a streamlined alternative to server.py (2200 lines → 708 lines, 68% reduction).
All tool implementations are delegated to modular tool files in tools/ directory.
**Architecture:**
- FastMCP server with decorator-based tool registration
- 33 tools organized into 7 categories:
- 34 tools organized into 7 categories:
* Config tools (3): generate_config, list_configs, validate_config
* Scraping tools (10): estimate_pages, scrape_docs, scrape_github, scrape_pdf, scrape_video, scrape_codebase, detect_patterns, extract_test_examples, build_how_to_guides, extract_config_patterns
* Scraping tools (11): estimate_pages, scrape_docs, scrape_github, scrape_pdf, scrape_video, scrape_codebase, detect_patterns, extract_test_examples, build_how_to_guides, extract_config_patterns, scrape_generic
* Packaging tools (4): package_skill, upload_skill, enhance_skill, install_skill
* Splitting tools (2): split_config, generate_router
* Source tools (5): fetch_config, submit_config, add_config_source, list_config_sources, remove_config_source
@@ -97,6 +97,7 @@ try:
remove_config_source_impl,
scrape_codebase_impl,
scrape_docs_impl,
scrape_generic_impl,
scrape_github_impl,
scrape_pdf_impl,
scrape_video_impl,
@@ -141,6 +142,7 @@ except ImportError:
remove_config_source_impl,
scrape_codebase_impl,
scrape_docs_impl,
scrape_generic_impl,
scrape_github_impl,
scrape_pdf_impl,
scrape_video_impl,
@@ -301,7 +303,7 @@ async def sync_config(
# ============================================================================
# SCRAPING TOOLS (10 tools)
# SCRAPING TOOLS (11 tools)
# ============================================================================
@@ -823,6 +825,50 @@ async def extract_config_patterns(
return str(result)
@safe_tool_decorator(
description="Scrape content from new source types: jupyter, html, openapi, asciidoc, pptx, confluence, notion, rss, manpage, chat. A generic entry point that delegates to the appropriate CLI scraper module."
)
async def scrape_generic(
source_type: str,
name: str,
path: str | None = None,
url: str | None = None,
) -> str:
"""
Scrape content from various source types and build a skill.
A generic scraper that supports 10 new source types. It delegates to the
corresponding CLI scraper module (e.g., skill_seekers.cli.jupyter_scraper).
File-based types (jupyter, html, openapi, asciidoc, pptx, manpage, chat)
typically use the 'path' parameter. URL-based types (confluence, notion, rss)
typically use the 'url' parameter.
Args:
source_type: Source type to scrape. One of: jupyter, html, openapi,
asciidoc, pptx, confluence, notion, rss, manpage, chat.
name: Skill name for the output
path: File or directory path (for file-based sources like jupyter, html, pptx)
url: URL (for URL-based sources like confluence, notion, rss)
Returns:
Scraping results with file paths and statistics.
"""
args = {
"source_type": source_type,
"name": name,
}
if path:
args["path"] = path
if url:
args["url"] = url
result = await scrape_generic_impl(args)
if isinstance(result, list) and result:
return result[0].text if hasattr(result[0], "text") else str(result[0])
return str(result)
# ============================================================================
# PACKAGING TOOLS (4 tools)
# ============================================================================

View File

@@ -63,6 +63,9 @@ from .scraping_tools import (
from .scraping_tools import (
scrape_pdf_tool as scrape_pdf_impl,
)
from .scraping_tools import (
scrape_generic_tool as scrape_generic_impl,
)
from .scraping_tools import (
scrape_video_tool as scrape_video_impl,
)
@@ -135,6 +138,7 @@ __all__ = [
"extract_test_examples_impl",
"build_how_to_guides_impl",
"extract_config_patterns_impl",
"scrape_generic_impl",
# Packaging tools
"package_skill_impl",
"upload_skill_impl",

View File

@@ -205,6 +205,18 @@ async def validate_config(args: dict) -> list[TextContent]:
)
elif source["type"] == "pdf":
result += f" Path: {source.get('path', 'N/A')}\n"
elif source["type"] in (
"jupyter",
"html",
"openapi",
"asciidoc",
"pptx",
"manpage",
"chat",
):
result += f" Path: {source.get('path', 'N/A')}\n"
elif source["type"] in ("confluence", "notion", "rss"):
result += f" URL: {source.get('url', 'N/A')}\n"
# Show merge settings if applicable
if validator.needs_api_merge():

View File

@@ -7,6 +7,8 @@ This module contains all scraping-related MCP tool implementations:
- scrape_github_tool: Scrape GitHub repositories
- scrape_pdf_tool: Scrape PDF documentation
- scrape_codebase_tool: Analyze local codebase and extract code knowledge
- scrape_generic_tool: Generic scraper for new source types (jupyter, html,
openapi, asciidoc, pptx, confluence, notion, rss, manpage, chat)
Extracted from server.py for better modularity and organization.
"""
@@ -1005,3 +1007,155 @@ async def extract_config_patterns_tool(args: dict) -> list[TextContent]:
return [TextContent(type="text", text=output_text)]
else:
return [TextContent(type="text", text=f"{output_text}\n\n❌ Error:\n{stderr}")]
# Valid source types for the generic scraper
GENERIC_SOURCE_TYPES = (
"jupyter",
"html",
"openapi",
"asciidoc",
"pptx",
"confluence",
"notion",
"rss",
"manpage",
"chat",
)
# Mapping from source type to the CLI flag used for the primary input argument.
# URL-based types use --url; file/path-based types use --path.
_URL_BASED_TYPES = {"confluence", "notion", "rss"}
# Friendly emoji labels per source type
_SOURCE_EMOJIS = {
"jupyter": "📓",
"html": "🌐",
"openapi": "📡",
"asciidoc": "📄",
"pptx": "📊",
"confluence": "🏢",
"notion": "📝",
"rss": "📰",
"manpage": "📖",
"chat": "💬",
}
async def scrape_generic_tool(args: dict) -> list[TextContent]:
"""
Generic scraper for new source types.
Handles all 10 new source types by building the appropriate subprocess
command and delegating to the corresponding CLI scraper module.
Supported source types: jupyter, html, openapi, asciidoc, pptx,
confluence, notion, rss, manpage, chat.
Args:
args: Dictionary containing:
- source_type (str): One of the supported source types
- path (str, optional): File or directory path (for file-based sources)
- url (str, optional): URL (for URL-based sources like confluence, notion, rss)
- name (str): Skill name for the output
Returns:
List[TextContent]: Tool execution results
"""
source_type = args.get("source_type", "")
path = args.get("path")
url = args.get("url")
name = args.get("name")
# Validate source_type
if source_type not in GENERIC_SOURCE_TYPES:
return [
TextContent(
type="text",
text=(
f"❌ Error: Unknown source_type '{source_type}'. "
f"Must be one of: {', '.join(GENERIC_SOURCE_TYPES)}"
),
)
]
# Validate that we have either path or url
if not path and not url:
return [
TextContent(
type="text",
text="❌ Error: Must specify either 'path' (file/directory) or 'url'",
)
]
if not name:
return [
TextContent(
type="text",
text="❌ Error: 'name' parameter is required",
)
]
# Build the subprocess command
# Map source type to module name (most are <type>_scraper, but some differ)
_MODULE_NAMES = {
"manpage": "man_scraper",
}
module_name = _MODULE_NAMES.get(source_type, f"{source_type}_scraper")
cmd = [sys.executable, "-m", f"skill_seekers.cli.{module_name}"]
# Map source type to the correct CLI flag for file/path input and URL input.
# Each scraper has its own flag name — using a generic --path or --url would fail.
_PATH_FLAGS: dict[str, str] = {
"jupyter": "--notebook",
"html": "--html-path",
"openapi": "--spec",
"asciidoc": "--asciidoc-path",
"pptx": "--pptx",
"manpage": "--man-path",
"confluence": "--export-path",
"notion": "--export-path",
"rss": "--feed-path",
"chat": "--export-path",
}
_URL_FLAGS: dict[str, str] = {
"confluence": "--base-url",
"notion": "--page-id",
"rss": "--feed-url",
"openapi": "--spec-url",
}
# Determine the input flag based on source type
if source_type in _URL_BASED_TYPES and url:
url_flag = _URL_FLAGS.get(source_type, "--url")
cmd.extend([url_flag, url])
elif path:
path_flag = _PATH_FLAGS.get(source_type, "--path")
cmd.extend([path_flag, path])
elif url:
# Allow url fallback for file-based types (some may accept URLs too)
url_flag = _URL_FLAGS.get(source_type, "--url")
cmd.extend([url_flag, url])
cmd.extend(["--name", name])
# Set a reasonable timeout
timeout = 600 # 10 minutes
emoji = _SOURCE_EMOJIS.get(source_type, "🔧")
progress_msg = f"{emoji} Scraping {source_type} source...\n"
if path:
progress_msg += f"📁 Path: {path}\n"
if url:
progress_msg += f"🔗 URL: {url}\n"
progress_msg += f"📛 Name: {name}\n"
progress_msg += f"⏱️ Maximum time: {timeout // 60} minutes\n\n"
stdout, stderr, returncode = run_subprocess_with_streaming(cmd, timeout=timeout)
output = progress_msg + stdout
if returncode == 0:
return [TextContent(type="text", text=output)]
else:
return [TextContent(type="text", text=f"{output}\n\n❌ Error:\n{stderr}")]

View File

@@ -106,7 +106,9 @@ async def split_config(args: dict) -> list[TextContent]:
Supports both documentation and unified (multi-source) configs:
- Documentation configs: Split by categories, size, or create router skills
- Unified configs: Split by source type (documentation, github, pdf)
- Unified configs: Split by source type (documentation, github, pdf,
jupyter, html, openapi, asciidoc, pptx, confluence, notion, rss,
manpage, chat)
For large documentation sites (10K+ pages), this tool splits the config into
multiple smaller configs. For unified configs with multiple sources, splits

View File

@@ -0,0 +1,222 @@
name: complex-merge
description: Intelligent multi-source merging with conflict resolution, priority rules, and gap analysis
version: "1.0"
author: Skill Seekers
tags:
- merge
- multi-source
- conflict-resolution
- synthesis
applies_to:
- doc_scraping
- codebase_analysis
- github_analysis
variables:
merge_strategy: priority
source_priority_order: "official_docs,code,community"
conflict_resolution: highest_priority
min_sources_for_consensus: 2
stages:
- name: source_inventory
type: custom
target: inventory
uses_history: false
enabled: true
prompt: >
Catalog every source that contributed content to this skill extraction.
For each source, classify its type and assess its characteristics.
For each source, determine:
1. Source type (official_docs, codebase, github_repo, pdf, video, community, blog)
2. Content scope — what topics or areas does this source cover?
3. Freshness — how recent is the content? Look for version numbers, dates, deprecation notices
4. Authority level — is this an official maintainer, core contributor, or third party?
5. Content density — roughly how much substantive information does this source provide?
6. Format characteristics — prose, code samples, API reference, tutorial, etc.
Output JSON with:
- "sources": array of {id, type, scope_summary, topics_covered, freshness_estimate, authority, density, format}
- "source_type_distribution": count of sources by type
- "total_topics_identified": number of unique topics across all sources
- "coverage_summary": brief overview of what the combined sources cover
- name: cross_reference
type: custom
target: cross_references
uses_history: true
enabled: true
prompt: >
Using the source inventory, identify overlapping topics across sources.
Find where multiple sources discuss the same concept, API, feature, or pattern.
For each overlapping topic:
1. List which sources cover it and how deeply
2. Note whether sources agree, complement each other, or diverge
3. Identify the richest source for that topic (most detail, best examples)
4. Flag any terminology differences across sources for the same concept
Output JSON with:
- "overlapping_topics": array of {topic, sources_covering, agreement_level, richest_source, terminology_variants}
- "high_overlap_topics": topics covered by 3+ sources
- "complementary_pairs": pairs of sources that cover different aspects of the same topic well
- "terminology_map": dictionary mapping variant terms to a canonical term
- name: conflict_detection
type: custom
target: conflicts
uses_history: true
enabled: true
prompt: >
Examine the cross-referenced topics and identify genuine contradictions
between sources. Distinguish between true conflicts and superficial differences.
Categories of conflict to detect:
1. Factual contradictions — sources state opposite things about the same feature
2. Version mismatches — sources describe different versions of an API or behavior
3. Best practice disagreements — sources recommend conflicting approaches
4. Deprecated vs current — one source shows deprecated usage another shows current
5. Scope conflicts — sources disagree on what a feature can or cannot do
For each conflict:
- Identify the specific claim from each source
- Assess which source is more likely correct and why
- Recommend a resolution strategy
Output JSON with:
- "conflicts": array of {topic, type, source_a_claim, source_b_claim, likely_correct, resolution_rationale}
- "conflict_count_by_type": breakdown of conflicts by category
- "high_severity_conflicts": conflicts that would mislead users if unresolved
- "auto_resolvable": conflicts that can be resolved by version/date alone
- name: priority_merge
type: custom
target: merged_content
uses_history: true
enabled: true
prompt: >
Merge content from all sources using the following priority hierarchy:
1. Official documentation (highest authority)
2. Source code and inline comments (ground truth for behavior)
3. Community content — tutorials, blog posts, Stack Overflow (practical usage)
Merging rules:
- When sources agree, combine the best explanation with the best examples
- When sources conflict, prefer the higher-priority source but note the alternative
- When only a lower-priority source covers a topic, include it but flag the authority level
- Preserve code examples from any source, annotating their origin
- Deduplicate content — do not repeat the same information from multiple sources
- Normalize terminology using the canonical terms from cross-referencing
For each merged topic, produce:
1. Authoritative explanation (from highest-priority source)
2. Practical examples (best available from any source)
3. Source attribution (which sources contributed)
4. Confidence level (high if official docs confirm, medium if code-only, low if community-only)
Output JSON with:
- "merged_topics": array of {topic, explanation, examples, sources_used, confidence, notes}
- "merge_decisions": array of {topic, decision, rationale} for non-trivial merges
- "source_contribution_stats": how much each source contributed to the final output
- name: gap_analysis
type: custom
target: gaps
uses_history: true
enabled: true
prompt: >
Analyse the merged content to identify gaps — topics or areas that are
underrepresented or missing entirely.
Identify:
1. Single-source topics — covered by only one source, making them fragile
2. Missing fundamentals — core concepts that should be documented but are not
3. Missing examples — topics explained in prose but lacking code samples
4. Missing edge cases — common error scenarios or limitations not documented
5. Broken references — topics that reference other topics not present in any source
6. Audience gaps — content assumes knowledge that is never introduced
For each gap, assess:
- Severity (critical, important, nice-to-have)
- Whether the gap can be inferred from existing content
- Suggested source type that would best fill this gap
Output JSON with:
- "single_source_topics": array of {topic, sole_source, risk_level}
- "missing_fundamentals": topics that should exist but do not
- "example_gaps": topics needing code examples
- "edge_case_gaps": undocumented error scenarios
- "broken_references": internal references with no target
- "gap_severity_summary": counts by severity level
- name: synthesis
type: custom
target: skill_md
uses_history: true
enabled: true
prompt: >
Create a unified, coherent narrative from the merged content. The output
should read as if written by a single knowledgeable author, not as a
patchwork of multiple sources.
Synthesis guidelines:
1. Structure content logically — concepts build on each other
2. Lead with the most important information for each topic
3. Integrate code examples naturally within explanations
4. Use consistent voice, terminology, and formatting throughout
5. Add transition text between topics for narrative flow
6. Include a "Sources and Confidence" appendix noting where information came from
7. Mark any low-confidence or single-source claims with a caveat
8. Fill minor gaps by inference where safe to do so, clearly marking inferred content
Output JSON with:
- "synthesized_sections": array of {title, content, sources_used, confidence}
- "section_order": recommended reading order
- "inferred_content": content that was inferred rather than directly sourced
- "caveats": any warnings about content reliability
- name: quality_check
type: custom
target: quality
uses_history: true
enabled: true
prompt: >
Perform a final quality review of the synthesized output. Evaluate the
merge result against multiple quality dimensions.
Check for:
1. Completeness — does the output cover all topics from all sources?
2. Accuracy — are merged claims consistent and non-contradictory?
3. Coherence — does the document flow logically as a unified piece?
4. Attribution — are source contributions properly tracked?
5. Confidence calibration — are confidence levels appropriate?
6. Example quality — are code examples correct, runnable, and well-annotated?
7. Terminology consistency — is the canonical terminology used throughout?
8. Gap acknowledgment — are known gaps clearly communicated?
Scoring:
- Rate each dimension 1-10
- Provide specific issues found for any dimension scoring below 7
- Suggest concrete fixes for each issue
Output JSON with:
- "quality_scores": {completeness, accuracy, coherence, attribution, confidence_calibration, example_quality, terminology_consistency, gap_acknowledgment}
- "overall_score": weighted average (accuracy and completeness weighted 2x)
- "issues_found": array of {dimension, description, severity, suggested_fix}
- "merge_health": "excellent" | "good" | "needs_review" | "poor" based on overall score
- "recommendations": top 3 actions to improve merge quality
post_process:
reorder_sections:
- overview
- core_concepts
- api_reference
- examples
- advanced_topics
- troubleshooting
- sources_and_confidence
add_metadata:
enhanced: true
workflow: complex-merge
multi_source: true
conflict_resolution: priority
quality_checked: true

View File

@@ -24,12 +24,12 @@ class TestParserRegistry:
def test_all_parsers_registered(self):
"""Test that all parsers are registered."""
assert len(PARSERS) == 25, f"Expected 25 parsers, got {len(PARSERS)}"
assert len(PARSERS) == 35, f"Expected 35 parsers, got {len(PARSERS)}"
def test_get_parser_names(self):
"""Test getting list of parser names."""
names = get_parser_names()
assert len(names) == 25
assert len(names) == 35
assert "scrape" in names
assert "github" in names
assert "package" in names
@@ -243,9 +243,9 @@ class TestBackwardCompatibility:
assert cmd in names, f"Command '{cmd}' not found in parser registry!"
def test_command_count_matches(self):
"""Test that we have exactly 25 commands (includes create, workflows, word, epub, video, and sync-config)."""
assert len(PARSERS) == 25
assert len(get_parser_names()) == 25
"""Test that we have exactly 35 commands (25 original + 10 new source types)."""
assert len(PARSERS) == 35
assert len(get_parser_names()) == 35
if __name__ == "__main__":

View File

@@ -0,0 +1,824 @@
#!/usr/bin/env python3
"""
Tests for v3.2.0 new source type integration points.
Covers source detection, config validation, generic merge, CLI wiring,
and source validation for the 10 new source types: jupyter, html, openapi,
asciidoc, pptx, rss, manpage, confluence, notion, chat.
"""
import os
import textwrap
import pytest
from skill_seekers.cli.config_validator import ConfigValidator
from skill_seekers.cli.main import COMMAND_MODULES
from skill_seekers.cli.parsers import PARSERS, get_parser_names
from skill_seekers.cli.source_detector import SourceDetector, SourceInfo
from skill_seekers.cli.unified_skill_builder import UnifiedSkillBuilder
# ---------------------------------------------------------------------------
# 1. SourceDetector — new type detection
# ---------------------------------------------------------------------------
class TestSourceDetectorNewTypes:
"""Test that SourceDetector.detect() maps new extensions to correct types."""
# -- Jupyter --
def test_detect_ipynb(self):
"""Test .ipynb → jupyter detection."""
info = SourceDetector.detect("analysis.ipynb")
assert info.type == "jupyter"
assert info.parsed["file_path"] == "analysis.ipynb"
assert info.suggested_name == "analysis"
# -- HTML --
def test_detect_html_extension(self):
"""Test .html → html detection."""
info = SourceDetector.detect("page.html")
assert info.type == "html"
assert info.parsed["file_path"] == "page.html"
def test_detect_htm_extension(self):
"""Test .htm → html detection."""
info = SourceDetector.detect("index.HTM")
assert info.type == "html"
assert info.parsed["file_path"] == "index.HTM"
# -- PowerPoint --
def test_detect_pptx(self):
"""Test .pptx → pptx detection."""
info = SourceDetector.detect("slides.pptx")
assert info.type == "pptx"
assert info.parsed["file_path"] == "slides.pptx"
assert info.suggested_name == "slides"
# -- AsciiDoc --
def test_detect_adoc(self):
"""Test .adoc → asciidoc detection."""
info = SourceDetector.detect("manual.adoc")
assert info.type == "asciidoc"
assert info.parsed["file_path"] == "manual.adoc"
def test_detect_asciidoc_extension(self):
"""Test .asciidoc → asciidoc detection."""
info = SourceDetector.detect("guide.ASCIIDOC")
assert info.type == "asciidoc"
assert info.parsed["file_path"] == "guide.ASCIIDOC"
# -- Man pages --
def test_detect_man_extension(self):
"""Test .man → manpage detection."""
info = SourceDetector.detect("curl.man")
assert info.type == "manpage"
assert info.parsed["file_path"] == "curl.man"
@pytest.mark.parametrize("section", range(1, 9))
def test_detect_man_sections(self, section):
"""Test .1 through .8 → manpage for simple basenames."""
filename = f"git.{section}"
info = SourceDetector.detect(filename)
assert info.type == "manpage", f"{filename} should detect as manpage"
assert info.suggested_name == "git"
def test_man_section_with_dotted_basename_not_detected(self):
"""Test that 'access.log.1' is NOT detected as a man page.
The heuristic checks that the basename (without extension) has no dots.
"""
# This should fall through to web/domain detection (has a dot, not a path)
info = SourceDetector.detect("access.log.1")
# access.log.1 has a dot in the basename-without-ext ("access.log"),
# so it should NOT be detected as manpage. It falls through to the
# domain inference branch because it contains a dot and doesn't start
# with '/'.
assert info.type != "manpage"
# -- RSS/Atom --
def test_detect_rss_extension(self):
"""Test .rss → rss detection."""
info = SourceDetector.detect("feed.rss")
assert info.type == "rss"
assert info.parsed["file_path"] == "feed.rss"
def test_detect_atom_extension(self):
"""Test .atom → rss detection."""
info = SourceDetector.detect("updates.atom")
assert info.type == "rss"
assert info.parsed["file_path"] == "updates.atom"
def test_xml_not_detected_as_rss(self):
"""Test .xml is NOT detected as rss (too generic).
The fix ensures .xml files do not get incorrectly classified as RSS feeds.
"""
# .xml has no special handling — it will fall through to domain inference
# or raise ValueError depending on contents. Either way, it must not
# be classified as "rss".
info = SourceDetector.detect("data.xml")
assert info.type != "rss"
# -- OpenAPI --
def test_yaml_with_openapi_content_detected(self, tmp_path):
"""Test .yaml with 'openapi:' key → openapi detection."""
spec = tmp_path / "petstore.yaml"
spec.write_text(
textwrap.dedent("""\
openapi: "3.0.0"
info:
title: Petstore
version: "1.0.0"
paths: {}
""")
)
info = SourceDetector.detect(str(spec))
assert info.type == "openapi"
assert info.parsed["file_path"] == str(spec)
assert info.suggested_name == "petstore"
def test_yaml_with_swagger_content_detected(self, tmp_path):
"""Test .yaml with 'swagger:' key → openapi detection."""
spec = tmp_path / "legacy.yml"
spec.write_text(
textwrap.dedent("""\
swagger: "2.0"
info:
title: Legacy API
basePath: /v1
""")
)
info = SourceDetector.detect(str(spec))
assert info.type == "openapi"
def test_yaml_without_openapi_not_detected(self, tmp_path):
"""Test .yaml without OpenAPI content is NOT detected as openapi.
When the YAML file doesn't contain openapi/swagger keys the detector
skips OpenAPI and falls through. For an absolute path it will raise
ValueError (cannot determine type), which still confirms it was NOT
classified as openapi.
"""
plain = tmp_path / "config.yaml"
plain.write_text("name: my-project\nversion: 1.0\n")
# Absolute path falls through to ValueError (no matching type).
# Either way, it must NOT be "openapi".
try:
info = SourceDetector.detect(str(plain))
assert info.type != "openapi"
except ValueError:
# Raised because source type cannot be determined — this is fine,
# the important thing is it was not classified as openapi.
pass
def test_looks_like_openapi_returns_false_for_missing_file(self):
"""Test _looks_like_openapi returns False for non-existent file."""
assert SourceDetector._looks_like_openapi("/nonexistent/spec.yaml") is False
def test_looks_like_openapi_json_key_format(self, tmp_path):
"""Test _looks_like_openapi detects JSON-style keys (quoted)."""
spec = tmp_path / "api.yaml"
spec.write_text('"openapi": "3.0.0"\n')
assert SourceDetector._looks_like_openapi(str(spec)) is True
# ---------------------------------------------------------------------------
# 2. ConfigValidator — new source type validation
# ---------------------------------------------------------------------------
class TestConfigValidatorNewTypes:
"""Test ConfigValidator VALID_SOURCE_TYPES and per-type validation."""
# All 17 expected types
EXPECTED_TYPES = {
"documentation",
"github",
"pdf",
"local",
"word",
"video",
"epub",
"jupyter",
"html",
"openapi",
"asciidoc",
"pptx",
"confluence",
"notion",
"rss",
"manpage",
"chat",
}
def test_all_17_types_present(self):
"""Test that VALID_SOURCE_TYPES contains all 17 types."""
assert ConfigValidator.VALID_SOURCE_TYPES == self.EXPECTED_TYPES
def test_unknown_type_rejected(self):
"""Test that an unknown source type is rejected during validation."""
config = {
"name": "test",
"description": "test",
"sources": [{"type": "foobar"}],
}
validator = ConfigValidator(config)
with pytest.raises(ValueError, match="Invalid type 'foobar'"):
validator.validate()
# --- Per-type required-field validation ---
def _make_config(self, source: dict) -> dict:
"""Helper: wrap a source dict in a valid config structure."""
return {
"name": "test",
"description": "test",
"sources": [source],
}
def test_epub_requires_path(self):
"""Test epub source validation requires 'path'."""
config = self._make_config({"type": "epub"})
validator = ConfigValidator(config)
with pytest.raises(ValueError, match="Missing required field 'path'"):
validator.validate()
def test_jupyter_requires_path(self):
"""Test jupyter source validation requires 'path'."""
config = self._make_config({"type": "jupyter"})
validator = ConfigValidator(config)
with pytest.raises(ValueError, match="Missing required field 'path'"):
validator.validate()
def test_html_requires_path(self):
"""Test html source validation requires 'path'."""
config = self._make_config({"type": "html"})
validator = ConfigValidator(config)
with pytest.raises(ValueError, match="Missing required field 'path'"):
validator.validate()
def test_openapi_requires_path_or_url(self):
"""Test openapi source validation requires 'path' or 'url'."""
config = self._make_config({"type": "openapi"})
validator = ConfigValidator(config)
with pytest.raises(ValueError, match="Missing required field 'path' or 'url'"):
validator.validate()
def test_openapi_accepts_url(self):
"""Test openapi source passes validation with 'url'."""
config = self._make_config({"type": "openapi", "url": "https://example.com/spec.yaml"})
validator = ConfigValidator(config)
assert validator.validate() is True
def test_pptx_requires_path(self):
"""Test pptx source validation requires 'path'."""
config = self._make_config({"type": "pptx"})
validator = ConfigValidator(config)
with pytest.raises(ValueError, match="Missing required field 'path'"):
validator.validate()
def test_asciidoc_requires_path(self):
"""Test asciidoc source validation requires 'path'."""
config = self._make_config({"type": "asciidoc"})
validator = ConfigValidator(config)
with pytest.raises(ValueError, match="Missing required field 'path'"):
validator.validate()
def test_confluence_requires_url_or_path(self):
"""Test confluence requires 'url'/'base_url' or 'path'."""
config = self._make_config({"type": "confluence"})
validator = ConfigValidator(config)
with pytest.raises(ValueError, match="Missing required field"):
validator.validate()
def test_confluence_accepts_base_url(self):
"""Test confluence passes with base_url + space_key."""
config = self._make_config(
{
"type": "confluence",
"base_url": "https://wiki.example.com",
"space_key": "DEV",
}
)
validator = ConfigValidator(config)
assert validator.validate() is True
def test_confluence_accepts_path(self):
"""Test confluence passes with export path."""
config = self._make_config({"type": "confluence", "path": "/exports/wiki"})
validator = ConfigValidator(config)
assert validator.validate() is True
def test_notion_requires_url_or_path(self):
"""Test notion requires 'url'/'database_id'/'page_id' or 'path'."""
config = self._make_config({"type": "notion"})
validator = ConfigValidator(config)
with pytest.raises(ValueError, match="Missing required field"):
validator.validate()
def test_notion_accepts_page_id(self):
"""Test notion passes with page_id."""
config = self._make_config({"type": "notion", "page_id": "abc123"})
validator = ConfigValidator(config)
assert validator.validate() is True
def test_notion_accepts_database_id(self):
"""Test notion passes with database_id."""
config = self._make_config({"type": "notion", "database_id": "db-456"})
validator = ConfigValidator(config)
assert validator.validate() is True
def test_rss_requires_url_or_path(self):
"""Test rss source validation requires 'url' or 'path'."""
config = self._make_config({"type": "rss"})
validator = ConfigValidator(config)
with pytest.raises(ValueError, match="Missing required field 'url' or 'path'"):
validator.validate()
def test_rss_accepts_url(self):
"""Test rss passes with url."""
config = self._make_config({"type": "rss", "url": "https://blog.example.com/feed.xml"})
validator = ConfigValidator(config)
assert validator.validate() is True
def test_manpage_requires_path_or_names(self):
"""Test manpage source validation requires 'path' or 'names'."""
config = self._make_config({"type": "manpage"})
validator = ConfigValidator(config)
with pytest.raises(ValueError, match="Missing required field 'path' or 'names'"):
validator.validate()
def test_manpage_accepts_names(self):
"""Test manpage passes with 'names' list."""
config = self._make_config({"type": "manpage", "names": ["git", "curl"]})
validator = ConfigValidator(config)
assert validator.validate() is True
def test_chat_requires_path_or_token(self):
"""Test chat source validation requires 'path' or 'token'."""
config = self._make_config({"type": "chat"})
validator = ConfigValidator(config)
with pytest.raises(ValueError, match="Missing required field 'path'.*or 'token'"):
validator.validate()
def test_chat_accepts_path(self):
"""Test chat passes with export path."""
config = self._make_config({"type": "chat", "path": "/exports/slack"})
validator = ConfigValidator(config)
assert validator.validate() is True
def test_chat_accepts_token_with_channel(self):
"""Test chat passes with API token + channel."""
config = self._make_config(
{
"type": "chat",
"token": "xoxb-fake",
"channel": "#general",
}
)
validator = ConfigValidator(config)
assert validator.validate() is True
# ---------------------------------------------------------------------------
# 3. UnifiedSkillBuilder — generic merge system
# ---------------------------------------------------------------------------
class TestUnifiedSkillBuilderGenericMerge:
"""Test _generic_merge, _append_extra_sources, and _SOURCE_LABELS."""
def _make_builder(self, tmp_path) -> UnifiedSkillBuilder:
"""Create a minimal builder instance for testing."""
config = {
"name": "test_project",
"description": "A test project for merge testing",
"sources": [
{"type": "jupyter", "path": "nb.ipynb"},
{"type": "rss", "url": "https://example.com/feed.rss"},
],
}
scraped_data: dict = {}
builder = UnifiedSkillBuilder(
config=config,
scraped_data=scraped_data,
cache_dir=str(tmp_path / "cache"),
)
# Override skill_dir to use tmp_path
builder.skill_dir = str(tmp_path / "output" / "test_project")
os.makedirs(builder.skill_dir, exist_ok=True)
os.makedirs(os.path.join(builder.skill_dir, "references"), exist_ok=True)
return builder
def test_generic_merge_produces_valid_markdown(self, tmp_path):
"""Test _generic_merge with two source types produces markdown."""
builder = self._make_builder(tmp_path)
skill_mds = {
"jupyter": "## When to Use\n\nFor data analysis.\n\n## Quick Reference\n\nImport pandas.",
"rss": "## When to Use\n\nFor feed monitoring.\n\n## Feed Items\n\nLatest entries.",
}
result = builder._generic_merge(skill_mds)
# Must be non-empty markdown
assert len(result) > 100
# Must contain the project title
assert "Test Project" in result
def test_generic_merge_includes_yaml_frontmatter(self, tmp_path):
"""Test _generic_merge includes YAML frontmatter."""
builder = self._make_builder(tmp_path)
skill_mds = {
"html": "## Overview\n\nHTML content here.",
}
result = builder._generic_merge(skill_mds)
assert result.startswith("---\n")
assert "name: test-project" in result
assert "description: A test project" in result
def test_generic_merge_attributes_content_to_sources(self, tmp_path):
"""Test _generic_merge attributes content to correct source labels."""
builder = self._make_builder(tmp_path)
skill_mds = {
"jupyter": "## Overview\n\nNotebook content.",
"pptx": "## Overview\n\nSlide content.",
}
result = builder._generic_merge(skill_mds)
# Check source labels appear
assert "Jupyter Notebook" in result
assert "PowerPoint Presentation" in result
def test_generic_merge_single_source_section(self, tmp_path):
"""Test section unique to one source has 'From <Label>' attribution."""
builder = self._make_builder(tmp_path)
skill_mds = {
"manpage": "## Synopsis\n\ngit [options]",
}
result = builder._generic_merge(skill_mds)
assert "*From Man Page*" in result
assert "## Synopsis" in result
def test_generic_merge_multi_source_section(self, tmp_path):
"""Test section shared by multiple sources gets sub-headings per source."""
builder = self._make_builder(tmp_path)
skill_mds = {
"asciidoc": "## Quick Reference\n\nAsciiDoc quick ref.",
"html": "## Quick Reference\n\nHTML quick ref.",
}
result = builder._generic_merge(skill_mds)
# Both sources should be attributed under the shared section
assert "### From AsciiDoc Document" in result
assert "### From HTML Document" in result
def test_generic_merge_footer(self, tmp_path):
"""Test _generic_merge ends with the standard footer."""
builder = self._make_builder(tmp_path)
skill_mds = {
"rss": "## Feeds\n\nSome feeds.",
}
result = builder._generic_merge(skill_mds)
assert "Generated by Skill Seeker" in result
def test_generic_merge_merged_from_line(self, tmp_path):
"""Test _generic_merge includes 'Merged from:' with correct labels."""
builder = self._make_builder(tmp_path)
skill_mds = {
"confluence": "## Pages\n\nWiki pages.",
"notion": "## Databases\n\nNotion DBs.",
}
result = builder._generic_merge(skill_mds)
assert "*Merged from: Confluence Wiki, Notion Page*" in result
def test_append_extra_sources_adds_sections(self, tmp_path):
"""Test _append_extra_sources adds new sections to base content."""
builder = self._make_builder(tmp_path)
base_content = "# Test\n\nIntro.\n\n## Main Section\n\nContent.\n\n---\n\n*Footer*\n"
skill_mds = {
"epub": "## Chapters\n\nChapter list.\n\n## Key Concepts\n\nConcept A.",
}
result = builder._append_extra_sources(base_content, skill_mds, {"epub"})
# The extra source content should be inserted before the footer separator
assert "EPUB E-book Content" in result
assert "Chapters" in result
assert "Key Concepts" in result
# Original content should still be present
assert "# Test" in result
assert "## Main Section" in result
def test_append_extra_sources_preserves_footer(self, tmp_path):
"""Test _append_extra_sources keeps the footer intact."""
builder = self._make_builder(tmp_path)
base_content = "# Test\n\n---\n\n*Footer*\n"
skill_mds = {
"chat": "## Messages\n\nChat history.",
}
result = builder._append_extra_sources(base_content, skill_mds, {"chat"})
assert "*Footer*" in result
def test_source_labels_has_all_17_types(self):
"""Test _SOURCE_LABELS has entries for all 17 source types."""
expected = {
"documentation",
"github",
"pdf",
"word",
"epub",
"video",
"local",
"jupyter",
"html",
"openapi",
"asciidoc",
"pptx",
"confluence",
"notion",
"rss",
"manpage",
"chat",
}
assert set(UnifiedSkillBuilder._SOURCE_LABELS.keys()) == expected
def test_source_labels_values_are_nonempty_strings(self):
"""Test all _SOURCE_LABELS values are non-empty strings."""
for key, label in UnifiedSkillBuilder._SOURCE_LABELS.items():
assert isinstance(label, str), f"Label for '{key}' is not a string"
assert len(label) > 0, f"Label for '{key}' is empty"
# ---------------------------------------------------------------------------
# 4. COMMAND_MODULES and parser wiring
# ---------------------------------------------------------------------------
class TestCommandModules:
"""Test that all 10 new source types are wired into CLI."""
NEW_COMMAND_NAMES = [
"jupyter",
"html",
"openapi",
"asciidoc",
"pptx",
"rss",
"manpage",
"confluence",
"notion",
"chat",
]
def test_new_types_in_command_modules(self):
"""Test all 10 new source types are in COMMAND_MODULES."""
for cmd in self.NEW_COMMAND_NAMES:
assert cmd in COMMAND_MODULES, f"'{cmd}' not in COMMAND_MODULES"
def test_command_modules_values_are_module_paths(self):
"""Test COMMAND_MODULES values look like importable module paths."""
for cmd in self.NEW_COMMAND_NAMES:
module_path = COMMAND_MODULES[cmd]
assert module_path.startswith("skill_seekers.cli."), (
f"Module path for '{cmd}' doesn't start with 'skill_seekers.cli.'"
)
def test_new_parser_names_include_all_10(self):
"""Test that get_parser_names() includes all 10 new source types."""
names = get_parser_names()
for cmd in self.NEW_COMMAND_NAMES:
assert cmd in names, f"Parser '{cmd}' not registered"
def test_total_parser_count(self):
"""Test total PARSERS count is 35 (25 original + 10 new)."""
assert len(PARSERS) == 35
def test_no_duplicate_parser_names(self):
"""Test no duplicate parser names exist."""
names = get_parser_names()
assert len(names) == len(set(names)), "Duplicate parser names found!"
def test_command_module_count(self):
"""Test COMMAND_MODULES has expected number of entries."""
# 25 original + 10 new = 35
assert len(COMMAND_MODULES) == 35
# ---------------------------------------------------------------------------
# 5. SourceDetector.validate_source — new types
# ---------------------------------------------------------------------------
class TestSourceDetectorValidation:
"""Test validate_source for new file-based source types."""
def test_validation_passes_for_existing_jupyter(self, tmp_path):
"""Test validation passes for an existing .ipynb file."""
nb = tmp_path / "test.ipynb"
nb.write_text('{"cells": []}')
info = SourceInfo(
type="jupyter",
parsed={"file_path": str(nb)},
suggested_name="test",
raw_input=str(nb),
)
# Should not raise
SourceDetector.validate_source(info)
def test_validation_raises_for_nonexistent_jupyter(self):
"""Test validation raises ValueError for non-existent file."""
info = SourceInfo(
type="jupyter",
parsed={"file_path": "/nonexistent/notebook.ipynb"},
suggested_name="notebook",
raw_input="/nonexistent/notebook.ipynb",
)
with pytest.raises(ValueError, match="does not exist"):
SourceDetector.validate_source(info)
def test_validation_passes_for_existing_html(self, tmp_path):
"""Test validation passes for an existing .html file."""
html = tmp_path / "page.html"
html.write_text("<html></html>")
info = SourceInfo(
type="html",
parsed={"file_path": str(html)},
suggested_name="page",
raw_input=str(html),
)
SourceDetector.validate_source(info)
def test_validation_raises_for_nonexistent_pptx(self):
"""Test validation raises ValueError for non-existent pptx."""
info = SourceInfo(
type="pptx",
parsed={"file_path": "/nonexistent/slides.pptx"},
suggested_name="slides",
raw_input="/nonexistent/slides.pptx",
)
with pytest.raises(ValueError, match="does not exist"):
SourceDetector.validate_source(info)
def test_validation_passes_for_existing_openapi(self, tmp_path):
"""Test validation passes for an existing OpenAPI spec file."""
spec = tmp_path / "api.yaml"
spec.write_text("openapi: '3.0.0'\n")
info = SourceInfo(
type="openapi",
parsed={"file_path": str(spec)},
suggested_name="api",
raw_input=str(spec),
)
SourceDetector.validate_source(info)
def test_validation_raises_for_nonexistent_asciidoc(self):
"""Test validation raises ValueError for non-existent asciidoc."""
info = SourceInfo(
type="asciidoc",
parsed={"file_path": "/nonexistent/doc.adoc"},
suggested_name="doc",
raw_input="/nonexistent/doc.adoc",
)
with pytest.raises(ValueError, match="does not exist"):
SourceDetector.validate_source(info)
def test_validation_raises_for_nonexistent_manpage(self):
"""Test validation raises ValueError for non-existent manpage."""
info = SourceInfo(
type="manpage",
parsed={"file_path": "/nonexistent/git.1"},
suggested_name="git",
raw_input="/nonexistent/git.1",
)
with pytest.raises(ValueError, match="does not exist"):
SourceDetector.validate_source(info)
def test_validation_passes_for_existing_manpage(self, tmp_path):
"""Test validation passes for an existing man page file."""
man = tmp_path / "curl.1"
man.write_text(".TH CURL 1\n")
info = SourceInfo(
type="manpage",
parsed={"file_path": str(man)},
suggested_name="curl",
raw_input=str(man),
)
SourceDetector.validate_source(info)
def test_rss_url_validation_no_file_check(self):
"""Test rss validation passes for URL-based source (no file check)."""
info = SourceInfo(
type="rss",
parsed={"url": "https://example.com/feed.rss"},
suggested_name="feed",
raw_input="https://example.com/feed.rss",
)
# rss validation only checks file if file_path is present; URL should pass
SourceDetector.validate_source(info)
def test_rss_validation_raises_for_nonexistent_file(self):
"""Test rss validation raises for non-existent local file."""
info = SourceInfo(
type="rss",
parsed={"file_path": "/nonexistent/feed.rss"},
suggested_name="feed",
raw_input="/nonexistent/feed.rss",
)
with pytest.raises(ValueError, match="does not exist"):
SourceDetector.validate_source(info)
def test_rss_validation_passes_for_existing_file(self, tmp_path):
"""Test rss validation passes for an existing .rss file."""
rss = tmp_path / "feed.rss"
rss.write_text("<rss></rss>")
info = SourceInfo(
type="rss",
parsed={"file_path": str(rss)},
suggested_name="feed",
raw_input=str(rss),
)
SourceDetector.validate_source(info)
def test_validation_passes_for_directory_types(self, tmp_path):
"""Test validation passes when source is a directory (e.g., html dir)."""
html_dir = tmp_path / "pages"
html_dir.mkdir()
info = SourceInfo(
type="html",
parsed={"file_path": str(html_dir)},
suggested_name="pages",
raw_input=str(html_dir),
)
# The validator allows directories for these types (isfile or isdir)
SourceDetector.validate_source(info)
# ---------------------------------------------------------------------------
# 6. CreateCommand._route_generic coverage
# ---------------------------------------------------------------------------
class TestCreateCommandRouting:
"""Test that CreateCommand._route_to_scraper maps new types to _route_generic."""
# We can't easily call _route_to_scraper (it imports real scrapers),
# but we verify the routing table is correct by checking the method source.
GENERIC_ROUTES = {
"jupyter": ("jupyter_scraper", "--notebook"),
"html": ("html_scraper", "--html-path"),
"openapi": ("openapi_scraper", "--spec"),
"asciidoc": ("asciidoc_scraper", "--asciidoc-path"),
"pptx": ("pptx_scraper", "--pptx"),
"rss": ("rss_scraper", "--feed-path"),
"manpage": ("man_scraper", "--man-path"),
"confluence": ("confluence_scraper", "--export-path"),
"notion": ("notion_scraper", "--export-path"),
"chat": ("chat_scraper", "--export-path"),
}
def test_route_to_scraper_source_coverage(self):
"""Test _route_to_scraper method handles all 10 new types.
We inspect the method source to verify each type has a branch.
"""
import inspect
source = inspect.getsource(
__import__(
"skill_seekers.cli.create_command",
fromlist=["CreateCommand"],
).CreateCommand._route_to_scraper
)
for source_type in self.GENERIC_ROUTES:
assert f'"{source_type}"' in source, (
f"_route_to_scraper missing branch for '{source_type}'"
)
def test_generic_route_module_names(self):
"""Test _route_generic is called with correct module names."""
import inspect
source = inspect.getsource(
__import__(
"skill_seekers.cli.create_command",
fromlist=["CreateCommand"],
).CreateCommand._route_to_scraper
)
for source_type, (module, flag) in self.GENERIC_ROUTES.items():
assert f'"{module}"' in source, f"Module name '{module}' not found for '{source_type}'"
assert f'"{flag}"' in source, f"Flag '{flag}' not found for '{source_type}'"
if __name__ == "__main__":
pytest.main([__file__, "-v"])

180
uv.lock generated
View File

@@ -220,6 +220,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/7f/9c/36c5c37947ebfb8c7f22e0eb6e4d188ee2d53aa3880f3f2744fb894f0cb1/anyio-4.12.0-py3-none-any.whl", hash = "sha256:dad2376a628f98eeca4881fc56cd06affd18f659b17a747d3ff0307ced94b1bb", size = 113362, upload-time = "2025-11-28T23:36:57.897Z" },
]
[[package]]
name = "asciidoc"
version = "10.2.1"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/1d/e7/315a82f2d256e9270977aa3c15e8fe281fd7c40b8e2a0b97e0cb61ca8fa0/asciidoc-10.2.1.tar.gz", hash = "sha256:d9f13c285981b3c7eb660d02ca0a2779981e88d48105de81bb40445e60dddb83", size = 230179, upload-time = "2024-07-17T03:12:52.681Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/75/1f/87941eaa96e86aa22086064f67e4187e2710fb76c147312979ea29278dac/asciidoc-10.2.1-py2.py3-none-any.whl", hash = "sha256:3f277a636b617c9ce7e0b87bcaea51f144500e9a5c8a6488421ee24594850d40", size = 272433, upload-time = "2024-07-17T03:12:49.012Z" },
]
[[package]]
name = "async-timeout"
version = "5.0.1"
@@ -229,6 +238,24 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/fe/ba/e2081de779ca30d473f21f5b30e0e737c438205440784c7dfc81efc2b029/async_timeout-5.0.1-py3-none-any.whl", hash = "sha256:39e3809566ff85354557ec2398b55e096c8364bacac9405a7a1fa429e77fe76c", size = 6233, upload-time = "2024-11-06T16:41:37.9Z" },
]
[[package]]
name = "atlassian-python-api"
version = "4.0.7"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "beautifulsoup4" },
{ name = "deprecated" },
{ name = "jmespath" },
{ name = "oauthlib" },
{ name = "requests" },
{ name = "requests-oauthlib" },
{ name = "typing-extensions" },
]
sdist = { url = "https://files.pythonhosted.org/packages/40/e8/f23b7273e410c6fe9f98f9db25268c6736572f22a9566d1dc9ed3614bb68/atlassian_python_api-4.0.7.tar.gz", hash = "sha256:8d9cc6068b1d2a48eb434e22e57f6bbd918a47fac9e46b95b7a3cefb00fceacb", size = 271149, upload-time = "2025-08-21T13:19:40.746Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/1d/83/e4f9976ce3c933a079b8931325e7a9c0a8bba7030a2cb85764c0048f3479/atlassian_python_api-4.0.7-py3-none-any.whl", hash = "sha256:46a70cb29eaab87c0a1697fccd3e25df1aa477e6aa4fb9ba936a9d46b425933c", size = 197746, upload-time = "2025-08-21T13:19:39.044Z" },
]
[[package]]
name = "attrs"
version = "25.4.0"
@@ -1135,6 +1162,27 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/05/99/49ee85903dee060d9f08297b4a342e5e0bcfca2f027a07b4ee0a38ab13f9/faster_whisper-1.2.1-py3-none-any.whl", hash = "sha256:79a66ad50688c0b794dd501dc340a736992a6342f7f95e5811be60b5224a26a7", size = 1118909, upload-time = "2025-10-31T11:35:47.794Z" },
]
[[package]]
name = "fastjsonschema"
version = "2.21.2"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/20/b5/23b216d9d985a956623b6bd12d4086b60f0059b27799f23016af04a74ea1/fastjsonschema-2.21.2.tar.gz", hash = "sha256:b1eb43748041c880796cd077f1a07c3d94e93ae84bba5ed36800a33554ae05de", size = 374130, upload-time = "2025-08-14T18:49:36.666Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/cb/a8/20d0723294217e47de6d9e2e40fd4a9d2f7c4b6ef974babd482a59743694/fastjsonschema-2.21.2-py3-none-any.whl", hash = "sha256:1c797122d0a86c5cace2e54bf4e819c36223b552017172f32c5c024a6b77e463", size = 24024, upload-time = "2025-08-14T18:49:34.776Z" },
]
[[package]]
name = "feedparser"
version = "6.0.12"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "sgmllib3k" },
]
sdist = { url = "https://files.pythonhosted.org/packages/dc/79/db7edb5e77d6dfbc54d7d9df72828be4318275b2e580549ff45a962f6461/feedparser-6.0.12.tar.gz", hash = "sha256:64f76ce90ae3e8ef5d1ede0f8d3b50ce26bcce71dd8ae5e82b1cd2d4a5f94228", size = 286579, upload-time = "2025-09-10T13:33:59.486Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/4e/eb/c96d64137e29ae17d83ad2552470bafe3a7a915e85434d9942077d7fd011/feedparser-6.0.12-py3-none-any.whl", hash = "sha256:6bbff10f5a52662c00a2e3f86a38928c37c48f77b3c511aedcd51de933549324", size = 81480, upload-time = "2025-09-10T13:33:58.022Z" },
]
[[package]]
name = "ffmpeg-python"
version = "0.2.0"
@@ -2100,6 +2148,19 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/41/45/1a4ed80516f02155c51f51e8cedb3c1902296743db0bbc66608a0db2814f/jsonschema_specifications-2025.9.1-py3-none-any.whl", hash = "sha256:98802fee3a11ee76ecaca44429fda8a41bff98b00a0f2838151b113f210cc6fe", size = 18437, upload-time = "2025-09-08T01:34:57.871Z" },
]
[[package]]
name = "jupyter-core"
version = "5.9.1"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "platformdirs" },
{ name = "traitlets" },
]
sdist = { url = "https://files.pythonhosted.org/packages/02/49/9d1284d0dc65e2c757b74c6687b6d319b02f822ad039e5c512df9194d9dd/jupyter_core-5.9.1.tar.gz", hash = "sha256:4d09aaff303b9566c3ce657f580bd089ff5c91f5f89cf7d8846c3cdf465b5508", size = 89814, upload-time = "2025-10-16T19:19:18.444Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/e7/e7/80988e32bf6f73919a113473a604f5a8f09094de312b9d52b79c2df7612b/jupyter_core-5.9.1-py3-none-any.whl", hash = "sha256:ebf87fdc6073d142e114c72c9e29a9d7ca03fad818c5d300ce2adc1fb0743407", size = 29032, upload-time = "2025-10-16T19:19:16.783Z" },
]
[[package]]
name = "kubernetes"
version = "35.0.0"
@@ -3122,6 +3183,21 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/79/7b/2c79738432f5c924bef5071f933bcc9efd0473bac3b4aa584a6f7c1c8df8/mypy_extensions-1.1.0-py3-none-any.whl", hash = "sha256:1be4cccdb0f2482337c4743e60421de3a356cd97508abadd57d47403e94f5505", size = 4963, upload-time = "2025-04-22T14:54:22.983Z" },
]
[[package]]
name = "nbformat"
version = "5.10.4"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "fastjsonschema" },
{ name = "jsonschema" },
{ name = "jupyter-core" },
{ name = "traitlets" },
]
sdist = { url = "https://files.pythonhosted.org/packages/6d/fd/91545e604bc3dad7dca9ed03284086039b294c6b3d75c0d2fa45f9e9caf3/nbformat-5.10.4.tar.gz", hash = "sha256:322168b14f937a5d11362988ecac2a4952d3d8e3a2cbeb2319584631226d5b3a", size = 142749, upload-time = "2024-04-04T11:20:37.371Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/a9/82/0340caa499416c78e5d8f5f05947ae4bc3cba53c9f038ab6e9ed964e22f1/nbformat-5.10.4-py3-none-any.whl", hash = "sha256:3b48d6c8fbca4b299bf3982ea7db1af21580e4fec269ad087b9e81588891200b", size = 78454, upload-time = "2024-04-04T11:20:34.895Z" },
]
[[package]]
name = "nest-asyncio"
version = "1.6.0"
@@ -3173,6 +3249,18 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/60/90/81ac364ef94209c100e12579629dc92bf7a709a84af32f8c551b02c07e94/nltk-3.9.2-py3-none-any.whl", hash = "sha256:1e209d2b3009110635ed9709a67a1a3e33a10f799490fa71cf4bec218c11c88a", size = 1513404, upload-time = "2025-10-01T07:19:21.648Z" },
]
[[package]]
name = "notion-client"
version = "3.0.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "httpx" },
]
sdist = { url = "https://files.pythonhosted.org/packages/a5/39/60afcbc0148c3dafaaefe851ae3f058077db49d66288dfb218a11a57b997/notion_client-3.0.0.tar.gz", hash = "sha256:05c4d2b4fa3491dc0de21c9c826277202ea8b8714077ee7f51a6e1a09ab23d0f", size = 31357, upload-time = "2026-02-16T11:15:48.024Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/aa/ce/6b03f9aedd2edfcc28e23ced5c2582d543f6ddbb2be5c570533f02890b27/notion_client-3.0.0-py2.py3-none-any.whl", hash = "sha256:177fc3d2ace7e8ef69cf96f46269e8a66071c2c7c526194bf06ce7925853e759", size = 18746, upload-time = "2026-02-16T11:15:46.602Z" },
]
[[package]]
name = "numpy"
version = "2.2.6"
@@ -4789,6 +4877,21 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/aa/76/03af049af4dcee5d27442f71b6924f01f3efb5d2bd34f23fcd563f2cc5f5/python_multipart-0.0.21-py3-none-any.whl", hash = "sha256:cf7a6713e01c87aa35387f4774e812c4361150938d20d232800f75ffcf266090", size = 24541, upload-time = "2025-12-17T09:24:21.153Z" },
]
[[package]]
name = "python-pptx"
version = "1.0.2"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "lxml" },
{ name = "pillow" },
{ name = "typing-extensions" },
{ name = "xlsxwriter" },
]
sdist = { url = "https://files.pythonhosted.org/packages/52/a9/0c0db8d37b2b8a645666f7fd8accea4c6224e013c42b1d5c17c93590cd06/python_pptx-1.0.2.tar.gz", hash = "sha256:479a8af0eaf0f0d76b6f00b0887732874ad2e3188230315290cd1f9dd9cc7095", size = 10109297, upload-time = "2024-08-07T17:33:37.772Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/d9/4f/00be2196329ebbff56ce564aa94efb0fbc828d00de250b1980de1a34ab49/python_pptx-1.0.2-py3-none-any.whl", hash = "sha256:160838e0b8565a8b1f67947675886e9fea18aa5e795db7ae531606d68e785cba", size = 472788, upload-time = "2024-08-07T17:33:28.192Z" },
]
[[package]]
name = "pytz"
version = "2025.2"
@@ -5570,6 +5673,12 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/e1/e3/c164c88b2e5ce7b24d667b9bd83589cf4f3520d97cad01534cd3c4f55fdb/setuptools-81.0.0-py3-none-any.whl", hash = "sha256:fdd925d5c5d9f62e4b74b30d6dd7828ce236fd6ed998a08d81de62ce5a6310d6", size = 1062021, upload-time = "2026-02-06T21:10:37.175Z" },
]
[[package]]
name = "sgmllib3k"
version = "1.0.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/9e/bd/3704a8c3e0942d711c1299ebf7b9091930adae6675d7c8f476a7ce48653c/sgmllib3k-1.0.0.tar.gz", hash = "sha256:7868fb1c8bfa764c1ac563d3cf369c381d1325d36124933a726f29fcdaa812e9", size = 5750, upload-time = "2010-08-24T14:33:52.445Z" }
[[package]]
name = "shellingham"
version = "1.5.4"
@@ -5619,23 +5728,30 @@ dependencies = [
[package.optional-dependencies]
all = [
{ name = "asciidoc" },
{ name = "atlassian-python-api" },
{ name = "azure-storage-blob" },
{ name = "boto3" },
{ name = "chromadb" },
{ name = "ebooklib" },
{ name = "fastapi" },
{ name = "feedparser" },
{ name = "google-cloud-storage" },
{ name = "google-generativeai" },
{ name = "httpx" },
{ name = "httpx-sse" },
{ name = "mammoth" },
{ name = "mcp" },
{ name = "nbformat" },
{ name = "notion-client" },
{ name = "numpy", version = "2.2.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" },
{ name = "numpy", version = "2.4.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11'" },
{ name = "openai" },
{ name = "pinecone" },
{ name = "python-docx" },
{ name = "python-pptx" },
{ name = "sentence-transformers" },
{ name = "slack-sdk" },
{ name = "sse-starlette" },
{ name = "starlette" },
{ name = "uvicorn" },
@@ -5653,12 +5769,21 @@ all-llms = [
{ name = "google-generativeai" },
{ name = "openai" },
]
asciidoc = [
{ name = "asciidoc" },
]
azure = [
{ name = "azure-storage-blob" },
]
chat = [
{ name = "slack-sdk" },
]
chroma = [
{ name = "chromadb" },
]
confluence = [
{ name = "atlassian-python-api" },
]
docx = [
{ name = "mammoth" },
{ name = "python-docx" },
@@ -5680,6 +5805,9 @@ gcs = [
gemini = [
{ name = "google-generativeai" },
]
jupyter = [
{ name = "nbformat" },
]
mcp = [
{ name = "httpx" },
{ name = "httpx-sse" },
@@ -5688,18 +5816,27 @@ mcp = [
{ name = "starlette" },
{ name = "uvicorn" },
]
notion = [
{ name = "notion-client" },
]
openai = [
{ name = "openai" },
]
pinecone = [
{ name = "pinecone" },
]
pptx = [
{ name = "python-pptx" },
]
rag-upload = [
{ name = "chromadb" },
{ name = "pinecone" },
{ name = "sentence-transformers" },
{ name = "weaviate-client" },
]
rss = [
{ name = "feedparser" },
]
s3 = [
{ name = "boto3" },
]
@@ -5743,6 +5880,10 @@ dev = [
[package.metadata]
requires-dist = [
{ name = "anthropic", specifier = ">=0.76.0" },
{ name = "asciidoc", marker = "extra == 'all'", specifier = ">=10.0.0" },
{ name = "asciidoc", marker = "extra == 'asciidoc'", specifier = ">=10.0.0" },
{ name = "atlassian-python-api", marker = "extra == 'all'", specifier = ">=3.41.0" },
{ name = "atlassian-python-api", marker = "extra == 'confluence'", specifier = ">=3.41.0" },
{ name = "azure-storage-blob", marker = "extra == 'all'", specifier = ">=12.19.0" },
{ name = "azure-storage-blob", marker = "extra == 'all-cloud'", specifier = ">=12.19.0" },
{ name = "azure-storage-blob", marker = "extra == 'azure'", specifier = ">=12.19.0" },
@@ -5759,6 +5900,8 @@ requires-dist = [
{ name = "fastapi", marker = "extra == 'all'", specifier = ">=0.109.0" },
{ name = "fastapi", marker = "extra == 'embedding'", specifier = ">=0.109.0" },
{ name = "faster-whisper", marker = "extra == 'video-full'", specifier = ">=1.0.0" },
{ name = "feedparser", marker = "extra == 'all'", specifier = ">=6.0.0" },
{ name = "feedparser", marker = "extra == 'rss'", specifier = ">=6.0.0" },
{ name = "gitpython", specifier = ">=3.1.40" },
{ name = "google-cloud-storage", marker = "extra == 'all'", specifier = ">=2.10.0" },
{ name = "google-cloud-storage", marker = "extra == 'all-cloud'", specifier = ">=2.10.0" },
@@ -5778,7 +5921,11 @@ requires-dist = [
{ name = "mammoth", marker = "extra == 'docx'", specifier = ">=1.6.0" },
{ name = "mcp", marker = "extra == 'all'", specifier = ">=1.25,<2" },
{ name = "mcp", marker = "extra == 'mcp'", specifier = ">=1.25,<2" },
{ name = "nbformat", marker = "extra == 'all'", specifier = ">=5.9.0" },
{ name = "nbformat", marker = "extra == 'jupyter'", specifier = ">=5.9.0" },
{ name = "networkx", specifier = ">=3.0" },
{ name = "notion-client", marker = "extra == 'all'", specifier = ">=2.0.0" },
{ name = "notion-client", marker = "extra == 'notion'", specifier = ">=2.0.0" },
{ name = "numpy", marker = "extra == 'all'", specifier = ">=1.24.0" },
{ name = "numpy", marker = "extra == 'embedding'", specifier = ">=1.24.0" },
{ name = "openai", marker = "extra == 'all'", specifier = ">=1.0.0" },
@@ -5799,6 +5946,8 @@ requires-dist = [
{ name = "python-docx", marker = "extra == 'all'", specifier = ">=1.1.0" },
{ name = "python-docx", marker = "extra == 'docx'", specifier = ">=1.1.0" },
{ name = "python-dotenv", specifier = ">=1.1.1" },
{ name = "python-pptx", marker = "extra == 'all'", specifier = ">=0.6.21" },
{ name = "python-pptx", marker = "extra == 'pptx'", specifier = ">=0.6.21" },
{ name = "pyyaml", specifier = ">=6.0" },
{ name = "requests", specifier = ">=2.32.5" },
{ name = "scenedetect", extras = ["opencv"], marker = "extra == 'video-full'", specifier = ">=0.6.4" },
@@ -5807,6 +5956,8 @@ requires-dist = [
{ name = "sentence-transformers", marker = "extra == 'embedding'", specifier = ">=2.3.0" },
{ name = "sentence-transformers", marker = "extra == 'rag-upload'", specifier = ">=2.2.0" },
{ name = "sentence-transformers", marker = "extra == 'sentence-transformers'", specifier = ">=2.2.0" },
{ name = "slack-sdk", marker = "extra == 'all'", specifier = ">=3.27.0" },
{ name = "slack-sdk", marker = "extra == 'chat'", specifier = ">=3.27.0" },
{ name = "sse-starlette", marker = "extra == 'all'", specifier = ">=3.0.2" },
{ name = "sse-starlette", marker = "extra == 'mcp'", specifier = ">=3.0.2" },
{ name = "starlette", marker = "extra == 'all'", specifier = ">=0.48.0" },
@@ -5827,7 +5978,7 @@ requires-dist = [
{ name = "yt-dlp", marker = "extra == 'video'", specifier = ">=2024.12.0" },
{ name = "yt-dlp", marker = "extra == 'video-full'", specifier = ">=2024.12.0" },
]
provides-extras = ["mcp", "gemini", "openai", "all-llms", "s3", "gcs", "azure", "docx", "epub", "video", "video-full", "chroma", "weaviate", "sentence-transformers", "pinecone", "rag-upload", "all-cloud", "embedding", "all"]
provides-extras = ["mcp", "gemini", "openai", "all-llms", "s3", "gcs", "azure", "docx", "epub", "video", "video-full", "chroma", "weaviate", "sentence-transformers", "pinecone", "rag-upload", "all-cloud", "jupyter", "asciidoc", "pptx", "confluence", "notion", "rss", "chat", "embedding", "all"]
[package.metadata.requires-dev]
dev = [
@@ -5846,6 +5997,15 @@ dev = [
{ name = "starlette", specifier = ">=0.31.0" },
]
[[package]]
name = "slack-sdk"
version = "3.41.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/22/35/fc009118a13187dd9731657c60138e5a7c2dea88681a7f04dc406af5da7d/slack_sdk-3.41.0.tar.gz", hash = "sha256:eb61eb12a65bebeca9cb5d36b3f799e836ed2be21b456d15df2627cfe34076ca", size = 250568, upload-time = "2026-03-12T16:10:11.381Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/a1/df/2e4be347ff98281b505cc0ccf141408cdd25eb5ca9f3830deb361b2472d3/slack_sdk-3.41.0-py2.py3-none-any.whl", hash = "sha256:bb18dcdfff1413ec448e759cf807ec3324090993d8ab9111c74081623b692a89", size = 313885, upload-time = "2026-03-12T16:10:09.811Z" },
]
[[package]]
name = "smmap"
version = "5.0.2"
@@ -6233,6 +6393,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/d0/30/dc54f88dd4a2b5dc8a0279bdd7270e735851848b762aeb1c1184ed1f6b14/tqdm-4.67.1-py3-none-any.whl", hash = "sha256:26445eca388f82e72884e0d580d5464cd801a3ea01e63e5601bdff9ba6a48de2", size = 78540, upload-time = "2024-11-24T20:12:19.698Z" },
]
[[package]]
name = "traitlets"
version = "5.14.3"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/eb/79/72064e6a701c2183016abbbfedaba506d81e30e232a68c9f0d6f6fcd1574/traitlets-5.14.3.tar.gz", hash = "sha256:9ed0579d3502c94b4b3732ac120375cda96f923114522847de4b3bb98b96b6b7", size = 161621, upload-time = "2024-04-19T11:11:49.746Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/00/c0/8f5d070730d7836adc9c9b6408dec68c6ced86b304a9b26a14df072a6e8c/traitlets-5.14.3-py3-none-any.whl", hash = "sha256:b74e89e397b1ed28cc831db7aea759ba6640cb3de13090ca145426688ff1ac4f", size = 85359, upload-time = "2024-04-19T11:11:46.763Z" },
]
[[package]]
name = "transformers"
version = "5.1.0"
@@ -6753,6 +6922,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/1f/f6/a933bd70f98e9cf3e08167fc5cd7aaaca49147e48411c0bd5ae701bb2194/wrapt-1.17.3-py3-none-any.whl", hash = "sha256:7171ae35d2c33d326ac19dd8facb1e82e5fd04ef8c6c0e394d7af55a55051c22", size = 23591, upload-time = "2025-08-12T05:53:20.674Z" },
]
[[package]]
name = "xlsxwriter"
version = "3.2.9"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/46/2c/c06ef49dc36e7954e55b802a8b231770d286a9758b3d936bd1e04ce5ba88/xlsxwriter-3.2.9.tar.gz", hash = "sha256:254b1c37a368c444eac6e2f867405cc9e461b0ed97a3233b2ac1e574efb4140c", size = 215940, upload-time = "2025-09-16T00:16:21.63Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/3a/0c/3662f4a66880196a590b202f0db82d919dd2f89e99a27fadef91c4a33d41/xlsxwriter-3.2.9-py3-none-any.whl", hash = "sha256:9a5db42bc5dff014806c58a20b9eae7322a134abb6fce3c92c181bfb275ec5b3", size = 175315, upload-time = "2025-09-16T00:16:20.108Z" },
]
[[package]]
name = "xxhash"
version = "3.6.0"