feat: add 10 new skill source types (17 total) with full pipeline integration
Add Jupyter Notebook, Local HTML, OpenAPI/Swagger, AsciiDoc, PowerPoint, RSS/Atom, Man Pages, Confluence, Notion, and Slack/Discord Chat as new skill source types. Each type is fully integrated across: - Standalone CLI commands (skill-seekers <type>) - Auto-detection via 'skill-seekers create' (file extension + content sniffing) - Unified multi-source configs (scraped_data, dispatch, config validation) - Unified skill builder (generic merge + source-attributed synthesis) - MCP server (scrape_generic tool with per-type flag mapping) - pyproject.toml (entry points, optional deps, [all] group) Also fixes: EPUB unified pipeline gap, missing word/video config validators, OpenAPI yaml import guard, MCP flag mismatch for all 10 types, stale docstrings, and adds 77 integration tests + complex-merge workflow. 50 files changed, +20,201 lines
This commit is contained in:
933
AGENTS.md
933
AGENTS.md
@@ -1,866 +1,171 @@
|
||||
# AGENTS.md - Skill Seekers
|
||||
|
||||
Essential guidance for AI coding agents working with the Skill Seekers codebase.
|
||||
Concise reference for AI coding agents. Skill Seekers is a Python CLI tool (v3.2.0) that converts documentation sites, GitHub repos, PDFs, videos, notebooks, wikis, and more into AI-ready skills for 16+ LLM platforms and RAG pipelines.
|
||||
|
||||
---
|
||||
|
||||
## Project Overview
|
||||
|
||||
**Skill Seekers** is a Python CLI tool that converts documentation websites, GitHub repositories, PDF files, and videos into AI-ready skills for LLM platforms and RAG (Retrieval-Augmented Generation) pipelines. It serves as the universal preprocessing layer for AI systems.
|
||||
|
||||
### Key Facts
|
||||
|
||||
| Attribute | Value |
|
||||
|-----------|-------|
|
||||
| **Current Version** | 3.1.3 |
|
||||
| **Python Version** | 3.10+ (tested on 3.10, 3.11, 3.12, 3.13) |
|
||||
| **License** | MIT |
|
||||
| **Package Name** | `skill-seekers` (PyPI) |
|
||||
| **Source Files** | 182 Python files |
|
||||
| **Test Files** | 105+ test files |
|
||||
| **Website** | https://skillseekersweb.com/ |
|
||||
| **Repository** | https://github.com/yusufkaraaslan/Skill_Seekers |
|
||||
|
||||
### Supported Target Platforms
|
||||
|
||||
| Platform | Format | Use Case |
|
||||
|----------|--------|----------|
|
||||
| **Claude AI** | ZIP + YAML | Claude Code skills |
|
||||
| **Google Gemini** | tar.gz | Gemini skills |
|
||||
| **OpenAI ChatGPT** | ZIP + Vector Store | Custom GPTs |
|
||||
| **LangChain** | Documents | QA chains, agents, retrievers |
|
||||
| **LlamaIndex** | TextNodes | Query engines, chat engines |
|
||||
| **Haystack** | Documents | Enterprise RAG pipelines |
|
||||
| **Pinecone** | Ready for upsert | Production vector search |
|
||||
| **Weaviate** | Vector objects | Vector database |
|
||||
| **Qdrant** | Points | Vector database |
|
||||
| **Chroma** | Documents | Local vector database |
|
||||
| **FAISS** | Index files | Local similarity search |
|
||||
| **Cursor IDE** | .cursorrules | AI coding assistant rules |
|
||||
| **Windsurf** | .windsurfrules | AI coding rules |
|
||||
| **Cline** | .clinerules + MCP | VS Code extension |
|
||||
| **Continue.dev** | HTTP context | Universal IDE support |
|
||||
| **Generic Markdown** | ZIP | Universal export |
|
||||
|
||||
### Core Workflow
|
||||
|
||||
1. **Scrape Phase** - Crawl documentation/GitHub/PDF/video sources
|
||||
2. **Build Phase** - Organize content into categorized references
|
||||
3. **Enhancement Phase** - AI-powered quality improvements (optional)
|
||||
4. **Package Phase** - Create platform-specific packages
|
||||
5. **Upload Phase** - Auto-upload to target platform (optional)
|
||||
|
||||
---
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
/mnt/1ece809a-2821-4f10-aecb-fcdf34760c0b/Git/Skill_Seekers/
|
||||
├── src/skill_seekers/ # Main source code (src/ layout)
|
||||
│ ├── cli/ # CLI tools and commands (~70 modules)
|
||||
│ │ ├── adaptors/ # Platform adaptors (Strategy pattern)
|
||||
│ │ │ ├── base.py # Abstract base class (SkillAdaptor)
|
||||
│ │ │ ├── claude.py # Claude AI adaptor
|
||||
│ │ │ ├── gemini.py # Google Gemini adaptor
|
||||
│ │ │ ├── openai.py # OpenAI ChatGPT adaptor
|
||||
│ │ │ ├── markdown.py # Generic Markdown adaptor
|
||||
│ │ │ ├── chroma.py # Chroma vector DB adaptor
|
||||
│ │ │ ├── faiss_helpers.py # FAISS index adaptor
|
||||
│ │ │ ├── haystack.py # Haystack RAG adaptor
|
||||
│ │ │ ├── langchain.py # LangChain adaptor
|
||||
│ │ │ ├── llama_index.py # LlamaIndex adaptor
|
||||
│ │ │ ├── qdrant.py # Qdrant vector DB adaptor
|
||||
│ │ │ ├── weaviate.py # Weaviate vector DB adaptor
|
||||
│ │ │ └── streaming_adaptor.py # Streaming output adaptor
|
||||
│ │ ├── arguments/ # CLI argument definitions
|
||||
│ │ ├── parsers/ # Argument parsers
|
||||
│ │ │ └── extractors/ # Content extractors
|
||||
│ │ ├── presets/ # Preset configuration management
|
||||
│ │ ├── storage/ # Cloud storage adaptors
|
||||
│ │ ├── main.py # Unified CLI entry point
|
||||
│ │ ├── create_command.py # Unified create command
|
||||
│ │ ├── doc_scraper.py # Documentation scraper
|
||||
│ │ ├── github_scraper.py # GitHub repository scraper
|
||||
│ │ ├── pdf_scraper.py # PDF extraction
|
||||
│ │ ├── word_scraper.py # Word document scraper
|
||||
│ │ ├── video_scraper.py # Video extraction
|
||||
│ │ ├── video_setup.py # GPU detection & dependency installation
|
||||
│ │ ├── unified_scraper.py # Multi-source scraping
|
||||
│ │ ├── codebase_scraper.py # Local codebase analysis
|
||||
│ │ ├── enhance_command.py # AI enhancement command
|
||||
│ │ ├── enhance_skill_local.py # AI enhancement (local mode)
|
||||
│ │ ├── package_skill.py # Skill packager
|
||||
│ │ ├── upload_skill.py # Upload to platforms
|
||||
│ │ ├── cloud_storage_cli.py # Cloud storage CLI
|
||||
│ │ ├── benchmark_cli.py # Benchmarking CLI
|
||||
│ │ ├── sync_cli.py # Sync monitoring CLI
|
||||
│ │ └── workflows_command.py # Workflow management CLI
|
||||
│ ├── mcp/ # MCP server integration
|
||||
│ │ ├── server_fastmcp.py # FastMCP server (~708 lines)
|
||||
│ │ ├── server_legacy.py # Legacy server implementation
|
||||
│ │ ├── server.py # Server entry point
|
||||
│ │ ├── agent_detector.py # AI agent detection
|
||||
│ │ ├── git_repo.py # Git repository operations
|
||||
│ │ ├── source_manager.py # Config source management
|
||||
│ │ └── tools/ # MCP tool implementations
|
||||
│ │ ├── config_tools.py # Configuration tools
|
||||
│ │ ├── packaging_tools.py # Packaging tools
|
||||
│ │ ├── scraping_tools.py # Scraping tools
|
||||
│ │ ├── source_tools.py # Source management tools
|
||||
│ │ ├── splitting_tools.py # Config splitting tools
|
||||
│ │ ├── vector_db_tools.py # Vector database tools
|
||||
│ │ └── workflow_tools.py # Workflow management tools
|
||||
│ ├── sync/ # Sync monitoring module
|
||||
│ │ ├── detector.py # Change detection
|
||||
│ │ ├── models.py # Data models (Pydantic)
|
||||
│ │ ├── monitor.py # Monitoring logic
|
||||
│ │ └── notifier.py # Notification system
|
||||
│ ├── benchmark/ # Benchmarking framework
|
||||
│ │ ├── framework.py # Benchmark framework
|
||||
│ │ ├── models.py # Benchmark models
|
||||
│ │ └── runner.py # Benchmark runner
|
||||
│ ├── embedding/ # Embedding server
|
||||
│ │ ├── server.py # FastAPI embedding server
|
||||
│ │ ├── generator.py # Embedding generation
|
||||
│ │ ├── cache.py # Embedding cache
|
||||
│ │ └── models.py # Embedding models
|
||||
│ ├── workflows/ # YAML workflow presets (66 presets)
|
||||
│ ├── _version.py # Version information (reads from pyproject.toml)
|
||||
│ └── __init__.py # Package init
|
||||
├── tests/ # Test suite (105+ test files)
|
||||
├── configs/ # Preset configuration files
|
||||
├── docs/ # Documentation (80+ markdown files)
|
||||
│ ├── integrations/ # Platform integration guides
|
||||
│ ├── guides/ # User guides
|
||||
│ ├── reference/ # API reference
|
||||
│ ├── features/ # Feature documentation
|
||||
│ ├── blog/ # Blog posts
|
||||
│ └── roadmap/ # Roadmap documents
|
||||
├── examples/ # Usage examples
|
||||
├── .github/workflows/ # CI/CD workflows
|
||||
├── pyproject.toml # Main project configuration
|
||||
├── requirements.txt # Pinned dependencies
|
||||
├── mypy.ini # MyPy type checker configuration
|
||||
├── Dockerfile # Main Docker image (multi-stage)
|
||||
├── Dockerfile.mcp # MCP server Docker image
|
||||
└── docker-compose.yml # Full stack deployment
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Build and Development Commands
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.10 or higher
|
||||
- pip or uv package manager
|
||||
- Git (for GitHub scraping features)
|
||||
|
||||
### Setup (REQUIRED before any development)
|
||||
## Setup
|
||||
|
||||
```bash
|
||||
# Install in editable mode (REQUIRED for tests due to src/ layout)
|
||||
# REQUIRED before running tests (src/ layout — tests fail without this)
|
||||
pip install -e .
|
||||
|
||||
# Install with all platform dependencies
|
||||
pip install -e ".[all-llms]"
|
||||
|
||||
# Install with all optional dependencies
|
||||
pip install -e ".[all]"
|
||||
|
||||
# Install specific platforms only
|
||||
pip install -e ".[gemini]" # Google Gemini support
|
||||
pip install -e ".[openai]" # OpenAI ChatGPT support
|
||||
pip install -e ".[mcp]" # MCP server dependencies
|
||||
pip install -e ".[s3]" # AWS S3 support
|
||||
pip install -e ".[gcs]" # Google Cloud Storage
|
||||
pip install -e ".[azure]" # Azure Blob Storage
|
||||
pip install -e ".[embedding]" # Embedding server support
|
||||
pip install -e ".[rag-upload]" # Vector DB upload support
|
||||
|
||||
# Install dev dependencies (using dependency-groups)
|
||||
# With dev tools
|
||||
pip install -e ".[dev]"
|
||||
# With all optional deps
|
||||
pip install -e ".[all]"
|
||||
```
|
||||
|
||||
**CRITICAL:** The project uses a `src/` layout. Tests WILL FAIL unless you install with `pip install -e .` first.
|
||||
|
||||
### Building
|
||||
## Build / Test / Lint Commands
|
||||
|
||||
```bash
|
||||
# Build package using uv (recommended)
|
||||
uv build
|
||||
|
||||
# Or using standard build
|
||||
python -m build
|
||||
|
||||
# Publish to PyPI
|
||||
uv publish
|
||||
```
|
||||
|
||||
### Docker
|
||||
|
||||
```bash
|
||||
# Build Docker image
|
||||
docker build -t skill-seekers .
|
||||
|
||||
# Run with docker-compose (includes vector databases)
|
||||
docker-compose up -d
|
||||
|
||||
# Run MCP server only
|
||||
docker-compose up -d mcp-server
|
||||
|
||||
# View logs
|
||||
docker-compose logs -f mcp-server
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Instructions
|
||||
|
||||
### Running Tests
|
||||
|
||||
**CRITICAL:** Never skip tests - all tests must pass before commits.
|
||||
|
||||
```bash
|
||||
# All tests (must run pip install -e . first!)
|
||||
# Run ALL tests (never skip tests — all must pass before commits)
|
||||
pytest tests/ -v
|
||||
|
||||
# Specific test file
|
||||
# Run a single test file
|
||||
pytest tests/test_scraper_features.py -v
|
||||
pytest tests/test_mcp_fastmcp.py -v
|
||||
pytest tests/test_cloud_storage.py -v
|
||||
|
||||
# With coverage
|
||||
pytest tests/ --cov=src/skill_seekers --cov-report=term --cov-report=html
|
||||
|
||||
# Single test
|
||||
# Run a single test function
|
||||
pytest tests/test_scraper_features.py::test_detect_language -v
|
||||
|
||||
# E2E tests
|
||||
pytest tests/test_e2e_three_stream_pipeline.py -v
|
||||
# Run a single test class method
|
||||
pytest tests/test_adaptors/test_claude_adaptor.py::TestClaudeAdaptor::test_package -v
|
||||
|
||||
# Skip slow tests
|
||||
pytest tests/ -v -m "not slow"
|
||||
|
||||
# Run only integration tests
|
||||
pytest tests/ -v -m integration
|
||||
|
||||
# Run only specific marker
|
||||
# Skip slow/integration tests
|
||||
pytest tests/ -v -m "not slow and not integration"
|
||||
```
|
||||
|
||||
### Test Architecture
|
||||
# With coverage
|
||||
pytest tests/ --cov=src/skill_seekers --cov-report=term
|
||||
|
||||
- **105+ test files** covering all features
|
||||
- **CI Matrix:** Ubuntu + macOS, Python 3.10-3.12
|
||||
- Test markers defined in `pyproject.toml`:
|
||||
|
||||
| Marker | Description |
|
||||
|--------|-------------|
|
||||
| `slow` | Tests taking >5 seconds |
|
||||
| `integration` | Requires external services (APIs) |
|
||||
| `e2e` | End-to-end tests (resource-intensive) |
|
||||
| `venv` | Requires virtual environment setup |
|
||||
| `bootstrap` | Bootstrap skill specific |
|
||||
| `benchmark` | Performance benchmark tests |
|
||||
|
||||
### Test Configuration
|
||||
|
||||
From `pyproject.toml`:
|
||||
```toml
|
||||
[tool.pytest.ini_options]
|
||||
testpaths = ["tests"]
|
||||
python_files = ["test_*.py"]
|
||||
addopts = "-v --tb=short --strict-markers"
|
||||
asyncio_mode = "auto"
|
||||
asyncio_default_fixture_loop_scope = "function"
|
||||
```
|
||||
|
||||
The `conftest.py` file checks that the package is installed before running tests.
|
||||
|
||||
---
|
||||
|
||||
## Code Style Guidelines
|
||||
|
||||
### Linting and Formatting
|
||||
|
||||
```bash
|
||||
# Run ruff linter
|
||||
# Lint (ruff)
|
||||
ruff check src/ tests/
|
||||
|
||||
# Run ruff formatter check
|
||||
ruff format --check src/ tests/
|
||||
|
||||
# Auto-fix issues
|
||||
ruff check src/ tests/ --fix
|
||||
|
||||
# Format (ruff)
|
||||
ruff format --check src/ tests/
|
||||
ruff format src/ tests/
|
||||
|
||||
# Run mypy type checker
|
||||
# Type check (mypy)
|
||||
mypy src/skill_seekers --show-error-codes --pretty
|
||||
```
|
||||
|
||||
### Style Rules (from pyproject.toml)
|
||||
**Test markers:** `slow`, `integration`, `e2e`, `venv`, `bootstrap`, `benchmark`
|
||||
**Async tests:** use `@pytest.mark.asyncio`; asyncio_mode is `auto`.
|
||||
|
||||
## Code Style
|
||||
|
||||
### Formatting Rules (ruff — from pyproject.toml)
|
||||
- **Line length:** 100 characters
|
||||
- **Target Python:** 3.10+
|
||||
- **Enabled rules:** E, W, F, I, B, C4, UP, ARG, SIM
|
||||
- **Ignored rules:** E501, F541, ARG002, B007, I001, SIM114
|
||||
- **Import sorting:** isort style with `skill_seekers` as first-party
|
||||
- **Enabled lint rules:** E, W, F, I, B, C4, UP, ARG, SIM
|
||||
- **Ignored rules:** E501 (line length handled by formatter), F541 (f-string style), ARG002 (unused method args for interface compliance), B007 (intentional unused loop vars), I001 (formatter handles imports), SIM114 (readability preference)
|
||||
|
||||
### MyPy Configuration (from pyproject.toml)
|
||||
### Imports
|
||||
- Sort with isort (via ruff); `skill_seekers` is first-party
|
||||
- Standard library → third-party → first-party, separated by blank lines
|
||||
- Use `from __future__ import annotations` only if needed for forward refs
|
||||
- Guard optional imports with try/except ImportError (see `adaptors/__init__.py` pattern)
|
||||
|
||||
```toml
|
||||
[tool.mypy]
|
||||
python_version = "3.10"
|
||||
warn_return_any = true
|
||||
warn_unused_configs = true
|
||||
disallow_untyped_defs = false
|
||||
disallow_incomplete_defs = false
|
||||
check_untyped_defs = true
|
||||
ignore_missing_imports = true
|
||||
show_error_codes = true
|
||||
pretty = true
|
||||
### Naming Conventions
|
||||
- **Files:** `snake_case.py`
|
||||
- **Classes:** `PascalCase` (e.g., `SkillAdaptor`, `ClaudeAdaptor`)
|
||||
- **Functions/methods:** `snake_case`
|
||||
- **Constants:** `UPPER_CASE` (e.g., `ADAPTORS`, `DEFAULT_CHUNK_TOKENS`)
|
||||
- **Private:** prefix with `_`
|
||||
|
||||
### Type Hints
|
||||
- Gradual typing — add hints where practical, not enforced everywhere
|
||||
- Use modern syntax: `str | None` not `Optional[str]`, `list[str]` not `List[str]`
|
||||
- MyPy config: `disallow_untyped_defs = false`, `check_untyped_defs = true`, `ignore_missing_imports = true`
|
||||
|
||||
### Docstrings
|
||||
- Module-level docstring on every file (triple-quoted, describes purpose)
|
||||
- Google-style or standard docstrings for public functions/classes
|
||||
- Include `Args:`, `Returns:`, `Raises:` sections where useful
|
||||
|
||||
### Error Handling
|
||||
- Use specific exceptions, never bare `except:`
|
||||
- Provide helpful error messages with context (see `get_adaptor()` in `adaptors/__init__.py`)
|
||||
- Use `raise ValueError(...)` for invalid arguments, `raise RuntimeError(...)` for state errors
|
||||
- Guard optional dependency imports with try/except and give clear install instructions on failure
|
||||
|
||||
### Suppressing Lint Warnings
|
||||
- Use inline `# noqa: XXXX` comments (e.g., `# noqa: F401` for re-exports, `# noqa: ARG001` for required but unused params)
|
||||
|
||||
## Supported Source Types (17)
|
||||
|
||||
| Type | CLI Command | Config Type | Detection |
|
||||
|------|------------|-------------|-----------|
|
||||
| Documentation (web) | `scrape` / `create <url>` | `documentation` | HTTP/HTTPS URLs |
|
||||
| GitHub repo | `github` / `create owner/repo` | `github` | `owner/repo` or github.com URLs |
|
||||
| PDF | `pdf` / `create file.pdf` | `pdf` | `.pdf` extension |
|
||||
| Word (.docx) | `word` / `create file.docx` | `word` | `.docx` extension |
|
||||
| EPUB | `epub` / `create file.epub` | `epub` | `.epub` extension |
|
||||
| Video | `video` / `create <url/file>` | `video` | YouTube/Vimeo URLs, video extensions |
|
||||
| Local codebase | `analyze` / `create ./path` | `local` | Directory paths |
|
||||
| Jupyter Notebook | `jupyter` / `create file.ipynb` | `jupyter` | `.ipynb` extension |
|
||||
| Local HTML | `html` / `create file.html` | `html` | `.html`/`.htm` extensions |
|
||||
| OpenAPI/Swagger | `openapi` / `create spec.yaml` | `openapi` | `.yaml`/`.yml` with OpenAPI content |
|
||||
| AsciiDoc | `asciidoc` / `create file.adoc` | `asciidoc` | `.adoc`/`.asciidoc` extensions |
|
||||
| PowerPoint | `pptx` / `create file.pptx` | `pptx` | `.pptx` extension |
|
||||
| RSS/Atom | `rss` / `create feed.rss` | `rss` | `.rss`/`.atom` extensions |
|
||||
| Man pages | `manpage` / `create cmd.1` | `manpage` | `.1`-`.8`/`.man` extensions |
|
||||
| Confluence | `confluence` | `confluence` | API or export directory |
|
||||
| Notion | `notion` | `notion` | API or export directory |
|
||||
| Slack/Discord | `chat` | `chat` | Export directory or API |
|
||||
|
||||
## Project Layout
|
||||
|
||||
```
|
||||
src/skill_seekers/ # Main package (src/ layout)
|
||||
cli/ # CLI commands and entry points
|
||||
adaptors/ # Platform adaptors (Strategy pattern, inherit SkillAdaptor)
|
||||
arguments/ # CLI argument definitions (one per source type)
|
||||
parsers/ # Subcommand parsers (one per source type)
|
||||
storage/ # Cloud storage (inherit BaseStorageAdaptor)
|
||||
main.py # Unified CLI entry point (COMMAND_MODULES dict)
|
||||
source_detector.py # Auto-detects source type from user input
|
||||
create_command.py # Unified `create` command routing
|
||||
config_validator.py # VALID_SOURCE_TYPES set + per-type validation
|
||||
unified_scraper.py # Multi-source orchestrator (scraped_data + dispatch)
|
||||
unified_skill_builder.py # Pairwise synthesis + generic merge
|
||||
mcp/ # MCP server (FastMCP + legacy)
|
||||
tools/ # MCP tool implementations by category
|
||||
sync/ # Sync monitoring (Pydantic models)
|
||||
benchmark/ # Benchmarking framework
|
||||
embedding/ # FastAPI embedding server
|
||||
workflows/ # 67 YAML workflow presets (includes complex-merge.yaml)
|
||||
_version.py # Reads version from pyproject.toml
|
||||
tests/ # 115+ test files (pytest)
|
||||
configs/ # Preset JSON scraping configs
|
||||
docs/ # 80+ markdown doc files
|
||||
```
|
||||
|
||||
### Code Conventions
|
||||
## Key Patterns
|
||||
|
||||
1. **Use type hints** where practical (gradual typing approach)
|
||||
2. **Docstrings:** Use Google-style or standard docstrings
|
||||
3. **Error handling:** Use specific exceptions, provide helpful messages
|
||||
4. **Async code:** Use `asyncio`, mark tests with `@pytest.mark.asyncio`
|
||||
5. **File naming:** Use snake_case for all Python files
|
||||
6. **Class naming:** Use PascalCase for classes
|
||||
7. **Function naming:** Use snake_case for functions and methods
|
||||
8. **Constants:** Use UPPER_CASE for module-level constants
|
||||
**Adaptor (Strategy) pattern** — all platform logic in `cli/adaptors/`. Inherit `SkillAdaptor`, implement `format_skill_md()`, `package()`, `upload()`. Register in `adaptors/__init__.py` ADAPTORS dict.
|
||||
|
||||
---
|
||||
**Scraper pattern** — each source type has: `cli/<type>_scraper.py` (with `<Type>ToSkillConverter` class + `main()`), `arguments/<type>.py`, `parsers/<type>_parser.py`. Register in `parsers/__init__.py` PARSERS list, `main.py` COMMAND_MODULES dict, `config_validator.py` VALID_SOURCE_TYPES set.
|
||||
|
||||
## Architecture Patterns
|
||||
**Unified pipeline** — `unified_scraper.py` dispatches to per-type `_scrape_<type>()` methods. `unified_skill_builder.py` uses pairwise synthesis for docs+github+pdf combos and `_generic_merge()` for all other combinations.
|
||||
|
||||
### Platform Adaptor Pattern (Strategy Pattern)
|
||||
**MCP tools** — grouped in `mcp/tools/` by category. `scrape_generic_tool` handles all new source types.
|
||||
|
||||
All platform-specific logic is encapsulated in adaptors:
|
||||
|
||||
```python
|
||||
from skill_seekers.cli.adaptors import get_adaptor
|
||||
|
||||
# Get platform-specific adaptor
|
||||
adaptor = get_adaptor('gemini') # or 'claude', 'openai', 'langchain', etc.
|
||||
|
||||
# Package skill
|
||||
adaptor.package(skill_dir='output/react/', output_path='output/')
|
||||
|
||||
# Upload to platform
|
||||
adaptor.upload(
|
||||
package_path='output/react-gemini.tar.gz',
|
||||
api_key=os.getenv('GOOGLE_API_KEY')
|
||||
)
|
||||
```
|
||||
|
||||
Each adaptor inherits from `SkillAdaptor` base class and implements:
|
||||
- `format_skill_md()` - Format SKILL.md content
|
||||
- `package()` - Create platform-specific package
|
||||
- `upload()` - Upload to platform API
|
||||
- `validate_api_key()` - Validate API key format
|
||||
- `supports_enhancement()` - Whether AI enhancement is supported
|
||||
|
||||
### CLI Architecture (Git-style)
|
||||
|
||||
Entry point: `src/skill_seekers/cli/main.py`
|
||||
|
||||
The CLI uses subcommands that delegate to existing modules:
|
||||
|
||||
```bash
|
||||
# skill-seekers scrape --config react.json
|
||||
# Transforms to: doc_scraper.main() with modified sys.argv
|
||||
```
|
||||
|
||||
**Available subcommands:**
|
||||
- `create` - Unified create command
|
||||
- `config` - Configuration wizard
|
||||
- `scrape` - Documentation scraping
|
||||
- `github` - GitHub repository scraping
|
||||
- `pdf` - PDF extraction
|
||||
- `word` - Word document extraction
|
||||
- `video` - Video extraction (YouTube or local). Use `--setup` to auto-detect GPU and install visual deps.
|
||||
- `unified` - Multi-source scraping
|
||||
- `analyze` / `codebase` - Local codebase analysis
|
||||
- `enhance` - AI enhancement
|
||||
- `package` - Package skill for target platform
|
||||
- `upload` - Upload to platform
|
||||
- `cloud` - Cloud storage operations
|
||||
- `sync` - Sync monitoring
|
||||
- `benchmark` - Performance benchmarking
|
||||
- `embed` - Embedding server
|
||||
- `install` / `install-agent` - Complete workflow
|
||||
- `stream` - Streaming ingestion
|
||||
- `update` - Incremental updates
|
||||
- `multilang` - Multi-language support
|
||||
- `quality` - Quality metrics
|
||||
- `resume` - Resume interrupted jobs
|
||||
- `estimate` - Estimate page counts
|
||||
- `workflows` - Workflow management
|
||||
|
||||
### MCP Server Architecture
|
||||
|
||||
Two implementations:
|
||||
- `server_fastmcp.py` - Modern, decorator-based (recommended, ~708 lines)
|
||||
- `server_legacy.py` - Legacy implementation
|
||||
|
||||
Tools are organized by category:
|
||||
- Config tools (3 tools): generate_config, list_configs, validate_config
|
||||
- Scraping tools (10 tools): estimate_pages, scrape_docs, scrape_github, scrape_pdf, scrape_video (supports `setup` parameter for GPU detection and visual dep installation), scrape_codebase, detect_patterns, extract_test_examples, build_how_to_guides, extract_config_patterns
|
||||
- Packaging tools (4 tools): package_skill, upload_skill, enhance_skill, install_skill
|
||||
- Source tools (5 tools): fetch_config, submit_config, add_config_source, list_config_sources, remove_config_source
|
||||
- Splitting tools (2 tools): split_config, generate_router
|
||||
- Vector Database tools (4 tools): export_to_weaviate, export_to_chroma, export_to_faiss, export_to_qdrant
|
||||
- Workflow tools (5 tools): list_workflows, get_workflow, create_workflow, update_workflow, delete_workflow
|
||||
|
||||
**Running MCP Server:**
|
||||
```bash
|
||||
# Stdio transport (default)
|
||||
python -m skill_seekers.mcp.server_fastmcp
|
||||
|
||||
# HTTP transport
|
||||
python -m skill_seekers.mcp.server_fastmcp --http --port 8765
|
||||
```
|
||||
|
||||
### Cloud Storage Architecture
|
||||
|
||||
Abstract base class pattern for cloud providers:
|
||||
- `base_storage.py` - Defines `BaseStorageAdaptor` interface
|
||||
- `s3_storage.py` - AWS S3 implementation
|
||||
- `gcs_storage.py` - Google Cloud Storage implementation
|
||||
- `azure_storage.py` - Azure Blob Storage implementation
|
||||
|
||||
### Sync Monitoring Architecture
|
||||
|
||||
Pydantic-based models in `src/skill_seekers/sync/`:
|
||||
- `models.py` - Data models (SyncConfig, ChangeReport, SyncState)
|
||||
- `detector.py` - Change detection logic
|
||||
- `monitor.py` - Monitoring daemon
|
||||
- `notifier.py` - Notification system (webhook, email, slack)
|
||||
|
||||
---
|
||||
**CLI subcommands** — git-style in `cli/main.py`. Each delegates to a module's `main()` function.
|
||||
|
||||
## Git Workflow
|
||||
|
||||
### Branch Structure
|
||||
- **`main`** — production, protected
|
||||
- **`development`** — default PR target, active dev
|
||||
- Feature branches created from `development`
|
||||
|
||||
```
|
||||
main (production)
|
||||
↑
|
||||
│ (only maintainer merges)
|
||||
│
|
||||
development (integration) ← default branch for PRs
|
||||
↑
|
||||
│ (all contributor PRs go here)
|
||||
│
|
||||
feature branches
|
||||
```
|
||||
|
||||
- **`main`** - Production, always stable, protected
|
||||
- **`development`** - Active development, default for PRs
|
||||
- **Feature branches** - Your work, created from `development`
|
||||
|
||||
### Creating a Feature Branch
|
||||
## Pre-commit Checklist
|
||||
|
||||
```bash
|
||||
# 1. Checkout development
|
||||
git checkout development
|
||||
git pull upstream development
|
||||
|
||||
# 2. Create feature branch
|
||||
git checkout -b my-feature
|
||||
|
||||
# 3. Make changes, commit, push
|
||||
git add .
|
||||
git commit -m "Add my feature"
|
||||
git push origin my-feature
|
||||
|
||||
# 4. Create PR targeting 'development' branch
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## CI/CD Configuration
|
||||
|
||||
### GitHub Actions Workflows
|
||||
|
||||
All workflows are in `.github/workflows/`:
|
||||
|
||||
**`tests.yml`:**
|
||||
- Runs on: push/PR to `main` and `development`
|
||||
- Lint job: Ruff + MyPy
|
||||
- Test matrix: Ubuntu + macOS, Python 3.10-3.12
|
||||
- Coverage: Uploads to Codecov
|
||||
|
||||
**`release.yml`:**
|
||||
- Triggered on version tags (`v*`)
|
||||
- Builds and publishes to PyPI using `uv`
|
||||
- Creates GitHub release with changelog
|
||||
|
||||
**`docker-publish.yml`:**
|
||||
- Builds and publishes Docker images
|
||||
- Multi-architecture support (linux/amd64, linux/arm64)
|
||||
|
||||
**`vector-db-export.yml`:**
|
||||
- Tests vector database exports
|
||||
|
||||
**`scheduled-updates.yml`:**
|
||||
- Scheduled sync monitoring
|
||||
|
||||
**`quality-metrics.yml`:**
|
||||
- Quality metrics tracking
|
||||
|
||||
**`test-vector-dbs.yml`:**
|
||||
- Vector database integration tests
|
||||
|
||||
### Pre-commit Checks (Manual)
|
||||
|
||||
```bash
|
||||
# Before committing, run:
|
||||
ruff check src/ tests/
|
||||
ruff format --check src/ tests/
|
||||
pytest tests/ -v -x # Stop on first failure
|
||||
pytest tests/ -v -x # stop on first failure
|
||||
```
|
||||
|
||||
---
|
||||
Never commit API keys. Use env vars: `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, `OPENAI_API_KEY`, `GITHUB_TOKEN`.
|
||||
|
||||
## Security Considerations
|
||||
## CI
|
||||
|
||||
### API Keys and Secrets
|
||||
|
||||
1. **Never commit API keys** to the repository
|
||||
2. **Use environment variables:**
|
||||
- `ANTHROPIC_API_KEY` - Claude AI
|
||||
- `GOOGLE_API_KEY` - Google Gemini
|
||||
- `OPENAI_API_KEY` - OpenAI
|
||||
- `GITHUB_TOKEN` - GitHub API
|
||||
- `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` - AWS S3
|
||||
- `GOOGLE_APPLICATION_CREDENTIALS` - GCS
|
||||
- `AZURE_STORAGE_CONNECTION_STRING` - Azure
|
||||
3. **Configuration storage:**
|
||||
- Stored at `~/.config/skill-seekers/config.json`
|
||||
- Permissions: 600 (owner read/write only)
|
||||
|
||||
### Rate Limit Handling
|
||||
|
||||
- GitHub API has rate limits (5000 requests/hour for authenticated)
|
||||
- The tool has built-in rate limit handling with retry logic
|
||||
- Use `--non-interactive` flag for CI/CD environments
|
||||
|
||||
### Custom API Endpoints
|
||||
|
||||
Support for Claude-compatible APIs:
|
||||
|
||||
```bash
|
||||
export ANTHROPIC_API_KEY=your-custom-api-key
|
||||
export ANTHROPIC_BASE_URL=https://custom-endpoint.com/v1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Development Tasks
|
||||
|
||||
### Adding a New CLI Command
|
||||
|
||||
1. Create module in `src/skill_seekers/cli/my_command.py`
|
||||
2. Implement `main()` function with argument parsing
|
||||
3. Add entry point in `pyproject.toml`:
|
||||
```toml
|
||||
[project.scripts]
|
||||
skill-seekers-my-command = "skill_seekers.cli.my_command:main"
|
||||
```
|
||||
4. Add subcommand handler in `src/skill_seekers/cli/main.py`
|
||||
5. Add argument parser in `src/skill_seekers/cli/parsers/`
|
||||
6. Add tests in `tests/test_my_command.py`
|
||||
|
||||
### Adding a New Platform Adaptor
|
||||
|
||||
1. Create `src/skill_seekers/cli/adaptors/my_platform.py`
|
||||
2. Inherit from `SkillAdaptor` base class
|
||||
3. Implement required methods: `package()`, `upload()`, `format_skill_md()`
|
||||
4. Register in `src/skill_seekers/cli/adaptors/__init__.py`
|
||||
5. Add optional dependencies in `pyproject.toml`
|
||||
6. Add tests in `tests/test_adaptors/`
|
||||
|
||||
### Adding an MCP Tool
|
||||
|
||||
1. Implement tool logic in `src/skill_seekers/mcp/tools/category_tools.py`
|
||||
2. Register in `src/skill_seekers/mcp/server_fastmcp.py`
|
||||
3. Add test in `tests/test_mcp_fastmcp.py`
|
||||
|
||||
### Adding Cloud Storage Provider
|
||||
|
||||
1. Create module in `src/skill_seekers/cli/storage/my_storage.py`
|
||||
2. Inherit from `BaseStorageAdaptor` base class
|
||||
3. Implement required methods: `upload_file()`, `download_file()`, `list_files()`, `delete_file()`
|
||||
4. Register in `src/skill_seekers/cli/storage/__init__.py`
|
||||
5. Add optional dependencies in `pyproject.toml`
|
||||
|
||||
---
|
||||
|
||||
## Documentation
|
||||
|
||||
### Project Documentation (New Structure - v3.1.0+)
|
||||
|
||||
**Entry Points:**
|
||||
- **README.md** - Main project documentation with navigation
|
||||
- **docs/README.md** - Documentation hub
|
||||
- **AGENTS.md** - This file, for AI coding agents
|
||||
|
||||
**Getting Started (for new users):**
|
||||
- `docs/getting-started/01-installation.md` - Installation guide
|
||||
- `docs/getting-started/02-quick-start.md` - 3 commands to first skill
|
||||
- `docs/getting-started/03-your-first-skill.md` - Complete walkthrough
|
||||
- `docs/getting-started/04-next-steps.md` - Where to go from here
|
||||
|
||||
**User Guides (common tasks):**
|
||||
- `docs/user-guide/01-core-concepts.md` - How Skill Seekers works
|
||||
- `docs/user-guide/02-scraping.md` - All scraping options
|
||||
- `docs/user-guide/03-enhancement.md` - AI enhancement explained
|
||||
- `docs/user-guide/04-packaging.md` - Export to platforms
|
||||
- `docs/user-guide/05-workflows.md` - Enhancement workflows
|
||||
- `docs/user-guide/06-troubleshooting.md` - Common issues
|
||||
|
||||
**Reference (technical details):**
|
||||
- `docs/reference/CLI_REFERENCE.md` - Complete command reference (20 commands)
|
||||
- `docs/reference/MCP_REFERENCE.md` - MCP tools reference (33 tools)
|
||||
- `docs/reference/CONFIG_FORMAT.md` - JSON configuration specification
|
||||
- `docs/reference/ENVIRONMENT_VARIABLES.md` - All environment variables
|
||||
|
||||
**Advanced (power user topics):**
|
||||
- `docs/advanced/mcp-server.md` - MCP server setup
|
||||
- `docs/advanced/mcp-tools.md` - Advanced MCP usage
|
||||
- `docs/advanced/custom-workflows.md` - Creating custom workflows
|
||||
- `docs/advanced/multi-source.md` - Multi-source scraping
|
||||
|
||||
### Configuration Documentation
|
||||
|
||||
Preset configs are in `configs/` directory:
|
||||
- `godot.json` / `godot_unified.json` - Godot Engine
|
||||
- `blender.json` / `blender-unified.json` - Blender Engine
|
||||
- `claude-code.json` - Claude Code
|
||||
- `httpx_comprehensive.json` - HTTPX library
|
||||
- `medusa-mercurjs.json` - Medusa/MercurJS
|
||||
- `astrovalley_unified.json` - Astrovalley
|
||||
- `react.json` - React documentation
|
||||
- `configs/integrations/` - Integration-specific configs
|
||||
|
||||
---
|
||||
|
||||
## Key Dependencies
|
||||
|
||||
### Core Dependencies (Required)
|
||||
|
||||
| Package | Version | Purpose |
|
||||
|---------|---------|---------|
|
||||
| `requests` | >=2.32.5 | HTTP requests |
|
||||
| `beautifulsoup4` | >=4.14.2 | HTML parsing |
|
||||
| `PyGithub` | >=2.5.0 | GitHub API |
|
||||
| `GitPython` | >=3.1.40 | Git operations |
|
||||
| `httpx` | >=0.28.1 | Async HTTP |
|
||||
| `anthropic` | >=0.76.0 | Claude AI API |
|
||||
| `PyMuPDF` | >=1.24.14 | PDF processing |
|
||||
| `Pillow` | >=11.0.0 | Image processing |
|
||||
| `pytesseract` | >=0.3.13 | OCR |
|
||||
| `pydantic` | >=2.12.3 | Data validation |
|
||||
| `pydantic-settings` | >=2.11.0 | Settings management |
|
||||
| `click` | >=8.3.0 | CLI framework |
|
||||
| `Pygments` | >=2.19.2 | Syntax highlighting |
|
||||
| `pathspec` | >=0.12.1 | Path matching |
|
||||
| `networkx` | >=3.0 | Graph operations |
|
||||
| `schedule` | >=1.2.0 | Scheduled tasks |
|
||||
| `python-dotenv` | >=1.1.1 | Environment variables |
|
||||
| `jsonschema` | >=4.25.1 | JSON validation |
|
||||
| `PyYAML` | >=6.0 | YAML parsing |
|
||||
| `langchain` | >=1.2.10 | LangChain integration |
|
||||
| `llama-index` | >=0.14.15 | LlamaIndex integration |
|
||||
|
||||
### Optional Dependencies
|
||||
|
||||
| Feature | Package | Install Command |
|
||||
|---------|---------|-----------------|
|
||||
| MCP Server | `mcp>=1.25,<2` | `pip install -e ".[mcp]"` |
|
||||
| Google Gemini | `google-generativeai>=0.8.0` | `pip install -e ".[gemini]"` |
|
||||
| OpenAI | `openai>=1.0.0` | `pip install -e ".[openai]"` |
|
||||
| AWS S3 | `boto3>=1.34.0` | `pip install -e ".[s3]"` |
|
||||
| Google Cloud Storage | `google-cloud-storage>=2.10.0` | `pip install -e ".[gcs]"` |
|
||||
| Azure Blob Storage | `azure-storage-blob>=12.19.0` | `pip install -e ".[azure]"` |
|
||||
| Word Documents | `mammoth>=1.6.0`, `python-docx>=1.1.0` | `pip install -e ".[docx]"` |
|
||||
| Video (lightweight) | `yt-dlp>=2024.12.0`, `youtube-transcript-api>=1.2.0` | `pip install -e ".[video]"` |
|
||||
| Video (full) | +`faster-whisper`, `scenedetect`, `opencv-python-headless` (`easyocr` now installed via `--setup`) | `pip install -e ".[video-full]"` |
|
||||
| Video (GPU setup) | Auto-detects GPU, installs PyTorch + easyocr + all visual deps | `skill-seekers video --setup` |
|
||||
| Chroma DB | `chromadb>=0.4.0` | `pip install -e ".[chroma]"` |
|
||||
| Weaviate | `weaviate-client>=3.25.0` | `pip install -e ".[weaviate]"` |
|
||||
| Pinecone | `pinecone>=5.0.0` | `pip install -e ".[pinecone]"` |
|
||||
| Embedding Server | `fastapi>=0.109.0`, `uvicorn>=0.27.0`, `sentence-transformers>=2.3.0` | `pip install -e ".[embedding]"` |
|
||||
|
||||
### Dev Dependencies (in dependency-groups)
|
||||
|
||||
| Package | Version | Purpose |
|
||||
|---------|---------|---------|
|
||||
| `pytest` | >=8.4.2 | Testing framework |
|
||||
| `pytest-asyncio` | >=0.24.0 | Async test support |
|
||||
| `pytest-cov` | >=7.0.0 | Coverage |
|
||||
| `coverage` | >=7.11.0 | Coverage reporting |
|
||||
| `ruff` | >=0.14.13 | Linting/formatting |
|
||||
| `mypy` | >=1.19.1 | Type checking |
|
||||
| `psutil` | >=5.9.0 | Process utilities for testing |
|
||||
| `numpy` | >=1.24.0 | Numerical operations |
|
||||
| `starlette` | >=0.31.0 | HTTP transport testing |
|
||||
| `httpx` | >=0.24.0 | HTTP client for testing |
|
||||
| `boto3` | >=1.26.0 | AWS S3 testing |
|
||||
| `google-cloud-storage` | >=2.10.0 | GCS testing |
|
||||
| `azure-storage-blob` | >=12.17.0 | Azure testing |
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**ImportError: No module named 'skill_seekers'**
|
||||
- Solution: Run `pip install -e .`
|
||||
|
||||
**Tests failing with "package not installed"**
|
||||
- Solution: Ensure you ran `pip install -e .` in the correct virtual environment
|
||||
|
||||
**MCP server import errors**
|
||||
- Solution: Install with `pip install -e ".[mcp]"`
|
||||
|
||||
**Type checking failures**
|
||||
- MyPy is configured to be lenient (gradual typing)
|
||||
- Focus on critical paths, not full coverage
|
||||
|
||||
**Docker build failures**
|
||||
- Ensure you have BuildKit enabled: `DOCKER_BUILDKIT=1`
|
||||
- Check that all submodules are initialized: `git submodule update --init`
|
||||
|
||||
**Rate limit errors from GitHub**
|
||||
- Set `GITHUB_TOKEN` environment variable for authenticated requests
|
||||
- Improves rate limit from 60 to 5000 requests/hour
|
||||
|
||||
### Getting Help
|
||||
|
||||
- Check **TROUBLESHOOTING.md** for detailed solutions
|
||||
- Review **docs/FAQ.md** for common questions
|
||||
- Visit https://skillseekersweb.com/ for documentation
|
||||
- Open an issue on GitHub with:
|
||||
- Clear title and description
|
||||
- Steps to reproduce
|
||||
- Expected vs actual behavior
|
||||
- Environment details (OS, Python version)
|
||||
- Error messages and stack traces
|
||||
|
||||
---
|
||||
|
||||
## Environment Variables Reference
|
||||
|
||||
| Variable | Purpose | Required For |
|
||||
|----------|---------|--------------|
|
||||
| `ANTHROPIC_API_KEY` | Claude AI API access | Claude enhancement/upload |
|
||||
| `GOOGLE_API_KEY` | Google Gemini API access | Gemini enhancement/upload |
|
||||
| `OPENAI_API_KEY` | OpenAI API access | OpenAI enhancement/upload |
|
||||
| `GITHUB_TOKEN` | GitHub API authentication | GitHub scraping (recommended) |
|
||||
| `AWS_ACCESS_KEY_ID` | AWS S3 authentication | S3 cloud storage |
|
||||
| `AWS_SECRET_ACCESS_KEY` | AWS S3 authentication | S3 cloud storage |
|
||||
| `GOOGLE_APPLICATION_CREDENTIALS` | GCS authentication path | GCS cloud storage |
|
||||
| `AZURE_STORAGE_CONNECTION_STRING` | Azure Blob authentication | Azure cloud storage |
|
||||
| `ANTHROPIC_BASE_URL` | Custom Claude endpoint | Custom API endpoints |
|
||||
| `SKILL_SEEKERS_HOME` | Data directory path | Docker/runtime |
|
||||
| `SKILL_SEEKERS_OUTPUT` | Output directory path | Docker/runtime |
|
||||
|
||||
---
|
||||
|
||||
## Version Management
|
||||
|
||||
The version is defined in `pyproject.toml` and dynamically read by `src/skill_seekers/_version.py`:
|
||||
|
||||
```python
|
||||
# _version.py reads from pyproject.toml
|
||||
__version__ = get_version() # Returns version from pyproject.toml
|
||||
```
|
||||
|
||||
**To update version:**
|
||||
1. Edit `version` in `pyproject.toml`
|
||||
2. The `_version.py` file will automatically pick up the new version
|
||||
|
||||
---
|
||||
|
||||
## Configuration File Format
|
||||
|
||||
Skill Seekers uses JSON configuration files to define scraping targets. Example structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "godot",
|
||||
"description": "Godot Engine documentation",
|
||||
"merge_mode": "claude-enhanced",
|
||||
"sources": [
|
||||
{
|
||||
"type": "documentation",
|
||||
"base_url": "https://docs.godotengine.org/en/stable/",
|
||||
"extract_api": true,
|
||||
"selectors": {
|
||||
"main_content": "div[role='main']",
|
||||
"title": "title",
|
||||
"code_blocks": "pre"
|
||||
},
|
||||
"url_patterns": {
|
||||
"include": [],
|
||||
"exclude": ["/search.html", "/_static/"]
|
||||
},
|
||||
"categories": {
|
||||
"getting_started": ["introduction", "getting_started"],
|
||||
"scripting": ["scripting", "gdscript"]
|
||||
},
|
||||
"rate_limit": 0.5,
|
||||
"max_pages": 500
|
||||
},
|
||||
{
|
||||
"type": "github",
|
||||
"repo": "godotengine/godot",
|
||||
"enable_codebase_analysis": true,
|
||||
"code_analysis_depth": "deep",
|
||||
"fetch_issues": true,
|
||||
"max_issues": 100
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow Presets
|
||||
|
||||
Skill Seekers includes 66 YAML workflow presets for AI enhancement in `src/skill_seekers/workflows/`:
|
||||
|
||||
**Built-in presets:**
|
||||
- `default.yaml` - Standard enhancement workflow
|
||||
- `minimal.yaml` - Fast, minimal enhancement
|
||||
- `security-focus.yaml` - Security-focused review
|
||||
- `architecture-comprehensive.yaml` - Deep architecture analysis
|
||||
- `api-documentation.yaml` - API documentation focus
|
||||
- And 61 more specialized presets...
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
# Apply a preset
|
||||
skill-seekers create ./my-project --enhance-workflow security-focus
|
||||
|
||||
# Chain multiple presets
|
||||
skill-seekers create ./my-project --enhance-workflow security-focus --enhance-workflow minimal
|
||||
|
||||
# Manage presets
|
||||
skill-seekers workflows list
|
||||
skill-seekers workflows show security-focus
|
||||
skill-seekers workflows copy security-focus
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
*This document is maintained for AI coding agents. For human contributors, see README.md and CONTRIBUTING.md.*
|
||||
|
||||
*Last updated: 2026-03-01*
|
||||
GitHub Actions (`.github/workflows/tests.yml`): ruff + mypy lint job, then pytest matrix (Ubuntu + macOS, Python 3.10-3.12) with Codecov upload.
|
||||
|
||||
126
CHANGELOG.md
126
CHANGELOG.md
@@ -8,6 +8,77 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
## [Unreleased]
|
||||
|
||||
### Added
|
||||
|
||||
#### 10 New Skill Source Types (17 total)
|
||||
|
||||
Skill Seekers now supports 17 source types — up from 7. Every new type is fully integrated into the CLI (`skill-seekers <type>`), `create` command auto-detection, unified multi-source configs, config validation, the MCP server, and the skill builder.
|
||||
|
||||
- **Jupyter Notebook** — `skill-seekers jupyter --notebook file.ipynb` or `skill-seekers create file.ipynb`
|
||||
- Extracts markdown cells, code cells with outputs, kernel metadata, imports, and language detection
|
||||
- Handles single files and directories of notebooks; filters `.ipynb_checkpoints`
|
||||
- Optional dependency: `pip install "skill-seekers[jupyter]"` (nbformat)
|
||||
- Entry point: `skill-seekers-jupyter`
|
||||
|
||||
- **Local HTML** — `skill-seekers html --html-path file.html` or `skill-seekers create file.html`
|
||||
- Parses HTML using BeautifulSoup with smart main content detection (`<article>`, `<main>`, `.content`, largest div)
|
||||
- Extracts headings, code blocks, tables (to markdown), images, links; converts inline HTML to markdown
|
||||
- Handles single files and directories; supports `.html`, `.htm`, `.xhtml` extensions
|
||||
- No extra dependencies (BeautifulSoup is a core dep)
|
||||
|
||||
- **OpenAPI/Swagger** — `skill-seekers openapi --spec spec.yaml` or `skill-seekers create spec.yaml`
|
||||
- Parses OpenAPI 3.0/3.1 and Swagger 2.0 specs from YAML or JSON (local files or URLs via `--spec-url`)
|
||||
- Extracts endpoints, parameters, request/response schemas, security schemes, tags
|
||||
- Resolves `$ref` references with circular reference protection; handles `allOf`/`oneOf`/`anyOf`
|
||||
- Groups endpoints by tags; generates comprehensive API reference markdown
|
||||
- Source detection sniffs YAML file content for `openapi:` or `swagger:` keys (avoids false positives on non-API YAML files)
|
||||
- Optional dependency: `pip install "skill-seekers[openapi]"` (pyyaml — already a core dep, guard added for safety)
|
||||
|
||||
- **AsciiDoc** — `skill-seekers asciidoc --asciidoc-path file.adoc` or `skill-seekers create file.adoc`
|
||||
- Regex-based parser (no external library required) with optional `asciidoc` library support
|
||||
- Extracts headings (= through =====), `[source,lang]` code blocks, `|===` tables, admonitions (NOTE/TIP/WARNING/IMPORTANT/CAUTION), and `include::` directives
|
||||
- Converts AsciiDoc formatting to markdown; handles single files and directories
|
||||
- Optional dependency: `pip install "skill-seekers[asciidoc]"` (asciidoc library for advanced rendering)
|
||||
|
||||
- **PowerPoint (.pptx)** — `skill-seekers pptx --pptx file.pptx` or `skill-seekers create file.pptx`
|
||||
- Extracts slide text, speaker notes, tables, images (with alt text), and grouped shapes
|
||||
- Detects code blocks by monospace font analysis (30+ font families)
|
||||
- Groups slides into sections by layout type; handles single files and directories
|
||||
- Optional dependency: `pip install "skill-seekers[pptx]"` (python-pptx)
|
||||
|
||||
- **RSS/Atom Feeds** — `skill-seekers rss --feed-url <url>` / `--feed-path file.rss` or `skill-seekers create feed.rss`
|
||||
- Parses RSS 2.0, RSS 1.0, and Atom feeds via feedparser
|
||||
- Optionally follows article links (`--follow-links`, default on) to scrape full page content using BeautifulSoup
|
||||
- Extracts article titles, summaries, authors, dates, categories; configurable `--max-articles` (default 50)
|
||||
- Source detection matches `.rss` and `.atom` extensions (`.xml` excluded to avoid false positives)
|
||||
- Optional dependency: `pip install "skill-seekers[rss]"` (feedparser)
|
||||
|
||||
- **Man Pages** — `skill-seekers manpage --man-names git,curl` / `--man-path dir/` or `skill-seekers create git.1`
|
||||
- Extracts man pages by running `man` command via subprocess or reading `.1`–`.8`/`.man` files directly
|
||||
- Handles gzip/bzip2/xz compressed man files; strips troff/groff formatting (backspace overstriking, macros, font escapes)
|
||||
- Parses structured sections (NAME, SYNOPSIS, DESCRIPTION, OPTIONS, EXAMPLES, SEE ALSO)
|
||||
- Source detection uses basename heuristic to avoid false positives on log rotation files (e.g., `access.log.1`)
|
||||
- No external dependencies (stdlib only)
|
||||
|
||||
- **Confluence** — `skill-seekers confluence --base-url <url> --space-key <key>` or `--export-path dir/`
|
||||
- API mode: fetches pages from Confluence REST API with pagination (`atlassian-python-api`)
|
||||
- Export mode: parses Confluence HTML/XML export directories
|
||||
- Extracts page content, code/panel/info/warning macros, page hierarchy, tables
|
||||
- Optional dependency: `pip install "skill-seekers[confluence]"` (atlassian-python-api)
|
||||
|
||||
- **Notion** — `skill-seekers notion --database-id <id>` / `--page-id <id>` or `--export-path dir/`
|
||||
- API mode: fetches pages via Notion API with support for 20+ block types (paragraph, heading, code, callout, toggle, table, etc.)
|
||||
- Export mode: parses Notion Markdown/CSV export directories
|
||||
- Extracts rich text with annotations (bold, italic, code, links), 16+ property types for database entries
|
||||
- Optional dependency: `pip install "skill-seekers[notion]"` (notion-client)
|
||||
|
||||
- **Slack/Discord Chat** — `skill-seekers chat --export-path dir/` or `--token <token> --channel <channel>`
|
||||
- Slack: parses workspace JSON exports or fetches via Slack Web API (`slack_sdk`)
|
||||
- Discord: parses DiscordChatExporter JSON or fetches via Discord HTTP API
|
||||
- Extracts messages, code snippets (fenced blocks), shared URLs, threads, reactions, attachments
|
||||
- Generates per-channel summaries and topic categorization
|
||||
- Optional dependency: `pip install "skill-seekers[chat]"` (slack-sdk)
|
||||
|
||||
#### EPUB Unified Pipeline Integration
|
||||
- **EPUB (.epub) input support** via `skill-seekers create book.epub` or `skill-seekers epub --epub book.epub`
|
||||
- Extracts chapters, metadata (Dublin Core), code blocks, images, and tables from EPUB 2 and EPUB 3 files
|
||||
- DRM detection with clear error messages (Adobe ADEPT, Apple FairPlay, Readium LCP)
|
||||
@@ -16,6 +87,61 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
- `--help-epub` flag for EPUB-specific help
|
||||
- Optional dependency: `pip install "skill-seekers[epub]"` (ebooklib)
|
||||
- 107 tests across 14 test classes
|
||||
- **EPUB added to unified scraper** — `_scrape_epub()` method, `scraped_data["epub"]`, config validation (`_validate_epub_source`), and dry-run display. Previously EPUB worked standalone but was missing from multi-source configs.
|
||||
|
||||
#### Unified Skill Builder — Generic Merge System
|
||||
- **`_generic_merge()`** — Priority-based section merge for any combination of source types not covered by existing pairwise synthesis (docs+github, docs+pdf, etc.). Produces YAML frontmatter + source-attributed sections.
|
||||
- **`_append_extra_sources()`** — Appends additional source type content (e.g., Jupyter + PPTX) to pairwise-synthesized SKILL.md.
|
||||
- **`_generate_generic_references()`** — Generates `references/<type>/index.md` for any source type, with ID resolution fallback chain.
|
||||
- **`_SOURCE_LABELS`** dict — Human-readable labels for all 17 source types used in merge attribution.
|
||||
|
||||
#### Config Validator Expansion
|
||||
- **17 source types in `VALID_SOURCE_TYPES`** — All new types plus `word` and `video` now have per-type validation methods.
|
||||
- **`_validate_word_source()`** — Validates `path` field for Word documents (was previously missing).
|
||||
- **`_validate_video_source()`** — Validates `url`, `path`, or `playlist` field for video sources (was previously missing).
|
||||
- **11 new `_validate_*_source()` methods** — One for each new type with appropriate required-field checks.
|
||||
|
||||
#### Source Detection Improvements
|
||||
- **7 new file extension detections** in `SourceDetector.detect()` — `.ipynb`, `.html`/`.htm`, `.pptx`, `.adoc`/`.asciidoc`, `.rss`/`.atom`, `.1`–`.8`/`.man`, `.yaml`/`.yml` (with content sniffing)
|
||||
- **`_looks_like_openapi()`** — Content sniffing for YAML files: only classifies as OpenAPI if the file contains `openapi:` or `swagger:` key in first 20 lines (prevents false positives on docker-compose, Ansible, Kubernetes manifests, etc.)
|
||||
- **Man page basename heuristic** — `.1`–`.8` extensions only detected as man pages if the basename has no dots (e.g., `git.1` matches but `access.log.1` does not)
|
||||
- **`.xml` excluded from RSS detection** — Too generic; only `.rss` and `.atom` trigger RSS detection
|
||||
|
||||
#### MCP Server Integration
|
||||
- **`scrape_generic` tool** — New MCP tool handles all 10 new source types via subprocess with per-type flag mapping
|
||||
- **`_PATH_FLAGS` / `_URL_FLAGS` dicts** — Correct flag routing for each source type (e.g., jupyter→`--notebook`, html→`--html-path`, rss→`--feed-url`)
|
||||
- **`GENERIC_SOURCE_TYPES` tuple** — Lists all 10 new types for validation
|
||||
- **Config validation display** — `validate_config` tool now shows source details for all new types
|
||||
- **Tool count updated** — 33 → 34 tools (scraping tools 10 → 11)
|
||||
|
||||
#### CLI Wiring
|
||||
- **10 new CLI subcommands** — `jupyter`, `html`, `openapi`, `asciidoc`, `pptx`, `rss`, `manpage`, `confluence`, `notion`, `chat` in `COMMAND_MODULES`
|
||||
- **10 new argument modules** — `arguments/{jupyter,html,openapi,asciidoc,pptx,rss,manpage,confluence,notion,chat}.py` with per-type `*_ARGUMENTS` dicts
|
||||
- **10 new parser modules** — `parsers/{jupyter,html,openapi,asciidoc,pptx,rss,manpage,confluence,notion,chat}_parser.py` with `SubcommandParser` implementations
|
||||
- **`create` command routing** — `_route_generic()` method for all new types with correct module names and CLI flags
|
||||
- **10 new entry points** in pyproject.toml — `skill-seekers-{jupyter,html,openapi,asciidoc,pptx,rss,manpage,confluence,notion,chat}`
|
||||
- **7 new optional dependency groups** in pyproject.toml — `[jupyter]`, `[asciidoc]`, `[pptx]`, `[confluence]`, `[notion]`, `[rss]`, `[chat]`
|
||||
- **`[all]` group updated** — Includes all 7 new optional dependencies
|
||||
|
||||
#### Workflow & Documentation
|
||||
- **`complex-merge.yaml`** — New 7-stage AI-powered workflow for complex multi-source merging (source inventory → cross-reference → conflict detection → priority merge → gap analysis → synthesis → quality check)
|
||||
- **AGENTS.md rewritten** — Updated with all 17 source types, scraper pattern docs, project layout, and key pattern documentation
|
||||
- **77 new integration tests** in `test_new_source_types.py` — Source detection, config validation, generic merge, CLI wiring, validation, and create command routing
|
||||
|
||||
### Fixed
|
||||
- **Config validator missing `word` and `video` dispatch** — `_validate_source()` had no `elif` branches for `word` or `video` types, silently skipping validation. Added dispatch entries and `_validate_word_source()` / `_validate_video_source()` methods.
|
||||
- **`openapi_scraper.py` unconditional `import yaml`** — Would crash at import time if pyyaml not installed. Added `try/except ImportError` guard with `YAML_AVAILABLE` flag and `_check_yaml_deps()` helper.
|
||||
- **`asciidoc_scraper.py` missing standard arguments** — `main()` manually defined args instead of using `add_asciidoc_arguments()`. Refactored to use shared argument definitions + added enhancement workflow integration.
|
||||
- **`pptx_scraper.py` missing standard arguments** — Same issue. Refactored to use `add_pptx_arguments()`.
|
||||
- **`chat_scraper.py` missing standard arguments** — Same issue. Refactored to use `add_chat_arguments()`.
|
||||
- **`notion_scraper.py` missing `run_workflows` call** — `--enhance-workflow` flags were silently ignored. Added workflow runner integration.
|
||||
- **`openapi_scraper.py` return type `None`** — `main()` returned `None` instead of `int`. Fixed to `return 0` on success, matching all other scrapers.
|
||||
- **MCP `scrape_generic_tool` flag mismatch** — Was passing `--path`/`--url` as generic flags, but every scraper expects its own flag name (e.g., `--notebook`, `--html-path`, `--spec`). All 10 source types would have failed at runtime. Fixed with per-type `_PATH_FLAGS` and `_URL_FLAGS` mappings.
|
||||
- **Word scraper `docx_id` key mismatch** — Unified scraper data dict used `docx_id` but generic reference generation looked for `word_id`. Added `word_id` alias.
|
||||
- **`main.py` docstring stale** — Missing all 10 new commands. Updated to list all 27 commands.
|
||||
- **`source_detector.py` module docstring stale** — Described only 5 source types. Updated to describe 14+ detected types.
|
||||
- **`manpage_parser.py` docstring referenced wrong file** — Said `manpage_scraper.py` but actual file is `man_scraper.py`. Fixed.
|
||||
- **Parser registry test count** — Updated expected count from 25 to 35 for 10 new parsers.
|
||||
|
||||
## [3.2.0] - 2026-03-01
|
||||
|
||||
|
||||
@@ -168,6 +168,35 @@ all-cloud = [
|
||||
"azure-storage-blob>=12.19.0",
|
||||
]
|
||||
|
||||
# New source type dependencies (v3.2.0+)
|
||||
jupyter = [
|
||||
"nbformat>=5.9.0",
|
||||
]
|
||||
|
||||
asciidoc = [
|
||||
"asciidoc>=10.0.0",
|
||||
]
|
||||
|
||||
pptx = [
|
||||
"python-pptx>=0.6.21",
|
||||
]
|
||||
|
||||
confluence = [
|
||||
"atlassian-python-api>=3.41.0",
|
||||
]
|
||||
|
||||
notion = [
|
||||
"notion-client>=2.0.0",
|
||||
]
|
||||
|
||||
rss = [
|
||||
"feedparser>=6.0.0",
|
||||
]
|
||||
|
||||
chat = [
|
||||
"slack-sdk>=3.27.0",
|
||||
]
|
||||
|
||||
# Embedding server support
|
||||
embedding = [
|
||||
"fastapi>=0.109.0",
|
||||
@@ -204,6 +233,14 @@ all = [
|
||||
"sentence-transformers>=2.3.0",
|
||||
"numpy>=1.24.0",
|
||||
"voyageai>=0.2.0",
|
||||
# New source types (v3.2.0+)
|
||||
"nbformat>=5.9.0",
|
||||
"asciidoc>=10.0.0",
|
||||
"python-pptx>=0.6.21",
|
||||
"atlassian-python-api>=3.41.0",
|
||||
"notion-client>=2.0.0",
|
||||
"feedparser>=6.0.0",
|
||||
"slack-sdk>=3.27.0",
|
||||
]
|
||||
|
||||
[project.urls]
|
||||
@@ -253,6 +290,18 @@ skill-seekers-quality = "skill_seekers.cli.quality_metrics:main"
|
||||
skill-seekers-workflows = "skill_seekers.cli.workflows_command:main"
|
||||
skill-seekers-sync-config = "skill_seekers.cli.sync_config:main"
|
||||
|
||||
# New source type entry points (v3.2.0+)
|
||||
skill-seekers-jupyter = "skill_seekers.cli.jupyter_scraper:main"
|
||||
skill-seekers-html = "skill_seekers.cli.html_scraper:main"
|
||||
skill-seekers-openapi = "skill_seekers.cli.openapi_scraper:main"
|
||||
skill-seekers-asciidoc = "skill_seekers.cli.asciidoc_scraper:main"
|
||||
skill-seekers-pptx = "skill_seekers.cli.pptx_scraper:main"
|
||||
skill-seekers-rss = "skill_seekers.cli.rss_scraper:main"
|
||||
skill-seekers-manpage = "skill_seekers.cli.man_scraper:main"
|
||||
skill-seekers-confluence = "skill_seekers.cli.confluence_scraper:main"
|
||||
skill-seekers-notion = "skill_seekers.cli.notion_scraper:main"
|
||||
skill-seekers-chat = "skill_seekers.cli.chat_scraper:main"
|
||||
|
||||
[tool.setuptools]
|
||||
package-dir = {"" = "src"}
|
||||
|
||||
|
||||
68
src/skill_seekers/cli/arguments/asciidoc.py
Normal file
68
src/skill_seekers/cli/arguments/asciidoc.py
Normal file
@@ -0,0 +1,68 @@
|
||||
"""AsciiDoc command argument definitions.
|
||||
|
||||
This module defines ALL arguments for the asciidoc command in ONE place.
|
||||
Both asciidoc_scraper.py (standalone) and parsers/asciidoc_parser.py (unified CLI)
|
||||
import and use these definitions.
|
||||
|
||||
Shared arguments (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
|
||||
via ``add_all_standard_arguments()``.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
from typing import Any
|
||||
|
||||
from .common import add_all_standard_arguments
|
||||
|
||||
# AsciiDoc-specific argument definitions as data structure
|
||||
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
|
||||
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
|
||||
ASCIIDOC_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"asciidoc_path": {
|
||||
"flags": ("--asciidoc-path",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Path to AsciiDoc file or directory containing .adoc files",
|
||||
"metavar": "PATH",
|
||||
},
|
||||
},
|
||||
"from_json": {
|
||||
"flags": ("--from-json",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Build skill from extracted JSON",
|
||||
"metavar": "FILE",
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def add_asciidoc_arguments(parser: argparse.ArgumentParser) -> None:
|
||||
"""Add all asciidoc command arguments to a parser.
|
||||
|
||||
Registers shared args (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
|
||||
then adds AsciiDoc-specific args on top.
|
||||
|
||||
The default for --enhance-level is overridden to 0 (disabled) for AsciiDoc.
|
||||
"""
|
||||
# Shared universal args first
|
||||
add_all_standard_arguments(parser)
|
||||
|
||||
# Override enhance-level default to 0 for AsciiDoc
|
||||
for action in parser._actions:
|
||||
if hasattr(action, "dest") and action.dest == "enhance_level":
|
||||
action.default = 0
|
||||
action.help = (
|
||||
"AI enhancement level (auto-detects API vs LOCAL mode): "
|
||||
"0=disabled (default for AsciiDoc), 1=SKILL.md only, "
|
||||
"2=+architecture/config, 3=full enhancement. "
|
||||
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
|
||||
"otherwise LOCAL (Claude Code)"
|
||||
)
|
||||
|
||||
# AsciiDoc-specific args
|
||||
for arg_name, arg_def in ASCIIDOC_ARGUMENTS.items():
|
||||
flags = arg_def["flags"]
|
||||
kwargs = arg_def["kwargs"]
|
||||
parser.add_argument(*flags, **kwargs)
|
||||
102
src/skill_seekers/cli/arguments/chat.py
Normal file
102
src/skill_seekers/cli/arguments/chat.py
Normal file
@@ -0,0 +1,102 @@
|
||||
"""Chat command argument definitions.
|
||||
|
||||
This module defines ALL arguments for the chat command in ONE place.
|
||||
Both chat_scraper.py (standalone) and parsers/chat_parser.py (unified CLI)
|
||||
import and use these definitions.
|
||||
|
||||
Shared arguments (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
|
||||
via ``add_all_standard_arguments()``.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
from typing import Any
|
||||
|
||||
from .common import add_all_standard_arguments
|
||||
|
||||
# Chat-specific argument definitions as data structure
|
||||
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
|
||||
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
|
||||
CHAT_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"export_path": {
|
||||
"flags": ("--export-path",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Path to chat export directory or file",
|
||||
"metavar": "PATH",
|
||||
},
|
||||
},
|
||||
"platform": {
|
||||
"flags": ("--platform",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"choices": ["slack", "discord"],
|
||||
"default": "slack",
|
||||
"help": "Chat platform type (default: slack)",
|
||||
},
|
||||
},
|
||||
"token": {
|
||||
"flags": ("--token",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "API token for chat platform authentication",
|
||||
"metavar": "TOKEN",
|
||||
},
|
||||
},
|
||||
"channel": {
|
||||
"flags": ("--channel",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Channel name or ID to extract from",
|
||||
"metavar": "CHANNEL",
|
||||
},
|
||||
},
|
||||
"max_messages": {
|
||||
"flags": ("--max-messages",),
|
||||
"kwargs": {
|
||||
"type": int,
|
||||
"default": 10000,
|
||||
"help": "Maximum number of messages to extract (default: 10000)",
|
||||
"metavar": "N",
|
||||
},
|
||||
},
|
||||
"from_json": {
|
||||
"flags": ("--from-json",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Build skill from extracted JSON",
|
||||
"metavar": "FILE",
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def add_chat_arguments(parser: argparse.ArgumentParser) -> None:
|
||||
"""Add all chat command arguments to a parser.
|
||||
|
||||
Registers shared args (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
|
||||
then adds Chat-specific args on top.
|
||||
|
||||
The default for --enhance-level is overridden to 0 (disabled) for Chat.
|
||||
"""
|
||||
# Shared universal args first
|
||||
add_all_standard_arguments(parser)
|
||||
|
||||
# Override enhance-level default to 0 for Chat
|
||||
for action in parser._actions:
|
||||
if hasattr(action, "dest") and action.dest == "enhance_level":
|
||||
action.default = 0
|
||||
action.help = (
|
||||
"AI enhancement level (auto-detects API vs LOCAL mode): "
|
||||
"0=disabled (default for Chat), 1=SKILL.md only, "
|
||||
"2=+architecture/config, 3=full enhancement. "
|
||||
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
|
||||
"otherwise LOCAL (Claude Code)"
|
||||
)
|
||||
|
||||
# Chat-specific args
|
||||
for arg_name, arg_def in CHAT_ARGUMENTS.items():
|
||||
flags = arg_def["flags"]
|
||||
kwargs = arg_def["kwargs"]
|
||||
parser.add_argument(*flags, **kwargs)
|
||||
109
src/skill_seekers/cli/arguments/confluence.py
Normal file
109
src/skill_seekers/cli/arguments/confluence.py
Normal file
@@ -0,0 +1,109 @@
|
||||
"""Confluence command argument definitions.
|
||||
|
||||
This module defines ALL arguments for the confluence command in ONE place.
|
||||
Both confluence_scraper.py (standalone) and parsers/confluence_parser.py (unified CLI)
|
||||
import and use these definitions.
|
||||
|
||||
Shared arguments (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
|
||||
via ``add_all_standard_arguments()``.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
from typing import Any
|
||||
|
||||
from .common import add_all_standard_arguments
|
||||
|
||||
# Confluence-specific argument definitions as data structure
|
||||
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
|
||||
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
|
||||
CONFLUENCE_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"base_url": {
|
||||
"flags": ("--base-url",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Confluence instance base URL",
|
||||
"metavar": "URL",
|
||||
},
|
||||
},
|
||||
"space_key": {
|
||||
"flags": ("--space-key",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Confluence space key to extract from",
|
||||
"metavar": "KEY",
|
||||
},
|
||||
},
|
||||
"export_path": {
|
||||
"flags": ("--export-path",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Path to Confluence HTML/XML export directory",
|
||||
"metavar": "PATH",
|
||||
},
|
||||
},
|
||||
"username": {
|
||||
"flags": ("--username",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Confluence username for API authentication",
|
||||
"metavar": "USER",
|
||||
},
|
||||
},
|
||||
"token": {
|
||||
"flags": ("--token",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Confluence API token for authentication",
|
||||
"metavar": "TOKEN",
|
||||
},
|
||||
},
|
||||
"max_pages": {
|
||||
"flags": ("--max-pages",),
|
||||
"kwargs": {
|
||||
"type": int,
|
||||
"default": 500,
|
||||
"help": "Maximum number of pages to extract (default: 500)",
|
||||
"metavar": "N",
|
||||
},
|
||||
},
|
||||
"from_json": {
|
||||
"flags": ("--from-json",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Build skill from extracted JSON",
|
||||
"metavar": "FILE",
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def add_confluence_arguments(parser: argparse.ArgumentParser) -> None:
|
||||
"""Add all confluence command arguments to a parser.
|
||||
|
||||
Registers shared args (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
|
||||
then adds Confluence-specific args on top.
|
||||
|
||||
The default for --enhance-level is overridden to 0 (disabled) for Confluence.
|
||||
"""
|
||||
# Shared universal args first
|
||||
add_all_standard_arguments(parser)
|
||||
|
||||
# Override enhance-level default to 0 for Confluence
|
||||
for action in parser._actions:
|
||||
if hasattr(action, "dest") and action.dest == "enhance_level":
|
||||
action.default = 0
|
||||
action.help = (
|
||||
"AI enhancement level (auto-detects API vs LOCAL mode): "
|
||||
"0=disabled (default for Confluence), 1=SKILL.md only, "
|
||||
"2=+architecture/config, 3=full enhancement. "
|
||||
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
|
||||
"otherwise LOCAL (Claude Code)"
|
||||
)
|
||||
|
||||
# Confluence-specific args
|
||||
for arg_name, arg_def in CONFLUENCE_ARGUMENTS.items():
|
||||
flags = arg_def["flags"]
|
||||
kwargs = arg_def["kwargs"]
|
||||
parser.add_argument(*flags, **kwargs)
|
||||
@@ -549,6 +549,121 @@ CONFIG_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
# For unified config files, use `skill-seekers unified --fresh` directly.
|
||||
}
|
||||
|
||||
# New source type arguments (v3.2.0+)
|
||||
# These are minimal dicts since most flags are handled by each scraper's own argument module.
|
||||
# The create command only needs the primary input flag for routing.
|
||||
|
||||
JUPYTER_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"notebook": {
|
||||
"flags": ("--notebook",),
|
||||
"kwargs": {"type": str, "help": "Jupyter Notebook file path (.ipynb)", "metavar": "PATH"},
|
||||
},
|
||||
}
|
||||
|
||||
HTML_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"html_path": {
|
||||
"flags": ("--html-path",),
|
||||
"kwargs": {"type": str, "help": "Local HTML file or directory path", "metavar": "PATH"},
|
||||
},
|
||||
}
|
||||
|
||||
OPENAPI_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"spec": {
|
||||
"flags": ("--spec",),
|
||||
"kwargs": {"type": str, "help": "OpenAPI/Swagger spec file path", "metavar": "PATH"},
|
||||
},
|
||||
"spec_url": {
|
||||
"flags": ("--spec-url",),
|
||||
"kwargs": {"type": str, "help": "OpenAPI/Swagger spec URL", "metavar": "URL"},
|
||||
},
|
||||
}
|
||||
|
||||
ASCIIDOC_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"asciidoc_path": {
|
||||
"flags": ("--asciidoc-path",),
|
||||
"kwargs": {"type": str, "help": "AsciiDoc file or directory path", "metavar": "PATH"},
|
||||
},
|
||||
}
|
||||
|
||||
PPTX_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"pptx": {
|
||||
"flags": ("--pptx",),
|
||||
"kwargs": {"type": str, "help": "PowerPoint file path (.pptx)", "metavar": "PATH"},
|
||||
},
|
||||
}
|
||||
|
||||
RSS_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"feed_url": {
|
||||
"flags": ("--feed-url",),
|
||||
"kwargs": {"type": str, "help": "RSS/Atom feed URL", "metavar": "URL"},
|
||||
},
|
||||
"feed_path": {
|
||||
"flags": ("--feed-path",),
|
||||
"kwargs": {"type": str, "help": "RSS/Atom feed file path", "metavar": "PATH"},
|
||||
},
|
||||
}
|
||||
|
||||
MANPAGE_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"man_names": {
|
||||
"flags": ("--man-names",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Comma-separated man page names (e.g., 'git,curl')",
|
||||
"metavar": "NAMES",
|
||||
},
|
||||
},
|
||||
"man_path": {
|
||||
"flags": ("--man-path",),
|
||||
"kwargs": {"type": str, "help": "Directory of man page files", "metavar": "PATH"},
|
||||
},
|
||||
}
|
||||
|
||||
CONFLUENCE_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"conf_base_url": {
|
||||
"flags": ("--conf-base-url",),
|
||||
"kwargs": {"type": str, "help": "Confluence base URL", "metavar": "URL"},
|
||||
},
|
||||
"space_key": {
|
||||
"flags": ("--space-key",),
|
||||
"kwargs": {"type": str, "help": "Confluence space key", "metavar": "KEY"},
|
||||
},
|
||||
"conf_export_path": {
|
||||
"flags": ("--conf-export-path",),
|
||||
"kwargs": {"type": str, "help": "Confluence export directory", "metavar": "PATH"},
|
||||
},
|
||||
}
|
||||
|
||||
NOTION_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"database_id": {
|
||||
"flags": ("--database-id",),
|
||||
"kwargs": {"type": str, "help": "Notion database ID", "metavar": "ID"},
|
||||
},
|
||||
"page_id": {
|
||||
"flags": ("--page-id",),
|
||||
"kwargs": {"type": str, "help": "Notion page ID", "metavar": "ID"},
|
||||
},
|
||||
"notion_export_path": {
|
||||
"flags": ("--notion-export-path",),
|
||||
"kwargs": {"type": str, "help": "Notion export directory", "metavar": "PATH"},
|
||||
},
|
||||
}
|
||||
|
||||
CHAT_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"chat_export_path": {
|
||||
"flags": ("--chat-export-path",),
|
||||
"kwargs": {"type": str, "help": "Slack/Discord export directory", "metavar": "PATH"},
|
||||
},
|
||||
"platform": {
|
||||
"flags": ("--platform",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"choices": ["slack", "discord"],
|
||||
"default": "slack",
|
||||
"help": "Chat platform (default: slack)",
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
# =============================================================================
|
||||
# TIER 3: ADVANCED/RARE ARGUMENTS
|
||||
# =============================================================================
|
||||
@@ -613,6 +728,17 @@ def get_source_specific_arguments(source_type: str) -> dict[str, dict[str, Any]]
|
||||
"epub": EPUB_ARGUMENTS,
|
||||
"video": VIDEO_ARGUMENTS,
|
||||
"config": CONFIG_ARGUMENTS,
|
||||
# New source types (v3.2.0+)
|
||||
"jupyter": JUPYTER_ARGUMENTS,
|
||||
"html": HTML_ARGUMENTS,
|
||||
"openapi": OPENAPI_ARGUMENTS,
|
||||
"asciidoc": ASCIIDOC_ARGUMENTS,
|
||||
"pptx": PPTX_ARGUMENTS,
|
||||
"rss": RSS_ARGUMENTS,
|
||||
"manpage": MANPAGE_ARGUMENTS,
|
||||
"confluence": CONFLUENCE_ARGUMENTS,
|
||||
"notion": NOTION_ARGUMENTS,
|
||||
"chat": CHAT_ARGUMENTS,
|
||||
}
|
||||
return source_args.get(source_type, {})
|
||||
|
||||
@@ -703,6 +829,24 @@ def add_create_arguments(parser: argparse.ArgumentParser, mode: str = "default")
|
||||
for arg_name, arg_def in CONFIG_ARGUMENTS.items():
|
||||
parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])
|
||||
|
||||
# New source types (v3.2.0+)
|
||||
_NEW_SOURCE_ARGS = {
|
||||
"jupyter": JUPYTER_ARGUMENTS,
|
||||
"html": HTML_ARGUMENTS,
|
||||
"openapi": OPENAPI_ARGUMENTS,
|
||||
"asciidoc": ASCIIDOC_ARGUMENTS,
|
||||
"pptx": PPTX_ARGUMENTS,
|
||||
"rss": RSS_ARGUMENTS,
|
||||
"manpage": MANPAGE_ARGUMENTS,
|
||||
"confluence": CONFLUENCE_ARGUMENTS,
|
||||
"notion": NOTION_ARGUMENTS,
|
||||
"chat": CHAT_ARGUMENTS,
|
||||
}
|
||||
for stype, sargs in _NEW_SOURCE_ARGS.items():
|
||||
if mode in [stype, "all"]:
|
||||
for arg_name, arg_def in sargs.items():
|
||||
parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])
|
||||
|
||||
# Add advanced arguments if requested
|
||||
if mode in ["advanced", "all"]:
|
||||
for arg_name, arg_def in ADVANCED_ARGUMENTS.items():
|
||||
|
||||
68
src/skill_seekers/cli/arguments/html.py
Normal file
68
src/skill_seekers/cli/arguments/html.py
Normal file
@@ -0,0 +1,68 @@
|
||||
"""HTML command argument definitions.
|
||||
|
||||
This module defines ALL arguments for the html command in ONE place.
|
||||
Both html_scraper.py (standalone) and parsers/html_parser.py (unified CLI)
|
||||
import and use these definitions.
|
||||
|
||||
Shared arguments (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
|
||||
via ``add_all_standard_arguments()``.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
from typing import Any
|
||||
|
||||
from .common import add_all_standard_arguments
|
||||
|
||||
# HTML-specific argument definitions as data structure
|
||||
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
|
||||
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
|
||||
HTML_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"html_path": {
|
||||
"flags": ("--html-path",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Path to HTML file or directory containing HTML files",
|
||||
"metavar": "PATH",
|
||||
},
|
||||
},
|
||||
"from_json": {
|
||||
"flags": ("--from-json",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Build skill from extracted JSON",
|
||||
"metavar": "FILE",
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def add_html_arguments(parser: argparse.ArgumentParser) -> None:
|
||||
"""Add all html command arguments to a parser.
|
||||
|
||||
Registers shared args (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
|
||||
then adds HTML-specific args on top.
|
||||
|
||||
The default for --enhance-level is overridden to 0 (disabled) for HTML.
|
||||
"""
|
||||
# Shared universal args first
|
||||
add_all_standard_arguments(parser)
|
||||
|
||||
# Override enhance-level default to 0 for HTML
|
||||
for action in parser._actions:
|
||||
if hasattr(action, "dest") and action.dest == "enhance_level":
|
||||
action.default = 0
|
||||
action.help = (
|
||||
"AI enhancement level (auto-detects API vs LOCAL mode): "
|
||||
"0=disabled (default for HTML), 1=SKILL.md only, "
|
||||
"2=+architecture/config, 3=full enhancement. "
|
||||
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
|
||||
"otherwise LOCAL (Claude Code)"
|
||||
)
|
||||
|
||||
# HTML-specific args
|
||||
for arg_name, arg_def in HTML_ARGUMENTS.items():
|
||||
flags = arg_def["flags"]
|
||||
kwargs = arg_def["kwargs"]
|
||||
parser.add_argument(*flags, **kwargs)
|
||||
68
src/skill_seekers/cli/arguments/jupyter.py
Normal file
68
src/skill_seekers/cli/arguments/jupyter.py
Normal file
@@ -0,0 +1,68 @@
|
||||
"""Jupyter Notebook command argument definitions.
|
||||
|
||||
This module defines ALL arguments for the jupyter command in ONE place.
|
||||
Both jupyter_scraper.py (standalone) and parsers/jupyter_parser.py (unified CLI)
|
||||
import and use these definitions.
|
||||
|
||||
Shared arguments (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
|
||||
via ``add_all_standard_arguments()``.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
from typing import Any
|
||||
|
||||
from .common import add_all_standard_arguments
|
||||
|
||||
# Jupyter-specific argument definitions as data structure
|
||||
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
|
||||
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
|
||||
JUPYTER_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"notebook": {
|
||||
"flags": ("--notebook",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Path to .ipynb file or directory containing notebooks",
|
||||
"metavar": "PATH",
|
||||
},
|
||||
},
|
||||
"from_json": {
|
||||
"flags": ("--from-json",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Build skill from extracted JSON",
|
||||
"metavar": "FILE",
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def add_jupyter_arguments(parser: argparse.ArgumentParser) -> None:
|
||||
"""Add all jupyter command arguments to a parser.
|
||||
|
||||
Registers shared args (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
|
||||
then adds Jupyter-specific args on top.
|
||||
|
||||
The default for --enhance-level is overridden to 0 (disabled) for Jupyter.
|
||||
"""
|
||||
# Shared universal args first
|
||||
add_all_standard_arguments(parser)
|
||||
|
||||
# Override enhance-level default to 0 for Jupyter
|
||||
for action in parser._actions:
|
||||
if hasattr(action, "dest") and action.dest == "enhance_level":
|
||||
action.default = 0
|
||||
action.help = (
|
||||
"AI enhancement level (auto-detects API vs LOCAL mode): "
|
||||
"0=disabled (default for Jupyter), 1=SKILL.md only, "
|
||||
"2=+architecture/config, 3=full enhancement. "
|
||||
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
|
||||
"otherwise LOCAL (Claude Code)"
|
||||
)
|
||||
|
||||
# Jupyter-specific args
|
||||
for arg_name, arg_def in JUPYTER_ARGUMENTS.items():
|
||||
flags = arg_def["flags"]
|
||||
kwargs = arg_def["kwargs"]
|
||||
parser.add_argument(*flags, **kwargs)
|
||||
84
src/skill_seekers/cli/arguments/manpage.py
Normal file
84
src/skill_seekers/cli/arguments/manpage.py
Normal file
@@ -0,0 +1,84 @@
|
||||
"""Man page command argument definitions.
|
||||
|
||||
This module defines ALL arguments for the manpage command in ONE place.
|
||||
Both manpage_scraper.py (standalone) and parsers/manpage_parser.py (unified CLI)
|
||||
import and use these definitions.
|
||||
|
||||
Shared arguments (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
|
||||
via ``add_all_standard_arguments()``.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
from typing import Any
|
||||
|
||||
from .common import add_all_standard_arguments
|
||||
|
||||
# ManPage-specific argument definitions as data structure
|
||||
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
|
||||
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
|
||||
MANPAGE_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"man_names": {
|
||||
"flags": ("--man-names",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Comma-separated list of man page names (e.g., 'ls,grep,find')",
|
||||
"metavar": "NAMES",
|
||||
},
|
||||
},
|
||||
"man_path": {
|
||||
"flags": ("--man-path",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Path to directory containing man page files",
|
||||
"metavar": "PATH",
|
||||
},
|
||||
},
|
||||
"sections": {
|
||||
"flags": ("--sections",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Comma-separated section numbers to include (e.g., '1,3,8')",
|
||||
"metavar": "SECTIONS",
|
||||
},
|
||||
},
|
||||
"from_json": {
|
||||
"flags": ("--from-json",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Build skill from extracted JSON",
|
||||
"metavar": "FILE",
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def add_manpage_arguments(parser: argparse.ArgumentParser) -> None:
|
||||
"""Add all manpage command arguments to a parser.
|
||||
|
||||
Registers shared args (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
|
||||
then adds ManPage-specific args on top.
|
||||
|
||||
The default for --enhance-level is overridden to 0 (disabled) for ManPage.
|
||||
"""
|
||||
# Shared universal args first
|
||||
add_all_standard_arguments(parser)
|
||||
|
||||
# Override enhance-level default to 0 for ManPage
|
||||
for action in parser._actions:
|
||||
if hasattr(action, "dest") and action.dest == "enhance_level":
|
||||
action.default = 0
|
||||
action.help = (
|
||||
"AI enhancement level (auto-detects API vs LOCAL mode): "
|
||||
"0=disabled (default for ManPage), 1=SKILL.md only, "
|
||||
"2=+architecture/config, 3=full enhancement. "
|
||||
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
|
||||
"otherwise LOCAL (Claude Code)"
|
||||
)
|
||||
|
||||
# ManPage-specific args
|
||||
for arg_name, arg_def in MANPAGE_ARGUMENTS.items():
|
||||
flags = arg_def["flags"]
|
||||
kwargs = arg_def["kwargs"]
|
||||
parser.add_argument(*flags, **kwargs)
|
||||
101
src/skill_seekers/cli/arguments/notion.py
Normal file
101
src/skill_seekers/cli/arguments/notion.py
Normal file
@@ -0,0 +1,101 @@
|
||||
"""Notion command argument definitions.
|
||||
|
||||
This module defines ALL arguments for the notion command in ONE place.
|
||||
Both notion_scraper.py (standalone) and parsers/notion_parser.py (unified CLI)
|
||||
import and use these definitions.
|
||||
|
||||
Shared arguments (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
|
||||
via ``add_all_standard_arguments()``.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
from typing import Any
|
||||
|
||||
from .common import add_all_standard_arguments
|
||||
|
||||
# Notion-specific argument definitions as data structure
|
||||
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
|
||||
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
|
||||
NOTION_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"database_id": {
|
||||
"flags": ("--database-id",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Notion database ID to extract from",
|
||||
"metavar": "ID",
|
||||
},
|
||||
},
|
||||
"page_id": {
|
||||
"flags": ("--page-id",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Notion page ID to extract from",
|
||||
"metavar": "ID",
|
||||
},
|
||||
},
|
||||
"export_path": {
|
||||
"flags": ("--export-path",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Path to Notion export directory",
|
||||
"metavar": "PATH",
|
||||
},
|
||||
},
|
||||
"token": {
|
||||
"flags": ("--token",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Notion integration token for API authentication",
|
||||
"metavar": "TOKEN",
|
||||
},
|
||||
},
|
||||
"max_pages": {
|
||||
"flags": ("--max-pages",),
|
||||
"kwargs": {
|
||||
"type": int,
|
||||
"default": 500,
|
||||
"help": "Maximum number of pages to extract (default: 500)",
|
||||
"metavar": "N",
|
||||
},
|
||||
},
|
||||
"from_json": {
|
||||
"flags": ("--from-json",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Build skill from extracted JSON",
|
||||
"metavar": "FILE",
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def add_notion_arguments(parser: argparse.ArgumentParser) -> None:
|
||||
"""Add all notion command arguments to a parser.
|
||||
|
||||
Registers shared args (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
|
||||
then adds Notion-specific args on top.
|
||||
|
||||
The default for --enhance-level is overridden to 0 (disabled) for Notion.
|
||||
"""
|
||||
# Shared universal args first
|
||||
add_all_standard_arguments(parser)
|
||||
|
||||
# Override enhance-level default to 0 for Notion
|
||||
for action in parser._actions:
|
||||
if hasattr(action, "dest") and action.dest == "enhance_level":
|
||||
action.default = 0
|
||||
action.help = (
|
||||
"AI enhancement level (auto-detects API vs LOCAL mode): "
|
||||
"0=disabled (default for Notion), 1=SKILL.md only, "
|
||||
"2=+architecture/config, 3=full enhancement. "
|
||||
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
|
||||
"otherwise LOCAL (Claude Code)"
|
||||
)
|
||||
|
||||
# Notion-specific args
|
||||
for arg_name, arg_def in NOTION_ARGUMENTS.items():
|
||||
flags = arg_def["flags"]
|
||||
kwargs = arg_def["kwargs"]
|
||||
parser.add_argument(*flags, **kwargs)
|
||||
76
src/skill_seekers/cli/arguments/openapi.py
Normal file
76
src/skill_seekers/cli/arguments/openapi.py
Normal file
@@ -0,0 +1,76 @@
|
||||
"""OpenAPI command argument definitions.
|
||||
|
||||
This module defines ALL arguments for the openapi command in ONE place.
|
||||
Both openapi_scraper.py (standalone) and parsers/openapi_parser.py (unified CLI)
|
||||
import and use these definitions.
|
||||
|
||||
Shared arguments (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
|
||||
via ``add_all_standard_arguments()``.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
from typing import Any
|
||||
|
||||
from .common import add_all_standard_arguments
|
||||
|
||||
# OpenAPI-specific argument definitions as data structure
|
||||
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
|
||||
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
|
||||
OPENAPI_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"spec": {
|
||||
"flags": ("--spec",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Path to OpenAPI/Swagger spec file",
|
||||
"metavar": "PATH",
|
||||
},
|
||||
},
|
||||
"spec_url": {
|
||||
"flags": ("--spec-url",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "URL to OpenAPI/Swagger spec",
|
||||
"metavar": "URL",
|
||||
},
|
||||
},
|
||||
"from_json": {
|
||||
"flags": ("--from-json",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Build skill from extracted JSON",
|
||||
"metavar": "FILE",
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def add_openapi_arguments(parser: argparse.ArgumentParser) -> None:
|
||||
"""Add all openapi command arguments to a parser.
|
||||
|
||||
Registers shared args (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
|
||||
then adds OpenAPI-specific args on top.
|
||||
|
||||
The default for --enhance-level is overridden to 0 (disabled) for OpenAPI.
|
||||
"""
|
||||
# Shared universal args first
|
||||
add_all_standard_arguments(parser)
|
||||
|
||||
# Override enhance-level default to 0 for OpenAPI
|
||||
for action in parser._actions:
|
||||
if hasattr(action, "dest") and action.dest == "enhance_level":
|
||||
action.default = 0
|
||||
action.help = (
|
||||
"AI enhancement level (auto-detects API vs LOCAL mode): "
|
||||
"0=disabled (default for OpenAPI), 1=SKILL.md only, "
|
||||
"2=+architecture/config, 3=full enhancement. "
|
||||
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
|
||||
"otherwise LOCAL (Claude Code)"
|
||||
)
|
||||
|
||||
# OpenAPI-specific args
|
||||
for arg_name, arg_def in OPENAPI_ARGUMENTS.items():
|
||||
flags = arg_def["flags"]
|
||||
kwargs = arg_def["kwargs"]
|
||||
parser.add_argument(*flags, **kwargs)
|
||||
68
src/skill_seekers/cli/arguments/pptx.py
Normal file
68
src/skill_seekers/cli/arguments/pptx.py
Normal file
@@ -0,0 +1,68 @@
|
||||
"""PPTX command argument definitions.
|
||||
|
||||
This module defines ALL arguments for the pptx command in ONE place.
|
||||
Both pptx_scraper.py (standalone) and parsers/pptx_parser.py (unified CLI)
|
||||
import and use these definitions.
|
||||
|
||||
Shared arguments (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
|
||||
via ``add_all_standard_arguments()``.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
from typing import Any
|
||||
|
||||
from .common import add_all_standard_arguments
|
||||
|
||||
# PPTX-specific argument definitions as data structure
|
||||
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
|
||||
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
|
||||
PPTX_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"pptx": {
|
||||
"flags": ("--pptx",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Path to PowerPoint file (.pptx)",
|
||||
"metavar": "PATH",
|
||||
},
|
||||
},
|
||||
"from_json": {
|
||||
"flags": ("--from-json",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Build skill from extracted JSON",
|
||||
"metavar": "FILE",
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def add_pptx_arguments(parser: argparse.ArgumentParser) -> None:
|
||||
"""Add all pptx command arguments to a parser.
|
||||
|
||||
Registers shared args (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
|
||||
then adds PPTX-specific args on top.
|
||||
|
||||
The default for --enhance-level is overridden to 0 (disabled) for PPTX.
|
||||
"""
|
||||
# Shared universal args first
|
||||
add_all_standard_arguments(parser)
|
||||
|
||||
# Override enhance-level default to 0 for PPTX
|
||||
for action in parser._actions:
|
||||
if hasattr(action, "dest") and action.dest == "enhance_level":
|
||||
action.default = 0
|
||||
action.help = (
|
||||
"AI enhancement level (auto-detects API vs LOCAL mode): "
|
||||
"0=disabled (default for PPTX), 1=SKILL.md only, "
|
||||
"2=+architecture/config, 3=full enhancement. "
|
||||
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
|
||||
"otherwise LOCAL (Claude Code)"
|
||||
)
|
||||
|
||||
# PPTX-specific args
|
||||
for arg_name, arg_def in PPTX_ARGUMENTS.items():
|
||||
flags = arg_def["flags"]
|
||||
kwargs = arg_def["kwargs"]
|
||||
parser.add_argument(*flags, **kwargs)
|
||||
101
src/skill_seekers/cli/arguments/rss.py
Normal file
101
src/skill_seekers/cli/arguments/rss.py
Normal file
@@ -0,0 +1,101 @@
|
||||
"""RSS command argument definitions.
|
||||
|
||||
This module defines ALL arguments for the rss command in ONE place.
|
||||
Both rss_scraper.py (standalone) and parsers/rss_parser.py (unified CLI)
|
||||
import and use these definitions.
|
||||
|
||||
Shared arguments (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
|
||||
via ``add_all_standard_arguments()``.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
from typing import Any
|
||||
|
||||
from .common import add_all_standard_arguments
|
||||
|
||||
# RSS-specific argument definitions as data structure
|
||||
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
|
||||
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
|
||||
RSS_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||
"feed_url": {
|
||||
"flags": ("--feed-url",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "URL of the RSS/Atom feed",
|
||||
"metavar": "URL",
|
||||
},
|
||||
},
|
||||
"feed_path": {
|
||||
"flags": ("--feed-path",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Path to local RSS/Atom feed file",
|
||||
"metavar": "PATH",
|
||||
},
|
||||
},
|
||||
"follow_links": {
|
||||
"flags": ("--follow-links",),
|
||||
"kwargs": {
|
||||
"action": "store_true",
|
||||
"default": True,
|
||||
"help": "Follow article links and extract full content (default: True)",
|
||||
},
|
||||
},
|
||||
"no_follow_links": {
|
||||
"flags": ("--no-follow-links",),
|
||||
"kwargs": {
|
||||
"action": "store_false",
|
||||
"dest": "follow_links",
|
||||
"help": "Do not follow article links; use feed summary only",
|
||||
},
|
||||
},
|
||||
"max_articles": {
|
||||
"flags": ("--max-articles",),
|
||||
"kwargs": {
|
||||
"type": int,
|
||||
"default": 50,
|
||||
"help": "Maximum number of articles to extract (default: 50)",
|
||||
"metavar": "N",
|
||||
},
|
||||
},
|
||||
"from_json": {
|
||||
"flags": ("--from-json",),
|
||||
"kwargs": {
|
||||
"type": str,
|
||||
"help": "Build skill from extracted JSON",
|
||||
"metavar": "FILE",
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def add_rss_arguments(parser: argparse.ArgumentParser) -> None:
|
||||
"""Add all rss command arguments to a parser.
|
||||
|
||||
Registers shared args (name, description, output, enhance-level, api-key,
|
||||
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
|
||||
then adds RSS-specific args on top.
|
||||
|
||||
The default for --enhance-level is overridden to 0 (disabled) for RSS.
|
||||
"""
|
||||
# Shared universal args first
|
||||
add_all_standard_arguments(parser)
|
||||
|
||||
# Override enhance-level default to 0 for RSS
|
||||
for action in parser._actions:
|
||||
if hasattr(action, "dest") and action.dest == "enhance_level":
|
||||
action.default = 0
|
||||
action.help = (
|
||||
"AI enhancement level (auto-detects API vs LOCAL mode): "
|
||||
"0=disabled (default for RSS), 1=SKILL.md only, "
|
||||
"2=+architecture/config, 3=full enhancement. "
|
||||
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
|
||||
"otherwise LOCAL (Claude Code)"
|
||||
)
|
||||
|
||||
# RSS-specific args
|
||||
for arg_name, arg_def in RSS_ARGUMENTS.items():
|
||||
flags = arg_def["flags"]
|
||||
kwargs = arg_def["kwargs"]
|
||||
parser.add_argument(*flags, **kwargs)
|
||||
1085
src/skill_seekers/cli/asciidoc_scraper.py
Normal file
1085
src/skill_seekers/cli/asciidoc_scraper.py
Normal file
File diff suppressed because it is too large
Load Diff
1920
src/skill_seekers/cli/chat_scraper.py
Normal file
1920
src/skill_seekers/cli/chat_scraper.py
Normal file
File diff suppressed because it is too large
Load Diff
@@ -7,6 +7,19 @@ Validates unified config format that supports multiple sources:
|
||||
- github (repository scraping)
|
||||
- pdf (PDF document scraping)
|
||||
- local (local codebase analysis)
|
||||
- word (Word .docx document scraping)
|
||||
- video (video transcript/visual extraction)
|
||||
- epub (EPUB e-book extraction)
|
||||
- jupyter (Jupyter Notebook extraction)
|
||||
- html (local HTML file extraction)
|
||||
- openapi (OpenAPI/Swagger spec extraction)
|
||||
- asciidoc (AsciiDoc document extraction)
|
||||
- pptx (PowerPoint presentation extraction)
|
||||
- confluence (Confluence wiki extraction)
|
||||
- notion (Notion page extraction)
|
||||
- rss (RSS/Atom feed extraction)
|
||||
- manpage (man page extraction)
|
||||
- chat (Slack/Discord chat export extraction)
|
||||
|
||||
Legacy config format support removed in v2.11.0.
|
||||
All configs must use unified format with 'sources' array.
|
||||
@@ -27,7 +40,25 @@ class ConfigValidator:
|
||||
"""
|
||||
|
||||
# Valid source types
|
||||
VALID_SOURCE_TYPES = {"documentation", "github", "pdf", "local", "word", "video"}
|
||||
VALID_SOURCE_TYPES = {
|
||||
"documentation",
|
||||
"github",
|
||||
"pdf",
|
||||
"local",
|
||||
"word",
|
||||
"video",
|
||||
"epub",
|
||||
"jupyter",
|
||||
"html",
|
||||
"openapi",
|
||||
"asciidoc",
|
||||
"pptx",
|
||||
"confluence",
|
||||
"notion",
|
||||
"rss",
|
||||
"manpage",
|
||||
"chat",
|
||||
}
|
||||
|
||||
# Valid merge modes
|
||||
VALID_MERGE_MODES = {"rule-based", "claude-enhanced"}
|
||||
@@ -159,6 +190,32 @@ class ConfigValidator:
|
||||
self._validate_pdf_source(source, index)
|
||||
elif source_type == "local":
|
||||
self._validate_local_source(source, index)
|
||||
elif source_type == "word":
|
||||
self._validate_word_source(source, index)
|
||||
elif source_type == "video":
|
||||
self._validate_video_source(source, index)
|
||||
elif source_type == "epub":
|
||||
self._validate_epub_source(source, index)
|
||||
elif source_type == "jupyter":
|
||||
self._validate_jupyter_source(source, index)
|
||||
elif source_type == "html":
|
||||
self._validate_html_source(source, index)
|
||||
elif source_type == "openapi":
|
||||
self._validate_openapi_source(source, index)
|
||||
elif source_type == "asciidoc":
|
||||
self._validate_asciidoc_source(source, index)
|
||||
elif source_type == "pptx":
|
||||
self._validate_pptx_source(source, index)
|
||||
elif source_type == "confluence":
|
||||
self._validate_confluence_source(source, index)
|
||||
elif source_type == "notion":
|
||||
self._validate_notion_source(source, index)
|
||||
elif source_type == "rss":
|
||||
self._validate_rss_source(source, index)
|
||||
elif source_type == "manpage":
|
||||
self._validate_manpage_source(source, index)
|
||||
elif source_type == "chat":
|
||||
self._validate_chat_source(source, index)
|
||||
|
||||
def _validate_documentation_source(self, source: dict[str, Any], index: int):
|
||||
"""Validate documentation source configuration."""
|
||||
@@ -253,12 +310,126 @@ class ConfigValidator:
|
||||
f"Source {index} (local): Invalid ai_mode '{ai_mode}'. Must be one of {self.VALID_AI_MODES}"
|
||||
)
|
||||
|
||||
def _validate_word_source(self, source: dict[str, Any], index: int):
|
||||
"""Validate Word document (.docx) source configuration."""
|
||||
if "path" not in source:
|
||||
raise ValueError(f"Source {index} (word): Missing required field 'path'")
|
||||
word_path = source["path"]
|
||||
if not Path(word_path).exists():
|
||||
logger.warning(f"Source {index} (word): File not found: {word_path}")
|
||||
|
||||
def _validate_video_source(self, source: dict[str, Any], index: int):
|
||||
"""Validate video source configuration."""
|
||||
has_url = "url" in source
|
||||
has_path = "path" in source
|
||||
has_playlist = "playlist" in source
|
||||
if not has_url and not has_path and not has_playlist:
|
||||
raise ValueError(
|
||||
f"Source {index} (video): Missing required field 'url', 'path', or 'playlist'"
|
||||
)
|
||||
|
||||
def _validate_epub_source(self, source: dict[str, Any], index: int):
|
||||
"""Validate EPUB source configuration."""
|
||||
if "path" not in source:
|
||||
raise ValueError(f"Source {index} (epub): Missing required field 'path'")
|
||||
epub_path = source["path"]
|
||||
if not Path(epub_path).exists():
|
||||
logger.warning(f"Source {index} (epub): File not found: {epub_path}")
|
||||
|
||||
def _validate_jupyter_source(self, source: dict[str, Any], index: int):
|
||||
"""Validate Jupyter Notebook source configuration."""
|
||||
if "path" not in source:
|
||||
raise ValueError(f"Source {index} (jupyter): Missing required field 'path'")
|
||||
nb_path = source["path"]
|
||||
if not Path(nb_path).exists():
|
||||
logger.warning(f"Source {index} (jupyter): Path not found: {nb_path}")
|
||||
|
||||
def _validate_html_source(self, source: dict[str, Any], index: int):
|
||||
"""Validate local HTML source configuration."""
|
||||
if "path" not in source:
|
||||
raise ValueError(f"Source {index} (html): Missing required field 'path'")
|
||||
html_path = source["path"]
|
||||
if not Path(html_path).exists():
|
||||
logger.warning(f"Source {index} (html): Path not found: {html_path}")
|
||||
|
||||
def _validate_openapi_source(self, source: dict[str, Any], index: int):
|
||||
"""Validate OpenAPI/Swagger source configuration."""
|
||||
if "path" not in source and "url" not in source:
|
||||
raise ValueError(f"Source {index} (openapi): Missing required field 'path' or 'url'")
|
||||
if "path" in source and not Path(source["path"]).exists():
|
||||
logger.warning(f"Source {index} (openapi): File not found: {source['path']}")
|
||||
|
||||
def _validate_asciidoc_source(self, source: dict[str, Any], index: int):
|
||||
"""Validate AsciiDoc source configuration."""
|
||||
if "path" not in source:
|
||||
raise ValueError(f"Source {index} (asciidoc): Missing required field 'path'")
|
||||
adoc_path = source["path"]
|
||||
if not Path(adoc_path).exists():
|
||||
logger.warning(f"Source {index} (asciidoc): Path not found: {adoc_path}")
|
||||
|
||||
def _validate_pptx_source(self, source: dict[str, Any], index: int):
|
||||
"""Validate PowerPoint source configuration."""
|
||||
if "path" not in source:
|
||||
raise ValueError(f"Source {index} (pptx): Missing required field 'path'")
|
||||
pptx_path = source["path"]
|
||||
if not Path(pptx_path).exists():
|
||||
logger.warning(f"Source {index} (pptx): File not found: {pptx_path}")
|
||||
|
||||
def _validate_confluence_source(self, source: dict[str, Any], index: int):
|
||||
"""Validate Confluence source configuration."""
|
||||
has_url = "url" in source or "base_url" in source
|
||||
has_path = "path" in source
|
||||
if not has_url and not has_path:
|
||||
raise ValueError(
|
||||
f"Source {index} (confluence): Missing required field 'url'/'base_url' "
|
||||
f"(for API) or 'path' (for export)"
|
||||
)
|
||||
if has_url and "space_key" not in source and "path" not in source:
|
||||
logger.warning(f"Source {index} (confluence): No 'space_key' specified for API mode")
|
||||
|
||||
def _validate_notion_source(self, source: dict[str, Any], index: int):
|
||||
"""Validate Notion source configuration."""
|
||||
has_url = "url" in source or "database_id" in source or "page_id" in source
|
||||
has_path = "path" in source
|
||||
if not has_url and not has_path:
|
||||
raise ValueError(
|
||||
f"Source {index} (notion): Missing required field 'url'/'database_id'/'page_id' "
|
||||
f"(for API) or 'path' (for export)"
|
||||
)
|
||||
|
||||
def _validate_rss_source(self, source: dict[str, Any], index: int):
|
||||
"""Validate RSS/Atom feed source configuration."""
|
||||
if "url" not in source and "path" not in source:
|
||||
raise ValueError(f"Source {index} (rss): Missing required field 'url' or 'path'")
|
||||
|
||||
def _validate_manpage_source(self, source: dict[str, Any], index: int):
|
||||
"""Validate man page source configuration."""
|
||||
if "path" not in source and "names" not in source:
|
||||
raise ValueError(f"Source {index} (manpage): Missing required field 'path' or 'names'")
|
||||
if "path" in source and not Path(source["path"]).exists():
|
||||
logger.warning(f"Source {index} (manpage): Path not found: {source['path']}")
|
||||
|
||||
def _validate_chat_source(self, source: dict[str, Any], index: int):
|
||||
"""Validate Slack/Discord chat source configuration."""
|
||||
has_path = "path" in source
|
||||
has_api = "token" in source or "webhook_url" in source
|
||||
has_channel = "channel" in source or "channel_id" in source
|
||||
if not has_path and not has_api:
|
||||
raise ValueError(
|
||||
f"Source {index} (chat): Missing required field 'path' (for export) "
|
||||
f"or 'token' (for API)"
|
||||
)
|
||||
if has_api and not has_channel:
|
||||
logger.warning(
|
||||
f"Source {index} (chat): No 'channel' or 'channel_id' specified for API mode"
|
||||
)
|
||||
|
||||
def get_sources_by_type(self, source_type: str) -> list[dict[str, Any]]:
|
||||
"""
|
||||
Get all sources of a specific type.
|
||||
|
||||
Args:
|
||||
source_type: 'documentation', 'github', 'pdf', or 'local'
|
||||
source_type: Any valid source type string
|
||||
|
||||
Returns:
|
||||
List of sources matching the type
|
||||
|
||||
2166
src/skill_seekers/cli/confluence_scraper.py
Normal file
2166
src/skill_seekers/cli/confluence_scraper.py
Normal file
File diff suppressed because it is too large
Load Diff
@@ -140,6 +140,26 @@ class CreateCommand:
|
||||
return self._route_video()
|
||||
elif self.source_info.type == "config":
|
||||
return self._route_config()
|
||||
elif self.source_info.type == "jupyter":
|
||||
return self._route_generic("jupyter_scraper", "--notebook")
|
||||
elif self.source_info.type == "html":
|
||||
return self._route_generic("html_scraper", "--html-path")
|
||||
elif self.source_info.type == "openapi":
|
||||
return self._route_generic("openapi_scraper", "--spec")
|
||||
elif self.source_info.type == "asciidoc":
|
||||
return self._route_generic("asciidoc_scraper", "--asciidoc-path")
|
||||
elif self.source_info.type == "pptx":
|
||||
return self._route_generic("pptx_scraper", "--pptx")
|
||||
elif self.source_info.type == "rss":
|
||||
return self._route_generic("rss_scraper", "--feed-path")
|
||||
elif self.source_info.type == "manpage":
|
||||
return self._route_generic("man_scraper", "--man-path")
|
||||
elif self.source_info.type == "confluence":
|
||||
return self._route_generic("confluence_scraper", "--export-path")
|
||||
elif self.source_info.type == "notion":
|
||||
return self._route_generic("notion_scraper", "--export-path")
|
||||
elif self.source_info.type == "chat":
|
||||
return self._route_generic("chat_scraper", "--export-path")
|
||||
else:
|
||||
logger.error(f"Unknown source type: {self.source_info.type}")
|
||||
return 1
|
||||
@@ -485,6 +505,40 @@ class CreateCommand:
|
||||
finally:
|
||||
sys.argv = original_argv
|
||||
|
||||
def _route_generic(self, module_name: str, file_flag: str) -> int:
|
||||
"""Generic routing for new source types.
|
||||
|
||||
Most new source types (jupyter, html, openapi, asciidoc, pptx, rss,
|
||||
manpage, confluence, notion, chat) follow the same pattern:
|
||||
import module, build argv with --flag <file_path>, add common args, call main().
|
||||
|
||||
Args:
|
||||
module_name: Python module name under skill_seekers.cli (e.g., "jupyter_scraper")
|
||||
file_flag: CLI flag for the source file (e.g., "--notebook")
|
||||
|
||||
Returns:
|
||||
Exit code from scraper
|
||||
"""
|
||||
import importlib
|
||||
|
||||
module = importlib.import_module(f"skill_seekers.cli.{module_name}")
|
||||
|
||||
argv = [module_name]
|
||||
|
||||
file_path = self.source_info.parsed.get("file_path", "")
|
||||
if file_path:
|
||||
argv.extend([file_flag, file_path])
|
||||
|
||||
self._add_common_args(argv)
|
||||
|
||||
logger.debug(f"Calling {module_name} with argv: {argv}")
|
||||
original_argv = sys.argv
|
||||
try:
|
||||
sys.argv = argv
|
||||
return module.main()
|
||||
finally:
|
||||
sys.argv = original_argv
|
||||
|
||||
def _add_common_args(self, argv: list[str]) -> None:
|
||||
"""Add truly universal arguments to argv list.
|
||||
|
||||
|
||||
1942
src/skill_seekers/cli/html_scraper.py
Normal file
1942
src/skill_seekers/cli/html_scraper.py
Normal file
File diff suppressed because it is too large
Load Diff
1209
src/skill_seekers/cli/jupyter_scraper.py
Normal file
1209
src/skill_seekers/cli/jupyter_scraper.py
Normal file
File diff suppressed because it is too large
Load Diff
@@ -15,7 +15,17 @@ Commands:
|
||||
word Extract from Word (.docx) file
|
||||
epub Extract from EPUB e-book (.epub)
|
||||
video Extract from video (YouTube or local)
|
||||
unified Multi-source scraping (docs + GitHub + PDF)
|
||||
jupyter Extract from Jupyter Notebook (.ipynb)
|
||||
html Extract from local HTML files
|
||||
openapi Extract from OpenAPI/Swagger spec
|
||||
asciidoc Extract from AsciiDoc documents (.adoc)
|
||||
pptx Extract from PowerPoint (.pptx)
|
||||
rss Extract from RSS/Atom feeds
|
||||
manpage Extract from man pages
|
||||
confluence Extract from Confluence wiki
|
||||
notion Extract from Notion pages
|
||||
chat Extract from Slack/Discord chat exports
|
||||
unified Multi-source scraping (docs + GitHub + PDF + more)
|
||||
analyze Analyze local codebase and extract code knowledge
|
||||
enhance AI-powered enhancement (auto: API or LOCAL mode)
|
||||
enhance-status Check enhancement status (for background/daemon modes)
|
||||
@@ -70,6 +80,17 @@ COMMAND_MODULES = {
|
||||
"quality": "skill_seekers.cli.quality_metrics",
|
||||
"workflows": "skill_seekers.cli.workflows_command",
|
||||
"sync-config": "skill_seekers.cli.sync_config",
|
||||
# New source types (v3.2.0+)
|
||||
"jupyter": "skill_seekers.cli.jupyter_scraper",
|
||||
"html": "skill_seekers.cli.html_scraper",
|
||||
"openapi": "skill_seekers.cli.openapi_scraper",
|
||||
"asciidoc": "skill_seekers.cli.asciidoc_scraper",
|
||||
"pptx": "skill_seekers.cli.pptx_scraper",
|
||||
"rss": "skill_seekers.cli.rss_scraper",
|
||||
"manpage": "skill_seekers.cli.man_scraper",
|
||||
"confluence": "skill_seekers.cli.confluence_scraper",
|
||||
"notion": "skill_seekers.cli.notion_scraper",
|
||||
"chat": "skill_seekers.cli.chat_scraper",
|
||||
}
|
||||
|
||||
|
||||
|
||||
1513
src/skill_seekers/cli/man_scraper.py
Normal file
1513
src/skill_seekers/cli/man_scraper.py
Normal file
File diff suppressed because it is too large
Load Diff
1023
src/skill_seekers/cli/notion_scraper.py
Normal file
1023
src/skill_seekers/cli/notion_scraper.py
Normal file
File diff suppressed because it is too large
Load Diff
1959
src/skill_seekers/cli/openapi_scraper.py
Normal file
1959
src/skill_seekers/cli/openapi_scraper.py
Normal file
File diff suppressed because it is too large
Load Diff
@@ -33,6 +33,18 @@ from .quality_parser import QualityParser
|
||||
from .workflows_parser import WorkflowsParser
|
||||
from .sync_config_parser import SyncConfigParser
|
||||
|
||||
# New source type parsers (v3.2.0+)
|
||||
from .jupyter_parser import JupyterParser
|
||||
from .html_parser import HtmlParser
|
||||
from .openapi_parser import OpenAPIParser
|
||||
from .asciidoc_parser import AsciiDocParser
|
||||
from .pptx_parser import PptxParser
|
||||
from .rss_parser import RssParser
|
||||
from .manpage_parser import ManPageParser
|
||||
from .confluence_parser import ConfluenceParser
|
||||
from .notion_parser import NotionParser
|
||||
from .chat_parser import ChatParser
|
||||
|
||||
# Registry of all parsers (in order of usage frequency)
|
||||
PARSERS = [
|
||||
CreateParser(), # NEW: Unified create command (placed first for prominence)
|
||||
@@ -60,6 +72,17 @@ PARSERS = [
|
||||
QualityParser(),
|
||||
WorkflowsParser(),
|
||||
SyncConfigParser(),
|
||||
# New source types (v3.2.0+)
|
||||
JupyterParser(),
|
||||
HtmlParser(),
|
||||
OpenAPIParser(),
|
||||
AsciiDocParser(),
|
||||
PptxParser(),
|
||||
RssParser(),
|
||||
ManPageParser(),
|
||||
ConfluenceParser(),
|
||||
NotionParser(),
|
||||
ChatParser(),
|
||||
]
|
||||
|
||||
|
||||
|
||||
32
src/skill_seekers/cli/parsers/asciidoc_parser.py
Normal file
32
src/skill_seekers/cli/parsers/asciidoc_parser.py
Normal file
@@ -0,0 +1,32 @@
|
||||
"""AsciiDoc subcommand parser.
|
||||
|
||||
Uses shared argument definitions from arguments.asciidoc to ensure
|
||||
consistency with the standalone asciidoc_scraper module.
|
||||
"""
|
||||
|
||||
from .base import SubcommandParser
|
||||
from skill_seekers.cli.arguments.asciidoc import add_asciidoc_arguments
|
||||
|
||||
|
||||
class AsciiDocParser(SubcommandParser):
|
||||
"""Parser for asciidoc subcommand."""
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return "asciidoc"
|
||||
|
||||
@property
|
||||
def help(self) -> str:
|
||||
return "Extract from AsciiDoc documents (.adoc)"
|
||||
|
||||
@property
|
||||
def description(self) -> str:
|
||||
return "Extract content from AsciiDoc documents (.adoc) and generate skill"
|
||||
|
||||
def add_arguments(self, parser):
|
||||
"""Add asciidoc-specific arguments.
|
||||
|
||||
Uses shared argument definitions to ensure consistency
|
||||
with asciidoc_scraper.py (standalone scraper).
|
||||
"""
|
||||
add_asciidoc_arguments(parser)
|
||||
32
src/skill_seekers/cli/parsers/chat_parser.py
Normal file
32
src/skill_seekers/cli/parsers/chat_parser.py
Normal file
@@ -0,0 +1,32 @@
|
||||
"""Chat subcommand parser.
|
||||
|
||||
Uses shared argument definitions from arguments.chat to ensure
|
||||
consistency with the standalone chat_scraper module.
|
||||
"""
|
||||
|
||||
from .base import SubcommandParser
|
||||
from skill_seekers.cli.arguments.chat import add_chat_arguments
|
||||
|
||||
|
||||
class ChatParser(SubcommandParser):
|
||||
"""Parser for chat subcommand."""
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return "chat"
|
||||
|
||||
@property
|
||||
def help(self) -> str:
|
||||
return "Extract from Slack/Discord chat exports"
|
||||
|
||||
@property
|
||||
def description(self) -> str:
|
||||
return "Extract content from Slack/Discord chat exports and generate skill"
|
||||
|
||||
def add_arguments(self, parser):
|
||||
"""Add chat-specific arguments.
|
||||
|
||||
Uses shared argument definitions to ensure consistency
|
||||
with chat_scraper.py (standalone scraper).
|
||||
"""
|
||||
add_chat_arguments(parser)
|
||||
32
src/skill_seekers/cli/parsers/confluence_parser.py
Normal file
32
src/skill_seekers/cli/parsers/confluence_parser.py
Normal file
@@ -0,0 +1,32 @@
|
||||
"""Confluence subcommand parser.
|
||||
|
||||
Uses shared argument definitions from arguments.confluence to ensure
|
||||
consistency with the standalone confluence_scraper module.
|
||||
"""
|
||||
|
||||
from .base import SubcommandParser
|
||||
from skill_seekers.cli.arguments.confluence import add_confluence_arguments
|
||||
|
||||
|
||||
class ConfluenceParser(SubcommandParser):
|
||||
"""Parser for confluence subcommand."""
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return "confluence"
|
||||
|
||||
@property
|
||||
def help(self) -> str:
|
||||
return "Extract from Confluence wiki"
|
||||
|
||||
@property
|
||||
def description(self) -> str:
|
||||
return "Extract content from Confluence wiki and generate skill"
|
||||
|
||||
def add_arguments(self, parser):
|
||||
"""Add confluence-specific arguments.
|
||||
|
||||
Uses shared argument definitions to ensure consistency
|
||||
with confluence_scraper.py (standalone scraper).
|
||||
"""
|
||||
add_confluence_arguments(parser)
|
||||
32
src/skill_seekers/cli/parsers/html_parser.py
Normal file
32
src/skill_seekers/cli/parsers/html_parser.py
Normal file
@@ -0,0 +1,32 @@
|
||||
"""HTML subcommand parser.
|
||||
|
||||
Uses shared argument definitions from arguments.html to ensure
|
||||
consistency with the standalone html_scraper module.
|
||||
"""
|
||||
|
||||
from .base import SubcommandParser
|
||||
from skill_seekers.cli.arguments.html import add_html_arguments
|
||||
|
||||
|
||||
class HtmlParser(SubcommandParser):
|
||||
"""Parser for html subcommand."""
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return "html"
|
||||
|
||||
@property
|
||||
def help(self) -> str:
|
||||
return "Extract from local HTML files (.html/.htm)"
|
||||
|
||||
@property
|
||||
def description(self) -> str:
|
||||
return "Extract content from local HTML files (.html/.htm) and generate skill"
|
||||
|
||||
def add_arguments(self, parser):
|
||||
"""Add html-specific arguments.
|
||||
|
||||
Uses shared argument definitions to ensure consistency
|
||||
with html_scraper.py (standalone scraper).
|
||||
"""
|
||||
add_html_arguments(parser)
|
||||
32
src/skill_seekers/cli/parsers/jupyter_parser.py
Normal file
32
src/skill_seekers/cli/parsers/jupyter_parser.py
Normal file
@@ -0,0 +1,32 @@
|
||||
"""Jupyter Notebook subcommand parser.
|
||||
|
||||
Uses shared argument definitions from arguments.jupyter to ensure
|
||||
consistency with the standalone jupyter_scraper module.
|
||||
"""
|
||||
|
||||
from .base import SubcommandParser
|
||||
from skill_seekers.cli.arguments.jupyter import add_jupyter_arguments
|
||||
|
||||
|
||||
class JupyterParser(SubcommandParser):
|
||||
"""Parser for jupyter subcommand."""
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return "jupyter"
|
||||
|
||||
@property
|
||||
def help(self) -> str:
|
||||
return "Extract from Jupyter Notebook (.ipynb)"
|
||||
|
||||
@property
|
||||
def description(self) -> str:
|
||||
return "Extract content from Jupyter Notebook (.ipynb) and generate skill"
|
||||
|
||||
def add_arguments(self, parser):
|
||||
"""Add jupyter-specific arguments.
|
||||
|
||||
Uses shared argument definitions to ensure consistency
|
||||
with jupyter_scraper.py (standalone scraper).
|
||||
"""
|
||||
add_jupyter_arguments(parser)
|
||||
32
src/skill_seekers/cli/parsers/manpage_parser.py
Normal file
32
src/skill_seekers/cli/parsers/manpage_parser.py
Normal file
@@ -0,0 +1,32 @@
|
||||
"""Man page subcommand parser.
|
||||
|
||||
Uses shared argument definitions from arguments.manpage to ensure
|
||||
consistency with the standalone man_scraper module.
|
||||
"""
|
||||
|
||||
from .base import SubcommandParser
|
||||
from skill_seekers.cli.arguments.manpage import add_manpage_arguments
|
||||
|
||||
|
||||
class ManPageParser(SubcommandParser):
|
||||
"""Parser for manpage subcommand."""
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return "manpage"
|
||||
|
||||
@property
|
||||
def help(self) -> str:
|
||||
return "Extract from man pages"
|
||||
|
||||
@property
|
||||
def description(self) -> str:
|
||||
return "Extract content from man pages and generate skill"
|
||||
|
||||
def add_arguments(self, parser):
|
||||
"""Add manpage-specific arguments.
|
||||
|
||||
Uses shared argument definitions to ensure consistency
|
||||
with man_scraper.py (standalone scraper).
|
||||
"""
|
||||
add_manpage_arguments(parser)
|
||||
32
src/skill_seekers/cli/parsers/notion_parser.py
Normal file
32
src/skill_seekers/cli/parsers/notion_parser.py
Normal file
@@ -0,0 +1,32 @@
|
||||
"""Notion subcommand parser.
|
||||
|
||||
Uses shared argument definitions from arguments.notion to ensure
|
||||
consistency with the standalone notion_scraper module.
|
||||
"""
|
||||
|
||||
from .base import SubcommandParser
|
||||
from skill_seekers.cli.arguments.notion import add_notion_arguments
|
||||
|
||||
|
||||
class NotionParser(SubcommandParser):
|
||||
"""Parser for notion subcommand."""
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return "notion"
|
||||
|
||||
@property
|
||||
def help(self) -> str:
|
||||
return "Extract from Notion pages"
|
||||
|
||||
@property
|
||||
def description(self) -> str:
|
||||
return "Extract content from Notion pages and generate skill"
|
||||
|
||||
def add_arguments(self, parser):
|
||||
"""Add notion-specific arguments.
|
||||
|
||||
Uses shared argument definitions to ensure consistency
|
||||
with notion_scraper.py (standalone scraper).
|
||||
"""
|
||||
add_notion_arguments(parser)
|
||||
32
src/skill_seekers/cli/parsers/openapi_parser.py
Normal file
32
src/skill_seekers/cli/parsers/openapi_parser.py
Normal file
@@ -0,0 +1,32 @@
|
||||
"""OpenAPI subcommand parser.
|
||||
|
||||
Uses shared argument definitions from arguments.openapi to ensure
|
||||
consistency with the standalone openapi_scraper module.
|
||||
"""
|
||||
|
||||
from .base import SubcommandParser
|
||||
from skill_seekers.cli.arguments.openapi import add_openapi_arguments
|
||||
|
||||
|
||||
class OpenAPIParser(SubcommandParser):
|
||||
"""Parser for openapi subcommand."""
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return "openapi"
|
||||
|
||||
@property
|
||||
def help(self) -> str:
|
||||
return "Extract from OpenAPI/Swagger spec"
|
||||
|
||||
@property
|
||||
def description(self) -> str:
|
||||
return "Extract content from OpenAPI/Swagger spec and generate skill"
|
||||
|
||||
def add_arguments(self, parser):
|
||||
"""Add openapi-specific arguments.
|
||||
|
||||
Uses shared argument definitions to ensure consistency
|
||||
with openapi_scraper.py (standalone scraper).
|
||||
"""
|
||||
add_openapi_arguments(parser)
|
||||
32
src/skill_seekers/cli/parsers/pptx_parser.py
Normal file
32
src/skill_seekers/cli/parsers/pptx_parser.py
Normal file
@@ -0,0 +1,32 @@
|
||||
"""PPTX subcommand parser.
|
||||
|
||||
Uses shared argument definitions from arguments.pptx to ensure
|
||||
consistency with the standalone pptx_scraper module.
|
||||
"""
|
||||
|
||||
from .base import SubcommandParser
|
||||
from skill_seekers.cli.arguments.pptx import add_pptx_arguments
|
||||
|
||||
|
||||
class PptxParser(SubcommandParser):
|
||||
"""Parser for pptx subcommand."""
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return "pptx"
|
||||
|
||||
@property
|
||||
def help(self) -> str:
|
||||
return "Extract from PowerPoint presentations (.pptx)"
|
||||
|
||||
@property
|
||||
def description(self) -> str:
|
||||
return "Extract content from PowerPoint presentations (.pptx) and generate skill"
|
||||
|
||||
def add_arguments(self, parser):
|
||||
"""Add pptx-specific arguments.
|
||||
|
||||
Uses shared argument definitions to ensure consistency
|
||||
with pptx_scraper.py (standalone scraper).
|
||||
"""
|
||||
add_pptx_arguments(parser)
|
||||
32
src/skill_seekers/cli/parsers/rss_parser.py
Normal file
32
src/skill_seekers/cli/parsers/rss_parser.py
Normal file
@@ -0,0 +1,32 @@
|
||||
"""RSS subcommand parser.
|
||||
|
||||
Uses shared argument definitions from arguments.rss to ensure
|
||||
consistency with the standalone rss_scraper module.
|
||||
"""
|
||||
|
||||
from .base import SubcommandParser
|
||||
from skill_seekers.cli.arguments.rss import add_rss_arguments
|
||||
|
||||
|
||||
class RssParser(SubcommandParser):
|
||||
"""Parser for rss subcommand."""
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return "rss"
|
||||
|
||||
@property
|
||||
def help(self) -> str:
|
||||
return "Extract from RSS/Atom feeds"
|
||||
|
||||
@property
|
||||
def description(self) -> str:
|
||||
return "Extract content from RSS/Atom feeds and generate skill"
|
||||
|
||||
def add_arguments(self, parser):
|
||||
"""Add rss-specific arguments.
|
||||
|
||||
Uses shared argument definitions to ensure consistency
|
||||
with rss_scraper.py (standalone scraper).
|
||||
"""
|
||||
add_rss_arguments(parser)
|
||||
1821
src/skill_seekers/cli/pptx_scraper.py
Normal file
1821
src/skill_seekers/cli/pptx_scraper.py
Normal file
File diff suppressed because it is too large
Load Diff
1087
src/skill_seekers/cli/rss_scraper.py
Normal file
1087
src/skill_seekers/cli/rss_scraper.py
Normal file
File diff suppressed because it is too large
Load Diff
@@ -1,7 +1,12 @@
|
||||
"""Source type detection for unified create command.
|
||||
|
||||
Auto-detects whether a source is a web URL, GitHub repository,
|
||||
local directory, PDF file, or config file based on patterns.
|
||||
Auto-detects source type from user input — supports web URLs, GitHub repos,
|
||||
local directories, and 14+ file types (PDF, DOCX, EPUB, IPYNB, HTML, YAML/OpenAPI,
|
||||
AsciiDoc, PPTX, RSS/Atom, man pages, video files, and config JSON).
|
||||
|
||||
Note: Confluence, Notion, and Slack/Discord chat sources are API/export-based
|
||||
and cannot be auto-detected from a single argument. Use their dedicated
|
||||
subcommands (``skill-seekers confluence``, ``notion``, ``chat``) instead.
|
||||
"""
|
||||
|
||||
import os
|
||||
@@ -66,11 +71,49 @@ class SourceDetector:
|
||||
if source.endswith(".epub"):
|
||||
return cls._detect_epub(source)
|
||||
|
||||
if source.endswith(".ipynb"):
|
||||
return cls._detect_jupyter(source)
|
||||
|
||||
if source.lower().endswith((".html", ".htm")):
|
||||
return cls._detect_html(source)
|
||||
|
||||
if source.endswith(".pptx"):
|
||||
return cls._detect_pptx(source)
|
||||
|
||||
if source.lower().endswith((".adoc", ".asciidoc")):
|
||||
return cls._detect_asciidoc(source)
|
||||
|
||||
# Man page file extensions (.1 through .8, .man)
|
||||
# Only match if the basename looks like a man page (e.g., "git.1", not "log.1")
|
||||
# Require basename without the extension to be a plausible command name
|
||||
if source.lower().endswith(".man"):
|
||||
return cls._detect_manpage(source)
|
||||
MAN_SECTION_EXTENSIONS = (".1", ".2", ".3", ".4", ".5", ".6", ".7", ".8")
|
||||
if source.lower().endswith(MAN_SECTION_EXTENSIONS):
|
||||
# Heuristic: man pages have a simple basename (no dots before extension)
|
||||
# e.g., "git.1" is a man page, "access.log.1" is not
|
||||
basename_no_ext = os.path.splitext(os.path.basename(source))[0]
|
||||
if "." not in basename_no_ext:
|
||||
return cls._detect_manpage(source)
|
||||
|
||||
# Video file extensions
|
||||
VIDEO_EXTENSIONS = (".mp4", ".mkv", ".avi", ".mov", ".webm", ".flv", ".wmv")
|
||||
if source.lower().endswith(VIDEO_EXTENSIONS):
|
||||
return cls._detect_video_file(source)
|
||||
|
||||
# RSS/Atom feed file extensions (only .rss and .atom — .xml is too generic)
|
||||
if source.lower().endswith((".rss", ".atom")):
|
||||
return cls._detect_rss(source)
|
||||
|
||||
# OpenAPI/Swagger spec detection (YAML files with OpenAPI content)
|
||||
# Sniff file content for 'openapi:' or 'swagger:' keys before committing
|
||||
if (
|
||||
source.lower().endswith((".yaml", ".yml"))
|
||||
and os.path.isfile(source)
|
||||
and cls._looks_like_openapi(source)
|
||||
):
|
||||
return cls._detect_openapi(source)
|
||||
|
||||
# 2. Video URL detection (before directory check)
|
||||
video_url_info = cls._detect_video_url(source)
|
||||
if video_url_info:
|
||||
@@ -97,15 +140,22 @@ class SourceDetector:
|
||||
raise ValueError(
|
||||
f"Cannot determine source type for: {source}\n\n"
|
||||
"Examples:\n"
|
||||
" Web: skill-seekers create https://docs.react.dev/\n"
|
||||
" GitHub: skill-seekers create facebook/react\n"
|
||||
" Local: skill-seekers create ./my-project\n"
|
||||
" PDF: skill-seekers create tutorial.pdf\n"
|
||||
" DOCX: skill-seekers create document.docx\n"
|
||||
" EPUB: skill-seekers create ebook.epub\n"
|
||||
" Video: skill-seekers create https://youtube.com/watch?v=...\n"
|
||||
" Video: skill-seekers create recording.mp4\n"
|
||||
" Config: skill-seekers create configs/react.json"
|
||||
" Web: skill-seekers create https://docs.react.dev/\n"
|
||||
" GitHub: skill-seekers create facebook/react\n"
|
||||
" Local: skill-seekers create ./my-project\n"
|
||||
" PDF: skill-seekers create tutorial.pdf\n"
|
||||
" DOCX: skill-seekers create document.docx\n"
|
||||
" EPUB: skill-seekers create ebook.epub\n"
|
||||
" Jupyter: skill-seekers create notebook.ipynb\n"
|
||||
" HTML: skill-seekers create page.html\n"
|
||||
" OpenAPI: skill-seekers create openapi.yaml\n"
|
||||
" AsciiDoc: skill-seekers create document.adoc\n"
|
||||
" PowerPoint: skill-seekers create presentation.pptx\n"
|
||||
" RSS: skill-seekers create feed.rss\n"
|
||||
" Man page: skill-seekers create command.1\n"
|
||||
" Video: skill-seekers create https://youtube.com/watch?v=...\n"
|
||||
" Video: skill-seekers create recording.mp4\n"
|
||||
" Config: skill-seekers create configs/react.json"
|
||||
)
|
||||
|
||||
@classmethod
|
||||
@@ -140,6 +190,90 @@ class SourceDetector:
|
||||
type="epub", parsed={"file_path": source}, suggested_name=name, raw_input=source
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def _detect_jupyter(cls, source: str) -> SourceInfo:
|
||||
"""Detect Jupyter Notebook file source."""
|
||||
name = os.path.splitext(os.path.basename(source))[0]
|
||||
return SourceInfo(
|
||||
type="jupyter", parsed={"file_path": source}, suggested_name=name, raw_input=source
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def _detect_html(cls, source: str) -> SourceInfo:
|
||||
"""Detect local HTML file source."""
|
||||
name = os.path.splitext(os.path.basename(source))[0]
|
||||
return SourceInfo(
|
||||
type="html", parsed={"file_path": source}, suggested_name=name, raw_input=source
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def _detect_pptx(cls, source: str) -> SourceInfo:
|
||||
"""Detect PowerPoint file source."""
|
||||
name = os.path.splitext(os.path.basename(source))[0]
|
||||
return SourceInfo(
|
||||
type="pptx", parsed={"file_path": source}, suggested_name=name, raw_input=source
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def _detect_asciidoc(cls, source: str) -> SourceInfo:
|
||||
"""Detect AsciiDoc file source."""
|
||||
name = os.path.splitext(os.path.basename(source))[0]
|
||||
return SourceInfo(
|
||||
type="asciidoc", parsed={"file_path": source}, suggested_name=name, raw_input=source
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def _detect_manpage(cls, source: str) -> SourceInfo:
|
||||
"""Detect man page file source."""
|
||||
name = os.path.splitext(os.path.basename(source))[0]
|
||||
return SourceInfo(
|
||||
type="manpage", parsed={"file_path": source}, suggested_name=name, raw_input=source
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def _detect_rss(cls, source: str) -> SourceInfo:
|
||||
"""Detect RSS/Atom feed file source."""
|
||||
name = os.path.splitext(os.path.basename(source))[0]
|
||||
return SourceInfo(
|
||||
type="rss", parsed={"file_path": source}, suggested_name=name, raw_input=source
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def _looks_like_openapi(cls, source: str) -> bool:
|
||||
"""Check if a YAML/JSON file looks like an OpenAPI or Swagger spec.
|
||||
|
||||
Reads the first few lines to look for 'openapi:' or 'swagger:' keys.
|
||||
|
||||
Args:
|
||||
source: Path to the file
|
||||
|
||||
Returns:
|
||||
True if the file appears to be an OpenAPI/Swagger spec
|
||||
"""
|
||||
try:
|
||||
with open(source, encoding="utf-8", errors="replace") as f:
|
||||
# Read first 20 lines — the openapi/swagger key is always near the top
|
||||
for _ in range(20):
|
||||
line = f.readline()
|
||||
if not line:
|
||||
break
|
||||
stripped = line.strip().lower()
|
||||
if stripped.startswith("openapi:") or stripped.startswith("swagger:"):
|
||||
return True
|
||||
if stripped.startswith('"openapi"') or stripped.startswith('"swagger"'):
|
||||
return True
|
||||
except OSError:
|
||||
pass
|
||||
return False
|
||||
|
||||
@classmethod
|
||||
def _detect_openapi(cls, source: str) -> SourceInfo:
|
||||
"""Detect OpenAPI/Swagger spec file source."""
|
||||
name = os.path.splitext(os.path.basename(source))[0]
|
||||
return SourceInfo(
|
||||
type="openapi", parsed={"file_path": source}, suggested_name=name, raw_input=source
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def _detect_video_file(cls, source: str) -> SourceInfo:
|
||||
"""Detect local video file source."""
|
||||
@@ -312,5 +446,19 @@ class SourceDetector:
|
||||
if not os.path.isfile(config_path):
|
||||
raise ValueError(f"Path is not a file: {config_path}")
|
||||
|
||||
# For web and github, validation happens during scraping
|
||||
# (URL accessibility, repo existence)
|
||||
elif source_info.type in ("jupyter", "html", "pptx", "asciidoc", "manpage", "openapi"):
|
||||
file_path = source_info.parsed.get("file_path", "")
|
||||
if file_path:
|
||||
type_label = source_info.type.upper()
|
||||
if not os.path.exists(file_path):
|
||||
raise ValueError(f"{type_label} file does not exist: {file_path}")
|
||||
if not os.path.isfile(file_path) and not os.path.isdir(file_path):
|
||||
raise ValueError(f"Path is not a file or directory: {file_path}")
|
||||
|
||||
elif source_info.type == "rss":
|
||||
file_path = source_info.parsed.get("file_path", "")
|
||||
if file_path and not os.path.exists(file_path):
|
||||
raise ValueError(f"RSS/Atom file does not exist: {file_path}")
|
||||
|
||||
# For web, github, confluence, notion, chat, rss (URL), validation happens
|
||||
# during scraping (URL accessibility, API auth, etc.)
|
||||
|
||||
@@ -76,6 +76,17 @@ class UnifiedScraper:
|
||||
"word": [], # List of word sources
|
||||
"video": [], # List of video sources
|
||||
"local": [], # List of local sources (docs or code)
|
||||
"epub": [], # List of epub sources
|
||||
"jupyter": [], # List of Jupyter notebook sources
|
||||
"html": [], # List of local HTML sources
|
||||
"openapi": [], # List of OpenAPI/Swagger spec sources
|
||||
"asciidoc": [], # List of AsciiDoc sources
|
||||
"pptx": [], # List of PowerPoint sources
|
||||
"confluence": [], # List of Confluence wiki sources
|
||||
"notion": [], # List of Notion page sources
|
||||
"rss": [], # List of RSS/Atom feed sources
|
||||
"manpage": [], # List of man page sources
|
||||
"chat": [], # List of Slack/Discord chat sources
|
||||
}
|
||||
|
||||
# Track source index for unique naming (multi-source support)
|
||||
@@ -86,6 +97,17 @@ class UnifiedScraper:
|
||||
"word": 0,
|
||||
"video": 0,
|
||||
"local": 0,
|
||||
"epub": 0,
|
||||
"jupyter": 0,
|
||||
"html": 0,
|
||||
"openapi": 0,
|
||||
"asciidoc": 0,
|
||||
"pptx": 0,
|
||||
"confluence": 0,
|
||||
"notion": 0,
|
||||
"rss": 0,
|
||||
"manpage": 0,
|
||||
"chat": 0,
|
||||
}
|
||||
|
||||
# Output paths - cleaner organization
|
||||
@@ -166,6 +188,28 @@ class UnifiedScraper:
|
||||
self._scrape_video(source)
|
||||
elif source_type == "local":
|
||||
self._scrape_local(source)
|
||||
elif source_type == "epub":
|
||||
self._scrape_epub(source)
|
||||
elif source_type == "jupyter":
|
||||
self._scrape_jupyter(source)
|
||||
elif source_type == "html":
|
||||
self._scrape_html(source)
|
||||
elif source_type == "openapi":
|
||||
self._scrape_openapi(source)
|
||||
elif source_type == "asciidoc":
|
||||
self._scrape_asciidoc(source)
|
||||
elif source_type == "pptx":
|
||||
self._scrape_pptx(source)
|
||||
elif source_type == "confluence":
|
||||
self._scrape_confluence(source)
|
||||
elif source_type == "notion":
|
||||
self._scrape_notion(source)
|
||||
elif source_type == "rss":
|
||||
self._scrape_rss(source)
|
||||
elif source_type == "manpage":
|
||||
self._scrape_manpage(source)
|
||||
elif source_type == "chat":
|
||||
self._scrape_chat(source)
|
||||
else:
|
||||
logger.warning(f"Unknown source type: {source_type}")
|
||||
except Exception as e:
|
||||
@@ -571,6 +615,7 @@ class UnifiedScraper:
|
||||
{
|
||||
"docx_path": docx_path,
|
||||
"docx_id": docx_id,
|
||||
"word_id": docx_id, # Alias for generic reference generation
|
||||
"idx": idx,
|
||||
"data": word_data,
|
||||
"data_file": cache_word_data,
|
||||
@@ -788,6 +833,595 @@ class UnifiedScraper:
|
||||
logger.debug(f"Traceback: {traceback.format_exc()}")
|
||||
raise
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# New source type handlers (v3.2.0+)
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def _scrape_epub(self, source: dict[str, Any]):
|
||||
"""Scrape EPUB e-book (.epub)."""
|
||||
try:
|
||||
from skill_seekers.cli.epub_scraper import EpubToSkillConverter
|
||||
except ImportError:
|
||||
logger.error(
|
||||
"EPUB scraper dependencies not installed.\n"
|
||||
" Install with: pip install skill-seekers[epub]"
|
||||
)
|
||||
return
|
||||
|
||||
idx = self._source_counters["epub"]
|
||||
self._source_counters["epub"] += 1
|
||||
|
||||
epub_path = source["path"]
|
||||
epub_id = os.path.splitext(os.path.basename(epub_path))[0]
|
||||
|
||||
epub_config = {
|
||||
"name": f"{self.name}_epub_{idx}_{epub_id}",
|
||||
"epub_path": source["path"],
|
||||
"description": source.get("description", f"{epub_id} e-book"),
|
||||
}
|
||||
|
||||
logger.info(f"Scraping EPUB: {source['path']}")
|
||||
converter = EpubToSkillConverter(epub_config)
|
||||
converter.extract_epub()
|
||||
|
||||
epub_data_file = converter.data_file
|
||||
with open(epub_data_file, encoding="utf-8") as f:
|
||||
epub_data = json.load(f)
|
||||
|
||||
cache_epub_data = os.path.join(self.data_dir, f"epub_data_{idx}_{epub_id}.json")
|
||||
shutil.copy(epub_data_file, cache_epub_data)
|
||||
|
||||
self.scraped_data["epub"].append(
|
||||
{
|
||||
"epub_path": epub_path,
|
||||
"epub_id": epub_id,
|
||||
"idx": idx,
|
||||
"data": epub_data,
|
||||
"data_file": cache_epub_data,
|
||||
}
|
||||
)
|
||||
|
||||
try:
|
||||
converter.build_skill()
|
||||
logger.info("✅ EPUB: Standalone SKILL.md created")
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ Failed to build standalone EPUB SKILL.md: {e}")
|
||||
|
||||
logger.info(f"✅ EPUB: {len(epub_data.get('chapters', []))} chapters extracted")
|
||||
|
||||
def _scrape_jupyter(self, source: dict[str, Any]):
|
||||
"""Scrape Jupyter Notebook (.ipynb)."""
|
||||
try:
|
||||
from skill_seekers.cli.jupyter_scraper import JupyterToSkillConverter
|
||||
except ImportError:
|
||||
logger.error(
|
||||
"Jupyter scraper dependencies not installed.\n"
|
||||
" Install with: pip install skill-seekers[jupyter]"
|
||||
)
|
||||
return
|
||||
|
||||
idx = self._source_counters["jupyter"]
|
||||
self._source_counters["jupyter"] += 1
|
||||
|
||||
nb_path = source["path"]
|
||||
nb_id = os.path.splitext(os.path.basename(nb_path))[0]
|
||||
|
||||
nb_config = {
|
||||
"name": f"{self.name}_jupyter_{idx}_{nb_id}",
|
||||
"notebook_path": source["path"],
|
||||
"description": source.get("description", f"{nb_id} notebook"),
|
||||
}
|
||||
|
||||
logger.info(f"Scraping Jupyter Notebook: {source['path']}")
|
||||
converter = JupyterToSkillConverter(nb_config)
|
||||
converter.extract_notebook()
|
||||
|
||||
nb_data_file = converter.data_file
|
||||
with open(nb_data_file, encoding="utf-8") as f:
|
||||
nb_data = json.load(f)
|
||||
|
||||
cache_nb_data = os.path.join(self.data_dir, f"jupyter_data_{idx}_{nb_id}.json")
|
||||
shutil.copy(nb_data_file, cache_nb_data)
|
||||
|
||||
self.scraped_data["jupyter"].append(
|
||||
{
|
||||
"notebook_path": nb_path,
|
||||
"notebook_id": nb_id,
|
||||
"idx": idx,
|
||||
"data": nb_data,
|
||||
"data_file": cache_nb_data,
|
||||
}
|
||||
)
|
||||
|
||||
try:
|
||||
converter.build_skill()
|
||||
logger.info("✅ Jupyter: Standalone SKILL.md created")
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ Failed to build standalone Jupyter SKILL.md: {e}")
|
||||
|
||||
logger.info(f"✅ Jupyter: {len(nb_data.get('cells', []))} cells extracted")
|
||||
|
||||
def _scrape_html(self, source: dict[str, Any]):
|
||||
"""Scrape local HTML file(s)."""
|
||||
try:
|
||||
from skill_seekers.cli.html_scraper import HtmlToSkillConverter
|
||||
except ImportError:
|
||||
logger.error("html_scraper.py not found")
|
||||
return
|
||||
|
||||
idx = self._source_counters["html"]
|
||||
self._source_counters["html"] += 1
|
||||
|
||||
html_path = source["path"]
|
||||
html_id = os.path.splitext(os.path.basename(html_path.rstrip("/")))[0]
|
||||
|
||||
html_config = {
|
||||
"name": f"{self.name}_html_{idx}_{html_id}",
|
||||
"html_path": source["path"],
|
||||
"description": source.get("description", f"{html_id} HTML content"),
|
||||
}
|
||||
|
||||
logger.info(f"Scraping local HTML: {source['path']}")
|
||||
converter = HtmlToSkillConverter(html_config)
|
||||
converter.extract_html()
|
||||
|
||||
html_data_file = converter.data_file
|
||||
with open(html_data_file, encoding="utf-8") as f:
|
||||
html_data = json.load(f)
|
||||
|
||||
cache_html_data = os.path.join(self.data_dir, f"html_data_{idx}_{html_id}.json")
|
||||
shutil.copy(html_data_file, cache_html_data)
|
||||
|
||||
self.scraped_data["html"].append(
|
||||
{
|
||||
"html_path": html_path,
|
||||
"html_id": html_id,
|
||||
"idx": idx,
|
||||
"data": html_data,
|
||||
"data_file": cache_html_data,
|
||||
}
|
||||
)
|
||||
|
||||
try:
|
||||
converter.build_skill()
|
||||
logger.info("✅ HTML: Standalone SKILL.md created")
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ Failed to build standalone HTML SKILL.md: {e}")
|
||||
|
||||
logger.info(f"✅ HTML: {len(html_data.get('pages', []))} pages extracted")
|
||||
|
||||
def _scrape_openapi(self, source: dict[str, Any]):
|
||||
"""Scrape OpenAPI/Swagger specification."""
|
||||
try:
|
||||
from skill_seekers.cli.openapi_scraper import OpenAPIToSkillConverter
|
||||
except ImportError:
|
||||
logger.error("openapi_scraper.py not found")
|
||||
return
|
||||
|
||||
idx = self._source_counters["openapi"]
|
||||
self._source_counters["openapi"] += 1
|
||||
|
||||
spec_path = source.get("path", source.get("url", ""))
|
||||
spec_id = os.path.splitext(os.path.basename(spec_path))[0] if spec_path else f"spec_{idx}"
|
||||
|
||||
openapi_config = {
|
||||
"name": f"{self.name}_openapi_{idx}_{spec_id}",
|
||||
"spec_path": source.get("path"),
|
||||
"spec_url": source.get("url"),
|
||||
"description": source.get("description", f"{spec_id} API spec"),
|
||||
}
|
||||
|
||||
logger.info(f"Scraping OpenAPI spec: {spec_path}")
|
||||
converter = OpenAPIToSkillConverter(openapi_config)
|
||||
converter.extract_spec()
|
||||
|
||||
api_data_file = converter.data_file
|
||||
with open(api_data_file, encoding="utf-8") as f:
|
||||
api_data = json.load(f)
|
||||
|
||||
cache_api_data = os.path.join(self.data_dir, f"openapi_data_{idx}_{spec_id}.json")
|
||||
shutil.copy(api_data_file, cache_api_data)
|
||||
|
||||
self.scraped_data["openapi"].append(
|
||||
{
|
||||
"spec_path": spec_path,
|
||||
"spec_id": spec_id,
|
||||
"idx": idx,
|
||||
"data": api_data,
|
||||
"data_file": cache_api_data,
|
||||
}
|
||||
)
|
||||
|
||||
try:
|
||||
converter.build_skill()
|
||||
logger.info("✅ OpenAPI: Standalone SKILL.md created")
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ Failed to build standalone OpenAPI SKILL.md: {e}")
|
||||
|
||||
logger.info(f"✅ OpenAPI: {len(api_data.get('endpoints', []))} endpoints extracted")
|
||||
|
||||
def _scrape_asciidoc(self, source: dict[str, Any]):
|
||||
"""Scrape AsciiDoc document(s)."""
|
||||
try:
|
||||
from skill_seekers.cli.asciidoc_scraper import AsciiDocToSkillConverter
|
||||
except ImportError:
|
||||
logger.error(
|
||||
"AsciiDoc scraper dependencies not installed.\n"
|
||||
" Install with: pip install skill-seekers[asciidoc]"
|
||||
)
|
||||
return
|
||||
|
||||
idx = self._source_counters["asciidoc"]
|
||||
self._source_counters["asciidoc"] += 1
|
||||
|
||||
adoc_path = source["path"]
|
||||
adoc_id = os.path.splitext(os.path.basename(adoc_path.rstrip("/")))[0]
|
||||
|
||||
adoc_config = {
|
||||
"name": f"{self.name}_asciidoc_{idx}_{adoc_id}",
|
||||
"asciidoc_path": source["path"],
|
||||
"description": source.get("description", f"{adoc_id} AsciiDoc content"),
|
||||
}
|
||||
|
||||
logger.info(f"Scraping AsciiDoc: {source['path']}")
|
||||
converter = AsciiDocToSkillConverter(adoc_config)
|
||||
converter.extract_asciidoc()
|
||||
|
||||
adoc_data_file = converter.data_file
|
||||
with open(adoc_data_file, encoding="utf-8") as f:
|
||||
adoc_data = json.load(f)
|
||||
|
||||
cache_adoc_data = os.path.join(self.data_dir, f"asciidoc_data_{idx}_{adoc_id}.json")
|
||||
shutil.copy(adoc_data_file, cache_adoc_data)
|
||||
|
||||
self.scraped_data["asciidoc"].append(
|
||||
{
|
||||
"asciidoc_path": adoc_path,
|
||||
"asciidoc_id": adoc_id,
|
||||
"idx": idx,
|
||||
"data": adoc_data,
|
||||
"data_file": cache_adoc_data,
|
||||
}
|
||||
)
|
||||
|
||||
try:
|
||||
converter.build_skill()
|
||||
logger.info("✅ AsciiDoc: Standalone SKILL.md created")
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ Failed to build standalone AsciiDoc SKILL.md: {e}")
|
||||
|
||||
logger.info(f"✅ AsciiDoc: {len(adoc_data.get('sections', []))} sections extracted")
|
||||
|
||||
def _scrape_pptx(self, source: dict[str, Any]):
|
||||
"""Scrape PowerPoint presentation (.pptx)."""
|
||||
try:
|
||||
from skill_seekers.cli.pptx_scraper import PptxToSkillConverter
|
||||
except ImportError:
|
||||
logger.error(
|
||||
"PowerPoint scraper dependencies not installed.\n"
|
||||
" Install with: pip install skill-seekers[pptx]"
|
||||
)
|
||||
return
|
||||
|
||||
idx = self._source_counters["pptx"]
|
||||
self._source_counters["pptx"] += 1
|
||||
|
||||
pptx_path = source["path"]
|
||||
pptx_id = os.path.splitext(os.path.basename(pptx_path))[0]
|
||||
|
||||
pptx_config = {
|
||||
"name": f"{self.name}_pptx_{idx}_{pptx_id}",
|
||||
"pptx_path": source["path"],
|
||||
"description": source.get("description", f"{pptx_id} presentation"),
|
||||
}
|
||||
|
||||
logger.info(f"Scraping PowerPoint: {source['path']}")
|
||||
converter = PptxToSkillConverter(pptx_config)
|
||||
converter.extract_pptx()
|
||||
|
||||
pptx_data_file = converter.data_file
|
||||
with open(pptx_data_file, encoding="utf-8") as f:
|
||||
pptx_data = json.load(f)
|
||||
|
||||
cache_pptx_data = os.path.join(self.data_dir, f"pptx_data_{idx}_{pptx_id}.json")
|
||||
shutil.copy(pptx_data_file, cache_pptx_data)
|
||||
|
||||
self.scraped_data["pptx"].append(
|
||||
{
|
||||
"pptx_path": pptx_path,
|
||||
"pptx_id": pptx_id,
|
||||
"idx": idx,
|
||||
"data": pptx_data,
|
||||
"data_file": cache_pptx_data,
|
||||
}
|
||||
)
|
||||
|
||||
try:
|
||||
converter.build_skill()
|
||||
logger.info("✅ PowerPoint: Standalone SKILL.md created")
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ Failed to build standalone PowerPoint SKILL.md: {e}")
|
||||
|
||||
logger.info(f"✅ PowerPoint: {len(pptx_data.get('slides', []))} slides extracted")
|
||||
|
||||
def _scrape_confluence(self, source: dict[str, Any]):
|
||||
"""Scrape Confluence wiki (API or exported HTML/XML)."""
|
||||
try:
|
||||
from skill_seekers.cli.confluence_scraper import ConfluenceToSkillConverter
|
||||
except ImportError:
|
||||
logger.error(
|
||||
"Confluence scraper dependencies not installed.\n"
|
||||
" Install with: pip install skill-seekers[confluence]"
|
||||
)
|
||||
return
|
||||
|
||||
idx = self._source_counters["confluence"]
|
||||
self._source_counters["confluence"] += 1
|
||||
|
||||
source_id = source.get("space_key", source.get("path", f"confluence_{idx}"))
|
||||
if isinstance(source_id, str) and "/" in source_id:
|
||||
source_id = os.path.basename(source_id.rstrip("/"))
|
||||
|
||||
conf_config = {
|
||||
"name": f"{self.name}_confluence_{idx}_{source_id}",
|
||||
"base_url": source.get("base_url", source.get("url")),
|
||||
"space_key": source.get("space_key"),
|
||||
"export_path": source.get("path"),
|
||||
"username": source.get("username"),
|
||||
"token": source.get("token"),
|
||||
"description": source.get("description", f"{source_id} Confluence content"),
|
||||
"max_pages": source.get("max_pages", 500),
|
||||
}
|
||||
|
||||
logger.info(f"Scraping Confluence: {source_id}")
|
||||
converter = ConfluenceToSkillConverter(conf_config)
|
||||
converter.extract_confluence()
|
||||
|
||||
conf_data_file = converter.data_file
|
||||
with open(conf_data_file, encoding="utf-8") as f:
|
||||
conf_data = json.load(f)
|
||||
|
||||
cache_conf_data = os.path.join(self.data_dir, f"confluence_data_{idx}_{source_id}.json")
|
||||
shutil.copy(conf_data_file, cache_conf_data)
|
||||
|
||||
self.scraped_data["confluence"].append(
|
||||
{
|
||||
"source_id": source_id,
|
||||
"idx": idx,
|
||||
"data": conf_data,
|
||||
"data_file": cache_conf_data,
|
||||
}
|
||||
)
|
||||
|
||||
try:
|
||||
converter.build_skill()
|
||||
logger.info("✅ Confluence: Standalone SKILL.md created")
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ Failed to build standalone Confluence SKILL.md: {e}")
|
||||
|
||||
logger.info(f"✅ Confluence: {len(conf_data.get('pages', []))} pages extracted")
|
||||
|
||||
def _scrape_notion(self, source: dict[str, Any]):
|
||||
"""Scrape Notion pages (API or exported Markdown)."""
|
||||
try:
|
||||
from skill_seekers.cli.notion_scraper import NotionToSkillConverter
|
||||
except ImportError:
|
||||
logger.error(
|
||||
"Notion scraper dependencies not installed.\n"
|
||||
" Install with: pip install skill-seekers[notion]"
|
||||
)
|
||||
return
|
||||
|
||||
idx = self._source_counters["notion"]
|
||||
self._source_counters["notion"] += 1
|
||||
|
||||
source_id = source.get(
|
||||
"database_id", source.get("page_id", source.get("path", f"notion_{idx}"))
|
||||
)
|
||||
if isinstance(source_id, str) and "/" in source_id:
|
||||
source_id = os.path.basename(source_id.rstrip("/"))
|
||||
|
||||
notion_config = {
|
||||
"name": f"{self.name}_notion_{idx}_{source_id}",
|
||||
"database_id": source.get("database_id"),
|
||||
"page_id": source.get("page_id"),
|
||||
"export_path": source.get("path"),
|
||||
"token": source.get("token"),
|
||||
"description": source.get("description", f"{source_id} Notion content"),
|
||||
"max_pages": source.get("max_pages", 500),
|
||||
}
|
||||
|
||||
logger.info(f"Scraping Notion: {source_id}")
|
||||
converter = NotionToSkillConverter(notion_config)
|
||||
converter.extract_notion()
|
||||
|
||||
notion_data_file = converter.data_file
|
||||
with open(notion_data_file, encoding="utf-8") as f:
|
||||
notion_data = json.load(f)
|
||||
|
||||
cache_notion_data = os.path.join(self.data_dir, f"notion_data_{idx}_{source_id}.json")
|
||||
shutil.copy(notion_data_file, cache_notion_data)
|
||||
|
||||
self.scraped_data["notion"].append(
|
||||
{
|
||||
"source_id": source_id,
|
||||
"idx": idx,
|
||||
"data": notion_data,
|
||||
"data_file": cache_notion_data,
|
||||
}
|
||||
)
|
||||
|
||||
try:
|
||||
converter.build_skill()
|
||||
logger.info("✅ Notion: Standalone SKILL.md created")
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ Failed to build standalone Notion SKILL.md: {e}")
|
||||
|
||||
logger.info(f"✅ Notion: {len(notion_data.get('pages', []))} pages extracted")
|
||||
|
||||
def _scrape_rss(self, source: dict[str, Any]):
|
||||
"""Scrape RSS/Atom feed (with optional full article scraping)."""
|
||||
try:
|
||||
from skill_seekers.cli.rss_scraper import RssToSkillConverter
|
||||
except ImportError:
|
||||
logger.error(
|
||||
"RSS scraper dependencies not installed.\n"
|
||||
" Install with: pip install skill-seekers[rss]"
|
||||
)
|
||||
return
|
||||
|
||||
idx = self._source_counters["rss"]
|
||||
self._source_counters["rss"] += 1
|
||||
|
||||
feed_url = source.get("url", source.get("path", ""))
|
||||
feed_id = feed_url.split("/")[-1].split(".")[0] if feed_url else f"feed_{idx}"
|
||||
|
||||
rss_config = {
|
||||
"name": f"{self.name}_rss_{idx}_{feed_id}",
|
||||
"feed_url": source.get("url"),
|
||||
"feed_path": source.get("path"),
|
||||
"follow_links": source.get("follow_links", True),
|
||||
"max_articles": source.get("max_articles", 50),
|
||||
"description": source.get("description", f"{feed_id} RSS/Atom feed"),
|
||||
}
|
||||
|
||||
logger.info(f"Scraping RSS/Atom feed: {feed_url}")
|
||||
converter = RssToSkillConverter(rss_config)
|
||||
converter.extract_feed()
|
||||
|
||||
rss_data_file = converter.data_file
|
||||
with open(rss_data_file, encoding="utf-8") as f:
|
||||
rss_data = json.load(f)
|
||||
|
||||
cache_rss_data = os.path.join(self.data_dir, f"rss_data_{idx}_{feed_id}.json")
|
||||
shutil.copy(rss_data_file, cache_rss_data)
|
||||
|
||||
self.scraped_data["rss"].append(
|
||||
{
|
||||
"feed_url": feed_url,
|
||||
"feed_id": feed_id,
|
||||
"idx": idx,
|
||||
"data": rss_data,
|
||||
"data_file": cache_rss_data,
|
||||
}
|
||||
)
|
||||
|
||||
try:
|
||||
converter.build_skill()
|
||||
logger.info("✅ RSS: Standalone SKILL.md created")
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ Failed to build standalone RSS SKILL.md: {e}")
|
||||
|
||||
logger.info(f"✅ RSS: {len(rss_data.get('articles', []))} articles extracted")
|
||||
|
||||
def _scrape_manpage(self, source: dict[str, Any]):
|
||||
"""Scrape man page(s)."""
|
||||
try:
|
||||
from skill_seekers.cli.man_scraper import ManPageToSkillConverter
|
||||
except ImportError:
|
||||
logger.error("man_scraper.py not found")
|
||||
return
|
||||
|
||||
idx = self._source_counters["manpage"]
|
||||
self._source_counters["manpage"] += 1
|
||||
|
||||
man_names = source.get("names", [])
|
||||
man_path = source.get("path", "")
|
||||
man_id = man_names[0] if man_names else os.path.basename(man_path.rstrip("/"))
|
||||
|
||||
man_config = {
|
||||
"name": f"{self.name}_manpage_{idx}_{man_id}",
|
||||
"man_names": man_names,
|
||||
"man_path": man_path,
|
||||
"sections": source.get("sections", []),
|
||||
"description": source.get("description", f"{man_id} man pages"),
|
||||
}
|
||||
|
||||
logger.info(f"Scraping man pages: {man_id}")
|
||||
converter = ManPageToSkillConverter(man_config)
|
||||
converter.extract_manpages()
|
||||
|
||||
man_data_file = converter.data_file
|
||||
with open(man_data_file, encoding="utf-8") as f:
|
||||
man_data = json.load(f)
|
||||
|
||||
cache_man_data = os.path.join(self.data_dir, f"manpage_data_{idx}_{man_id}.json")
|
||||
shutil.copy(man_data_file, cache_man_data)
|
||||
|
||||
self.scraped_data["manpage"].append(
|
||||
{
|
||||
"man_id": man_id,
|
||||
"idx": idx,
|
||||
"data": man_data,
|
||||
"data_file": cache_man_data,
|
||||
}
|
||||
)
|
||||
|
||||
try:
|
||||
converter.build_skill()
|
||||
logger.info("✅ Man pages: Standalone SKILL.md created")
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ Failed to build standalone man page SKILL.md: {e}")
|
||||
|
||||
logger.info(f"✅ Man pages: {len(man_data.get('pages', []))} man pages extracted")
|
||||
|
||||
def _scrape_chat(self, source: dict[str, Any]):
|
||||
"""Scrape Slack/Discord chat export or API."""
|
||||
try:
|
||||
from skill_seekers.cli.chat_scraper import ChatToSkillConverter
|
||||
except ImportError:
|
||||
logger.error(
|
||||
"Chat scraper dependencies not installed.\n"
|
||||
" Install with: pip install skill-seekers[chat]"
|
||||
)
|
||||
return
|
||||
|
||||
idx = self._source_counters["chat"]
|
||||
self._source_counters["chat"] += 1
|
||||
|
||||
export_path = source.get("path", "")
|
||||
channel = source.get("channel", source.get("channel_id", ""))
|
||||
chat_id = channel or os.path.basename(export_path.rstrip("/")) or f"chat_{idx}"
|
||||
|
||||
chat_config = {
|
||||
"name": f"{self.name}_chat_{idx}_{chat_id}",
|
||||
"export_path": source.get("path"),
|
||||
"platform": source.get("platform", "slack"),
|
||||
"token": source.get("token"),
|
||||
"channel": channel,
|
||||
"max_messages": source.get("max_messages", 10000),
|
||||
"description": source.get("description", f"{chat_id} chat export"),
|
||||
}
|
||||
|
||||
logger.info(f"Scraping chat: {chat_id}")
|
||||
converter = ChatToSkillConverter(chat_config)
|
||||
converter.extract_chat()
|
||||
|
||||
chat_data_file = converter.data_file
|
||||
with open(chat_data_file, encoding="utf-8") as f:
|
||||
chat_data = json.load(f)
|
||||
|
||||
cache_chat_data = os.path.join(self.data_dir, f"chat_data_{idx}_{chat_id}.json")
|
||||
shutil.copy(chat_data_file, cache_chat_data)
|
||||
|
||||
self.scraped_data["chat"].append(
|
||||
{
|
||||
"chat_id": chat_id,
|
||||
"platform": source.get("platform", "slack"),
|
||||
"idx": idx,
|
||||
"data": chat_data,
|
||||
"data_file": cache_chat_data,
|
||||
}
|
||||
)
|
||||
|
||||
try:
|
||||
converter.build_skill()
|
||||
logger.info("✅ Chat: Standalone SKILL.md created")
|
||||
except Exception as e:
|
||||
logger.warning(f"⚠️ Failed to build standalone chat SKILL.md: {e}")
|
||||
|
||||
logger.info(f"✅ Chat: {len(chat_data.get('messages', []))} messages extracted")
|
||||
|
||||
def _load_json(self, file_path: Path) -> dict:
|
||||
"""
|
||||
Load JSON file safely.
|
||||
@@ -1297,14 +1931,33 @@ Examples:
|
||||
if args.dry_run:
|
||||
logger.info("🔍 DRY RUN MODE - Preview only, no scraping will occur")
|
||||
logger.info(f"\nWould scrape {len(scraper.config.get('sources', []))} sources:")
|
||||
# Source type display config: type -> (label, key for detail)
|
||||
_SOURCE_DISPLAY = {
|
||||
"documentation": ("Documentation", "base_url"),
|
||||
"github": ("GitHub", "repo"),
|
||||
"pdf": ("PDF", "path"),
|
||||
"word": ("Word", "path"),
|
||||
"epub": ("EPUB", "path"),
|
||||
"video": ("Video", "url"),
|
||||
"local": ("Local Codebase", "path"),
|
||||
"jupyter": ("Jupyter Notebook", "path"),
|
||||
"html": ("HTML", "path"),
|
||||
"openapi": ("OpenAPI Spec", "path"),
|
||||
"asciidoc": ("AsciiDoc", "path"),
|
||||
"pptx": ("PowerPoint", "path"),
|
||||
"confluence": ("Confluence", "base_url"),
|
||||
"notion": ("Notion", "page_id"),
|
||||
"rss": ("RSS/Atom Feed", "url"),
|
||||
"manpage": ("Man Page", "names"),
|
||||
"chat": ("Chat Export", "path"),
|
||||
}
|
||||
for idx, source in enumerate(scraper.config.get("sources", []), 1):
|
||||
source_type = source.get("type", "unknown")
|
||||
if source_type == "documentation":
|
||||
logger.info(f" {idx}. Documentation: {source.get('base_url', 'N/A')}")
|
||||
elif source_type == "github":
|
||||
logger.info(f" {idx}. GitHub: {source.get('repo', 'N/A')}")
|
||||
elif source_type == "pdf":
|
||||
logger.info(f" {idx}. PDF: {source.get('pdf_path', 'N/A')}")
|
||||
label, key = _SOURCE_DISPLAY.get(source_type, (source_type.title(), "path"))
|
||||
detail = source.get(key, "N/A")
|
||||
if isinstance(detail, list):
|
||||
detail = ", ".join(str(d) for d in detail)
|
||||
logger.info(f" {idx}. {label}: {detail}")
|
||||
logger.info(f"\nOutput directory: {scraper.output_dir}")
|
||||
logger.info(f"Merge mode: {scraper.merge_mode}")
|
||||
return
|
||||
|
||||
@@ -136,6 +136,44 @@ class UnifiedSkillBuilder:
|
||||
skill_mds["pdf"] = "\n\n---\n\n".join(pdf_sources)
|
||||
logger.debug(f"Combined {len(pdf_sources)} PDF SKILL.md files")
|
||||
|
||||
# Load additional source types using generic glob pattern
|
||||
# Each source type uses: {name}_{type}_{idx}_*/ or {name}_{type}_*/
|
||||
_extra_types = [
|
||||
"word",
|
||||
"epub",
|
||||
"video",
|
||||
"jupyter",
|
||||
"html",
|
||||
"openapi",
|
||||
"asciidoc",
|
||||
"pptx",
|
||||
"confluence",
|
||||
"notion",
|
||||
"rss",
|
||||
"manpage",
|
||||
"chat",
|
||||
]
|
||||
for source_type in _extra_types:
|
||||
type_sources = []
|
||||
for type_dir in sources_dir.glob(f"{self.name}_{source_type}_*"):
|
||||
type_skill_path = type_dir / "SKILL.md"
|
||||
if type_skill_path.exists():
|
||||
try:
|
||||
content = type_skill_path.read_text(encoding="utf-8")
|
||||
type_sources.append(content)
|
||||
logger.debug(
|
||||
f"Loaded {source_type} SKILL.md from {type_dir.name} "
|
||||
f"({len(content)} chars)"
|
||||
)
|
||||
except OSError as e:
|
||||
logger.warning(
|
||||
f"Failed to read {source_type} SKILL.md from {type_dir.name}: {e}"
|
||||
)
|
||||
|
||||
if type_sources:
|
||||
skill_mds[source_type] = "\n\n---\n\n".join(type_sources)
|
||||
logger.debug(f"Combined {len(type_sources)} {source_type} SKILL.md files")
|
||||
|
||||
logger.info(f"Loaded {len(skill_mds)} source SKILL.md files")
|
||||
return skill_mds
|
||||
|
||||
@@ -477,6 +515,18 @@ This skill synthesizes knowledge from multiple sources:
|
||||
logger.info("Using PDF SKILL.md as-is")
|
||||
content = skill_mds["pdf"]
|
||||
|
||||
# Generic merge for additional source types not covered by pairwise methods
|
||||
if not content and skill_mds:
|
||||
# At least one source SKILL.md exists but not docs/github/pdf
|
||||
logger.info(f"Generic merge for source types: {list(skill_mds.keys())}")
|
||||
content = self._generic_merge(skill_mds)
|
||||
elif content and len(skill_mds) > (int(has_docs) + int(has_github) + int(has_pdf)):
|
||||
# Pairwise synthesis handled the core types; append additional sources
|
||||
extra_types = set(skill_mds.keys()) - {"documentation", "github", "pdf"}
|
||||
if extra_types:
|
||||
logger.info(f"Appending additional sources: {extra_types}")
|
||||
content = self._append_extra_sources(content, skill_mds, extra_types)
|
||||
|
||||
# Fallback: generate minimal SKILL.md (legacy behavior)
|
||||
if not content:
|
||||
logger.warning("No source SKILL.md files found, generating minimal SKILL.md (legacy)")
|
||||
@@ -574,6 +624,165 @@ This skill synthesizes knowledge from multiple sources:
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Generic merge system for any combination of source types (v3.2.0+)
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
# Human-readable labels for source types
|
||||
_SOURCE_LABELS: dict[str, str] = {
|
||||
"documentation": "Documentation",
|
||||
"github": "GitHub Repository",
|
||||
"pdf": "PDF Document",
|
||||
"word": "Word Document",
|
||||
"epub": "EPUB E-book",
|
||||
"video": "Video",
|
||||
"local": "Local Codebase",
|
||||
"jupyter": "Jupyter Notebook",
|
||||
"html": "HTML Document",
|
||||
"openapi": "OpenAPI/Swagger Spec",
|
||||
"asciidoc": "AsciiDoc Document",
|
||||
"pptx": "PowerPoint Presentation",
|
||||
"confluence": "Confluence Wiki",
|
||||
"notion": "Notion Page",
|
||||
"rss": "RSS/Atom Feed",
|
||||
"manpage": "Man Page",
|
||||
"chat": "Chat Export",
|
||||
}
|
||||
|
||||
def _generic_merge(self, skill_mds: dict[str, str]) -> str:
|
||||
"""Generic merge for any combination of source types.
|
||||
|
||||
Uses a priority-based section ordering approach:
|
||||
1. Parse all source SKILL.md files into sections
|
||||
2. Collect unique sections across all sources
|
||||
3. Merge matching sections with source attribution
|
||||
4. Produce a unified SKILL.md
|
||||
|
||||
This preserves the existing pairwise synthesis for docs+github, docs+pdf, etc.
|
||||
and handles any other combination generically.
|
||||
|
||||
Args:
|
||||
skill_mds: Dict mapping source type to SKILL.md content
|
||||
|
||||
Returns:
|
||||
Merged SKILL.md content string
|
||||
"""
|
||||
skill_name = self.name.lower().replace("_", "-").replace(" ", "-")[:64]
|
||||
desc = self.description[:1024] if len(self.description) > 1024 else self.description
|
||||
|
||||
# Parse all source SKILL.md files into sections
|
||||
all_sections: dict[str, dict[str, str]] = {}
|
||||
for source_type, content in skill_mds.items():
|
||||
all_sections[source_type] = self._parse_skill_md_sections(content)
|
||||
|
||||
# Determine all unique section names in priority order
|
||||
# Sections that appear earlier in sources have higher priority
|
||||
seen_sections: list[str] = []
|
||||
for _source_type, sections in all_sections.items():
|
||||
for section_name in sections:
|
||||
if section_name not in seen_sections:
|
||||
seen_sections.append(section_name)
|
||||
|
||||
# Build merged content
|
||||
source_labels = ", ".join(self._SOURCE_LABELS.get(t, t.title()) for t in skill_mds)
|
||||
lines = [
|
||||
"---",
|
||||
f"name: {skill_name}",
|
||||
f"description: {desc}",
|
||||
"---",
|
||||
"",
|
||||
f"# {self.name.replace('_', ' ').title()}",
|
||||
"",
|
||||
f"{self.description}",
|
||||
"",
|
||||
f"*Merged from: {source_labels}*",
|
||||
"",
|
||||
]
|
||||
|
||||
# Emit each section, merging content from all sources that have it
|
||||
for section_name in seen_sections:
|
||||
contributing_sources = [
|
||||
(stype, sections[section_name])
|
||||
for stype, sections in all_sections.items()
|
||||
if section_name in sections
|
||||
]
|
||||
|
||||
if len(contributing_sources) == 1:
|
||||
# Single source for this section — emit as-is
|
||||
stype, content = contributing_sources[0]
|
||||
label = self._SOURCE_LABELS.get(stype, stype.title())
|
||||
lines.append(f"## {section_name}")
|
||||
lines.append("")
|
||||
lines.append(f"*From {label}*")
|
||||
lines.append("")
|
||||
lines.append(content)
|
||||
lines.append("")
|
||||
else:
|
||||
# Multiple sources — merge with attribution
|
||||
lines.append(f"## {section_name}")
|
||||
lines.append("")
|
||||
for stype, content in contributing_sources:
|
||||
label = self._SOURCE_LABELS.get(stype, stype.title())
|
||||
lines.append(f"### From {label}")
|
||||
lines.append("")
|
||||
lines.append(content)
|
||||
lines.append("")
|
||||
|
||||
lines.append("---")
|
||||
lines.append("")
|
||||
lines.append("*Generated by Skill Seeker's unified multi-source scraper*")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
def _append_extra_sources(
|
||||
self,
|
||||
base_content: str,
|
||||
skill_mds: dict[str, str],
|
||||
extra_types: set[str],
|
||||
) -> str:
|
||||
"""Append additional source content to existing pairwise-synthesized SKILL.md.
|
||||
|
||||
Used when the core docs+github+pdf synthesis has run, but there are
|
||||
additional source types (epub, jupyter, etc.) that need to be included.
|
||||
|
||||
Args:
|
||||
base_content: Already-synthesized SKILL.md content
|
||||
skill_mds: All source SKILL.md files
|
||||
extra_types: Set of extra source type keys to append
|
||||
|
||||
Returns:
|
||||
Extended SKILL.md content
|
||||
"""
|
||||
lines = base_content.split("\n")
|
||||
|
||||
# Find the final separator (---) or end of file
|
||||
insertion_index = len(lines)
|
||||
for i in range(len(lines) - 1, -1, -1):
|
||||
if lines[i].strip() == "---":
|
||||
insertion_index = i
|
||||
break
|
||||
|
||||
# Build extra content
|
||||
extra_lines = [""]
|
||||
for source_type in sorted(extra_types):
|
||||
if source_type not in skill_mds:
|
||||
continue
|
||||
label = self._SOURCE_LABELS.get(source_type, source_type.title())
|
||||
sections = self._parse_skill_md_sections(skill_mds[source_type])
|
||||
|
||||
extra_lines.append(f"## {label} Content")
|
||||
extra_lines.append("")
|
||||
|
||||
for section_name, content in sections.items():
|
||||
extra_lines.append(f"### {section_name}")
|
||||
extra_lines.append("")
|
||||
extra_lines.append(content)
|
||||
extra_lines.append("")
|
||||
|
||||
lines[insertion_index:insertion_index] = extra_lines
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
def _generate_minimal_skill_md(self) -> str:
|
||||
"""Generate minimal SKILL.md (legacy fallback behavior).
|
||||
|
||||
@@ -597,18 +806,42 @@ This skill combines knowledge from multiple sources:
|
||||
|
||||
"""
|
||||
|
||||
# Source type display keys: type -> (label, primary_key, extra_keys)
|
||||
_source_detail_map = {
|
||||
"documentation": ("Documentation", "base_url", [("Pages", "max_pages", "unlimited")]),
|
||||
"github": (
|
||||
"GitHub Repository",
|
||||
"repo",
|
||||
[("Code Analysis", "code_analysis_depth", "surface"), ("Issues", "max_issues", 0)],
|
||||
),
|
||||
"pdf": ("PDF Document", "path", []),
|
||||
"word": ("Word Document", "path", []),
|
||||
"epub": ("EPUB E-book", "path", []),
|
||||
"video": ("Video", "url", []),
|
||||
"local": ("Local Codebase", "path", [("Analysis Depth", "analysis_depth", "surface")]),
|
||||
"jupyter": ("Jupyter Notebook", "path", []),
|
||||
"html": ("HTML Document", "path", []),
|
||||
"openapi": ("OpenAPI Spec", "path", []),
|
||||
"asciidoc": ("AsciiDoc Document", "path", []),
|
||||
"pptx": ("PowerPoint", "path", []),
|
||||
"confluence": ("Confluence Wiki", "base_url", []),
|
||||
"notion": ("Notion Page", "page_id", []),
|
||||
"rss": ("RSS/Atom Feed", "url", []),
|
||||
"manpage": ("Man Page", "names", []),
|
||||
"chat": ("Chat Export", "path", []),
|
||||
}
|
||||
|
||||
# List sources
|
||||
for source in self.config.get("sources", []):
|
||||
source_type = source["type"]
|
||||
if source_type == "documentation":
|
||||
content += f"- ✅ **Documentation**: {source.get('base_url', 'N/A')}\n"
|
||||
content += f" - Pages: {source.get('max_pages', 'unlimited')}\n"
|
||||
elif source_type == "github":
|
||||
content += f"- ✅ **GitHub Repository**: {source.get('repo', 'N/A')}\n"
|
||||
content += f" - Code Analysis: {source.get('code_analysis_depth', 'surface')}\n"
|
||||
content += f" - Issues: {source.get('max_issues', 0)}\n"
|
||||
elif source_type == "pdf":
|
||||
content += f"- ✅ **PDF Document**: {source.get('path', 'N/A')}\n"
|
||||
display = _source_detail_map.get(source_type, (source_type.title(), "path", []))
|
||||
label, primary_key, extras = display
|
||||
primary_val = source.get(primary_key, "N/A")
|
||||
if isinstance(primary_val, list):
|
||||
primary_val = ", ".join(str(v) for v in primary_val)
|
||||
content += f"- ✅ **{label}**: {primary_val}\n"
|
||||
for extra_label, extra_key, extra_default in extras:
|
||||
content += f" - {extra_label}: {source.get(extra_key, extra_default)}\n"
|
||||
|
||||
# C3.x Architecture & Code Analysis section (if available)
|
||||
github_data = self.scraped_data.get("github", {})
|
||||
@@ -796,6 +1029,27 @@ This skill combines knowledge from multiple sources:
|
||||
if pdf_list:
|
||||
self._generate_pdf_references(pdf_list)
|
||||
|
||||
# Generate references for all additional source types
|
||||
_extra_source_types = [
|
||||
"word",
|
||||
"epub",
|
||||
"video",
|
||||
"jupyter",
|
||||
"html",
|
||||
"openapi",
|
||||
"asciidoc",
|
||||
"pptx",
|
||||
"confluence",
|
||||
"notion",
|
||||
"rss",
|
||||
"manpage",
|
||||
"chat",
|
||||
]
|
||||
for source_type in _extra_source_types:
|
||||
source_list = self.scraped_data.get(source_type, [])
|
||||
if source_list:
|
||||
self._generate_generic_references(source_type, source_list)
|
||||
|
||||
# Generate merged API reference if available
|
||||
if self.merged_data:
|
||||
self._generate_merged_api_reference()
|
||||
@@ -977,6 +1231,63 @@ This skill combines knowledge from multiple sources:
|
||||
|
||||
logger.info(f"Created PDF references ({len(pdf_list)} sources)")
|
||||
|
||||
def _generate_generic_references(self, source_type: str, source_list: list[dict]):
|
||||
"""Generate references for any source type using a generic approach.
|
||||
|
||||
Creates a references/<source_type>/ directory with an index and
|
||||
copies any data files from the source list.
|
||||
|
||||
Args:
|
||||
source_type: The source type key (e.g., 'epub', 'jupyter')
|
||||
source_list: List of scraped source dicts for this type
|
||||
"""
|
||||
if not source_list:
|
||||
return
|
||||
|
||||
label = self._SOURCE_LABELS.get(source_type, source_type.title())
|
||||
type_dir = os.path.join(self.skill_dir, "references", source_type)
|
||||
os.makedirs(type_dir, exist_ok=True)
|
||||
|
||||
# Create index
|
||||
index_path = os.path.join(type_dir, "index.md")
|
||||
with open(index_path, "w", encoding="utf-8") as f:
|
||||
f.write(f"# {label} References\n\n")
|
||||
f.write(f"Reference from {len(source_list)} {label} source(s).\n\n")
|
||||
|
||||
for i, source_data in enumerate(source_list):
|
||||
# Try common ID fields
|
||||
source_id = (
|
||||
source_data.get("source_id")
|
||||
or source_data.get(f"{source_type}_id")
|
||||
or source_data.get("notebook_id")
|
||||
or source_data.get("spec_id")
|
||||
or source_data.get("feed_id")
|
||||
or source_data.get("man_id")
|
||||
or source_data.get("chat_id")
|
||||
or f"source_{i}"
|
||||
)
|
||||
f.write(f"## {source_id}\n\n")
|
||||
|
||||
# Write summary of extracted data
|
||||
data = source_data.get("data", {})
|
||||
if isinstance(data, dict):
|
||||
for key in ["title", "description", "metadata"]:
|
||||
if key in data:
|
||||
val = data[key]
|
||||
if isinstance(val, str) and val:
|
||||
f.write(f"**{key.title()}:** {val}\n\n")
|
||||
|
||||
# Copy data file if available
|
||||
data_file = source_data.get("data_file")
|
||||
if data_file and os.path.isfile(data_file):
|
||||
dest = os.path.join(type_dir, f"{source_id}_data.json")
|
||||
import contextlib
|
||||
|
||||
with contextlib.suppress(OSError):
|
||||
shutil.copy(data_file, dest)
|
||||
|
||||
logger.info(f"Created {label} references ({len(source_list)} sources)")
|
||||
|
||||
def _generate_merged_api_reference(self):
|
||||
"""Generate merged API reference file."""
|
||||
api_dir = os.path.join(self.skill_dir, "references", "api")
|
||||
|
||||
@@ -3,16 +3,16 @@
|
||||
Skill Seeker MCP Server (FastMCP Implementation)
|
||||
|
||||
Modern, decorator-based MCP server using FastMCP for simplified tool registration.
|
||||
Provides 33 tools for generating Claude AI skills from documentation.
|
||||
Provides 34 tools for generating Claude AI skills from documentation.
|
||||
|
||||
This is a streamlined alternative to server.py (2200 lines → 708 lines, 68% reduction).
|
||||
All tool implementations are delegated to modular tool files in tools/ directory.
|
||||
|
||||
**Architecture:**
|
||||
- FastMCP server with decorator-based tool registration
|
||||
- 33 tools organized into 7 categories:
|
||||
- 34 tools organized into 7 categories:
|
||||
* Config tools (3): generate_config, list_configs, validate_config
|
||||
* Scraping tools (10): estimate_pages, scrape_docs, scrape_github, scrape_pdf, scrape_video, scrape_codebase, detect_patterns, extract_test_examples, build_how_to_guides, extract_config_patterns
|
||||
* Scraping tools (11): estimate_pages, scrape_docs, scrape_github, scrape_pdf, scrape_video, scrape_codebase, detect_patterns, extract_test_examples, build_how_to_guides, extract_config_patterns, scrape_generic
|
||||
* Packaging tools (4): package_skill, upload_skill, enhance_skill, install_skill
|
||||
* Splitting tools (2): split_config, generate_router
|
||||
* Source tools (5): fetch_config, submit_config, add_config_source, list_config_sources, remove_config_source
|
||||
@@ -97,6 +97,7 @@ try:
|
||||
remove_config_source_impl,
|
||||
scrape_codebase_impl,
|
||||
scrape_docs_impl,
|
||||
scrape_generic_impl,
|
||||
scrape_github_impl,
|
||||
scrape_pdf_impl,
|
||||
scrape_video_impl,
|
||||
@@ -141,6 +142,7 @@ except ImportError:
|
||||
remove_config_source_impl,
|
||||
scrape_codebase_impl,
|
||||
scrape_docs_impl,
|
||||
scrape_generic_impl,
|
||||
scrape_github_impl,
|
||||
scrape_pdf_impl,
|
||||
scrape_video_impl,
|
||||
@@ -301,7 +303,7 @@ async def sync_config(
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# SCRAPING TOOLS (10 tools)
|
||||
# SCRAPING TOOLS (11 tools)
|
||||
# ============================================================================
|
||||
|
||||
|
||||
@@ -823,6 +825,50 @@ async def extract_config_patterns(
|
||||
return str(result)
|
||||
|
||||
|
||||
@safe_tool_decorator(
|
||||
description="Scrape content from new source types: jupyter, html, openapi, asciidoc, pptx, confluence, notion, rss, manpage, chat. A generic entry point that delegates to the appropriate CLI scraper module."
|
||||
)
|
||||
async def scrape_generic(
|
||||
source_type: str,
|
||||
name: str,
|
||||
path: str | None = None,
|
||||
url: str | None = None,
|
||||
) -> str:
|
||||
"""
|
||||
Scrape content from various source types and build a skill.
|
||||
|
||||
A generic scraper that supports 10 new source types. It delegates to the
|
||||
corresponding CLI scraper module (e.g., skill_seekers.cli.jupyter_scraper).
|
||||
|
||||
File-based types (jupyter, html, openapi, asciidoc, pptx, manpage, chat)
|
||||
typically use the 'path' parameter. URL-based types (confluence, notion, rss)
|
||||
typically use the 'url' parameter.
|
||||
|
||||
Args:
|
||||
source_type: Source type to scrape. One of: jupyter, html, openapi,
|
||||
asciidoc, pptx, confluence, notion, rss, manpage, chat.
|
||||
name: Skill name for the output
|
||||
path: File or directory path (for file-based sources like jupyter, html, pptx)
|
||||
url: URL (for URL-based sources like confluence, notion, rss)
|
||||
|
||||
Returns:
|
||||
Scraping results with file paths and statistics.
|
||||
"""
|
||||
args = {
|
||||
"source_type": source_type,
|
||||
"name": name,
|
||||
}
|
||||
if path:
|
||||
args["path"] = path
|
||||
if url:
|
||||
args["url"] = url
|
||||
|
||||
result = await scrape_generic_impl(args)
|
||||
if isinstance(result, list) and result:
|
||||
return result[0].text if hasattr(result[0], "text") else str(result[0])
|
||||
return str(result)
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# PACKAGING TOOLS (4 tools)
|
||||
# ============================================================================
|
||||
|
||||
@@ -63,6 +63,9 @@ from .scraping_tools import (
|
||||
from .scraping_tools import (
|
||||
scrape_pdf_tool as scrape_pdf_impl,
|
||||
)
|
||||
from .scraping_tools import (
|
||||
scrape_generic_tool as scrape_generic_impl,
|
||||
)
|
||||
from .scraping_tools import (
|
||||
scrape_video_tool as scrape_video_impl,
|
||||
)
|
||||
@@ -135,6 +138,7 @@ __all__ = [
|
||||
"extract_test_examples_impl",
|
||||
"build_how_to_guides_impl",
|
||||
"extract_config_patterns_impl",
|
||||
"scrape_generic_impl",
|
||||
# Packaging tools
|
||||
"package_skill_impl",
|
||||
"upload_skill_impl",
|
||||
|
||||
@@ -205,6 +205,18 @@ async def validate_config(args: dict) -> list[TextContent]:
|
||||
)
|
||||
elif source["type"] == "pdf":
|
||||
result += f" Path: {source.get('path', 'N/A')}\n"
|
||||
elif source["type"] in (
|
||||
"jupyter",
|
||||
"html",
|
||||
"openapi",
|
||||
"asciidoc",
|
||||
"pptx",
|
||||
"manpage",
|
||||
"chat",
|
||||
):
|
||||
result += f" Path: {source.get('path', 'N/A')}\n"
|
||||
elif source["type"] in ("confluence", "notion", "rss"):
|
||||
result += f" URL: {source.get('url', 'N/A')}\n"
|
||||
|
||||
# Show merge settings if applicable
|
||||
if validator.needs_api_merge():
|
||||
|
||||
@@ -7,6 +7,8 @@ This module contains all scraping-related MCP tool implementations:
|
||||
- scrape_github_tool: Scrape GitHub repositories
|
||||
- scrape_pdf_tool: Scrape PDF documentation
|
||||
- scrape_codebase_tool: Analyze local codebase and extract code knowledge
|
||||
- scrape_generic_tool: Generic scraper for new source types (jupyter, html,
|
||||
openapi, asciidoc, pptx, confluence, notion, rss, manpage, chat)
|
||||
|
||||
Extracted from server.py for better modularity and organization.
|
||||
"""
|
||||
@@ -1005,3 +1007,155 @@ async def extract_config_patterns_tool(args: dict) -> list[TextContent]:
|
||||
return [TextContent(type="text", text=output_text)]
|
||||
else:
|
||||
return [TextContent(type="text", text=f"{output_text}\n\n❌ Error:\n{stderr}")]
|
||||
|
||||
|
||||
# Valid source types for the generic scraper
|
||||
GENERIC_SOURCE_TYPES = (
|
||||
"jupyter",
|
||||
"html",
|
||||
"openapi",
|
||||
"asciidoc",
|
||||
"pptx",
|
||||
"confluence",
|
||||
"notion",
|
||||
"rss",
|
||||
"manpage",
|
||||
"chat",
|
||||
)
|
||||
|
||||
# Mapping from source type to the CLI flag used for the primary input argument.
|
||||
# URL-based types use --url; file/path-based types use --path.
|
||||
_URL_BASED_TYPES = {"confluence", "notion", "rss"}
|
||||
|
||||
# Friendly emoji labels per source type
|
||||
_SOURCE_EMOJIS = {
|
||||
"jupyter": "📓",
|
||||
"html": "🌐",
|
||||
"openapi": "📡",
|
||||
"asciidoc": "📄",
|
||||
"pptx": "📊",
|
||||
"confluence": "🏢",
|
||||
"notion": "📝",
|
||||
"rss": "📰",
|
||||
"manpage": "📖",
|
||||
"chat": "💬",
|
||||
}
|
||||
|
||||
|
||||
async def scrape_generic_tool(args: dict) -> list[TextContent]:
|
||||
"""
|
||||
Generic scraper for new source types.
|
||||
|
||||
Handles all 10 new source types by building the appropriate subprocess
|
||||
command and delegating to the corresponding CLI scraper module.
|
||||
|
||||
Supported source types: jupyter, html, openapi, asciidoc, pptx,
|
||||
confluence, notion, rss, manpage, chat.
|
||||
|
||||
Args:
|
||||
args: Dictionary containing:
|
||||
- source_type (str): One of the supported source types
|
||||
- path (str, optional): File or directory path (for file-based sources)
|
||||
- url (str, optional): URL (for URL-based sources like confluence, notion, rss)
|
||||
- name (str): Skill name for the output
|
||||
|
||||
Returns:
|
||||
List[TextContent]: Tool execution results
|
||||
"""
|
||||
source_type = args.get("source_type", "")
|
||||
path = args.get("path")
|
||||
url = args.get("url")
|
||||
name = args.get("name")
|
||||
|
||||
# Validate source_type
|
||||
if source_type not in GENERIC_SOURCE_TYPES:
|
||||
return [
|
||||
TextContent(
|
||||
type="text",
|
||||
text=(
|
||||
f"❌ Error: Unknown source_type '{source_type}'. "
|
||||
f"Must be one of: {', '.join(GENERIC_SOURCE_TYPES)}"
|
||||
),
|
||||
)
|
||||
]
|
||||
|
||||
# Validate that we have either path or url
|
||||
if not path and not url:
|
||||
return [
|
||||
TextContent(
|
||||
type="text",
|
||||
text="❌ Error: Must specify either 'path' (file/directory) or 'url'",
|
||||
)
|
||||
]
|
||||
|
||||
if not name:
|
||||
return [
|
||||
TextContent(
|
||||
type="text",
|
||||
text="❌ Error: 'name' parameter is required",
|
||||
)
|
||||
]
|
||||
|
||||
# Build the subprocess command
|
||||
# Map source type to module name (most are <type>_scraper, but some differ)
|
||||
_MODULE_NAMES = {
|
||||
"manpage": "man_scraper",
|
||||
}
|
||||
module_name = _MODULE_NAMES.get(source_type, f"{source_type}_scraper")
|
||||
cmd = [sys.executable, "-m", f"skill_seekers.cli.{module_name}"]
|
||||
|
||||
# Map source type to the correct CLI flag for file/path input and URL input.
|
||||
# Each scraper has its own flag name — using a generic --path or --url would fail.
|
||||
_PATH_FLAGS: dict[str, str] = {
|
||||
"jupyter": "--notebook",
|
||||
"html": "--html-path",
|
||||
"openapi": "--spec",
|
||||
"asciidoc": "--asciidoc-path",
|
||||
"pptx": "--pptx",
|
||||
"manpage": "--man-path",
|
||||
"confluence": "--export-path",
|
||||
"notion": "--export-path",
|
||||
"rss": "--feed-path",
|
||||
"chat": "--export-path",
|
||||
}
|
||||
_URL_FLAGS: dict[str, str] = {
|
||||
"confluence": "--base-url",
|
||||
"notion": "--page-id",
|
||||
"rss": "--feed-url",
|
||||
"openapi": "--spec-url",
|
||||
}
|
||||
|
||||
# Determine the input flag based on source type
|
||||
if source_type in _URL_BASED_TYPES and url:
|
||||
url_flag = _URL_FLAGS.get(source_type, "--url")
|
||||
cmd.extend([url_flag, url])
|
||||
elif path:
|
||||
path_flag = _PATH_FLAGS.get(source_type, "--path")
|
||||
cmd.extend([path_flag, path])
|
||||
elif url:
|
||||
# Allow url fallback for file-based types (some may accept URLs too)
|
||||
url_flag = _URL_FLAGS.get(source_type, "--url")
|
||||
cmd.extend([url_flag, url])
|
||||
|
||||
cmd.extend(["--name", name])
|
||||
|
||||
# Set a reasonable timeout
|
||||
timeout = 600 # 10 minutes
|
||||
|
||||
emoji = _SOURCE_EMOJIS.get(source_type, "🔧")
|
||||
progress_msg = f"{emoji} Scraping {source_type} source...\n"
|
||||
if path:
|
||||
progress_msg += f"📁 Path: {path}\n"
|
||||
if url:
|
||||
progress_msg += f"🔗 URL: {url}\n"
|
||||
progress_msg += f"📛 Name: {name}\n"
|
||||
progress_msg += f"⏱️ Maximum time: {timeout // 60} minutes\n\n"
|
||||
|
||||
stdout, stderr, returncode = run_subprocess_with_streaming(cmd, timeout=timeout)
|
||||
|
||||
output = progress_msg + stdout
|
||||
|
||||
if returncode == 0:
|
||||
return [TextContent(type="text", text=output)]
|
||||
else:
|
||||
return [TextContent(type="text", text=f"{output}\n\n❌ Error:\n{stderr}")]
|
||||
|
||||
@@ -106,7 +106,9 @@ async def split_config(args: dict) -> list[TextContent]:
|
||||
|
||||
Supports both documentation and unified (multi-source) configs:
|
||||
- Documentation configs: Split by categories, size, or create router skills
|
||||
- Unified configs: Split by source type (documentation, github, pdf)
|
||||
- Unified configs: Split by source type (documentation, github, pdf,
|
||||
jupyter, html, openapi, asciidoc, pptx, confluence, notion, rss,
|
||||
manpage, chat)
|
||||
|
||||
For large documentation sites (10K+ pages), this tool splits the config into
|
||||
multiple smaller configs. For unified configs with multiple sources, splits
|
||||
|
||||
222
src/skill_seekers/workflows/complex-merge.yaml
Normal file
222
src/skill_seekers/workflows/complex-merge.yaml
Normal file
@@ -0,0 +1,222 @@
|
||||
name: complex-merge
|
||||
description: Intelligent multi-source merging with conflict resolution, priority rules, and gap analysis
|
||||
version: "1.0"
|
||||
author: Skill Seekers
|
||||
tags:
|
||||
- merge
|
||||
- multi-source
|
||||
- conflict-resolution
|
||||
- synthesis
|
||||
applies_to:
|
||||
- doc_scraping
|
||||
- codebase_analysis
|
||||
- github_analysis
|
||||
variables:
|
||||
merge_strategy: priority
|
||||
source_priority_order: "official_docs,code,community"
|
||||
conflict_resolution: highest_priority
|
||||
min_sources_for_consensus: 2
|
||||
stages:
|
||||
- name: source_inventory
|
||||
type: custom
|
||||
target: inventory
|
||||
uses_history: false
|
||||
enabled: true
|
||||
prompt: >
|
||||
Catalog every source that contributed content to this skill extraction.
|
||||
For each source, classify its type and assess its characteristics.
|
||||
|
||||
For each source, determine:
|
||||
1. Source type (official_docs, codebase, github_repo, pdf, video, community, blog)
|
||||
2. Content scope — what topics or areas does this source cover?
|
||||
3. Freshness — how recent is the content? Look for version numbers, dates, deprecation notices
|
||||
4. Authority level — is this an official maintainer, core contributor, or third party?
|
||||
5. Content density — roughly how much substantive information does this source provide?
|
||||
6. Format characteristics — prose, code samples, API reference, tutorial, etc.
|
||||
|
||||
Output JSON with:
|
||||
- "sources": array of {id, type, scope_summary, topics_covered, freshness_estimate, authority, density, format}
|
||||
- "source_type_distribution": count of sources by type
|
||||
- "total_topics_identified": number of unique topics across all sources
|
||||
- "coverage_summary": brief overview of what the combined sources cover
|
||||
|
||||
- name: cross_reference
|
||||
type: custom
|
||||
target: cross_references
|
||||
uses_history: true
|
||||
enabled: true
|
||||
prompt: >
|
||||
Using the source inventory, identify overlapping topics across sources.
|
||||
Find where multiple sources discuss the same concept, API, feature, or pattern.
|
||||
|
||||
For each overlapping topic:
|
||||
1. List which sources cover it and how deeply
|
||||
2. Note whether sources agree, complement each other, or diverge
|
||||
3. Identify the richest source for that topic (most detail, best examples)
|
||||
4. Flag any terminology differences across sources for the same concept
|
||||
|
||||
Output JSON with:
|
||||
- "overlapping_topics": array of {topic, sources_covering, agreement_level, richest_source, terminology_variants}
|
||||
- "high_overlap_topics": topics covered by 3+ sources
|
||||
- "complementary_pairs": pairs of sources that cover different aspects of the same topic well
|
||||
- "terminology_map": dictionary mapping variant terms to a canonical term
|
||||
|
||||
- name: conflict_detection
|
||||
type: custom
|
||||
target: conflicts
|
||||
uses_history: true
|
||||
enabled: true
|
||||
prompt: >
|
||||
Examine the cross-referenced topics and identify genuine contradictions
|
||||
between sources. Distinguish between true conflicts and superficial differences.
|
||||
|
||||
Categories of conflict to detect:
|
||||
1. Factual contradictions — sources state opposite things about the same feature
|
||||
2. Version mismatches — sources describe different versions of an API or behavior
|
||||
3. Best practice disagreements — sources recommend conflicting approaches
|
||||
4. Deprecated vs current — one source shows deprecated usage another shows current
|
||||
5. Scope conflicts — sources disagree on what a feature can or cannot do
|
||||
|
||||
For each conflict:
|
||||
- Identify the specific claim from each source
|
||||
- Assess which source is more likely correct and why
|
||||
- Recommend a resolution strategy
|
||||
|
||||
Output JSON with:
|
||||
- "conflicts": array of {topic, type, source_a_claim, source_b_claim, likely_correct, resolution_rationale}
|
||||
- "conflict_count_by_type": breakdown of conflicts by category
|
||||
- "high_severity_conflicts": conflicts that would mislead users if unresolved
|
||||
- "auto_resolvable": conflicts that can be resolved by version/date alone
|
||||
|
||||
- name: priority_merge
|
||||
type: custom
|
||||
target: merged_content
|
||||
uses_history: true
|
||||
enabled: true
|
||||
prompt: >
|
||||
Merge content from all sources using the following priority hierarchy:
|
||||
1. Official documentation (highest authority)
|
||||
2. Source code and inline comments (ground truth for behavior)
|
||||
3. Community content — tutorials, blog posts, Stack Overflow (practical usage)
|
||||
|
||||
Merging rules:
|
||||
- When sources agree, combine the best explanation with the best examples
|
||||
- When sources conflict, prefer the higher-priority source but note the alternative
|
||||
- When only a lower-priority source covers a topic, include it but flag the authority level
|
||||
- Preserve code examples from any source, annotating their origin
|
||||
- Deduplicate content — do not repeat the same information from multiple sources
|
||||
- Normalize terminology using the canonical terms from cross-referencing
|
||||
|
||||
For each merged topic, produce:
|
||||
1. Authoritative explanation (from highest-priority source)
|
||||
2. Practical examples (best available from any source)
|
||||
3. Source attribution (which sources contributed)
|
||||
4. Confidence level (high if official docs confirm, medium if code-only, low if community-only)
|
||||
|
||||
Output JSON with:
|
||||
- "merged_topics": array of {topic, explanation, examples, sources_used, confidence, notes}
|
||||
- "merge_decisions": array of {topic, decision, rationale} for non-trivial merges
|
||||
- "source_contribution_stats": how much each source contributed to the final output
|
||||
|
||||
- name: gap_analysis
|
||||
type: custom
|
||||
target: gaps
|
||||
uses_history: true
|
||||
enabled: true
|
||||
prompt: >
|
||||
Analyse the merged content to identify gaps — topics or areas that are
|
||||
underrepresented or missing entirely.
|
||||
|
||||
Identify:
|
||||
1. Single-source topics — covered by only one source, making them fragile
|
||||
2. Missing fundamentals — core concepts that should be documented but are not
|
||||
3. Missing examples — topics explained in prose but lacking code samples
|
||||
4. Missing edge cases — common error scenarios or limitations not documented
|
||||
5. Broken references — topics that reference other topics not present in any source
|
||||
6. Audience gaps — content assumes knowledge that is never introduced
|
||||
|
||||
For each gap, assess:
|
||||
- Severity (critical, important, nice-to-have)
|
||||
- Whether the gap can be inferred from existing content
|
||||
- Suggested source type that would best fill this gap
|
||||
|
||||
Output JSON with:
|
||||
- "single_source_topics": array of {topic, sole_source, risk_level}
|
||||
- "missing_fundamentals": topics that should exist but do not
|
||||
- "example_gaps": topics needing code examples
|
||||
- "edge_case_gaps": undocumented error scenarios
|
||||
- "broken_references": internal references with no target
|
||||
- "gap_severity_summary": counts by severity level
|
||||
|
||||
- name: synthesis
|
||||
type: custom
|
||||
target: skill_md
|
||||
uses_history: true
|
||||
enabled: true
|
||||
prompt: >
|
||||
Create a unified, coherent narrative from the merged content. The output
|
||||
should read as if written by a single knowledgeable author, not as a
|
||||
patchwork of multiple sources.
|
||||
|
||||
Synthesis guidelines:
|
||||
1. Structure content logically — concepts build on each other
|
||||
2. Lead with the most important information for each topic
|
||||
3. Integrate code examples naturally within explanations
|
||||
4. Use consistent voice, terminology, and formatting throughout
|
||||
5. Add transition text between topics for narrative flow
|
||||
6. Include a "Sources and Confidence" appendix noting where information came from
|
||||
7. Mark any low-confidence or single-source claims with a caveat
|
||||
8. Fill minor gaps by inference where safe to do so, clearly marking inferred content
|
||||
|
||||
Output JSON with:
|
||||
- "synthesized_sections": array of {title, content, sources_used, confidence}
|
||||
- "section_order": recommended reading order
|
||||
- "inferred_content": content that was inferred rather than directly sourced
|
||||
- "caveats": any warnings about content reliability
|
||||
|
||||
- name: quality_check
|
||||
type: custom
|
||||
target: quality
|
||||
uses_history: true
|
||||
enabled: true
|
||||
prompt: >
|
||||
Perform a final quality review of the synthesized output. Evaluate the
|
||||
merge result against multiple quality dimensions.
|
||||
|
||||
Check for:
|
||||
1. Completeness — does the output cover all topics from all sources?
|
||||
2. Accuracy — are merged claims consistent and non-contradictory?
|
||||
3. Coherence — does the document flow logically as a unified piece?
|
||||
4. Attribution — are source contributions properly tracked?
|
||||
5. Confidence calibration — are confidence levels appropriate?
|
||||
6. Example quality — are code examples correct, runnable, and well-annotated?
|
||||
7. Terminology consistency — is the canonical terminology used throughout?
|
||||
8. Gap acknowledgment — are known gaps clearly communicated?
|
||||
|
||||
Scoring:
|
||||
- Rate each dimension 1-10
|
||||
- Provide specific issues found for any dimension scoring below 7
|
||||
- Suggest concrete fixes for each issue
|
||||
|
||||
Output JSON with:
|
||||
- "quality_scores": {completeness, accuracy, coherence, attribution, confidence_calibration, example_quality, terminology_consistency, gap_acknowledgment}
|
||||
- "overall_score": weighted average (accuracy and completeness weighted 2x)
|
||||
- "issues_found": array of {dimension, description, severity, suggested_fix}
|
||||
- "merge_health": "excellent" | "good" | "needs_review" | "poor" based on overall score
|
||||
- "recommendations": top 3 actions to improve merge quality
|
||||
|
||||
post_process:
|
||||
reorder_sections:
|
||||
- overview
|
||||
- core_concepts
|
||||
- api_reference
|
||||
- examples
|
||||
- advanced_topics
|
||||
- troubleshooting
|
||||
- sources_and_confidence
|
||||
add_metadata:
|
||||
enhanced: true
|
||||
workflow: complex-merge
|
||||
multi_source: true
|
||||
conflict_resolution: priority
|
||||
quality_checked: true
|
||||
@@ -24,12 +24,12 @@ class TestParserRegistry:
|
||||
|
||||
def test_all_parsers_registered(self):
|
||||
"""Test that all parsers are registered."""
|
||||
assert len(PARSERS) == 25, f"Expected 25 parsers, got {len(PARSERS)}"
|
||||
assert len(PARSERS) == 35, f"Expected 35 parsers, got {len(PARSERS)}"
|
||||
|
||||
def test_get_parser_names(self):
|
||||
"""Test getting list of parser names."""
|
||||
names = get_parser_names()
|
||||
assert len(names) == 25
|
||||
assert len(names) == 35
|
||||
assert "scrape" in names
|
||||
assert "github" in names
|
||||
assert "package" in names
|
||||
@@ -243,9 +243,9 @@ class TestBackwardCompatibility:
|
||||
assert cmd in names, f"Command '{cmd}' not found in parser registry!"
|
||||
|
||||
def test_command_count_matches(self):
|
||||
"""Test that we have exactly 25 commands (includes create, workflows, word, epub, video, and sync-config)."""
|
||||
assert len(PARSERS) == 25
|
||||
assert len(get_parser_names()) == 25
|
||||
"""Test that we have exactly 35 commands (25 original + 10 new source types)."""
|
||||
assert len(PARSERS) == 35
|
||||
assert len(get_parser_names()) == 35
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
824
tests/test_new_source_types.py
Normal file
824
tests/test_new_source_types.py
Normal file
@@ -0,0 +1,824 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Tests for v3.2.0 new source type integration points.
|
||||
|
||||
Covers source detection, config validation, generic merge, CLI wiring,
|
||||
and source validation for the 10 new source types: jupyter, html, openapi,
|
||||
asciidoc, pptx, rss, manpage, confluence, notion, chat.
|
||||
"""
|
||||
|
||||
import os
|
||||
import textwrap
|
||||
|
||||
import pytest
|
||||
|
||||
from skill_seekers.cli.config_validator import ConfigValidator
|
||||
from skill_seekers.cli.main import COMMAND_MODULES
|
||||
from skill_seekers.cli.parsers import PARSERS, get_parser_names
|
||||
from skill_seekers.cli.source_detector import SourceDetector, SourceInfo
|
||||
from skill_seekers.cli.unified_skill_builder import UnifiedSkillBuilder
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 1. SourceDetector — new type detection
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestSourceDetectorNewTypes:
|
||||
"""Test that SourceDetector.detect() maps new extensions to correct types."""
|
||||
|
||||
# -- Jupyter --
|
||||
def test_detect_ipynb(self):
|
||||
"""Test .ipynb → jupyter detection."""
|
||||
info = SourceDetector.detect("analysis.ipynb")
|
||||
assert info.type == "jupyter"
|
||||
assert info.parsed["file_path"] == "analysis.ipynb"
|
||||
assert info.suggested_name == "analysis"
|
||||
|
||||
# -- HTML --
|
||||
def test_detect_html_extension(self):
|
||||
"""Test .html → html detection."""
|
||||
info = SourceDetector.detect("page.html")
|
||||
assert info.type == "html"
|
||||
assert info.parsed["file_path"] == "page.html"
|
||||
|
||||
def test_detect_htm_extension(self):
|
||||
"""Test .htm → html detection."""
|
||||
info = SourceDetector.detect("index.HTM")
|
||||
assert info.type == "html"
|
||||
assert info.parsed["file_path"] == "index.HTM"
|
||||
|
||||
# -- PowerPoint --
|
||||
def test_detect_pptx(self):
|
||||
"""Test .pptx → pptx detection."""
|
||||
info = SourceDetector.detect("slides.pptx")
|
||||
assert info.type == "pptx"
|
||||
assert info.parsed["file_path"] == "slides.pptx"
|
||||
assert info.suggested_name == "slides"
|
||||
|
||||
# -- AsciiDoc --
|
||||
def test_detect_adoc(self):
|
||||
"""Test .adoc → asciidoc detection."""
|
||||
info = SourceDetector.detect("manual.adoc")
|
||||
assert info.type == "asciidoc"
|
||||
assert info.parsed["file_path"] == "manual.adoc"
|
||||
|
||||
def test_detect_asciidoc_extension(self):
|
||||
"""Test .asciidoc → asciidoc detection."""
|
||||
info = SourceDetector.detect("guide.ASCIIDOC")
|
||||
assert info.type == "asciidoc"
|
||||
assert info.parsed["file_path"] == "guide.ASCIIDOC"
|
||||
|
||||
# -- Man pages --
|
||||
def test_detect_man_extension(self):
|
||||
"""Test .man → manpage detection."""
|
||||
info = SourceDetector.detect("curl.man")
|
||||
assert info.type == "manpage"
|
||||
assert info.parsed["file_path"] == "curl.man"
|
||||
|
||||
@pytest.mark.parametrize("section", range(1, 9))
|
||||
def test_detect_man_sections(self, section):
|
||||
"""Test .1 through .8 → manpage for simple basenames."""
|
||||
filename = f"git.{section}"
|
||||
info = SourceDetector.detect(filename)
|
||||
assert info.type == "manpage", f"{filename} should detect as manpage"
|
||||
assert info.suggested_name == "git"
|
||||
|
||||
def test_man_section_with_dotted_basename_not_detected(self):
|
||||
"""Test that 'access.log.1' is NOT detected as a man page.
|
||||
|
||||
The heuristic checks that the basename (without extension) has no dots.
|
||||
"""
|
||||
# This should fall through to web/domain detection (has a dot, not a path)
|
||||
info = SourceDetector.detect("access.log.1")
|
||||
# access.log.1 has a dot in the basename-without-ext ("access.log"),
|
||||
# so it should NOT be detected as manpage. It falls through to the
|
||||
# domain inference branch because it contains a dot and doesn't start
|
||||
# with '/'.
|
||||
assert info.type != "manpage"
|
||||
|
||||
# -- RSS/Atom --
|
||||
def test_detect_rss_extension(self):
|
||||
"""Test .rss → rss detection."""
|
||||
info = SourceDetector.detect("feed.rss")
|
||||
assert info.type == "rss"
|
||||
assert info.parsed["file_path"] == "feed.rss"
|
||||
|
||||
def test_detect_atom_extension(self):
|
||||
"""Test .atom → rss detection."""
|
||||
info = SourceDetector.detect("updates.atom")
|
||||
assert info.type == "rss"
|
||||
assert info.parsed["file_path"] == "updates.atom"
|
||||
|
||||
def test_xml_not_detected_as_rss(self):
|
||||
"""Test .xml is NOT detected as rss (too generic).
|
||||
|
||||
The fix ensures .xml files do not get incorrectly classified as RSS feeds.
|
||||
"""
|
||||
# .xml has no special handling — it will fall through to domain inference
|
||||
# or raise ValueError depending on contents. Either way, it must not
|
||||
# be classified as "rss".
|
||||
info = SourceDetector.detect("data.xml")
|
||||
assert info.type != "rss"
|
||||
|
||||
# -- OpenAPI --
|
||||
def test_yaml_with_openapi_content_detected(self, tmp_path):
|
||||
"""Test .yaml with 'openapi:' key → openapi detection."""
|
||||
spec = tmp_path / "petstore.yaml"
|
||||
spec.write_text(
|
||||
textwrap.dedent("""\
|
||||
openapi: "3.0.0"
|
||||
info:
|
||||
title: Petstore
|
||||
version: "1.0.0"
|
||||
paths: {}
|
||||
""")
|
||||
)
|
||||
info = SourceDetector.detect(str(spec))
|
||||
assert info.type == "openapi"
|
||||
assert info.parsed["file_path"] == str(spec)
|
||||
assert info.suggested_name == "petstore"
|
||||
|
||||
def test_yaml_with_swagger_content_detected(self, tmp_path):
|
||||
"""Test .yaml with 'swagger:' key → openapi detection."""
|
||||
spec = tmp_path / "legacy.yml"
|
||||
spec.write_text(
|
||||
textwrap.dedent("""\
|
||||
swagger: "2.0"
|
||||
info:
|
||||
title: Legacy API
|
||||
basePath: /v1
|
||||
""")
|
||||
)
|
||||
info = SourceDetector.detect(str(spec))
|
||||
assert info.type == "openapi"
|
||||
|
||||
def test_yaml_without_openapi_not_detected(self, tmp_path):
|
||||
"""Test .yaml without OpenAPI content is NOT detected as openapi.
|
||||
|
||||
When the YAML file doesn't contain openapi/swagger keys the detector
|
||||
skips OpenAPI and falls through. For an absolute path it will raise
|
||||
ValueError (cannot determine type), which still confirms it was NOT
|
||||
classified as openapi.
|
||||
"""
|
||||
plain = tmp_path / "config.yaml"
|
||||
plain.write_text("name: my-project\nversion: 1.0\n")
|
||||
# Absolute path falls through to ValueError (no matching type).
|
||||
# Either way, it must NOT be "openapi".
|
||||
try:
|
||||
info = SourceDetector.detect(str(plain))
|
||||
assert info.type != "openapi"
|
||||
except ValueError:
|
||||
# Raised because source type cannot be determined — this is fine,
|
||||
# the important thing is it was not classified as openapi.
|
||||
pass
|
||||
|
||||
def test_looks_like_openapi_returns_false_for_missing_file(self):
|
||||
"""Test _looks_like_openapi returns False for non-existent file."""
|
||||
assert SourceDetector._looks_like_openapi("/nonexistent/spec.yaml") is False
|
||||
|
||||
def test_looks_like_openapi_json_key_format(self, tmp_path):
|
||||
"""Test _looks_like_openapi detects JSON-style keys (quoted)."""
|
||||
spec = tmp_path / "api.yaml"
|
||||
spec.write_text('"openapi": "3.0.0"\n')
|
||||
assert SourceDetector._looks_like_openapi(str(spec)) is True
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 2. ConfigValidator — new source type validation
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestConfigValidatorNewTypes:
|
||||
"""Test ConfigValidator VALID_SOURCE_TYPES and per-type validation."""
|
||||
|
||||
# All 17 expected types
|
||||
EXPECTED_TYPES = {
|
||||
"documentation",
|
||||
"github",
|
||||
"pdf",
|
||||
"local",
|
||||
"word",
|
||||
"video",
|
||||
"epub",
|
||||
"jupyter",
|
||||
"html",
|
||||
"openapi",
|
||||
"asciidoc",
|
||||
"pptx",
|
||||
"confluence",
|
||||
"notion",
|
||||
"rss",
|
||||
"manpage",
|
||||
"chat",
|
||||
}
|
||||
|
||||
def test_all_17_types_present(self):
|
||||
"""Test that VALID_SOURCE_TYPES contains all 17 types."""
|
||||
assert ConfigValidator.VALID_SOURCE_TYPES == self.EXPECTED_TYPES
|
||||
|
||||
def test_unknown_type_rejected(self):
|
||||
"""Test that an unknown source type is rejected during validation."""
|
||||
config = {
|
||||
"name": "test",
|
||||
"description": "test",
|
||||
"sources": [{"type": "foobar"}],
|
||||
}
|
||||
validator = ConfigValidator(config)
|
||||
with pytest.raises(ValueError, match="Invalid type 'foobar'"):
|
||||
validator.validate()
|
||||
|
||||
# --- Per-type required-field validation ---
|
||||
|
||||
def _make_config(self, source: dict) -> dict:
|
||||
"""Helper: wrap a source dict in a valid config structure."""
|
||||
return {
|
||||
"name": "test",
|
||||
"description": "test",
|
||||
"sources": [source],
|
||||
}
|
||||
|
||||
def test_epub_requires_path(self):
|
||||
"""Test epub source validation requires 'path'."""
|
||||
config = self._make_config({"type": "epub"})
|
||||
validator = ConfigValidator(config)
|
||||
with pytest.raises(ValueError, match="Missing required field 'path'"):
|
||||
validator.validate()
|
||||
|
||||
def test_jupyter_requires_path(self):
|
||||
"""Test jupyter source validation requires 'path'."""
|
||||
config = self._make_config({"type": "jupyter"})
|
||||
validator = ConfigValidator(config)
|
||||
with pytest.raises(ValueError, match="Missing required field 'path'"):
|
||||
validator.validate()
|
||||
|
||||
def test_html_requires_path(self):
|
||||
"""Test html source validation requires 'path'."""
|
||||
config = self._make_config({"type": "html"})
|
||||
validator = ConfigValidator(config)
|
||||
with pytest.raises(ValueError, match="Missing required field 'path'"):
|
||||
validator.validate()
|
||||
|
||||
def test_openapi_requires_path_or_url(self):
|
||||
"""Test openapi source validation requires 'path' or 'url'."""
|
||||
config = self._make_config({"type": "openapi"})
|
||||
validator = ConfigValidator(config)
|
||||
with pytest.raises(ValueError, match="Missing required field 'path' or 'url'"):
|
||||
validator.validate()
|
||||
|
||||
def test_openapi_accepts_url(self):
|
||||
"""Test openapi source passes validation with 'url'."""
|
||||
config = self._make_config({"type": "openapi", "url": "https://example.com/spec.yaml"})
|
||||
validator = ConfigValidator(config)
|
||||
assert validator.validate() is True
|
||||
|
||||
def test_pptx_requires_path(self):
|
||||
"""Test pptx source validation requires 'path'."""
|
||||
config = self._make_config({"type": "pptx"})
|
||||
validator = ConfigValidator(config)
|
||||
with pytest.raises(ValueError, match="Missing required field 'path'"):
|
||||
validator.validate()
|
||||
|
||||
def test_asciidoc_requires_path(self):
|
||||
"""Test asciidoc source validation requires 'path'."""
|
||||
config = self._make_config({"type": "asciidoc"})
|
||||
validator = ConfigValidator(config)
|
||||
with pytest.raises(ValueError, match="Missing required field 'path'"):
|
||||
validator.validate()
|
||||
|
||||
def test_confluence_requires_url_or_path(self):
|
||||
"""Test confluence requires 'url'/'base_url' or 'path'."""
|
||||
config = self._make_config({"type": "confluence"})
|
||||
validator = ConfigValidator(config)
|
||||
with pytest.raises(ValueError, match="Missing required field"):
|
||||
validator.validate()
|
||||
|
||||
def test_confluence_accepts_base_url(self):
|
||||
"""Test confluence passes with base_url + space_key."""
|
||||
config = self._make_config(
|
||||
{
|
||||
"type": "confluence",
|
||||
"base_url": "https://wiki.example.com",
|
||||
"space_key": "DEV",
|
||||
}
|
||||
)
|
||||
validator = ConfigValidator(config)
|
||||
assert validator.validate() is True
|
||||
|
||||
def test_confluence_accepts_path(self):
|
||||
"""Test confluence passes with export path."""
|
||||
config = self._make_config({"type": "confluence", "path": "/exports/wiki"})
|
||||
validator = ConfigValidator(config)
|
||||
assert validator.validate() is True
|
||||
|
||||
def test_notion_requires_url_or_path(self):
|
||||
"""Test notion requires 'url'/'database_id'/'page_id' or 'path'."""
|
||||
config = self._make_config({"type": "notion"})
|
||||
validator = ConfigValidator(config)
|
||||
with pytest.raises(ValueError, match="Missing required field"):
|
||||
validator.validate()
|
||||
|
||||
def test_notion_accepts_page_id(self):
|
||||
"""Test notion passes with page_id."""
|
||||
config = self._make_config({"type": "notion", "page_id": "abc123"})
|
||||
validator = ConfigValidator(config)
|
||||
assert validator.validate() is True
|
||||
|
||||
def test_notion_accepts_database_id(self):
|
||||
"""Test notion passes with database_id."""
|
||||
config = self._make_config({"type": "notion", "database_id": "db-456"})
|
||||
validator = ConfigValidator(config)
|
||||
assert validator.validate() is True
|
||||
|
||||
def test_rss_requires_url_or_path(self):
|
||||
"""Test rss source validation requires 'url' or 'path'."""
|
||||
config = self._make_config({"type": "rss"})
|
||||
validator = ConfigValidator(config)
|
||||
with pytest.raises(ValueError, match="Missing required field 'url' or 'path'"):
|
||||
validator.validate()
|
||||
|
||||
def test_rss_accepts_url(self):
|
||||
"""Test rss passes with url."""
|
||||
config = self._make_config({"type": "rss", "url": "https://blog.example.com/feed.xml"})
|
||||
validator = ConfigValidator(config)
|
||||
assert validator.validate() is True
|
||||
|
||||
def test_manpage_requires_path_or_names(self):
|
||||
"""Test manpage source validation requires 'path' or 'names'."""
|
||||
config = self._make_config({"type": "manpage"})
|
||||
validator = ConfigValidator(config)
|
||||
with pytest.raises(ValueError, match="Missing required field 'path' or 'names'"):
|
||||
validator.validate()
|
||||
|
||||
def test_manpage_accepts_names(self):
|
||||
"""Test manpage passes with 'names' list."""
|
||||
config = self._make_config({"type": "manpage", "names": ["git", "curl"]})
|
||||
validator = ConfigValidator(config)
|
||||
assert validator.validate() is True
|
||||
|
||||
def test_chat_requires_path_or_token(self):
|
||||
"""Test chat source validation requires 'path' or 'token'."""
|
||||
config = self._make_config({"type": "chat"})
|
||||
validator = ConfigValidator(config)
|
||||
with pytest.raises(ValueError, match="Missing required field 'path'.*or 'token'"):
|
||||
validator.validate()
|
||||
|
||||
def test_chat_accepts_path(self):
|
||||
"""Test chat passes with export path."""
|
||||
config = self._make_config({"type": "chat", "path": "/exports/slack"})
|
||||
validator = ConfigValidator(config)
|
||||
assert validator.validate() is True
|
||||
|
||||
def test_chat_accepts_token_with_channel(self):
|
||||
"""Test chat passes with API token + channel."""
|
||||
config = self._make_config(
|
||||
{
|
||||
"type": "chat",
|
||||
"token": "xoxb-fake",
|
||||
"channel": "#general",
|
||||
}
|
||||
)
|
||||
validator = ConfigValidator(config)
|
||||
assert validator.validate() is True
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 3. UnifiedSkillBuilder — generic merge system
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestUnifiedSkillBuilderGenericMerge:
|
||||
"""Test _generic_merge, _append_extra_sources, and _SOURCE_LABELS."""
|
||||
|
||||
def _make_builder(self, tmp_path) -> UnifiedSkillBuilder:
|
||||
"""Create a minimal builder instance for testing."""
|
||||
config = {
|
||||
"name": "test_project",
|
||||
"description": "A test project for merge testing",
|
||||
"sources": [
|
||||
{"type": "jupyter", "path": "nb.ipynb"},
|
||||
{"type": "rss", "url": "https://example.com/feed.rss"},
|
||||
],
|
||||
}
|
||||
scraped_data: dict = {}
|
||||
builder = UnifiedSkillBuilder(
|
||||
config=config,
|
||||
scraped_data=scraped_data,
|
||||
cache_dir=str(tmp_path / "cache"),
|
||||
)
|
||||
# Override skill_dir to use tmp_path
|
||||
builder.skill_dir = str(tmp_path / "output" / "test_project")
|
||||
os.makedirs(builder.skill_dir, exist_ok=True)
|
||||
os.makedirs(os.path.join(builder.skill_dir, "references"), exist_ok=True)
|
||||
return builder
|
||||
|
||||
def test_generic_merge_produces_valid_markdown(self, tmp_path):
|
||||
"""Test _generic_merge with two source types produces markdown."""
|
||||
builder = self._make_builder(tmp_path)
|
||||
skill_mds = {
|
||||
"jupyter": "## When to Use\n\nFor data analysis.\n\n## Quick Reference\n\nImport pandas.",
|
||||
"rss": "## When to Use\n\nFor feed monitoring.\n\n## Feed Items\n\nLatest entries.",
|
||||
}
|
||||
result = builder._generic_merge(skill_mds)
|
||||
|
||||
# Must be non-empty markdown
|
||||
assert len(result) > 100
|
||||
# Must contain the project title
|
||||
assert "Test Project" in result
|
||||
|
||||
def test_generic_merge_includes_yaml_frontmatter(self, tmp_path):
|
||||
"""Test _generic_merge includes YAML frontmatter."""
|
||||
builder = self._make_builder(tmp_path)
|
||||
skill_mds = {
|
||||
"html": "## Overview\n\nHTML content here.",
|
||||
}
|
||||
result = builder._generic_merge(skill_mds)
|
||||
|
||||
assert result.startswith("---\n")
|
||||
assert "name: test-project" in result
|
||||
assert "description: A test project" in result
|
||||
|
||||
def test_generic_merge_attributes_content_to_sources(self, tmp_path):
|
||||
"""Test _generic_merge attributes content to correct source labels."""
|
||||
builder = self._make_builder(tmp_path)
|
||||
skill_mds = {
|
||||
"jupyter": "## Overview\n\nNotebook content.",
|
||||
"pptx": "## Overview\n\nSlide content.",
|
||||
}
|
||||
result = builder._generic_merge(skill_mds)
|
||||
|
||||
# Check source labels appear
|
||||
assert "Jupyter Notebook" in result
|
||||
assert "PowerPoint Presentation" in result
|
||||
|
||||
def test_generic_merge_single_source_section(self, tmp_path):
|
||||
"""Test section unique to one source has 'From <Label>' attribution."""
|
||||
builder = self._make_builder(tmp_path)
|
||||
skill_mds = {
|
||||
"manpage": "## Synopsis\n\ngit [options]",
|
||||
}
|
||||
result = builder._generic_merge(skill_mds)
|
||||
|
||||
assert "*From Man Page*" in result
|
||||
assert "## Synopsis" in result
|
||||
|
||||
def test_generic_merge_multi_source_section(self, tmp_path):
|
||||
"""Test section shared by multiple sources gets sub-headings per source."""
|
||||
builder = self._make_builder(tmp_path)
|
||||
skill_mds = {
|
||||
"asciidoc": "## Quick Reference\n\nAsciiDoc quick ref.",
|
||||
"html": "## Quick Reference\n\nHTML quick ref.",
|
||||
}
|
||||
result = builder._generic_merge(skill_mds)
|
||||
|
||||
# Both sources should be attributed under the shared section
|
||||
assert "### From AsciiDoc Document" in result
|
||||
assert "### From HTML Document" in result
|
||||
|
||||
def test_generic_merge_footer(self, tmp_path):
|
||||
"""Test _generic_merge ends with the standard footer."""
|
||||
builder = self._make_builder(tmp_path)
|
||||
skill_mds = {
|
||||
"rss": "## Feeds\n\nSome feeds.",
|
||||
}
|
||||
result = builder._generic_merge(skill_mds)
|
||||
assert "Generated by Skill Seeker" in result
|
||||
|
||||
def test_generic_merge_merged_from_line(self, tmp_path):
|
||||
"""Test _generic_merge includes 'Merged from:' with correct labels."""
|
||||
builder = self._make_builder(tmp_path)
|
||||
skill_mds = {
|
||||
"confluence": "## Pages\n\nWiki pages.",
|
||||
"notion": "## Databases\n\nNotion DBs.",
|
||||
}
|
||||
result = builder._generic_merge(skill_mds)
|
||||
|
||||
assert "*Merged from: Confluence Wiki, Notion Page*" in result
|
||||
|
||||
def test_append_extra_sources_adds_sections(self, tmp_path):
|
||||
"""Test _append_extra_sources adds new sections to base content."""
|
||||
builder = self._make_builder(tmp_path)
|
||||
base_content = "# Test\n\nIntro.\n\n## Main Section\n\nContent.\n\n---\n\n*Footer*\n"
|
||||
skill_mds = {
|
||||
"epub": "## Chapters\n\nChapter list.\n\n## Key Concepts\n\nConcept A.",
|
||||
}
|
||||
result = builder._append_extra_sources(base_content, skill_mds, {"epub"})
|
||||
|
||||
# The extra source content should be inserted before the footer separator
|
||||
assert "EPUB E-book Content" in result
|
||||
assert "Chapters" in result
|
||||
assert "Key Concepts" in result
|
||||
# Original content should still be present
|
||||
assert "# Test" in result
|
||||
assert "## Main Section" in result
|
||||
|
||||
def test_append_extra_sources_preserves_footer(self, tmp_path):
|
||||
"""Test _append_extra_sources keeps the footer intact."""
|
||||
builder = self._make_builder(tmp_path)
|
||||
base_content = "# Test\n\n---\n\n*Footer*\n"
|
||||
skill_mds = {
|
||||
"chat": "## Messages\n\nChat history.",
|
||||
}
|
||||
result = builder._append_extra_sources(base_content, skill_mds, {"chat"})
|
||||
|
||||
assert "*Footer*" in result
|
||||
|
||||
def test_source_labels_has_all_17_types(self):
|
||||
"""Test _SOURCE_LABELS has entries for all 17 source types."""
|
||||
expected = {
|
||||
"documentation",
|
||||
"github",
|
||||
"pdf",
|
||||
"word",
|
||||
"epub",
|
||||
"video",
|
||||
"local",
|
||||
"jupyter",
|
||||
"html",
|
||||
"openapi",
|
||||
"asciidoc",
|
||||
"pptx",
|
||||
"confluence",
|
||||
"notion",
|
||||
"rss",
|
||||
"manpage",
|
||||
"chat",
|
||||
}
|
||||
assert set(UnifiedSkillBuilder._SOURCE_LABELS.keys()) == expected
|
||||
|
||||
def test_source_labels_values_are_nonempty_strings(self):
|
||||
"""Test all _SOURCE_LABELS values are non-empty strings."""
|
||||
for key, label in UnifiedSkillBuilder._SOURCE_LABELS.items():
|
||||
assert isinstance(label, str), f"Label for '{key}' is not a string"
|
||||
assert len(label) > 0, f"Label for '{key}' is empty"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 4. COMMAND_MODULES and parser wiring
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestCommandModules:
|
||||
"""Test that all 10 new source types are wired into CLI."""
|
||||
|
||||
NEW_COMMAND_NAMES = [
|
||||
"jupyter",
|
||||
"html",
|
||||
"openapi",
|
||||
"asciidoc",
|
||||
"pptx",
|
||||
"rss",
|
||||
"manpage",
|
||||
"confluence",
|
||||
"notion",
|
||||
"chat",
|
||||
]
|
||||
|
||||
def test_new_types_in_command_modules(self):
|
||||
"""Test all 10 new source types are in COMMAND_MODULES."""
|
||||
for cmd in self.NEW_COMMAND_NAMES:
|
||||
assert cmd in COMMAND_MODULES, f"'{cmd}' not in COMMAND_MODULES"
|
||||
|
||||
def test_command_modules_values_are_module_paths(self):
|
||||
"""Test COMMAND_MODULES values look like importable module paths."""
|
||||
for cmd in self.NEW_COMMAND_NAMES:
|
||||
module_path = COMMAND_MODULES[cmd]
|
||||
assert module_path.startswith("skill_seekers.cli."), (
|
||||
f"Module path for '{cmd}' doesn't start with 'skill_seekers.cli.'"
|
||||
)
|
||||
|
||||
def test_new_parser_names_include_all_10(self):
|
||||
"""Test that get_parser_names() includes all 10 new source types."""
|
||||
names = get_parser_names()
|
||||
for cmd in self.NEW_COMMAND_NAMES:
|
||||
assert cmd in names, f"Parser '{cmd}' not registered"
|
||||
|
||||
def test_total_parser_count(self):
|
||||
"""Test total PARSERS count is 35 (25 original + 10 new)."""
|
||||
assert len(PARSERS) == 35
|
||||
|
||||
def test_no_duplicate_parser_names(self):
|
||||
"""Test no duplicate parser names exist."""
|
||||
names = get_parser_names()
|
||||
assert len(names) == len(set(names)), "Duplicate parser names found!"
|
||||
|
||||
def test_command_module_count(self):
|
||||
"""Test COMMAND_MODULES has expected number of entries."""
|
||||
# 25 original + 10 new = 35
|
||||
assert len(COMMAND_MODULES) == 35
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 5. SourceDetector.validate_source — new types
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestSourceDetectorValidation:
|
||||
"""Test validate_source for new file-based source types."""
|
||||
|
||||
def test_validation_passes_for_existing_jupyter(self, tmp_path):
|
||||
"""Test validation passes for an existing .ipynb file."""
|
||||
nb = tmp_path / "test.ipynb"
|
||||
nb.write_text('{"cells": []}')
|
||||
|
||||
info = SourceInfo(
|
||||
type="jupyter",
|
||||
parsed={"file_path": str(nb)},
|
||||
suggested_name="test",
|
||||
raw_input=str(nb),
|
||||
)
|
||||
# Should not raise
|
||||
SourceDetector.validate_source(info)
|
||||
|
||||
def test_validation_raises_for_nonexistent_jupyter(self):
|
||||
"""Test validation raises ValueError for non-existent file."""
|
||||
info = SourceInfo(
|
||||
type="jupyter",
|
||||
parsed={"file_path": "/nonexistent/notebook.ipynb"},
|
||||
suggested_name="notebook",
|
||||
raw_input="/nonexistent/notebook.ipynb",
|
||||
)
|
||||
with pytest.raises(ValueError, match="does not exist"):
|
||||
SourceDetector.validate_source(info)
|
||||
|
||||
def test_validation_passes_for_existing_html(self, tmp_path):
|
||||
"""Test validation passes for an existing .html file."""
|
||||
html = tmp_path / "page.html"
|
||||
html.write_text("<html></html>")
|
||||
|
||||
info = SourceInfo(
|
||||
type="html",
|
||||
parsed={"file_path": str(html)},
|
||||
suggested_name="page",
|
||||
raw_input=str(html),
|
||||
)
|
||||
SourceDetector.validate_source(info)
|
||||
|
||||
def test_validation_raises_for_nonexistent_pptx(self):
|
||||
"""Test validation raises ValueError for non-existent pptx."""
|
||||
info = SourceInfo(
|
||||
type="pptx",
|
||||
parsed={"file_path": "/nonexistent/slides.pptx"},
|
||||
suggested_name="slides",
|
||||
raw_input="/nonexistent/slides.pptx",
|
||||
)
|
||||
with pytest.raises(ValueError, match="does not exist"):
|
||||
SourceDetector.validate_source(info)
|
||||
|
||||
def test_validation_passes_for_existing_openapi(self, tmp_path):
|
||||
"""Test validation passes for an existing OpenAPI spec file."""
|
||||
spec = tmp_path / "api.yaml"
|
||||
spec.write_text("openapi: '3.0.0'\n")
|
||||
|
||||
info = SourceInfo(
|
||||
type="openapi",
|
||||
parsed={"file_path": str(spec)},
|
||||
suggested_name="api",
|
||||
raw_input=str(spec),
|
||||
)
|
||||
SourceDetector.validate_source(info)
|
||||
|
||||
def test_validation_raises_for_nonexistent_asciidoc(self):
|
||||
"""Test validation raises ValueError for non-existent asciidoc."""
|
||||
info = SourceInfo(
|
||||
type="asciidoc",
|
||||
parsed={"file_path": "/nonexistent/doc.adoc"},
|
||||
suggested_name="doc",
|
||||
raw_input="/nonexistent/doc.adoc",
|
||||
)
|
||||
with pytest.raises(ValueError, match="does not exist"):
|
||||
SourceDetector.validate_source(info)
|
||||
|
||||
def test_validation_raises_for_nonexistent_manpage(self):
|
||||
"""Test validation raises ValueError for non-existent manpage."""
|
||||
info = SourceInfo(
|
||||
type="manpage",
|
||||
parsed={"file_path": "/nonexistent/git.1"},
|
||||
suggested_name="git",
|
||||
raw_input="/nonexistent/git.1",
|
||||
)
|
||||
with pytest.raises(ValueError, match="does not exist"):
|
||||
SourceDetector.validate_source(info)
|
||||
|
||||
def test_validation_passes_for_existing_manpage(self, tmp_path):
|
||||
"""Test validation passes for an existing man page file."""
|
||||
man = tmp_path / "curl.1"
|
||||
man.write_text(".TH CURL 1\n")
|
||||
|
||||
info = SourceInfo(
|
||||
type="manpage",
|
||||
parsed={"file_path": str(man)},
|
||||
suggested_name="curl",
|
||||
raw_input=str(man),
|
||||
)
|
||||
SourceDetector.validate_source(info)
|
||||
|
||||
def test_rss_url_validation_no_file_check(self):
|
||||
"""Test rss validation passes for URL-based source (no file check)."""
|
||||
info = SourceInfo(
|
||||
type="rss",
|
||||
parsed={"url": "https://example.com/feed.rss"},
|
||||
suggested_name="feed",
|
||||
raw_input="https://example.com/feed.rss",
|
||||
)
|
||||
# rss validation only checks file if file_path is present; URL should pass
|
||||
SourceDetector.validate_source(info)
|
||||
|
||||
def test_rss_validation_raises_for_nonexistent_file(self):
|
||||
"""Test rss validation raises for non-existent local file."""
|
||||
info = SourceInfo(
|
||||
type="rss",
|
||||
parsed={"file_path": "/nonexistent/feed.rss"},
|
||||
suggested_name="feed",
|
||||
raw_input="/nonexistent/feed.rss",
|
||||
)
|
||||
with pytest.raises(ValueError, match="does not exist"):
|
||||
SourceDetector.validate_source(info)
|
||||
|
||||
def test_rss_validation_passes_for_existing_file(self, tmp_path):
|
||||
"""Test rss validation passes for an existing .rss file."""
|
||||
rss = tmp_path / "feed.rss"
|
||||
rss.write_text("<rss></rss>")
|
||||
|
||||
info = SourceInfo(
|
||||
type="rss",
|
||||
parsed={"file_path": str(rss)},
|
||||
suggested_name="feed",
|
||||
raw_input=str(rss),
|
||||
)
|
||||
SourceDetector.validate_source(info)
|
||||
|
||||
def test_validation_passes_for_directory_types(self, tmp_path):
|
||||
"""Test validation passes when source is a directory (e.g., html dir)."""
|
||||
html_dir = tmp_path / "pages"
|
||||
html_dir.mkdir()
|
||||
|
||||
info = SourceInfo(
|
||||
type="html",
|
||||
parsed={"file_path": str(html_dir)},
|
||||
suggested_name="pages",
|
||||
raw_input=str(html_dir),
|
||||
)
|
||||
# The validator allows directories for these types (isfile or isdir)
|
||||
SourceDetector.validate_source(info)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 6. CreateCommand._route_generic coverage
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestCreateCommandRouting:
|
||||
"""Test that CreateCommand._route_to_scraper maps new types to _route_generic."""
|
||||
|
||||
# We can't easily call _route_to_scraper (it imports real scrapers),
|
||||
# but we verify the routing table is correct by checking the method source.
|
||||
|
||||
GENERIC_ROUTES = {
|
||||
"jupyter": ("jupyter_scraper", "--notebook"),
|
||||
"html": ("html_scraper", "--html-path"),
|
||||
"openapi": ("openapi_scraper", "--spec"),
|
||||
"asciidoc": ("asciidoc_scraper", "--asciidoc-path"),
|
||||
"pptx": ("pptx_scraper", "--pptx"),
|
||||
"rss": ("rss_scraper", "--feed-path"),
|
||||
"manpage": ("man_scraper", "--man-path"),
|
||||
"confluence": ("confluence_scraper", "--export-path"),
|
||||
"notion": ("notion_scraper", "--export-path"),
|
||||
"chat": ("chat_scraper", "--export-path"),
|
||||
}
|
||||
|
||||
def test_route_to_scraper_source_coverage(self):
|
||||
"""Test _route_to_scraper method handles all 10 new types.
|
||||
|
||||
We inspect the method source to verify each type has a branch.
|
||||
"""
|
||||
import inspect
|
||||
|
||||
source = inspect.getsource(
|
||||
__import__(
|
||||
"skill_seekers.cli.create_command",
|
||||
fromlist=["CreateCommand"],
|
||||
).CreateCommand._route_to_scraper
|
||||
)
|
||||
for source_type in self.GENERIC_ROUTES:
|
||||
assert f'"{source_type}"' in source, (
|
||||
f"_route_to_scraper missing branch for '{source_type}'"
|
||||
)
|
||||
|
||||
def test_generic_route_module_names(self):
|
||||
"""Test _route_generic is called with correct module names."""
|
||||
import inspect
|
||||
|
||||
source = inspect.getsource(
|
||||
__import__(
|
||||
"skill_seekers.cli.create_command",
|
||||
fromlist=["CreateCommand"],
|
||||
).CreateCommand._route_to_scraper
|
||||
)
|
||||
for source_type, (module, flag) in self.GENERIC_ROUTES.items():
|
||||
assert f'"{module}"' in source, f"Module name '{module}' not found for '{source_type}'"
|
||||
assert f'"{flag}"' in source, f"Flag '{flag}' not found for '{source_type}'"
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
pytest.main([__file__, "-v"])
|
||||
180
uv.lock
generated
180
uv.lock
generated
@@ -220,6 +220,15 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/7f/9c/36c5c37947ebfb8c7f22e0eb6e4d188ee2d53aa3880f3f2744fb894f0cb1/anyio-4.12.0-py3-none-any.whl", hash = "sha256:dad2376a628f98eeca4881fc56cd06affd18f659b17a747d3ff0307ced94b1bb", size = 113362, upload-time = "2025-11-28T23:36:57.897Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "asciidoc"
|
||||
version = "10.2.1"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/1d/e7/315a82f2d256e9270977aa3c15e8fe281fd7c40b8e2a0b97e0cb61ca8fa0/asciidoc-10.2.1.tar.gz", hash = "sha256:d9f13c285981b3c7eb660d02ca0a2779981e88d48105de81bb40445e60dddb83", size = 230179, upload-time = "2024-07-17T03:12:52.681Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/75/1f/87941eaa96e86aa22086064f67e4187e2710fb76c147312979ea29278dac/asciidoc-10.2.1-py2.py3-none-any.whl", hash = "sha256:3f277a636b617c9ce7e0b87bcaea51f144500e9a5c8a6488421ee24594850d40", size = 272433, upload-time = "2024-07-17T03:12:49.012Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "async-timeout"
|
||||
version = "5.0.1"
|
||||
@@ -229,6 +238,24 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/fe/ba/e2081de779ca30d473f21f5b30e0e737c438205440784c7dfc81efc2b029/async_timeout-5.0.1-py3-none-any.whl", hash = "sha256:39e3809566ff85354557ec2398b55e096c8364bacac9405a7a1fa429e77fe76c", size = 6233, upload-time = "2024-11-06T16:41:37.9Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "atlassian-python-api"
|
||||
version = "4.0.7"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "beautifulsoup4" },
|
||||
{ name = "deprecated" },
|
||||
{ name = "jmespath" },
|
||||
{ name = "oauthlib" },
|
||||
{ name = "requests" },
|
||||
{ name = "requests-oauthlib" },
|
||||
{ name = "typing-extensions" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/40/e8/f23b7273e410c6fe9f98f9db25268c6736572f22a9566d1dc9ed3614bb68/atlassian_python_api-4.0.7.tar.gz", hash = "sha256:8d9cc6068b1d2a48eb434e22e57f6bbd918a47fac9e46b95b7a3cefb00fceacb", size = 271149, upload-time = "2025-08-21T13:19:40.746Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/1d/83/e4f9976ce3c933a079b8931325e7a9c0a8bba7030a2cb85764c0048f3479/atlassian_python_api-4.0.7-py3-none-any.whl", hash = "sha256:46a70cb29eaab87c0a1697fccd3e25df1aa477e6aa4fb9ba936a9d46b425933c", size = 197746, upload-time = "2025-08-21T13:19:39.044Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "attrs"
|
||||
version = "25.4.0"
|
||||
@@ -1135,6 +1162,27 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/05/99/49ee85903dee060d9f08297b4a342e5e0bcfca2f027a07b4ee0a38ab13f9/faster_whisper-1.2.1-py3-none-any.whl", hash = "sha256:79a66ad50688c0b794dd501dc340a736992a6342f7f95e5811be60b5224a26a7", size = 1118909, upload-time = "2025-10-31T11:35:47.794Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "fastjsonschema"
|
||||
version = "2.21.2"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/20/b5/23b216d9d985a956623b6bd12d4086b60f0059b27799f23016af04a74ea1/fastjsonschema-2.21.2.tar.gz", hash = "sha256:b1eb43748041c880796cd077f1a07c3d94e93ae84bba5ed36800a33554ae05de", size = 374130, upload-time = "2025-08-14T18:49:36.666Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/cb/a8/20d0723294217e47de6d9e2e40fd4a9d2f7c4b6ef974babd482a59743694/fastjsonschema-2.21.2-py3-none-any.whl", hash = "sha256:1c797122d0a86c5cace2e54bf4e819c36223b552017172f32c5c024a6b77e463", size = 24024, upload-time = "2025-08-14T18:49:34.776Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "feedparser"
|
||||
version = "6.0.12"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "sgmllib3k" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/dc/79/db7edb5e77d6dfbc54d7d9df72828be4318275b2e580549ff45a962f6461/feedparser-6.0.12.tar.gz", hash = "sha256:64f76ce90ae3e8ef5d1ede0f8d3b50ce26bcce71dd8ae5e82b1cd2d4a5f94228", size = 286579, upload-time = "2025-09-10T13:33:59.486Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/4e/eb/c96d64137e29ae17d83ad2552470bafe3a7a915e85434d9942077d7fd011/feedparser-6.0.12-py3-none-any.whl", hash = "sha256:6bbff10f5a52662c00a2e3f86a38928c37c48f77b3c511aedcd51de933549324", size = 81480, upload-time = "2025-09-10T13:33:58.022Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "ffmpeg-python"
|
||||
version = "0.2.0"
|
||||
@@ -2100,6 +2148,19 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/41/45/1a4ed80516f02155c51f51e8cedb3c1902296743db0bbc66608a0db2814f/jsonschema_specifications-2025.9.1-py3-none-any.whl", hash = "sha256:98802fee3a11ee76ecaca44429fda8a41bff98b00a0f2838151b113f210cc6fe", size = 18437, upload-time = "2025-09-08T01:34:57.871Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "jupyter-core"
|
||||
version = "5.9.1"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "platformdirs" },
|
||||
{ name = "traitlets" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/02/49/9d1284d0dc65e2c757b74c6687b6d319b02f822ad039e5c512df9194d9dd/jupyter_core-5.9.1.tar.gz", hash = "sha256:4d09aaff303b9566c3ce657f580bd089ff5c91f5f89cf7d8846c3cdf465b5508", size = 89814, upload-time = "2025-10-16T19:19:18.444Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/e7/e7/80988e32bf6f73919a113473a604f5a8f09094de312b9d52b79c2df7612b/jupyter_core-5.9.1-py3-none-any.whl", hash = "sha256:ebf87fdc6073d142e114c72c9e29a9d7ca03fad818c5d300ce2adc1fb0743407", size = 29032, upload-time = "2025-10-16T19:19:16.783Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "kubernetes"
|
||||
version = "35.0.0"
|
||||
@@ -3122,6 +3183,21 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/79/7b/2c79738432f5c924bef5071f933bcc9efd0473bac3b4aa584a6f7c1c8df8/mypy_extensions-1.1.0-py3-none-any.whl", hash = "sha256:1be4cccdb0f2482337c4743e60421de3a356cd97508abadd57d47403e94f5505", size = 4963, upload-time = "2025-04-22T14:54:22.983Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nbformat"
|
||||
version = "5.10.4"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "fastjsonschema" },
|
||||
{ name = "jsonschema" },
|
||||
{ name = "jupyter-core" },
|
||||
{ name = "traitlets" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/6d/fd/91545e604bc3dad7dca9ed03284086039b294c6b3d75c0d2fa45f9e9caf3/nbformat-5.10.4.tar.gz", hash = "sha256:322168b14f937a5d11362988ecac2a4952d3d8e3a2cbeb2319584631226d5b3a", size = 142749, upload-time = "2024-04-04T11:20:37.371Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/a9/82/0340caa499416c78e5d8f5f05947ae4bc3cba53c9f038ab6e9ed964e22f1/nbformat-5.10.4-py3-none-any.whl", hash = "sha256:3b48d6c8fbca4b299bf3982ea7db1af21580e4fec269ad087b9e81588891200b", size = 78454, upload-time = "2024-04-04T11:20:34.895Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "nest-asyncio"
|
||||
version = "1.6.0"
|
||||
@@ -3173,6 +3249,18 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/60/90/81ac364ef94209c100e12579629dc92bf7a709a84af32f8c551b02c07e94/nltk-3.9.2-py3-none-any.whl", hash = "sha256:1e209d2b3009110635ed9709a67a1a3e33a10f799490fa71cf4bec218c11c88a", size = 1513404, upload-time = "2025-10-01T07:19:21.648Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "notion-client"
|
||||
version = "3.0.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "httpx" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/a5/39/60afcbc0148c3dafaaefe851ae3f058077db49d66288dfb218a11a57b997/notion_client-3.0.0.tar.gz", hash = "sha256:05c4d2b4fa3491dc0de21c9c826277202ea8b8714077ee7f51a6e1a09ab23d0f", size = 31357, upload-time = "2026-02-16T11:15:48.024Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/aa/ce/6b03f9aedd2edfcc28e23ced5c2582d543f6ddbb2be5c570533f02890b27/notion_client-3.0.0-py2.py3-none-any.whl", hash = "sha256:177fc3d2ace7e8ef69cf96f46269e8a66071c2c7c526194bf06ce7925853e759", size = 18746, upload-time = "2026-02-16T11:15:46.602Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "numpy"
|
||||
version = "2.2.6"
|
||||
@@ -4789,6 +4877,21 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/aa/76/03af049af4dcee5d27442f71b6924f01f3efb5d2bd34f23fcd563f2cc5f5/python_multipart-0.0.21-py3-none-any.whl", hash = "sha256:cf7a6713e01c87aa35387f4774e812c4361150938d20d232800f75ffcf266090", size = 24541, upload-time = "2025-12-17T09:24:21.153Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "python-pptx"
|
||||
version = "1.0.2"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "lxml" },
|
||||
{ name = "pillow" },
|
||||
{ name = "typing-extensions" },
|
||||
{ name = "xlsxwriter" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/52/a9/0c0db8d37b2b8a645666f7fd8accea4c6224e013c42b1d5c17c93590cd06/python_pptx-1.0.2.tar.gz", hash = "sha256:479a8af0eaf0f0d76b6f00b0887732874ad2e3188230315290cd1f9dd9cc7095", size = 10109297, upload-time = "2024-08-07T17:33:37.772Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/d9/4f/00be2196329ebbff56ce564aa94efb0fbc828d00de250b1980de1a34ab49/python_pptx-1.0.2-py3-none-any.whl", hash = "sha256:160838e0b8565a8b1f67947675886e9fea18aa5e795db7ae531606d68e785cba", size = 472788, upload-time = "2024-08-07T17:33:28.192Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "pytz"
|
||||
version = "2025.2"
|
||||
@@ -5570,6 +5673,12 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/e1/e3/c164c88b2e5ce7b24d667b9bd83589cf4f3520d97cad01534cd3c4f55fdb/setuptools-81.0.0-py3-none-any.whl", hash = "sha256:fdd925d5c5d9f62e4b74b30d6dd7828ce236fd6ed998a08d81de62ce5a6310d6", size = 1062021, upload-time = "2026-02-06T21:10:37.175Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "sgmllib3k"
|
||||
version = "1.0.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/9e/bd/3704a8c3e0942d711c1299ebf7b9091930adae6675d7c8f476a7ce48653c/sgmllib3k-1.0.0.tar.gz", hash = "sha256:7868fb1c8bfa764c1ac563d3cf369c381d1325d36124933a726f29fcdaa812e9", size = 5750, upload-time = "2010-08-24T14:33:52.445Z" }
|
||||
|
||||
[[package]]
|
||||
name = "shellingham"
|
||||
version = "1.5.4"
|
||||
@@ -5619,23 +5728,30 @@ dependencies = [
|
||||
|
||||
[package.optional-dependencies]
|
||||
all = [
|
||||
{ name = "asciidoc" },
|
||||
{ name = "atlassian-python-api" },
|
||||
{ name = "azure-storage-blob" },
|
||||
{ name = "boto3" },
|
||||
{ name = "chromadb" },
|
||||
{ name = "ebooklib" },
|
||||
{ name = "fastapi" },
|
||||
{ name = "feedparser" },
|
||||
{ name = "google-cloud-storage" },
|
||||
{ name = "google-generativeai" },
|
||||
{ name = "httpx" },
|
||||
{ name = "httpx-sse" },
|
||||
{ name = "mammoth" },
|
||||
{ name = "mcp" },
|
||||
{ name = "nbformat" },
|
||||
{ name = "notion-client" },
|
||||
{ name = "numpy", version = "2.2.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" },
|
||||
{ name = "numpy", version = "2.4.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11'" },
|
||||
{ name = "openai" },
|
||||
{ name = "pinecone" },
|
||||
{ name = "python-docx" },
|
||||
{ name = "python-pptx" },
|
||||
{ name = "sentence-transformers" },
|
||||
{ name = "slack-sdk" },
|
||||
{ name = "sse-starlette" },
|
||||
{ name = "starlette" },
|
||||
{ name = "uvicorn" },
|
||||
@@ -5653,12 +5769,21 @@ all-llms = [
|
||||
{ name = "google-generativeai" },
|
||||
{ name = "openai" },
|
||||
]
|
||||
asciidoc = [
|
||||
{ name = "asciidoc" },
|
||||
]
|
||||
azure = [
|
||||
{ name = "azure-storage-blob" },
|
||||
]
|
||||
chat = [
|
||||
{ name = "slack-sdk" },
|
||||
]
|
||||
chroma = [
|
||||
{ name = "chromadb" },
|
||||
]
|
||||
confluence = [
|
||||
{ name = "atlassian-python-api" },
|
||||
]
|
||||
docx = [
|
||||
{ name = "mammoth" },
|
||||
{ name = "python-docx" },
|
||||
@@ -5680,6 +5805,9 @@ gcs = [
|
||||
gemini = [
|
||||
{ name = "google-generativeai" },
|
||||
]
|
||||
jupyter = [
|
||||
{ name = "nbformat" },
|
||||
]
|
||||
mcp = [
|
||||
{ name = "httpx" },
|
||||
{ name = "httpx-sse" },
|
||||
@@ -5688,18 +5816,27 @@ mcp = [
|
||||
{ name = "starlette" },
|
||||
{ name = "uvicorn" },
|
||||
]
|
||||
notion = [
|
||||
{ name = "notion-client" },
|
||||
]
|
||||
openai = [
|
||||
{ name = "openai" },
|
||||
]
|
||||
pinecone = [
|
||||
{ name = "pinecone" },
|
||||
]
|
||||
pptx = [
|
||||
{ name = "python-pptx" },
|
||||
]
|
||||
rag-upload = [
|
||||
{ name = "chromadb" },
|
||||
{ name = "pinecone" },
|
||||
{ name = "sentence-transformers" },
|
||||
{ name = "weaviate-client" },
|
||||
]
|
||||
rss = [
|
||||
{ name = "feedparser" },
|
||||
]
|
||||
s3 = [
|
||||
{ name = "boto3" },
|
||||
]
|
||||
@@ -5743,6 +5880,10 @@ dev = [
|
||||
[package.metadata]
|
||||
requires-dist = [
|
||||
{ name = "anthropic", specifier = ">=0.76.0" },
|
||||
{ name = "asciidoc", marker = "extra == 'all'", specifier = ">=10.0.0" },
|
||||
{ name = "asciidoc", marker = "extra == 'asciidoc'", specifier = ">=10.0.0" },
|
||||
{ name = "atlassian-python-api", marker = "extra == 'all'", specifier = ">=3.41.0" },
|
||||
{ name = "atlassian-python-api", marker = "extra == 'confluence'", specifier = ">=3.41.0" },
|
||||
{ name = "azure-storage-blob", marker = "extra == 'all'", specifier = ">=12.19.0" },
|
||||
{ name = "azure-storage-blob", marker = "extra == 'all-cloud'", specifier = ">=12.19.0" },
|
||||
{ name = "azure-storage-blob", marker = "extra == 'azure'", specifier = ">=12.19.0" },
|
||||
@@ -5759,6 +5900,8 @@ requires-dist = [
|
||||
{ name = "fastapi", marker = "extra == 'all'", specifier = ">=0.109.0" },
|
||||
{ name = "fastapi", marker = "extra == 'embedding'", specifier = ">=0.109.0" },
|
||||
{ name = "faster-whisper", marker = "extra == 'video-full'", specifier = ">=1.0.0" },
|
||||
{ name = "feedparser", marker = "extra == 'all'", specifier = ">=6.0.0" },
|
||||
{ name = "feedparser", marker = "extra == 'rss'", specifier = ">=6.0.0" },
|
||||
{ name = "gitpython", specifier = ">=3.1.40" },
|
||||
{ name = "google-cloud-storage", marker = "extra == 'all'", specifier = ">=2.10.0" },
|
||||
{ name = "google-cloud-storage", marker = "extra == 'all-cloud'", specifier = ">=2.10.0" },
|
||||
@@ -5778,7 +5921,11 @@ requires-dist = [
|
||||
{ name = "mammoth", marker = "extra == 'docx'", specifier = ">=1.6.0" },
|
||||
{ name = "mcp", marker = "extra == 'all'", specifier = ">=1.25,<2" },
|
||||
{ name = "mcp", marker = "extra == 'mcp'", specifier = ">=1.25,<2" },
|
||||
{ name = "nbformat", marker = "extra == 'all'", specifier = ">=5.9.0" },
|
||||
{ name = "nbformat", marker = "extra == 'jupyter'", specifier = ">=5.9.0" },
|
||||
{ name = "networkx", specifier = ">=3.0" },
|
||||
{ name = "notion-client", marker = "extra == 'all'", specifier = ">=2.0.0" },
|
||||
{ name = "notion-client", marker = "extra == 'notion'", specifier = ">=2.0.0" },
|
||||
{ name = "numpy", marker = "extra == 'all'", specifier = ">=1.24.0" },
|
||||
{ name = "numpy", marker = "extra == 'embedding'", specifier = ">=1.24.0" },
|
||||
{ name = "openai", marker = "extra == 'all'", specifier = ">=1.0.0" },
|
||||
@@ -5799,6 +5946,8 @@ requires-dist = [
|
||||
{ name = "python-docx", marker = "extra == 'all'", specifier = ">=1.1.0" },
|
||||
{ name = "python-docx", marker = "extra == 'docx'", specifier = ">=1.1.0" },
|
||||
{ name = "python-dotenv", specifier = ">=1.1.1" },
|
||||
{ name = "python-pptx", marker = "extra == 'all'", specifier = ">=0.6.21" },
|
||||
{ name = "python-pptx", marker = "extra == 'pptx'", specifier = ">=0.6.21" },
|
||||
{ name = "pyyaml", specifier = ">=6.0" },
|
||||
{ name = "requests", specifier = ">=2.32.5" },
|
||||
{ name = "scenedetect", extras = ["opencv"], marker = "extra == 'video-full'", specifier = ">=0.6.4" },
|
||||
@@ -5807,6 +5956,8 @@ requires-dist = [
|
||||
{ name = "sentence-transformers", marker = "extra == 'embedding'", specifier = ">=2.3.0" },
|
||||
{ name = "sentence-transformers", marker = "extra == 'rag-upload'", specifier = ">=2.2.0" },
|
||||
{ name = "sentence-transformers", marker = "extra == 'sentence-transformers'", specifier = ">=2.2.0" },
|
||||
{ name = "slack-sdk", marker = "extra == 'all'", specifier = ">=3.27.0" },
|
||||
{ name = "slack-sdk", marker = "extra == 'chat'", specifier = ">=3.27.0" },
|
||||
{ name = "sse-starlette", marker = "extra == 'all'", specifier = ">=3.0.2" },
|
||||
{ name = "sse-starlette", marker = "extra == 'mcp'", specifier = ">=3.0.2" },
|
||||
{ name = "starlette", marker = "extra == 'all'", specifier = ">=0.48.0" },
|
||||
@@ -5827,7 +5978,7 @@ requires-dist = [
|
||||
{ name = "yt-dlp", marker = "extra == 'video'", specifier = ">=2024.12.0" },
|
||||
{ name = "yt-dlp", marker = "extra == 'video-full'", specifier = ">=2024.12.0" },
|
||||
]
|
||||
provides-extras = ["mcp", "gemini", "openai", "all-llms", "s3", "gcs", "azure", "docx", "epub", "video", "video-full", "chroma", "weaviate", "sentence-transformers", "pinecone", "rag-upload", "all-cloud", "embedding", "all"]
|
||||
provides-extras = ["mcp", "gemini", "openai", "all-llms", "s3", "gcs", "azure", "docx", "epub", "video", "video-full", "chroma", "weaviate", "sentence-transformers", "pinecone", "rag-upload", "all-cloud", "jupyter", "asciidoc", "pptx", "confluence", "notion", "rss", "chat", "embedding", "all"]
|
||||
|
||||
[package.metadata.requires-dev]
|
||||
dev = [
|
||||
@@ -5846,6 +5997,15 @@ dev = [
|
||||
{ name = "starlette", specifier = ">=0.31.0" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "slack-sdk"
|
||||
version = "3.41.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/22/35/fc009118a13187dd9731657c60138e5a7c2dea88681a7f04dc406af5da7d/slack_sdk-3.41.0.tar.gz", hash = "sha256:eb61eb12a65bebeca9cb5d36b3f799e836ed2be21b456d15df2627cfe34076ca", size = 250568, upload-time = "2026-03-12T16:10:11.381Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/a1/df/2e4be347ff98281b505cc0ccf141408cdd25eb5ca9f3830deb361b2472d3/slack_sdk-3.41.0-py2.py3-none-any.whl", hash = "sha256:bb18dcdfff1413ec448e759cf807ec3324090993d8ab9111c74081623b692a89", size = 313885, upload-time = "2026-03-12T16:10:09.811Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "smmap"
|
||||
version = "5.0.2"
|
||||
@@ -6233,6 +6393,15 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/d0/30/dc54f88dd4a2b5dc8a0279bdd7270e735851848b762aeb1c1184ed1f6b14/tqdm-4.67.1-py3-none-any.whl", hash = "sha256:26445eca388f82e72884e0d580d5464cd801a3ea01e63e5601bdff9ba6a48de2", size = 78540, upload-time = "2024-11-24T20:12:19.698Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "traitlets"
|
||||
version = "5.14.3"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/eb/79/72064e6a701c2183016abbbfedaba506d81e30e232a68c9f0d6f6fcd1574/traitlets-5.14.3.tar.gz", hash = "sha256:9ed0579d3502c94b4b3732ac120375cda96f923114522847de4b3bb98b96b6b7", size = 161621, upload-time = "2024-04-19T11:11:49.746Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/00/c0/8f5d070730d7836adc9c9b6408dec68c6ced86b304a9b26a14df072a6e8c/traitlets-5.14.3-py3-none-any.whl", hash = "sha256:b74e89e397b1ed28cc831db7aea759ba6640cb3de13090ca145426688ff1ac4f", size = 85359, upload-time = "2024-04-19T11:11:46.763Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "transformers"
|
||||
version = "5.1.0"
|
||||
@@ -6753,6 +6922,15 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/1f/f6/a933bd70f98e9cf3e08167fc5cd7aaaca49147e48411c0bd5ae701bb2194/wrapt-1.17.3-py3-none-any.whl", hash = "sha256:7171ae35d2c33d326ac19dd8facb1e82e5fd04ef8c6c0e394d7af55a55051c22", size = 23591, upload-time = "2025-08-12T05:53:20.674Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "xlsxwriter"
|
||||
version = "3.2.9"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/46/2c/c06ef49dc36e7954e55b802a8b231770d286a9758b3d936bd1e04ce5ba88/xlsxwriter-3.2.9.tar.gz", hash = "sha256:254b1c37a368c444eac6e2f867405cc9e461b0ed97a3233b2ac1e574efb4140c", size = 215940, upload-time = "2025-09-16T00:16:21.63Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/3a/0c/3662f4a66880196a590b202f0db82d919dd2f89e99a27fadef91c4a33d41/xlsxwriter-3.2.9-py3-none-any.whl", hash = "sha256:9a5db42bc5dff014806c58a20b9eae7322a134abb6fce3c92c181bfb275ec5b3", size = 175315, upload-time = "2025-09-16T00:16:20.108Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "xxhash"
|
||||
version = "3.6.0"
|
||||
|
||||
Reference in New Issue
Block a user