feat: add 10 new skill source types (17 total) with full pipeline integration

Add Jupyter Notebook, Local HTML, OpenAPI/Swagger, AsciiDoc, PowerPoint, RSS/Atom, Man Pages, Confluence, Notion, and Slack/Discord Chat as new skill source types. Each type is fully integrated across: - Standalone CLI commands (skill-seekers <type>) - Auto-detection via 'skill-seekers create' (file extension + content sniffing) - Unified multi-source configs (scraped_data, dispatch, config validation) - Unified skill builder (generic merge + source-attributed synthesis) - MCP server (scrape_generic tool with per-type flag mapping) - pyproject.toml (entry points, optional deps, [all] group) Also fixes: EPUB unified pipeline gap, missing word/video config validators, OpenAPI yaml import guard, MCP flag mismatch for all 10 types, stale docstrings, and adds 77 integration tests + complex-merge workflow. 50 files changed, +20,201 lines
2026-03-15 15:30:15 +03:00
parent 64403a3686
commit 53b911b697
50 changed files with 20193 additions and 856 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,866 +1,171 @@
 # AGENTS.md - Skill Seekers

-Essential guidance for AI coding agents working with the Skill Seekers codebase.
+Concise reference for AI coding agents. Skill Seekers is a Python CLI tool (v3.2.0) that converts documentation sites, GitHub repos, PDFs, videos, notebooks, wikis, and more into AI-ready skills for 16+ LLM platforms and RAG pipelines.

---
-
-## Project Overview
-
-**Skill Seekers** is a Python CLI tool that converts documentation websites, GitHub repositories, PDF files, and videos into AI-ready skills for LLM platforms and RAG (Retrieval-Augmented Generation) pipelines. It serves as the universal preprocessing layer for AI systems.
-
-### Key Facts
-
-| Attribute | Value |
-|-----------|-------|
-| **Current Version** | 3.1.3 |
-| **Python Version** | 3.10+ (tested on 3.10, 3.11, 3.12, 3.13) |
-| **License** | MIT |
-| **Package Name** | `skill-seekers` (PyPI) |
-| **Source Files** | 182 Python files |
-| **Test Files** | 105+ test files |
-| **Website** | https://skillseekersweb.com/ |
-| **Repository** | https://github.com/yusufkaraaslan/Skill_Seekers |
-
-### Supported Target Platforms
-
-| Platform | Format | Use Case |
-|----------|--------|----------|
-| **Claude AI** | ZIP + YAML | Claude Code skills |
-| **Google Gemini** | tar.gz | Gemini skills |
-| **OpenAI ChatGPT** | ZIP + Vector Store | Custom GPTs |
-| **LangChain** | Documents | QA chains, agents, retrievers |
-| **LlamaIndex** | TextNodes | Query engines, chat engines |
-| **Haystack** | Documents | Enterprise RAG pipelines |
-| **Pinecone** | Ready for upsert | Production vector search |
-| **Weaviate** | Vector objects | Vector database |
-| **Qdrant** | Points | Vector database |
-| **Chroma** | Documents | Local vector database |
-| **FAISS** | Index files | Local similarity search |
-| **Cursor IDE** | .cursorrules | AI coding assistant rules |
-| **Windsurf** | .windsurfrules | AI coding rules |
-| **Cline** | .clinerules + MCP | VS Code extension |
-| **Continue.dev** | HTTP context | Universal IDE support |
-| **Generic Markdown** | ZIP | Universal export |
-
-### Core Workflow
-
-1. **Scrape Phase** - Crawl documentation/GitHub/PDF/video sources
-2. **Build Phase** - Organize content into categorized references
-3. **Enhancement Phase** - AI-powered quality improvements (optional)
-4. **Package Phase** - Create platform-specific packages
-5. **Upload Phase** - Auto-upload to target platform (optional)
-
---
-
-## Project Structure
-
-```
-/mnt/1ece809a-2821-4f10-aecb-fcdf34760c0b/Git/Skill_Seekers/
-├── src/skill_seekers/              # Main source code (src/ layout)
-│   ├── cli/                        # CLI tools and commands (~70 modules)
-│   │   ├── adaptors/               # Platform adaptors (Strategy pattern)
-│   │   │   ├── base.py             # Abstract base class (SkillAdaptor)
-│   │   │   ├── claude.py           # Claude AI adaptor
-│   │   │   ├── gemini.py           # Google Gemini adaptor
-│   │   │   ├── openai.py           # OpenAI ChatGPT adaptor
-│   │   │   ├── markdown.py         # Generic Markdown adaptor
-│   │   │   ├── chroma.py           # Chroma vector DB adaptor
-│   │   │   ├── faiss_helpers.py    # FAISS index adaptor
-│   │   │   ├── haystack.py         # Haystack RAG adaptor
-│   │   │   ├── langchain.py        # LangChain adaptor
-│   │   │   ├── llama_index.py      # LlamaIndex adaptor
-│   │   │   ├── qdrant.py           # Qdrant vector DB adaptor
-│   │   │   ├── weaviate.py         # Weaviate vector DB adaptor
-│   │   │   └── streaming_adaptor.py # Streaming output adaptor
-│   │   ├── arguments/              # CLI argument definitions
-│   │   ├── parsers/                # Argument parsers
-│   │   │   └── extractors/         # Content extractors
-│   │   ├── presets/                # Preset configuration management
-│   │   ├── storage/                # Cloud storage adaptors
-│   │   ├── main.py                 # Unified CLI entry point
-│   │   ├── create_command.py       # Unified create command
-│   │   ├── doc_scraper.py          # Documentation scraper
-│   │   ├── github_scraper.py       # GitHub repository scraper
-│   │   ├── pdf_scraper.py          # PDF extraction
-│   │   ├── word_scraper.py         # Word document scraper
-│   │   ├── video_scraper.py        # Video extraction
-│   │   ├── video_setup.py          # GPU detection & dependency installation
-│   │   ├── unified_scraper.py      # Multi-source scraping
-│   │   ├── codebase_scraper.py     # Local codebase analysis
-│   │   ├── enhance_command.py      # AI enhancement command
-│   │   ├── enhance_skill_local.py  # AI enhancement (local mode)
-│   │   ├── package_skill.py        # Skill packager
-│   │   ├── upload_skill.py         # Upload to platforms
-│   │   ├── cloud_storage_cli.py    # Cloud storage CLI
-│   │   ├── benchmark_cli.py        # Benchmarking CLI
-│   │   ├── sync_cli.py             # Sync monitoring CLI
-│   │   └── workflows_command.py    # Workflow management CLI
-│   ├── mcp/                        # MCP server integration
-│   │   ├── server_fastmcp.py       # FastMCP server (~708 lines)
-│   │   ├── server_legacy.py        # Legacy server implementation
-│   │   ├── server.py               # Server entry point
-│   │   ├── agent_detector.py       # AI agent detection
-│   │   ├── git_repo.py             # Git repository operations
-│   │   ├── source_manager.py       # Config source management
-│   │   └── tools/                  # MCP tool implementations
-│   │       ├── config_tools.py     # Configuration tools
-│   │       ├── packaging_tools.py  # Packaging tools
-│   │       ├── scraping_tools.py   # Scraping tools
-│   │       ├── source_tools.py     # Source management tools
-│   │       ├── splitting_tools.py  # Config splitting tools
-│   │       ├── vector_db_tools.py  # Vector database tools
-│   │       └── workflow_tools.py   # Workflow management tools
-│   ├── sync/                       # Sync monitoring module
-│   │   ├── detector.py             # Change detection
-│   │   ├── models.py               # Data models (Pydantic)
-│   │   ├── monitor.py              # Monitoring logic
-│   │   └── notifier.py             # Notification system
-│   ├── benchmark/                  # Benchmarking framework
-│   │   ├── framework.py            # Benchmark framework
-│   │   ├── models.py               # Benchmark models
-│   │   └── runner.py               # Benchmark runner
-│   ├── embedding/                  # Embedding server
-│   │   ├── server.py               # FastAPI embedding server
-│   │   ├── generator.py            # Embedding generation
-│   │   ├── cache.py                # Embedding cache
-│   │   └── models.py               # Embedding models
-│   ├── workflows/                  # YAML workflow presets (66 presets)
-│   ├── _version.py                 # Version information (reads from pyproject.toml)
-│   └── __init__.py                 # Package init
-├── tests/                          # Test suite (105+ test files)
-├── configs/                        # Preset configuration files
-├── docs/                           # Documentation (80+ markdown files)
-│   ├── integrations/               # Platform integration guides
-│   ├── guides/                     # User guides
-│   ├── reference/                  # API reference
-│   ├── features/                   # Feature documentation
-│   ├── blog/                       # Blog posts
-│   └── roadmap/                    # Roadmap documents
-├── examples/                       # Usage examples
-├── .github/workflows/              # CI/CD workflows
-├── pyproject.toml                  # Main project configuration
-├── requirements.txt                # Pinned dependencies
-├── mypy.ini                        # MyPy type checker configuration
-├── Dockerfile                      # Main Docker image (multi-stage)
-├── Dockerfile.mcp                  # MCP server Docker image
-└── docker-compose.yml              # Full stack deployment
-```
-
---
-
-## Build and Development Commands
-
-### Prerequisites
-
- Python 3.10 or higher
- pip or uv package manager
- Git (for GitHub scraping features)
-
-### Setup (REQUIRED before any development)
+## Setup

 ```bash
-# Install in editable mode (REQUIRED for tests due to src/ layout)
+# REQUIRED before running tests (src/ layout — tests fail without this)
 pip install -e .
-
-# Install with all platform dependencies
-pip install -e ".[all-llms]"
-
-# Install with all optional dependencies
-pip install -e ".[all]"
-
-# Install specific platforms only
-pip install -e ".[gemini]"    # Google Gemini support
-pip install -e ".[openai]"    # OpenAI ChatGPT support
-pip install -e ".[mcp]"       # MCP server dependencies
-pip install -e ".[s3]"        # AWS S3 support
-pip install -e ".[gcs]"       # Google Cloud Storage
-pip install -e ".[azure]"     # Azure Blob Storage
-pip install -e ".[embedding]" # Embedding server support
-pip install -e ".[rag-upload]" # Vector DB upload support
-
-# Install dev dependencies (using dependency-groups)
+# With dev tools
 pip install -e ".[dev]"
+# With all optional deps
+pip install -e ".[all]"
 ```

-**CRITICAL:** The project uses a `src/` layout. Tests WILL FAIL unless you install with `pip install -e .` first.
-
-### Building
+## Build / Test / Lint Commands

 ```bash
-# Build package using uv (recommended)
-uv build
-
-# Or using standard build
-python -m build
-
-# Publish to PyPI
-uv publish
-```
-
-### Docker
-
-```bash
-# Build Docker image
-docker build -t skill-seekers .
-
-# Run with docker-compose (includes vector databases)
-docker-compose up -d
-
-# Run MCP server only
-docker-compose up -d mcp-server
-
-# View logs
-docker-compose logs -f mcp-server
-```
-
---
-
-## Testing Instructions
-
-### Running Tests
-
-**CRITICAL:** Never skip tests - all tests must pass before commits.
-
-```bash
-# All tests (must run pip install -e . first!)
+# Run ALL tests (never skip tests — all must pass before commits)
 pytest tests/ -v

-# Specific test file
+# Run a single test file
 pytest tests/test_scraper_features.py -v
-pytest tests/test_mcp_fastmcp.py -v
-pytest tests/test_cloud_storage.py -v

-# With coverage
-pytest tests/ --cov=src/skill_seekers --cov-report=term --cov-report=html
-
-# Single test
+# Run a single test function
 pytest tests/test_scraper_features.py::test_detect_language -v

-# E2E tests
-pytest tests/test_e2e_three_stream_pipeline.py -v
+# Run a single test class method
+pytest tests/test_adaptors/test_claude_adaptor.py::TestClaudeAdaptor::test_package -v

-# Skip slow tests
-pytest tests/ -v -m "not slow"
-
-# Run only integration tests
-pytest tests/ -v -m integration
-
-# Run only specific marker
+# Skip slow/integration tests
 pytest tests/ -v -m "not slow and not integration"
-```

-### Test Architecture
+# With coverage
+pytest tests/ --cov=src/skill_seekers --cov-report=term

- **105+ test files** covering all features
- **CI Matrix:** Ubuntu + macOS, Python 3.10-3.12
- Test markers defined in `pyproject.toml`:
-
-| Marker | Description |
-|--------|-------------|
-| `slow` | Tests taking >5 seconds |
-| `integration` | Requires external services (APIs) |
-| `e2e` | End-to-end tests (resource-intensive) |
-| `venv` | Requires virtual environment setup |
-| `bootstrap` | Bootstrap skill specific |
-| `benchmark` | Performance benchmark tests |
-
-### Test Configuration
-
-From `pyproject.toml`:
-```toml
-[tool.pytest.ini_options]
-testpaths = ["tests"]
-python_files = ["test_*.py"]
-addopts = "-v --tb=short --strict-markers"
-asyncio_mode = "auto"
-asyncio_default_fixture_loop_scope = "function"
-```
-
-The `conftest.py` file checks that the package is installed before running tests.
-
---
-
-## Code Style Guidelines
-
-### Linting and Formatting
-
-```bash
-# Run ruff linter
+# Lint (ruff)
 ruff check src/ tests/
-
-# Run ruff formatter check
-ruff format --check src/ tests/
-
-# Auto-fix issues
 ruff check src/ tests/ --fix
+
+# Format (ruff)
+ruff format --check src/ tests/
 ruff format src/ tests/

-# Run mypy type checker
+# Type check (mypy)
 mypy src/skill_seekers --show-error-codes --pretty
 ```

-### Style Rules (from pyproject.toml)
+**Test markers:** `slow`, `integration`, `e2e`, `venv`, `bootstrap`, `benchmark`
+**Async tests:** use `@pytest.mark.asyncio`; asyncio_mode is `auto`.

+## Code Style
+
+### Formatting Rules (ruff — from pyproject.toml)
 - **Line length:** 100 characters
 - **Target Python:** 3.10+
- **Enabled rules:** E, W, F, I, B, C4, UP, ARG, SIM
- **Ignored rules:** E501, F541, ARG002, B007, I001, SIM114
- **Import sorting:** isort style with `skill_seekers` as first-party
+- **Enabled lint rules:** E, W, F, I, B, C4, UP, ARG, SIM
+- **Ignored rules:** E501 (line length handled by formatter), F541 (f-string style), ARG002 (unused method args for interface compliance), B007 (intentional unused loop vars), I001 (formatter handles imports), SIM114 (readability preference)

-### MyPy Configuration (from pyproject.toml)
+### Imports
+- Sort with isort (via ruff); `skill_seekers` is first-party
+- Standard library → third-party → first-party, separated by blank lines
+- Use `from __future__ import annotations` only if needed for forward refs
+- Guard optional imports with try/except ImportError (see `adaptors/__init__.py` pattern)

-```toml
-[tool.mypy]
-python_version = "3.10"
-warn_return_any = true
-warn_unused_configs = true
-disallow_untyped_defs = false
-disallow_incomplete_defs = false
-check_untyped_defs = true
-ignore_missing_imports = true
-show_error_codes = true
-pretty = true
+### Naming Conventions
+- **Files:** `snake_case.py`
+- **Classes:** `PascalCase` (e.g., `SkillAdaptor`, `ClaudeAdaptor`)
+- **Functions/methods:** `snake_case`
+- **Constants:** `UPPER_CASE` (e.g., `ADAPTORS`, `DEFAULT_CHUNK_TOKENS`)
+- **Private:** prefix with `_`
+
+### Type Hints
+- Gradual typing — add hints where practical, not enforced everywhere
+- Use modern syntax: `str | None` not `Optional[str]`, `list[str]` not `List[str]`
+- MyPy config: `disallow_untyped_defs = false`, `check_untyped_defs = true`, `ignore_missing_imports = true`
+
+### Docstrings
+- Module-level docstring on every file (triple-quoted, describes purpose)
+- Google-style or standard docstrings for public functions/classes
+- Include `Args:`, `Returns:`, `Raises:` sections where useful
+
+### Error Handling
+- Use specific exceptions, never bare `except:`
+- Provide helpful error messages with context (see `get_adaptor()` in `adaptors/__init__.py`)
+- Use `raise ValueError(...)` for invalid arguments, `raise RuntimeError(...)` for state errors
+- Guard optional dependency imports with try/except and give clear install instructions on failure
+
+### Suppressing Lint Warnings
+- Use inline `# noqa: XXXX` comments (e.g., `# noqa: F401` for re-exports, `# noqa: ARG001` for required but unused params)
+
+## Supported Source Types (17)
+
+| Type | CLI Command | Config Type | Detection |
+|------|------------|-------------|-----------|
+| Documentation (web) | `scrape` / `create <url>` | `documentation` | HTTP/HTTPS URLs |
+| GitHub repo | `github` / `create owner/repo` | `github` | `owner/repo` or github.com URLs |
+| PDF | `pdf` / `create file.pdf` | `pdf` | `.pdf` extension |
+| Word (.docx) | `word` / `create file.docx` | `word` | `.docx` extension |
+| EPUB | `epub` / `create file.epub` | `epub` | `.epub` extension |
+| Video | `video` / `create <url/file>` | `video` | YouTube/Vimeo URLs, video extensions |
+| Local codebase | `analyze` / `create ./path` | `local` | Directory paths |
+| Jupyter Notebook | `jupyter` / `create file.ipynb` | `jupyter` | `.ipynb` extension |
+| Local HTML | `html` / `create file.html` | `html` | `.html`/`.htm` extensions |
+| OpenAPI/Swagger | `openapi` / `create spec.yaml` | `openapi` | `.yaml`/`.yml` with OpenAPI content |
+| AsciiDoc | `asciidoc` / `create file.adoc` | `asciidoc` | `.adoc`/`.asciidoc` extensions |
+| PowerPoint | `pptx` / `create file.pptx` | `pptx` | `.pptx` extension |
+| RSS/Atom | `rss` / `create feed.rss` | `rss` | `.rss`/`.atom` extensions |
+| Man pages | `manpage` / `create cmd.1` | `manpage` | `.1`-`.8`/`.man` extensions |
+| Confluence | `confluence` | `confluence` | API or export directory |
+| Notion | `notion` | `notion` | API or export directory |
+| Slack/Discord | `chat` | `chat` | Export directory or API |
+
+## Project Layout
+
+```
+src/skill_seekers/           # Main package (src/ layout)
+  cli/                       # CLI commands and entry points
+    adaptors/                # Platform adaptors (Strategy pattern, inherit SkillAdaptor)
+    arguments/               # CLI argument definitions (one per source type)
+    parsers/                 # Subcommand parsers (one per source type)
+    storage/                 # Cloud storage (inherit BaseStorageAdaptor)
+    main.py                  # Unified CLI entry point (COMMAND_MODULES dict)
+    source_detector.py       # Auto-detects source type from user input
+    create_command.py        # Unified `create` command routing
+    config_validator.py      # VALID_SOURCE_TYPES set + per-type validation
+    unified_scraper.py       # Multi-source orchestrator (scraped_data + dispatch)
+    unified_skill_builder.py # Pairwise synthesis + generic merge
+  mcp/                       # MCP server (FastMCP + legacy)
+    tools/                   # MCP tool implementations by category
+  sync/                      # Sync monitoring (Pydantic models)
+  benchmark/                 # Benchmarking framework
+  embedding/                 # FastAPI embedding server
+  workflows/                 # 67 YAML workflow presets (includes complex-merge.yaml)
+  _version.py                # Reads version from pyproject.toml
+tests/                       # 115+ test files (pytest)
+configs/                     # Preset JSON scraping configs
+docs/                        # 80+ markdown doc files
 ```

-### Code Conventions
+## Key Patterns

-1. **Use type hints** where practical (gradual typing approach)
-2. **Docstrings:** Use Google-style or standard docstrings
-3. **Error handling:** Use specific exceptions, provide helpful messages
-4. **Async code:** Use `asyncio`, mark tests with `@pytest.mark.asyncio`
-5. **File naming:** Use snake_case for all Python files
-6. **Class naming:** Use PascalCase for classes
-7. **Function naming:** Use snake_case for functions and methods
-8. **Constants:** Use UPPER_CASE for module-level constants
+**Adaptor (Strategy) pattern** — all platform logic in `cli/adaptors/`. Inherit `SkillAdaptor`, implement `format_skill_md()`, `package()`, `upload()`. Register in `adaptors/__init__.py` ADAPTORS dict.

---
+**Scraper pattern** — each source type has: `cli/<type>_scraper.py` (with `<Type>ToSkillConverter` class + `main()`), `arguments/<type>.py`, `parsers/<type>_parser.py`. Register in `parsers/__init__.py` PARSERS list, `main.py` COMMAND_MODULES dict, `config_validator.py` VALID_SOURCE_TYPES set.

-## Architecture Patterns
+**Unified pipeline** — `unified_scraper.py` dispatches to per-type `_scrape_<type>()` methods. `unified_skill_builder.py` uses pairwise synthesis for docs+github+pdf combos and `_generic_merge()` for all other combinations.

-### Platform Adaptor Pattern (Strategy Pattern)
+**MCP tools** — grouped in `mcp/tools/` by category. `scrape_generic_tool` handles all new source types.

-All platform-specific logic is encapsulated in adaptors:
-
-```python
-from skill_seekers.cli.adaptors import get_adaptor
-
-# Get platform-specific adaptor
-adaptor = get_adaptor('gemini')  # or 'claude', 'openai', 'langchain', etc.
-
-# Package skill
-adaptor.package(skill_dir='output/react/', output_path='output/')
-
-# Upload to platform
-adaptor.upload(
-    package_path='output/react-gemini.tar.gz',
-    api_key=os.getenv('GOOGLE_API_KEY')
-)
-```
-
-Each adaptor inherits from `SkillAdaptor` base class and implements:
- `format_skill_md()` - Format SKILL.md content
- `package()` - Create platform-specific package
- `upload()` - Upload to platform API
- `validate_api_key()` - Validate API key format
- `supports_enhancement()` - Whether AI enhancement is supported
-
-### CLI Architecture (Git-style)
-
-Entry point: `src/skill_seekers/cli/main.py`
-
-The CLI uses subcommands that delegate to existing modules:
-
-```bash
-# skill-seekers scrape --config react.json
-# Transforms to: doc_scraper.main() with modified sys.argv
-```
-
-**Available subcommands:**
- `create` - Unified create command
- `config` - Configuration wizard
- `scrape` - Documentation scraping
- `github` - GitHub repository scraping
- `pdf` - PDF extraction
- `word` - Word document extraction
- `video` - Video extraction (YouTube or local). Use `--setup` to auto-detect GPU and install visual deps.
- `unified` - Multi-source scraping
- `analyze` / `codebase` - Local codebase analysis
- `enhance` - AI enhancement
- `package` - Package skill for target platform
- `upload` - Upload to platform
- `cloud` - Cloud storage operations
- `sync` - Sync monitoring
- `benchmark` - Performance benchmarking
- `embed` - Embedding server
- `install` / `install-agent` - Complete workflow
- `stream` - Streaming ingestion
- `update` - Incremental updates
- `multilang` - Multi-language support
- `quality` - Quality metrics
- `resume` - Resume interrupted jobs
- `estimate` - Estimate page counts
- `workflows` - Workflow management
-
-### MCP Server Architecture
-
-Two implementations:
- `server_fastmcp.py` - Modern, decorator-based (recommended, ~708 lines)
- `server_legacy.py` - Legacy implementation
-
-Tools are organized by category:
- Config tools (3 tools): generate_config, list_configs, validate_config
- Scraping tools (10 tools): estimate_pages, scrape_docs, scrape_github, scrape_pdf, scrape_video (supports `setup` parameter for GPU detection and visual dep installation), scrape_codebase, detect_patterns, extract_test_examples, build_how_to_guides, extract_config_patterns
- Packaging tools (4 tools): package_skill, upload_skill, enhance_skill, install_skill
- Source tools (5 tools): fetch_config, submit_config, add_config_source, list_config_sources, remove_config_source
- Splitting tools (2 tools): split_config, generate_router
- Vector Database tools (4 tools): export_to_weaviate, export_to_chroma, export_to_faiss, export_to_qdrant
- Workflow tools (5 tools): list_workflows, get_workflow, create_workflow, update_workflow, delete_workflow
-
-**Running MCP Server:**
-```bash
-# Stdio transport (default)
-python -m skill_seekers.mcp.server_fastmcp
-
-# HTTP transport
-python -m skill_seekers.mcp.server_fastmcp --http --port 8765
-```
-
-### Cloud Storage Architecture
-
-Abstract base class pattern for cloud providers:
- `base_storage.py` - Defines `BaseStorageAdaptor` interface
- `s3_storage.py` - AWS S3 implementation
- `gcs_storage.py` - Google Cloud Storage implementation
- `azure_storage.py` - Azure Blob Storage implementation
-
-### Sync Monitoring Architecture
-
-Pydantic-based models in `src/skill_seekers/sync/`:
- `models.py` - Data models (SyncConfig, ChangeReport, SyncState)
- `detector.py` - Change detection logic
- `monitor.py` - Monitoring daemon
- `notifier.py` - Notification system (webhook, email, slack)
-
---
+**CLI subcommands** — git-style in `cli/main.py`. Each delegates to a module's `main()` function.

 ## Git Workflow

-### Branch Structure
+- **`main`** — production, protected
+- **`development`** — default PR target, active dev
+- Feature branches created from `development`

-```
-main (production)
-  ↑
-  │ (only maintainer merges)
-  │
-development (integration) ← default branch for PRs
-  ↑
-  │ (all contributor PRs go here)
-  │
-feature branches
-```
-
- **`main`** - Production, always stable, protected
- **`development`** - Active development, default for PRs
- **Feature branches** - Your work, created from `development`
-
-### Creating a Feature Branch
+## Pre-commit Checklist

 ```bash
-# 1. Checkout development
-git checkout development
-git pull upstream development
-
-# 2. Create feature branch
-git checkout -b my-feature
-
-# 3. Make changes, commit, push
-git add .
-git commit -m "Add my feature"
-git push origin my-feature
-
-# 4. Create PR targeting 'development' branch
-```
-
---
-
-## CI/CD Configuration
-
-### GitHub Actions Workflows
-
-All workflows are in `.github/workflows/`:
-
-**`tests.yml`:**
- Runs on: push/PR to `main` and `development`
- Lint job: Ruff + MyPy
- Test matrix: Ubuntu + macOS, Python 3.10-3.12
- Coverage: Uploads to Codecov
-
-**`release.yml`:**
- Triggered on version tags (`v*`)
- Builds and publishes to PyPI using `uv`
- Creates GitHub release with changelog
-
-**`docker-publish.yml`:**
- Builds and publishes Docker images
- Multi-architecture support (linux/amd64, linux/arm64)
-
-**`vector-db-export.yml`:**
- Tests vector database exports
-
-**`scheduled-updates.yml`:**
- Scheduled sync monitoring
-
-**`quality-metrics.yml`:**
- Quality metrics tracking
-
-**`test-vector-dbs.yml`:**
- Vector database integration tests
-
-### Pre-commit Checks (Manual)
-
-```bash
-# Before committing, run:
 ruff check src/ tests/
 ruff format --check src/ tests/
-pytest tests/ -v -x  # Stop on first failure
+pytest tests/ -v -x   # stop on first failure
 ```

---
+Never commit API keys. Use env vars: `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, `OPENAI_API_KEY`, `GITHUB_TOKEN`.

-## Security Considerations
+## CI

-### API Keys and Secrets
-
-1. **Never commit API keys** to the repository
-2. **Use environment variables:**
-   - `ANTHROPIC_API_KEY` - Claude AI
-   - `GOOGLE_API_KEY` - Google Gemini
-   - `OPENAI_API_KEY` - OpenAI
-   - `GITHUB_TOKEN` - GitHub API
-   - `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` - AWS S3
-   - `GOOGLE_APPLICATION_CREDENTIALS` - GCS
-   - `AZURE_STORAGE_CONNECTION_STRING` - Azure
-3. **Configuration storage:**
-   - Stored at `~/.config/skill-seekers/config.json`
-   - Permissions: 600 (owner read/write only)
-
-### Rate Limit Handling
-
- GitHub API has rate limits (5000 requests/hour for authenticated)
- The tool has built-in rate limit handling with retry logic
- Use `--non-interactive` flag for CI/CD environments
-
-### Custom API Endpoints
-
-Support for Claude-compatible APIs:
-
-```bash
-export ANTHROPIC_API_KEY=your-custom-api-key
-export ANTHROPIC_BASE_URL=https://custom-endpoint.com/v1
-```
-
---
-
-## Common Development Tasks
-
-### Adding a New CLI Command
-
-1. Create module in `src/skill_seekers/cli/my_command.py`
-2. Implement `main()` function with argument parsing
-3. Add entry point in `pyproject.toml`:
-   ```toml
-   [project.scripts]
-   skill-seekers-my-command = "skill_seekers.cli.my_command:main"
-   ```
-4. Add subcommand handler in `src/skill_seekers/cli/main.py`
-5. Add argument parser in `src/skill_seekers/cli/parsers/`
-6. Add tests in `tests/test_my_command.py`
-
-### Adding a New Platform Adaptor
-
-1. Create `src/skill_seekers/cli/adaptors/my_platform.py`
-2. Inherit from `SkillAdaptor` base class
-3. Implement required methods: `package()`, `upload()`, `format_skill_md()`
-4. Register in `src/skill_seekers/cli/adaptors/__init__.py`
-5. Add optional dependencies in `pyproject.toml`
-6. Add tests in `tests/test_adaptors/`
-
-### Adding an MCP Tool
-
-1. Implement tool logic in `src/skill_seekers/mcp/tools/category_tools.py`
-2. Register in `src/skill_seekers/mcp/server_fastmcp.py`
-3. Add test in `tests/test_mcp_fastmcp.py`
-
-### Adding Cloud Storage Provider
-
-1. Create module in `src/skill_seekers/cli/storage/my_storage.py`
-2. Inherit from `BaseStorageAdaptor` base class
-3. Implement required methods: `upload_file()`, `download_file()`, `list_files()`, `delete_file()`
-4. Register in `src/skill_seekers/cli/storage/__init__.py`
-5. Add optional dependencies in `pyproject.toml`
-
---
-
-## Documentation
-
-### Project Documentation (New Structure - v3.1.0+)
-
-**Entry Points:**
- **README.md** - Main project documentation with navigation
- **docs/README.md** - Documentation hub
- **AGENTS.md** - This file, for AI coding agents
-
-**Getting Started (for new users):**
- `docs/getting-started/01-installation.md` - Installation guide
- `docs/getting-started/02-quick-start.md` - 3 commands to first skill
- `docs/getting-started/03-your-first-skill.md` - Complete walkthrough
- `docs/getting-started/04-next-steps.md` - Where to go from here
-
-**User Guides (common tasks):**
- `docs/user-guide/01-core-concepts.md` - How Skill Seekers works
- `docs/user-guide/02-scraping.md` - All scraping options
- `docs/user-guide/03-enhancement.md` - AI enhancement explained
- `docs/user-guide/04-packaging.md` - Export to platforms
- `docs/user-guide/05-workflows.md` - Enhancement workflows
- `docs/user-guide/06-troubleshooting.md` - Common issues
-
-**Reference (technical details):**
- `docs/reference/CLI_REFERENCE.md` - Complete command reference (20 commands)
- `docs/reference/MCP_REFERENCE.md` - MCP tools reference (33 tools)
- `docs/reference/CONFIG_FORMAT.md` - JSON configuration specification
- `docs/reference/ENVIRONMENT_VARIABLES.md` - All environment variables
-
-**Advanced (power user topics):**
- `docs/advanced/mcp-server.md` - MCP server setup
- `docs/advanced/mcp-tools.md` - Advanced MCP usage
- `docs/advanced/custom-workflows.md` - Creating custom workflows
- `docs/advanced/multi-source.md` - Multi-source scraping
-
-### Configuration Documentation
-
-Preset configs are in `configs/` directory:
- `godot.json` / `godot_unified.json` - Godot Engine
- `blender.json` / `blender-unified.json` - Blender Engine
- `claude-code.json` - Claude Code
- `httpx_comprehensive.json` - HTTPX library
- `medusa-mercurjs.json` - Medusa/MercurJS
- `astrovalley_unified.json` - Astrovalley
- `react.json` - React documentation
- `configs/integrations/` - Integration-specific configs
-
---
-
-## Key Dependencies
-
-### Core Dependencies (Required)
-
-| Package | Version | Purpose |
-|---------|---------|---------|
-| `requests` | >=2.32.5 | HTTP requests |
-| `beautifulsoup4` | >=4.14.2 | HTML parsing |
-| `PyGithub` | >=2.5.0 | GitHub API |
-| `GitPython` | >=3.1.40 | Git operations |
-| `httpx` | >=0.28.1 | Async HTTP |
-| `anthropic` | >=0.76.0 | Claude AI API |
-| `PyMuPDF` | >=1.24.14 | PDF processing |
-| `Pillow` | >=11.0.0 | Image processing |
-| `pytesseract` | >=0.3.13 | OCR |
-| `pydantic` | >=2.12.3 | Data validation |
-| `pydantic-settings` | >=2.11.0 | Settings management |
-| `click` | >=8.3.0 | CLI framework |
-| `Pygments` | >=2.19.2 | Syntax highlighting |
-| `pathspec` | >=0.12.1 | Path matching |
-| `networkx` | >=3.0 | Graph operations |
-| `schedule` | >=1.2.0 | Scheduled tasks |
-| `python-dotenv` | >=1.1.1 | Environment variables |
-| `jsonschema` | >=4.25.1 | JSON validation |
-| `PyYAML` | >=6.0 | YAML parsing |
-| `langchain` | >=1.2.10 | LangChain integration |
-| `llama-index` | >=0.14.15 | LlamaIndex integration |
-
-### Optional Dependencies
-
-| Feature | Package | Install Command |
-|---------|---------|-----------------|
-| MCP Server | `mcp>=1.25,<2` | `pip install -e ".[mcp]"` |
-| Google Gemini | `google-generativeai>=0.8.0` | `pip install -e ".[gemini]"` |
-| OpenAI | `openai>=1.0.0` | `pip install -e ".[openai]"` |
-| AWS S3 | `boto3>=1.34.0` | `pip install -e ".[s3]"` |
-| Google Cloud Storage | `google-cloud-storage>=2.10.0` | `pip install -e ".[gcs]"` |
-| Azure Blob Storage | `azure-storage-blob>=12.19.0` | `pip install -e ".[azure]"` |
-| Word Documents | `mammoth>=1.6.0`, `python-docx>=1.1.0` | `pip install -e ".[docx]"` |
-| Video (lightweight) | `yt-dlp>=2024.12.0`, `youtube-transcript-api>=1.2.0` | `pip install -e ".[video]"` |
-| Video (full) | +`faster-whisper`, `scenedetect`, `opencv-python-headless` (`easyocr` now installed via `--setup`) | `pip install -e ".[video-full]"` |
-| Video (GPU setup) | Auto-detects GPU, installs PyTorch + easyocr + all visual deps | `skill-seekers video --setup` |
-| Chroma DB | `chromadb>=0.4.0` | `pip install -e ".[chroma]"` |
-| Weaviate | `weaviate-client>=3.25.0` | `pip install -e ".[weaviate]"` |
-| Pinecone | `pinecone>=5.0.0` | `pip install -e ".[pinecone]"` |
-| Embedding Server | `fastapi>=0.109.0`, `uvicorn>=0.27.0`, `sentence-transformers>=2.3.0` | `pip install -e ".[embedding]"` |
-
-### Dev Dependencies (in dependency-groups)
-
-| Package | Version | Purpose |
-|---------|---------|---------|
-| `pytest` | >=8.4.2 | Testing framework |
-| `pytest-asyncio` | >=0.24.0 | Async test support |
-| `pytest-cov` | >=7.0.0 | Coverage |
-| `coverage` | >=7.11.0 | Coverage reporting |
-| `ruff` | >=0.14.13 | Linting/formatting |
-| `mypy` | >=1.19.1 | Type checking |
-| `psutil` | >=5.9.0 | Process utilities for testing |
-| `numpy` | >=1.24.0 | Numerical operations |
-| `starlette` | >=0.31.0 | HTTP transport testing |
-| `httpx` | >=0.24.0 | HTTP client for testing |
-| `boto3` | >=1.26.0 | AWS S3 testing |
-| `google-cloud-storage` | >=2.10.0 | GCS testing |
-| `azure-storage-blob` | >=12.17.0 | Azure testing |
-
---
-
-## Troubleshooting
-
-### Common Issues
-
-**ImportError: No module named 'skill_seekers'**
- Solution: Run `pip install -e .`
-
-**Tests failing with "package not installed"**
- Solution: Ensure you ran `pip install -e .` in the correct virtual environment
-
-**MCP server import errors**
- Solution: Install with `pip install -e ".[mcp]"`
-
-**Type checking failures**
- MyPy is configured to be lenient (gradual typing)
- Focus on critical paths, not full coverage
-
-**Docker build failures**
- Ensure you have BuildKit enabled: `DOCKER_BUILDKIT=1`
- Check that all submodules are initialized: `git submodule update --init`
-
-**Rate limit errors from GitHub**
- Set `GITHUB_TOKEN` environment variable for authenticated requests
- Improves rate limit from 60 to 5000 requests/hour
-
-### Getting Help
-
- Check **TROUBLESHOOTING.md** for detailed solutions
- Review **docs/FAQ.md** for common questions
- Visit https://skillseekersweb.com/ for documentation
- Open an issue on GitHub with:
-  - Clear title and description
-  - Steps to reproduce
-  - Expected vs actual behavior
-  - Environment details (OS, Python version)
-  - Error messages and stack traces
-
---
-
-## Environment Variables Reference
-
-| Variable | Purpose | Required For |
-|----------|---------|--------------|
-| `ANTHROPIC_API_KEY` | Claude AI API access | Claude enhancement/upload |
-| `GOOGLE_API_KEY` | Google Gemini API access | Gemini enhancement/upload |
-| `OPENAI_API_KEY` | OpenAI API access | OpenAI enhancement/upload |
-| `GITHUB_TOKEN` | GitHub API authentication | GitHub scraping (recommended) |
-| `AWS_ACCESS_KEY_ID` | AWS S3 authentication | S3 cloud storage |
-| `AWS_SECRET_ACCESS_KEY` | AWS S3 authentication | S3 cloud storage |
-| `GOOGLE_APPLICATION_CREDENTIALS` | GCS authentication path | GCS cloud storage |
-| `AZURE_STORAGE_CONNECTION_STRING` | Azure Blob authentication | Azure cloud storage |
-| `ANTHROPIC_BASE_URL` | Custom Claude endpoint | Custom API endpoints |
-| `SKILL_SEEKERS_HOME` | Data directory path | Docker/runtime |
-| `SKILL_SEEKERS_OUTPUT` | Output directory path | Docker/runtime |
-
---
-
-## Version Management
-
-The version is defined in `pyproject.toml` and dynamically read by `src/skill_seekers/_version.py`:
-
-```python
-# _version.py reads from pyproject.toml
-__version__ = get_version()  # Returns version from pyproject.toml
-```
-
-**To update version:**
-1. Edit `version` in `pyproject.toml`
-2. The `_version.py` file will automatically pick up the new version
-
---
-
-## Configuration File Format
-
-Skill Seekers uses JSON configuration files to define scraping targets. Example structure:
-
-```json
-{
-  "name": "godot",
-  "description": "Godot Engine documentation",
-  "merge_mode": "claude-enhanced",
-  "sources": [
-    {
-      "type": "documentation",
-      "base_url": "https://docs.godotengine.org/en/stable/",
-      "extract_api": true,
-      "selectors": {
-        "main_content": "div[role='main']",
-        "title": "title",
-        "code_blocks": "pre"
-      },
-      "url_patterns": {
-        "include": [],
-        "exclude": ["/search.html", "/_static/"]
-      },
-      "categories": {
-        "getting_started": ["introduction", "getting_started"],
-        "scripting": ["scripting", "gdscript"]
-      },
-      "rate_limit": 0.5,
-      "max_pages": 500
-    },
-    {
-      "type": "github",
-      "repo": "godotengine/godot",
-      "enable_codebase_analysis": true,
-      "code_analysis_depth": "deep",
-      "fetch_issues": true,
-      "max_issues": 100
-    }
-  ]
-}
-```
-
---
-
-## Workflow Presets
-
-Skill Seekers includes 66 YAML workflow presets for AI enhancement in `src/skill_seekers/workflows/`:
-
-**Built-in presets:**
- `default.yaml` - Standard enhancement workflow
- `minimal.yaml` - Fast, minimal enhancement
- `security-focus.yaml` - Security-focused review
- `architecture-comprehensive.yaml` - Deep architecture analysis
- `api-documentation.yaml` - API documentation focus
- And 61 more specialized presets...
-
-**Usage:**
-```bash
-# Apply a preset
-skill-seekers create ./my-project --enhance-workflow security-focus
-
-# Chain multiple presets
-skill-seekers create ./my-project --enhance-workflow security-focus --enhance-workflow minimal
-
-# Manage presets
-skill-seekers workflows list
-skill-seekers workflows show security-focus
-skill-seekers workflows copy security-focus
-```
-
---
-
-*This document is maintained for AI coding agents. For human contributors, see README.md and CONTRIBUTING.md.*
-
-*Last updated: 2026-03-01*
+GitHub Actions (`.github/workflows/tests.yml`): ruff + mypy lint job, then pytest matrix (Ubuntu + macOS, Python 3.10-3.12) with Codecov upload.