feat: add 10 new skill source types (17 total) with full pipeline integration

Add Jupyter Notebook, Local HTML, OpenAPI/Swagger, AsciiDoc, PowerPoint,
RSS/Atom, Man Pages, Confluence, Notion, and Slack/Discord Chat as new
skill source types. Each type is fully integrated across:

- Standalone CLI commands (skill-seekers <type>)
- Auto-detection via 'skill-seekers create' (file extension + content sniffing)
- Unified multi-source configs (scraped_data, dispatch, config validation)
- Unified skill builder (generic merge + source-attributed synthesis)
- MCP server (scrape_generic tool with per-type flag mapping)
- pyproject.toml (entry points, optional deps, [all] group)

Also fixes: EPUB unified pipeline gap, missing word/video config validators,
OpenAPI yaml import guard, MCP flag mismatch for all 10 types, stale
docstrings, and adds 77 integration tests + complex-merge workflow.

50 files changed, +20,201 lines
This commit is contained in:
yusyus
2026-03-15 15:30:15 +03:00
parent 64403a3686
commit 53b911b697
50 changed files with 20193 additions and 856 deletions

View File

@@ -3,16 +3,16 @@
Skill Seeker MCP Server (FastMCP Implementation)
Modern, decorator-based MCP server using FastMCP for simplified tool registration.
Provides 33 tools for generating Claude AI skills from documentation.
Provides 34 tools for generating Claude AI skills from documentation.
This is a streamlined alternative to server.py (2200 lines → 708 lines, 68% reduction).
All tool implementations are delegated to modular tool files in tools/ directory.
**Architecture:**
- FastMCP server with decorator-based tool registration
- 33 tools organized into 7 categories:
- 34 tools organized into 7 categories:
* Config tools (3): generate_config, list_configs, validate_config
* Scraping tools (10): estimate_pages, scrape_docs, scrape_github, scrape_pdf, scrape_video, scrape_codebase, detect_patterns, extract_test_examples, build_how_to_guides, extract_config_patterns
* Scraping tools (11): estimate_pages, scrape_docs, scrape_github, scrape_pdf, scrape_video, scrape_codebase, detect_patterns, extract_test_examples, build_how_to_guides, extract_config_patterns, scrape_generic
* Packaging tools (4): package_skill, upload_skill, enhance_skill, install_skill
* Splitting tools (2): split_config, generate_router
* Source tools (5): fetch_config, submit_config, add_config_source, list_config_sources, remove_config_source
@@ -97,6 +97,7 @@ try:
remove_config_source_impl,
scrape_codebase_impl,
scrape_docs_impl,
scrape_generic_impl,
scrape_github_impl,
scrape_pdf_impl,
scrape_video_impl,
@@ -141,6 +142,7 @@ except ImportError:
remove_config_source_impl,
scrape_codebase_impl,
scrape_docs_impl,
scrape_generic_impl,
scrape_github_impl,
scrape_pdf_impl,
scrape_video_impl,
@@ -301,7 +303,7 @@ async def sync_config(
# ============================================================================
# SCRAPING TOOLS (10 tools)
# SCRAPING TOOLS (11 tools)
# ============================================================================
@@ -823,6 +825,50 @@ async def extract_config_patterns(
return str(result)
@safe_tool_decorator(
description="Scrape content from new source types: jupyter, html, openapi, asciidoc, pptx, confluence, notion, rss, manpage, chat. A generic entry point that delegates to the appropriate CLI scraper module."
)
async def scrape_generic(
source_type: str,
name: str,
path: str | None = None,
url: str | None = None,
) -> str:
"""
Scrape content from various source types and build a skill.
A generic scraper that supports 10 new source types. It delegates to the
corresponding CLI scraper module (e.g., skill_seekers.cli.jupyter_scraper).
File-based types (jupyter, html, openapi, asciidoc, pptx, manpage, chat)
typically use the 'path' parameter. URL-based types (confluence, notion, rss)
typically use the 'url' parameter.
Args:
source_type: Source type to scrape. One of: jupyter, html, openapi,
asciidoc, pptx, confluence, notion, rss, manpage, chat.
name: Skill name for the output
path: File or directory path (for file-based sources like jupyter, html, pptx)
url: URL (for URL-based sources like confluence, notion, rss)
Returns:
Scraping results with file paths and statistics.
"""
args = {
"source_type": source_type,
"name": name,
}
if path:
args["path"] = path
if url:
args["url"] = url
result = await scrape_generic_impl(args)
if isinstance(result, list) and result:
return result[0].text if hasattr(result[0], "text") else str(result[0])
return str(result)
# ============================================================================
# PACKAGING TOOLS (4 tools)
# ============================================================================