docs: update all documentation for 17 source types
Update 32 documentation files across English and Chinese (zh-CN) docs to reflect the 10 new source types added in the previous commit. Updated files: - README.md, README.zh-CN.md — taglines, feature lists, examples, install extras - docs/reference/ — CLI_REFERENCE, FEATURE_MATRIX, MCP_REFERENCE, CONFIG_FORMAT, API_REFERENCE - docs/features/ — UNIFIED_SCRAPING with generic merge docs - docs/advanced/ — multi-source guide, MCP server guide - docs/getting-started/ — installation extras, quick-start examples - docs/user-guide/ — core-concepts, scraping, packaging, workflows (complex-merge) - docs/ — FAQ, TROUBLESHOOTING, BEST_PRACTICES, ARCHITECTURE, UNIFIED_PARSERS, README - Root — BULLETPROOF_QUICKSTART, CONTRIBUTING, ROADMAP - docs/zh-CN/ — Chinese translations for all of the above 32 files changed, +3,016 lines, -245 lines
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# MCP Server Setup Guide
|
||||
|
||||
> **Skill Seekers v3.1.0**
|
||||
> **Skill Seekers v3.2.0**
|
||||
> **Integrate with AI agents via Model Context Protocol**
|
||||
|
||||
---
|
||||
@@ -143,7 +143,7 @@ skill-seekers-mcp --transport http --port 8765
|
||||
|
||||
## Available Tools
|
||||
|
||||
26 tools organized by category:
|
||||
27 tools organized by category:
|
||||
|
||||
### Core Tools (9)
|
||||
- `list_configs` - List presets
|
||||
@@ -156,9 +156,10 @@ skill-seekers-mcp --transport http --port 8765
|
||||
- `enhance_skill` - AI enhancement
|
||||
- `install_skill` - Complete workflow
|
||||
|
||||
### Extended Tools (9)
|
||||
### Extended Tools (10)
|
||||
- `scrape_github` - GitHub repo
|
||||
- `scrape_pdf` - PDF extraction
|
||||
- `scrape_generic` - Generic scraper for 10 new source types (see below)
|
||||
- `scrape_codebase` - Local code
|
||||
- `unified_scrape` - Multi-source
|
||||
- `detect_patterns` - Pattern detection
|
||||
@@ -180,6 +181,37 @@ skill-seekers-mcp --transport http --port 8765
|
||||
- `export_to_faiss`
|
||||
- `export_to_qdrant`
|
||||
|
||||
### scrape_generic Tool
|
||||
|
||||
The `scrape_generic` tool is the generic entry point for 10 new source types added in v3.2.0. It delegates to the appropriate CLI scraper module.
|
||||
|
||||
**Supported source types:** `jupyter`, `html`, `openapi`, `asciidoc`, `pptx`, `rss`, `manpage`, `confluence`, `notion`, `chat`
|
||||
|
||||
**Parameters:**
|
||||
|
||||
| Name | Type | Required | Description |
|
||||
|------|------|----------|-------------|
|
||||
| `source_type` | string | Yes | One of the 10 supported source types |
|
||||
| `name` | string | Yes | Skill name for the output |
|
||||
| `path` | string | No | File or directory path (for file-based sources) |
|
||||
| `url` | string | No | URL (for URL-based sources like confluence, notion, rss) |
|
||||
|
||||
**Usage examples:**
|
||||
|
||||
```
|
||||
"Scrape the Jupyter notebook analysis.ipynb"
|
||||
→ scrape_generic(source_type="jupyter", name="analysis", path="analysis.ipynb")
|
||||
|
||||
"Extract content from the API spec"
|
||||
→ scrape_generic(source_type="openapi", name="my-api", path="api-spec.yaml")
|
||||
|
||||
"Process the PowerPoint slides"
|
||||
→ scrape_generic(source_type="pptx", name="slides", path="presentation.pptx")
|
||||
|
||||
"Scrape the Confluence wiki"
|
||||
→ scrape_generic(source_type="confluence", name="wiki", url="https://wiki.example.com")
|
||||
```
|
||||
|
||||
See [MCP Reference](../reference/MCP_REFERENCE.md) for full details.
|
||||
|
||||
---
|
||||
|
||||
@@ -1,28 +1,34 @@
|
||||
# Multi-Source Scraping Guide
|
||||
|
||||
> **Skill Seekers v3.1.0**
|
||||
> **Combine documentation, code, and PDFs into one skill**
|
||||
> **Skill Seekers v3.2.0**
|
||||
> **Combine 17 source types into one unified skill**
|
||||
|
||||
---
|
||||
|
||||
## What is Multi-Source Scraping?
|
||||
|
||||
Combine multiple sources into a single, comprehensive skill:
|
||||
Combine multiple sources into a single, comprehensive skill. Skill Seekers supports **17 source types** that can be freely mixed and matched:
|
||||
|
||||
```
|
||||
┌──────────────┐
|
||||
│ Documentation │──┐
|
||||
│ (Web docs) │ │
|
||||
└──────────────┘ │
|
||||
│
|
||||
┌──────────────┐ │ ┌──────────────────┐
|
||||
│ GitHub Repo │──┼────▶│ Unified Skill │
|
||||
│ (Source code)│ │ │ (Single source │
|
||||
└──────────────┘ │ │ of truth) │
|
||||
│ └──────────────────┘
|
||||
┌──────────────┐ │
|
||||
│ PDF Manual │──┘
|
||||
│ (Reference) │
|
||||
│ Documentation│──┐
|
||||
│ (Web docs) │ │
|
||||
├──────────────┤ │
|
||||
│ GitHub Repo │ │
|
||||
│ (Source code) │ │
|
||||
├──────────────┤ │ ┌──────────────────┐
|
||||
│ PDF / Word / │ │ │ Unified Skill │
|
||||
│ EPUB / PPTX │──┼────▶│ (Single source │
|
||||
├──────────────┤ │ │ of truth) │
|
||||
│ Video / │ │ └──────────────────┘
|
||||
│ Jupyter / HTML│ │
|
||||
├──────────────┤ │
|
||||
│ OpenAPI / │ │
|
||||
│ AsciiDoc / │ │
|
||||
│ RSS / Man │ │
|
||||
├──────────────┤ │
|
||||
│ Confluence / │──┘
|
||||
│ Notion / Chat│
|
||||
└──────────────┘
|
||||
```
|
||||
|
||||
@@ -38,6 +44,14 @@ Combine multiple sources into a single, comprehensive skill:
|
||||
| Product + API | Docs + OpenAPI spec | Usage + reference |
|
||||
| Legacy + Current | PDF + Web docs | Complete history |
|
||||
| Internal + External | Local code + Public docs | Full context |
|
||||
| Data Science Project | Jupyter + GitHub + Docs | Code + notebooks + docs |
|
||||
| Enterprise Wiki | Confluence + GitHub + Video | Wiki + code + tutorials |
|
||||
| API-First Product | OpenAPI + Docs + Jupyter | Spec + docs + examples |
|
||||
| CLI Tool | Man pages + GitHub + AsciiDoc | Reference + code + docs |
|
||||
| Team Knowledge | Notion + Slack/Discord + Docs | Notes + discussions + docs |
|
||||
| Book + Code | EPUB + GitHub + PDF | Theory + implementation |
|
||||
| Presentations + Code | PowerPoint + GitHub + Docs | Slides + code + reference |
|
||||
| Content Feed | RSS/Atom + Docs + GitHub | Updates + docs + code |
|
||||
|
||||
### Benefits
|
||||
|
||||
@@ -75,9 +89,9 @@ Combine multiple sources into a single, comprehensive skill:
|
||||
|
||||
---
|
||||
|
||||
## Source Types
|
||||
## Source Types (17 Supported)
|
||||
|
||||
### 1. Documentation
|
||||
### 1. Documentation (Web)
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -127,6 +141,139 @@ Combine multiple sources into a single, comprehensive skill:
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Word Document (.docx)
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "word",
|
||||
"name": "product-spec",
|
||||
"path": "docs/specification.docx"
|
||||
}
|
||||
```
|
||||
|
||||
### 6. Video (YouTube/Vimeo/Local)
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "video",
|
||||
"name": "tutorial-video",
|
||||
"url": "https://www.youtube.com/watch?v=example",
|
||||
"language": "en"
|
||||
}
|
||||
```
|
||||
|
||||
### 7. EPUB
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "epub",
|
||||
"name": "programming-book",
|
||||
"path": "books/python-guide.epub"
|
||||
}
|
||||
```
|
||||
|
||||
### 8. Jupyter Notebook
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "jupyter",
|
||||
"name": "analysis-notebooks",
|
||||
"path": "notebooks/data-analysis.ipynb"
|
||||
}
|
||||
```
|
||||
|
||||
### 9. Local HTML
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "html",
|
||||
"name": "exported-docs",
|
||||
"path": "exports/documentation.html"
|
||||
}
|
||||
```
|
||||
|
||||
### 10. OpenAPI/Swagger
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "openapi",
|
||||
"name": "api-spec",
|
||||
"path": "specs/openapi.yaml"
|
||||
}
|
||||
```
|
||||
|
||||
### 11. AsciiDoc
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "asciidoc",
|
||||
"name": "technical-docs",
|
||||
"path": "docs/manual.adoc"
|
||||
}
|
||||
```
|
||||
|
||||
### 12. PowerPoint (.pptx)
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "pptx",
|
||||
"name": "architecture-deck",
|
||||
"path": "presentations/architecture.pptx"
|
||||
}
|
||||
```
|
||||
|
||||
### 13. RSS/Atom Feed
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "rss",
|
||||
"name": "release-feed",
|
||||
"url": "https://blog.example.com/releases.xml"
|
||||
}
|
||||
```
|
||||
|
||||
### 14. Man Pages
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "manpage",
|
||||
"name": "cli-reference",
|
||||
"path": "man/mytool.1"
|
||||
}
|
||||
```
|
||||
|
||||
### 15. Confluence
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "confluence",
|
||||
"name": "team-wiki",
|
||||
"base_url": "https://company.atlassian.net/wiki",
|
||||
"space_key": "ENGINEERING"
|
||||
}
|
||||
```
|
||||
|
||||
### 16. Notion
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "notion",
|
||||
"name": "project-docs",
|
||||
"workspace": "my-workspace",
|
||||
"root_page_id": "abc123def456"
|
||||
}
|
||||
```
|
||||
|
||||
### 17. Slack/Discord (Chat)
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "chat",
|
||||
"name": "team-discussions",
|
||||
"path": "exports/slack-export/"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Complete Example
|
||||
@@ -240,6 +387,21 @@ Uses defined rules for merging:
|
||||
skill-seekers unified --config my-config.json --merge-mode rule-based
|
||||
```
|
||||
|
||||
### Generic Merge System
|
||||
|
||||
When combining source types beyond the standard docs+github+pdf trio, the **generic merge system** (`_generic_merge()` in `unified_skill_builder.py`) handles any combination automatically. It uses pairwise synthesis for known combos (docs+github, docs+pdf, github+pdf) and falls back to a generic merging strategy for all other source type combinations.
|
||||
|
||||
### AI-Powered Multi-Source Merging
|
||||
|
||||
For complex multi-source projects, use the `complex-merge.yaml` workflow preset to apply AI-powered merging:
|
||||
|
||||
```bash
|
||||
skill-seekers unified --config my-config.json \
|
||||
--enhance-workflow complex-merge
|
||||
```
|
||||
|
||||
This workflow uses Claude to intelligently reconcile content from disparate source types, resolving conflicts and creating coherent cross-references between sources that would otherwise be difficult to merge deterministically.
|
||||
|
||||
---
|
||||
|
||||
## Conflict Detection
|
||||
@@ -319,7 +481,9 @@ output/react-complete/
|
||||
"sources": [
|
||||
{"type": "docs", "name": "official-docs"},
|
||||
{"type": "github", "name": "source-code"},
|
||||
{"type": "pdf", "name": "legacy-reference"}
|
||||
{"type": "pdf", "name": "legacy-reference"},
|
||||
{"type": "openapi", "name": "api-spec"},
|
||||
{"type": "confluence", "name": "team-wiki"}
|
||||
]
|
||||
}
|
||||
```
|
||||
@@ -406,14 +570,40 @@ skill-seekers unified --config my-config.json --merge-mode rule-based
|
||||
}
|
||||
```
|
||||
|
||||
### API + Documentation
|
||||
### Docs + OpenAPI Spec
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "stripe-complete",
|
||||
"sources": [
|
||||
{"type": "docs", "base_url": "https://stripe.com/docs"},
|
||||
{"type": "pdf", "pdf_path": "stripe-api-reference.pdf"}
|
||||
{"type": "openapi", "path": "specs/stripe-openapi.yaml"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Code + Jupyter Notebooks
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "ml-project",
|
||||
"sources": [
|
||||
{"type": "github", "repo": "org/ml-pipeline"},
|
||||
{"type": "jupyter", "path": "notebooks/training.ipynb"},
|
||||
{"type": "jupyter", "path": "notebooks/evaluation.ipynb"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Confluence + GitHub
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "internal-platform",
|
||||
"sources": [
|
||||
{"type": "confluence", "base_url": "https://company.atlassian.net/wiki", "space_key": "PLATFORM"},
|
||||
{"type": "github", "repo": "company/platform-core"},
|
||||
{"type": "openapi", "path": "specs/platform-api.yaml"}
|
||||
]
|
||||
}
|
||||
```
|
||||
@@ -430,6 +620,32 @@ skill-seekers unified --config my-config.json --merge-mode rule-based
|
||||
}
|
||||
```
|
||||
|
||||
### CLI Tool (Man Pages + GitHub + AsciiDoc)
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "mytool-complete",
|
||||
"sources": [
|
||||
{"type": "manpage", "path": "man/mytool.1"},
|
||||
{"type": "github", "repo": "org/mytool"},
|
||||
{"type": "asciidoc", "path": "docs/user-guide.adoc"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Team Knowledge (Notion + Chat + Video)
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "onboarding-knowledge",
|
||||
"sources": [
|
||||
{"type": "notion", "workspace": "engineering", "root_page_id": "abc123"},
|
||||
{"type": "chat", "path": "exports/slack-engineering/"},
|
||||
{"type": "video", "url": "https://www.youtube.com/playlist?list=PLonboarding"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
Reference in New Issue
Block a user