Files
skill-seekers-reference/docs/advanced/multi-source.md
yusyus 37cb307455 docs: update all documentation for 17 source types
Update 32 documentation files across English and Chinese (zh-CN) docs
to reflect the 10 new source types added in the previous commit.

Updated files:
- README.md, README.zh-CN.md — taglines, feature lists, examples, install extras
- docs/reference/ — CLI_REFERENCE, FEATURE_MATRIX, MCP_REFERENCE, CONFIG_FORMAT, API_REFERENCE
- docs/features/ — UNIFIED_SCRAPING with generic merge docs
- docs/advanced/ — multi-source guide, MCP server guide
- docs/getting-started/ — installation extras, quick-start examples
- docs/user-guide/ — core-concepts, scraping, packaging, workflows (complex-merge)
- docs/ — FAQ, TROUBLESHOOTING, BEST_PRACTICES, ARCHITECTURE, UNIFIED_PARSERS, README
- Root — BULLETPROOF_QUICKSTART, CONTRIBUTING, ROADMAP
- docs/zh-CN/ — Chinese translations for all of the above

32 files changed, +3,016 lines, -245 lines
2026-03-15 15:56:04 +03:00

13 KiB

Multi-Source Scraping Guide

Skill Seekers v3.2.0
Combine 17 source types into one unified skill


What is Multi-Source Scraping?

Combine multiple sources into a single, comprehensive skill. Skill Seekers supports 17 source types that can be freely mixed and matched:

┌──────────────┐
│ Documentation│──┐
│ (Web docs)   │  │
├──────────────┤  │
│ GitHub Repo  │  │
│ (Source code) │  │
├──────────────┤  │     ┌──────────────────┐
│ PDF / Word / │  │     │  Unified Skill   │
│ EPUB / PPTX  │──┼────▶│  (Single source  │
├──────────────┤  │     │   of truth)      │
│ Video /      │  │     └──────────────────┘
│ Jupyter / HTML│  │
├──────────────┤  │
│ OpenAPI /    │  │
│ AsciiDoc /   │  │
│ RSS / Man    │  │
├──────────────┤  │
│ Confluence / │──┘
│ Notion / Chat│
└──────────────┘

When to Use Multi-Source

Use Cases

Scenario Sources Benefit
Framework + Examples Docs + GitHub repo Theory + practice
Product + API Docs + OpenAPI spec Usage + reference
Legacy + Current PDF + Web docs Complete history
Internal + External Local code + Public docs Full context
Data Science Project Jupyter + GitHub + Docs Code + notebooks + docs
Enterprise Wiki Confluence + GitHub + Video Wiki + code + tutorials
API-First Product OpenAPI + Docs + Jupyter Spec + docs + examples
CLI Tool Man pages + GitHub + AsciiDoc Reference + code + docs
Team Knowledge Notion + Slack/Discord + Docs Notes + discussions + docs
Book + Code EPUB + GitHub + PDF Theory + implementation
Presentations + Code PowerPoint + GitHub + Docs Slides + code + reference
Content Feed RSS/Atom + Docs + GitHub Updates + docs + code

Benefits

  • Single source of truth - One skill with all context
  • Conflict detection - Find doc/code discrepancies
  • Cross-references - Link between sources
  • Comprehensive - No gaps in knowledge

Creating Unified Configs

Basic Structure

{
  "name": "my-framework-complete",
  "description": "Complete documentation and code",
  "merge_mode": "claude-enhanced",
  
  "sources": [
    {
      "type": "docs",
      "name": "documentation",
      "base_url": "https://docs.example.com/"
    },
    {
      "type": "github",
      "name": "source-code",
      "repo": "owner/repo"
    }
  ]
}

Source Types (17 Supported)

1. Documentation (Web)

{
  "type": "docs",
  "name": "official-docs",
  "base_url": "https://docs.framework.com/",
  "max_pages": 500,
  "categories": {
    "getting_started": ["intro", "quickstart"],
    "api": ["reference", "api"]
  }
}

2. GitHub Repository

{
  "type": "github",
  "name": "source-code",
  "repo": "facebook/react",
  "fetch_issues": true,
  "max_issues": 100,
  "enable_codebase_analysis": true
}

3. PDF Document

{
  "type": "pdf",
  "name": "legacy-manual",
  "pdf_path": "docs/legacy-manual.pdf",
  "enable_ocr": false
}

4. Local Codebase

{
  "type": "local",
  "name": "internal-tools",
  "directory": "./internal-lib",
  "languages": ["Python", "JavaScript"]
}

5. Word Document (.docx)

{
  "type": "word",
  "name": "product-spec",
  "path": "docs/specification.docx"
}

6. Video (YouTube/Vimeo/Local)

{
  "type": "video",
  "name": "tutorial-video",
  "url": "https://www.youtube.com/watch?v=example",
  "language": "en"
}

7. EPUB

{
  "type": "epub",
  "name": "programming-book",
  "path": "books/python-guide.epub"
}

8. Jupyter Notebook

{
  "type": "jupyter",
  "name": "analysis-notebooks",
  "path": "notebooks/data-analysis.ipynb"
}

9. Local HTML

{
  "type": "html",
  "name": "exported-docs",
  "path": "exports/documentation.html"
}

10. OpenAPI/Swagger

{
  "type": "openapi",
  "name": "api-spec",
  "path": "specs/openapi.yaml"
}

11. AsciiDoc

{
  "type": "asciidoc",
  "name": "technical-docs",
  "path": "docs/manual.adoc"
}

12. PowerPoint (.pptx)

{
  "type": "pptx",
  "name": "architecture-deck",
  "path": "presentations/architecture.pptx"
}

13. RSS/Atom Feed

{
  "type": "rss",
  "name": "release-feed",
  "url": "https://blog.example.com/releases.xml"
}

14. Man Pages

{
  "type": "manpage",
  "name": "cli-reference",
  "path": "man/mytool.1"
}

15. Confluence

{
  "type": "confluence",
  "name": "team-wiki",
  "base_url": "https://company.atlassian.net/wiki",
  "space_key": "ENGINEERING"
}

16. Notion

{
  "type": "notion",
  "name": "project-docs",
  "workspace": "my-workspace",
  "root_page_id": "abc123def456"
}

17. Slack/Discord (Chat)

{
  "type": "chat",
  "name": "team-discussions",
  "path": "exports/slack-export/"
}

Complete Example

React Complete Skill

{
  "name": "react-complete",
  "description": "React - docs, source, and guides",
  "merge_mode": "claude-enhanced",
  
  "sources": [
    {
      "type": "docs",
      "name": "react-docs",
      "base_url": "https://react.dev/",
      "max_pages": 300,
      "categories": {
        "getting_started": ["learn", "tutorial"],
        "api": ["reference", "hooks"],
        "advanced": ["concurrent", "suspense"]
      }
    },
    {
      "type": "github",
      "name": "react-source",
      "repo": "facebook/react",
      "fetch_issues": true,
      "max_issues": 50,
      "enable_codebase_analysis": true,
      "code_analysis_depth": "deep"
    },
    {
      "type": "pdf",
      "name": "react-patterns",
      "pdf_path": "downloads/react-patterns.pdf"
    }
  ],
  
  "conflict_detection": {
    "enabled": true,
    "rules": [
      {
        "field": "api_signature",
        "action": "flag_mismatch"
      },
      {
        "field": "version",
        "action": "warn_outdated"
      }
    ]
  },
  
  "output_structure": {
    "group_by_source": false,
    "cross_reference": true
  }
}

Running Unified Scraping

Basic Command

skill-seekers unified --config react-complete.json

With Options

# Fresh start (ignore cache)
skill-seekers unified --config react-complete.json --fresh

# Dry run
skill-seekers unified --config react-complete.json --dry-run

# Rule-based merging
skill-seekers unified --config react-complete.json --merge-mode rule-based

Merge Modes

claude-enhanced (Default)

Uses AI to intelligently merge sources:

  • Detects relationships between content
  • Resolves conflicts intelligently
  • Creates cross-references
  • Best quality, slower
skill-seekers unified --config my-config.json --merge-mode claude-enhanced

rule-based

Uses defined rules for merging:

  • Faster
  • Deterministic
  • Less sophisticated
skill-seekers unified --config my-config.json --merge-mode rule-based

Generic Merge System

When combining source types beyond the standard docs+github+pdf trio, the generic merge system (_generic_merge() in unified_skill_builder.py) handles any combination automatically. It uses pairwise synthesis for known combos (docs+github, docs+pdf, github+pdf) and falls back to a generic merging strategy for all other source type combinations.

AI-Powered Multi-Source Merging

For complex multi-source projects, use the complex-merge.yaml workflow preset to apply AI-powered merging:

skill-seekers unified --config my-config.json \
  --enhance-workflow complex-merge

This workflow uses Claude to intelligently reconcile content from disparate source types, resolving conflicts and creating coherent cross-references between sources that would otherwise be difficult to merge deterministically.


Conflict Detection

Automatic Detection

Finds discrepancies between sources:

{
  "conflict_detection": {
    "enabled": true,
    "rules": [
      {
        "field": "api_signature",
        "action": "flag_mismatch"
      },
      {
        "field": "version",
        "action": "warn_outdated"
      },
      {
        "field": "deprecation",
        "action": "highlight"
      }
    ]
  }
}

Conflict Report

After scraping, check for conflicts:

# Conflicts are reported in output
ls output/react-complete/conflicts.json

# Or use MCP tool
detect_conflicts({
  "docs_source": "output/react-docs",
  "code_source": "output/react-source"
})

Output Structure

Merged Output

output/react-complete/
├── SKILL.md                    # Combined skill
├── references/
│   ├── index.md               # Master index
│   ├── getting_started.md     # From docs
│   ├── api_reference.md       # From docs
│   ├── source_overview.md     # From GitHub
│   ├── code_examples.md       # From GitHub
│   └── patterns.md            # From PDF
├── .skill-seekers/
│   ├── manifest.json          # Metadata
│   ├── sources.json           # Source list
│   └── conflicts.json         # Detected conflicts
└── cross-references.json      # Links between sources

Best Practices

1. Name Sources Clearly

{
  "sources": [
    {"type": "docs", "name": "official-docs"},
    {"type": "github", "name": "source-code"},
    {"type": "pdf", "name": "legacy-reference"},
    {"type": "openapi", "name": "api-spec"},
    {"type": "confluence", "name": "team-wiki"}
  ]
}

2. Limit Source Scope

{
  "type": "github",
  "name": "core-source",
  "repo": "owner/repo",
  "file_patterns": ["src/**/*.py"],  // Only core files
  "exclude_patterns": ["tests/**", "docs/**"]
}

3. Enable Conflict Detection

{
  "conflict_detection": {
    "enabled": true
  }
}

4. Use Appropriate Merge Mode

  • claude-enhanced - Best quality, for important skills
  • rule-based - Faster, for testing or large datasets

5. Test Incrementally

# Test with one source first
skill-seekers create <source1>

# Then add sources
skill-seekers unified --config my-config.json --dry-run

Troubleshooting

"Source not found"

# Check all sources exist
curl -I https://docs.example.com/
ls downloads/manual.pdf

"Merge conflicts"

# Check conflicts report
cat output/my-skill/conflicts.json

# Adjust merge_mode
skill-seekers unified --config my-config.json --merge-mode rule-based

"Out of memory"

# Process sources separately
# Then merge manually

Examples

Framework + Examples

{
  "name": "django-complete",
  "sources": [
    {"type": "docs", "base_url": "https://docs.djangoproject.com/"},
    {"type": "github", "repo": "django/django", "fetch_issues": false}
  ]
}

Docs + OpenAPI Spec

{
  "name": "stripe-complete",
  "sources": [
    {"type": "docs", "base_url": "https://stripe.com/docs"},
    {"type": "openapi", "path": "specs/stripe-openapi.yaml"}
  ]
}

Code + Jupyter Notebooks

{
  "name": "ml-project",
  "sources": [
    {"type": "github", "repo": "org/ml-pipeline"},
    {"type": "jupyter", "path": "notebooks/training.ipynb"},
    {"type": "jupyter", "path": "notebooks/evaluation.ipynb"}
  ]
}

Confluence + GitHub

{
  "name": "internal-platform",
  "sources": [
    {"type": "confluence", "base_url": "https://company.atlassian.net/wiki", "space_key": "PLATFORM"},
    {"type": "github", "repo": "company/platform-core"},
    {"type": "openapi", "path": "specs/platform-api.yaml"}
  ]
}

Legacy + Current

{
  "name": "product-docs",
  "sources": [
    {"type": "docs", "base_url": "https://docs.example.com/v2/"},
    {"type": "pdf", "pdf_path": "v1-legacy-manual.pdf"}
  ]
}

CLI Tool (Man Pages + GitHub + AsciiDoc)

{
  "name": "mytool-complete",
  "sources": [
    {"type": "manpage", "path": "man/mytool.1"},
    {"type": "github", "repo": "org/mytool"},
    {"type": "asciidoc", "path": "docs/user-guide.adoc"}
  ]
}

Team Knowledge (Notion + Chat + Video)

{
  "name": "onboarding-knowledge",
  "sources": [
    {"type": "notion", "workspace": "engineering", "root_page_id": "abc123"},
    {"type": "chat", "path": "exports/slack-engineering/"},
    {"type": "video", "url": "https://www.youtube.com/playlist?list=PLonboarding"}
  ]
}

See Also