firefrost-gaming/skill-seekers-reference

Files

yusyus 37cb307455 docs: update all documentation for 17 source types

Update 32 documentation files across English and Chinese (zh-CN) docs
to reflect the 10 new source types added in the previous commit.

Updated files:
- README.md, README.zh-CN.md — taglines, feature lists, examples, install extras
- docs/reference/ — CLI_REFERENCE, FEATURE_MATRIX, MCP_REFERENCE, CONFIG_FORMAT, API_REFERENCE
- docs/features/ — UNIFIED_SCRAPING with generic merge docs
- docs/advanced/ — multi-source guide, MCP server guide
- docs/getting-started/ — installation extras, quick-start examples
- docs/user-guide/ — core-concepts, scraping, packaging, workflows (complex-merge)
- docs/ — FAQ, TROUBLESHOOTING, BEST_PRACTICES, ARCHITECTURE, UNIFIED_PARSERS, README
- Root — BULLETPROOF_QUICKSTART, CONTRIBUTING, ROADMAP
- docs/zh-CN/ — Chinese translations for all of the above

32 files changed, +3,016 lines, -245 lines

2026-03-15 15:56:04 +03:00

24 KiB

Raw Blame History

Unified Multi-Source Scraping

Version: 3.2.0 (17 source types supported)

Overview

Unified multi-source scraping allows you to combine knowledge from multiple sources into a single comprehensive skill. Instead of choosing between documentation, GitHub repositories, PDF manuals, or any of the 17 supported source types, you can extract and intelligently merge information from all of them.

Why Unified Scraping?

The Problem: Documentation and code often drift apart over time. Official docs might be outdated, missing features that exist in code, or documenting features that have been removed. Separately scraping docs and code creates two incomplete skills.

The Solution: Unified scraping:

Extracts information from 17 source types (documentation, GitHub, PDFs, videos, Word docs, EPUB, Jupyter notebooks, local HTML, OpenAPI specs, AsciiDoc, PowerPoint, RSS/Atom feeds, man pages, Confluence, Notion, Slack/Discord, and local codebases)
Detects conflicts between documentation and actual code implementation
Intelligently merges conflicting information with transparency
Generic merge system combines any combination of source types via pairwise synthesis
Highlights discrepancies with inline warnings
Creates a single, comprehensive skill that shows the complete picture

Quick Start

1. Create a Unified Config

Create a config file with multiple sources:

{
  "name": "react",
  "description": "Complete React knowledge from docs + codebase",
  "merge_mode": "rule-based",
  "sources": [
    {
      "type": "documentation",
      "base_url": "https://react.dev/",
      "extract_api": true,
      "max_pages": 200
    },
    {
      "type": "github",
      "repo": "facebook/react",
      "include_code": true,
      "code_analysis_depth": "surface",
      "max_issues": 100
    }
  ]
}

2. Scrape and Build

python3 cli/unified_scraper.py --config configs/react_unified.json

The tool will:

✅ Phase 1: Scrape all sources (any of the 17 supported types)
✅ Phase 2: Detect conflicts between sources
✅ Phase 3: Merge conflicts intelligently (pairwise synthesis or generic merge)
✅ Phase 4: Build unified skill with conflict transparency
✅ Phase 5: Apply enhancement workflows (optional)

3. Package and Upload

python3 cli/package_skill.py output/react/

Config Format

Unified Config Structure

{
  "name": "skill-name",
  "description": "When to use this skill",
  "merge_mode": "rule-based|claude-enhanced",
  "sources": [
    {
      "type": "<source-type>",
      ...source-specific fields...
    }
  ]
}

Supported Source Types

Type	Config `type` Value	Description
Documentation (web)	`documentation`	Web documentation sites
GitHub repo	`github`	GitHub repository analysis
PDF	`pdf`	PDF document extraction
Local codebase	`local`	Local directory analysis
Word (.docx)	`word`	Word document extraction
Video	`video`	YouTube/Vimeo/local video transcription
EPUB	`epub`	EPUB ebook extraction
Jupyter Notebook	`jupyter`	`.ipynb` notebook extraction
Local HTML	`html`	Local HTML file extraction
OpenAPI/Swagger	`openapi`	OpenAPI/Swagger spec parsing
AsciiDoc	`asciidoc`	AsciiDoc document extraction
PowerPoint	`pptx`	PowerPoint presentation extraction
RSS/Atom	`rss`	RSS/Atom feed extraction
Man pages	`manpage`	Unix man page extraction
Confluence	`confluence`	Atlassian Confluence wiki extraction
Notion	`notion`	Notion workspace extraction
Slack/Discord	`chat`	Chat export extraction

Documentation Source

{
  "type": "documentation",
  "base_url": "https://docs.example.com/",
  "extract_api": true,
  "selectors": {
    "main_content": "article",
    "title": "h1",
    "code_blocks": "pre code"
  },
  "url_patterns": {
    "include": [],
    "exclude": ["/blog/"]
  },
  "categories": {
    "getting_started": ["intro", "tutorial"],
    "api": ["api", "reference"]
  },
  "rate_limit": 0.5,
  "max_pages": 200
}

GitHub Source

{
  "type": "github",
  "repo": "owner/repo",
  "github_token": "ghp_...",
  "include_issues": true,
  "max_issues": 100,
  "include_changelog": true,
  "include_releases": true,
  "include_code": true,
  "code_analysis_depth": "surface|deep|full",
  "file_patterns": [
    "src/**/*.js",
    "lib/**/*.ts"
  ]
}

Code Analysis Depth:

surface (default): Basic structure, no code analysis
deep: Extract class/function signatures, parameters, return types
full: Complete AST analysis (expensive)

PDF Source

{
  "type": "pdf",
  "path": "/path/to/manual.pdf",
  "extract_tables": false,
  "ocr": false,
  "password": "optional-password"
}

Video Source

{
  "type": "video",
  "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
  "language": "en"
}

Word Document Source

{
  "type": "word",
  "path": "/path/to/document.docx"
}

EPUB Source

{
  "type": "epub",
  "path": "/path/to/book.epub"
}

Jupyter Notebook Source

{
  "type": "jupyter",
  "path": "/path/to/notebook.ipynb"
}

Local HTML Source

{
  "type": "html",
  "path": "/path/to/page.html"
}

OpenAPI/Swagger Source

{
  "type": "openapi",
  "path": "/path/to/openapi.yaml"
}

AsciiDoc Source

{
  "type": "asciidoc",
  "path": "/path/to/document.adoc"
}

PowerPoint Source

{
  "type": "pptx",
  "path": "/path/to/presentation.pptx"
}

RSS/Atom Feed Source

{
  "type": "rss",
  "url": "https://blog.example.com/feed.xml"
}

Man Page Source

{
  "type": "manpage",
  "path": "/path/to/command.1"
}

Confluence Source

{
  "type": "confluence",
  "base_url": "https://company.atlassian.net/wiki",
  "space_key": "DOCS"
}

Notion Source

{
  "type": "notion",
  "workspace": "my-workspace",
  "root_page_id": "abc123"
}

Slack/Discord Chat Source

{
  "type": "chat",
  "path": "/path/to/export/"
}

Conflict Detection

The unified scraper automatically detects 4 types of conflicts:

1. Missing in Documentation

Severity: Medium Description: API exists in code but is not documented

Example:

# Code has this method:
def move_local_x(self, delta: float, snap: bool = False) -> None:
    """Move node along local X axis"""

# But documentation doesn't mention it

Suggestion: Add documentation for this API

2. Missing in Code

Severity: High Description: API is documented but not found in codebase

Example:

# Docs say:
def rotate(angle: float) -> None

# But code doesn't have this function

Suggestion: Update documentation to remove this API, or add it to codebase

3. Signature Mismatch

Severity: Medium-High Description: API exists in both but signatures differ

Example:

# Docs say:
def move_local_x(delta: float)

# Code has:
def move_local_x(delta: float, snap: bool = False)

Suggestion: Update documentation to match actual signature

4. Description Mismatch

Severity: Low Description: Different descriptions/docstrings

Merge Modes

Rule-Based Merge (Default)

Fast, deterministic merging using predefined rules:

If API only in docs → Include with [DOCS_ONLY] tag
If API only in code → Include with [UNDOCUMENTED] tag
If both match perfectly → Include normally
If conflict exists → Prefer code signature, keep docs description

When to use:

Fast merging (< 1 second)
Automated workflows
You don't need human oversight

Example:

python3 cli/unified_scraper.py --config config.json --merge-mode rule-based

Claude-Enhanced Merge

AI-powered reconciliation using local Claude Code:

Opens new terminal with Claude Code
Provides conflict context and instructions
Claude analyzes and creates reconciled API reference
Human can review and adjust before finalizing

When to use:

Complex conflicts requiring judgment
You want highest quality merge
You have time for human oversight

Example:

python3 cli/unified_scraper.py --config config.json --merge-mode claude-enhanced

Skill Output Structure

The unified scraper creates this structure:

output/skill-name/
├── SKILL.md                     # Main skill file with merged APIs
├── references/
│   ├── documentation/           # Documentation references
│   │   └── index.md
│   ├── github/                  # GitHub references
│   │   ├── README.md
│   │   ├── issues.md
│   │   └── releases.md
│   ├── pdf/                     # PDF references (if applicable)
│   │   └── index.md
│   ├── video/                   # Video transcripts (if applicable)
│   │   └── index.md
│   ├── openapi/                 # OpenAPI spec (if applicable)
│   │   └── index.md
│   ├── jupyter/                 # Notebook content (if applicable)
│   │   └── index.md
│   ├── <source-type>/           # Other source type references
│   │   └── index.md
│   ├── api/                     # Merged API reference
│   │   └── merged_api.md
│   └── conflicts.md             # Detailed conflict report
├── scripts/                     # Empty (for user scripts)
└── assets/                      # Empty (for user assets)

SKILL.md Format

# React

Complete React knowledge base combining official documentation and React codebase insights.

## 📚 Sources

This skill combines knowledge from multiple sources:

- ✅ **Documentation**: https://react.dev/
  - Pages: 200
- ✅ **GitHub Repository**: facebook/react
  - Code Analysis: surface
  - Issues: 100

## ⚠️ Data Quality

**5 conflicts detected** between sources.

**Conflict Breakdown:**
- missing_in_docs: 3
- missing_in_code: 2

See `references/conflicts.md` for detailed conflict information.

## 🔧 API Reference

*Merged from documentation and code analysis*

### ✅ Verified APIs

*Documentation and code agree*

#### `useState(initialValue)`

...

### ⚠️ APIs with Conflicts

*Documentation and code differ*

#### `useEffect(callback, deps?)`

⚠️ **Conflict**: Documentation signature differs from code implementation

**Documentation says:**

useEffect(callback: () => void, deps: any[])


**Code implementation:**

useEffect(callback: () => void | (() => void), deps?: readonly any[])


*Source: both*

---

Examples

Example 1: React (Docs + GitHub)

{
  "name": "react",
  "description": "Complete React framework knowledge",
  "merge_mode": "rule-based",
  "sources": [
    {
      "type": "documentation",
      "base_url": "https://react.dev/",
      "extract_api": true,
      "max_pages": 200
    },
    {
      "type": "github",
      "repo": "facebook/react",
      "include_code": true,
      "code_analysis_depth": "surface"
    }
  ]
}

Example 2: Django (Docs + GitHub)

{
  "name": "django",
  "description": "Complete Django framework knowledge",
  "merge_mode": "rule-based",
  "sources": [
    {
      "type": "documentation",
      "base_url": "https://docs.djangoproject.com/en/stable/",
      "extract_api": true,
      "max_pages": 300
    },
    {
      "type": "github",
      "repo": "django/django",
      "include_code": true,
      "code_analysis_depth": "deep",
      "file_patterns": [
        "django/db/**/*.py",
        "django/views/**/*.py"
      ]
    }
  ]
}

Example 3: API Project (Docs + OpenAPI + Jupyter)

{
  "name": "my-api",
  "description": "Complete API knowledge with spec and notebooks",
  "merge_mode": "rule-based",
  "sources": [
    {
      "type": "documentation",
      "base_url": "https://api.example.com/docs/",
      "extract_api": true,
      "max_pages": 100
    },
    {
      "type": "openapi",
      "path": "specs/openapi.yaml"
    },
    {
      "type": "jupyter",
      "path": "notebooks/api-examples.ipynb"
    }
  ]
}

Example 4: Enterprise Knowledge (Confluence + GitHub + Video)

{
  "name": "internal-platform",
  "description": "Internal platform knowledge from all sources",
  "merge_mode": "claude-enhanced",
  "sources": [
    {
      "type": "confluence",
      "base_url": "https://company.atlassian.net/wiki",
      "space_key": "PLATFORM"
    },
    {
      "type": "github",
      "repo": "company/platform",
      "include_code": true,
      "code_analysis_depth": "deep"
    },
    {
      "type": "video",
      "url": "https://www.youtube.com/playlist?list=PLexample",
      "language": "en"
    }
  ]
}

Example 5: Mixed Sources (Docs + GitHub + PDF)

{
  "name": "godot",
  "description": "Complete Godot Engine knowledge",
  "merge_mode": "claude-enhanced",
  "sources": [
    {
      "type": "documentation",
      "base_url": "https://docs.godotengine.org/en/stable/",
      "extract_api": true,
      "max_pages": 500
    },
    {
      "type": "github",
      "repo": "godotengine/godot",
      "include_code": true,
      "code_analysis_depth": "deep"
    },
    {
      "type": "pdf",
      "path": "/path/to/godot_manual.pdf",
      "extract_tables": true
    }
  ]
}

Command Reference

Unified Scraper

# Basic usage
skill-seekers unified --config configs/react_unified.json

# Override merge mode
skill-seekers unified --config configs/react_unified.json --merge-mode claude-enhanced

# Fresh start (clear cached data)
skill-seekers unified --config configs/react_unified.json --fresh

# Dry run (preview without executing)
skill-seekers unified --config configs/react_unified.json --dry-run

Enhancement Workflow Options

All workflow flags are now supported:

# Apply workflow preset
skill-seekers unified --config configs/react_unified.json --enhance-workflow security-focus

# Multiple workflows (chained)
skill-seekers unified --config configs/react_unified.json \
  --enhance-workflow security-focus \
  --enhance-workflow api-documentation

# Custom enhancement stage
skill-seekers unified --config configs/react_unified.json \
  --enhance-stage "cleanup:Remove boilerplate content"

# Workflow variables
skill-seekers unified --config configs/react_unified.json \
  --enhance-workflow my-workflow \
  --var focus_area=performance \
  --var detail_level=high

# Preview workflows without executing
skill-seekers unified --config configs/react_unified.json \
  --enhance-workflow security-focus \
  --workflow-dry-run

Global Enhancement Override

Override enhancement settings from CLI:

# Override enhance level for all sources
skill-seekers unified --config configs/react_unified.json --enhance-level 3

# Provide API key (or use ANTHROPIC_API_KEY env var)
skill-seekers unified --config configs/react_unified.json --api-key YOUR_API_KEY

Workflow Configuration in JSON

Define workflows directly in your unified config:

{
  "name": "react-complete",
  "description": "React with security focus",
  "merge_mode": "claude-enhanced",
  "workflows": ["security-focus"],
  "workflow_stages": [
    {
      "name": "cleanup",
      "prompt": "Remove boilerplate and standardize formatting"
    }
  ],
  "workflow_vars": {
    "focus_area": "security",
    "detail_level": "comprehensive"
  },
  "sources": [
    {"type": "documentation", "base_url": "https://react.dev/"},
    {"type": "github", "repo": "facebook/react"}
  ]
}

Priority: CLI flags override config values.

Validate Config

python3 -c "
import sys
sys.path.insert(0, 'cli')
from config_validator import validate_config

validator = validate_config('configs/react_unified.json')
print(f'Format: {\"Unified\" if validator.is_unified else \"Legacy\"}')
print(f'Sources: {len(validator.config.get(\"sources\", []))}')
print(f'Needs API merge: {validator.needs_api_merge()}')
"

MCP Integration

The unified scraper is fully integrated with MCP. The scrape_docs tool automatically detects unified vs legacy configs and routes to the appropriate scraper.

# MCP tool usage
{
  "name": "scrape_docs",
  "arguments": {
    "config_path": "configs/react_unified.json",
    "merge_mode": "rule-based"  # Optional override
  }
}

The tool will:

Auto-detect unified format
Route to unified_scraper.py
Apply specified merge mode
Return comprehensive output

Backward Compatibility

Legacy configs still work! The system automatically detects legacy single-source configs and routes to the original doc_scraper.py.

// Legacy config (still works)
{
  "name": "react",
  "base_url": "https://react.dev/",
  ...
}

// Automatically detected as legacy format
// Routes to doc_scraper.py

Testing

Run integration tests:

python3 cli/test_unified_simple.py

Tests validate:

✅ Unified config validation
✅ Backward compatibility with legacy configs
✅ Mixed source type support
✅ Error handling for invalid configs

Architecture

Components

config_validator.py: Validates unified and legacy configs
code_analyzer.py: Extracts code signatures at configurable depth
conflict_detector.py: Detects API conflicts between sources
merge_sources.py: Implements rule-based and Claude-enhanced merging
unified_scraper.py: Main orchestrator
unified_skill_builder.py: Generates final skill structure
skill_seeker_mcp/server.py: MCP integration with auto-detection

Data Flow

Unified Config
     ↓
ConfigValidator (validates format)
     ↓
UnifiedScraper.run()
     ↓
┌────────────────────────────────────┐
│ Phase 1: Scrape All Sources        │
│  - Documentation → doc_scraper     │
│  - GitHub → github_scraper         │
│  - PDF → pdf_scraper               │
│  - Local → codebase_scraper        │
│  - Video → video_scraper           │
│  - Word → word_scraper             │
│  - EPUB → epub_scraper             │
│  - Jupyter → jupyter_scraper       │
│  - HTML → html_scraper             │
│  - OpenAPI → openapi_scraper       │
│  - AsciiDoc → asciidoc_scraper     │
│  - PowerPoint → pptx_scraper       │
│  - RSS/Atom → rss_scraper          │
│  - Man pages → manpage_scraper     │
│  - Confluence → confluence_scraper │
│  - Notion → notion_scraper         │
│  - Chat → chat_scraper             │
└────────────────────────────────────┘
     ↓
┌────────────────────────────────────┐
│ Phase 2: Detect Conflicts          │
│  - ConflictDetector                │
│  - Compare docs APIs vs code APIs  │
│  - Classify by type and severity   │
└────────────────────────────────────┘
     ↓
┌────────────────────────────────────┐
│ Phase 3: Merge Sources              │
│  - Pairwise synthesis (docs+github │
│    +pdf combos)                    │
│  - Generic merge (_generic_merge)  │
│    for all other combinations      │
│  - RuleBasedMerger (fast)          │
│  - OR ClaudeEnhancedMerger (AI)    │
│  - Create unified API reference    │
└────────────────────────────────────┘
     ↓
┌────────────────────────────────────┐
│ Phase 4: Build Skill                │
│  - UnifiedSkillBuilder             │
│  - Generate SKILL.md with conflicts│
│  - Create reference structure      │
│  - Generate conflicts report       │
└────────────────────────────────────┘
     ↓
┌────────────────────────────────────┐
│ Phase 5: Enhancement Workflows      │
│  - Apply workflow presets          │
│  - Run custom enhancement stages   │
│  - Variable substitution           │
└────────────────────────────────────┘
     ↓
Unified Skill (.zip ready)

Best Practices

1. Start with Rule-Based Merge

Rule-based is fast and works well for most cases. Only use Claude-enhanced if you need human oversight.

2. Use Surface-Level Code Analysis

code_analysis_depth: "surface" is usually sufficient. Deep analysis is expensive and rarely needed.

3. Limit GitHub Issues

max_issues: 100 is a good default. More than 200 issues rarely adds value.

4. Be Specific with File Patterns

"file_patterns": [
  "src/**/*.js",     // Good: specific paths
  "lib/**/*.ts"
]

// Not recommended:
"file_patterns": ["**/*.js"]  // Too broad, slow

5. Monitor Conflict Reports

Always review references/conflicts.md to understand discrepancies between sources.

Troubleshooting

No Conflicts Detected

Possible causes:

extract_api: false in documentation source
include_code: false in GitHub source
Code analysis found no APIs (check code_analysis_depth)

Solution: Ensure both sources have API extraction enabled

Too Many Conflicts

Possible causes:

Fuzzy matching threshold too strict
Documentation uses different naming conventions
Old documentation version

Solution: Review conflicts manually and adjust merge strategy

Merge Takes Too Long

Possible causes:

Using code_analysis_depth: "full" (very slow)
Too many file patterns
Large repository

Solution:

Use "surface" or "deep" analysis
Narrow file patterns
Increase rate_limit

Future Enhancements

Planned features:

Automated conflict resolution strategies
Conflict trend analysis across versions
Multi-version comparison (docs v1 vs v2)
Custom merge rules DSL
Conflict confidence scores

Support

For issues, questions, or suggestions:

GitHub Issues: https://github.com/yusufkaraaslan/Skill_Seekers/issues
Documentation: https://github.com/yusufkaraaslan/Skill_Seekers/docs

Changelog

v3.2.0 (March 2026): 17 source types supported

✅ 13 new source types: Word, EPUB, Video, Jupyter, HTML, OpenAPI, AsciiDoc, PowerPoint, RSS/Atom, Man pages, Confluence, Notion, Slack/Discord
✅ Generic merge system (_generic_merge()) for combining any source type combination
✅ Pairwise synthesis for docs+github+pdf combos
✅ complex-merge.yaml workflow preset for AI-powered multi-source merging

v3.1.0 (February 2026): Enhancement workflow support

✅ Full workflow system integration (Phase 5)
✅ All workflow flags supported (--enhance-workflow, --enhance-stage, --var, --workflow-dry-run)
✅ Workflow configuration in JSON configs
✅ Global --enhance-level and --api-key CLI overrides
✅ Local source type support (codebase analysis)

v2.0 (October 2025): Unified multi-source scraping feature complete

✅ Config validation for unified format
✅ Deep code analysis with AST parsing
✅ Conflict detection (4 types, 3 severity levels)
✅ Rule-based merging
✅ Claude-enhanced merging
✅ Unified skill builder with inline conflict warnings
✅ MCP integration with auto-detection
✅ Backward compatibility with legacy configs
✅ Comprehensive tests and documentation

24 KiB Raw Blame History

Unified Multi-Source Scraping

Overview

Why Unified Scraping?

Quick Start

1. Create a Unified Config

2. Scrape and Build

3. Package and Upload

Config Format

Unified Config Structure

Supported Source Types

Documentation Source

GitHub Source

PDF Source

Video Source

Word Document Source

EPUB Source

Jupyter Notebook Source

Local HTML Source

OpenAPI/Swagger Source

AsciiDoc Source

PowerPoint Source

RSS/Atom Feed Source

Man Page Source

Confluence Source

Notion Source

Slack/Discord Chat Source

Conflict Detection

1. Missing in Documentation

2. Missing in Code

3. Signature Mismatch

4. Description Mismatch

Merge Modes

Rule-Based Merge (Default)

Claude-Enhanced Merge

Skill Output Structure

SKILL.md Format

Examples

Example 1: React (Docs + GitHub)

Example 2: Django (Docs + GitHub)

Example 3: API Project (Docs + OpenAPI + Jupyter)

Example 4: Enterprise Knowledge (Confluence + GitHub + Video)

Example 5: Mixed Sources (Docs + GitHub + PDF)

Command Reference

Unified Scraper

Enhancement Workflow Options

Global Enhancement Override

Workflow Configuration in JSON

Validate Config

MCP Integration

Backward Compatibility

Testing

Architecture

Components

Data Flow

Best Practices

1. Start with Rule-Based Merge

2. Use Surface-Level Code Analysis

3. Limit GitHub Issues

4. Be Specific with File Patterns

5. Monitor Conflict Reports

Troubleshooting

No Conflicts Detected

Too Many Conflicts

Merge Takes Too Long

Future Enhancements

Support

Changelog

24 KiB

Raw Blame History