Files
skill-seekers-reference/docs/reference/CONFIG_FORMAT.md
yusyus 37cb307455 docs: update all documentation for 17 source types
Update 32 documentation files across English and Chinese (zh-CN) docs
to reflect the 10 new source types added in the previous commit.

Updated files:
- README.md, README.zh-CN.md — taglines, feature lists, examples, install extras
- docs/reference/ — CLI_REFERENCE, FEATURE_MATRIX, MCP_REFERENCE, CONFIG_FORMAT, API_REFERENCE
- docs/features/ — UNIFIED_SCRAPING with generic merge docs
- docs/advanced/ — multi-source guide, MCP server guide
- docs/getting-started/ — installation extras, quick-start examples
- docs/user-guide/ — core-concepts, scraping, packaging, workflows (complex-merge)
- docs/ — FAQ, TROUBLESHOOTING, BEST_PRACTICES, ARCHITECTURE, UNIFIED_PARSERS, README
- Root — BULLETPROOF_QUICKSTART, CONTRIBUTING, ROADMAP
- docs/zh-CN/ — Chinese translations for all of the above

32 files changed, +3,016 lines, -245 lines
2026-03-15 15:56:04 +03:00

20 KiB

Config Format Reference - Skill Seekers

Version: 3.2.0 Last Updated: 2026-03-15 Complete JSON configuration specification for 17 source types


Table of Contents


Overview

Skill Seekers uses JSON configuration files with a unified format. All configs use a sources array, even for single-source scraping.

Important: Legacy configs without sources were removed in v2.11.0. All configs must use the unified format shown below.

Use Case Example
Single source "sources": [{ "type": "documentation", ... }]
Multiple sources "sources": [{ "type": "documentation", ... }, { "type": "github", ... }]

Single-Source Config

Even for a single source, wrap it in a sources array.

Documentation Source

For scraping documentation websites.

{
  "name": "react",
  "description": "React - JavaScript library for building UIs",
  "sources": [
    {
      "type": "documentation",
      "base_url": "https://react.dev/",

      "start_urls": [
        "https://react.dev/learn",
        "https://react.dev/reference/react"
      ],

      "selectors": {
        "main_content": "article",
        "title": "h1",
        "code_blocks": "pre code"
      },

      "url_patterns": {
        "include": ["/learn/", "/reference/"],
        "exclude": ["/blog/", "/community/"]
      },

      "categories": {
        "getting_started": ["learn", "tutorial", "intro"],
        "api": ["reference", "api", "hooks"]
      },

      "rate_limit": 0.5,
      "max_pages": 300
    }
  ]
}

Documentation Fields

Field Type Required Default Description
name string Yes - Skill name (alphanumeric, dashes, underscores)
base_url string Yes - Base documentation URL
description string No "" Skill description for SKILL.md
start_urls array No [base_url] URLs to start crawling from
selectors object No see below CSS selectors for content extraction
url_patterns object No {} Include/exclude URL patterns
categories object No {} Content categorization rules
rate_limit number No 0.5 Seconds between requests
max_pages number No 500 Maximum pages to scrape
merge_mode string No "claude-enhanced" Merge strategy
extract_api boolean No false Extract API references
llms_txt_url string No auto Path to llms.txt file

GitHub Source

For analyzing GitHub repositories.

{
  "name": "react-github",
  "description": "React GitHub repository analysis",
  "sources": [
    {
      "type": "github",
      "repo": "facebook/react",

      "enable_codebase_analysis": true,
      "code_analysis_depth": "deep",

      "fetch_issues": true,
      "max_issues": 100,
      "issue_labels": ["bug", "enhancement"],

      "fetch_releases": true,
      "max_releases": 20,

      "fetch_changelog": true,
      "analyze_commit_history": true,

      "file_patterns": ["*.js", "*.ts", "*.tsx"],
      "exclude_patterns": ["*.test.js", "node_modules/**"],

      "rate_limit": 1.0
    }
  ]
}

GitHub Fields

Field Type Required Default Description
name string Yes - Skill name
type string Yes - Must be "github"
repo string Yes - Repository in owner/repo format
description string No "" Skill description
enable_codebase_analysis boolean No true Analyze source code
code_analysis_depth string No "standard" surface, standard, deep
fetch_issues boolean No true Fetch GitHub issues
max_issues number No 100 Maximum issues to fetch
issue_labels array No [] Filter by labels
fetch_releases boolean No true Fetch releases
max_releases number No 20 Maximum releases
fetch_changelog boolean No true Extract CHANGELOG
analyze_commit_history boolean No false Analyze commits
file_patterns array No [] Include file patterns
exclude_patterns array No [] Exclude file patterns

PDF Source

For extracting content from PDF files.

{
  "name": "product-manual",
  "description": "Product documentation manual",
  "sources": [
    {
      "type": "pdf",
      "pdf_path": "docs/manual.pdf",

      "enable_ocr": false,
      "password": "",

      "extract_images": true,
      "image_output_dir": "output/images/",

      "extract_tables": true,
      "table_format": "markdown",

      "page_range": [1, 100],
      "split_by_chapters": true,

      "chunk_size": 1000,
      "chunk_overlap": 100
    }
  ]
}

PDF Fields

Field Type Required Default Description
name string Yes - Skill name
type string Yes - Must be "pdf"
pdf_path string Yes - Path to PDF file
description string No "" Skill description
enable_ocr boolean No false OCR for scanned PDFs
password string No "" PDF password if encrypted
extract_images boolean No false Extract embedded images
image_output_dir string No auto Directory for images
extract_tables boolean No false Extract tables
table_format string No "markdown" markdown, json, csv
page_range array No all [start, end] page range
split_by_chapters boolean No false Split by detected chapters
chunk_size number No 1000 Characters per chunk
chunk_overlap number No 100 Overlap between chunks

Local Source

For analyzing local codebases.

{
  "name": "my-project",
  "description": "Local project analysis",
  "sources": [
    {
      "type": "local",
      "directory": "./my-project",

      "languages": ["Python", "JavaScript"],
      "file_patterns": ["*.py", "*.js"],
      "exclude_patterns": ["*.pyc", "node_modules/**", ".git/**"],

      "analysis_depth": "comprehensive",

      "extract_api": true,
      "extract_patterns": true,
      "extract_test_examples": true,
      "extract_how_to_guides": true,
      "extract_config_patterns": true,

      "include_comments": true,
      "include_docstrings": true,
      "include_readme": true
    }
  ]
}

Local Fields

Field Type Required Default Description
name string Yes - Skill name
type string Yes - Must be "local"
directory string Yes - Path to directory
description string No "" Skill description
languages array No auto Languages to analyze
file_patterns array No all Include patterns
exclude_patterns array No common Exclude patterns
analysis_depth string No "standard" quick, standard, comprehensive
extract_api boolean No true Extract API documentation
extract_patterns boolean No true Detect patterns
extract_test_examples boolean No true Extract test examples
extract_how_to_guides boolean No true Generate guides
extract_config_patterns boolean No true Extract config patterns
include_comments boolean No true Include code comments
include_docstrings boolean No true Include docstrings
include_readme boolean No true Include README

Additional Source Types

The following 10 source types were added in v3.2.0. Each can be used as a standalone config or within a unified sources array.

Jupyter Notebook Source

{
  "name": "ml-tutorial",
  "sources": [{
    "type": "jupyter",
    "notebook_path": "notebooks/tutorial.ipynb"
  }]
}

Local HTML Source

{
  "name": "offline-docs",
  "sources": [{
    "type": "html",
    "html_path": "./exported-docs/"
  }]
}

OpenAPI/Swagger Source

{
  "name": "petstore-api",
  "sources": [{
    "type": "openapi",
    "spec_path": "api/openapi.yaml",
    "spec_url": "https://petstore.swagger.io/v2/swagger.json"
  }]
}

AsciiDoc Source

{
  "name": "project-guide",
  "sources": [{
    "type": "asciidoc",
    "asciidoc_path": "./docs/guide.adoc"
  }]
}

PowerPoint Source

{
  "name": "training-slides",
  "sources": [{
    "type": "pptx",
    "pptx_path": "presentations/training.pptx"
  }]
}

RSS/Atom Feed Source

{
  "name": "engineering-blog",
  "sources": [{
    "type": "rss",
    "feed_url": "https://engineering.example.com/feed.xml",
    "follow_links": true,
    "max_articles": 50
  }]
}

Man Page Source

{
  "name": "unix-tools",
  "sources": [{
    "type": "manpage",
    "man_names": "ls,grep,find,awk,sed",
    "sections": "1,3"
  }]
}

Confluence Source

{
  "name": "team-wiki",
  "sources": [{
    "type": "confluence",
    "base_url": "https://wiki.example.com",
    "space_key": "DEV",
    "username": "user@example.com",
    "max_pages": 500
  }]
}

Notion Source

{
  "name": "product-docs",
  "sources": [{
    "type": "notion",
    "database_id": "abc123def456",
    "max_pages": 500
  }]
}

Chat (Slack/Discord) Source

{
  "name": "team-knowledge",
  "sources": [{
    "type": "chat",
    "export_path": "./slack-export/",
    "platform": "slack",
    "channel": "engineering",
    "max_messages": 10000
  }]
}

Additional Source Fields Reference

Source Type Required Fields Optional Fields
jupyter notebook_path
html html_path
openapi spec_path or spec_url
asciidoc asciidoc_path
pptx pptx_path
rss feed_url or feed_path follow_links, max_articles
manpage man_names or man_path sections
confluence base_url + space_key or export_path username, token, max_pages
notion database_id or page_id or export_path token, max_pages
chat export_path platform, token, channel, max_messages

Unified (Multi-Source) Config

Combine multiple sources into one skill with conflict detection.

{
  "name": "react-complete",
  "description": "React docs + GitHub + examples",
  "merge_mode": "claude-enhanced",
  
  "sources": [
    {
      "type": "docs",
      "name": "react-docs",
      "base_url": "https://react.dev/",
      "max_pages": 200,
      "categories": {
        "getting_started": ["learn"],
        "api": ["reference"]
      }
    },
    {
      "type": "github",
      "name": "react-github",
      "repo": "facebook/react",
      "fetch_issues": true,
      "max_issues": 50
    },
    {
      "type": "pdf",
      "name": "react-cheatsheet",
      "pdf_path": "docs/react-cheatsheet.pdf"
    },
    {
      "type": "local",
      "name": "react-examples",
      "directory": "./react-examples"
    }
  ],
  
  "conflict_detection": {
    "enabled": true,
    "rules": [
      {
        "field": "api_signature",
        "action": "flag_mismatch"
      }
    ]
  },
  
  "output_structure": {
    "group_by_source": false,
    "cross_reference": true
  }
}

Unified Fields

Field Type Required Default Description
name string Yes - Combined skill name
description string No "" Skill description
merge_mode string No "claude-enhanced" rule-based, claude-enhanced
sources array Yes - List of source configs
conflict_detection object No {} Conflict detection settings
output_structure object No {} Output organization
workflows array No [] Workflow presets to apply
workflow_stages array No [] Inline enhancement stages
workflow_vars object No {} Workflow variable overrides
workflow_dry_run boolean No false Preview workflows without executing

Workflow Configuration (Unified)

Unified configs support defining enhancement workflows at the top level:

{
  "name": "react-complete",
  "description": "React docs + GitHub with security enhancement",
  "merge_mode": "claude-enhanced",
  
  "workflows": ["security-focus", "api-documentation"],
  "workflow_stages": [
    {
      "name": "cleanup",
      "prompt": "Remove boilerplate sections and standardize formatting"
    }
  ],
  "workflow_vars": {
    "focus_area": "performance",
    "detail_level": "comprehensive"
  },
  
  "sources": [
    {"type": "docs", "base_url": "https://react.dev/"},
    {"type": "github", "repo": "facebook/react"}
  ]
}

Workflow Fields:

Field Type Description
workflows array List of workflow preset names to apply
workflow_stages array Inline stages with name and prompt
workflow_vars object Key-value pairs for workflow variables
workflow_dry_run boolean Preview workflows without executing

Note: CLI flags override config values (CLI takes precedence).

Source Types in Unified Config

Each source in the sources array can be any of the 17 supported types:

Type Required Fields
documentation / docs base_url
github repo
pdf pdf_path
word docx_path
epub epub_path
video url or video_path
local directory
jupyter notebook_path
html html_path
openapi spec_path or spec_url
asciidoc asciidoc_path
pptx pptx_path
rss feed_url or feed_path
manpage man_names or man_path
confluence base_url + space_key or export_path
notion database_id or page_id or export_path
chat export_path

Common Fields

Fields available in all config types:

Field Type Description
name string Skill identifier (letters, numbers, dashes, underscores)
description string Human-readable description
rate_limit number Delay between requests in seconds
output_dir string Custom output directory
skip_scrape boolean Use existing data
enhance_level number 0=off, 1=SKILL.md, 2=+config, 3=full

Selectors

CSS selectors for content extraction from HTML:

{
  "selectors": {
    "main_content": "article",
    "title": "h1",
    "code_blocks": "pre code",
    "navigation": "nav.sidebar",
    "breadcrumbs": "nav[aria-label='breadcrumb']",
    "next_page": "a[rel='next']",
    "prev_page": "a[rel='prev']"
  }
}

Default Selectors

If main_content is not specified, the scraper tries these selectors in order until one matches:

  1. main
  2. div[role="main"]
  3. article
  4. [role="main"]
  5. .content
  6. .doc-content
  7. #main-content

Tip: Omit main_content from your config to let auto-detection work. Only specify it when auto-detection picks the wrong element.

Other defaults:

Element Default Selector
title title
code_blocks pre code

Categories

Map URL patterns to content categories:

{
  "categories": {
    "getting_started": [
      "intro", "tutorial", "quickstart", 
      "installation", "getting-started"
    ],
    "core_concepts": [
      "concept", "fundamental", "architecture",
      "principle", "overview"
    ],
    "api_reference": [
      "reference", "api", "method", "function",
      "class", "interface", "type"
    ],
    "guides": [
      "guide", "how-to", "example", "recipe",
      "pattern", "best-practice"
    ],
    "advanced": [
      "advanced", "expert", "performance",
      "optimization", "internals"
    ]
  }
}

Categories appear as sections in the generated SKILL.md.


URL Patterns

Control which URLs are included or excluded:

{
  "url_patterns": {
    "include": [
      "/docs/",
      "/guide/",
      "/api/",
      "/reference/"
    ],
    "exclude": [
      "/blog/",
      "/news/",
      "/community/",
      "/search",
      "?print=1",
      "/_static/",
      "/_images/"
    ]
  }
}

Pattern Rules

  • Patterns are matched against the URL path
  • Use * for wildcards: /api/v*/
  • Use ** for recursive: /docs/**/*.html
  • Exclude takes precedence over include

Examples

React Documentation

{
  "name": "react",
  "description": "React - JavaScript library for building UIs",
  "sources": [
    {
      "type": "documentation",
      "base_url": "https://react.dev/",
      "start_urls": [
        "https://react.dev/learn",
        "https://react.dev/reference/react",
        "https://react.dev/reference/react-dom"
      ],
      "selectors": {
        "main_content": "article",
        "title": "h1",
        "code_blocks": "pre code"
      },
      "url_patterns": {
        "include": ["/learn/", "/reference/"],
        "exclude": ["/community/", "/search"]
      },
      "categories": {
        "getting_started": ["learn", "tutorial"],
        "api": ["reference", "api"]
      },
      "rate_limit": 0.5,
      "max_pages": 300
    }
  ]
}

Django GitHub

{
  "name": "django-github",
  "description": "Django web framework source code",
  "sources": [
    {
      "type": "github",
      "repo": "django/django",
      "enable_codebase_analysis": true,
      "code_analysis_depth": "deep",
      "fetch_issues": true,
      "max_issues": 100,
      "fetch_releases": true,
      "file_patterns": ["*.py"],
      "exclude_patterns": ["tests/**", "docs/**"]
    }
  ]
}

Unified Multi-Source

{
  "name": "godot-complete",
  "description": "Godot Engine - docs, source, and manual",
  "merge_mode": "claude-enhanced",
  "sources": [
    {
      "type": "docs",
      "name": "godot-docs",
      "base_url": "https://docs.godotengine.org/en/stable/",
      "max_pages": 500
    },
    {
      "type": "github",
      "name": "godot-source",
      "repo": "godotengine/godot",
      "fetch_issues": false
    },
    {
      "type": "pdf",
      "name": "godot-manual",
      "pdf_path": "docs/godot-manual.pdf"
    }
  ]
}

Unified with New Source Types

{
  "name": "project-complete",
  "description": "Full project knowledge from multiple source types",
  "merge_mode": "claude-enhanced",
  "sources": [
    {
      "type": "docs",
      "name": "project-docs",
      "base_url": "https://docs.example.com/",
      "max_pages": 200
    },
    {
      "type": "github",
      "name": "project-code",
      "repo": "example/project"
    },
    {
      "type": "openapi",
      "name": "project-api",
      "spec_path": "api/openapi.yaml"
    },
    {
      "type": "confluence",
      "name": "project-wiki",
      "export_path": "./confluence-export/"
    },
    {
      "type": "jupyter",
      "name": "project-notebooks",
      "notebook_path": "./notebooks/"
    }
  ]
}

Local Project

{
  "name": "my-api",
  "description": "My REST API implementation",
  "sources": [
    {
      "type": "local",
      "directory": "./my-api-project",
      "languages": ["Python"],
      "file_patterns": ["*.py"],
      "exclude_patterns": ["tests/**", "migrations/**"],
      "analysis_depth": "comprehensive",
      "extract_api": true,
      "extract_test_examples": true
    }
  ]
}

Validation

Validate your config before scraping:

# Using CLI
skill-seekers scrape --config my-config.json --dry-run

# Using MCP tool
validate_config({"config": "my-config.json"})

See Also


For more examples, see configs/ directory in the repository