Documentation restructure: - New docs/getting-started/ guide (4 files: install, quick-start, first-skill, next-steps) - New docs/user-guide/ section (6 files: core concepts through troubleshooting) - New docs/reference/ section (CLI_REFERENCE, CONFIG_FORMAT, ENVIRONMENT_VARIABLES, MCP_REFERENCE) - New docs/advanced/ section (custom-workflows, mcp-server, multi-source) - New docs/ARCHITECTURE.md - system architecture overview - Archived legacy files (QUICKSTART.md, QUICK_REFERENCE.md, docs/guides/USAGE.md) to docs/archive/legacy/ Chinese (zh-CN) translations: - Full zh-CN mirror of all user-facing docs (getting-started, user-guide, reference, advanced) - GitHub Actions workflow for translation sync (.github/workflows/translate-docs.yml) - Translation sync checker script (scripts/check_translation_sync.sh) - Translation helper script (scripts/translate_doc.py) Content updates: - CHANGELOG.md: [Unreleased] → [3.1.0] - 2026-02-22 - README.md: updated with new doc structure links - AGENTS.md: updated agent documentation - docs/features/UNIFIED_SCRAPING.md: updated for unified scraper workflow JSON config Analysis/planning artifacts (kept for reference): - DOCUMENTATION_OVERHAUL_PLAN.md, DOCUMENTATION_OVERHAUL_SUMMARY.md - FEATURE_GAP_ANALYSIS.md, IMPLEMENTATION_GAPS_ANALYSIS.md, CREATE_COMMAND_COVERAGE_ANALYSIS.md - CHINESE_TRANSLATION_IMPLEMENTATION_SUMMARY.md, ISSUE_260_UPDATE.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
14 KiB
Config Format Reference - Skill Seekers
Version: 3.1.0
Last Updated: 2026-02-16
Complete JSON configuration specification
Table of Contents
- Overview
- Single-Source Config
- Unified (Multi-Source) Config
- Common Fields
- Selectors
- Categories
- URL Patterns
- Examples
Overview
Skill Seekers uses JSON configuration files to define scraping targets. There are two types:
| Type | Use Case | File |
|---|---|---|
| Single-Source | One source (docs, GitHub, PDF, or local) | *.json |
| Unified | Multiple sources combined | *-unified.json |
Single-Source Config
Documentation Source
For scraping documentation websites.
{
"name": "react",
"base_url": "https://react.dev/",
"description": "React - JavaScript library for building UIs",
"start_urls": [
"https://react.dev/learn",
"https://react.dev/reference/react"
],
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": ["/learn/", "/reference/"],
"exclude": ["/blog/", "/community/"]
},
"categories": {
"getting_started": ["learn", "tutorial", "intro"],
"api": ["reference", "api", "hooks"]
},
"rate_limit": 0.5,
"max_pages": 300,
"merge_mode": "claude-enhanced"
}
Documentation Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name |
string | Yes | - | Skill name (alphanumeric, dashes, underscores) |
base_url |
string | Yes | - | Base documentation URL |
description |
string | No | "" | Skill description for SKILL.md |
start_urls |
array | No | [base_url] |
URLs to start crawling from |
selectors |
object | No | see below | CSS selectors for content extraction |
url_patterns |
object | No | {} |
Include/exclude URL patterns |
categories |
object | No | {} |
Content categorization rules |
rate_limit |
number | No | 0.5 | Seconds between requests |
max_pages |
number | No | 500 | Maximum pages to scrape |
merge_mode |
string | No | "claude-enhanced" | Merge strategy |
extract_api |
boolean | No | false | Extract API references |
llms_txt_url |
string | No | auto | Path to llms.txt file |
GitHub Source
For analyzing GitHub repositories.
{
"name": "react-github",
"type": "github",
"repo": "facebook/react",
"description": "React GitHub repository analysis",
"enable_codebase_analysis": true,
"code_analysis_depth": "deep",
"fetch_issues": true,
"max_issues": 100,
"issue_labels": ["bug", "enhancement"],
"fetch_releases": true,
"max_releases": 20,
"fetch_changelog": true,
"analyze_commit_history": true,
"file_patterns": ["*.js", "*.ts", "*.tsx"],
"exclude_patterns": ["*.test.js", "node_modules/**"],
"rate_limit": 1.0
}
GitHub Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name |
string | Yes | - | Skill name |
type |
string | Yes | - | Must be "github" |
repo |
string | Yes | - | Repository in owner/repo format |
description |
string | No | "" | Skill description |
enable_codebase_analysis |
boolean | No | true | Analyze source code |
code_analysis_depth |
string | No | "standard" | surface, standard, deep |
fetch_issues |
boolean | No | true | Fetch GitHub issues |
max_issues |
number | No | 100 | Maximum issues to fetch |
issue_labels |
array | No | [] | Filter by labels |
fetch_releases |
boolean | No | true | Fetch releases |
max_releases |
number | No | 20 | Maximum releases |
fetch_changelog |
boolean | No | true | Extract CHANGELOG |
analyze_commit_history |
boolean | No | false | Analyze commits |
file_patterns |
array | No | [] | Include file patterns |
exclude_patterns |
array | No | [] | Exclude file patterns |
PDF Source
For extracting content from PDF files.
{
"name": "product-manual",
"type": "pdf",
"pdf_path": "docs/manual.pdf",
"description": "Product documentation manual",
"enable_ocr": false,
"password": "",
"extract_images": true,
"image_output_dir": "output/images/",
"extract_tables": true,
"table_format": "markdown",
"page_range": [1, 100],
"split_by_chapters": true,
"chunk_size": 1000,
"chunk_overlap": 100
}
PDF Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name |
string | Yes | - | Skill name |
type |
string | Yes | - | Must be "pdf" |
pdf_path |
string | Yes | - | Path to PDF file |
description |
string | No | "" | Skill description |
enable_ocr |
boolean | No | false | OCR for scanned PDFs |
password |
string | No | "" | PDF password if encrypted |
extract_images |
boolean | No | false | Extract embedded images |
image_output_dir |
string | No | auto | Directory for images |
extract_tables |
boolean | No | false | Extract tables |
table_format |
string | No | "markdown" | markdown, json, csv |
page_range |
array | No | all | [start, end] page range |
split_by_chapters |
boolean | No | false | Split by detected chapters |
chunk_size |
number | No | 1000 | Characters per chunk |
chunk_overlap |
number | No | 100 | Overlap between chunks |
Local Source
For analyzing local codebases.
{
"name": "my-project",
"type": "local",
"directory": "./my-project",
"description": "Local project analysis",
"languages": ["Python", "JavaScript"],
"file_patterns": ["*.py", "*.js"],
"exclude_patterns": ["*.pyc", "node_modules/**", ".git/**"],
"analysis_depth": "comprehensive",
"extract_api": true,
"extract_patterns": true,
"extract_test_examples": true,
"extract_how_to_guides": true,
"extract_config_patterns": true,
"include_comments": true,
"include_docstrings": true,
"include_readme": true
}
Local Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name |
string | Yes | - | Skill name |
type |
string | Yes | - | Must be "local" |
directory |
string | Yes | - | Path to directory |
description |
string | No | "" | Skill description |
languages |
array | No | auto | Languages to analyze |
file_patterns |
array | No | all | Include patterns |
exclude_patterns |
array | No | common | Exclude patterns |
analysis_depth |
string | No | "standard" | quick, standard, comprehensive |
extract_api |
boolean | No | true | Extract API documentation |
extract_patterns |
boolean | No | true | Detect patterns |
extract_test_examples |
boolean | No | true | Extract test examples |
extract_how_to_guides |
boolean | No | true | Generate guides |
extract_config_patterns |
boolean | No | true | Extract config patterns |
include_comments |
boolean | No | true | Include code comments |
include_docstrings |
boolean | No | true | Include docstrings |
include_readme |
boolean | No | true | Include README |
Unified (Multi-Source) Config
Combine multiple sources into one skill with conflict detection.
{
"name": "react-complete",
"description": "React docs + GitHub + examples",
"merge_mode": "claude-enhanced",
"sources": [
{
"type": "docs",
"name": "react-docs",
"base_url": "https://react.dev/",
"max_pages": 200,
"categories": {
"getting_started": ["learn"],
"api": ["reference"]
}
},
{
"type": "github",
"name": "react-github",
"repo": "facebook/react",
"fetch_issues": true,
"max_issues": 50
},
{
"type": "pdf",
"name": "react-cheatsheet",
"pdf_path": "docs/react-cheatsheet.pdf"
},
{
"type": "local",
"name": "react-examples",
"directory": "./react-examples"
}
],
"conflict_detection": {
"enabled": true,
"rules": [
{
"field": "api_signature",
"action": "flag_mismatch"
}
]
},
"output_structure": {
"group_by_source": false,
"cross_reference": true
}
}
Unified Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name |
string | Yes | - | Combined skill name |
description |
string | No | "" | Skill description |
merge_mode |
string | No | "claude-enhanced" | rule-based, claude-enhanced |
sources |
array | Yes | - | List of source configs |
conflict_detection |
object | No | {} |
Conflict detection settings |
output_structure |
object | No | {} |
Output organization |
Source Types in Unified Config
Each source in the sources array can be:
| Type | Required Fields |
|---|---|
docs |
base_url |
github |
repo |
pdf |
pdf_path |
local |
directory |
Common Fields
Fields available in all config types:
| Field | Type | Description |
|---|---|---|
name |
string | Skill identifier (letters, numbers, dashes, underscores) |
description |
string | Human-readable description |
rate_limit |
number | Delay between requests in seconds |
output_dir |
string | Custom output directory |
skip_scrape |
boolean | Use existing data |
enhance_level |
number | 0=off, 1=SKILL.md, 2=+config, 3=full |
Selectors
CSS selectors for content extraction from HTML:
{
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code",
"navigation": "nav.sidebar",
"breadcrumbs": "nav[aria-label='breadcrumb']",
"next_page": "a[rel='next']",
"prev_page": "a[rel='prev']"
}
}
Default Selectors
If not specified, these defaults are used:
| Element | Default Selector |
|---|---|
main_content |
article, main, .content, #content, [role='main'] |
title |
h1, .page-title, title |
code_blocks |
pre code, code[class*="language-"] |
navigation |
nav, .sidebar, .toc |
Categories
Map URL patterns to content categories:
{
"categories": {
"getting_started": [
"intro", "tutorial", "quickstart",
"installation", "getting-started"
],
"core_concepts": [
"concept", "fundamental", "architecture",
"principle", "overview"
],
"api_reference": [
"reference", "api", "method", "function",
"class", "interface", "type"
],
"guides": [
"guide", "how-to", "example", "recipe",
"pattern", "best-practice"
],
"advanced": [
"advanced", "expert", "performance",
"optimization", "internals"
]
}
}
Categories appear as sections in the generated SKILL.md.
URL Patterns
Control which URLs are included or excluded:
{
"url_patterns": {
"include": [
"/docs/",
"/guide/",
"/api/",
"/reference/"
],
"exclude": [
"/blog/",
"/news/",
"/community/",
"/search",
"?print=1",
"/_static/",
"/_images/"
]
}
}
Pattern Rules
- Patterns are matched against the URL path
- Use
*for wildcards:/api/v*/ - Use
**for recursive:/docs/**/*.html - Exclude takes precedence over include
Examples
React Documentation
{
"name": "react",
"base_url": "https://react.dev/",
"description": "React - JavaScript library for building UIs",
"start_urls": [
"https://react.dev/learn",
"https://react.dev/reference/react",
"https://react.dev/reference/react-dom"
],
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": ["/learn/", "/reference/", "/blog/"],
"exclude": ["/community/", "/search"]
},
"categories": {
"getting_started": ["learn", "tutorial"],
"api": ["reference", "api"],
"blog": ["blog"]
},
"rate_limit": 0.5,
"max_pages": 300
}
Django GitHub
{
"name": "django-github",
"type": "github",
"repo": "django/django",
"description": "Django web framework source code",
"enable_codebase_analysis": true,
"code_analysis_depth": "deep",
"fetch_issues": true,
"max_issues": 100,
"fetch_releases": true,
"file_patterns": ["*.py"],
"exclude_patterns": ["tests/**", "docs/**"]
}
Unified Multi-Source
{
"name": "godot-complete",
"description": "Godot Engine - docs, source, and manual",
"merge_mode": "claude-enhanced",
"sources": [
{
"type": "docs",
"name": "godot-docs",
"base_url": "https://docs.godotengine.org/en/stable/",
"max_pages": 500
},
{
"type": "github",
"name": "godot-source",
"repo": "godotengine/godot",
"fetch_issues": false
},
{
"type": "pdf",
"name": "godot-manual",
"pdf_path": "docs/godot-manual.pdf"
}
]
}
Local Project
{
"name": "my-api",
"type": "local",
"directory": "./my-api-project",
"description": "My REST API implementation",
"languages": ["Python"],
"file_patterns": ["*.py"],
"exclude_patterns": ["tests/**", "migrations/**"],
"analysis_depth": "comprehensive",
"extract_api": true,
"extract_test_examples": true
}
Validation
Validate your config before scraping:
# Using CLI
skill-seekers scrape --config my-config.json --dry-run
# Using MCP tool
validate_config({"config": "my-config.json"})
See Also
- CLI Reference - Command reference
- Environment Variables - Configuration environment
For more examples, see configs/ directory in the repository