docs: complete documentation overhaul with v3.1.0 release notes and zh-CN translations
Documentation restructure: - New docs/getting-started/ guide (4 files: install, quick-start, first-skill, next-steps) - New docs/user-guide/ section (6 files: core concepts through troubleshooting) - New docs/reference/ section (CLI_REFERENCE, CONFIG_FORMAT, ENVIRONMENT_VARIABLES, MCP_REFERENCE) - New docs/advanced/ section (custom-workflows, mcp-server, multi-source) - New docs/ARCHITECTURE.md - system architecture overview - Archived legacy files (QUICKSTART.md, QUICK_REFERENCE.md, docs/guides/USAGE.md) to docs/archive/legacy/ Chinese (zh-CN) translations: - Full zh-CN mirror of all user-facing docs (getting-started, user-guide, reference, advanced) - GitHub Actions workflow for translation sync (.github/workflows/translate-docs.yml) - Translation sync checker script (scripts/check_translation_sync.sh) - Translation helper script (scripts/translate_doc.py) Content updates: - CHANGELOG.md: [Unreleased] → [3.1.0] - 2026-02-22 - README.md: updated with new doc structure links - AGENTS.md: updated agent documentation - docs/features/UNIFIED_SCRAPING.md: updated for unified scraper workflow JSON config Analysis/planning artifacts (kept for reference): - DOCUMENTATION_OVERHAUL_PLAN.md, DOCUMENTATION_OVERHAUL_SUMMARY.md - FEATURE_GAP_ANALYSIS.md, IMPLEMENTATION_GAPS_ANALYSIS.md, CREATE_COMMAND_COVERAGE_ANALYSIS.md - CHINESE_TRANSLATION_IMPLEMENTATION_SUMMARY.md, ISSUE_260_UPDATE.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
610
docs/reference/CONFIG_FORMAT.md
Normal file
610
docs/reference/CONFIG_FORMAT.md
Normal file
@@ -0,0 +1,610 @@
|
||||
# Config Format Reference - Skill Seekers
|
||||
|
||||
> **Version:** 3.1.0
|
||||
> **Last Updated:** 2026-02-16
|
||||
> **Complete JSON configuration specification**
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Overview](#overview)
|
||||
- [Single-Source Config](#single-source-config)
|
||||
- [Documentation Source](#documentation-source)
|
||||
- [GitHub Source](#github-source)
|
||||
- [PDF Source](#pdf-source)
|
||||
- [Local Source](#local-source)
|
||||
- [Unified (Multi-Source) Config](#unified-multi-source-config)
|
||||
- [Common Fields](#common-fields)
|
||||
- [Selectors](#selectors)
|
||||
- [Categories](#categories)
|
||||
- [URL Patterns](#url-patterns)
|
||||
- [Examples](#examples)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Skill Seekers uses JSON configuration files to define scraping targets. There are two types:
|
||||
|
||||
| Type | Use Case | File |
|
||||
|------|----------|------|
|
||||
| **Single-Source** | One source (docs, GitHub, PDF, or local) | `*.json` |
|
||||
| **Unified** | Multiple sources combined | `*-unified.json` |
|
||||
|
||||
---
|
||||
|
||||
## Single-Source Config
|
||||
|
||||
### Documentation Source
|
||||
|
||||
For scraping documentation websites.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "react",
|
||||
"base_url": "https://react.dev/",
|
||||
"description": "React - JavaScript library for building UIs",
|
||||
|
||||
"start_urls": [
|
||||
"https://react.dev/learn",
|
||||
"https://react.dev/reference/react"
|
||||
],
|
||||
|
||||
"selectors": {
|
||||
"main_content": "article",
|
||||
"title": "h1",
|
||||
"code_blocks": "pre code"
|
||||
},
|
||||
|
||||
"url_patterns": {
|
||||
"include": ["/learn/", "/reference/"],
|
||||
"exclude": ["/blog/", "/community/"]
|
||||
},
|
||||
|
||||
"categories": {
|
||||
"getting_started": ["learn", "tutorial", "intro"],
|
||||
"api": ["reference", "api", "hooks"]
|
||||
},
|
||||
|
||||
"rate_limit": 0.5,
|
||||
"max_pages": 300,
|
||||
"merge_mode": "claude-enhanced"
|
||||
}
|
||||
```
|
||||
|
||||
#### Documentation Fields
|
||||
|
||||
| Field | Type | Required | Default | Description |
|
||||
|-------|------|----------|---------|-------------|
|
||||
| `name` | string | Yes | - | Skill name (alphanumeric, dashes, underscores) |
|
||||
| `base_url` | string | Yes | - | Base documentation URL |
|
||||
| `description` | string | No | "" | Skill description for SKILL.md |
|
||||
| `start_urls` | array | No | `[base_url]` | URLs to start crawling from |
|
||||
| `selectors` | object | No | see below | CSS selectors for content extraction |
|
||||
| `url_patterns` | object | No | `{}` | Include/exclude URL patterns |
|
||||
| `categories` | object | No | `{}` | Content categorization rules |
|
||||
| `rate_limit` | number | No | 0.5 | Seconds between requests |
|
||||
| `max_pages` | number | No | 500 | Maximum pages to scrape |
|
||||
| `merge_mode` | string | No | "claude-enhanced" | Merge strategy |
|
||||
| `extract_api` | boolean | No | false | Extract API references |
|
||||
| `llms_txt_url` | string | No | auto | Path to llms.txt file |
|
||||
|
||||
---
|
||||
|
||||
### GitHub Source
|
||||
|
||||
For analyzing GitHub repositories.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "react-github",
|
||||
"type": "github",
|
||||
"repo": "facebook/react",
|
||||
"description": "React GitHub repository analysis",
|
||||
|
||||
"enable_codebase_analysis": true,
|
||||
"code_analysis_depth": "deep",
|
||||
|
||||
"fetch_issues": true,
|
||||
"max_issues": 100,
|
||||
"issue_labels": ["bug", "enhancement"],
|
||||
|
||||
"fetch_releases": true,
|
||||
"max_releases": 20,
|
||||
|
||||
"fetch_changelog": true,
|
||||
"analyze_commit_history": true,
|
||||
|
||||
"file_patterns": ["*.js", "*.ts", "*.tsx"],
|
||||
"exclude_patterns": ["*.test.js", "node_modules/**"],
|
||||
|
||||
"rate_limit": 1.0
|
||||
}
|
||||
```
|
||||
|
||||
#### GitHub Fields
|
||||
|
||||
| Field | Type | Required | Default | Description |
|
||||
|-------|------|----------|---------|-------------|
|
||||
| `name` | string | Yes | - | Skill name |
|
||||
| `type` | string | Yes | - | Must be `"github"` |
|
||||
| `repo` | string | Yes | - | Repository in `owner/repo` format |
|
||||
| `description` | string | No | "" | Skill description |
|
||||
| `enable_codebase_analysis` | boolean | No | true | Analyze source code |
|
||||
| `code_analysis_depth` | string | No | "standard" | `surface`, `standard`, `deep` |
|
||||
| `fetch_issues` | boolean | No | true | Fetch GitHub issues |
|
||||
| `max_issues` | number | No | 100 | Maximum issues to fetch |
|
||||
| `issue_labels` | array | No | [] | Filter by labels |
|
||||
| `fetch_releases` | boolean | No | true | Fetch releases |
|
||||
| `max_releases` | number | No | 20 | Maximum releases |
|
||||
| `fetch_changelog` | boolean | No | true | Extract CHANGELOG |
|
||||
| `analyze_commit_history` | boolean | No | false | Analyze commits |
|
||||
| `file_patterns` | array | No | [] | Include file patterns |
|
||||
| `exclude_patterns` | array | No | [] | Exclude file patterns |
|
||||
|
||||
---
|
||||
|
||||
### PDF Source
|
||||
|
||||
For extracting content from PDF files.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "product-manual",
|
||||
"type": "pdf",
|
||||
"pdf_path": "docs/manual.pdf",
|
||||
"description": "Product documentation manual",
|
||||
|
||||
"enable_ocr": false,
|
||||
"password": "",
|
||||
|
||||
"extract_images": true,
|
||||
"image_output_dir": "output/images/",
|
||||
|
||||
"extract_tables": true,
|
||||
"table_format": "markdown",
|
||||
|
||||
"page_range": [1, 100],
|
||||
"split_by_chapters": true,
|
||||
|
||||
"chunk_size": 1000,
|
||||
"chunk_overlap": 100
|
||||
}
|
||||
```
|
||||
|
||||
#### PDF Fields
|
||||
|
||||
| Field | Type | Required | Default | Description |
|
||||
|-------|------|----------|---------|-------------|
|
||||
| `name` | string | Yes | - | Skill name |
|
||||
| `type` | string | Yes | - | Must be `"pdf"` |
|
||||
| `pdf_path` | string | Yes | - | Path to PDF file |
|
||||
| `description` | string | No | "" | Skill description |
|
||||
| `enable_ocr` | boolean | No | false | OCR for scanned PDFs |
|
||||
| `password` | string | No | "" | PDF password if encrypted |
|
||||
| `extract_images` | boolean | No | false | Extract embedded images |
|
||||
| `image_output_dir` | string | No | auto | Directory for images |
|
||||
| `extract_tables` | boolean | No | false | Extract tables |
|
||||
| `table_format` | string | No | "markdown" | `markdown`, `json`, `csv` |
|
||||
| `page_range` | array | No | all | `[start, end]` page range |
|
||||
| `split_by_chapters` | boolean | No | false | Split by detected chapters |
|
||||
| `chunk_size` | number | No | 1000 | Characters per chunk |
|
||||
| `chunk_overlap` | number | No | 100 | Overlap between chunks |
|
||||
|
||||
---
|
||||
|
||||
### Local Source
|
||||
|
||||
For analyzing local codebases.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "my-project",
|
||||
"type": "local",
|
||||
"directory": "./my-project",
|
||||
"description": "Local project analysis",
|
||||
|
||||
"languages": ["Python", "JavaScript"],
|
||||
"file_patterns": ["*.py", "*.js"],
|
||||
"exclude_patterns": ["*.pyc", "node_modules/**", ".git/**"],
|
||||
|
||||
"analysis_depth": "comprehensive",
|
||||
|
||||
"extract_api": true,
|
||||
"extract_patterns": true,
|
||||
"extract_test_examples": true,
|
||||
"extract_how_to_guides": true,
|
||||
"extract_config_patterns": true,
|
||||
|
||||
"include_comments": true,
|
||||
"include_docstrings": true,
|
||||
"include_readme": true
|
||||
}
|
||||
```
|
||||
|
||||
#### Local Fields
|
||||
|
||||
| Field | Type | Required | Default | Description |
|
||||
|-------|------|----------|---------|-------------|
|
||||
| `name` | string | Yes | - | Skill name |
|
||||
| `type` | string | Yes | - | Must be `"local"` |
|
||||
| `directory` | string | Yes | - | Path to directory |
|
||||
| `description` | string | No | "" | Skill description |
|
||||
| `languages` | array | No | auto | Languages to analyze |
|
||||
| `file_patterns` | array | No | all | Include patterns |
|
||||
| `exclude_patterns` | array | No | common | Exclude patterns |
|
||||
| `analysis_depth` | string | No | "standard" | `quick`, `standard`, `comprehensive` |
|
||||
| `extract_api` | boolean | No | true | Extract API documentation |
|
||||
| `extract_patterns` | boolean | No | true | Detect patterns |
|
||||
| `extract_test_examples` | boolean | No | true | Extract test examples |
|
||||
| `extract_how_to_guides` | boolean | No | true | Generate guides |
|
||||
| `extract_config_patterns` | boolean | No | true | Extract config patterns |
|
||||
| `include_comments` | boolean | No | true | Include code comments |
|
||||
| `include_docstrings` | boolean | No | true | Include docstrings |
|
||||
| `include_readme` | boolean | No | true | Include README |
|
||||
|
||||
---
|
||||
|
||||
## Unified (Multi-Source) Config
|
||||
|
||||
Combine multiple sources into one skill with conflict detection.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "react-complete",
|
||||
"description": "React docs + GitHub + examples",
|
||||
"merge_mode": "claude-enhanced",
|
||||
|
||||
"sources": [
|
||||
{
|
||||
"type": "docs",
|
||||
"name": "react-docs",
|
||||
"base_url": "https://react.dev/",
|
||||
"max_pages": 200,
|
||||
"categories": {
|
||||
"getting_started": ["learn"],
|
||||
"api": ["reference"]
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "github",
|
||||
"name": "react-github",
|
||||
"repo": "facebook/react",
|
||||
"fetch_issues": true,
|
||||
"max_issues": 50
|
||||
},
|
||||
{
|
||||
"type": "pdf",
|
||||
"name": "react-cheatsheet",
|
||||
"pdf_path": "docs/react-cheatsheet.pdf"
|
||||
},
|
||||
{
|
||||
"type": "local",
|
||||
"name": "react-examples",
|
||||
"directory": "./react-examples"
|
||||
}
|
||||
],
|
||||
|
||||
"conflict_detection": {
|
||||
"enabled": true,
|
||||
"rules": [
|
||||
{
|
||||
"field": "api_signature",
|
||||
"action": "flag_mismatch"
|
||||
}
|
||||
]
|
||||
},
|
||||
|
||||
"output_structure": {
|
||||
"group_by_source": false,
|
||||
"cross_reference": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Unified Fields
|
||||
|
||||
| Field | Type | Required | Default | Description |
|
||||
|-------|------|----------|---------|-------------|
|
||||
| `name` | string | Yes | - | Combined skill name |
|
||||
| `description` | string | No | "" | Skill description |
|
||||
| `merge_mode` | string | No | "claude-enhanced" | `rule-based`, `claude-enhanced` |
|
||||
| `sources` | array | Yes | - | List of source configs |
|
||||
| `conflict_detection` | object | No | `{}` | Conflict detection settings |
|
||||
| `output_structure` | object | No | `{}` | Output organization |
|
||||
| `workflows` | array | No | `[]` | Workflow presets to apply |
|
||||
| `workflow_stages` | array | No | `[]` | Inline enhancement stages |
|
||||
| `workflow_vars` | object | No | `{}` | Workflow variable overrides |
|
||||
| `workflow_dry_run` | boolean | No | `false` | Preview workflows without executing |
|
||||
|
||||
#### Workflow Configuration (Unified)
|
||||
|
||||
Unified configs support defining enhancement workflows at the top level:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "react-complete",
|
||||
"description": "React docs + GitHub with security enhancement",
|
||||
"merge_mode": "claude-enhanced",
|
||||
|
||||
"workflows": ["security-focus", "api-documentation"],
|
||||
"workflow_stages": [
|
||||
{
|
||||
"name": "cleanup",
|
||||
"prompt": "Remove boilerplate sections and standardize formatting"
|
||||
}
|
||||
],
|
||||
"workflow_vars": {
|
||||
"focus_area": "performance",
|
||||
"detail_level": "comprehensive"
|
||||
},
|
||||
|
||||
"sources": [
|
||||
{"type": "docs", "base_url": "https://react.dev/"},
|
||||
{"type": "github", "repo": "facebook/react"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Workflow Fields:**
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `workflows` | array | List of workflow preset names to apply |
|
||||
| `workflow_stages` | array | Inline stages with `name` and `prompt` |
|
||||
| `workflow_vars` | object | Key-value pairs for workflow variables |
|
||||
| `workflow_dry_run` | boolean | Preview workflows without executing |
|
||||
|
||||
**Note:** CLI flags override config values (CLI takes precedence).
|
||||
|
||||
#### Source Types in Unified Config
|
||||
|
||||
Each source in the `sources` array can be:
|
||||
|
||||
| Type | Required Fields |
|
||||
|------|-----------------|
|
||||
| `docs` | `base_url` |
|
||||
| `github` | `repo` |
|
||||
| `pdf` | `pdf_path` |
|
||||
| `local` | `directory` |
|
||||
|
||||
---
|
||||
|
||||
## Common Fields
|
||||
|
||||
Fields available in all config types:
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `name` | string | Skill identifier (letters, numbers, dashes, underscores) |
|
||||
| `description` | string | Human-readable description |
|
||||
| `rate_limit` | number | Delay between requests in seconds |
|
||||
| `output_dir` | string | Custom output directory |
|
||||
| `skip_scrape` | boolean | Use existing data |
|
||||
| `enhance_level` | number | 0=off, 1=SKILL.md, 2=+config, 3=full |
|
||||
|
||||
---
|
||||
|
||||
## Selectors
|
||||
|
||||
CSS selectors for content extraction from HTML:
|
||||
|
||||
```json
|
||||
{
|
||||
"selectors": {
|
||||
"main_content": "article",
|
||||
"title": "h1",
|
||||
"code_blocks": "pre code",
|
||||
"navigation": "nav.sidebar",
|
||||
"breadcrumbs": "nav[aria-label='breadcrumb']",
|
||||
"next_page": "a[rel='next']",
|
||||
"prev_page": "a[rel='prev']"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Default Selectors
|
||||
|
||||
If not specified, these defaults are used:
|
||||
|
||||
| Element | Default Selector |
|
||||
|---------|-----------------|
|
||||
| `main_content` | `article, main, .content, #content, [role='main']` |
|
||||
| `title` | `h1, .page-title, title` |
|
||||
| `code_blocks` | `pre code, code[class*="language-"]` |
|
||||
| `navigation` | `nav, .sidebar, .toc` |
|
||||
|
||||
---
|
||||
|
||||
## Categories
|
||||
|
||||
Map URL patterns to content categories:
|
||||
|
||||
```json
|
||||
{
|
||||
"categories": {
|
||||
"getting_started": [
|
||||
"intro", "tutorial", "quickstart",
|
||||
"installation", "getting-started"
|
||||
],
|
||||
"core_concepts": [
|
||||
"concept", "fundamental", "architecture",
|
||||
"principle", "overview"
|
||||
],
|
||||
"api_reference": [
|
||||
"reference", "api", "method", "function",
|
||||
"class", "interface", "type"
|
||||
],
|
||||
"guides": [
|
||||
"guide", "how-to", "example", "recipe",
|
||||
"pattern", "best-practice"
|
||||
],
|
||||
"advanced": [
|
||||
"advanced", "expert", "performance",
|
||||
"optimization", "internals"
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Categories appear as sections in the generated SKILL.md.
|
||||
|
||||
---
|
||||
|
||||
## URL Patterns
|
||||
|
||||
Control which URLs are included or excluded:
|
||||
|
||||
```json
|
||||
{
|
||||
"url_patterns": {
|
||||
"include": [
|
||||
"/docs/",
|
||||
"/guide/",
|
||||
"/api/",
|
||||
"/reference/"
|
||||
],
|
||||
"exclude": [
|
||||
"/blog/",
|
||||
"/news/",
|
||||
"/community/",
|
||||
"/search",
|
||||
"?print=1",
|
||||
"/_static/",
|
||||
"/_images/"
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Pattern Rules
|
||||
|
||||
- Patterns are matched against the URL path
|
||||
- Use `*` for wildcards: `/api/v*/`
|
||||
- Use `**` for recursive: `/docs/**/*.html`
|
||||
- Exclude takes precedence over include
|
||||
|
||||
---
|
||||
|
||||
## Examples
|
||||
|
||||
### React Documentation
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "react",
|
||||
"base_url": "https://react.dev/",
|
||||
"description": "React - JavaScript library for building UIs",
|
||||
"start_urls": [
|
||||
"https://react.dev/learn",
|
||||
"https://react.dev/reference/react",
|
||||
"https://react.dev/reference/react-dom"
|
||||
],
|
||||
"selectors": {
|
||||
"main_content": "article",
|
||||
"title": "h1",
|
||||
"code_blocks": "pre code"
|
||||
},
|
||||
"url_patterns": {
|
||||
"include": ["/learn/", "/reference/", "/blog/"],
|
||||
"exclude": ["/community/", "/search"]
|
||||
},
|
||||
"categories": {
|
||||
"getting_started": ["learn", "tutorial"],
|
||||
"api": ["reference", "api"],
|
||||
"blog": ["blog"]
|
||||
},
|
||||
"rate_limit": 0.5,
|
||||
"max_pages": 300
|
||||
}
|
||||
```
|
||||
|
||||
### Django GitHub
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "django-github",
|
||||
"type": "github",
|
||||
"repo": "django/django",
|
||||
"description": "Django web framework source code",
|
||||
"enable_codebase_analysis": true,
|
||||
"code_analysis_depth": "deep",
|
||||
"fetch_issues": true,
|
||||
"max_issues": 100,
|
||||
"fetch_releases": true,
|
||||
"file_patterns": ["*.py"],
|
||||
"exclude_patterns": ["tests/**", "docs/**"]
|
||||
}
|
||||
```
|
||||
|
||||
### Unified Multi-Source
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "godot-complete",
|
||||
"description": "Godot Engine - docs, source, and manual",
|
||||
"merge_mode": "claude-enhanced",
|
||||
"sources": [
|
||||
{
|
||||
"type": "docs",
|
||||
"name": "godot-docs",
|
||||
"base_url": "https://docs.godotengine.org/en/stable/",
|
||||
"max_pages": 500
|
||||
},
|
||||
{
|
||||
"type": "github",
|
||||
"name": "godot-source",
|
||||
"repo": "godotengine/godot",
|
||||
"fetch_issues": false
|
||||
},
|
||||
{
|
||||
"type": "pdf",
|
||||
"name": "godot-manual",
|
||||
"pdf_path": "docs/godot-manual.pdf"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Local Project
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "my-api",
|
||||
"type": "local",
|
||||
"directory": "./my-api-project",
|
||||
"description": "My REST API implementation",
|
||||
"languages": ["Python"],
|
||||
"file_patterns": ["*.py"],
|
||||
"exclude_patterns": ["tests/**", "migrations/**"],
|
||||
"analysis_depth": "comprehensive",
|
||||
"extract_api": true,
|
||||
"extract_test_examples": true
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Validation
|
||||
|
||||
Validate your config before scraping:
|
||||
|
||||
```bash
|
||||
# Using CLI
|
||||
skill-seekers scrape --config my-config.json --dry-run
|
||||
|
||||
# Using MCP tool
|
||||
validate_config({"config": "my-config.json"})
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- [CLI Reference](CLI_REFERENCE.md) - Command reference
|
||||
- [Environment Variables](ENVIRONMENT_VARIABLES.md) - Configuration environment
|
||||
|
||||
---
|
||||
|
||||
*For more examples, see `configs/` directory in the repository*
|
||||
Reference in New Issue
Block a user