Files
yusyus ba9a8ff8b5 docs: complete documentation overhaul with v3.1.0 release notes and zh-CN translations
Documentation restructure:
- New docs/getting-started/ guide (4 files: install, quick-start, first-skill, next-steps)
- New docs/user-guide/ section (6 files: core concepts through troubleshooting)
- New docs/reference/ section (CLI_REFERENCE, CONFIG_FORMAT, ENVIRONMENT_VARIABLES, MCP_REFERENCE)
- New docs/advanced/ section (custom-workflows, mcp-server, multi-source)
- New docs/ARCHITECTURE.md - system architecture overview
- Archived legacy files (QUICKSTART.md, QUICK_REFERENCE.md, docs/guides/USAGE.md) to docs/archive/legacy/

Chinese (zh-CN) translations:
- Full zh-CN mirror of all user-facing docs (getting-started, user-guide, reference, advanced)
- GitHub Actions workflow for translation sync (.github/workflows/translate-docs.yml)
- Translation sync checker script (scripts/check_translation_sync.sh)
- Translation helper script (scripts/translate_doc.py)

Content updates:
- CHANGELOG.md: [Unreleased] → [3.1.0] - 2026-02-22
- README.md: updated with new doc structure links
- AGENTS.md: updated agent documentation
- docs/features/UNIFIED_SCRAPING.md: updated for unified scraper workflow JSON config

Analysis/planning artifacts (kept for reference):
- DOCUMENTATION_OVERHAUL_PLAN.md, DOCUMENTATION_OVERHAUL_SUMMARY.md
- FEATURE_GAP_ANALYSIS.md, IMPLEMENTATION_GAPS_ANALYSIS.md, CREATE_COMMAND_COVERAGE_ANALYSIS.md
- CHINESE_TRANSLATION_IMPLEMENTATION_SUMMARY.md, ISSUE_260_UPDATE.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 01:01:51 +03:00

567 lines
14 KiB
Markdown

# Config Format Reference - Skill Seekers
> **Version:** 3.1.0
> **Last Updated:** 2026-02-16
> **Complete JSON configuration specification**
---
## Table of Contents
- [Overview](#overview)
- [Single-Source Config](#single-source-config)
- [Documentation Source](#documentation-source)
- [GitHub Source](#github-source)
- [PDF Source](#pdf-source)
- [Local Source](#local-source)
- [Unified (Multi-Source) Config](#unified-multi-source-config)
- [Common Fields](#common-fields)
- [Selectors](#selectors)
- [Categories](#categories)
- [URL Patterns](#url-patterns)
- [Examples](#examples)
---
## Overview
Skill Seekers uses JSON configuration files to define scraping targets. There are two types:
| Type | Use Case | File |
|------|----------|------|
| **Single-Source** | One source (docs, GitHub, PDF, or local) | `*.json` |
| **Unified** | Multiple sources combined | `*-unified.json` |
---
## Single-Source Config
### Documentation Source
For scraping documentation websites.
```json
{
"name": "react",
"base_url": "https://react.dev/",
"description": "React - JavaScript library for building UIs",
"start_urls": [
"https://react.dev/learn",
"https://react.dev/reference/react"
],
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": ["/learn/", "/reference/"],
"exclude": ["/blog/", "/community/"]
},
"categories": {
"getting_started": ["learn", "tutorial", "intro"],
"api": ["reference", "api", "hooks"]
},
"rate_limit": 0.5,
"max_pages": 300,
"merge_mode": "claude-enhanced"
}
```
#### Documentation Fields
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `name` | string | Yes | - | Skill name (alphanumeric, dashes, underscores) |
| `base_url` | string | Yes | - | Base documentation URL |
| `description` | string | No | "" | Skill description for SKILL.md |
| `start_urls` | array | No | `[base_url]` | URLs to start crawling from |
| `selectors` | object | No | see below | CSS selectors for content extraction |
| `url_patterns` | object | No | `{}` | Include/exclude URL patterns |
| `categories` | object | No | `{}` | Content categorization rules |
| `rate_limit` | number | No | 0.5 | Seconds between requests |
| `max_pages` | number | No | 500 | Maximum pages to scrape |
| `merge_mode` | string | No | "claude-enhanced" | Merge strategy |
| `extract_api` | boolean | No | false | Extract API references |
| `llms_txt_url` | string | No | auto | Path to llms.txt file |
---
### GitHub Source
For analyzing GitHub repositories.
```json
{
"name": "react-github",
"type": "github",
"repo": "facebook/react",
"description": "React GitHub repository analysis",
"enable_codebase_analysis": true,
"code_analysis_depth": "deep",
"fetch_issues": true,
"max_issues": 100,
"issue_labels": ["bug", "enhancement"],
"fetch_releases": true,
"max_releases": 20,
"fetch_changelog": true,
"analyze_commit_history": true,
"file_patterns": ["*.js", "*.ts", "*.tsx"],
"exclude_patterns": ["*.test.js", "node_modules/**"],
"rate_limit": 1.0
}
```
#### GitHub Fields
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `name` | string | Yes | - | Skill name |
| `type` | string | Yes | - | Must be `"github"` |
| `repo` | string | Yes | - | Repository in `owner/repo` format |
| `description` | string | No | "" | Skill description |
| `enable_codebase_analysis` | boolean | No | true | Analyze source code |
| `code_analysis_depth` | string | No | "standard" | `surface`, `standard`, `deep` |
| `fetch_issues` | boolean | No | true | Fetch GitHub issues |
| `max_issues` | number | No | 100 | Maximum issues to fetch |
| `issue_labels` | array | No | [] | Filter by labels |
| `fetch_releases` | boolean | No | true | Fetch releases |
| `max_releases` | number | No | 20 | Maximum releases |
| `fetch_changelog` | boolean | No | true | Extract CHANGELOG |
| `analyze_commit_history` | boolean | No | false | Analyze commits |
| `file_patterns` | array | No | [] | Include file patterns |
| `exclude_patterns` | array | No | [] | Exclude file patterns |
---
### PDF Source
For extracting content from PDF files.
```json
{
"name": "product-manual",
"type": "pdf",
"pdf_path": "docs/manual.pdf",
"description": "Product documentation manual",
"enable_ocr": false,
"password": "",
"extract_images": true,
"image_output_dir": "output/images/",
"extract_tables": true,
"table_format": "markdown",
"page_range": [1, 100],
"split_by_chapters": true,
"chunk_size": 1000,
"chunk_overlap": 100
}
```
#### PDF Fields
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `name` | string | Yes | - | Skill name |
| `type` | string | Yes | - | Must be `"pdf"` |
| `pdf_path` | string | Yes | - | Path to PDF file |
| `description` | string | No | "" | Skill description |
| `enable_ocr` | boolean | No | false | OCR for scanned PDFs |
| `password` | string | No | "" | PDF password if encrypted |
| `extract_images` | boolean | No | false | Extract embedded images |
| `image_output_dir` | string | No | auto | Directory for images |
| `extract_tables` | boolean | No | false | Extract tables |
| `table_format` | string | No | "markdown" | `markdown`, `json`, `csv` |
| `page_range` | array | No | all | `[start, end]` page range |
| `split_by_chapters` | boolean | No | false | Split by detected chapters |
| `chunk_size` | number | No | 1000 | Characters per chunk |
| `chunk_overlap` | number | No | 100 | Overlap between chunks |
---
### Local Source
For analyzing local codebases.
```json
{
"name": "my-project",
"type": "local",
"directory": "./my-project",
"description": "Local project analysis",
"languages": ["Python", "JavaScript"],
"file_patterns": ["*.py", "*.js"],
"exclude_patterns": ["*.pyc", "node_modules/**", ".git/**"],
"analysis_depth": "comprehensive",
"extract_api": true,
"extract_patterns": true,
"extract_test_examples": true,
"extract_how_to_guides": true,
"extract_config_patterns": true,
"include_comments": true,
"include_docstrings": true,
"include_readme": true
}
```
#### Local Fields
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `name` | string | Yes | - | Skill name |
| `type` | string | Yes | - | Must be `"local"` |
| `directory` | string | Yes | - | Path to directory |
| `description` | string | No | "" | Skill description |
| `languages` | array | No | auto | Languages to analyze |
| `file_patterns` | array | No | all | Include patterns |
| `exclude_patterns` | array | No | common | Exclude patterns |
| `analysis_depth` | string | No | "standard" | `quick`, `standard`, `comprehensive` |
| `extract_api` | boolean | No | true | Extract API documentation |
| `extract_patterns` | boolean | No | true | Detect patterns |
| `extract_test_examples` | boolean | No | true | Extract test examples |
| `extract_how_to_guides` | boolean | No | true | Generate guides |
| `extract_config_patterns` | boolean | No | true | Extract config patterns |
| `include_comments` | boolean | No | true | Include code comments |
| `include_docstrings` | boolean | No | true | Include docstrings |
| `include_readme` | boolean | No | true | Include README |
---
## Unified (Multi-Source) Config
Combine multiple sources into one skill with conflict detection.
```json
{
"name": "react-complete",
"description": "React docs + GitHub + examples",
"merge_mode": "claude-enhanced",
"sources": [
{
"type": "docs",
"name": "react-docs",
"base_url": "https://react.dev/",
"max_pages": 200,
"categories": {
"getting_started": ["learn"],
"api": ["reference"]
}
},
{
"type": "github",
"name": "react-github",
"repo": "facebook/react",
"fetch_issues": true,
"max_issues": 50
},
{
"type": "pdf",
"name": "react-cheatsheet",
"pdf_path": "docs/react-cheatsheet.pdf"
},
{
"type": "local",
"name": "react-examples",
"directory": "./react-examples"
}
],
"conflict_detection": {
"enabled": true,
"rules": [
{
"field": "api_signature",
"action": "flag_mismatch"
}
]
},
"output_structure": {
"group_by_source": false,
"cross_reference": true
}
}
```
#### Unified Fields
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `name` | string | Yes | - | Combined skill name |
| `description` | string | No | "" | Skill description |
| `merge_mode` | string | No | "claude-enhanced" | `rule-based`, `claude-enhanced` |
| `sources` | array | Yes | - | List of source configs |
| `conflict_detection` | object | No | `{}` | Conflict detection settings |
| `output_structure` | object | No | `{}` | Output organization |
#### Source Types in Unified Config
Each source in the `sources` array can be:
| Type | Required Fields |
|------|-----------------|
| `docs` | `base_url` |
| `github` | `repo` |
| `pdf` | `pdf_path` |
| `local` | `directory` |
---
## Common Fields
Fields available in all config types:
| Field | Type | Description |
|-------|------|-------------|
| `name` | string | Skill identifier (letters, numbers, dashes, underscores) |
| `description` | string | Human-readable description |
| `rate_limit` | number | Delay between requests in seconds |
| `output_dir` | string | Custom output directory |
| `skip_scrape` | boolean | Use existing data |
| `enhance_level` | number | 0=off, 1=SKILL.md, 2=+config, 3=full |
---
## Selectors
CSS selectors for content extraction from HTML:
```json
{
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code",
"navigation": "nav.sidebar",
"breadcrumbs": "nav[aria-label='breadcrumb']",
"next_page": "a[rel='next']",
"prev_page": "a[rel='prev']"
}
}
```
### Default Selectors
If not specified, these defaults are used:
| Element | Default Selector |
|---------|-----------------|
| `main_content` | `article, main, .content, #content, [role='main']` |
| `title` | `h1, .page-title, title` |
| `code_blocks` | `pre code, code[class*="language-"]` |
| `navigation` | `nav, .sidebar, .toc` |
---
## Categories
Map URL patterns to content categories:
```json
{
"categories": {
"getting_started": [
"intro", "tutorial", "quickstart",
"installation", "getting-started"
],
"core_concepts": [
"concept", "fundamental", "architecture",
"principle", "overview"
],
"api_reference": [
"reference", "api", "method", "function",
"class", "interface", "type"
],
"guides": [
"guide", "how-to", "example", "recipe",
"pattern", "best-practice"
],
"advanced": [
"advanced", "expert", "performance",
"optimization", "internals"
]
}
}
```
Categories appear as sections in the generated SKILL.md.
---
## URL Patterns
Control which URLs are included or excluded:
```json
{
"url_patterns": {
"include": [
"/docs/",
"/guide/",
"/api/",
"/reference/"
],
"exclude": [
"/blog/",
"/news/",
"/community/",
"/search",
"?print=1",
"/_static/",
"/_images/"
]
}
}
```
### Pattern Rules
- Patterns are matched against the URL path
- Use `*` for wildcards: `/api/v*/`
- Use `**` for recursive: `/docs/**/*.html`
- Exclude takes precedence over include
---
## Examples
### React Documentation
```json
{
"name": "react",
"base_url": "https://react.dev/",
"description": "React - JavaScript library for building UIs",
"start_urls": [
"https://react.dev/learn",
"https://react.dev/reference/react",
"https://react.dev/reference/react-dom"
],
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": ["/learn/", "/reference/", "/blog/"],
"exclude": ["/community/", "/search"]
},
"categories": {
"getting_started": ["learn", "tutorial"],
"api": ["reference", "api"],
"blog": ["blog"]
},
"rate_limit": 0.5,
"max_pages": 300
}
```
### Django GitHub
```json
{
"name": "django-github",
"type": "github",
"repo": "django/django",
"description": "Django web framework source code",
"enable_codebase_analysis": true,
"code_analysis_depth": "deep",
"fetch_issues": true,
"max_issues": 100,
"fetch_releases": true,
"file_patterns": ["*.py"],
"exclude_patterns": ["tests/**", "docs/**"]
}
```
### Unified Multi-Source
```json
{
"name": "godot-complete",
"description": "Godot Engine - docs, source, and manual",
"merge_mode": "claude-enhanced",
"sources": [
{
"type": "docs",
"name": "godot-docs",
"base_url": "https://docs.godotengine.org/en/stable/",
"max_pages": 500
},
{
"type": "github",
"name": "godot-source",
"repo": "godotengine/godot",
"fetch_issues": false
},
{
"type": "pdf",
"name": "godot-manual",
"pdf_path": "docs/godot-manual.pdf"
}
]
}
```
### Local Project
```json
{
"name": "my-api",
"type": "local",
"directory": "./my-api-project",
"description": "My REST API implementation",
"languages": ["Python"],
"file_patterns": ["*.py"],
"exclude_patterns": ["tests/**", "migrations/**"],
"analysis_depth": "comprehensive",
"extract_api": true,
"extract_test_examples": true
}
```
---
## Validation
Validate your config before scraping:
```bash
# Using CLI
skill-seekers scrape --config my-config.json --dry-run
# Using MCP tool
validate_config({"config": "my-config.json"})
```
---
## See Also
- [CLI Reference](CLI_REFERENCE.md) - Command reference
- [Environment Variables](ENVIRONMENT_VARIABLES.md) - Configuration environment
---
*For more examples, see `configs/` directory in the repository*