Documentation restructure: - New docs/getting-started/ guide (4 files: install, quick-start, first-skill, next-steps) - New docs/user-guide/ section (6 files: core concepts through troubleshooting) - New docs/reference/ section (CLI_REFERENCE, CONFIG_FORMAT, ENVIRONMENT_VARIABLES, MCP_REFERENCE) - New docs/advanced/ section (custom-workflows, mcp-server, multi-source) - New docs/ARCHITECTURE.md - system architecture overview - Archived legacy files (QUICKSTART.md, QUICK_REFERENCE.md, docs/guides/USAGE.md) to docs/archive/legacy/ Chinese (zh-CN) translations: - Full zh-CN mirror of all user-facing docs (getting-started, user-guide, reference, advanced) - GitHub Actions workflow for translation sync (.github/workflows/translate-docs.yml) - Translation sync checker script (scripts/check_translation_sync.sh) - Translation helper script (scripts/translate_doc.py) Content updates: - CHANGELOG.md: [Unreleased] → [3.1.0] - 2026-02-22 - README.md: updated with new doc structure links - AGENTS.md: updated agent documentation - docs/features/UNIFIED_SCRAPING.md: updated for unified scraper workflow JSON config Analysis/planning artifacts (kept for reference): - DOCUMENTATION_OVERHAUL_PLAN.md, DOCUMENTATION_OVERHAUL_SUMMARY.md - FEATURE_GAP_ANALYSIS.md, IMPLEMENTATION_GAPS_ANALYSIS.md, CREATE_COMMAND_COVERAGE_ANALYSIS.md - CHINESE_TRANSLATION_IMPLEMENTATION_SUMMARY.md, ISSUE_260_UPDATE.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
567 lines
14 KiB
Markdown
567 lines
14 KiB
Markdown
# Config Format Reference - Skill Seekers
|
|
|
|
> **Version:** 3.1.0
|
|
> **Last Updated:** 2026-02-16
|
|
> **Complete JSON configuration specification**
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
- [Overview](#overview)
|
|
- [Single-Source Config](#single-source-config)
|
|
- [Documentation Source](#documentation-source)
|
|
- [GitHub Source](#github-source)
|
|
- [PDF Source](#pdf-source)
|
|
- [Local Source](#local-source)
|
|
- [Unified (Multi-Source) Config](#unified-multi-source-config)
|
|
- [Common Fields](#common-fields)
|
|
- [Selectors](#selectors)
|
|
- [Categories](#categories)
|
|
- [URL Patterns](#url-patterns)
|
|
- [Examples](#examples)
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
Skill Seekers uses JSON configuration files to define scraping targets. There are two types:
|
|
|
|
| Type | Use Case | File |
|
|
|------|----------|------|
|
|
| **Single-Source** | One source (docs, GitHub, PDF, or local) | `*.json` |
|
|
| **Unified** | Multiple sources combined | `*-unified.json` |
|
|
|
|
---
|
|
|
|
## Single-Source Config
|
|
|
|
### Documentation Source
|
|
|
|
For scraping documentation websites.
|
|
|
|
```json
|
|
{
|
|
"name": "react",
|
|
"base_url": "https://react.dev/",
|
|
"description": "React - JavaScript library for building UIs",
|
|
|
|
"start_urls": [
|
|
"https://react.dev/learn",
|
|
"https://react.dev/reference/react"
|
|
],
|
|
|
|
"selectors": {
|
|
"main_content": "article",
|
|
"title": "h1",
|
|
"code_blocks": "pre code"
|
|
},
|
|
|
|
"url_patterns": {
|
|
"include": ["/learn/", "/reference/"],
|
|
"exclude": ["/blog/", "/community/"]
|
|
},
|
|
|
|
"categories": {
|
|
"getting_started": ["learn", "tutorial", "intro"],
|
|
"api": ["reference", "api", "hooks"]
|
|
},
|
|
|
|
"rate_limit": 0.5,
|
|
"max_pages": 300,
|
|
"merge_mode": "claude-enhanced"
|
|
}
|
|
```
|
|
|
|
#### Documentation Fields
|
|
|
|
| Field | Type | Required | Default | Description |
|
|
|-------|------|----------|---------|-------------|
|
|
| `name` | string | Yes | - | Skill name (alphanumeric, dashes, underscores) |
|
|
| `base_url` | string | Yes | - | Base documentation URL |
|
|
| `description` | string | No | "" | Skill description for SKILL.md |
|
|
| `start_urls` | array | No | `[base_url]` | URLs to start crawling from |
|
|
| `selectors` | object | No | see below | CSS selectors for content extraction |
|
|
| `url_patterns` | object | No | `{}` | Include/exclude URL patterns |
|
|
| `categories` | object | No | `{}` | Content categorization rules |
|
|
| `rate_limit` | number | No | 0.5 | Seconds between requests |
|
|
| `max_pages` | number | No | 500 | Maximum pages to scrape |
|
|
| `merge_mode` | string | No | "claude-enhanced" | Merge strategy |
|
|
| `extract_api` | boolean | No | false | Extract API references |
|
|
| `llms_txt_url` | string | No | auto | Path to llms.txt file |
|
|
|
|
---
|
|
|
|
### GitHub Source
|
|
|
|
For analyzing GitHub repositories.
|
|
|
|
```json
|
|
{
|
|
"name": "react-github",
|
|
"type": "github",
|
|
"repo": "facebook/react",
|
|
"description": "React GitHub repository analysis",
|
|
|
|
"enable_codebase_analysis": true,
|
|
"code_analysis_depth": "deep",
|
|
|
|
"fetch_issues": true,
|
|
"max_issues": 100,
|
|
"issue_labels": ["bug", "enhancement"],
|
|
|
|
"fetch_releases": true,
|
|
"max_releases": 20,
|
|
|
|
"fetch_changelog": true,
|
|
"analyze_commit_history": true,
|
|
|
|
"file_patterns": ["*.js", "*.ts", "*.tsx"],
|
|
"exclude_patterns": ["*.test.js", "node_modules/**"],
|
|
|
|
"rate_limit": 1.0
|
|
}
|
|
```
|
|
|
|
#### GitHub Fields
|
|
|
|
| Field | Type | Required | Default | Description |
|
|
|-------|------|----------|---------|-------------|
|
|
| `name` | string | Yes | - | Skill name |
|
|
| `type` | string | Yes | - | Must be `"github"` |
|
|
| `repo` | string | Yes | - | Repository in `owner/repo` format |
|
|
| `description` | string | No | "" | Skill description |
|
|
| `enable_codebase_analysis` | boolean | No | true | Analyze source code |
|
|
| `code_analysis_depth` | string | No | "standard" | `surface`, `standard`, `deep` |
|
|
| `fetch_issues` | boolean | No | true | Fetch GitHub issues |
|
|
| `max_issues` | number | No | 100 | Maximum issues to fetch |
|
|
| `issue_labels` | array | No | [] | Filter by labels |
|
|
| `fetch_releases` | boolean | No | true | Fetch releases |
|
|
| `max_releases` | number | No | 20 | Maximum releases |
|
|
| `fetch_changelog` | boolean | No | true | Extract CHANGELOG |
|
|
| `analyze_commit_history` | boolean | No | false | Analyze commits |
|
|
| `file_patterns` | array | No | [] | Include file patterns |
|
|
| `exclude_patterns` | array | No | [] | Exclude file patterns |
|
|
|
|
---
|
|
|
|
### PDF Source
|
|
|
|
For extracting content from PDF files.
|
|
|
|
```json
|
|
{
|
|
"name": "product-manual",
|
|
"type": "pdf",
|
|
"pdf_path": "docs/manual.pdf",
|
|
"description": "Product documentation manual",
|
|
|
|
"enable_ocr": false,
|
|
"password": "",
|
|
|
|
"extract_images": true,
|
|
"image_output_dir": "output/images/",
|
|
|
|
"extract_tables": true,
|
|
"table_format": "markdown",
|
|
|
|
"page_range": [1, 100],
|
|
"split_by_chapters": true,
|
|
|
|
"chunk_size": 1000,
|
|
"chunk_overlap": 100
|
|
}
|
|
```
|
|
|
|
#### PDF Fields
|
|
|
|
| Field | Type | Required | Default | Description |
|
|
|-------|------|----------|---------|-------------|
|
|
| `name` | string | Yes | - | Skill name |
|
|
| `type` | string | Yes | - | Must be `"pdf"` |
|
|
| `pdf_path` | string | Yes | - | Path to PDF file |
|
|
| `description` | string | No | "" | Skill description |
|
|
| `enable_ocr` | boolean | No | false | OCR for scanned PDFs |
|
|
| `password` | string | No | "" | PDF password if encrypted |
|
|
| `extract_images` | boolean | No | false | Extract embedded images |
|
|
| `image_output_dir` | string | No | auto | Directory for images |
|
|
| `extract_tables` | boolean | No | false | Extract tables |
|
|
| `table_format` | string | No | "markdown" | `markdown`, `json`, `csv` |
|
|
| `page_range` | array | No | all | `[start, end]` page range |
|
|
| `split_by_chapters` | boolean | No | false | Split by detected chapters |
|
|
| `chunk_size` | number | No | 1000 | Characters per chunk |
|
|
| `chunk_overlap` | number | No | 100 | Overlap between chunks |
|
|
|
|
---
|
|
|
|
### Local Source
|
|
|
|
For analyzing local codebases.
|
|
|
|
```json
|
|
{
|
|
"name": "my-project",
|
|
"type": "local",
|
|
"directory": "./my-project",
|
|
"description": "Local project analysis",
|
|
|
|
"languages": ["Python", "JavaScript"],
|
|
"file_patterns": ["*.py", "*.js"],
|
|
"exclude_patterns": ["*.pyc", "node_modules/**", ".git/**"],
|
|
|
|
"analysis_depth": "comprehensive",
|
|
|
|
"extract_api": true,
|
|
"extract_patterns": true,
|
|
"extract_test_examples": true,
|
|
"extract_how_to_guides": true,
|
|
"extract_config_patterns": true,
|
|
|
|
"include_comments": true,
|
|
"include_docstrings": true,
|
|
"include_readme": true
|
|
}
|
|
```
|
|
|
|
#### Local Fields
|
|
|
|
| Field | Type | Required | Default | Description |
|
|
|-------|------|----------|---------|-------------|
|
|
| `name` | string | Yes | - | Skill name |
|
|
| `type` | string | Yes | - | Must be `"local"` |
|
|
| `directory` | string | Yes | - | Path to directory |
|
|
| `description` | string | No | "" | Skill description |
|
|
| `languages` | array | No | auto | Languages to analyze |
|
|
| `file_patterns` | array | No | all | Include patterns |
|
|
| `exclude_patterns` | array | No | common | Exclude patterns |
|
|
| `analysis_depth` | string | No | "standard" | `quick`, `standard`, `comprehensive` |
|
|
| `extract_api` | boolean | No | true | Extract API documentation |
|
|
| `extract_patterns` | boolean | No | true | Detect patterns |
|
|
| `extract_test_examples` | boolean | No | true | Extract test examples |
|
|
| `extract_how_to_guides` | boolean | No | true | Generate guides |
|
|
| `extract_config_patterns` | boolean | No | true | Extract config patterns |
|
|
| `include_comments` | boolean | No | true | Include code comments |
|
|
| `include_docstrings` | boolean | No | true | Include docstrings |
|
|
| `include_readme` | boolean | No | true | Include README |
|
|
|
|
---
|
|
|
|
## Unified (Multi-Source) Config
|
|
|
|
Combine multiple sources into one skill with conflict detection.
|
|
|
|
```json
|
|
{
|
|
"name": "react-complete",
|
|
"description": "React docs + GitHub + examples",
|
|
"merge_mode": "claude-enhanced",
|
|
|
|
"sources": [
|
|
{
|
|
"type": "docs",
|
|
"name": "react-docs",
|
|
"base_url": "https://react.dev/",
|
|
"max_pages": 200,
|
|
"categories": {
|
|
"getting_started": ["learn"],
|
|
"api": ["reference"]
|
|
}
|
|
},
|
|
{
|
|
"type": "github",
|
|
"name": "react-github",
|
|
"repo": "facebook/react",
|
|
"fetch_issues": true,
|
|
"max_issues": 50
|
|
},
|
|
{
|
|
"type": "pdf",
|
|
"name": "react-cheatsheet",
|
|
"pdf_path": "docs/react-cheatsheet.pdf"
|
|
},
|
|
{
|
|
"type": "local",
|
|
"name": "react-examples",
|
|
"directory": "./react-examples"
|
|
}
|
|
],
|
|
|
|
"conflict_detection": {
|
|
"enabled": true,
|
|
"rules": [
|
|
{
|
|
"field": "api_signature",
|
|
"action": "flag_mismatch"
|
|
}
|
|
]
|
|
},
|
|
|
|
"output_structure": {
|
|
"group_by_source": false,
|
|
"cross_reference": true
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Unified Fields
|
|
|
|
| Field | Type | Required | Default | Description |
|
|
|-------|------|----------|---------|-------------|
|
|
| `name` | string | Yes | - | Combined skill name |
|
|
| `description` | string | No | "" | Skill description |
|
|
| `merge_mode` | string | No | "claude-enhanced" | `rule-based`, `claude-enhanced` |
|
|
| `sources` | array | Yes | - | List of source configs |
|
|
| `conflict_detection` | object | No | `{}` | Conflict detection settings |
|
|
| `output_structure` | object | No | `{}` | Output organization |
|
|
|
|
#### Source Types in Unified Config
|
|
|
|
Each source in the `sources` array can be:
|
|
|
|
| Type | Required Fields |
|
|
|------|-----------------|
|
|
| `docs` | `base_url` |
|
|
| `github` | `repo` |
|
|
| `pdf` | `pdf_path` |
|
|
| `local` | `directory` |
|
|
|
|
---
|
|
|
|
## Common Fields
|
|
|
|
Fields available in all config types:
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `name` | string | Skill identifier (letters, numbers, dashes, underscores) |
|
|
| `description` | string | Human-readable description |
|
|
| `rate_limit` | number | Delay between requests in seconds |
|
|
| `output_dir` | string | Custom output directory |
|
|
| `skip_scrape` | boolean | Use existing data |
|
|
| `enhance_level` | number | 0=off, 1=SKILL.md, 2=+config, 3=full |
|
|
|
|
---
|
|
|
|
## Selectors
|
|
|
|
CSS selectors for content extraction from HTML:
|
|
|
|
```json
|
|
{
|
|
"selectors": {
|
|
"main_content": "article",
|
|
"title": "h1",
|
|
"code_blocks": "pre code",
|
|
"navigation": "nav.sidebar",
|
|
"breadcrumbs": "nav[aria-label='breadcrumb']",
|
|
"next_page": "a[rel='next']",
|
|
"prev_page": "a[rel='prev']"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Default Selectors
|
|
|
|
If not specified, these defaults are used:
|
|
|
|
| Element | Default Selector |
|
|
|---------|-----------------|
|
|
| `main_content` | `article, main, .content, #content, [role='main']` |
|
|
| `title` | `h1, .page-title, title` |
|
|
| `code_blocks` | `pre code, code[class*="language-"]` |
|
|
| `navigation` | `nav, .sidebar, .toc` |
|
|
|
|
---
|
|
|
|
## Categories
|
|
|
|
Map URL patterns to content categories:
|
|
|
|
```json
|
|
{
|
|
"categories": {
|
|
"getting_started": [
|
|
"intro", "tutorial", "quickstart",
|
|
"installation", "getting-started"
|
|
],
|
|
"core_concepts": [
|
|
"concept", "fundamental", "architecture",
|
|
"principle", "overview"
|
|
],
|
|
"api_reference": [
|
|
"reference", "api", "method", "function",
|
|
"class", "interface", "type"
|
|
],
|
|
"guides": [
|
|
"guide", "how-to", "example", "recipe",
|
|
"pattern", "best-practice"
|
|
],
|
|
"advanced": [
|
|
"advanced", "expert", "performance",
|
|
"optimization", "internals"
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
Categories appear as sections in the generated SKILL.md.
|
|
|
|
---
|
|
|
|
## URL Patterns
|
|
|
|
Control which URLs are included or excluded:
|
|
|
|
```json
|
|
{
|
|
"url_patterns": {
|
|
"include": [
|
|
"/docs/",
|
|
"/guide/",
|
|
"/api/",
|
|
"/reference/"
|
|
],
|
|
"exclude": [
|
|
"/blog/",
|
|
"/news/",
|
|
"/community/",
|
|
"/search",
|
|
"?print=1",
|
|
"/_static/",
|
|
"/_images/"
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
### Pattern Rules
|
|
|
|
- Patterns are matched against the URL path
|
|
- Use `*` for wildcards: `/api/v*/`
|
|
- Use `**` for recursive: `/docs/**/*.html`
|
|
- Exclude takes precedence over include
|
|
|
|
---
|
|
|
|
## Examples
|
|
|
|
### React Documentation
|
|
|
|
```json
|
|
{
|
|
"name": "react",
|
|
"base_url": "https://react.dev/",
|
|
"description": "React - JavaScript library for building UIs",
|
|
"start_urls": [
|
|
"https://react.dev/learn",
|
|
"https://react.dev/reference/react",
|
|
"https://react.dev/reference/react-dom"
|
|
],
|
|
"selectors": {
|
|
"main_content": "article",
|
|
"title": "h1",
|
|
"code_blocks": "pre code"
|
|
},
|
|
"url_patterns": {
|
|
"include": ["/learn/", "/reference/", "/blog/"],
|
|
"exclude": ["/community/", "/search"]
|
|
},
|
|
"categories": {
|
|
"getting_started": ["learn", "tutorial"],
|
|
"api": ["reference", "api"],
|
|
"blog": ["blog"]
|
|
},
|
|
"rate_limit": 0.5,
|
|
"max_pages": 300
|
|
}
|
|
```
|
|
|
|
### Django GitHub
|
|
|
|
```json
|
|
{
|
|
"name": "django-github",
|
|
"type": "github",
|
|
"repo": "django/django",
|
|
"description": "Django web framework source code",
|
|
"enable_codebase_analysis": true,
|
|
"code_analysis_depth": "deep",
|
|
"fetch_issues": true,
|
|
"max_issues": 100,
|
|
"fetch_releases": true,
|
|
"file_patterns": ["*.py"],
|
|
"exclude_patterns": ["tests/**", "docs/**"]
|
|
}
|
|
```
|
|
|
|
### Unified Multi-Source
|
|
|
|
```json
|
|
{
|
|
"name": "godot-complete",
|
|
"description": "Godot Engine - docs, source, and manual",
|
|
"merge_mode": "claude-enhanced",
|
|
"sources": [
|
|
{
|
|
"type": "docs",
|
|
"name": "godot-docs",
|
|
"base_url": "https://docs.godotengine.org/en/stable/",
|
|
"max_pages": 500
|
|
},
|
|
{
|
|
"type": "github",
|
|
"name": "godot-source",
|
|
"repo": "godotengine/godot",
|
|
"fetch_issues": false
|
|
},
|
|
{
|
|
"type": "pdf",
|
|
"name": "godot-manual",
|
|
"pdf_path": "docs/godot-manual.pdf"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Local Project
|
|
|
|
```json
|
|
{
|
|
"name": "my-api",
|
|
"type": "local",
|
|
"directory": "./my-api-project",
|
|
"description": "My REST API implementation",
|
|
"languages": ["Python"],
|
|
"file_patterns": ["*.py"],
|
|
"exclude_patterns": ["tests/**", "migrations/**"],
|
|
"analysis_depth": "comprehensive",
|
|
"extract_api": true,
|
|
"extract_test_examples": true
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Validation
|
|
|
|
Validate your config before scraping:
|
|
|
|
```bash
|
|
# Using CLI
|
|
skill-seekers scrape --config my-config.json --dry-run
|
|
|
|
# Using MCP tool
|
|
validate_config({"config": "my-config.json"})
|
|
```
|
|
|
|
---
|
|
|
|
## See Also
|
|
|
|
- [CLI Reference](CLI_REFERENCE.md) - Command reference
|
|
- [Environment Variables](ENVIRONMENT_VARIABLES.md) - Configuration environment
|
|
|
|
---
|
|
|
|
*For more examples, see `configs/` directory in the repository*
|