Files
yusyus ba9a8ff8b5 docs: complete documentation overhaul with v3.1.0 release notes and zh-CN translations
Documentation restructure:
- New docs/getting-started/ guide (4 files: install, quick-start, first-skill, next-steps)
- New docs/user-guide/ section (6 files: core concepts through troubleshooting)
- New docs/reference/ section (CLI_REFERENCE, CONFIG_FORMAT, ENVIRONMENT_VARIABLES, MCP_REFERENCE)
- New docs/advanced/ section (custom-workflows, mcp-server, multi-source)
- New docs/ARCHITECTURE.md - system architecture overview
- Archived legacy files (QUICKSTART.md, QUICK_REFERENCE.md, docs/guides/USAGE.md) to docs/archive/legacy/

Chinese (zh-CN) translations:
- Full zh-CN mirror of all user-facing docs (getting-started, user-guide, reference, advanced)
- GitHub Actions workflow for translation sync (.github/workflows/translate-docs.yml)
- Translation sync checker script (scripts/check_translation_sync.sh)
- Translation helper script (scripts/translate_doc.py)

Content updates:
- CHANGELOG.md: [Unreleased] → [3.1.0] - 2026-02-22
- README.md: updated with new doc structure links
- AGENTS.md: updated agent documentation
- docs/features/UNIFIED_SCRAPING.md: updated for unified scraper workflow JSON config

Analysis/planning artifacts (kept for reference):
- DOCUMENTATION_OVERHAUL_PLAN.md, DOCUMENTATION_OVERHAUL_SUMMARY.md
- FEATURE_GAP_ANALYSIS.md, IMPLEMENTATION_GAPS_ANALYSIS.md, CREATE_COMMAND_COVERAGE_ANALYSIS.md
- CHINESE_TRANSLATION_IMPLEMENTATION_SUMMARY.md, ISSUE_260_UPDATE.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 01:01:51 +03:00

14 KiB

Config Format Reference - Skill Seekers

Version: 3.1.0
Last Updated: 2026-02-16
Complete JSON configuration specification


Table of Contents


Overview

Skill Seekers uses JSON configuration files to define scraping targets. There are two types:

Type Use Case File
Single-Source One source (docs, GitHub, PDF, or local) *.json
Unified Multiple sources combined *-unified.json

Single-Source Config

Documentation Source

For scraping documentation websites.

{
  "name": "react",
  "base_url": "https://react.dev/",
  "description": "React - JavaScript library for building UIs",
  
  "start_urls": [
    "https://react.dev/learn",
    "https://react.dev/reference/react"
  ],
  
  "selectors": {
    "main_content": "article",
    "title": "h1",
    "code_blocks": "pre code"
  },
  
  "url_patterns": {
    "include": ["/learn/", "/reference/"],
    "exclude": ["/blog/", "/community/"]
  },
  
  "categories": {
    "getting_started": ["learn", "tutorial", "intro"],
    "api": ["reference", "api", "hooks"]
  },
  
  "rate_limit": 0.5,
  "max_pages": 300,
  "merge_mode": "claude-enhanced"
}

Documentation Fields

Field Type Required Default Description
name string Yes - Skill name (alphanumeric, dashes, underscores)
base_url string Yes - Base documentation URL
description string No "" Skill description for SKILL.md
start_urls array No [base_url] URLs to start crawling from
selectors object No see below CSS selectors for content extraction
url_patterns object No {} Include/exclude URL patterns
categories object No {} Content categorization rules
rate_limit number No 0.5 Seconds between requests
max_pages number No 500 Maximum pages to scrape
merge_mode string No "claude-enhanced" Merge strategy
extract_api boolean No false Extract API references
llms_txt_url string No auto Path to llms.txt file

GitHub Source

For analyzing GitHub repositories.

{
  "name": "react-github",
  "type": "github",
  "repo": "facebook/react",
  "description": "React GitHub repository analysis",
  
  "enable_codebase_analysis": true,
  "code_analysis_depth": "deep",
  
  "fetch_issues": true,
  "max_issues": 100,
  "issue_labels": ["bug", "enhancement"],
  
  "fetch_releases": true,
  "max_releases": 20,
  
  "fetch_changelog": true,
  "analyze_commit_history": true,
  
  "file_patterns": ["*.js", "*.ts", "*.tsx"],
  "exclude_patterns": ["*.test.js", "node_modules/**"],
  
  "rate_limit": 1.0
}

GitHub Fields

Field Type Required Default Description
name string Yes - Skill name
type string Yes - Must be "github"
repo string Yes - Repository in owner/repo format
description string No "" Skill description
enable_codebase_analysis boolean No true Analyze source code
code_analysis_depth string No "standard" surface, standard, deep
fetch_issues boolean No true Fetch GitHub issues
max_issues number No 100 Maximum issues to fetch
issue_labels array No [] Filter by labels
fetch_releases boolean No true Fetch releases
max_releases number No 20 Maximum releases
fetch_changelog boolean No true Extract CHANGELOG
analyze_commit_history boolean No false Analyze commits
file_patterns array No [] Include file patterns
exclude_patterns array No [] Exclude file patterns

PDF Source

For extracting content from PDF files.

{
  "name": "product-manual",
  "type": "pdf",
  "pdf_path": "docs/manual.pdf",
  "description": "Product documentation manual",
  
  "enable_ocr": false,
  "password": "",
  
  "extract_images": true,
  "image_output_dir": "output/images/",
  
  "extract_tables": true,
  "table_format": "markdown",
  
  "page_range": [1, 100],
  "split_by_chapters": true,
  
  "chunk_size": 1000,
  "chunk_overlap": 100
}

PDF Fields

Field Type Required Default Description
name string Yes - Skill name
type string Yes - Must be "pdf"
pdf_path string Yes - Path to PDF file
description string No "" Skill description
enable_ocr boolean No false OCR for scanned PDFs
password string No "" PDF password if encrypted
extract_images boolean No false Extract embedded images
image_output_dir string No auto Directory for images
extract_tables boolean No false Extract tables
table_format string No "markdown" markdown, json, csv
page_range array No all [start, end] page range
split_by_chapters boolean No false Split by detected chapters
chunk_size number No 1000 Characters per chunk
chunk_overlap number No 100 Overlap between chunks

Local Source

For analyzing local codebases.

{
  "name": "my-project",
  "type": "local",
  "directory": "./my-project",
  "description": "Local project analysis",
  
  "languages": ["Python", "JavaScript"],
  "file_patterns": ["*.py", "*.js"],
  "exclude_patterns": ["*.pyc", "node_modules/**", ".git/**"],
  
  "analysis_depth": "comprehensive",
  
  "extract_api": true,
  "extract_patterns": true,
  "extract_test_examples": true,
  "extract_how_to_guides": true,
  "extract_config_patterns": true,
  
  "include_comments": true,
  "include_docstrings": true,
  "include_readme": true
}

Local Fields

Field Type Required Default Description
name string Yes - Skill name
type string Yes - Must be "local"
directory string Yes - Path to directory
description string No "" Skill description
languages array No auto Languages to analyze
file_patterns array No all Include patterns
exclude_patterns array No common Exclude patterns
analysis_depth string No "standard" quick, standard, comprehensive
extract_api boolean No true Extract API documentation
extract_patterns boolean No true Detect patterns
extract_test_examples boolean No true Extract test examples
extract_how_to_guides boolean No true Generate guides
extract_config_patterns boolean No true Extract config patterns
include_comments boolean No true Include code comments
include_docstrings boolean No true Include docstrings
include_readme boolean No true Include README

Unified (Multi-Source) Config

Combine multiple sources into one skill with conflict detection.

{
  "name": "react-complete",
  "description": "React docs + GitHub + examples",
  "merge_mode": "claude-enhanced",
  
  "sources": [
    {
      "type": "docs",
      "name": "react-docs",
      "base_url": "https://react.dev/",
      "max_pages": 200,
      "categories": {
        "getting_started": ["learn"],
        "api": ["reference"]
      }
    },
    {
      "type": "github",
      "name": "react-github",
      "repo": "facebook/react",
      "fetch_issues": true,
      "max_issues": 50
    },
    {
      "type": "pdf",
      "name": "react-cheatsheet",
      "pdf_path": "docs/react-cheatsheet.pdf"
    },
    {
      "type": "local",
      "name": "react-examples",
      "directory": "./react-examples"
    }
  ],
  
  "conflict_detection": {
    "enabled": true,
    "rules": [
      {
        "field": "api_signature",
        "action": "flag_mismatch"
      }
    ]
  },
  
  "output_structure": {
    "group_by_source": false,
    "cross_reference": true
  }
}

Unified Fields

Field Type Required Default Description
name string Yes - Combined skill name
description string No "" Skill description
merge_mode string No "claude-enhanced" rule-based, claude-enhanced
sources array Yes - List of source configs
conflict_detection object No {} Conflict detection settings
output_structure object No {} Output organization

Source Types in Unified Config

Each source in the sources array can be:

Type Required Fields
docs base_url
github repo
pdf pdf_path
local directory

Common Fields

Fields available in all config types:

Field Type Description
name string Skill identifier (letters, numbers, dashes, underscores)
description string Human-readable description
rate_limit number Delay between requests in seconds
output_dir string Custom output directory
skip_scrape boolean Use existing data
enhance_level number 0=off, 1=SKILL.md, 2=+config, 3=full

Selectors

CSS selectors for content extraction from HTML:

{
  "selectors": {
    "main_content": "article",
    "title": "h1",
    "code_blocks": "pre code",
    "navigation": "nav.sidebar",
    "breadcrumbs": "nav[aria-label='breadcrumb']",
    "next_page": "a[rel='next']",
    "prev_page": "a[rel='prev']"
  }
}

Default Selectors

If not specified, these defaults are used:

Element Default Selector
main_content article, main, .content, #content, [role='main']
title h1, .page-title, title
code_blocks pre code, code[class*="language-"]
navigation nav, .sidebar, .toc

Categories

Map URL patterns to content categories:

{
  "categories": {
    "getting_started": [
      "intro", "tutorial", "quickstart", 
      "installation", "getting-started"
    ],
    "core_concepts": [
      "concept", "fundamental", "architecture",
      "principle", "overview"
    ],
    "api_reference": [
      "reference", "api", "method", "function",
      "class", "interface", "type"
    ],
    "guides": [
      "guide", "how-to", "example", "recipe",
      "pattern", "best-practice"
    ],
    "advanced": [
      "advanced", "expert", "performance",
      "optimization", "internals"
    ]
  }
}

Categories appear as sections in the generated SKILL.md.


URL Patterns

Control which URLs are included or excluded:

{
  "url_patterns": {
    "include": [
      "/docs/",
      "/guide/",
      "/api/",
      "/reference/"
    ],
    "exclude": [
      "/blog/",
      "/news/",
      "/community/",
      "/search",
      "?print=1",
      "/_static/",
      "/_images/"
    ]
  }
}

Pattern Rules

  • Patterns are matched against the URL path
  • Use * for wildcards: /api/v*/
  • Use ** for recursive: /docs/**/*.html
  • Exclude takes precedence over include

Examples

React Documentation

{
  "name": "react",
  "base_url": "https://react.dev/",
  "description": "React - JavaScript library for building UIs",
  "start_urls": [
    "https://react.dev/learn",
    "https://react.dev/reference/react",
    "https://react.dev/reference/react-dom"
  ],
  "selectors": {
    "main_content": "article",
    "title": "h1",
    "code_blocks": "pre code"
  },
  "url_patterns": {
    "include": ["/learn/", "/reference/", "/blog/"],
    "exclude": ["/community/", "/search"]
  },
  "categories": {
    "getting_started": ["learn", "tutorial"],
    "api": ["reference", "api"],
    "blog": ["blog"]
  },
  "rate_limit": 0.5,
  "max_pages": 300
}

Django GitHub

{
  "name": "django-github",
  "type": "github",
  "repo": "django/django",
  "description": "Django web framework source code",
  "enable_codebase_analysis": true,
  "code_analysis_depth": "deep",
  "fetch_issues": true,
  "max_issues": 100,
  "fetch_releases": true,
  "file_patterns": ["*.py"],
  "exclude_patterns": ["tests/**", "docs/**"]
}

Unified Multi-Source

{
  "name": "godot-complete",
  "description": "Godot Engine - docs, source, and manual",
  "merge_mode": "claude-enhanced",
  "sources": [
    {
      "type": "docs",
      "name": "godot-docs",
      "base_url": "https://docs.godotengine.org/en/stable/",
      "max_pages": 500
    },
    {
      "type": "github",
      "name": "godot-source",
      "repo": "godotengine/godot",
      "fetch_issues": false
    },
    {
      "type": "pdf",
      "name": "godot-manual",
      "pdf_path": "docs/godot-manual.pdf"
    }
  ]
}

Local Project

{
  "name": "my-api",
  "type": "local",
  "directory": "./my-api-project",
  "description": "My REST API implementation",
  "languages": ["Python"],
  "file_patterns": ["*.py"],
  "exclude_patterns": ["tests/**", "migrations/**"],
  "analysis_depth": "comprehensive",
  "extract_api": true,
  "extract_test_examples": true
}

Validation

Validate your config before scraping:

# Using CLI
skill-seekers scrape --config my-config.json --dry-run

# Using MCP tool
validate_config({"config": "my-config.json"})

See Also


For more examples, see configs/ directory in the repository