Bug fixes: - Fix --var flag silently dropped in create routing (args.workflow_var → args.var) - Fix double _score_code_quality() call in word scraper - Add .docx file extension validation in WordToSkillConverter - Fix weaviate ImportError masked by generic Exception handler - Fix RAG chunking crash using non-existent converter.output_dir Chunking pipeline improvements: - Wire --chunk-overlap-tokens through entire package pipeline (package_skill → adaptor.package → format_skill_md → _maybe_chunk_content → RAGChunker) - Add auto-scaling overlap: max(50, chunk_tokens//10) when chunk size is non-default - Rename --no-preserve-code to --no-preserve-code-blocks (backward-compat alias kept) - Replace hardcoded 512/50 chunk defaults with DEFAULT_CHUNK_TOKENS/DEFAULT_CHUNK_OVERLAP_TOKENS constants across all 12 concrete adaptors, rag_chunker, base, and package_skill Code quality: - Extract shared _generate_openai_embeddings() and _generate_st_embeddings() to SkillAdaptor base class, removing ~150 lines of duplication from chroma/weaviate/pinecone - Add Pinecone adaptor with full upload support (pinecone_adaptor.py) Tests (14 new): - chunk_overlap_tokens parameter wiring, auto-scaling overlap, preserve_code_blocks flag - .docx/.doc/no-extension file validation, --var flag routing E2E - Embedding method inheritance verification, backward-compatible flag aliases Docs: - Update CHANGELOG, CLI_REFERENCE, API_REFERENCE, packaging guide (EN+ZH) - Update README test count badge (1880+ → 2283+) All 2283 tests passing, 8 skipped, 0 failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
23 KiB
API Reference - Programmatic Usage
Version: 3.1.0-dev Last Updated: 2026-02-18 Status: ✅ Production Ready
Overview
Skill Seekers can be used programmatically for integration into other tools, automation scripts, and CI/CD pipelines. This guide covers the public APIs available for developers who want to embed Skill Seekers functionality into their own applications.
Use Cases:
- Automated documentation skill generation in CI/CD
- Batch processing multiple documentation sources
- Custom skill generation workflows
- Integration with internal tooling
- Automated skill updates on documentation changes
Installation
Basic Installation
pip install skill-seekers
With Platform Dependencies
# Google Gemini support
pip install skill-seekers[gemini]
# OpenAI ChatGPT support
pip install skill-seekers[openai]
# All platform support
pip install skill-seekers[all-llms]
Development Installation
git clone https://github.com/yusufkaraaslan/Skill_Seekers.git
cd Skill_Seekers
pip install -e ".[all-llms]"
Core APIs
1. Documentation Scraping API
Extract content from documentation websites using BFS traversal and smart categorization.
Basic Usage
from skill_seekers.cli.doc_scraper import scrape_all, build_skill
import json
# Load configuration
with open('configs/react.json', 'r') as f:
config = json.load(f)
# Scrape documentation
pages = scrape_all(
base_url=config['base_url'],
selectors=config['selectors'],
config=config,
output_dir='output/react_data'
)
print(f"Scraped {len(pages)} pages")
# Build skill from scraped data
skill_path = build_skill(
config_name='react',
output_dir='output/react',
data_dir='output/react_data'
)
print(f"Skill created at: {skill_path}")
Advanced Scraping Options
from skill_seekers.cli.doc_scraper import scrape_all
# Custom scraping with advanced options
pages = scrape_all(
base_url='https://docs.example.com',
selectors={
'main_content': 'article',
'title': 'h1',
'code_blocks': 'pre code'
},
config={
'name': 'my-framework',
'description': 'Custom framework documentation',
'rate_limit': 0.5, # 0.5 second delay between requests
'max_pages': 500, # Limit to 500 pages
'url_patterns': {
'include': ['/docs/'],
'exclude': ['/blog/', '/changelog/']
}
},
output_dir='output/my-framework_data',
use_async=True # Enable async scraping (2-3x faster)
)
Rebuilding Without Scraping
from skill_seekers.cli.doc_scraper import build_skill
# Rebuild skill from existing data (fast!)
skill_path = build_skill(
config_name='react',
output_dir='output/react',
data_dir='output/react_data', # Use existing scraped data
skip_scrape=True # Don't re-scrape
)
2. GitHub Repository Analysis API
Analyze GitHub repositories with three-stream architecture (Code + Docs + Insights).
Basic GitHub Analysis
from skill_seekers.cli.github_scraper import scrape_github_repo
# Analyze GitHub repository
result = scrape_github_repo(
repo_url='https://github.com/facebook/react',
output_dir='output/react-github',
analysis_depth='c3x', # Options: 'basic' or 'c3x'
github_token='ghp_...' # Optional: higher rate limits
)
print(f"Analysis complete: {result['skill_path']}")
print(f"Code files analyzed: {result['stats']['code_files']}")
print(f"Patterns detected: {result['stats']['patterns']}")
Stream-Specific Analysis
from skill_seekers.cli.github_scraper import scrape_github_repo
# Focus on specific streams
result = scrape_github_repo(
repo_url='https://github.com/vercel/next.js',
output_dir='output/nextjs',
analysis_depth='c3x',
enable_code_stream=True, # C3.x codebase analysis
enable_docs_stream=True, # README, docs/, wiki
enable_insights_stream=True, # GitHub metadata, issues
include_tests=True, # Extract test examples
include_patterns=True, # Detect design patterns
include_how_to_guides=True # Generate guides from tests
)
3. PDF Extraction API
Extract content from PDF documents with OCR and image support.
Basic PDF Extraction
from skill_seekers.cli.pdf_scraper import scrape_pdf
# Extract from single PDF
skill_path = scrape_pdf(
pdf_path='documentation.pdf',
output_dir='output/pdf-skill',
skill_name='my-pdf-skill',
description='Documentation from PDF'
)
print(f"PDF skill created: {skill_path}")
Advanced PDF Processing
from skill_seekers.cli.pdf_scraper import scrape_pdf
# PDF extraction with all features
skill_path = scrape_pdf(
pdf_path='large-manual.pdf',
output_dir='output/manual',
skill_name='product-manual',
description='Product manual documentation',
enable_ocr=True, # OCR for scanned PDFs
extract_images=True, # Extract embedded images
extract_tables=True, # Parse tables
chunk_size=50, # Pages per chunk (large PDFs)
language='eng', # OCR language
dpi=300 # Image DPI for OCR
)
4. Unified Multi-Source Scraping API
Combine multiple sources (docs + GitHub + PDF) into a single unified skill.
Unified Scraping
from skill_seekers.cli.unified_scraper import unified_scrape
# Scrape from multiple sources
result = unified_scrape(
config_path='configs/unified/react-unified.json',
output_dir='output/react-complete'
)
print(f"Unified skill created: {result['skill_path']}")
print(f"Sources merged: {result['sources']}")
print(f"Conflicts detected: {result['conflicts']}")
Conflict Detection
from skill_seekers.cli.unified_scraper import detect_conflicts
# Detect discrepancies between sources
conflicts = detect_conflicts(
docs_dir='output/react_data',
github_dir='output/react-github',
pdf_dir='output/react-pdf'
)
for conflict in conflicts:
print(f"Conflict in {conflict['topic']}:")
print(f" Docs say: {conflict['docs_version']}")
print(f" Code shows: {conflict['code_version']}")
5. Skill Packaging API
Package skills for different LLM platforms using the platform adaptor architecture.
Basic Packaging
from skill_seekers.cli.adaptors import get_adaptor
# Get platform-specific adaptor
adaptor = get_adaptor('claude') # Options: claude, gemini, openai, markdown
# Package skill
package_path = adaptor.package(
skill_dir='output/react/',
output_path='output/'
)
print(f"Claude skill package: {package_path}")
Multi-Platform Packaging
from skill_seekers.cli.adaptors import get_adaptor
# Package for all platforms
platforms = ['claude', 'gemini', 'openai', 'markdown']
for platform in platforms:
adaptor = get_adaptor(platform)
package_path = adaptor.package(
skill_dir='output/react/',
output_path='output/'
)
print(f"{platform.capitalize()} package: {package_path}")
Custom Packaging Options
from skill_seekers.cli.adaptors import get_adaptor
adaptor = get_adaptor('gemini')
# Gemini-specific packaging (.tar.gz format)
package_path = adaptor.package(
skill_dir='output/react/',
output_path='output/',
compress_level=9, # Maximum compression
include_metadata=True
)
Shared Embedding Methods
The base SkillAdaptor class provides two shared embedding methods inherited by all vector database adaptors (chroma, weaviate, pinecone):
_generate_openai_embeddings(texts, model)-- Generate embeddings via the OpenAI API._generate_st_embeddings(texts, model)-- Generate embeddings using a local sentence-transformers model.
These methods are available on any adaptor instance returned by get_adaptor() for vector database targets, so you do not need to implement embedding logic per-adaptor.
6. Skill Upload API
Upload packaged skills to LLM platforms via their APIs.
Claude AI Upload
import os
from skill_seekers.cli.adaptors import get_adaptor
adaptor = get_adaptor('claude')
# Upload to Claude AI
result = adaptor.upload(
package_path='output/react-claude.zip',
api_key=os.getenv('ANTHROPIC_API_KEY')
)
print(f"Uploaded to Claude AI: {result['skill_id']}")
Google Gemini Upload
import os
from skill_seekers.cli.adaptors import get_adaptor
adaptor = get_adaptor('gemini')
# Upload to Google Gemini
result = adaptor.upload(
package_path='output/react-gemini.tar.gz',
api_key=os.getenv('GOOGLE_API_KEY')
)
print(f"Gemini corpus ID: {result['corpus_id']}")
OpenAI ChatGPT Upload
import os
from skill_seekers.cli.adaptors import get_adaptor
adaptor = get_adaptor('openai')
# Upload to OpenAI Vector Store
result = adaptor.upload(
package_path='output/react-openai.zip',
api_key=os.getenv('OPENAI_API_KEY')
)
print(f"Vector store ID: {result['vector_store_id']}")
7. AI Enhancement API
Enhance skills with AI-powered improvements using platform-specific models.
API Mode Enhancement
import os
from skill_seekers.cli.adaptors import get_adaptor
adaptor = get_adaptor('claude')
# Enhance using Claude API
result = adaptor.enhance(
skill_dir='output/react/',
mode='api',
api_key=os.getenv('ANTHROPIC_API_KEY')
)
print(f"Enhanced skill: {result['enhanced_path']}")
print(f"Quality score: {result['quality_score']}/10")
LOCAL Mode Enhancement
from skill_seekers.cli.adaptors import get_adaptor
adaptor = get_adaptor('claude')
# Enhance using Claude Code CLI (free!)
result = adaptor.enhance(
skill_dir='output/react/',
mode='LOCAL',
execution_mode='headless', # Options: headless, background, daemon
timeout=300 # 5 minute timeout
)
print(f"Enhanced skill: {result['enhanced_path']}")
Background Enhancement with Monitoring
from skill_seekers.cli.enhance_skill_local import enhance_skill
from skill_seekers.cli.enhance_status import monitor_enhancement
import time
# Start background enhancement
result = enhance_skill(
skill_dir='output/react/',
mode='background'
)
pid = result['pid']
print(f"Enhancement started in background (PID: {pid})")
# Monitor progress
while True:
status = monitor_enhancement('output/react/')
print(f"Status: {status['state']}, Progress: {status['progress']}%")
if status['state'] == 'completed':
print(f"Enhanced skill: {status['output_path']}")
break
elif status['state'] == 'failed':
print(f"Enhancement failed: {status['error']}")
break
time.sleep(5) # Check every 5 seconds
8. Complete Workflow Automation API
Automate the entire workflow: fetch config → scrape → enhance → package → upload.
One-Command Install
import os
from skill_seekers.cli.install_skill import install_skill
# Complete workflow automation
result = install_skill(
config_name='react', # Use preset config
target='claude', # Target platform
api_key=os.getenv('ANTHROPIC_API_KEY'),
enhance=True, # Enable AI enhancement
upload=True, # Upload to platform
force=True # Skip confirmations
)
print(f"Skill installed: {result['skill_id']}")
print(f"Package path: {result['package_path']}")
print(f"Time taken: {result['duration']}s")
Custom Config Install
from skill_seekers.cli.install_skill import install_skill
# Install with custom configuration
result = install_skill(
config_path='configs/custom/my-framework.json',
target='gemini',
api_key=os.getenv('GOOGLE_API_KEY'),
enhance=True,
upload=True,
analysis_depth='c3x', # Deep codebase analysis
enable_router=True # Generate router for large docs
)
Configuration Objects
Config Schema
Skill Seekers uses JSON configuration files to define scraping behavior.
{
"name": "framework-name",
"description": "When to use this skill",
"base_url": "https://docs.example.com/",
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code",
"navigation": "nav.sidebar"
},
"url_patterns": {
"include": ["/docs/", "/api/", "/guides/"],
"exclude": ["/blog/", "/changelog/", "/archive/"]
},
"categories": {
"getting_started": ["intro", "quickstart", "installation"],
"api": ["api", "reference", "methods"],
"guides": ["guide", "tutorial", "how-to"],
"examples": ["example", "demo", "sample"]
},
"rate_limit": 0.5,
"max_pages": 500,
"llms_txt_url": "https://example.com/llms.txt",
"enable_async": true
}
Required Fields
| Field | Type | Description |
|---|---|---|
name |
string | Skill name (alphanumeric + hyphens) |
description |
string | When to use this skill |
base_url |
string | Documentation website URL |
selectors |
object | CSS selectors for content extraction |
Optional Fields
| Field | Type | Default | Description |
|---|---|---|---|
url_patterns.include |
array | [] |
URL path patterns to include |
url_patterns.exclude |
array | [] |
URL path patterns to exclude |
categories |
object | {} |
Category keywords mapping |
rate_limit |
float | 0.5 |
Delay between requests (seconds) |
max_pages |
int | 500 |
Maximum pages to scrape |
llms_txt_url |
string | null |
URL to llms.txt file |
enable_async |
bool | false |
Enable async scraping (faster) |
Unified Config Schema (Multi-Source)
{
"name": "framework-unified",
"description": "Complete framework documentation",
"sources": {
"documentation": {
"type": "docs",
"base_url": "https://docs.example.com/",
"selectors": { "main_content": "article" }
},
"github": {
"type": "github",
"repo_url": "https://github.com/org/repo",
"analysis_depth": "c3x"
},
"pdf": {
"type": "pdf",
"pdf_path": "manual.pdf",
"enable_ocr": true
}
},
"conflict_resolution": "prefer_code",
"merge_strategy": "smart"
}
Advanced Options
Custom Selectors
from skill_seekers.cli.doc_scraper import scrape_all
# Custom CSS selectors for complex sites
pages = scrape_all(
base_url='https://complex-site.com',
selectors={
'main_content': 'div.content-wrapper > article',
'title': 'h1.page-title',
'code_blocks': 'pre.highlight code',
'navigation': 'aside.sidebar nav',
'metadata': 'meta[name="description"]'
},
config={'name': 'complex-site'}
)
URL Pattern Matching
# Advanced URL filtering
config = {
'url_patterns': {
'include': [
'/docs/', # Exact path match
'/api/**', # Wildcard: all subpaths
'/guides/v2.*' # Regex: version-specific
],
'exclude': [
'/blog/',
'/changelog/',
'**/*.png', # Exclude images
'**/*.pdf' # Exclude PDFs
]
}
}
Category Inference
from skill_seekers.cli.doc_scraper import infer_categories
# Auto-detect categories from URL structure
categories = infer_categories(
pages=[
{'url': 'https://docs.example.com/getting-started/intro'},
{'url': 'https://docs.example.com/api/authentication'},
{'url': 'https://docs.example.com/guides/tutorial'}
]
)
print(categories)
# Output: {
# 'getting-started': ['intro'],
# 'api': ['authentication'],
# 'guides': ['tutorial']
# }
Error Handling
Common Exceptions
from skill_seekers.cli.doc_scraper import scrape_all
from skill_seekers.exceptions import (
NetworkError,
InvalidConfigError,
ScrapingError,
RateLimitError
)
try:
pages = scrape_all(
base_url='https://docs.example.com',
selectors={'main_content': 'article'},
config={'name': 'example'}
)
except NetworkError as e:
print(f"Network error: {e}")
# Retry with exponential backoff
except InvalidConfigError as e:
print(f"Invalid config: {e}")
# Fix configuration and retry
except RateLimitError as e:
print(f"Rate limited: {e}")
# Increase rate_limit in config
except ScrapingError as e:
print(f"Scraping failed: {e}")
# Check selectors and URL patterns
Retry Logic
from skill_seekers.cli.doc_scraper import scrape_all
from skill_seekers.utils import retry_with_backoff
@retry_with_backoff(max_retries=3, base_delay=1.0)
def scrape_with_retry(base_url, config):
return scrape_all(
base_url=base_url,
selectors=config['selectors'],
config=config
)
# Automatically retries on network errors
pages = scrape_with_retry(
base_url='https://docs.example.com',
config={'name': 'example', 'selectors': {...}}
)
Testing Your Integration
Unit Tests
import pytest
from skill_seekers.cli.doc_scraper import scrape_all
def test_basic_scraping():
"""Test basic documentation scraping."""
pages = scrape_all(
base_url='https://docs.example.com',
selectors={'main_content': 'article'},
config={
'name': 'test-framework',
'max_pages': 10 # Limit for testing
}
)
assert len(pages) > 0
assert all('title' in p for p in pages)
assert all('content' in p for p in pages)
def test_config_validation():
"""Test configuration validation."""
from skill_seekers.cli.config_validator import validate_config
config = {
'name': 'test',
'base_url': 'https://example.com',
'selectors': {'main_content': 'article'}
}
is_valid, errors = validate_config(config)
assert is_valid
assert len(errors) == 0
Integration Tests
import pytest
import os
from skill_seekers.cli.install_skill import install_skill
@pytest.mark.integration
def test_end_to_end_workflow():
"""Test complete skill installation workflow."""
result = install_skill(
config_name='react',
target='markdown', # No API key needed for markdown
enhance=False, # Skip AI enhancement
upload=False, # Don't upload
force=True
)
assert result['success']
assert os.path.exists(result['package_path'])
assert result['package_path'].endswith('.zip')
@pytest.mark.integration
def test_multi_platform_packaging():
"""Test packaging for multiple platforms."""
from skill_seekers.cli.adaptors import get_adaptor
platforms = ['claude', 'gemini', 'openai', 'markdown']
for platform in platforms:
adaptor = get_adaptor(platform)
package_path = adaptor.package(
skill_dir='output/test-skill/',
output_path='output/'
)
assert os.path.exists(package_path)
Performance Optimization
Async Scraping
from skill_seekers.cli.doc_scraper import scrape_all
# Enable async for 2-3x speed improvement
pages = scrape_all(
base_url='https://docs.example.com',
selectors={'main_content': 'article'},
config={'name': 'example'},
use_async=True # 2-3x faster
)
Caching and Rebuilding
from skill_seekers.cli.doc_scraper import build_skill
# First scrape (slow - 15-45 minutes)
build_skill(config_name='react', output_dir='output/react')
# Rebuild without re-scraping (fast - <1 minute)
build_skill(
config_name='react',
output_dir='output/react',
data_dir='output/react_data',
skip_scrape=True # Use cached data
)
Batch Processing
from concurrent.futures import ThreadPoolExecutor
from skill_seekers.cli.install_skill import install_skill
configs = ['react', 'vue', 'angular', 'svelte']
def install_config(config_name):
return install_skill(
config_name=config_name,
target='markdown',
enhance=False,
upload=False,
force=True
)
# Process 4 configs in parallel
with ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(install_config, configs))
for config, result in zip(configs, results):
print(f"{config}: {result['success']}")
CI/CD Integration Examples
GitHub Actions
name: Generate Skills
on:
schedule:
- cron: '0 0 * * *' # Daily at midnight
workflow_dispatch:
jobs:
generate-skills:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install Skill Seekers
run: pip install skill-seekers[all-llms]
- name: Generate Skills
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
run: |
skill-seekers install react --target claude --enhance --upload
skill-seekers install vue --target gemini --enhance --upload
- name: Archive Skills
uses: actions/upload-artifact@v3
with:
name: skills
path: output/**/*.zip
GitLab CI
generate_skills:
image: python:3.11
script:
- pip install skill-seekers[all-llms]
- skill-seekers install react --target claude --enhance --upload
- skill-seekers install vue --target gemini --enhance --upload
artifacts:
paths:
- output/
only:
- schedules
Best Practices
1. Use Configuration Files
Store configs in version control for reproducibility:
import json
with open('configs/my-framework.json') as f:
config = json.load(f)
scrape_all(config=config)
2. Enable Async for Large Sites
pages = scrape_all(base_url=url, config=config, use_async=True)
3. Cache Scraped Data
# Scrape once
scrape_all(config=config, output_dir='output/data')
# Rebuild many times (fast!)
build_skill(config_name='framework', data_dir='output/data', skip_scrape=True)
4. Use Platform Adaptors
# Good: Platform-agnostic
adaptor = get_adaptor(target_platform)
adaptor.package(skill_dir)
# Bad: Hardcoded for one platform
# create_zip_for_claude(skill_dir)
5. Handle Errors Gracefully
try:
result = install_skill(config_name='framework', target='claude')
except NetworkError:
# Retry logic
except InvalidConfigError:
# Fix config
6. Monitor Background Enhancements
# Start enhancement
enhance_skill(skill_dir='output/react/', mode='background')
# Monitor progress
monitor_enhancement('output/react/', watch=True)
API Reference Summary
| API | Module | Use Case |
|---|---|---|
| Documentation Scraping | doc_scraper |
Extract from docs websites |
| GitHub Analysis | github_scraper |
Analyze code repositories |
| PDF Extraction | pdf_scraper |
Extract from PDF files |
| Unified Scraping | unified_scraper |
Multi-source scraping |
| Skill Packaging | adaptors |
Package for LLM platforms |
| Skill Upload | adaptors |
Upload to platforms |
| AI Enhancement | adaptors |
Improve skill quality |
| Complete Workflow | install_skill |
End-to-end automation |
Additional Resources
- Main Documentation - Complete user guide
- Usage Guide - CLI usage examples
- MCP Setup - MCP server integration
- Multi-LLM Support - Platform comparison
- CHANGELOG - Version history and API changes
Version: 3.1.0-dev Last Updated: 2026-02-18 Status: ✅ Production Ready