This commit includes two major improvements:
## 1. Unified Create Command (v3.0.0 feature)
- Auto-detects source type (web, GitHub, local, PDF, config)
- Three-tier argument organization (universal, source-specific, advanced)
- Routes to existing scrapers (100% backward compatible)
- Progressive disclosure: 15 universal flags in default help
**New files:**
- src/skill_seekers/cli/source_detector.py - Auto-detection logic
- src/skill_seekers/cli/arguments/create.py - Argument definitions
- src/skill_seekers/cli/create_command.py - Main orchestrator
- src/skill_seekers/cli/parsers/create_parser.py - Parser integration
**Tests:**
- tests/test_source_detector.py (35 tests)
- tests/test_create_arguments.py (30 tests)
- tests/test_create_integration_basic.py (10 tests)
## 2. Enhanced Flag Consolidation (Phase 1)
- Consolidated 3 flags (--enhance, --enhance-local, --enhance-level) → 1 flag
- --enhance-level 0-3 with auto-detection of API vs LOCAL mode
- Default: --enhance-level 2 (balanced enhancement)
**Modified files:**
- arguments/{common,create,scrape,github,analyze}.py - Added enhance_level
- {doc_scraper,github_scraper,config_extractor,main}.py - Updated logic
- create_command.py - Uses consolidated flag
**Auto-detection:**
- If ANTHROPIC_API_KEY set → API mode
- Else → LOCAL mode (Claude Code)
## 3. PresetManager Bug Fix
- Fixed module naming conflict (presets.py vs presets/ directory)
- Moved presets.py → presets/manager.py
- Updated __init__.py exports
**Test Results:**
- All 160+ tests passing
- Zero regressions
- 100% backward compatible
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
71 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
🎯 Project Overview
Skill Seekers is the universal documentation preprocessor for AI systems. It transforms documentation websites, GitHub repositories, and PDFs into production-ready formats for 16+ platforms: RAG pipelines (LangChain, LlamaIndex, Haystack), vector databases (Pinecone, Chroma, Weaviate, FAISS, Qdrant), AI coding assistants (Cursor, Windsurf, Cline, Continue.dev), and LLM platforms (Claude, Gemini, OpenAI).
Current Version: v3.0.0 Python Version: 3.10+ required Status: Production-ready, published on PyPI Website: https://skillseekersweb.com/ - Browse configs, share, and access documentation
📚 Table of Contents
- First Time Here? - Start here!
- Quick Commands - Common workflows
- Architecture - How it works
- Development - Building & testing
- Testing - Test strategy
- Debugging - Troubleshooting
- Contributing - How to add features
👋 First Time Here?
Complete this 3-minute setup to start contributing:
# 1. Install package in editable mode (REQUIRED for development)
pip install -e .
# 2. Verify installation
python -c "import skill_seekers; print(skill_seekers.__version__)" # Should print: 3.0.0
# 3. Run a quick test
pytest tests/test_scraper_features.py::test_detect_language -v
# 4. You're ready! Pick a task from the roadmap:
# https://github.com/users/yusufkaraaslan/projects/2
Quick Navigation:
- Building/Testing → Development Commands
- Architecture → Core Design Pattern
- Common Issues → Common Pitfalls
- Contributing → See
CONTRIBUTING.md
⚡ Quick Command Reference (Most Used)
First time setup:
pip install -e . # REQUIRED before running tests or CLI
Running tests (NEVER skip - user requirement):
pytest tests/ -v # All tests
pytest tests/test_scraper_features.py -v # Single file
pytest tests/ --cov=src/skill_seekers --cov-report=html # With coverage
Code quality checks (matches CI):
ruff check src/ tests/ # Lint
ruff format src/ tests/ # Format
mypy src/skill_seekers # Type check
Common workflows:
# Documentation scraping
skill-seekers scrape --config configs/react.json
# GitHub analysis
skill-seekers github --repo facebook/react
# Local codebase analysis
skill-seekers analyze --directory . --comprehensive
# Package for LLM platforms
skill-seekers package output/react/ --target claude
skill-seekers package output/react/ --target gemini
RAG Pipeline workflows:
# LangChain Documents
skill-seekers package output/react/ --format langchain
# LlamaIndex TextNodes
skill-seekers package output/react/ --format llama-index
# Haystack Documents
skill-seekers package output/react/ --format haystack
# ChromaDB direct upload
skill-seekers package output/react/ --format chroma --upload
# FAISS export
skill-seekers package output/react/ --format faiss
# Weaviate/Qdrant upload (requires API keys)
skill-seekers package output/react/ --format weaviate --upload
skill-seekers package output/react/ --format qdrant --upload
AI Coding Assistant workflows:
# Cursor IDE
skill-seekers package output/react/ --target claude
cp output/react-claude/SKILL.md .cursorrules
# Windsurf
cp output/react-claude/SKILL.md .windsurf/rules/react.md
# Cline (VS Code)
cp output/react-claude/SKILL.md .clinerules
# Continue.dev (universal IDE)
python examples/continue-dev-universal/context_server.py
# Configure in ~/.continue/config.json
Cloud Storage:
# Upload to S3
skill-seekers cloud upload --provider s3 --bucket my-skills output/react.zip
# Upload to GCS
skill-seekers cloud upload --provider gcs --bucket my-skills output/react.zip
# Upload to Azure
skill-seekers cloud upload --provider azure --container my-skills output/react.zip
🏗️ Architecture
Core Design Pattern: Platform Adaptors
The codebase uses the Strategy Pattern with a factory method to support 16 platforms across 4 categories:
src/skill_seekers/cli/adaptors/
├── __init__.py # Factory: get_adaptor(target/format)
├── base.py # Abstract base class
# LLM Platforms (3)
├── claude.py # Claude AI (ZIP + YAML)
├── gemini.py # Google Gemini (tar.gz)
├── openai.py # OpenAI ChatGPT (ZIP + Vector Store)
# RAG Frameworks (3)
├── langchain.py # LangChain Documents
├── llama_index.py # LlamaIndex TextNodes
├── haystack.py # Haystack Documents
# Vector Databases (5)
├── chroma.py # ChromaDB
├── faiss_helpers.py # FAISS
├── qdrant.py # Qdrant
├── weaviate.py # Weaviate
# AI Coding Assistants (4 - via Claude format + config files)
# - Cursor, Windsurf, Cline, Continue.dev
# Generic (1)
├── markdown.py # Generic Markdown (ZIP)
└── streaming_adaptor.py # Streaming data ingest
Key Methods:
package(skill_dir, output_path)- Platform-specific packagingupload(package_path, api_key)- Platform-specific upload (where applicable)enhance(skill_dir, mode)- AI enhancement with platform-specific modelsexport(skill_dir, format)- Export to RAG/vector DB formats
Data Flow (5 Phases)
-
Scrape Phase (
doc_scraper.py:scrape_all())- BFS traversal from base_url
- Output:
output/{name}_data/pages/*.json
-
Build Phase (
doc_scraper.py:build_skill())- Load pages → Categorize → Extract patterns
- Output:
output/{name}/SKILL.md+references/*.md
-
Enhancement Phase (optional,
enhance_skill_local.py)- LLM analyzes references → Rewrites SKILL.md
- Platform-specific models (Sonnet 4, Gemini 2.0, GPT-4o)
-
Package Phase (
package_skill.py→ adaptor)- Platform adaptor packages in appropriate format
- Output:
.zipor.tar.gz
-
Upload Phase (optional,
upload_skill.py→ adaptor)- Upload via platform API
File Structure (src/ layout) - Key Files Only
src/skill_seekers/
├── cli/ # All CLI commands
│ ├── main.py # ⭐ Git-style CLI dispatcher
│ ├── doc_scraper.py # ⭐ Main scraper (~790 lines)
│ │ ├── scrape_all() # BFS traversal engine
│ │ ├── smart_categorize() # Category detection
│ │ └── build_skill() # SKILL.md generation
│ ├── github_scraper.py # GitHub repo analysis
│ ├── codebase_scraper.py # ⭐ Local analysis (C2.x+C3.x)
│ ├── package_skill.py # Platform packaging
│ ├── unified_scraper.py # Multi-source scraping
│ ├── unified_codebase_analyzer.py # Three-stream GitHub+local analyzer
│ ├── enhance_skill_local.py # AI enhancement (LOCAL mode)
│ ├── enhance_status.py # Enhancement status monitoring
│ ├── upload_skill.py # Upload to platforms
│ ├── install_skill.py # Complete workflow automation
│ ├── install_agent.py # Install to AI agent directories
│ ├── pattern_recognizer.py # C3.1 Design pattern detection
│ ├── test_example_extractor.py # C3.2 Test example extraction
│ ├── how_to_guide_builder.py # C3.3 How-to guide generation
│ ├── config_extractor.py # C3.4 Configuration extraction
│ ├── generate_router.py # C3.5 Router skill generation
│ ├── code_analyzer.py # Multi-language code analysis
│ ├── api_reference_builder.py # API documentation builder
│ ├── dependency_analyzer.py # Dependency graph analysis
│ ├── signal_flow_analyzer.py # C3.10 Signal flow analysis (Godot)
│ ├── pdf_scraper.py # PDF extraction
│ └── adaptors/ # ⭐ Platform adaptor pattern
│ ├── __init__.py # Factory: get_adaptor()
│ ├── base_adaptor.py # Abstract base
│ ├── claude_adaptor.py # Claude AI
│ ├── gemini_adaptor.py # Google Gemini
│ ├── openai_adaptor.py # OpenAI ChatGPT
│ ├── markdown_adaptor.py # Generic Markdown
│ ├── langchain.py # LangChain RAG
│ ├── llama_index.py # LlamaIndex RAG
│ ├── haystack.py # Haystack RAG
│ ├── chroma.py # ChromaDB
│ ├── faiss_helpers.py # FAISS
│ ├── qdrant.py # Qdrant
│ ├── weaviate.py # Weaviate
│ └── streaming_adaptor.py # Streaming data ingest
└── mcp/ # MCP server (26 tools)
├── server_fastmcp.py # FastMCP server
└── tools/ # Tool implementations
Most Modified Files (when contributing):
- Platform adaptors:
src/skill_seekers/cli/adaptors/{platform}.py - Tests:
tests/test_{feature}.py - Configs:
configs/{framework}.json
🛠️ Development Commands
Setup
# Install in editable mode (required before tests due to src/ layout)
pip install -e .
# Install with all platform dependencies
pip install -e ".[all-llms]"
# Install specific platforms
pip install -e ".[gemini]" # Google Gemini
pip install -e ".[openai]" # OpenAI ChatGPT
Running Tests
CRITICAL: Never skip tests - User requires all tests to pass before commits.
# All tests (must run pip install -e . first!)
pytest tests/ -v
# Specific test file
pytest tests/test_scraper_features.py -v
# Multi-platform tests
pytest tests/test_install_multiplatform.py -v
# With coverage
pytest tests/ --cov=src/skill_seekers --cov-report=term --cov-report=html
# Single test
pytest tests/test_scraper_features.py::test_detect_language -v
# MCP server tests
pytest tests/test_mcp_fastmcp.py -v
Test Architecture:
- 46 test files covering all features
- CI Matrix: Ubuntu + macOS, Python 3.10-3.13
- 1,852 tests passing (up from 700+ in v2.x)
- Must run
pip install -e .before tests (src/ layout requirement)
Building & Publishing
# Build package (using uv - recommended)
uv build
# Or using build
python -m build
# Publish to PyPI
uv publish
# Or using twine
python -m twine upload dist/*
Testing CLI Commands
# Test configuration wizard (NEW: v2.7.0)
skill-seekers config --show # Show current configuration
skill-seekers config --github # GitHub token setup
skill-seekers config --test # Test connections
# Test resume functionality (NEW: v2.7.0)
skill-seekers resume --list # List resumable jobs
skill-seekers resume --clean # Clean up old jobs
# Test GitHub scraping with profiles (NEW: v2.7.0)
skill-seekers github --repo facebook/react --profile personal # Use specific profile
skill-seekers github --repo owner/repo --non-interactive # CI/CD mode
# Test scraping (dry run)
skill-seekers scrape --config configs/react.json --dry-run
# Test codebase analysis (C2.x features)
skill-seekers analyze --directory . --output output/codebase/
# Test pattern detection (C3.1)
skill-seekers patterns --file src/skill_seekers/cli/code_analyzer.py
# Test how-to guide generation (C3.3)
skill-seekers how-to-guides output/test_examples.json --output output/guides/
# Test enhancement status monitoring
skill-seekers enhance-status output/react/ --watch
# Test multi-platform packaging
skill-seekers package output/react/ --target gemini --dry-run
# Test MCP server (stdio mode)
python -m skill_seekers.mcp.server_fastmcp
# Test MCP server (HTTP mode)
python -m skill_seekers.mcp.server_fastmcp --transport http --port 8765
New v3.0.0 CLI Commands
# Setup wizard (interactive configuration)
skill-seekers-setup
# Cloud storage operations
skill-seekers cloud upload --provider s3 --bucket my-bucket output/react.zip
skill-seekers cloud download --provider gcs --bucket my-bucket react.zip
skill-seekers cloud list --provider azure --container my-container
# Embedding server (for RAG pipelines)
skill-seekers embed --port 8080 --model sentence-transformers
# Sync & incremental updates
skill-seekers sync --source https://docs.react.dev/ --target output/react/
skill-seekers update --skill output/react/ --check-changes
# Quality metrics & benchmarking
skill-seekers quality --skill output/react/ --report
skill-seekers benchmark --config configs/react.json --compare-versions
# Multilingual support
skill-seekers multilang --detect output/react/
skill-seekers multilang --translate output/react/ --target zh-CN
# Streaming data ingest
skill-seekers stream --source docs/ --target output/streaming/
🔧 Key Implementation Details
CLI Architecture (Git-style)
Entry point: src/skill_seekers/cli/main.py
The unified CLI modifies sys.argv and calls existing main() functions to maintain backward compatibility:
# Example: skill-seekers scrape --config react.json
# Transforms to: doc_scraper.main() with modified sys.argv
Subcommands: scrape, github, pdf, unified, codebase, enhance, enhance-status, package, upload, estimate, install, install-agent, patterns, how-to-guides
Recent Additions:
codebase- Local codebase analysis without GitHub API (C2.x + C3.x features)enhance-status- Monitor background/daemon enhancement processespatterns- Detect design patterns in code (C3.1)how-to-guides- Generate educational guides from tests (C3.3)
Platform Adaptor Usage
from skill_seekers.cli.adaptors import get_adaptor
# Get platform-specific adaptor
adaptor = get_adaptor('gemini') # or 'claude', 'openai', 'markdown'
# Package skill
adaptor.package(skill_dir='output/react/', output_path='output/')
# Upload to platform
adaptor.upload(
package_path='output/react-gemini.tar.gz',
api_key=os.getenv('GOOGLE_API_KEY')
)
# AI enhancement
adaptor.enhance(skill_dir='output/react/', mode='api')
C3.x Codebase Analysis Features
The project has comprehensive codebase analysis capabilities (C3.1-C3.8):
C3.1 Design Pattern Detection (pattern_recognizer.py):
- Detects 10 common patterns: Singleton, Factory, Observer, Strategy, Decorator, Builder, Adapter, Command, Template Method, Chain of Responsibility
- Supports 9 languages: Python, JavaScript, TypeScript, C++, C, C#, Go, Rust, Java
- Three detection levels: surface (fast), deep (balanced), full (thorough)
- 87% precision, 80% recall on real-world projects
C3.2 Test Example Extraction (test_example_extractor.py):
- Extracts real usage examples from test files
- Categories: instantiation, method_call, config, setup, workflow
- AST-based for Python, regex-based for 8 other languages
- Quality filtering with confidence scoring
C3.3 How-To Guide Generation (how_to_guide_builder.py):
- Transforms test workflows into educational guides
- 5 AI enhancements: step descriptions, troubleshooting, prerequisites, next steps, use cases
- Dual-mode AI: API (fast) or LOCAL (free with Claude Code Max)
- 4 grouping strategies: AI tutorial group, file path, test name, complexity
C3.4 Configuration Pattern Extraction (config_extractor.py):
- Extracts configuration patterns from codebases
- Identifies config files, env vars, CLI arguments
- AI enhancement for better organization
C3.5 Architectural Overview (generate_router.py):
- Generates comprehensive ARCHITECTURE.md files
- Router skill generation for large documentation
- Quality improvements: 6.5/10 → 8.5/10 (+31%)
- Integrates GitHub metadata, issues, labels
C3.6 AI Enhancement (Claude API integration):
- Enhances C3.1-C3.5 with AI-powered insights
- Pattern explanations and improvement suggestions
- Test example context and best practices
- Guide enhancement with troubleshooting and prerequisites
C3.7 Architectural Pattern Detection (architectural_pattern_detector.py):
- Detects 8 architectural patterns (MVC, MVVM, MVP, Repository, etc.)
- Framework detection (Django, Flask, Spring, React, Angular, etc.)
- Multi-file analysis with directory structure patterns
- Evidence-based detection with confidence scoring
C3.8 Standalone Codebase Scraper (codebase_scraper.py):
# Quick analysis (1-2 min, basic features only)
skill-seekers analyze --directory /path/to/repo --quick
# Comprehensive analysis (20-60 min, all features + AI)
skill-seekers analyze --directory . --comprehensive
# With AI enhancement (auto-detects API or LOCAL)
skill-seekers analyze --directory . --enhance
# Granular AI enhancement control (NEW)
skill-seekers analyze --directory . --enhance-level 1 # SKILL.md only
skill-seekers analyze --directory . --enhance-level 2 # + Architecture + Config + Docs
skill-seekers analyze --directory . --enhance-level 3 # Full enhancement (all features)
# Disable specific features
skill-seekers analyze --directory . --skip-patterns --skip-how-to-guides
- Generates 300+ line standalone SKILL.md files from codebases
- All C3.x features integrated (patterns, tests, guides, config, architecture, docs)
- Complete codebase analysis without documentation scraping
- NEW: Granular AI enhancement control with
--enhance-level(0-3)
C3.9 Project Documentation Extraction (codebase_scraper.py):
- Extracts and categorizes all markdown files from the project
- Auto-detects categories: overview, architecture, guides, workflows, features, etc.
- Integrates documentation into SKILL.md with summaries
- AI enhancement (level 2+) adds topic extraction and cross-references
- Controlled by depth: surface=raw copy, deep=parse+summarize, full=AI-enhanced
- Default ON, use
--skip-docsto disable
C3.10 Signal Flow Analysis for Godot Projects (signal_flow_analyzer.py):
- Complete signal flow analysis system for event-driven Godot architectures
- Signal declaration extraction (detects
signalkeyword declarations) - Connection mapping (tracks
.connect()calls with targets and methods) - Emission tracking (finds
.emit()andemit_signal()calls) - Real-world metrics: 208 signals, 634 connections, 298 emissions in test project
- Signal density metrics (signals per file)
- Event chain detection (signals triggering other signals)
- Signal pattern detection:
- EventBus Pattern (0.90 confidence): Centralized signal hub in autoload
- Observer Pattern (0.85 confidence): Multi-observer signals (3+ listeners)
- Event Chains (0.80 confidence): Cascading signal propagation
- Signal-based how-to guides (C3.10.1):
- AI-generated step-by-step usage guides (Connect → Emit → Handle)
- Real code examples from project
- Common usage locations
- Parameter documentation
- Outputs:
signal_flow.json,signal_flow.mmd(Mermaid diagram),signal_reference.md,signal_how_to_guides.md - Comprehensive Godot 4.x support:
- GDScript (.gd), Scene files (.tscn), Resources (.tres), Shaders (.gdshader)
- GDScript test extraction (GUT, gdUnit4, WAT frameworks)
- 396 test cases extracted in test project
- Framework detection (Unity, Unreal, Godot)
Key Architecture Decision (BREAKING in v2.5.2):
- Changed from opt-in (
--build-*) to opt-out (--skip-*) flags - All analysis features now ON by default for maximum value
- Backward compatibility warnings for deprecated flags
Smart Categorization Algorithm
Located in doc_scraper.py:smart_categorize():
- Scores pages against category keywords
- 3 points for URL match, 2 for title, 1 for content
- Threshold of 2+ for categorization
- Auto-infers categories from URL segments if none provided
- Falls back to "other" category
Language Detection
Located in doc_scraper.py:detect_language():
- CSS class attributes (
language-*,lang-*) - Heuristics (keywords like
def,const,func)
Configuration File Structure
Configs (configs/*.json) define scraping behavior:
{
"name": "framework-name",
"description": "When to use this skill",
"base_url": "https://docs.example.com/",
"selectors": {
"main_content": "article", // CSS selector
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": ["/docs"],
"exclude": ["/blog"]
},
"categories": {
"getting_started": ["intro", "quickstart"],
"api": ["api", "reference"]
},
"rate_limit": 0.5,
"max_pages": 500
}
🧪 Testing Guidelines
Test Coverage Requirements
- Core features: 100% coverage required
- Platform adaptors: Each platform has dedicated tests
- MCP tools: All 18 tools must be tested
- Integration tests: End-to-end workflows
Test Markers (from pytest.ini_options)
The project uses pytest markers to categorize tests:
# Run only fast unit tests (default)
pytest tests/ -v
# Include slow tests (>5 seconds)
pytest tests/ -v -m slow
# Run integration tests (requires external services)
pytest tests/ -v -m integration
# Run end-to-end tests (resource-intensive, creates files)
pytest tests/ -v -m e2e
# Run tests requiring virtual environment setup
pytest tests/ -v -m venv
# Run bootstrap feature tests
pytest tests/ -v -m bootstrap
# Skip slow and integration tests (fastest)
pytest tests/ -v -m "not slow and not integration"
Test Execution Strategy
By default, only fast tests run. Use markers to control test execution:
# Default: Only fast tests (skip slow/integration/e2e)
pytest tests/ -v
# Include slow tests (>5 seconds)
pytest tests/ -v -m slow
# Include integration tests (requires external services)
pytest tests/ -v -m integration
# Include resource-intensive e2e tests (creates files)
pytest tests/ -v -m e2e
# Run ONLY fast tests (explicit)
pytest tests/ -v -m "not slow and not integration and not e2e"
# Run everything (CI does this)
pytest tests/ -v -m ""
When to use which:
- Local development: Default (fast tests only) -
pytest tests/ -v - Pre-commit: Fast tests -
pytest tests/ -v - Before PR: Include slow + integration -
pytest tests/ -v -m "not e2e" - CI validation: All tests run automatically
Key Test Files
test_scraper_features.py- Core scraping functionalitytest_mcp_server.py- MCP integration (18 tools)test_mcp_fastmcp.py- FastMCP frameworktest_unified.py- Multi-source scrapingtest_github_scraper.py- GitHub analysistest_pdf_scraper.py- PDF extractiontest_install_multiplatform.py- Multi-platform packagingtest_integration.py- End-to-end workflowstest_install_skill.py- One-command installtest_install_agent.py- AI agent installationconftest.py- Test configuration (checks package installation)
🌐 Environment Variables
# Claude AI / Compatible APIs
# Option 1: Official Anthropic API (default)
export ANTHROPIC_API_KEY=sk-ant-...
# Option 2: GLM-4.7 Claude-compatible API (or any compatible endpoint)
export ANTHROPIC_API_KEY=your-api-key
export ANTHROPIC_BASE_URL=https://glm-4-7-endpoint.com/v1
# Google Gemini (optional)
export GOOGLE_API_KEY=AIza...
# OpenAI ChatGPT (optional)
export OPENAI_API_KEY=sk-...
# GitHub (for higher rate limits)
export GITHUB_TOKEN=ghp_...
# Private config repositories (optional)
export GITLAB_TOKEN=glpat-...
export GITEA_TOKEN=...
export BITBUCKET_TOKEN=...
All AI enhancement features respect these settings:
enhance_skill.py- API mode SKILL.md enhancementai_enhancer.py- C3.1/C3.2 pattern and test example enhancementguide_enhancer.py- C3.3 guide enhancementconfig_enhancer.py- C3.4 configuration enhancementadaptors/claude.py- Claude platform adaptor enhancement
Note: Setting ANTHROPIC_BASE_URL allows you to use any Claude-compatible API endpoint, such as GLM-4.7 (智谱 AI).
📦 Package Structure (pyproject.toml)
Entry Points
[project.scripts]
# Main unified CLI
skill-seekers = "skill_seekers.cli.main:main"
# Individual tool entry points (Core)
skill-seekers-config = "skill_seekers.cli.config_command:main" # v2.7.0 Configuration wizard
skill-seekers-resume = "skill_seekers.cli.resume_command:main" # v2.7.0 Resume interrupted jobs
skill-seekers-scrape = "skill_seekers.cli.doc_scraper:main"
skill-seekers-github = "skill_seekers.cli.github_scraper:main"
skill-seekers-pdf = "skill_seekers.cli.pdf_scraper:main"
skill-seekers-unified = "skill_seekers.cli.unified_scraper:main"
skill-seekers-codebase = "skill_seekers.cli.codebase_scraper:main" # C2.x Local codebase analysis
skill-seekers-enhance = "skill_seekers.cli.enhance_skill_local:main"
skill-seekers-enhance-status = "skill_seekers.cli.enhance_status:main" # Status monitoring
skill-seekers-package = "skill_seekers.cli.package_skill:main"
skill-seekers-upload = "skill_seekers.cli.upload_skill:main"
skill-seekers-estimate = "skill_seekers.cli.estimate_pages:main"
skill-seekers-install = "skill_seekers.cli.install_skill:main"
skill-seekers-install-agent = "skill_seekers.cli.install_agent:main"
skill-seekers-patterns = "skill_seekers.cli.pattern_recognizer:main" # C3.1 Pattern detection
skill-seekers-how-to-guides = "skill_seekers.cli.how_to_guide_builder:main" # C3.3 Guide generation
# New v3.0.0 Entry Points
skill-seekers-setup = "skill_seekers.cli.setup_wizard:main" # NEW: v3.0.0 Setup wizard
skill-seekers-cloud = "skill_seekers.cli.cloud_storage_cli:main" # NEW: v3.0.0 Cloud storage
skill-seekers-embed = "skill_seekers.embedding.server:main" # NEW: v3.0.0 Embedding server
skill-seekers-sync = "skill_seekers.cli.sync_cli:main" # NEW: v3.0.0 Sync & monitoring
skill-seekers-benchmark = "skill_seekers.cli.benchmark_cli:main" # NEW: v3.0.0 Benchmarking
skill-seekers-stream = "skill_seekers.cli.streaming_ingest:main" # NEW: v3.0.0 Streaming ingest
skill-seekers-update = "skill_seekers.cli.incremental_updater:main" # NEW: v3.0.0 Incremental updates
skill-seekers-multilang = "skill_seekers.cli.multilang_support:main" # NEW: v3.0.0 Multilingual
skill-seekers-quality = "skill_seekers.cli.quality_metrics:main" # NEW: v3.0.0 Quality metrics
Optional Dependencies
Project uses PEP 735 [dependency-groups] (Python 3.13+):
- Replaces deprecated
tool.uv.dev-dependencies - Dev dependencies:
[dependency-groups] dev = [...]in pyproject.toml - Install with:
pip install -e .(installs only core deps) - Install dev deps: See CI workflow or manually install pytest, ruff, mypy
[project.optional-dependencies]
gemini = ["google-generativeai>=0.8.0"]
openai = ["openai>=1.0.0"]
all-llms = ["google-generativeai>=0.8.0", "openai>=1.0.0"]
[dependency-groups] # PEP 735 (replaces tool.uv.dev-dependencies)
dev = [
"pytest>=8.4.2",
"pytest-asyncio>=0.24.0",
"pytest-cov>=7.0.0",
"coverage>=7.11.0",
]
🚨 Critical Development Notes
Must Run Before Tests
# REQUIRED: Install package before running tests
pip install -e .
# Why: src/ layout requires package installation
# Without this, imports will fail
Never Skip Tests
Per user instructions in ~/.claude/CLAUDE.md:
- "never skipp any test. always make sure all test pass"
- All 1,852 tests must pass before commits
- Run full test suite:
pytest tests/ -v
Platform-Specific Dependencies
Platform dependencies are optional (install only what you need):
# Install specific platform support
pip install -e ".[gemini]" # Google Gemini
pip install -e ".[openai]" # OpenAI ChatGPT
pip install -e ".[chroma]" # ChromaDB
pip install -e ".[weaviate]" # Weaviate
pip install -e ".[s3]" # AWS S3
pip install -e ".[gcs]" # Google Cloud Storage
pip install -e ".[azure]" # Azure Blob Storage
pip install -e ".[mcp]" # MCP integration
pip install -e ".[all]" # Everything (16 platforms + cloud + embedding)
# Or install from PyPI:
pip install skill-seekers[gemini] # Google Gemini support
pip install skill-seekers[openai] # OpenAI ChatGPT support
pip install skill-seekers[all-llms] # All LLM platforms
pip install skill-seekers[chroma] # ChromaDB support
pip install skill-seekers[weaviate] # Weaviate support
pip install skill-seekers[s3] # AWS S3 support
pip install skill-seekers[all] # All optional dependencies
AI Enhancement Modes
AI enhancement transforms basic skills (2-3/10) into production-ready skills (8-9/10). Two modes available:
API Mode (default if ANTHROPIC_API_KEY is set):
- Direct Claude API calls (fast, efficient)
- Cost: ~$0.15-$0.30 per skill
- Perfect for CI/CD automation
- Requires:
export ANTHROPIC_API_KEY=sk-ant-...
LOCAL Mode (fallback if no API key):
- Uses Claude Code CLI (your existing Max plan)
- Free! No API charges
- 4 execution modes:
- Headless (default): Foreground, waits for completion
- Background (
--background): Returns immediately - Daemon (
--daemon): Fully detached with nohup - Terminal (
--interactive-enhancement): Opens new terminal (macOS)
- Status monitoring:
skill-seekers enhance-status output/react/ --watch - Timeout configuration:
--timeout 300(seconds)
Force Mode (default ON since v2.5.2):
- Skip all confirmations automatically
- Perfect for CI/CD, batch processing
- Use
--no-forceto enable prompts if needed
# API mode (if ANTHROPIC_API_KEY is set)
skill-seekers enhance output/react/
# LOCAL mode (no API key needed)
skill-seekers enhance output/react/ --mode LOCAL
# Background with status monitoring
skill-seekers enhance output/react/ --background
skill-seekers enhance-status output/react/ --watch
# Force mode OFF (enable prompts)
skill-seekers enhance output/react/ --no-force
See docs/ENHANCEMENT_MODES.md for detailed documentation.
Git Workflow
Git Workflow Notes:
- Main branch:
main - Development branch:
development - Always create feature branches from
development - Branch naming:
feature/{task-id}-{description}orfeature/{category}
To see current status: git status
CI/CD Pipeline
The project has GitHub Actions workflows in .github/workflows/:
tests.yml - Runs on every push and PR to main or development:
-
Lint Job (Python 3.12, Ubuntu):
ruff check src/ tests/- Code linting with GitHub annotationsruff format --check src/ tests/- Format validationmypy src/skill_seekers- Type checking (continue-on-error)
-
Test Job (Matrix):
- OS: Ubuntu + macOS
- Python: 3.10, 3.11, 3.12
- Exclusions: macOS + Python 3.10 (speed optimization)
- Steps:
- Install dependencies +
pip install -e . - Run CLI tests (scraper, config, integration)
- Run MCP server tests
- Generate coverage report → Upload to Codecov
- Install dependencies +
-
Summary Job - Single status check for branch protection
- Ensures both lint and test jobs succeed
- Provides single "All Checks Complete" status
release.yml - Triggers on version tags (e.g., v2.9.0):
- Builds package with
uv build - Publishes to PyPI with
uv publish - Creates GitHub release
Local Pre-Commit Validation
Run the same checks as CI before pushing:
# 1. Code quality (matches lint job)
ruff check src/ tests/
ruff format --check src/ tests/
mypy src/skill_seekers
# 2. Tests (matches test job)
pip install -e .
pytest tests/ -v --cov=src/skill_seekers --cov-report=term
# 3. If all pass, you're good to push!
git push origin feature/my-feature
Branch Protection Rules:
- main: Requires tests + 1 review, only maintainers merge
- development: Requires tests to pass, default target for PRs
🚨 Common Pitfalls & Solutions
1. Import Errors
Problem: ModuleNotFoundError: No module named 'skill_seekers'
Solution: Must install package first due to src/ layout
pip install -e .
Why: The src/ layout prevents imports from repo root. Package must be installed.
2. Tests Fail with "No module named..."
Problem: Package not installed in test environment
Solution: CI runs pip install -e . before tests - do the same locally
pip install -e .
pytest tests/ -v
3. Platform-Specific Dependencies Not Found
Problem: ModuleNotFoundError: No module named 'google.generativeai'
Solution: Install platform-specific dependencies
pip install -e ".[gemini]" # For Gemini
pip install -e ".[openai]" # For OpenAI
pip install -e ".[all-llms]" # For all platforms
4. Git Branch Confusion
Problem: PR targets main instead of development
Solution: Always create PRs targeting development branch
git checkout development
git pull upstream development
git checkout -b feature/my-feature
# ... make changes ...
git push origin feature/my-feature
# Create PR: feature/my-feature → development
Important: See CONTRIBUTING.md for complete branch workflow.
5. Tests Pass Locally But Fail in CI
Problem: Different Python version or missing dependency
Solution: Test with multiple Python versions locally
# CI tests: Python 3.10, 3.11, 3.12 on Ubuntu + macOS
# Use pyenv or docker to test locally:
pyenv install 3.10.13 3.11.7 3.12.1
pyenv local 3.10.13
pip install -e . && pytest tests/ -v
pyenv local 3.11.7
pip install -e . && pytest tests/ -v
pyenv local 3.12.1
pip install -e . && pytest tests/ -v
6. Enhancement Not Working
Problem: AI enhancement fails or hangs
Solutions:
# Check if API key is set
echo $ANTHROPIC_API_KEY
# Try LOCAL mode instead (uses Claude Code Max, no API key needed)
skill-seekers enhance output/react/ --mode LOCAL
# Monitor enhancement status for background jobs
skill-seekers enhance-status output/react/ --watch
7. Rate Limit Errors from GitHub
Problem: 403 Forbidden from GitHub API
Solutions:
# Check current rate limit
curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/rate_limit
# Configure multiple GitHub profiles (recommended)
skill-seekers config --github
# Use specific profile
skill-seekers github --repo owner/repo --profile work
# Test all configured tokens
skill-seekers config --test
🔌 MCP Integration
MCP Server (26 Tools)
Transport modes:
- stdio: Claude Code, VS Code + Cline
- HTTP: Cursor, Windsurf, IntelliJ IDEA
Core Tools (9):
list_configs- List preset configurationsgenerate_config- Generate config from docs URLvalidate_config- Validate config structureestimate_pages- Estimate page countscrape_docs- Scrape documentationpackage_skill- Package to format (supports--formatand--target)upload_skill- Upload to platform (supports--target)enhance_skill- AI enhancement with platform supportinstall_skill- Complete workflow automation
Extended Tools (10):
10. scrape_github - GitHub repository analysis
11. scrape_pdf - PDF extraction
12. unified_scrape - Multi-source scraping
13. merge_sources - Merge docs + code
14. detect_conflicts - Find discrepancies
15. add_config_source - Register git repos
16. fetch_config - Fetch configs from git
17. list_config_sources - List registered sources
18. remove_config_source - Remove config source
19. split_config - Split large configs
NEW Vector DB Tools (4):
20. export_to_chroma - Export to ChromaDB
21. export_to_weaviate - Export to Weaviate
22. export_to_faiss - Export to FAISS
23. export_to_qdrant - Export to Qdrant
NEW Cloud Tools (3):
24. cloud_upload - Upload to S3/GCS/Azure
25. cloud_download - Download from cloud storage
26. cloud_list - List files in cloud storage
Starting MCP Server
# stdio mode (Claude Code, VS Code + Cline)
python -m skill_seekers.mcp.server_fastmcp
# HTTP mode (Cursor, Windsurf, IntelliJ)
python -m skill_seekers.mcp.server_fastmcp --transport http --port 8765
🤖 RAG Framework & Vector Database Integrations (NEW - v3.0.0)
Skill Seekers is now the universal preprocessor for RAG pipelines. Export documentation to any RAG framework or vector database with a single command.
RAG Frameworks
LangChain Documents:
# Export to LangChain Document format
skill-seekers package output/django --format langchain
# Output: output/django-langchain.json
# Format: Array of LangChain Document objects
# - page_content: Full text content
# - metadata: {source, category, type, url}
# Use in LangChain:
from langchain.document_loaders import JSONLoader
loader = JSONLoader("output/django-langchain.json")
documents = loader.load()
LlamaIndex TextNodes:
# Export to LlamaIndex TextNode format
skill-seekers package output/django --format llama-index
# Output: output/django-llama-index.json
# Format: Array of LlamaIndex TextNode objects
# - text: Content
# - id_: Unique identifier
# - metadata: {source, category, type}
# - relationships: Document relationships
# Use in LlamaIndex:
from llama_index import StorageContext, load_index_from_storage
from llama_index.schema import TextNode
nodes = [TextNode.from_dict(n) for n in json.load(open("output/django-llama-index.json"))]
Haystack Documents:
# Export to Haystack Document format
skill-seekers package output/django --format haystack
# Output: output/django-haystack.json
# Format: Haystack Document objects for pipelines
# Perfect for: Question answering, search, RAG pipelines
Vector Databases
ChromaDB (Direct Integration):
# Export and optionally upload to ChromaDB
skill-seekers package output/django --format chroma
# Output: output/django-chroma/ (ChromaDB collection)
# With direct upload (requires chromadb running):
skill-seekers package output/django --format chroma --upload
# Configuration via environment:
export CHROMA_HOST=localhost
export CHROMA_PORT=8000
FAISS (Facebook AI Similarity Search):
# Export to FAISS index format
skill-seekers package output/django --format faiss
# Output:
# - output/django-faiss.index (FAISS index)
# - output/django-faiss-metadata.json (Document metadata)
# Use with FAISS:
import faiss
index = faiss.read_index("output/django-faiss.index")
Weaviate:
# Export and upload to Weaviate
skill-seekers package output/django --format weaviate --upload
# Requires environment variables:
export WEAVIATE_URL=http://localhost:8080
export WEAVIATE_API_KEY=your-api-key
# Creates class "DjangoDoc" with schema
Qdrant:
# Export and upload to Qdrant
skill-seekers package output/django --format qdrant --upload
# Requires environment variables:
export QDRANT_URL=http://localhost:6333
export QDRANT_API_KEY=your-api-key
# Creates collection "django_docs"
Pinecone (via Markdown):
# Pinecone uses the markdown format
skill-seekers package output/django --target markdown
# Then use Pinecone's Python client for upsert
# See: docs/integrations/PINECONE.md
Complete RAG Pipeline Example
# 1. Scrape documentation
skill-seekers scrape --config configs/django.json
# 2. Export to your RAG stack
skill-seekers package output/django --format langchain # For LangChain
skill-seekers package output/django --format llama-index # For LlamaIndex
skill-seekers package output/django --format chroma --upload # Direct to ChromaDB
# 3. Use in your application
# See examples/:
# - examples/langchain-rag-pipeline/
# - examples/llama-index-query-engine/
# - examples/pinecone-upsert/
Integration Hub: docs/integrations/RAG_PIPELINES.md
🛠️ AI Coding Assistant Integrations (NEW - v3.0.0)
Transform any framework documentation into persistent expert context for 4+ AI coding assistants. Your IDE's AI now "knows" your frameworks without manual prompting.
Cursor IDE
Setup:
# 1. Generate skill
skill-seekers scrape --config configs/react.json
skill-seekers package output/react/ --target claude
# 2. Install to Cursor
cp output/react-claude/SKILL.md .cursorrules
# 3. Restart Cursor
# AI now has React expertise!
Benefits:
- ✅ AI suggests React-specific patterns
- ✅ No manual "use React hooks" prompts needed
- ✅ Consistent team patterns
- ✅ Works for ANY framework
Guide: docs/integrations/CURSOR.md Example: examples/cursor-react-skill/
Windsurf
Setup:
# 1. Generate skill
skill-seekers scrape --config configs/django.json
skill-seekers package output/django/ --target claude
# 2. Install to Windsurf
mkdir -p .windsurf/rules
cp output/django-claude/SKILL.md .windsurf/rules/django.md
# 3. Restart Windsurf
# AI now knows Django patterns!
Benefits:
- ✅ Flow-based coding with framework knowledge
- ✅ IDE-native AI assistance
- ✅ Persistent context across sessions
Guide: docs/integrations/WINDSURF.md Example: examples/windsurf-fastapi-context/
Cline (VS Code Extension)
Setup:
# 1. Generate skill
skill-seekers scrape --config configs/fastapi.json
skill-seekers package output/fastapi/ --target claude
# 2. Install to Cline
cp output/fastapi-claude/SKILL.md .clinerules
# 3. Reload VS Code
# Cline now has FastAPI expertise!
Benefits:
- ✅ Agentic code generation in VS Code
- ✅ Cursor Composer equivalent for VS Code
- ✅ System prompts + MCP integration
Guide: docs/integrations/CLINE.md Example: examples/cline-django-assistant/
Continue.dev (Universal IDE)
Setup:
# 1. Generate skill
skill-seekers scrape --config configs/react.json
skill-seekers package output/react/ --target claude
# 2. Start context server
cd examples/continue-dev-universal/
python context_server.py --port 8765
# 3. Configure in ~/.continue/config.json
{
"contextProviders": [
{
"name": "http",
"params": {
"url": "http://localhost:8765/context",
"title": "React Documentation"
}
}
]
}
# 4. Works in ALL IDEs!
# VS Code, JetBrains, Vim, Emacs...
Benefits:
- ✅ IDE-agnostic (works in VS Code, IntelliJ, Vim, Emacs)
- ✅ Custom LLM providers supported
- ✅ HTTP-based context serving
- ✅ Team consistency across mixed IDE environments
Guide: docs/integrations/CONTINUE_DEV.md Example: examples/continue-dev-universal/
Multi-IDE Team Setup
For teams using different IDEs (VS Code, IntelliJ, Vim):
# Use Continue.dev as universal context provider
skill-seekers scrape --config configs/react.json
python context_server.py --host 0.0.0.0 --port 8765
# ALL team members configure Continue.dev
# Result: Identical AI suggestions across all IDEs!
Integration Hub: docs/integrations/INTEGRATIONS.md
☁️ Cloud Storage Integration (NEW - v3.0.0)
Upload skills directly to cloud storage for team sharing and CI/CD pipelines.
Supported Providers
AWS S3:
# Upload skill
skill-seekers cloud upload --provider s3 --bucket my-skills output/react.zip
# Download skill
skill-seekers cloud download --provider s3 --bucket my-skills react.zip
# List skills
skill-seekers cloud list --provider s3 --bucket my-skills
# Environment variables:
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret
export AWS_REGION=us-east-1
Google Cloud Storage:
# Upload skill
skill-seekers cloud upload --provider gcs --bucket my-skills output/react.zip
# Download skill
skill-seekers cloud download --provider gcs --bucket my-skills react.zip
# List skills
skill-seekers cloud list --provider gcs --bucket my-skills
# Environment variables:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
Azure Blob Storage:
# Upload skill
skill-seekers cloud upload --provider azure --container my-skills output/react.zip
# Download skill
skill-seekers cloud download --provider azure --container my-skills react.zip
# List skills
skill-seekers cloud list --provider azure --container my-skills
# Environment variables:
export AZURE_STORAGE_CONNECTION_STRING=your-connection-string
CI/CD Integration
# GitHub Actions example
- name: Upload skill to S3
run: |
skill-seekers scrape --config configs/react.json
skill-seekers package output/react/
skill-seekers cloud upload --provider s3 --bucket ci-skills output/react.zip
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
Guide: docs/integrations/CLOUD_STORAGE.md
📋 Common Workflows
Adding a New Platform
- Create adaptor in
src/skill_seekers/cli/adaptors/{platform}_adaptor.py - Inherit from
BaseAdaptor - Implement
package(),upload(),enhance()methods - Add to factory in
adaptors/__init__.py - Add optional dependency to
pyproject.toml - Add tests in
tests/test_install_multiplatform.py
Adding a New Feature
- Implement in appropriate CLI module
- Add entry point to
pyproject.tomlif needed - Add tests in
tests/test_{feature}.py - Run full test suite:
pytest tests/ -v - Update CHANGELOG.md
- Commit only when all tests pass
Debugging Common Issues
Import Errors:
# Always ensure package is installed first
pip install -e .
# Verify installation
python -c "import skill_seekers; print(skill_seekers.__version__)"
Rate Limit Issues:
# Check current GitHub rate limit status
curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/rate_limit
# Configure multiple GitHub profiles
skill-seekers config --github
# Test your tokens
skill-seekers config --test
Enhancement Not Working:
# Check if API key is set
echo $ANTHROPIC_API_KEY
# Try LOCAL mode instead (uses Claude Code Max)
skill-seekers enhance output/react/ --mode LOCAL
# Monitor enhancement status
skill-seekers enhance-status output/react/ --watch
Test Failures:
# Run specific failing test with verbose output
pytest tests/test_file.py::test_name -vv
# Run with print statements visible
pytest tests/test_file.py -s
# Run with coverage to see what's not tested
pytest tests/test_file.py --cov=src/skill_seekers --cov-report=term-missing
# Run only unit tests (skip slow integration tests)
pytest tests/ -v -m "not slow and not integration"
Config Issues:
# Validate config structure
skill-seekers-validate configs/myconfig.json
# Show current configuration
skill-seekers config --show
# Estimate pages before scraping
skill-seekers estimate configs/myconfig.json
🎯 Where to Make Changes
This section helps you quickly locate the right files when implementing common changes.
Adding a New CLI Command
Files to modify:
-
Create command file:
src/skill_seekers/cli/my_command.pydef main(): """Entry point for my-command.""" # Implementation -
Add entry point:
pyproject.toml[project.scripts] skill-seekers-my-command = "skill_seekers.cli.my_command:main" -
Update unified CLI:
src/skill_seekers/cli/main.py- Add subcommand handler to dispatcher
-
Add tests:
tests/test_my_command.py- Test main functionality
- Test CLI argument parsing
- Test error cases
-
Update docs:
CHANGELOG.md+README.md(if user-facing)
Adding a New Platform Adaptor
Files to modify:
-
Create adaptor:
src/skill_seekers/cli/adaptors/my_platform_adaptor.pyfrom .base import BaseAdaptor class MyPlatformAdaptor(BaseAdaptor): def package(self, skill_dir, output_path, **kwargs): # Platform-specific packaging pass def upload(self, package_path, api_key=None, **kwargs): # Platform-specific upload (optional for some platforms) pass def export(self, skill_dir, format, **kwargs): # For RAG/vector DB adaptors: export to specific format pass -
Register in factory:
src/skill_seekers/cli/adaptors/__init__.pydef get_adaptor(target=None, format=None): # For LLM platforms (--target flag) target_adaptors = { 'claude': ClaudeAdaptor, 'gemini': GeminiAdaptor, 'openai': OpenAIAdaptor, 'markdown': MarkdownAdaptor, 'myplatform': MyPlatformAdaptor, # ADD THIS } # For RAG/vector DBs (--format flag) format_adaptors = { 'langchain': LangChainAdaptor, 'llama-index': LlamaIndexAdaptor, 'chroma': ChromaAdaptor, # ... etc } -
Add optional dependency:
pyproject.toml[project.optional-dependencies] myplatform = ["myplatform-sdk>=1.0.0"] -
Add tests:
tests/test_adaptors/test_my_platform_adaptor.py- Test export format
- Test upload (if applicable)
- Test with real data
-
Update documentation:
- README.md - Platform comparison table
- docs/integrations/MY_PLATFORM.md - Integration guide
- examples/my-platform-example/ - Working example
Adding a New Config Preset
Files to modify:
-
Create config:
configs/my_framework.json{ "name": "my_framework", "base_url": "https://docs.myframework.com/", "selectors": {...}, "categories": {...} } -
Test locally:
# Estimate first skill-seekers estimate configs/my_framework.json # Test scrape (small sample) skill-seekers scrape --config configs/my_framework.json --max-pages 50 -
Add to README: Update presets table in
README.md -
Submit to website: (Optional) Submit to SkillSeekersWeb.com
Modifying Core Scraping Logic
Key files by feature:
| Feature | File | Size | Notes |
|---|---|---|---|
| Doc scraping | src/skill_seekers/cli/doc_scraper.py |
~90KB | Main scraper, BFS traversal |
| GitHub scraping | src/skill_seekers/cli/github_scraper.py |
~56KB | Repo analysis + metadata |
| GitHub API | src/skill_seekers/cli/github_fetcher.py |
~17KB | Rate limit handling |
| PDF extraction | src/skill_seekers/cli/pdf_scraper.py |
Medium | PyMuPDF + OCR |
| Code analysis | src/skill_seekers/cli/code_analyzer.py |
~65KB | Multi-language AST parsing |
| Pattern detection | src/skill_seekers/cli/pattern_recognizer.py |
Medium | C3.1 - 10 GoF patterns |
| Test extraction | src/skill_seekers/cli/test_example_extractor.py |
Medium | C3.2 - 5 categories |
| Guide generation | src/skill_seekers/cli/how_to_guide_builder.py |
~45KB | C3.3 - AI-enhanced guides |
| Config extraction | src/skill_seekers/cli/config_extractor.py |
~32KB | C3.4 - 9 formats |
| Router generation | src/skill_seekers/cli/generate_router.py |
~43KB | C3.5 - Architecture docs |
| Signal flow | src/skill_seekers/cli/signal_flow_analyzer.py |
Medium | C3.10 - Godot-specific |
Always add tests when modifying core logic!
Adding MCP Tools
Files to modify:
-
Add tool function:
src/skill_seekers/mcp/tools/{category}_tools.py -
Register tool:
src/skill_seekers/mcp/server.py@mcp.tool() def my_new_tool(param: str) -> str: """Tool description.""" # Implementation -
Add tests:
tests/test_mcp_fastmcp.py -
Update count: README.md (currently 18 tools)
📍 Key Files Quick Reference
| Task | File(s) | What to Modify |
|---|---|---|
| Add new CLI command | src/skill_seekers/cli/my_cmd.pypyproject.toml |
Create main() functionAdd entry point |
| Add platform adaptor | src/skill_seekers/cli/adaptors/my_platform.pyadaptors/__init__.py |
Inherit BaseAdaptorRegister in factory |
| Fix scraping logic | src/skill_seekers/cli/doc_scraper.py |
scrape_all(), extract_content() |
| Add MCP tool | src/skill_seekers/mcp/server_fastmcp.py |
Add @mcp.tool() function |
| Fix tests | tests/test_{feature}.py |
Add/modify test functions |
| Add config preset | configs/{framework}.json |
Create JSON config |
| Update CI | .github/workflows/tests.yml |
Modify workflow steps |
📚 Key Code Locations
Documentation Scraper (src/skill_seekers/cli/doc_scraper.py):
is_valid_url()- URL validationextract_content()- Content extractiondetect_language()- Code language detectionextract_patterns()- Pattern extractionsmart_categorize()- Smart categorizationinfer_categories()- Category inferencegenerate_quick_reference()- Quick reference generationcreate_enhanced_skill_md()- SKILL.md generationscrape_all()- Main scraping loopmain()- Entry point
Codebase Analysis (src/skill_seekers/cli/):
codebase_scraper.py- Main CLI for local codebase analysiscode_analyzer.py- Multi-language AST parsing (9 languages)api_reference_builder.py- API documentation generationdependency_analyzer.py- NetworkX-based dependency graphspattern_recognizer.py- C3.1 design pattern detectiontest_example_extractor.py- C3.2 test example extractionhow_to_guide_builder.py- C3.3 guide generationconfig_extractor.py- C3.4 configuration extractiongenerate_router.py- C3.5 router skill generationsignal_flow_analyzer.py- C3.10 signal flow analysis (Godot projects)unified_codebase_analyzer.py- Three-stream GitHub+local analyzer
AI Enhancement (src/skill_seekers/cli/):
enhance_skill_local.py- LOCAL mode enhancement (4 execution modes)enhance_skill.py- API mode enhancementenhance_status.py- Status monitoring for background processesai_enhancer.py- Shared AI enhancement logicguide_enhancer.py- C3.3 guide AI enhancementconfig_enhancer.py- C3.4 config AI enhancement
Platform Adaptors (src/skill_seekers/cli/adaptors/):
__init__.py- Factory functionbase_adaptor.py- Abstract base classclaude_adaptor.py- Claude AI implementationgemini_adaptor.py- Google Gemini implementationopenai_adaptor.py- OpenAI ChatGPT implementationmarkdown_adaptor.py- Generic Markdown implementation
MCP Server (src/skill_seekers/mcp/):
server.py- FastMCP-based servertools/- 18 MCP tool implementations
Configuration & Rate Limit Management (NEW: v2.7.0 - src/skill_seekers/cli/):
config_manager.py- Multi-token configuration system (~490 lines)ConfigManagerclass - Singleton pattern for global config accessadd_github_profile()- Add GitHub profile with token and strategyget_github_token()- Smart fallback chain (CLI → Env → Config → Prompt)get_next_profile()- Profile switching for rate limit handlingsave_progress()/load_progress()- Job resumption supportcleanup_old_progress()- Auto-cleanup of old jobs (7 days default)
config_command.py- Interactive configuration wizard (~400 lines)main_menu()- 7-option main menu with navigationgithub_token_menu()- GitHub profile managementadd_github_profile()- Guided token setup with browser integrationapi_keys_menu()- API key configuration for Claude/Gemini/OpenAItest_connections()- Connection testing for tokens and API keys
rate_limit_handler.py- Smart rate limit detection and handling (~450 lines)RateLimitHandlerclass - Strategy pattern for rate limit handlingcheck_upfront()- Upfront rate limit check before startingcheck_response()- Real-time detection from API responseshandle_rate_limit()- Execute strategy (prompt/wait/switch/fail)try_switch_profile()- Automatic profile switchingwait_for_reset()- Countdown timer with live progressshow_countdown_timer()- Live terminal countdown display
resume_command.py- Resume interrupted scraping jobs (~150 lines)list_resumable_jobs()- Display all jobs with progress detailsresume_job()- Resume from saved checkpointclean_old_jobs()- Cleanup old progress files
GitHub Integration (Modified for v2.7.0 - src/skill_seekers/cli/):
github_fetcher.py- Integrated rate limit handler- Constructor now accepts
interactiveandprofile_nameparameters fetch()- Added upfront rate limit check- All API calls check responses for rate limits
- Raises
RateLimitErrorwhen rate limit cannot be handled
- Constructor now accepts
github_scraper.py- Added CLI flags--non-interactiveflag for CI/CD mode (fail fast)--profileflag to select GitHub profile from config- Config supports
interactiveandgithub_profilekeys
RAG & Vector Database Adaptors (NEW: v3.0.0 - src/skill_seekers/cli/adaptors/):
langchain.py- LangChain Documents export (~250 lines)- Exports to LangChain Document format
- Preserves metadata (source, category, type, url)
- Smart chunking with overlap
llama_index.py- LlamaIndex TextNodes export (~280 lines)- Exports to TextNode format with unique IDs
- Relationship mapping between documents
- Metadata preservation
haystack.py- Haystack Documents export (~230 lines)- Pipeline-ready document format
- Supports embeddings and filters
chroma.py- ChromaDB integration (~350 lines)- Direct collection creation
- Batch upsert with embeddings
- Query interface
weaviate.py- Weaviate vector search (~320 lines)- Schema creation with auto-detection
- Batch import with error handling
faiss_helpers.py- FAISS index generation (~280 lines)- Index building with metadata
- Search utilities
qdrant.py- Qdrant vector database (~300 lines)- Collection management
- Payload indexing
streaming_adaptor.py- Streaming data ingest (~200 lines)- Real-time data processing
- Incremental updates
Cloud Storage & Infrastructure (NEW: v3.0.0 - src/skill_seekers/cli/):
cloud_storage_cli.py- S3/GCS/Azure upload/download (~450 lines)- Multi-provider abstraction
- Parallel uploads for large files
- Retry logic with exponential backoff
embedding_pipeline.py- Embedding generation for vectors (~320 lines)- Sentence-transformers integration
- Batch processing
- Multiple embedding models
sync_cli.py- Continuous sync & monitoring (~380 lines)- File watching for changes
- Automatic re-scraping
- Smart diff detection
incremental_updater.py- Smart incremental updates (~350 lines)- Change detection algorithms
- Partial skill updates
- Version tracking
streaming_ingest.py- Real-time data streaming (~290 lines)- Stream processing pipelines
- WebSocket support
benchmark_cli.py- Performance benchmarking (~280 lines)- Scraping performance tests
- Comparison reports
- CI/CD integration
quality_metrics.py- Quality analysis & reporting (~340 lines)- Completeness scoring
- Link checking
- Content quality metrics
multilang_support.py- Internationalization support (~260 lines)- Language detection
- Translation integration
- Multi-locale skills
setup_wizard.py- Interactive setup wizard (~220 lines)- Configuration management
- Profile creation
- First-time setup
🎯 Project-Specific Best Practices
- Always use platform adaptors - Never hardcode platform-specific logic
- Test all platforms - Changes must work for all 16 platforms (was 4 in v2.x)
- Maintain backward compatibility - Legacy configs and v2.x workflows must still work
- Document API changes - Update CHANGELOG.md for every release
- Keep dependencies optional - Platform-specific deps are optional (RAG, cloud, etc.)
- Use src/ layout - Proper package structure with
pip install -e . - Run tests before commits - Per user instructions, never skip tests (1,852 tests must pass)
- RAG-first mindset - v3.0.0 is the universal preprocessor for AI systems
- Export format clarity - Use
--formatfor RAG/vector DBs,--targetfor LLM platforms - Test with real integrations - Verify exports work with actual LangChain, ChromaDB, etc.
🐛 Debugging Tips
Enable Verbose Logging
# Set environment variable for debug output
export SKILL_SEEKERS_DEBUG=1
skill-seekers scrape --config configs/react.json
Test Single Function/Module
Run Python modules directly for debugging:
# Run modules with --help to see options
python -m skill_seekers.cli.doc_scraper --help
python -m skill_seekers.cli.github_scraper --repo facebook/react --dry-run
python -m skill_seekers.cli.package_skill --help
# Test MCP server directly
python -m skill_seekers.mcp.server_fastmcp
Use pytest with Debugging
# Drop into debugger on failure
pytest tests/test_scraper_features.py --pdb
# Show print statements (normally suppressed)
pytest tests/test_scraper_features.py -s
# Verbose test output (shows full diff, more details)
pytest tests/test_scraper_features.py -vv
# Run only failed tests from last run
pytest tests/ --lf
# Run until first failure (stop immediately)
pytest tests/ -x
# Show local variables on failure
pytest tests/ -l
Debug Specific Test
# Run single test with full output
pytest tests/test_scraper_features.py::test_detect_language -vv -s
# With debugger
pytest tests/test_scraper_features.py::test_detect_language --pdb
Check Package Installation
# Verify package is installed
pip list | grep skill-seekers
# Check installation mode (should show editable location)
pip show skill-seekers
# Verify imports work
python -c "import skill_seekers; print(skill_seekers.__version__)"
# Check CLI entry points
which skill-seekers
skill-seekers --version
Common Error Messages & Solutions
"ModuleNotFoundError: No module named 'skill_seekers'"
→ Solution: pip install -e .
→ Why: src/ layout requires package installation
"403 Forbidden" from GitHub API
→ Solution: Rate limit hit, set GITHUB_TOKEN or use skill-seekers config --github
→ Check limit: curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/rate_limit
"SKILL.md enhancement failed"
→ Solution: Check if ANTHROPIC_API_KEY is set, or use --mode LOCAL
→ Monitor: skill-seekers enhance-status output/react/ --watch
"No such file or directory: 'configs/myconfig.json'" → Solution: Config path resolution order:
- Exact path as provided
./configs/(current directory)~/.config/skill-seekers/configs/(user config)- SkillSeekersWeb.com API (presets)
"pytest: command not found" → Solution: Install dev dependencies
pip install pytest pytest-asyncio pytest-cov coverage
# Or: pip install -e ".[dev]" (if available)
"ruff: command not found" → Solution: Install ruff
pip install ruff
# Or use uvx: uvx ruff check src/
Debugging Scraping Issues
No content extracted?
# Test selectors in Python
from bs4 import BeautifulSoup
import requests
url = "https://docs.example.com/page"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Try different selectors
print(soup.select_one('article'))
print(soup.select_one('main'))
print(soup.select_one('div[role="main"]'))
print(soup.select_one('.documentation-content'))
Categories not working?
- Check
categoriesin config has correct keywords - Run with
--dry-runto see categorization without scraping - Enable debug mode:
export SKILL_SEEKERS_DEBUG=1
Profiling Performance
# Profile scraping performance
python -m cProfile -o profile.stats -m skill_seekers.cli.doc_scraper --config configs/react.json --max-pages 10
# Analyze profile
python -m pstats profile.stats
# In pstats shell:
# > sort cumtime
# > stats 20
📖 Additional Documentation
Official Website:
- SkillSeekersWeb.com - Browse 24+ preset configs, share configs, complete documentation
For Users:
- README.md - Complete user documentation
- BULLETPROOF_QUICKSTART.md - Beginner guide
- TROUBLESHOOTING.md - Common issues
For Developers:
- CHANGELOG.md - Release history
- ROADMAP.md - 136 tasks across 10 categories
- docs/UNIFIED_SCRAPING.md - Multi-source scraping
- docs/MCP_SETUP.md - MCP server setup
- docs/ENHANCEMENT_MODES.md - AI enhancement modes
- docs/PATTERN_DETECTION.md - C3.1 pattern detection
- docs/THREE_STREAM_STATUS_REPORT.md - Three-stream architecture
- docs/MULTI_LLM_SUPPORT.md - Multi-platform support
🎓 Understanding the Codebase
Why src/ Layout?
Modern Python best practice (PEP 517/518):
- Prevents accidental imports from repo root
- Forces proper package installation
- Better isolation between package and tests
- Required:
pip install -e .before running tests
Why Platform Adaptors?
Strategy pattern benefits:
- Single codebase supports 4 platforms
- Platform-specific optimizations (format, APIs, models)
- Easy to add new platforms (implement BaseAdaptor)
- Clean separation of concerns
- Testable in isolation
Why Git-style CLI?
User experience benefits:
- Familiar to developers (like
git) - Single entry point:
skill-seekers - Backward compatible: individual tools still work
- Cleaner than multiple separate commands
- Easier to document and teach
Three-Stream GitHub Architecture
The unified_codebase_analyzer.py splits GitHub repositories into three independent streams:
Stream 1: Code Analysis (C3.x features)
- Deep AST parsing (9 languages)
- Design pattern detection (C3.1)
- Test example extraction (C3.2)
- How-to guide generation (C3.3)
- Configuration extraction (C3.4)
- Architectural overview (C3.5)
- API reference + dependency graphs
Stream 2: Documentation
- README, CONTRIBUTING, LICENSE
- docs/ directory markdown files
- Wiki pages (if available)
- CHANGELOG and version history
Stream 3: Community Insights
- GitHub metadata (stars, forks, watchers)
- Issue analysis (top problems and solutions)
- PR trends and contributor stats
- Release history
- Label-based topic detection
Key Benefits:
- Unified interface for GitHub URLs and local paths
- Analysis depth control: 'basic' (1-2 min) or 'c3x' (20-60 min)
- Enhanced router generation with GitHub context
- Smart keyword extraction weighted by GitHub labels (2x weight)
- 81 E2E tests passing (0.44 seconds)
🔧 Helper Scripts
The scripts/ directory contains utility scripts:
# Bootstrap skill generation - self-hosting skill-seekers as a Claude skill
./scripts/bootstrap_skill.sh
# Start MCP server for HTTP transport
./scripts/start_mcp_server.sh
# Script templates are in scripts/skill_header.md
Bootstrap Skill Workflow:
- Analyzes skill-seekers codebase itself (dogfooding)
- Combines handcrafted header with auto-generated analysis
- Validates SKILL.md structure
- Outputs ready-to-use skill for Claude Code
🔍 Performance Characteristics
| Operation | Time | Notes |
|---|---|---|
| Scraping (sync) | 15-45 min | First time, thread-based |
| Scraping (async) | 5-15 min | 2-3x faster with --async |
| Building | 1-3 min | Fast rebuild from cache |
| Re-building | <1 min | With --skip-scrape |
| Enhancement (LOCAL) | 30-60 sec | Uses Claude Code Max |
| Enhancement (API) | 20-40 sec | Requires API key |
| Packaging | 5-10 sec | Final .zip creation |
🎉 Recent Achievements
v3.0.0 (February 10, 2026) - "Universal Intelligence Platform":
- 🚀 16 Platform Adaptors - RAG frameworks (LangChain, LlamaIndex, Haystack), vector DBs (Chroma, FAISS, Weaviate, Qdrant), AI coding assistants (Cursor, Windsurf, Cline, Continue.dev), LLM platforms (Claude, Gemini, OpenAI)
- 🛠️ 26 MCP Tools (up from 18) - Complete automation for any AI system
- ✅ 1,852 Tests Passing (up from 700+) - Production-grade reliability
- ☁️ Cloud Storage - S3, GCS, Azure Blob Storage integration
- 🎯 AI Coding Assistants - Persistent context for Cursor, Windsurf, Cline, Continue.dev
- 📊 Quality Metrics - Automated completeness scoring and content analysis
- 🌐 Multilingual Support - Language detection and translation
- 🔄 Streaming Ingest - Real-time data processing pipelines
- 📈 Benchmarking Tools - Performance comparison and CI/CD integration
- 🔧 Setup Wizard - Interactive first-time configuration
- 📦 12 Example Projects - Complete working examples for every integration
- 📚 18 Integration Guides - Comprehensive documentation for all platforms
v2.9.0 (February 3, 2026):
- C3.10: Signal Flow Analysis - Complete signal flow analysis for Godot projects
- Comprehensive Godot 4.x support (GDScript, .tscn, .tres, .gdshader files)
- GDScript test extraction (GUT, gdUnit4, WAT frameworks)
- Signal pattern detection (EventBus, Observer, Event Chains)
- Signal-based how-to guides generation
v2.8.0 (February 1, 2026):
- C3.9: Project Documentation Extraction
- Granular AI enhancement control with
--enhance-level(0-3)
v2.7.1 (January 18, 2026 - Hotfix):
- 🚨 Critical Bug Fix: Config download 404 errors resolved
- Fixed manual URL construction bug - now uses
download_urlfrom API response - All 15 source tools tests + 8 fetch_config tests passing
v2.7.0 (January 18, 2026):
- 🔐 Smart Rate Limit Management - Multi-token GitHub configuration system
- 🧙 Interactive Configuration Wizard - Beautiful terminal UI (
skill-seekers config) - 🚦 Intelligent Rate Limit Handler - Four strategies (prompt/wait/switch/fail)
- 📥 Resume Capability - Continue interrupted jobs with progress tracking
- 🔧 CI/CD Support - Non-interactive mode for automation
- 🎯 Bootstrap Skill - Self-hosting skill-seekers as Claude Code skill
v2.6.0 (January 14, 2026):
- C3.x Codebase Analysis Suite Complete (C3.1-C3.8)
- Multi-platform support with platform adaptor architecture (4 platforms)
- 18 MCP tools fully functional
- 700+ tests passing
- Unified multi-source scraping maturity
C3.x Series (Complete - Code Analysis Features):
- C3.1: Design pattern detection (10 GoF patterns, 9 languages, 87% precision)
- C3.2: Test example extraction (5 categories, AST-based for Python)
- C3.3: How-to guide generation with AI enhancement (5 improvements)
- C3.4: Configuration pattern extraction (env vars, config files, CLI args)
- C3.5: Architectural overview & router skill generation
- C3.6: AI enhancement for patterns and test examples (Claude API integration)
- C3.7: Architectural pattern detection (8 patterns, framework-aware)
- C3.8: Standalone codebase scraper (300+ line SKILL.md from code alone)
- C3.9: Project documentation extraction (markdown categorization, AI enhancement)
- C3.10: Signal flow analysis (Godot event-driven architecture, pattern detection)
v2.5.2:
- UX Improvement: Analysis features now default ON with --skip-* flags (BREAKING)
- Router quality improvements: 6.5/10 → 8.5/10 (+31%)
- All 107 codebase analysis tests passing
v2.5.0:
- Multi-platform support (Claude, Gemini, OpenAI, Markdown)
- Platform adaptor architecture
- 18 MCP tools (up from 9)
- Complete feature parity across platforms
v2.1.0:
- Unified multi-source scraping (docs + GitHub + PDF)
- Conflict detection between sources
- 427 tests passing
v1.0.0:
- Production release with MCP integration
- Documentation scraping with smart categorization
- 12 preset configurations