Files
skill-seekers-reference/CLAUDE.md
yusyus 2e30970dfb feat: add EPUB input support (#310)
Adds EPUB as a first-class input source for skill generation.

- EpubToSkillConverter (epub_scraper.py, ~1200 lines) following PDF scraper pattern
- Dublin Core metadata, spine items, code blocks, tables, images extraction
- DRM detection (Adobe ADEPT, Apple FairPlay, Readium LCP) with fail-fast
- EPUB 3 NCX TOC bug workaround (ignore_ncx=True)
- ebooklib as optional dep: pip install skill-seekers[epub]
- Wired into create command with .epub auto-detection
- 104 tests, all passing

Review fixes: removed 3 empty test stubs, fixed SVG double-counting in
_extract_images(), added logger.debug to bare except pass.

Based on PR #310 by @christianbaumann.
Co-authored-by: Christian Baumann <mail@chriss-baumann.de>
2026-03-15 02:34:41 +03:00

85 KiB
Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

🎯 Project Overview

Skill Seekers is the universal documentation preprocessor for AI systems. It transforms documentation websites, GitHub repositories, PDFs, and EPUBs into production-ready formats for 16+ platforms: RAG pipelines (LangChain, LlamaIndex, Haystack), vector databases (Pinecone, Chroma, Weaviate, FAISS, Qdrant), AI coding assistants (Cursor, Windsurf, Cline, Continue.dev), and LLM platforms (Claude, Gemini, OpenAI).

Current Version: v3.1.3 Python Version: 3.10+ required Status: Production-ready, published on PyPI Website: https://skillseekersweb.com/ - Browse configs, share, and access documentation

📚 Table of Contents

👋 First Time Here?

Complete this 3-minute setup to start contributing:

# 1. Install package in editable mode (REQUIRED for development)
pip install -e .

# 2. Verify installation
python -c "import skill_seekers; print(skill_seekers.__version__)"  # Should print: 3.1.0-dev

# 3. Run a quick test
pytest tests/test_scraper_features.py::test_detect_language -v

# 4. You're ready! Pick a task from the roadmap:
# https://github.com/users/yusufkaraaslan/projects/2

Quick Navigation:

Quick Command Reference (Most Used)

First time setup:

pip install -e .  # REQUIRED before running tests or CLI

Running tests (NEVER skip - user requirement):

pytest tests/ -v  # All tests
pytest tests/test_scraper_features.py -v  # Single file
pytest tests/ --cov=src/skill_seekers --cov-report=html  # With coverage

Code quality checks (matches CI):

ruff check src/ tests/  # Lint
ruff format src/ tests/  # Format
mypy src/skill_seekers  # Type check

Common workflows:

# NEW unified create command (auto-detects source type)
skill-seekers create https://docs.react.dev/ -p quick
skill-seekers create facebook/react -p standard
skill-seekers create ./my-project -p comprehensive
skill-seekers create tutorial.pdf

# Legacy commands (still supported)
skill-seekers scrape --config configs/react.json
skill-seekers github --repo facebook/react
skill-seekers analyze --directory . --comprehensive

# Package for LLM platforms
skill-seekers package output/react/ --target claude
skill-seekers package output/react/ --target gemini

RAG Pipeline workflows:

# LangChain Documents
skill-seekers package output/react/ --format langchain

# LlamaIndex TextNodes
skill-seekers package output/react/ --format llama-index

# Haystack Documents
skill-seekers package output/react/ --format haystack

# ChromaDB direct upload
skill-seekers package output/react/ --format chroma --upload

# FAISS export
skill-seekers package output/react/ --format faiss

# Weaviate/Qdrant upload (requires API keys)
skill-seekers package output/react/ --format weaviate --upload
skill-seekers package output/react/ --format qdrant --upload

AI Coding Assistant workflows:

# Cursor IDE
skill-seekers package output/react/ --target claude
cp output/react-claude/SKILL.md .cursorrules

# Windsurf
cp output/react-claude/SKILL.md .windsurf/rules/react.md

# Cline (VS Code)
cp output/react-claude/SKILL.md .clinerules

# Continue.dev (universal IDE)
python examples/continue-dev-universal/context_server.py
# Configure in ~/.continue/config.json

Cloud Storage:

# Upload to S3
skill-seekers cloud upload --provider s3 --bucket my-skills output/react.zip

# Upload to GCS
skill-seekers cloud upload --provider gcs --bucket my-skills output/react.zip

# Upload to Azure
skill-seekers cloud upload --provider azure --container my-skills output/react.zip

🏗️ Architecture

Core Design Pattern: Platform Adaptors

The codebase uses the Strategy Pattern with a factory method to support 16 platforms across 4 categories:

src/skill_seekers/cli/adaptors/
├── __init__.py          # Factory: get_adaptor(target/format)
├── base.py              # Abstract base class
# LLM Platforms (3)
├── claude.py            # Claude AI (ZIP + YAML)
├── gemini.py            # Google Gemini (tar.gz)
├── openai.py            # OpenAI ChatGPT (ZIP + Vector Store)
# RAG Frameworks (3)
├── langchain.py         # LangChain Documents
├── llama_index.py       # LlamaIndex TextNodes
├── haystack.py          # Haystack Documents
# Vector Databases (5)
├── chroma.py            # ChromaDB
├── faiss_helpers.py     # FAISS
├── qdrant.py            # Qdrant
├── weaviate.py          # Weaviate
# AI Coding Assistants (4 - via Claude format + config files)
# - Cursor, Windsurf, Cline, Continue.dev
# Generic (1)
├── markdown.py          # Generic Markdown (ZIP)
└── streaming_adaptor.py # Streaming data ingest

Key Methods:

  • package(skill_dir, output_path) - Platform-specific packaging
  • upload(package_path, api_key) - Platform-specific upload (where applicable)
  • enhance(skill_dir, mode) - AI enhancement with platform-specific models
  • export(skill_dir, format) - Export to RAG/vector DB formats

Data Flow (5 Phases)

  1. Scrape Phase (doc_scraper.py:scrape_all())

    • BFS traversal from base_url
    • Output: output/{name}_data/pages/*.json
  2. Build Phase (doc_scraper.py:build_skill())

    • Load pages → Categorize → Extract patterns
    • Output: output/{name}/SKILL.md + references/*.md
  3. Enhancement Phase (optional, enhance_skill_local.py)

    • LLM analyzes references → Rewrites SKILL.md
    • Platform-specific models (Sonnet 4, Gemini 2.0, GPT-4o)
  4. Package Phase (package_skill.py → adaptor)

    • Platform adaptor packages in appropriate format
    • Output: .zip or .tar.gz
  5. Upload Phase (optional, upload_skill.py → adaptor)

    • Upload via platform API

File Structure (src/ layout) - Key Files Only

src/skill_seekers/
├── cli/                              # All CLI commands
│   ├── main.py                       # ⭐ Git-style CLI dispatcher
│   ├── doc_scraper.py                # ⭐ Main scraper (~790 lines)
│   │   ├── scrape_all()              # BFS traversal engine
│   │   ├── smart_categorize()        # Category detection
│   │   └── build_skill()             # SKILL.md generation
│   ├── github_scraper.py             # GitHub repo analysis
│   ├── codebase_scraper.py           # ⭐ Local analysis (C2.x+C3.x)
│   ├── package_skill.py              # Platform packaging
│   ├── unified_scraper.py            # Multi-source scraping
│   ├── unified_codebase_analyzer.py  # Three-stream GitHub+local analyzer
│   ├── enhance_skill_local.py        # AI enhancement (LOCAL mode)
│   ├── enhance_status.py             # Enhancement status monitoring
│   ├── upload_skill.py               # Upload to platforms
│   ├── install_skill.py              # Complete workflow automation
│   ├── install_agent.py              # Install to AI agent directories
│   ├── pattern_recognizer.py         # C3.1 Design pattern detection
│   ├── test_example_extractor.py     # C3.2 Test example extraction
│   ├── how_to_guide_builder.py       # C3.3 How-to guide generation
│   ├── config_extractor.py           # C3.4 Configuration extraction
│   ├── generate_router.py            # C3.5 Router skill generation
│   ├── code_analyzer.py              # Multi-language code analysis
│   ├── api_reference_builder.py      # API documentation builder
│   ├── dependency_analyzer.py        # Dependency graph analysis
│   ├── signal_flow_analyzer.py       # C3.10 Signal flow analysis (Godot)
│   ├── pdf_scraper.py                # PDF extraction
│   ├── epub_scraper.py               # EPUB extraction
│   └── adaptors/                     # ⭐ Platform adaptor pattern
│       ├── __init__.py               # Factory: get_adaptor()
│       ├── base_adaptor.py           # Abstract base
│       ├── claude_adaptor.py         # Claude AI
│       ├── gemini_adaptor.py         # Google Gemini
│       ├── openai_adaptor.py         # OpenAI ChatGPT
│       ├── markdown_adaptor.py       # Generic Markdown
│       ├── langchain.py              # LangChain RAG
│       ├── llama_index.py            # LlamaIndex RAG
│       ├── haystack.py               # Haystack RAG
│       ├── chroma.py                 # ChromaDB
│       ├── faiss_helpers.py          # FAISS
│       ├── qdrant.py                 # Qdrant
│       ├── weaviate.py               # Weaviate
│       └── streaming_adaptor.py      # Streaming data ingest
└── mcp/                              # MCP server (26 tools)
    ├── server_fastmcp.py             # FastMCP server
    └── tools/                        # Tool implementations

Most Modified Files (when contributing):

  • Platform adaptors: src/skill_seekers/cli/adaptors/{platform}.py
  • Tests: tests/test_{feature}.py
  • Configs: configs/{framework}.json

🛠️ Development Commands

Setup

# Install in editable mode (required before tests due to src/ layout)
pip install -e .

# Install with all platform dependencies
pip install -e ".[all-llms]"

# Install specific platforms
pip install -e ".[gemini]"   # Google Gemini
pip install -e ".[openai]"   # OpenAI ChatGPT

Running Tests

CRITICAL: Never skip tests - User requires all tests to pass before commits.

# All tests (must run pip install -e . first!)
pytest tests/ -v

# Specific test file
pytest tests/test_scraper_features.py -v

# Multi-platform tests
pytest tests/test_install_multiplatform.py -v

# With coverage
pytest tests/ --cov=src/skill_seekers --cov-report=term --cov-report=html

# Single test
pytest tests/test_scraper_features.py::test_detect_language -v

# MCP server tests
pytest tests/test_mcp_fastmcp.py -v

Test Architecture:

  • 46 test files covering all features
  • CI Matrix: Ubuntu + macOS, Python 3.10-3.13
  • 2,540 tests passing (current), up from 700+ in v2.x
  • Must run pip install -e . before tests (src/ layout requirement)
  • Tests include create command integration tests, CLI refactor E2E tests

Building & Publishing

# Build package (using uv - recommended)
uv build

# Or using build
python -m build

# Publish to PyPI
uv publish

# Or using twine
python -m twine upload dist/*

Testing CLI Commands

# Test configuration wizard (NEW: v2.7.0)
skill-seekers config --show                          # Show current configuration
skill-seekers config --github                        # GitHub token setup
skill-seekers config --test                          # Test connections

# Test resume functionality (NEW: v2.7.0)
skill-seekers resume --list                          # List resumable jobs
skill-seekers resume --clean                         # Clean up old jobs

# Test GitHub scraping with profiles (NEW: v2.7.0)
skill-seekers github --repo facebook/react --profile personal    # Use specific profile
skill-seekers github --repo owner/repo --non-interactive         # CI/CD mode

# Test scraping (dry run)
skill-seekers scrape --config configs/react.json --dry-run

# Test codebase analysis (C2.x features)
skill-seekers analyze --directory . --output output/codebase/

# Test pattern detection (C3.1)
skill-seekers patterns --file src/skill_seekers/cli/code_analyzer.py

# Test how-to guide generation (C3.3)
skill-seekers how-to-guides output/test_examples.json --output output/guides/

# Test enhancement status monitoring
skill-seekers enhance-status output/react/ --watch

# Video setup (auto-detect GPU and install deps)
skill-seekers video --setup

# Test multi-platform packaging
skill-seekers package output/react/ --target gemini --dry-run

# Test MCP server (stdio mode)
python -m skill_seekers.mcp.server_fastmcp

# Test MCP server (HTTP mode)
python -m skill_seekers.mcp.server_fastmcp --transport http --port 8765

New v3.0.0 CLI Commands

# Setup wizard (interactive configuration)
skill-seekers-setup

# Cloud storage operations
skill-seekers cloud upload --provider s3 --bucket my-bucket output/react.zip
skill-seekers cloud download --provider gcs --bucket my-bucket react.zip
skill-seekers cloud list --provider azure --container my-container

# Embedding server (for RAG pipelines)
skill-seekers embed --port 8080 --model sentence-transformers

# Sync & incremental updates
skill-seekers sync --source https://docs.react.dev/ --target output/react/
skill-seekers update --skill output/react/ --check-changes

# Quality metrics & benchmarking
skill-seekers quality --skill output/react/ --report
skill-seekers benchmark --config configs/react.json --compare-versions

# Multilingual support
skill-seekers multilang --detect output/react/
skill-seekers multilang --translate output/react/ --target zh-CN

# Streaming data ingest
skill-seekers stream --source docs/ --target output/streaming/

🔧 Key Implementation Details

CLI Architecture (Git-style)

Entry point: src/skill_seekers/cli/main.py

The unified CLI modifies sys.argv and calls existing main() functions to maintain backward compatibility:

# Example: skill-seekers scrape --config react.json
# Transforms to: doc_scraper.main() with modified sys.argv

Subcommands: create, scrape, github, pdf, epub, unified, codebase, enhance, enhance-status, package, upload, estimate, install, install-agent, patterns, how-to-guides

NEW: Unified create Command

The recommended way to create skills - Auto-detects source type and provides progressive help disclosure:

# Auto-detection examples
skill-seekers create https://docs.react.dev/         # → Web scraping
skill-seekers create facebook/react                  # → GitHub analysis
skill-seekers create ./my-project                    # → Local codebase
skill-seekers create tutorial.pdf                    # → PDF extraction
skill-seekers create book.epub                       # → EPUB extraction
skill-seekers create configs/react.json              # → Multi-source

# Progressive help system
skill-seekers create --help           # Shows universal args only (13 flags)
skill-seekers create --help-web       # Shows web-specific options
skill-seekers create --help-github    # Shows GitHub-specific options
skill-seekers create --help-local     # Shows local analysis options
skill-seekers create --help-pdf       # Shows PDF extraction options
skill-seekers create --help-epub      # Shows EPUB extraction options
skill-seekers create --help-advanced  # Shows advanced/rare options
skill-seekers create --help-all       # Shows all 120+ flags

# Universal flags work for ALL sources
skill-seekers create <source> -p quick                    # Preset (-p shortcut)
skill-seekers create <source> --enhance-level 2           # AI enhancement (0-3)
skill-seekers create <source> --chunk-for-rag             # RAG chunking
skill-seekers create <source> --dry-run                   # Preview

Key improvements:

  • Single command replaces scrape/github/analyze for most use cases
  • Smart detection - No need to specify source type
  • Progressive disclosure - Default help shows 13 flags, detailed help available
  • -p shortcut - Quick preset selection (-p quick|standard|comprehensive)
  • Universal features - RAG chunking, dry-run, presets work everywhere

Recent Additions:

  • create - NEW: Unified command with auto-detection and progressive help
  • codebase - Local codebase analysis without GitHub API (C2.x + C3.x features)
  • enhance-status - Monitor background/daemon enhancement processes
  • patterns - Detect design patterns in code (C3.1)
  • how-to-guides - Generate educational guides from tests (C3.3)

Platform Adaptor Usage

from skill_seekers.cli.adaptors import get_adaptor

# Get platform-specific adaptor
adaptor = get_adaptor('gemini')  # or 'claude', 'openai', 'markdown'

# Package skill
adaptor.package(skill_dir='output/react/', output_path='output/')

# Upload to platform
adaptor.upload(
    package_path='output/react-gemini.tar.gz',
    api_key=os.getenv('GOOGLE_API_KEY')
)

# AI enhancement
adaptor.enhance(skill_dir='output/react/', mode='api')

C3.x Codebase Analysis Features

The project has comprehensive codebase analysis capabilities (C3.1-C3.8):

C3.1 Design Pattern Detection (pattern_recognizer.py):

  • Detects 10 common patterns: Singleton, Factory, Observer, Strategy, Decorator, Builder, Adapter, Command, Template Method, Chain of Responsibility
  • Supports 9 languages: Python, JavaScript, TypeScript, C++, C, C#, Go, Rust, Java
  • Three detection levels: surface (fast), deep (balanced), full (thorough)
  • 87% precision, 80% recall on real-world projects

C3.2 Test Example Extraction (test_example_extractor.py):

  • Extracts real usage examples from test files
  • Categories: instantiation, method_call, config, setup, workflow
  • AST-based for Python, regex-based for 8 other languages
  • Quality filtering with confidence scoring

C3.3 How-To Guide Generation (how_to_guide_builder.py):

  • Transforms test workflows into educational guides
  • 5 AI enhancements: step descriptions, troubleshooting, prerequisites, next steps, use cases
  • Dual-mode AI: API (fast) or LOCAL (free with Claude Code Max)
  • 4 grouping strategies: AI tutorial group, file path, test name, complexity

C3.4 Configuration Pattern Extraction (config_extractor.py):

  • Extracts configuration patterns from codebases
  • Identifies config files, env vars, CLI arguments
  • AI enhancement for better organization

C3.5 Architectural Overview (generate_router.py):

  • Generates comprehensive ARCHITECTURE.md files
  • Router skill generation for large documentation
  • Quality improvements: 6.5/10 → 8.5/10 (+31%)
  • Integrates GitHub metadata, issues, labels

C3.6 AI Enhancement (Claude API integration):

  • Enhances C3.1-C3.5 with AI-powered insights
  • Pattern explanations and improvement suggestions
  • Test example context and best practices
  • Guide enhancement with troubleshooting and prerequisites

C3.7 Architectural Pattern Detection (architectural_pattern_detector.py):

  • Detects 8 architectural patterns (MVC, MVVM, MVP, Repository, etc.)
  • Framework detection (Django, Flask, Spring, React, Angular, etc.)
  • Multi-file analysis with directory structure patterns
  • Evidence-based detection with confidence scoring

C3.8 Standalone Codebase Scraper (codebase_scraper.py):

# Quick analysis (1-2 min, basic features only)
skill-seekers analyze --directory /path/to/repo --quick

# Comprehensive analysis (20-60 min, all features + AI)
skill-seekers analyze --directory . --comprehensive

# With AI enhancement (auto-detects API or LOCAL)
skill-seekers analyze --directory . --enhance

# Granular AI enhancement control (NEW)
skill-seekers analyze --directory . --enhance-level 1  # SKILL.md only
skill-seekers analyze --directory . --enhance-level 2  # + Architecture + Config + Docs
skill-seekers analyze --directory . --enhance-level 3  # Full enhancement (all features)

# Disable specific features
skill-seekers analyze --directory . --skip-patterns --skip-how-to-guides
  • Generates 300+ line standalone SKILL.md files from codebases
  • All C3.x features integrated (patterns, tests, guides, config, architecture, docs)
  • Complete codebase analysis without documentation scraping
  • NEW: Granular AI enhancement control with --enhance-level (0-3)

C3.9 Project Documentation Extraction (codebase_scraper.py):

  • Extracts and categorizes all markdown files from the project
  • Auto-detects categories: overview, architecture, guides, workflows, features, etc.
  • Integrates documentation into SKILL.md with summaries
  • AI enhancement (level 2+) adds topic extraction and cross-references
  • Controlled by depth: surface=raw copy, deep=parse+summarize, full=AI-enhanced
  • Default ON, use --skip-docs to disable

C3.10 Signal Flow Analysis for Godot Projects (signal_flow_analyzer.py):

  • Complete signal flow analysis system for event-driven Godot architectures
  • Signal declaration extraction (detects signal keyword declarations)
  • Connection mapping (tracks .connect() calls with targets and methods)
  • Emission tracking (finds .emit() and emit_signal() calls)
  • Real-world metrics: 208 signals, 634 connections, 298 emissions in test project
  • Signal density metrics (signals per file)
  • Event chain detection (signals triggering other signals)
  • Signal pattern detection:
    • EventBus Pattern (0.90 confidence): Centralized signal hub in autoload
    • Observer Pattern (0.85 confidence): Multi-observer signals (3+ listeners)
    • Event Chains (0.80 confidence): Cascading signal propagation
  • Signal-based how-to guides (C3.10.1):
    • AI-generated step-by-step usage guides (Connect → Emit → Handle)
    • Real code examples from project
    • Common usage locations
    • Parameter documentation
  • Outputs: signal_flow.json, signal_flow.mmd (Mermaid diagram), signal_reference.md, signal_how_to_guides.md
  • Comprehensive Godot 4.x support:
    • GDScript (.gd), Scene files (.tscn), Resources (.tres), Shaders (.gdshader)
    • GDScript test extraction (GUT, gdUnit4, WAT frameworks)
    • 396 test cases extracted in test project
    • Framework detection (Unity, Unreal, Godot)

Key Architecture Decision (BREAKING in v2.5.2):

  • Changed from opt-in (--build-*) to opt-out (--skip-*) flags
  • All analysis features now ON by default for maximum value
  • Backward compatibility warnings for deprecated flags

Smart Categorization Algorithm

Located in doc_scraper.py:smart_categorize():

  • Scores pages against category keywords
  • 3 points for URL match, 2 for title, 1 for content
  • Threshold of 2+ for categorization
  • Auto-infers categories from URL segments if none provided
  • Falls back to "other" category

Language Detection

Located in doc_scraper.py:detect_language():

  1. CSS class attributes (language-*, lang-*)
  2. Heuristics (keywords like def, const, func)

Configuration File Structure

Configs (configs/*.json) define scraping behavior:

{
  "name": "framework-name",
  "description": "When to use this skill",
  "base_url": "https://docs.example.com/",
  "selectors": {
    "main_content": "article",  // CSS selector
    "title": "h1",
    "code_blocks": "pre code"
  },
  "url_patterns": {
    "include": ["/docs"],
    "exclude": ["/blog"]
  },
  "categories": {
    "getting_started": ["intro", "quickstart"],
    "api": ["api", "reference"]
  },
  "rate_limit": 0.5,
  "max_pages": 500
}

🧪 Testing Guidelines

Test Coverage Requirements

  • Core features: 100% coverage required
  • Platform adaptors: Each platform has dedicated tests
  • MCP tools: All 18 tools must be tested
  • Integration tests: End-to-end workflows

Test Markers (from pytest.ini_options)

The project uses pytest markers to categorize tests:

# Run only fast unit tests (default)
pytest tests/ -v

# Include slow tests (>5 seconds)
pytest tests/ -v -m slow

# Run integration tests (requires external services)
pytest tests/ -v -m integration

# Run end-to-end tests (resource-intensive, creates files)
pytest tests/ -v -m e2e

# Run tests requiring virtual environment setup
pytest tests/ -v -m venv

# Run bootstrap feature tests
pytest tests/ -v -m bootstrap

# Skip slow and integration tests (fastest)
pytest tests/ -v -m "not slow and not integration"

Test Execution Strategy

By default, only fast tests run. Use markers to control test execution:

# Default: Only fast tests (skip slow/integration/e2e)
pytest tests/ -v

# Include slow tests (>5 seconds)
pytest tests/ -v -m slow

# Include integration tests (requires external services)
pytest tests/ -v -m integration

# Include resource-intensive e2e tests (creates files)
pytest tests/ -v -m e2e

# Run ONLY fast tests (explicit)
pytest tests/ -v -m "not slow and not integration and not e2e"

# Run everything (CI does this)
pytest tests/ -v -m ""

When to use which:

  • Local development: Default (fast tests only) - pytest tests/ -v
  • Pre-commit: Fast tests - pytest tests/ -v
  • Before PR: Include slow + integration - pytest tests/ -v -m "not e2e"
  • CI validation: All tests run automatically

Key Test Files

  • test_scraper_features.py - Core scraping functionality
  • test_mcp_server.py - MCP integration (18 tools)
  • test_mcp_fastmcp.py - FastMCP framework
  • test_unified.py - Multi-source scraping
  • test_github_scraper.py - GitHub analysis
  • test_pdf_scraper.py - PDF extraction
  • test_epub_scraper.py - EPUB extraction
  • test_install_multiplatform.py - Multi-platform packaging
  • test_integration.py - End-to-end workflows
  • test_install_skill.py - One-command install
  • test_install_agent.py - AI agent installation
  • conftest.py - Test configuration (checks package installation)

🌐 Environment Variables

# Claude AI / Compatible APIs
# Option 1: Official Anthropic API (default)
export ANTHROPIC_API_KEY=sk-ant-...

# Option 2: GLM-4.7 Claude-compatible API (or any compatible endpoint)
export ANTHROPIC_API_KEY=your-api-key
export ANTHROPIC_BASE_URL=https://glm-4-7-endpoint.com/v1

# Google Gemini (optional)
export GOOGLE_API_KEY=AIza...

# OpenAI ChatGPT (optional)
export OPENAI_API_KEY=sk-...

# GitHub (for higher rate limits)
export GITHUB_TOKEN=ghp_...

# Private config repositories (optional)
export GITLAB_TOKEN=glpat-...
export GITEA_TOKEN=...
export BITBUCKET_TOKEN=...

All AI enhancement features respect these settings:

  • enhance_skill.py - API mode SKILL.md enhancement
  • ai_enhancer.py - C3.1/C3.2 pattern and test example enhancement
  • guide_enhancer.py - C3.3 guide enhancement
  • config_enhancer.py - C3.4 configuration enhancement
  • adaptors/claude.py - Claude platform adaptor enhancement

Note: Setting ANTHROPIC_BASE_URL allows you to use any Claude-compatible API endpoint, such as GLM-4.7 (智谱 AI).

📦 Package Structure (pyproject.toml)

Entry Points

[project.scripts]
# Main unified CLI
skill-seekers = "skill_seekers.cli.main:main"

# Individual tool entry points (Core)
skill-seekers-config = "skill_seekers.cli.config_command:main"                # v2.7.0 Configuration wizard
skill-seekers-resume = "skill_seekers.cli.resume_command:main"                # v2.7.0 Resume interrupted jobs
skill-seekers-scrape = "skill_seekers.cli.doc_scraper:main"
skill-seekers-github = "skill_seekers.cli.github_scraper:main"
skill-seekers-pdf = "skill_seekers.cli.pdf_scraper:main"
skill-seekers-epub = "skill_seekers.cli.epub_scraper:main"
skill-seekers-unified = "skill_seekers.cli.unified_scraper:main"
skill-seekers-codebase = "skill_seekers.cli.codebase_scraper:main"           # C2.x Local codebase analysis
skill-seekers-enhance = "skill_seekers.cli.enhance_skill_local:main"
skill-seekers-enhance-status = "skill_seekers.cli.enhance_status:main"       # Status monitoring
skill-seekers-package = "skill_seekers.cli.package_skill:main"
skill-seekers-upload = "skill_seekers.cli.upload_skill:main"
skill-seekers-estimate = "skill_seekers.cli.estimate_pages:main"
skill-seekers-install = "skill_seekers.cli.install_skill:main"
skill-seekers-install-agent = "skill_seekers.cli.install_agent:main"
skill-seekers-patterns = "skill_seekers.cli.pattern_recognizer:main"         # C3.1 Pattern detection
skill-seekers-how-to-guides = "skill_seekers.cli.how_to_guide_builder:main" # C3.3 Guide generation
skill-seekers-workflows = "skill_seekers.cli.workflows_command:main"         # NEW: Workflow preset management
skill-seekers-video = "skill_seekers.cli.video_scraper:main"                  # Video scraping pipeline (use --setup to install deps)

# New v3.0.0 Entry Points
skill-seekers-setup = "skill_seekers.cli.setup_wizard:main"                  # NEW: v3.0.0 Setup wizard
skill-seekers-cloud = "skill_seekers.cli.cloud_storage_cli:main"             # NEW: v3.0.0 Cloud storage
skill-seekers-embed = "skill_seekers.embedding.server:main"                  # NEW: v3.0.0 Embedding server
skill-seekers-sync = "skill_seekers.cli.sync_cli:main"                       # NEW: v3.0.0 Sync & monitoring
skill-seekers-benchmark = "skill_seekers.cli.benchmark_cli:main"             # NEW: v3.0.0 Benchmarking
skill-seekers-stream = "skill_seekers.cli.streaming_ingest:main"             # NEW: v3.0.0 Streaming ingest
skill-seekers-update = "skill_seekers.cli.incremental_updater:main"          # NEW: v3.0.0 Incremental updates
skill-seekers-multilang = "skill_seekers.cli.multilang_support:main"         # NEW: v3.0.0 Multilingual
skill-seekers-quality = "skill_seekers.cli.quality_metrics:main"             # NEW: v3.0.0 Quality metrics

Optional Dependencies

Project uses PEP 735 [dependency-groups] (Python 3.13+):

  • Replaces deprecated tool.uv.dev-dependencies
  • Dev dependencies: [dependency-groups] dev = [...] in pyproject.toml
  • Install with: pip install -e . (installs only core deps)
  • Install dev deps: See CI workflow or manually install pytest, ruff, mypy

Note on video dependencies: easyocr and GPU-specific PyTorch builds are not included in the video-full optional dependency group. They are installed at runtime by skill-seekers video --setup, which auto-detects the GPU (CUDA/ROCm/MPS/CPU) and installs the correct builds.

[project.optional-dependencies]
gemini = ["google-generativeai>=0.8.0"]
openai = ["openai>=1.0.0"]
all-llms = ["google-generativeai>=0.8.0", "openai>=1.0.0"]

[dependency-groups]  # PEP 735 (replaces tool.uv.dev-dependencies)
dev = [
    "pytest>=8.4.2",
    "pytest-asyncio>=0.24.0",
    "pytest-cov>=7.0.0",
    "coverage>=7.11.0",
]

🚨 Critical Development Notes

Must Run Before Tests

# REQUIRED: Install package before running tests
pip install -e .

# Why: src/ layout requires package installation
# Without this, imports will fail

Never Skip Tests

Per user instructions in ~/.claude/CLAUDE.md:

  • "never skip any test. always make sure all test pass"
  • All 2,540 tests must pass before commits
  • Run full test suite: pytest tests/ -v
  • New tests added for create command and CLI refactor work

Platform-Specific Dependencies

Platform dependencies are optional (install only what you need):

# Install specific platform support
pip install -e ".[gemini]"         # Google Gemini
pip install -e ".[openai]"         # OpenAI ChatGPT
pip install -e ".[chroma]"         # ChromaDB
pip install -e ".[weaviate]"       # Weaviate
pip install -e ".[s3]"             # AWS S3
pip install -e ".[gcs]"            # Google Cloud Storage
pip install -e ".[azure]"          # Azure Blob Storage
pip install -e ".[mcp]"            # MCP integration
pip install -e ".[all]"            # Everything (16 platforms + cloud + embedding)

# Or install from PyPI:
pip install skill-seekers[gemini]    # Google Gemini support
pip install skill-seekers[openai]    # OpenAI ChatGPT support
pip install skill-seekers[all-llms]  # All LLM platforms
pip install skill-seekers[chroma]    # ChromaDB support
pip install skill-seekers[weaviate]  # Weaviate support
pip install skill-seekers[s3]        # AWS S3 support
pip install skill-seekers[all]       # All optional dependencies

AI Enhancement Modes

AI enhancement transforms basic skills (2-3/10) into production-ready skills (8-9/10). Two modes available:

API Mode (default if ANTHROPIC_API_KEY is set):

  • Direct Claude API calls (fast, efficient)
  • Cost: ~$0.15-$0.30 per skill
  • Perfect for CI/CD automation
  • Requires: export ANTHROPIC_API_KEY=sk-ant-...

LOCAL Mode (fallback if no API key):

  • Uses Claude Code CLI (your existing Max plan)
  • Free! No API charges
  • 4 execution modes:
    • Headless (default): Foreground, waits for completion
    • Background (--background): Returns immediately
    • Daemon (--daemon): Fully detached with nohup
    • Terminal (--interactive-enhancement): Opens new terminal (macOS)
  • Status monitoring: skill-seekers enhance-status output/react/ --watch
  • Timeout configuration: --timeout 300 (seconds)

Enhancement Flag Consolidation (Phase 1)

IMPORTANT CHANGE: Three enhancement flags have been unified into a single granular control:

Old flags (deprecated):

  • --enhance - Enable AI enhancement
  • --enhance-local - Use LOCAL mode (Claude Code)
  • --api-key KEY - Anthropic API key

New unified flag:

  • --enhance-level LEVEL - Granular AI enhancement control (0-3, default: 2)
    • 0 - Disabled, no AI enhancement
    • 1 - SKILL.md only (core documentation)
    • 2 - + Architecture + Config + Docs (default, balanced)
    • 3 - Full enhancement (all features, comprehensive)

Auto-detection: Mode (API vs LOCAL) is auto-detected:

  • If ANTHROPIC_API_KEY is set → API mode
  • Otherwise → LOCAL mode (Claude Code Max)

Examples:

# Auto-detect mode, default enhancement level (2)
skill-seekers create https://docs.react.dev/

# Disable enhancement
skill-seekers create facebook/react --enhance-level 0

# SKILL.md only (fast)
skill-seekers create ./my-project --enhance-level 1

# Full enhancement (comprehensive)
skill-seekers create tutorial.pdf --enhance-level 3

# Force LOCAL mode with specific level
skill-seekers enhance output/react/ --mode LOCAL --enhance-level 2

# Background with status monitoring
skill-seekers enhance output/react/ --background
skill-seekers enhance-status output/react/ --watch

Migration: Old flags still work with deprecation warnings, will be removed in v4.0.0.

See docs/ENHANCEMENT_MODES.md for detailed documentation.

Git Workflow

Git Workflow Notes:

  • Main branch: main
  • Development branch: development
  • Always create feature branches from development
  • Branch naming: feature/{task-id}-{description} or feature/{category}

To see current status: git status

CI/CD Pipeline

The project has GitHub Actions workflows in .github/workflows/:

tests.yml - Runs on every push and PR to main or development:

  1. Lint Job (Python 3.12, Ubuntu):

    • ruff check src/ tests/ - Code linting with GitHub annotations
    • ruff format --check src/ tests/ - Format validation
    • mypy src/skill_seekers - Type checking (continue-on-error)
  2. Test Job (Matrix):

    • OS: Ubuntu + macOS
    • Python: 3.10, 3.11, 3.12
    • Exclusions: macOS + Python 3.10 (speed optimization)
    • Steps:
      • Install dependencies + pip install -e .
      • Run CLI tests (scraper, config, integration)
      • Run MCP server tests
      • Generate coverage report → Upload to Codecov
  3. Summary Job - Single status check for branch protection

    • Ensures both lint and test jobs succeed
    • Provides single "All Checks Complete" status

release.yml - Triggers on version tags (e.g., v2.9.0):

  • Builds package with uv build
  • Publishes to PyPI with uv publish
  • Creates GitHub release

Local Pre-Commit Validation

Run the same checks as CI before pushing:

# 1. Code quality (matches lint job) - WITH AUTO-FIX
uvx ruff check --fix --unsafe-fixes src/ tests/  # Auto-fix issues
uvx ruff format src/ tests/                      # Auto-format
uvx ruff check src/ tests/                       # Verify clean
uvx ruff format --check src/ tests/              # Verify formatted
mypy src/skill_seekers

# 2. Tests (matches test job)
pip install -e .
pytest tests/ -v --cov=src/skill_seekers --cov-report=term

# 3. If all pass, you're good to push!
git add -A  # Stage any auto-fixes
git commit --amend --no-edit  # Add fixes to commit (or new commit)
git push origin feature/my-feature

Branch Protection Rules:

  • main: Requires tests + 1 review, only maintainers merge
  • development: Requires tests to pass, default target for PRs

Common CI Failure Patterns and Fixes

If CI fails after your changes, follow this debugging checklist:

# 1. Fix linting errors automatically
uvx ruff check --fix --unsafe-fixes src/ tests/

# 2. Fix formatting issues
uvx ruff format src/ tests/

# 3. Check for remaining issues
uvx ruff check src/ tests/
uvx ruff format --check src/ tests/

# 4. Verify tests pass locally
pip install -e .
pytest tests/ -v

# 5. Push fixes
git add -A
git commit -m "fix: resolve CI linting/formatting issues"
git push

Critical dependency patterns to check:

  • MCP version mismatch: Ensure requirements.txt and pyproject.toml have matching MCP versions
  • Missing module-level imports: If a tool file imports a module at top level (e.g., import yaml), that module MUST be in core dependencies
  • Try/except ImportError: Silent failures in try/except blocks can hide missing dependencies

Timing-sensitive tests:

  • Benchmark tests may fail on slower CI runners (macOS)
  • If a test times out or exceeds threshold only in CI, consider relaxing the threshold
  • Local passing doesn't guarantee CI passing for performance tests

🚨 Common Pitfalls & Solutions

1. Import Errors

Problem: ModuleNotFoundError: No module named 'skill_seekers'

Solution: Must install package first due to src/ layout

pip install -e .

Why: The src/ layout prevents imports from repo root. Package must be installed.

2. Tests Fail with "No module named..."

Problem: Package not installed in test environment

Solution: CI runs pip install -e . before tests - do the same locally

pip install -e .
pytest tests/ -v

3. Platform-Specific Dependencies Not Found

Problem: ModuleNotFoundError: No module named 'google.generativeai'

Solution: Install platform-specific dependencies

pip install -e ".[gemini]"   # For Gemini
pip install -e ".[openai]"   # For OpenAI
pip install -e ".[all-llms]" # For all platforms

4. Git Branch Confusion

Problem: PR targets main instead of development

Solution: Always create PRs targeting development branch

git checkout development
git pull upstream development
git checkout -b feature/my-feature
# ... make changes ...
git push origin feature/my-feature
# Create PR: feature/my-feature → development

Important: See CONTRIBUTING.md for complete branch workflow.

5. Tests Pass Locally But Fail in CI

Problem: Different Python version or missing dependency

Solution: Test with multiple Python versions locally

# CI tests: Python 3.10, 3.11, 3.12 on Ubuntu + macOS
# Use pyenv or docker to test locally:
pyenv install 3.10.13 3.11.7 3.12.1

pyenv local 3.10.13
pip install -e . && pytest tests/ -v

pyenv local 3.11.7
pip install -e . && pytest tests/ -v

pyenv local 3.12.1
pip install -e . && pytest tests/ -v

6. Enhancement Not Working

Problem: AI enhancement fails or hangs

Solutions:

# Check if API key is set
echo $ANTHROPIC_API_KEY

# Try LOCAL mode instead (uses Claude Code Max, no API key needed)
skill-seekers enhance output/react/ --mode LOCAL

# Monitor enhancement status for background jobs
skill-seekers enhance-status output/react/ --watch

7. Rate Limit Errors from GitHub

Problem: 403 Forbidden from GitHub API

Solutions:

# Check current rate limit
curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/rate_limit

# Configure multiple GitHub profiles (recommended)
skill-seekers config --github

# Use specific profile
skill-seekers github --repo owner/repo --profile work

# Test all configured tokens
skill-seekers config --test

8. Confused About Command Options

Problem: "Too many flags!" or "Which flags work with which sources?"

Solution: Use the progressive disclosure help system in the create command:

# Start with universal options (13 flags)
skill-seekers create --help

# Need web scraping options?
skill-seekers create --help-web

# GitHub-specific flags?
skill-seekers create --help-github

# See ALL options (120+ flags)?
skill-seekers create --help-all

# Quick preset shortcut
skill-seekers create <source> -p quick
skill-seekers create <source> -p standard
skill-seekers create <source> -p comprehensive

Why: The create command shows only relevant flags by default to reduce cognitive load.

Legacy commands (scrape, github, analyze) show all flags in one help screen - use them if you prefer that style.

9. CI Passes Locally But Fails in GitHub Actions

Problem: Ruff check/format or tests pass locally but fail in CI

Common causes:

  1. Dependency version mismatch - requirements.txt vs pyproject.toml conflicts

    # Check both files have matching versions for core deps
    grep "mcp" requirements.txt pyproject.toml
    grep "PyYAML" requirements.txt pyproject.toml
    
  2. Module imported but not declared - File imports module at top level but it's not in dependencies

    # Search for imports that might not be in dependencies
    grep -r "^import yaml" src/
    grep -r "^from yaml" src/
    # Ensure PyYAML is in pyproject.toml core dependencies
    
  3. Ruff version differences - Local ruff vs CI ruff may have different rules

    # Use uvx to match CI's ruff version
    uvx ruff check src/ tests/
    uvx ruff format src/ tests/
    

Solution:

# Run CI validation commands exactly as CI does
pip install -e .  # Fresh install
uvx ruff check src/ tests/  # Use uvx, not local ruff
uvx ruff format --check src/ tests/
pytest tests/ -v

🔌 MCP Integration

MCP Server (26 Tools)

Transport modes:

  • stdio: Claude Code, VS Code + Cline
  • HTTP: Cursor, Windsurf, IntelliJ IDEA

Core Tools (9):

  1. list_configs - List preset configurations
  2. generate_config - Generate config from docs URL
  3. validate_config - Validate config structure
  4. estimate_pages - Estimate page count
  5. scrape_docs - Scrape documentation
  6. package_skill - Package to format (supports --format and --target)
  7. upload_skill - Upload to platform (supports --target)
  8. enhance_skill - AI enhancement with platform support
  9. install_skill - Complete workflow automation

Extended Tools (10): 10. scrape_github - GitHub repository analysis 11. scrape_pdf - PDF extraction 12. unified_scrape - Multi-source scraping 13. merge_sources - Merge docs + code 14. detect_conflicts - Find discrepancies 15. add_config_source - Register git repos 16. fetch_config - Fetch configs from git 17. list_config_sources - List registered sources 18. remove_config_source - Remove config source 19. split_config - Split large configs

NEW Vector DB Tools (4): 20. export_to_chroma - Export to ChromaDB 21. export_to_weaviate - Export to Weaviate 22. export_to_faiss - Export to FAISS 23. export_to_qdrant - Export to Qdrant

NEW Cloud Tools (3): 24. cloud_upload - Upload to S3/GCS/Azure 25. cloud_download - Download from cloud storage 26. cloud_list - List files in cloud storage

Starting MCP Server

# stdio mode (Claude Code, VS Code + Cline)
python -m skill_seekers.mcp.server_fastmcp

# HTTP mode (Cursor, Windsurf, IntelliJ)
python -m skill_seekers.mcp.server_fastmcp --transport http --port 8765

🤖 RAG Framework & Vector Database Integrations (NEW - v3.0.0)

Skill Seekers is now the universal preprocessor for RAG pipelines. Export documentation to any RAG framework or vector database with a single command.

RAG Frameworks

LangChain Documents:

# Export to LangChain Document format
skill-seekers package output/django --format langchain

# Output: output/django-langchain.json
# Format: Array of LangChain Document objects
# - page_content: Full text content
# - metadata: {source, category, type, url}

# Use in LangChain:
from langchain.document_loaders import JSONLoader
loader = JSONLoader("output/django-langchain.json")
documents = loader.load()

LlamaIndex TextNodes:

# Export to LlamaIndex TextNode format
skill-seekers package output/django --format llama-index

# Output: output/django-llama-index.json
# Format: Array of LlamaIndex TextNode objects
# - text: Content
# - id_: Unique identifier
# - metadata: {source, category, type}
# - relationships: Document relationships

# Use in LlamaIndex:
from llama_index import StorageContext, load_index_from_storage
from llama_index.schema import TextNode
nodes = [TextNode.from_dict(n) for n in json.load(open("output/django-llama-index.json"))]

Haystack Documents:

# Export to Haystack Document format
skill-seekers package output/django --format haystack

# Output: output/django-haystack.json
# Format: Haystack Document objects for pipelines
# Perfect for: Question answering, search, RAG pipelines

Vector Databases

ChromaDB (Direct Integration):

# Export and optionally upload to ChromaDB
skill-seekers package output/django --format chroma

# Output: output/django-chroma/ (ChromaDB collection)
# With direct upload (requires chromadb running):
skill-seekers package output/django --format chroma --upload

# Configuration via environment:
export CHROMA_HOST=localhost
export CHROMA_PORT=8000

FAISS (Facebook AI Similarity Search):

# Export to FAISS index format
skill-seekers package output/django --format faiss

# Output:
# - output/django-faiss.index (FAISS index)
# - output/django-faiss-metadata.json (Document metadata)

# Use with FAISS:
import faiss
index = faiss.read_index("output/django-faiss.index")

Weaviate:

# Export and upload to Weaviate
skill-seekers package output/django --format weaviate --upload

# Requires environment variables:
export WEAVIATE_URL=http://localhost:8080
export WEAVIATE_API_KEY=your-api-key

# Creates class "DjangoDoc" with schema

Qdrant:

# Export and upload to Qdrant
skill-seekers package output/django --format qdrant --upload

# Requires environment variables:
export QDRANT_URL=http://localhost:6333
export QDRANT_API_KEY=your-api-key

# Creates collection "django_docs"

Pinecone (via Markdown):

# Pinecone uses the markdown format
skill-seekers package output/django --target markdown

# Then use Pinecone's Python client for upsert
# See: docs/integrations/PINECONE.md

Complete RAG Pipeline Example

# 1. Scrape documentation
skill-seekers scrape --config configs/django.json

# 2. Export to your RAG stack
skill-seekers package output/django --format langchain  # For LangChain
skill-seekers package output/django --format llama-index  # For LlamaIndex
skill-seekers package output/django --format chroma --upload  # Direct to ChromaDB

# 3. Use in your application
# See examples/:
# - examples/langchain-rag-pipeline/
# - examples/llama-index-query-engine/
# - examples/pinecone-upsert/

Integration Hub: docs/integrations/RAG_PIPELINES.md

🛠️ AI Coding Assistant Integrations (NEW - v3.0.0)

Transform any framework documentation into persistent expert context for 4+ AI coding assistants. Your IDE's AI now "knows" your frameworks without manual prompting.

Cursor IDE

Setup:

# 1. Generate skill
skill-seekers scrape --config configs/react.json
skill-seekers package output/react/ --target claude

# 2. Install to Cursor
cp output/react-claude/SKILL.md .cursorrules

# 3. Restart Cursor
# AI now has React expertise!

Benefits:

  • AI suggests React-specific patterns
  • No manual "use React hooks" prompts needed
  • Consistent team patterns
  • Works for ANY framework

Guide: docs/integrations/CURSOR.md Example: examples/cursor-react-skill/

Windsurf

Setup:

# 1. Generate skill
skill-seekers scrape --config configs/django.json
skill-seekers package output/django/ --target claude

# 2. Install to Windsurf
mkdir -p .windsurf/rules
cp output/django-claude/SKILL.md .windsurf/rules/django.md

# 3. Restart Windsurf
# AI now knows Django patterns!

Benefits:

  • Flow-based coding with framework knowledge
  • IDE-native AI assistance
  • Persistent context across sessions

Guide: docs/integrations/WINDSURF.md Example: examples/windsurf-fastapi-context/

Cline (VS Code Extension)

Setup:

# 1. Generate skill
skill-seekers scrape --config configs/fastapi.json
skill-seekers package output/fastapi/ --target claude

# 2. Install to Cline
cp output/fastapi-claude/SKILL.md .clinerules

# 3. Reload VS Code
# Cline now has FastAPI expertise!

Benefits:

  • Agentic code generation in VS Code
  • Cursor Composer equivalent for VS Code
  • System prompts + MCP integration

Guide: docs/integrations/CLINE.md Example: examples/cline-django-assistant/

Continue.dev (Universal IDE)

Setup:

# 1. Generate skill
skill-seekers scrape --config configs/react.json
skill-seekers package output/react/ --target claude

# 2. Start context server
cd examples/continue-dev-universal/
python context_server.py --port 8765

# 3. Configure in ~/.continue/config.json
{
  "contextProviders": [
    {
      "name": "http",
      "params": {
        "url": "http://localhost:8765/context",
        "title": "React Documentation"
      }
    }
  ]
}

# 4. Works in ALL IDEs!
# VS Code, JetBrains, Vim, Emacs...

Benefits:

  • IDE-agnostic (works in VS Code, IntelliJ, Vim, Emacs)
  • Custom LLM providers supported
  • HTTP-based context serving
  • Team consistency across mixed IDE environments

Guide: docs/integrations/CONTINUE_DEV.md Example: examples/continue-dev-universal/

Multi-IDE Team Setup

For teams using different IDEs (VS Code, IntelliJ, Vim):

# Use Continue.dev as universal context provider
skill-seekers scrape --config configs/react.json
python context_server.py --host 0.0.0.0 --port 8765

# ALL team members configure Continue.dev
# Result: Identical AI suggestions across all IDEs!

Integration Hub: docs/integrations/INTEGRATIONS.md

☁️ Cloud Storage Integration (NEW - v3.0.0)

Upload skills directly to cloud storage for team sharing and CI/CD pipelines.

Supported Providers

AWS S3:

# Upload skill
skill-seekers cloud upload --provider s3 --bucket my-skills output/react.zip

# Download skill
skill-seekers cloud download --provider s3 --bucket my-skills react.zip

# List skills
skill-seekers cloud list --provider s3 --bucket my-skills

# Environment variables:
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret
export AWS_REGION=us-east-1

Google Cloud Storage:

# Upload skill
skill-seekers cloud upload --provider gcs --bucket my-skills output/react.zip

# Download skill
skill-seekers cloud download --provider gcs --bucket my-skills react.zip

# List skills
skill-seekers cloud list --provider gcs --bucket my-skills

# Environment variables:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json

Azure Blob Storage:

# Upload skill
skill-seekers cloud upload --provider azure --container my-skills output/react.zip

# Download skill
skill-seekers cloud download --provider azure --container my-skills react.zip

# List skills
skill-seekers cloud list --provider azure --container my-skills

# Environment variables:
export AZURE_STORAGE_CONNECTION_STRING=your-connection-string

CI/CD Integration

# GitHub Actions example
- name: Upload skill to S3
  run: |
    skill-seekers scrape --config configs/react.json
    skill-seekers package output/react/
    skill-seekers cloud upload --provider s3 --bucket ci-skills output/react.zip
  env:
    AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
    AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Guide: docs/integrations/CLOUD_STORAGE.md

📋 Common Workflows

Adding a New Platform

  1. Create adaptor in src/skill_seekers/cli/adaptors/{platform}_adaptor.py
  2. Inherit from BaseAdaptor
  3. Implement package(), upload(), enhance() methods
  4. Add to factory in adaptors/__init__.py
  5. Add optional dependency to pyproject.toml
  6. Add tests in tests/test_install_multiplatform.py

Adding a New Feature

  1. Implement in appropriate CLI module
  2. Add entry point to pyproject.toml if needed
  3. Add tests in tests/test_{feature}.py
  4. Run full test suite: pytest tests/ -v
  5. Update CHANGELOG.md
  6. Commit only when all tests pass

Debugging Common Issues

Import Errors:

# Always ensure package is installed first
pip install -e .

# Verify installation
python -c "import skill_seekers; print(skill_seekers.__version__)"

Rate Limit Issues:

# Check current GitHub rate limit status
curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/rate_limit

# Configure multiple GitHub profiles
skill-seekers config --github

# Test your tokens
skill-seekers config --test

Enhancement Not Working:

# Check if API key is set
echo $ANTHROPIC_API_KEY

# Try LOCAL mode instead (uses Claude Code Max)
skill-seekers enhance output/react/ --mode LOCAL

# Monitor enhancement status
skill-seekers enhance-status output/react/ --watch

Test Failures:

# Run specific failing test with verbose output
pytest tests/test_file.py::test_name -vv

# Run with print statements visible
pytest tests/test_file.py -s

# Run with coverage to see what's not tested
pytest tests/test_file.py --cov=src/skill_seekers --cov-report=term-missing

# Run only unit tests (skip slow integration tests)
pytest tests/ -v -m "not slow and not integration"

Config Issues:

# Validate config structure
skill-seekers-validate configs/myconfig.json

# Show current configuration
skill-seekers config --show

# Estimate pages before scraping
skill-seekers estimate configs/myconfig.json

🎯 Where to Make Changes

This section helps you quickly locate the right files when implementing common changes.

Adding a New CLI Command

Files to modify:

  1. Create command file: src/skill_seekers/cli/my_command.py

    def main():
        """Entry point for my-command."""
        # Implementation
    
  2. Add entry point: pyproject.toml

    [project.scripts]
    skill-seekers-my-command = "skill_seekers.cli.my_command:main"
    
  3. Update unified CLI: src/skill_seekers/cli/main.py

    • Add subcommand handler to dispatcher
  4. Add tests: tests/test_my_command.py

    • Test main functionality
    • Test CLI argument parsing
    • Test error cases
  5. Update docs: CHANGELOG.md + README.md (if user-facing)

Adding a New Platform Adaptor

Files to modify:

  1. Create adaptor: src/skill_seekers/cli/adaptors/my_platform_adaptor.py

    from .base import BaseAdaptor
    
    class MyPlatformAdaptor(BaseAdaptor):
        def package(self, skill_dir, output_path, **kwargs):
            # Platform-specific packaging
            pass
    
        def upload(self, package_path, api_key=None, **kwargs):
            # Platform-specific upload (optional for some platforms)
            pass
    
        def export(self, skill_dir, format, **kwargs):
            # For RAG/vector DB adaptors: export to specific format
            pass
    
  2. Register in factory: src/skill_seekers/cli/adaptors/__init__.py

    def get_adaptor(target=None, format=None):
        # For LLM platforms (--target flag)
        target_adaptors = {
            'claude': ClaudeAdaptor,
            'gemini': GeminiAdaptor,
            'openai': OpenAIAdaptor,
            'markdown': MarkdownAdaptor,
            'myplatform': MyPlatformAdaptor,  # ADD THIS
        }
    
        # For RAG/vector DBs (--format flag)
        format_adaptors = {
            'langchain': LangChainAdaptor,
            'llama-index': LlamaIndexAdaptor,
            'chroma': ChromaAdaptor,
            # ... etc
        }
    
  3. Add optional dependency: pyproject.toml

    [project.optional-dependencies]
    myplatform = ["myplatform-sdk>=1.0.0"]
    
  4. Add tests: tests/test_adaptors/test_my_platform_adaptor.py

    • Test export format
    • Test upload (if applicable)
    • Test with real data
  5. Update documentation:

    • README.md - Platform comparison table
    • docs/integrations/MY_PLATFORM.md - Integration guide
    • examples/my-platform-example/ - Working example

Adding a New Config Preset

Files to modify:

  1. Create config: configs/my_framework.json

    {
      "name": "my_framework",
      "base_url": "https://docs.myframework.com/",
      "selectors": {...},
      "categories": {...}
    }
    
  2. Test locally:

    # Estimate first
    skill-seekers estimate configs/my_framework.json
    
    # Test scrape (small sample)
    skill-seekers scrape --config configs/my_framework.json --max-pages 50
    
  3. Add to README: Update presets table in README.md

  4. Submit to website: (Optional) Submit to SkillSeekersWeb.com

Modifying Core Scraping Logic

Key files by feature:

Feature File Size Notes
Doc scraping src/skill_seekers/cli/doc_scraper.py ~90KB Main scraper, BFS traversal
GitHub scraping src/skill_seekers/cli/github_scraper.py ~56KB Repo analysis + metadata
GitHub API src/skill_seekers/cli/github_fetcher.py ~17KB Rate limit handling
PDF extraction src/skill_seekers/cli/pdf_scraper.py Medium PyMuPDF + OCR
EPUB extraction src/skill_seekers/cli/epub_scraper.py Medium ebooklib + BeautifulSoup
Code analysis src/skill_seekers/cli/code_analyzer.py ~65KB Multi-language AST parsing
Pattern detection src/skill_seekers/cli/pattern_recognizer.py Medium C3.1 - 10 GoF patterns
Test extraction src/skill_seekers/cli/test_example_extractor.py Medium C3.2 - 5 categories
Guide generation src/skill_seekers/cli/how_to_guide_builder.py ~45KB C3.3 - AI-enhanced guides
Config extraction src/skill_seekers/cli/config_extractor.py ~32KB C3.4 - 9 formats
Router generation src/skill_seekers/cli/generate_router.py ~43KB C3.5 - Architecture docs
Signal flow src/skill_seekers/cli/signal_flow_analyzer.py Medium C3.10 - Godot-specific

Always add tests when modifying core logic!

Modifying the Unified Create Command

The create command uses a modular argument system:

Files involved:

  1. Parser: src/skill_seekers/cli/parsers/create_parser.py

    • Defines help text and formatter
    • Registers help mode flags (--help-web, --help-github, etc.)
    • Uses custom NoWrapFormatter for better help display
  2. Arguments: src/skill_seekers/cli/arguments/create.py

    • Three tiers of arguments:
      • UNIVERSAL_ARGUMENTS (13 flags) - Work for all sources
      • Source-specific dicts (WEB_ARGUMENTS, GITHUB_ARGUMENTS, EPUB_ARGUMENTS, etc.)
      • ADVANCED_ARGUMENTS - Rare/advanced options
    • add_create_arguments(parser, mode) - Multi-mode argument addition
  3. Source Detection: src/skill_seekers/cli/source_detector.py (if implemented)

    • Auto-detect source type from input
    • Pattern matching (URLs, GitHub repos, file extensions)
  4. Main Logic: src/skill_seekers/cli/create_command.py (if implemented)

    • Route to appropriate scraper based on detected type
    • Argument validation and compatibility checking

When adding new arguments:

  • Universal args → UNIVERSAL_ARGUMENTS in arguments/create.py
  • Source-specific → Appropriate dict (WEB_ARGUMENTS, etc.)
  • Always update help text and add tests

Example: Adding a new universal flag:

# In arguments/create.py
UNIVERSAL_ARGUMENTS = {
    # ... existing args ...
    "my_flag": {
        "flags": ("--my-flag", "-m"),
        "kwargs": {
            "action": "store_true",
            "help": "Description of my flag",
        },
    },
}

Adding MCP Tools

Files to modify:

  1. Add tool function: src/skill_seekers/mcp/tools/{category}_tools.py

  2. Register tool: src/skill_seekers/mcp/server.py

    @mcp.tool()
    def my_new_tool(param: str) -> str:
        """Tool description."""
        # Implementation
    
  3. Add tests: tests/test_mcp_fastmcp.py

  4. Update count: README.md (currently 18 tools)

📍 Key Files Quick Reference

Task File(s) What to Modify
Add new CLI command src/skill_seekers/cli/my_cmd.py
pyproject.toml
Create main() function
Add entry point
Add platform adaptor src/skill_seekers/cli/adaptors/my_platform.py
adaptors/__init__.py
Inherit BaseAdaptor
Register in factory
Fix scraping logic src/skill_seekers/cli/doc_scraper.py scrape_all(), extract_content()
Add MCP tool src/skill_seekers/mcp/server_fastmcp.py Add @mcp.tool() function
Fix tests tests/test_{feature}.py Add/modify test functions
Add config preset configs/{framework}.json Create JSON config
Update CI .github/workflows/tests.yml Modify workflow steps

📚 Key Code Locations

Documentation Scraper (src/skill_seekers/cli/doc_scraper.py):

  • FALLBACK_MAIN_SELECTORS - Shared fallback CSS selectors for finding main content (no body)
  • _find_main_content() - Centralized selector fallback: config selector → fallback list
  • is_valid_url() - URL validation
  • extract_content() - Content extraction (links extracted from full page before early return)
  • detect_language() - Code language detection
  • extract_patterns() - Pattern extraction
  • smart_categorize() - Smart categorization
  • infer_categories() - Category inference
  • generate_quick_reference() - Quick reference generation
  • create_enhanced_skill_md() - SKILL.md generation
  • scrape_all() - Main scraping loop (dry-run extracts links from full page)
  • main() - Entry point

Codebase Analysis (src/skill_seekers/cli/):

  • codebase_scraper.py - Main CLI for local codebase analysis
  • code_analyzer.py - Multi-language AST parsing (9 languages)
  • api_reference_builder.py - API documentation generation
  • dependency_analyzer.py - NetworkX-based dependency graphs
  • pattern_recognizer.py - C3.1 design pattern detection
  • test_example_extractor.py - C3.2 test example extraction
  • how_to_guide_builder.py - C3.3 guide generation
  • config_extractor.py - C3.4 configuration extraction
  • generate_router.py - C3.5 router skill generation
  • signal_flow_analyzer.py - C3.10 signal flow analysis (Godot projects)
  • unified_codebase_analyzer.py - Three-stream GitHub+local analyzer

AI Enhancement (src/skill_seekers/cli/):

  • enhance_skill_local.py - LOCAL mode enhancement (4 execution modes)
  • enhance_skill.py - API mode enhancement
  • enhance_status.py - Status monitoring for background processes
  • ai_enhancer.py - Shared AI enhancement logic
  • guide_enhancer.py - C3.3 guide AI enhancement
  • config_enhancer.py - C3.4 config AI enhancement

Platform Adaptors (src/skill_seekers/cli/adaptors/):

  • __init__.py - Factory function
  • base_adaptor.py - Abstract base class
  • claude_adaptor.py - Claude AI implementation
  • gemini_adaptor.py - Google Gemini implementation
  • openai_adaptor.py - OpenAI ChatGPT implementation
  • markdown_adaptor.py - Generic Markdown implementation

MCP Server (src/skill_seekers/mcp/):

  • server.py - FastMCP-based server
  • tools/ - 18 MCP tool implementations

Configuration & Rate Limit Management (NEW: v2.7.0 - src/skill_seekers/cli/):

  • config_manager.py - Multi-token configuration system (~490 lines)
    • ConfigManager class - Singleton pattern for global config access
    • add_github_profile() - Add GitHub profile with token and strategy
    • get_github_token() - Smart fallback chain (CLI → Env → Config → Prompt)
    • get_next_profile() - Profile switching for rate limit handling
    • save_progress() / load_progress() - Job resumption support
    • cleanup_old_progress() - Auto-cleanup of old jobs (7 days default)
  • config_command.py - Interactive configuration wizard (~400 lines)
    • main_menu() - 7-option main menu with navigation
    • github_token_menu() - GitHub profile management
    • add_github_profile() - Guided token setup with browser integration
    • api_keys_menu() - API key configuration for Claude/Gemini/OpenAI
    • test_connections() - Connection testing for tokens and API keys
  • rate_limit_handler.py - Smart rate limit detection and handling (~450 lines)
    • RateLimitHandler class - Strategy pattern for rate limit handling
    • check_upfront() - Upfront rate limit check before starting
    • check_response() - Real-time detection from API responses
    • handle_rate_limit() - Execute strategy (prompt/wait/switch/fail)
    • try_switch_profile() - Automatic profile switching
    • wait_for_reset() - Countdown timer with live progress
    • show_countdown_timer() - Live terminal countdown display
  • resume_command.py - Resume interrupted scraping jobs (~150 lines)
    • list_resumable_jobs() - Display all jobs with progress details
    • resume_job() - Resume from saved checkpoint
    • clean_old_jobs() - Cleanup old progress files

GitHub Integration (Modified for v2.7.0 - src/skill_seekers/cli/):

  • github_fetcher.py - Integrated rate limit handler
    • Constructor now accepts interactive and profile_name parameters
    • fetch() - Added upfront rate limit check
    • All API calls check responses for rate limits
    • Raises RateLimitError when rate limit cannot be handled
  • github_scraper.py - Added CLI flags
    • --non-interactive flag for CI/CD mode (fail fast)
    • --profile flag to select GitHub profile from config
    • Config supports interactive and github_profile keys

RAG & Vector Database Adaptors (NEW: v3.0.0 - src/skill_seekers/cli/adaptors/):

  • langchain.py - LangChain Documents export (~250 lines)
    • Exports to LangChain Document format
    • Preserves metadata (source, category, type, url)
    • Smart chunking with overlap
  • llama_index.py - LlamaIndex TextNodes export (~280 lines)
    • Exports to TextNode format with unique IDs
    • Relationship mapping between documents
    • Metadata preservation
  • haystack.py - Haystack Documents export (~230 lines)
    • Pipeline-ready document format
    • Supports embeddings and filters
  • chroma.py - ChromaDB integration (~350 lines)
    • Direct collection creation
    • Batch upsert with embeddings
    • Query interface
  • weaviate.py - Weaviate vector search (~320 lines)
    • Schema creation with auto-detection
    • Batch import with error handling
  • faiss_helpers.py - FAISS index generation (~280 lines)
    • Index building with metadata
    • Search utilities
  • qdrant.py - Qdrant vector database (~300 lines)
    • Collection management
    • Payload indexing
  • streaming_adaptor.py - Streaming data ingest (~200 lines)
    • Real-time data processing
    • Incremental updates

Cloud Storage & Infrastructure (NEW: v3.0.0 - src/skill_seekers/cli/):

  • cloud_storage_cli.py - S3/GCS/Azure upload/download (~450 lines)
    • Multi-provider abstraction
    • Parallel uploads for large files
    • Retry logic with exponential backoff
  • embedding_pipeline.py - Embedding generation for vectors (~320 lines)
    • Sentence-transformers integration
    • Batch processing
    • Multiple embedding models
  • sync_cli.py - Continuous sync & monitoring (~380 lines)
    • File watching for changes
    • Automatic re-scraping
    • Smart diff detection
  • incremental_updater.py - Smart incremental updates (~350 lines)
    • Change detection algorithms
    • Partial skill updates
    • Version tracking
  • streaming_ingest.py - Real-time data streaming (~290 lines)
    • Stream processing pipelines
    • WebSocket support
  • benchmark_cli.py - Performance benchmarking (~280 lines)
    • Scraping performance tests
    • Comparison reports
    • CI/CD integration
  • quality_metrics.py - Quality analysis & reporting (~340 lines)
    • Completeness scoring
    • Link checking
    • Content quality metrics
  • multilang_support.py - Internationalization support (~260 lines)
    • Language detection
    • Translation integration
    • Multi-locale skills
  • setup_wizard.py - Interactive setup wizard (~220 lines)
    • Configuration management
    • Profile creation
    • First-time setup

Video Scraper (src/skill_seekers/cli/):

  • video_scraper.py - Main video scraping pipeline CLI
  • video_setup.py - GPU auto-detection, PyTorch installation, visual dependency setup (~835 lines)
    • Detects CUDA/ROCm/MPS/CPU and installs matching PyTorch build
    • Installs easyocr and other visual processing deps at runtime via --setup
    • Run skill-seekers video --setup before first use

🎯 Project-Specific Best Practices

  1. Prefer the unified create command - Use skill-seekers create <source> over legacy commands for consistency
  2. Always use platform adaptors - Never hardcode platform-specific logic
  3. Test all platforms - Changes must work for all 16 platforms (was 4 in v2.x)
  4. Maintain backward compatibility - Legacy commands (scrape, github, analyze) must still work
  5. Document API changes - Update CHANGELOG.md for every release
  6. Keep dependencies optional - Platform-specific deps are optional (RAG, cloud, etc.)
  7. Use src/ layout - Proper package structure with pip install -e .
  8. Run tests before commits - Per user instructions, never skip tests (1,765+ tests must pass)
  9. RAG-first mindset - v3.0.0 is the universal preprocessor for AI systems
  10. Export format clarity - Use --format for RAG/vector DBs, --target for LLM platforms
  11. Test with real integrations - Verify exports work with actual LangChain, ChromaDB, etc.
  12. Progressive disclosure - When adding flags, categorize as universal/source-specific/advanced

🐛 Debugging Tips

Enable Verbose Logging

# Set environment variable for debug output
export SKILL_SEEKERS_DEBUG=1
skill-seekers scrape --config configs/react.json

Test Single Function/Module

Run Python modules directly for debugging:

# Run modules with --help to see options
python -m skill_seekers.cli.doc_scraper --help
python -m skill_seekers.cli.github_scraper --repo facebook/react --dry-run
python -m skill_seekers.cli.package_skill --help

# Test MCP server directly
python -m skill_seekers.mcp.server_fastmcp

Use pytest with Debugging

# Drop into debugger on failure
pytest tests/test_scraper_features.py --pdb

# Show print statements (normally suppressed)
pytest tests/test_scraper_features.py -s

# Verbose test output (shows full diff, more details)
pytest tests/test_scraper_features.py -vv

# Run only failed tests from last run
pytest tests/ --lf

# Run until first failure (stop immediately)
pytest tests/ -x

# Show local variables on failure
pytest tests/ -l

Debug Specific Test

# Run single test with full output
pytest tests/test_scraper_features.py::test_detect_language -vv -s

# With debugger
pytest tests/test_scraper_features.py::test_detect_language --pdb

Check Package Installation

# Verify package is installed
pip list | grep skill-seekers

# Check installation mode (should show editable location)
pip show skill-seekers

# Verify imports work
python -c "import skill_seekers; print(skill_seekers.__version__)"

# Check CLI entry points
which skill-seekers
skill-seekers --version

Common Error Messages & Solutions

"ModuleNotFoundError: No module named 'skill_seekers'"Solution: pip install -e .Why: src/ layout requires package installation

"403 Forbidden" from GitHub APISolution: Rate limit hit, set GITHUB_TOKEN or use skill-seekers config --githubCheck limit: curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/rate_limit

"SKILL.md enhancement failed"Solution: Check if ANTHROPIC_API_KEY is set, or use --mode LOCALMonitor: skill-seekers enhance-status output/react/ --watch

"No such file or directory: 'configs/myconfig.json'"Solution: Config path resolution order:

  1. Exact path as provided
  2. ./configs/ (current directory)
  3. ~/.config/skill-seekers/configs/ (user config)
  4. SkillSeekersWeb.com API (presets)

"pytest: command not found"Solution: Install dev dependencies

pip install pytest pytest-asyncio pytest-cov coverage
# Or: pip install -e ".[dev]"  (if available)

"ruff: command not found"Solution: Install ruff

pip install ruff
# Or use uvx: uvx ruff check src/

Debugging Scraping Issues

No content extracted?

# Test selectors in Python
from bs4 import BeautifulSoup
import requests

url = "https://docs.example.com/page"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

# Try different selectors
print(soup.select_one('article'))
print(soup.select_one('main'))
print(soup.select_one('div[role="main"]'))
print(soup.select_one('.documentation-content'))

Categories not working?

  • Check categories in config has correct keywords
  • Run with --dry-run to see categorization without scraping
  • Enable debug mode: export SKILL_SEEKERS_DEBUG=1

Profiling Performance

# Profile scraping performance
python -m cProfile -o profile.stats -m skill_seekers.cli.doc_scraper --config configs/react.json --max-pages 10

# Analyze profile
python -m pstats profile.stats
# In pstats shell:
# > sort cumtime
# > stats 20

📖 Additional Documentation

Official Website:

For Users:

For Developers:

🎓 Understanding the Codebase

Why src/ Layout?

Modern Python best practice (PEP 517/518):

  • Prevents accidental imports from repo root
  • Forces proper package installation
  • Better isolation between package and tests
  • Required: pip install -e . before running tests

Why Platform Adaptors?

Strategy pattern benefits:

  • Single codebase supports 4 platforms
  • Platform-specific optimizations (format, APIs, models)
  • Easy to add new platforms (implement BaseAdaptor)
  • Clean separation of concerns
  • Testable in isolation

Why Git-style CLI?

User experience benefits:

  • Familiar to developers (like git)
  • Single entry point: skill-seekers
  • Backward compatible: individual tools still work
  • Cleaner than multiple separate commands
  • Easier to document and teach

Three-Stream GitHub Architecture

The unified_codebase_analyzer.py splits GitHub repositories into three independent streams:

Stream 1: Code Analysis (C3.x features)

  • Deep AST parsing (9 languages)
  • Design pattern detection (C3.1)
  • Test example extraction (C3.2)
  • How-to guide generation (C3.3)
  • Configuration extraction (C3.4)
  • Architectural overview (C3.5)
  • API reference + dependency graphs

Stream 2: Documentation

  • README, CONTRIBUTING, LICENSE
  • docs/ directory markdown files
  • Wiki pages (if available)
  • CHANGELOG and version history

Stream 3: Community Insights

  • GitHub metadata (stars, forks, watchers)
  • Issue analysis (top problems and solutions)
  • PR trends and contributor stats
  • Release history
  • Label-based topic detection

Key Benefits:

  • Unified interface for GitHub URLs and local paths
  • Analysis depth control: 'basic' (1-2 min) or 'c3x' (20-60 min)
  • Enhanced router generation with GitHub context
  • Smart keyword extraction weighted by GitHub labels (2x weight)
  • 81 E2E tests passing (0.44 seconds)

🔧 Helper Scripts

The scripts/ directory contains utility scripts:

# Bootstrap skill generation - self-hosting skill-seekers as a Claude skill
./scripts/bootstrap_skill.sh

# Start MCP server for HTTP transport
./scripts/start_mcp_server.sh

# Script templates are in scripts/skill_header.md

Bootstrap Skill Workflow:

  1. Analyzes skill-seekers codebase itself (dogfooding)
  2. Combines handcrafted header with auto-generated analysis
  3. Validates SKILL.md structure
  4. Outputs ready-to-use skill for Claude Code

🔍 Performance Characteristics

Operation Time Notes
Scraping (sync) 15-45 min First time, thread-based
Scraping (async) 5-15 min 2-3x faster with --async
Building 1-3 min Fast rebuild from cache
Re-building <1 min With --skip-scrape
Enhancement (LOCAL) 30-60 sec Uses Claude Code Max
Enhancement (API) 20-40 sec Requires API key
Packaging 5-10 sec Final .zip creation

🎉 Recent Achievements

v3.1.4 (Unreleased) - "Selector Fallback & Dry-Run Fix":

  • 🐛 Issue #300: create https://reactflow.dev/ only found 1 page — Now finds 20+ pages
  • 🔧 Centralized selector fallbackFALLBACK_MAIN_SELECTORS constant + _find_main_content() helper replace 3 duplicated fallback loops
  • 🔗 Link extraction before early returnextract_content() now discovers links even when no content selector matches
  • 🔍 Dry-run full-page link discovery — Both sync and async dry-run paths extract links from the full page (was main-content-only or missing entirely)
  • 🛣️ Smart create --config routing — Peeks at JSON to route base_url configs to doc_scraper and sources configs to unified_scraper
  • 🧹 Removed body fallbackbody matched everything, hiding real selector failures
  • Pre-existing test fixestest_auto_fetch_enabled (react.json exists locally) and test_mcp_validate_legacy_config (react.json is now unified format)

v3.1.3 (Released) - "Unified Argument Interface":

  • 🔧 Unified Scraper Arguments - All scrapers (scrape, github, analyze, pdf) now share a common argument contract via add_all_standard_arguments(parser) in arguments/common.py
  • 🐛 Fix create Argument Forwarding - create <url> --dry-run, create owner/repo --dry-run, create ./path --dry-run all work now (previously crashed)
  • 🏗️ Argument Deduplication - Removed duplicated arg definitions from github.py, scrape.py, analyze.py, pdf.py; all import shared args
  • New Flags - GitHub and PDF scrapers gain --dry-run, --verbose, --quiet; analyze gains --name, --description, --quiet
  • 🔀 Route-Specific Forwarding - create command's _add_common_args() now only forwards universal flags; route-specific flags moved to their respective methods

v3.1.0 - "Unified CLI & Developer Experience":

  • 🎯 Unified create Command - Auto-detects source type (web/GitHub/local/PDF/config)
  • 📋 Progressive Disclosure Help - Default shows 13 universal flags, detailed help available per source
  • -p Shortcut - Quick preset selection (-p quick|standard|comprehensive)
  • 🔧 Enhancement Flag Consolidation - --enhance-level (0-3) replaces 3 separate flags
  • 🎨 Smart Source Detection - No need to specify whether input is URL, repo, or directory
  • 🔄 Enhancement Workflow Presets - YAML-based presets; skill-seekers workflows list/show/copy/add/remove/validate; bundled presets: default, minimal, security-focus, architecture-comprehensive, api-documentation
  • 🔀 Multiple Workflows from CLI - --enhance-workflow wf-a --enhance-workflow wf-b chains presets in a single command; workflows copy/add/remove all accept multiple names/files at once
  • 🐛 Bug Fix - create command now correctly forwards multiple --enhance-workflow flags to sub-scrapers
  • 2,121 Tests Passing - All CLI refactor + workflow preset work verified
  • 📚 Improved Documentation - CLAUDE.md, README, QUICK_REFERENCE updated with workflow preset details

v3.1.0 CI Stability (February 20, 2026):

  • 🔧 Dependency Alignment - Fixed MCP version mismatch between requirements.txt (was 1.18.0) and pyproject.toml (>=1.25)
  • 📦 PyYAML Core Dependency - Added PyYAML>=6.0 to core dependencies (required by workflow_tools.py module-level import)
  • Benchmark Stability - Relaxed timing-sensitive test thresholds for CI environment variability
  • 2,121 Tests Passing - All CI matrix jobs passing (ubuntu 3.10/3.11/3.12, macos 3.11/3.12)

v3.0.0 (February 10, 2026) - "Universal Intelligence Platform":

  • 🚀 16 Platform Adaptors - RAG frameworks (LangChain, LlamaIndex, Haystack), vector DBs (Chroma, FAISS, Weaviate, Qdrant), AI coding assistants (Cursor, Windsurf, Cline, Continue.dev), LLM platforms (Claude, Gemini, OpenAI)
  • 🛠️ 26 MCP Tools (up from 18) - Complete automation for any AI system
  • 1,852 Tests Passing (up from 700+) - Production-grade reliability
  • ☁️ Cloud Storage - S3, GCS, Azure Blob Storage integration
  • 🎯 AI Coding Assistants - Persistent context for Cursor, Windsurf, Cline, Continue.dev
  • 📊 Quality Metrics - Automated completeness scoring and content analysis
  • 🌐 Multilingual Support - Language detection and translation
  • 🔄 Streaming Ingest - Real-time data processing pipelines
  • 📈 Benchmarking Tools - Performance comparison and CI/CD integration
  • 🔧 Setup Wizard - Interactive first-time configuration
  • 📦 12 Example Projects - Complete working examples for every integration
  • 📚 18 Integration Guides - Comprehensive documentation for all platforms

v2.9.0 (February 3, 2026):

  • C3.10: Signal Flow Analysis - Complete signal flow analysis for Godot projects
  • Comprehensive Godot 4.x support (GDScript, .tscn, .tres, .gdshader files)
  • GDScript test extraction (GUT, gdUnit4, WAT frameworks)
  • Signal pattern detection (EventBus, Observer, Event Chains)
  • Signal-based how-to guides generation

v2.8.0 (February 1, 2026):

  • C3.9: Project Documentation Extraction
  • Granular AI enhancement control with --enhance-level (0-3)

v2.7.1 (January 18, 2026 - Hotfix):

  • 🚨 Critical Bug Fix: Config download 404 errors resolved
  • Fixed manual URL construction bug - now uses download_url from API response
  • All 15 source tools tests + 8 fetch_config tests passing

v2.7.0 (January 18, 2026):

  • 🔐 Smart Rate Limit Management - Multi-token GitHub configuration system
  • 🧙 Interactive Configuration Wizard - Beautiful terminal UI (skill-seekers config)
  • 🚦 Intelligent Rate Limit Handler - Four strategies (prompt/wait/switch/fail)
  • 📥 Resume Capability - Continue interrupted jobs with progress tracking
  • 🔧 CI/CD Support - Non-interactive mode for automation
  • 🎯 Bootstrap Skill - Self-hosting skill-seekers as Claude Code skill

v2.6.0 (January 14, 2026):

  • C3.x Codebase Analysis Suite Complete (C3.1-C3.8)
  • Multi-platform support with platform adaptor architecture (4 platforms)
  • 18 MCP tools fully functional
  • 700+ tests passing
  • Unified multi-source scraping maturity

C3.x Series (Complete - Code Analysis Features):

  • C3.1: Design pattern detection (10 GoF patterns, 9 languages, 87% precision)
  • C3.2: Test example extraction (5 categories, AST-based for Python)
  • C3.3: How-to guide generation with AI enhancement (5 improvements)
  • C3.4: Configuration pattern extraction (env vars, config files, CLI args)
  • C3.5: Architectural overview & router skill generation
  • C3.6: AI enhancement for patterns and test examples (Claude API integration)
  • C3.7: Architectural pattern detection (8 patterns, framework-aware)
  • C3.8: Standalone codebase scraper (300+ line SKILL.md from code alone)
  • C3.9: Project documentation extraction (markdown categorization, AI enhancement)
  • C3.10: Signal flow analysis (Godot event-driven architecture, pattern detection)

v2.5.2:

  • UX Improvement: Analysis features now default ON with --skip-* flags (BREAKING)
  • Router quality improvements: 6.5/10 → 8.5/10 (+31%)
  • All 107 codebase analysis tests passing

v2.5.0:

  • Multi-platform support (Claude, Gemini, OpenAI, Markdown)
  • Platform adaptor architecture
  • 18 MCP tools (up from 9)
  • Complete feature parity across platforms

v2.1.0:

  • Unified multi-source scraping (docs + GitHub + PDF)
  • Conflict detection between sources
  • 427 tests passing

v1.0.0:

  • Production release with MCP integration
  • Documentation scraping with smart categorization
  • 12 preset configurations