Files
skill-seekers-reference/CLAUDE.md
yusyus 2e30970dfb feat: add EPUB input support (#310)
Adds EPUB as a first-class input source for skill generation.

- EpubToSkillConverter (epub_scraper.py, ~1200 lines) following PDF scraper pattern
- Dublin Core metadata, spine items, code blocks, tables, images extraction
- DRM detection (Adobe ADEPT, Apple FairPlay, Readium LCP) with fail-fast
- EPUB 3 NCX TOC bug workaround (ignore_ncx=True)
- ebooklib as optional dep: pip install skill-seekers[epub]
- Wired into create command with .epub auto-detection
- 104 tests, all passing

Review fixes: removed 3 empty test stubs, fixed SVG double-counting in
_extract_images(), added logger.debug to bare except pass.

Based on PR #310 by @christianbaumann.
Co-authored-by: Christian Baumann <mail@chriss-baumann.de>
2026-03-15 02:34:41 +03:00

2391 lines
85 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## 🎯 Project Overview
**Skill Seekers** is the **universal documentation preprocessor** for AI systems. It transforms documentation websites, GitHub repositories, PDFs, and EPUBs into production-ready formats for **16+ platforms**: RAG pipelines (LangChain, LlamaIndex, Haystack), vector databases (Pinecone, Chroma, Weaviate, FAISS, Qdrant), AI coding assistants (Cursor, Windsurf, Cline, Continue.dev), and LLM platforms (Claude, Gemini, OpenAI).
**Current Version:** v3.1.3
**Python Version:** 3.10+ required
**Status:** Production-ready, published on PyPI
**Website:** https://skillseekersweb.com/ - Browse configs, share, and access documentation
## 📚 Table of Contents
- [First Time Here?](#-first-time-here) - Start here!
- [Quick Commands](#-quick-command-reference-most-used) - Common workflows
- [Architecture](#-architecture) - How it works
- [Development](#-development-commands) - Building & testing
- [Testing](#-testing-guidelines) - Test strategy
- [Debugging](#-debugging-tips) - Troubleshooting
- [Contributing](#-where-to-make-changes) - How to add features
## 👋 First Time Here?
**Complete this 3-minute setup to start contributing:**
```bash
# 1. Install package in editable mode (REQUIRED for development)
pip install -e .
# 2. Verify installation
python -c "import skill_seekers; print(skill_seekers.__version__)" # Should print: 3.1.0-dev
# 3. Run a quick test
pytest tests/test_scraper_features.py::test_detect_language -v
# 4. You're ready! Pick a task from the roadmap:
# https://github.com/users/yusufkaraaslan/projects/2
```
**Quick Navigation:**
- Building/Testing → [Development Commands](#-development-commands)
- Architecture → [Core Design Pattern](#-architecture)
- Common Issues → [Common Pitfalls](#-common-pitfalls--solutions)
- Contributing → See `CONTRIBUTING.md`
## ⚡ Quick Command Reference (Most Used)
**First time setup:**
```bash
pip install -e . # REQUIRED before running tests or CLI
```
**Running tests (NEVER skip - user requirement):**
```bash
pytest tests/ -v # All tests
pytest tests/test_scraper_features.py -v # Single file
pytest tests/ --cov=src/skill_seekers --cov-report=html # With coverage
```
**Code quality checks (matches CI):**
```bash
ruff check src/ tests/ # Lint
ruff format src/ tests/ # Format
mypy src/skill_seekers # Type check
```
**Common workflows:**
```bash
# NEW unified create command (auto-detects source type)
skill-seekers create https://docs.react.dev/ -p quick
skill-seekers create facebook/react -p standard
skill-seekers create ./my-project -p comprehensive
skill-seekers create tutorial.pdf
# Legacy commands (still supported)
skill-seekers scrape --config configs/react.json
skill-seekers github --repo facebook/react
skill-seekers analyze --directory . --comprehensive
# Package for LLM platforms
skill-seekers package output/react/ --target claude
skill-seekers package output/react/ --target gemini
```
**RAG Pipeline workflows:**
```bash
# LangChain Documents
skill-seekers package output/react/ --format langchain
# LlamaIndex TextNodes
skill-seekers package output/react/ --format llama-index
# Haystack Documents
skill-seekers package output/react/ --format haystack
# ChromaDB direct upload
skill-seekers package output/react/ --format chroma --upload
# FAISS export
skill-seekers package output/react/ --format faiss
# Weaviate/Qdrant upload (requires API keys)
skill-seekers package output/react/ --format weaviate --upload
skill-seekers package output/react/ --format qdrant --upload
```
**AI Coding Assistant workflows:**
```bash
# Cursor IDE
skill-seekers package output/react/ --target claude
cp output/react-claude/SKILL.md .cursorrules
# Windsurf
cp output/react-claude/SKILL.md .windsurf/rules/react.md
# Cline (VS Code)
cp output/react-claude/SKILL.md .clinerules
# Continue.dev (universal IDE)
python examples/continue-dev-universal/context_server.py
# Configure in ~/.continue/config.json
```
**Cloud Storage:**
```bash
# Upload to S3
skill-seekers cloud upload --provider s3 --bucket my-skills output/react.zip
# Upload to GCS
skill-seekers cloud upload --provider gcs --bucket my-skills output/react.zip
# Upload to Azure
skill-seekers cloud upload --provider azure --container my-skills output/react.zip
```
## 🏗️ Architecture
### Core Design Pattern: Platform Adaptors
The codebase uses the **Strategy Pattern** with a factory method to support **16 platforms** across 4 categories:
```
src/skill_seekers/cli/adaptors/
├── __init__.py # Factory: get_adaptor(target/format)
├── base.py # Abstract base class
# LLM Platforms (3)
├── claude.py # Claude AI (ZIP + YAML)
├── gemini.py # Google Gemini (tar.gz)
├── openai.py # OpenAI ChatGPT (ZIP + Vector Store)
# RAG Frameworks (3)
├── langchain.py # LangChain Documents
├── llama_index.py # LlamaIndex TextNodes
├── haystack.py # Haystack Documents
# Vector Databases (5)
├── chroma.py # ChromaDB
├── faiss_helpers.py # FAISS
├── qdrant.py # Qdrant
├── weaviate.py # Weaviate
# AI Coding Assistants (4 - via Claude format + config files)
# - Cursor, Windsurf, Cline, Continue.dev
# Generic (1)
├── markdown.py # Generic Markdown (ZIP)
└── streaming_adaptor.py # Streaming data ingest
```
**Key Methods:**
- `package(skill_dir, output_path)` - Platform-specific packaging
- `upload(package_path, api_key)` - Platform-specific upload (where applicable)
- `enhance(skill_dir, mode)` - AI enhancement with platform-specific models
- `export(skill_dir, format)` - Export to RAG/vector DB formats
### Data Flow (5 Phases)
1. **Scrape Phase** (`doc_scraper.py:scrape_all()`)
- BFS traversal from base_url
- Output: `output/{name}_data/pages/*.json`
2. **Build Phase** (`doc_scraper.py:build_skill()`)
- Load pages → Categorize → Extract patterns
- Output: `output/{name}/SKILL.md` + `references/*.md`
3. **Enhancement Phase** (optional, `enhance_skill_local.py`)
- LLM analyzes references → Rewrites SKILL.md
- Platform-specific models (Sonnet 4, Gemini 2.0, GPT-4o)
4. **Package Phase** (`package_skill.py` → adaptor)
- Platform adaptor packages in appropriate format
- Output: `.zip` or `.tar.gz`
5. **Upload Phase** (optional, `upload_skill.py` → adaptor)
- Upload via platform API
### File Structure (src/ layout) - Key Files Only
```
src/skill_seekers/
├── cli/ # All CLI commands
│ ├── main.py # ⭐ Git-style CLI dispatcher
│ ├── doc_scraper.py # ⭐ Main scraper (~790 lines)
│ │ ├── scrape_all() # BFS traversal engine
│ │ ├── smart_categorize() # Category detection
│ │ └── build_skill() # SKILL.md generation
│ ├── github_scraper.py # GitHub repo analysis
│ ├── codebase_scraper.py # ⭐ Local analysis (C2.x+C3.x)
│ ├── package_skill.py # Platform packaging
│ ├── unified_scraper.py # Multi-source scraping
│ ├── unified_codebase_analyzer.py # Three-stream GitHub+local analyzer
│ ├── enhance_skill_local.py # AI enhancement (LOCAL mode)
│ ├── enhance_status.py # Enhancement status monitoring
│ ├── upload_skill.py # Upload to platforms
│ ├── install_skill.py # Complete workflow automation
│ ├── install_agent.py # Install to AI agent directories
│ ├── pattern_recognizer.py # C3.1 Design pattern detection
│ ├── test_example_extractor.py # C3.2 Test example extraction
│ ├── how_to_guide_builder.py # C3.3 How-to guide generation
│ ├── config_extractor.py # C3.4 Configuration extraction
│ ├── generate_router.py # C3.5 Router skill generation
│ ├── code_analyzer.py # Multi-language code analysis
│ ├── api_reference_builder.py # API documentation builder
│ ├── dependency_analyzer.py # Dependency graph analysis
│ ├── signal_flow_analyzer.py # C3.10 Signal flow analysis (Godot)
│ ├── pdf_scraper.py # PDF extraction
│ ├── epub_scraper.py # EPUB extraction
│ └── adaptors/ # ⭐ Platform adaptor pattern
│ ├── __init__.py # Factory: get_adaptor()
│ ├── base_adaptor.py # Abstract base
│ ├── claude_adaptor.py # Claude AI
│ ├── gemini_adaptor.py # Google Gemini
│ ├── openai_adaptor.py # OpenAI ChatGPT
│ ├── markdown_adaptor.py # Generic Markdown
│ ├── langchain.py # LangChain RAG
│ ├── llama_index.py # LlamaIndex RAG
│ ├── haystack.py # Haystack RAG
│ ├── chroma.py # ChromaDB
│ ├── faiss_helpers.py # FAISS
│ ├── qdrant.py # Qdrant
│ ├── weaviate.py # Weaviate
│ └── streaming_adaptor.py # Streaming data ingest
└── mcp/ # MCP server (26 tools)
├── server_fastmcp.py # FastMCP server
└── tools/ # Tool implementations
```
**Most Modified Files (when contributing):**
- Platform adaptors: `src/skill_seekers/cli/adaptors/{platform}.py`
- Tests: `tests/test_{feature}.py`
- Configs: `configs/{framework}.json`
## 🛠️ Development Commands
### Setup
```bash
# Install in editable mode (required before tests due to src/ layout)
pip install -e .
# Install with all platform dependencies
pip install -e ".[all-llms]"
# Install specific platforms
pip install -e ".[gemini]" # Google Gemini
pip install -e ".[openai]" # OpenAI ChatGPT
```
### Running Tests
**CRITICAL: Never skip tests** - User requires all tests to pass before commits.
```bash
# All tests (must run pip install -e . first!)
pytest tests/ -v
# Specific test file
pytest tests/test_scraper_features.py -v
# Multi-platform tests
pytest tests/test_install_multiplatform.py -v
# With coverage
pytest tests/ --cov=src/skill_seekers --cov-report=term --cov-report=html
# Single test
pytest tests/test_scraper_features.py::test_detect_language -v
# MCP server tests
pytest tests/test_mcp_fastmcp.py -v
```
**Test Architecture:**
- 46 test files covering all features
- CI Matrix: Ubuntu + macOS, Python 3.10-3.13
- **2,540 tests passing** (current), up from 700+ in v2.x
- Must run `pip install -e .` before tests (src/ layout requirement)
- Tests include create command integration tests, CLI refactor E2E tests
### Building & Publishing
```bash
# Build package (using uv - recommended)
uv build
# Or using build
python -m build
# Publish to PyPI
uv publish
# Or using twine
python -m twine upload dist/*
```
### Testing CLI Commands
```bash
# Test configuration wizard (NEW: v2.7.0)
skill-seekers config --show # Show current configuration
skill-seekers config --github # GitHub token setup
skill-seekers config --test # Test connections
# Test resume functionality (NEW: v2.7.0)
skill-seekers resume --list # List resumable jobs
skill-seekers resume --clean # Clean up old jobs
# Test GitHub scraping with profiles (NEW: v2.7.0)
skill-seekers github --repo facebook/react --profile personal # Use specific profile
skill-seekers github --repo owner/repo --non-interactive # CI/CD mode
# Test scraping (dry run)
skill-seekers scrape --config configs/react.json --dry-run
# Test codebase analysis (C2.x features)
skill-seekers analyze --directory . --output output/codebase/
# Test pattern detection (C3.1)
skill-seekers patterns --file src/skill_seekers/cli/code_analyzer.py
# Test how-to guide generation (C3.3)
skill-seekers how-to-guides output/test_examples.json --output output/guides/
# Test enhancement status monitoring
skill-seekers enhance-status output/react/ --watch
# Video setup (auto-detect GPU and install deps)
skill-seekers video --setup
# Test multi-platform packaging
skill-seekers package output/react/ --target gemini --dry-run
# Test MCP server (stdio mode)
python -m skill_seekers.mcp.server_fastmcp
# Test MCP server (HTTP mode)
python -m skill_seekers.mcp.server_fastmcp --transport http --port 8765
```
### New v3.0.0 CLI Commands
```bash
# Setup wizard (interactive configuration)
skill-seekers-setup
# Cloud storage operations
skill-seekers cloud upload --provider s3 --bucket my-bucket output/react.zip
skill-seekers cloud download --provider gcs --bucket my-bucket react.zip
skill-seekers cloud list --provider azure --container my-container
# Embedding server (for RAG pipelines)
skill-seekers embed --port 8080 --model sentence-transformers
# Sync & incremental updates
skill-seekers sync --source https://docs.react.dev/ --target output/react/
skill-seekers update --skill output/react/ --check-changes
# Quality metrics & benchmarking
skill-seekers quality --skill output/react/ --report
skill-seekers benchmark --config configs/react.json --compare-versions
# Multilingual support
skill-seekers multilang --detect output/react/
skill-seekers multilang --translate output/react/ --target zh-CN
# Streaming data ingest
skill-seekers stream --source docs/ --target output/streaming/
```
## 🔧 Key Implementation Details
### CLI Architecture (Git-style)
**Entry point:** `src/skill_seekers/cli/main.py`
The unified CLI modifies `sys.argv` and calls existing `main()` functions to maintain backward compatibility:
```python
# Example: skill-seekers scrape --config react.json
# Transforms to: doc_scraper.main() with modified sys.argv
```
**Subcommands:** create, scrape, github, pdf, epub, unified, codebase, enhance, enhance-status, package, upload, estimate, install, install-agent, patterns, how-to-guides
### NEW: Unified `create` Command
**The recommended way to create skills** - Auto-detects source type and provides progressive help disclosure:
```bash
# Auto-detection examples
skill-seekers create https://docs.react.dev/ # → Web scraping
skill-seekers create facebook/react # → GitHub analysis
skill-seekers create ./my-project # → Local codebase
skill-seekers create tutorial.pdf # → PDF extraction
skill-seekers create book.epub # → EPUB extraction
skill-seekers create configs/react.json # → Multi-source
# Progressive help system
skill-seekers create --help # Shows universal args only (13 flags)
skill-seekers create --help-web # Shows web-specific options
skill-seekers create --help-github # Shows GitHub-specific options
skill-seekers create --help-local # Shows local analysis options
skill-seekers create --help-pdf # Shows PDF extraction options
skill-seekers create --help-epub # Shows EPUB extraction options
skill-seekers create --help-advanced # Shows advanced/rare options
skill-seekers create --help-all # Shows all 120+ flags
# Universal flags work for ALL sources
skill-seekers create <source> -p quick # Preset (-p shortcut)
skill-seekers create <source> --enhance-level 2 # AI enhancement (0-3)
skill-seekers create <source> --chunk-for-rag # RAG chunking
skill-seekers create <source> --dry-run # Preview
```
**Key improvements:**
- **Single command** replaces scrape/github/analyze for most use cases
- **Smart detection** - No need to specify source type
- **Progressive disclosure** - Default help shows 13 flags, detailed help available
- **-p shortcut** - Quick preset selection (`-p quick|standard|comprehensive`)
- **Universal features** - RAG chunking, dry-run, presets work everywhere
**Recent Additions:**
- `create` - **NEW:** Unified command with auto-detection and progressive help
- `codebase` - Local codebase analysis without GitHub API (C2.x + C3.x features)
- `enhance-status` - Monitor background/daemon enhancement processes
- `patterns` - Detect design patterns in code (C3.1)
- `how-to-guides` - Generate educational guides from tests (C3.3)
### Platform Adaptor Usage
```python
from skill_seekers.cli.adaptors import get_adaptor
# Get platform-specific adaptor
adaptor = get_adaptor('gemini') # or 'claude', 'openai', 'markdown'
# Package skill
adaptor.package(skill_dir='output/react/', output_path='output/')
# Upload to platform
adaptor.upload(
package_path='output/react-gemini.tar.gz',
api_key=os.getenv('GOOGLE_API_KEY')
)
# AI enhancement
adaptor.enhance(skill_dir='output/react/', mode='api')
```
### C3.x Codebase Analysis Features
The project has comprehensive codebase analysis capabilities (C3.1-C3.8):
**C3.1 Design Pattern Detection** (`pattern_recognizer.py`):
- Detects 10 common patterns: Singleton, Factory, Observer, Strategy, Decorator, Builder, Adapter, Command, Template Method, Chain of Responsibility
- Supports 9 languages: Python, JavaScript, TypeScript, C++, C, C#, Go, Rust, Java
- Three detection levels: surface (fast), deep (balanced), full (thorough)
- 87% precision, 80% recall on real-world projects
**C3.2 Test Example Extraction** (`test_example_extractor.py`):
- Extracts real usage examples from test files
- Categories: instantiation, method_call, config, setup, workflow
- AST-based for Python, regex-based for 8 other languages
- Quality filtering with confidence scoring
**C3.3 How-To Guide Generation** (`how_to_guide_builder.py`):
- Transforms test workflows into educational guides
- 5 AI enhancements: step descriptions, troubleshooting, prerequisites, next steps, use cases
- Dual-mode AI: API (fast) or LOCAL (free with Claude Code Max)
- 4 grouping strategies: AI tutorial group, file path, test name, complexity
**C3.4 Configuration Pattern Extraction** (`config_extractor.py`):
- Extracts configuration patterns from codebases
- Identifies config files, env vars, CLI arguments
- AI enhancement for better organization
**C3.5 Architectural Overview** (`generate_router.py`):
- Generates comprehensive ARCHITECTURE.md files
- Router skill generation for large documentation
- Quality improvements: 6.5/10 → 8.5/10 (+31%)
- Integrates GitHub metadata, issues, labels
**C3.6 AI Enhancement** (Claude API integration):
- Enhances C3.1-C3.5 with AI-powered insights
- Pattern explanations and improvement suggestions
- Test example context and best practices
- Guide enhancement with troubleshooting and prerequisites
**C3.7 Architectural Pattern Detection** (`architectural_pattern_detector.py`):
- Detects 8 architectural patterns (MVC, MVVM, MVP, Repository, etc.)
- Framework detection (Django, Flask, Spring, React, Angular, etc.)
- Multi-file analysis with directory structure patterns
- Evidence-based detection with confidence scoring
**C3.8 Standalone Codebase Scraper** (`codebase_scraper.py`):
```bash
# Quick analysis (1-2 min, basic features only)
skill-seekers analyze --directory /path/to/repo --quick
# Comprehensive analysis (20-60 min, all features + AI)
skill-seekers analyze --directory . --comprehensive
# With AI enhancement (auto-detects API or LOCAL)
skill-seekers analyze --directory . --enhance
# Granular AI enhancement control (NEW)
skill-seekers analyze --directory . --enhance-level 1 # SKILL.md only
skill-seekers analyze --directory . --enhance-level 2 # + Architecture + Config + Docs
skill-seekers analyze --directory . --enhance-level 3 # Full enhancement (all features)
# Disable specific features
skill-seekers analyze --directory . --skip-patterns --skip-how-to-guides
```
- Generates 300+ line standalone SKILL.md files from codebases
- All C3.x features integrated (patterns, tests, guides, config, architecture, docs)
- Complete codebase analysis without documentation scraping
- **NEW**: Granular AI enhancement control with `--enhance-level` (0-3)
**C3.9 Project Documentation Extraction** (`codebase_scraper.py`):
- Extracts and categorizes all markdown files from the project
- Auto-detects categories: overview, architecture, guides, workflows, features, etc.
- Integrates documentation into SKILL.md with summaries
- AI enhancement (level 2+) adds topic extraction and cross-references
- Controlled by depth: surface=raw copy, deep=parse+summarize, full=AI-enhanced
- Default ON, use `--skip-docs` to disable
**C3.10 Signal Flow Analysis for Godot Projects** (`signal_flow_analyzer.py`):
- Complete signal flow analysis system for event-driven Godot architectures
- Signal declaration extraction (detects `signal` keyword declarations)
- Connection mapping (tracks `.connect()` calls with targets and methods)
- Emission tracking (finds `.emit()` and `emit_signal()` calls)
- Real-world metrics: 208 signals, 634 connections, 298 emissions in test project
- Signal density metrics (signals per file)
- Event chain detection (signals triggering other signals)
- Signal pattern detection:
- **EventBus Pattern** (0.90 confidence): Centralized signal hub in autoload
- **Observer Pattern** (0.85 confidence): Multi-observer signals (3+ listeners)
- **Event Chains** (0.80 confidence): Cascading signal propagation
- Signal-based how-to guides (C3.10.1):
- AI-generated step-by-step usage guides (Connect → Emit → Handle)
- Real code examples from project
- Common usage locations
- Parameter documentation
- Outputs: `signal_flow.json`, `signal_flow.mmd` (Mermaid diagram), `signal_reference.md`, `signal_how_to_guides.md`
- Comprehensive Godot 4.x support:
- GDScript (.gd), Scene files (.tscn), Resources (.tres), Shaders (.gdshader)
- GDScript test extraction (GUT, gdUnit4, WAT frameworks)
- 396 test cases extracted in test project
- Framework detection (Unity, Unreal, Godot)
**Key Architecture Decision (BREAKING in v2.5.2):**
- Changed from opt-in (`--build-*`) to opt-out (`--skip-*`) flags
- All analysis features now ON by default for maximum value
- Backward compatibility warnings for deprecated flags
### Smart Categorization Algorithm
Located in `doc_scraper.py:smart_categorize()`:
- Scores pages against category keywords
- 3 points for URL match, 2 for title, 1 for content
- Threshold of 2+ for categorization
- Auto-infers categories from URL segments if none provided
- Falls back to "other" category
### Language Detection
Located in `doc_scraper.py:detect_language()`:
1. CSS class attributes (`language-*`, `lang-*`)
2. Heuristics (keywords like `def`, `const`, `func`)
### Configuration File Structure
Configs (`configs/*.json`) define scraping behavior:
```json
{
"name": "framework-name",
"description": "When to use this skill",
"base_url": "https://docs.example.com/",
"selectors": {
"main_content": "article", // CSS selector
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": ["/docs"],
"exclude": ["/blog"]
},
"categories": {
"getting_started": ["intro", "quickstart"],
"api": ["api", "reference"]
},
"rate_limit": 0.5,
"max_pages": 500
}
```
## 🧪 Testing Guidelines
### Test Coverage Requirements
- Core features: 100% coverage required
- Platform adaptors: Each platform has dedicated tests
- MCP tools: All 18 tools must be tested
- Integration tests: End-to-end workflows
### Test Markers (from pytest.ini_options)
The project uses pytest markers to categorize tests:
```bash
# Run only fast unit tests (default)
pytest tests/ -v
# Include slow tests (>5 seconds)
pytest tests/ -v -m slow
# Run integration tests (requires external services)
pytest tests/ -v -m integration
# Run end-to-end tests (resource-intensive, creates files)
pytest tests/ -v -m e2e
# Run tests requiring virtual environment setup
pytest tests/ -v -m venv
# Run bootstrap feature tests
pytest tests/ -v -m bootstrap
# Skip slow and integration tests (fastest)
pytest tests/ -v -m "not slow and not integration"
```
### Test Execution Strategy
**By default, only fast tests run**. Use markers to control test execution:
```bash
# Default: Only fast tests (skip slow/integration/e2e)
pytest tests/ -v
# Include slow tests (>5 seconds)
pytest tests/ -v -m slow
# Include integration tests (requires external services)
pytest tests/ -v -m integration
# Include resource-intensive e2e tests (creates files)
pytest tests/ -v -m e2e
# Run ONLY fast tests (explicit)
pytest tests/ -v -m "not slow and not integration and not e2e"
# Run everything (CI does this)
pytest tests/ -v -m ""
```
**When to use which:**
- **Local development:** Default (fast tests only) - `pytest tests/ -v`
- **Pre-commit:** Fast tests - `pytest tests/ -v`
- **Before PR:** Include slow + integration - `pytest tests/ -v -m "not e2e"`
- **CI validation:** All tests run automatically
### Key Test Files
- `test_scraper_features.py` - Core scraping functionality
- `test_mcp_server.py` - MCP integration (18 tools)
- `test_mcp_fastmcp.py` - FastMCP framework
- `test_unified.py` - Multi-source scraping
- `test_github_scraper.py` - GitHub analysis
- `test_pdf_scraper.py` - PDF extraction
- `test_epub_scraper.py` - EPUB extraction
- `test_install_multiplatform.py` - Multi-platform packaging
- `test_integration.py` - End-to-end workflows
- `test_install_skill.py` - One-command install
- `test_install_agent.py` - AI agent installation
- `conftest.py` - Test configuration (checks package installation)
## 🌐 Environment Variables
```bash
# Claude AI / Compatible APIs
# Option 1: Official Anthropic API (default)
export ANTHROPIC_API_KEY=sk-ant-...
# Option 2: GLM-4.7 Claude-compatible API (or any compatible endpoint)
export ANTHROPIC_API_KEY=your-api-key
export ANTHROPIC_BASE_URL=https://glm-4-7-endpoint.com/v1
# Google Gemini (optional)
export GOOGLE_API_KEY=AIza...
# OpenAI ChatGPT (optional)
export OPENAI_API_KEY=sk-...
# GitHub (for higher rate limits)
export GITHUB_TOKEN=ghp_...
# Private config repositories (optional)
export GITLAB_TOKEN=glpat-...
export GITEA_TOKEN=...
export BITBUCKET_TOKEN=...
```
**All AI enhancement features respect these settings**:
- `enhance_skill.py` - API mode SKILL.md enhancement
- `ai_enhancer.py` - C3.1/C3.2 pattern and test example enhancement
- `guide_enhancer.py` - C3.3 guide enhancement
- `config_enhancer.py` - C3.4 configuration enhancement
- `adaptors/claude.py` - Claude platform adaptor enhancement
**Note**: Setting `ANTHROPIC_BASE_URL` allows you to use any Claude-compatible API endpoint, such as GLM-4.7 (智谱 AI).
## 📦 Package Structure (pyproject.toml)
### Entry Points
```toml
[project.scripts]
# Main unified CLI
skill-seekers = "skill_seekers.cli.main:main"
# Individual tool entry points (Core)
skill-seekers-config = "skill_seekers.cli.config_command:main" # v2.7.0 Configuration wizard
skill-seekers-resume = "skill_seekers.cli.resume_command:main" # v2.7.0 Resume interrupted jobs
skill-seekers-scrape = "skill_seekers.cli.doc_scraper:main"
skill-seekers-github = "skill_seekers.cli.github_scraper:main"
skill-seekers-pdf = "skill_seekers.cli.pdf_scraper:main"
skill-seekers-epub = "skill_seekers.cli.epub_scraper:main"
skill-seekers-unified = "skill_seekers.cli.unified_scraper:main"
skill-seekers-codebase = "skill_seekers.cli.codebase_scraper:main" # C2.x Local codebase analysis
skill-seekers-enhance = "skill_seekers.cli.enhance_skill_local:main"
skill-seekers-enhance-status = "skill_seekers.cli.enhance_status:main" # Status monitoring
skill-seekers-package = "skill_seekers.cli.package_skill:main"
skill-seekers-upload = "skill_seekers.cli.upload_skill:main"
skill-seekers-estimate = "skill_seekers.cli.estimate_pages:main"
skill-seekers-install = "skill_seekers.cli.install_skill:main"
skill-seekers-install-agent = "skill_seekers.cli.install_agent:main"
skill-seekers-patterns = "skill_seekers.cli.pattern_recognizer:main" # C3.1 Pattern detection
skill-seekers-how-to-guides = "skill_seekers.cli.how_to_guide_builder:main" # C3.3 Guide generation
skill-seekers-workflows = "skill_seekers.cli.workflows_command:main" # NEW: Workflow preset management
skill-seekers-video = "skill_seekers.cli.video_scraper:main" # Video scraping pipeline (use --setup to install deps)
# New v3.0.0 Entry Points
skill-seekers-setup = "skill_seekers.cli.setup_wizard:main" # NEW: v3.0.0 Setup wizard
skill-seekers-cloud = "skill_seekers.cli.cloud_storage_cli:main" # NEW: v3.0.0 Cloud storage
skill-seekers-embed = "skill_seekers.embedding.server:main" # NEW: v3.0.0 Embedding server
skill-seekers-sync = "skill_seekers.cli.sync_cli:main" # NEW: v3.0.0 Sync & monitoring
skill-seekers-benchmark = "skill_seekers.cli.benchmark_cli:main" # NEW: v3.0.0 Benchmarking
skill-seekers-stream = "skill_seekers.cli.streaming_ingest:main" # NEW: v3.0.0 Streaming ingest
skill-seekers-update = "skill_seekers.cli.incremental_updater:main" # NEW: v3.0.0 Incremental updates
skill-seekers-multilang = "skill_seekers.cli.multilang_support:main" # NEW: v3.0.0 Multilingual
skill-seekers-quality = "skill_seekers.cli.quality_metrics:main" # NEW: v3.0.0 Quality metrics
```
### Optional Dependencies
**Project uses PEP 735 `[dependency-groups]` (Python 3.13+)**:
- Replaces deprecated `tool.uv.dev-dependencies`
- Dev dependencies: `[dependency-groups] dev = [...]` in pyproject.toml
- Install with: `pip install -e .` (installs only core deps)
- Install dev deps: See CI workflow or manually install pytest, ruff, mypy
**Note on video dependencies:** `easyocr` and GPU-specific PyTorch builds are **not** included in the `video-full` optional dependency group. They are installed at runtime by `skill-seekers video --setup`, which auto-detects the GPU (CUDA/ROCm/MPS/CPU) and installs the correct builds.
```toml
[project.optional-dependencies]
gemini = ["google-generativeai>=0.8.0"]
openai = ["openai>=1.0.0"]
all-llms = ["google-generativeai>=0.8.0", "openai>=1.0.0"]
[dependency-groups] # PEP 735 (replaces tool.uv.dev-dependencies)
dev = [
"pytest>=8.4.2",
"pytest-asyncio>=0.24.0",
"pytest-cov>=7.0.0",
"coverage>=7.11.0",
]
```
## 🚨 Critical Development Notes
### Must Run Before Tests
```bash
# REQUIRED: Install package before running tests
pip install -e .
# Why: src/ layout requires package installation
# Without this, imports will fail
```
### Never Skip Tests
Per user instructions in `~/.claude/CLAUDE.md`:
- "never skip any test. always make sure all test pass"
- All 2,540 tests must pass before commits
- Run full test suite: `pytest tests/ -v`
- New tests added for create command and CLI refactor work
### Platform-Specific Dependencies
Platform dependencies are optional (install only what you need):
```bash
# Install specific platform support
pip install -e ".[gemini]" # Google Gemini
pip install -e ".[openai]" # OpenAI ChatGPT
pip install -e ".[chroma]" # ChromaDB
pip install -e ".[weaviate]" # Weaviate
pip install -e ".[s3]" # AWS S3
pip install -e ".[gcs]" # Google Cloud Storage
pip install -e ".[azure]" # Azure Blob Storage
pip install -e ".[mcp]" # MCP integration
pip install -e ".[all]" # Everything (16 platforms + cloud + embedding)
# Or install from PyPI:
pip install skill-seekers[gemini] # Google Gemini support
pip install skill-seekers[openai] # OpenAI ChatGPT support
pip install skill-seekers[all-llms] # All LLM platforms
pip install skill-seekers[chroma] # ChromaDB support
pip install skill-seekers[weaviate] # Weaviate support
pip install skill-seekers[s3] # AWS S3 support
pip install skill-seekers[all] # All optional dependencies
```
### AI Enhancement Modes
AI enhancement transforms basic skills (2-3/10) into production-ready skills (8-9/10). Two modes available:
**API Mode** (default if ANTHROPIC_API_KEY is set):
- Direct Claude API calls (fast, efficient)
- Cost: ~$0.15-$0.30 per skill
- Perfect for CI/CD automation
- Requires: `export ANTHROPIC_API_KEY=sk-ant-...`
**LOCAL Mode** (fallback if no API key):
- Uses Claude Code CLI (your existing Max plan)
- Free! No API charges
- 4 execution modes:
- Headless (default): Foreground, waits for completion
- Background (`--background`): Returns immediately
- Daemon (`--daemon`): Fully detached with nohup
- Terminal (`--interactive-enhancement`): Opens new terminal (macOS)
- Status monitoring: `skill-seekers enhance-status output/react/ --watch`
- Timeout configuration: `--timeout 300` (seconds)
### Enhancement Flag Consolidation (Phase 1)
**IMPORTANT CHANGE:** Three enhancement flags have been unified into a single granular control:
**Old flags (deprecated):**
- `--enhance` - Enable AI enhancement
- `--enhance-local` - Use LOCAL mode (Claude Code)
- `--api-key KEY` - Anthropic API key
**New unified flag:**
- `--enhance-level LEVEL` - Granular AI enhancement control (0-3, default: 2)
- `0` - Disabled, no AI enhancement
- `1` - SKILL.md only (core documentation)
- `2` - + Architecture + Config + Docs (default, balanced)
- `3` - Full enhancement (all features, comprehensive)
**Auto-detection:** Mode (API vs LOCAL) is auto-detected:
- If `ANTHROPIC_API_KEY` is set → API mode
- Otherwise → LOCAL mode (Claude Code Max)
**Examples:**
```bash
# Auto-detect mode, default enhancement level (2)
skill-seekers create https://docs.react.dev/
# Disable enhancement
skill-seekers create facebook/react --enhance-level 0
# SKILL.md only (fast)
skill-seekers create ./my-project --enhance-level 1
# Full enhancement (comprehensive)
skill-seekers create tutorial.pdf --enhance-level 3
# Force LOCAL mode with specific level
skill-seekers enhance output/react/ --mode LOCAL --enhance-level 2
# Background with status monitoring
skill-seekers enhance output/react/ --background
skill-seekers enhance-status output/react/ --watch
```
**Migration:** Old flags still work with deprecation warnings, will be removed in v4.0.0.
See `docs/ENHANCEMENT_MODES.md` for detailed documentation.
### Git Workflow
**Git Workflow Notes:**
- Main branch: `main`
- Development branch: `development`
- Always create feature branches from `development`
- Branch naming: `feature/{task-id}-{description}` or `feature/{category}`
**To see current status:** `git status`
### CI/CD Pipeline
The project has GitHub Actions workflows in `.github/workflows/`:
**tests.yml** - Runs on every push and PR to `main` or `development`:
1. **Lint Job** (Python 3.12, Ubuntu):
- `ruff check src/ tests/` - Code linting with GitHub annotations
- `ruff format --check src/ tests/` - Format validation
- `mypy src/skill_seekers` - Type checking (continue-on-error)
2. **Test Job** (Matrix):
- **OS:** Ubuntu + macOS
- **Python:** 3.10, 3.11, 3.12
- **Exclusions:** macOS + Python 3.10 (speed optimization)
- **Steps:**
- Install dependencies + `pip install -e .`
- Run CLI tests (scraper, config, integration)
- Run MCP server tests
- Generate coverage report → Upload to Codecov
3. **Summary Job** - Single status check for branch protection
- Ensures both lint and test jobs succeed
- Provides single "All Checks Complete" status
**release.yml** - Triggers on version tags (e.g., `v2.9.0`):
- Builds package with `uv build`
- Publishes to PyPI with `uv publish`
- Creates GitHub release
**Local Pre-Commit Validation**
Run the same checks as CI before pushing:
```bash
# 1. Code quality (matches lint job) - WITH AUTO-FIX
uvx ruff check --fix --unsafe-fixes src/ tests/ # Auto-fix issues
uvx ruff format src/ tests/ # Auto-format
uvx ruff check src/ tests/ # Verify clean
uvx ruff format --check src/ tests/ # Verify formatted
mypy src/skill_seekers
# 2. Tests (matches test job)
pip install -e .
pytest tests/ -v --cov=src/skill_seekers --cov-report=term
# 3. If all pass, you're good to push!
git add -A # Stage any auto-fixes
git commit --amend --no-edit # Add fixes to commit (or new commit)
git push origin feature/my-feature
```
**Branch Protection Rules:**
- **main:** Requires tests + 1 review, only maintainers merge
- **development:** Requires tests to pass, default target for PRs
**Common CI Failure Patterns and Fixes**
If CI fails after your changes, follow this debugging checklist:
```bash
# 1. Fix linting errors automatically
uvx ruff check --fix --unsafe-fixes src/ tests/
# 2. Fix formatting issues
uvx ruff format src/ tests/
# 3. Check for remaining issues
uvx ruff check src/ tests/
uvx ruff format --check src/ tests/
# 4. Verify tests pass locally
pip install -e .
pytest tests/ -v
# 5. Push fixes
git add -A
git commit -m "fix: resolve CI linting/formatting issues"
git push
```
**Critical dependency patterns to check:**
- **MCP version mismatch**: Ensure `requirements.txt` and `pyproject.toml` have matching MCP versions
- **Missing module-level imports**: If a tool file imports a module at top level (e.g., `import yaml`), that module MUST be in core dependencies
- **Try/except ImportError**: Silent failures in try/except blocks can hide missing dependencies
**Timing-sensitive tests:**
- Benchmark tests may fail on slower CI runners (macOS)
- If a test times out or exceeds threshold only in CI, consider relaxing the threshold
- Local passing doesn't guarantee CI passing for performance tests
## 🚨 Common Pitfalls & Solutions
### 1. Import Errors
**Problem:** `ModuleNotFoundError: No module named 'skill_seekers'`
**Solution:** Must install package first due to src/ layout
```bash
pip install -e .
```
**Why:** The src/ layout prevents imports from repo root. Package must be installed.
### 2. Tests Fail with "No module named..."
**Problem:** Package not installed in test environment
**Solution:** CI runs `pip install -e .` before tests - do the same locally
```bash
pip install -e .
pytest tests/ -v
```
### 3. Platform-Specific Dependencies Not Found
**Problem:** `ModuleNotFoundError: No module named 'google.generativeai'`
**Solution:** Install platform-specific dependencies
```bash
pip install -e ".[gemini]" # For Gemini
pip install -e ".[openai]" # For OpenAI
pip install -e ".[all-llms]" # For all platforms
```
### 4. Git Branch Confusion
**Problem:** PR targets `main` instead of `development`
**Solution:** Always create PRs targeting `development` branch
```bash
git checkout development
git pull upstream development
git checkout -b feature/my-feature
# ... make changes ...
git push origin feature/my-feature
# Create PR: feature/my-feature → development
```
**Important:** See `CONTRIBUTING.md` for complete branch workflow.
### 5. Tests Pass Locally But Fail in CI
**Problem:** Different Python version or missing dependency
**Solution:** Test with multiple Python versions locally
```bash
# CI tests: Python 3.10, 3.11, 3.12 on Ubuntu + macOS
# Use pyenv or docker to test locally:
pyenv install 3.10.13 3.11.7 3.12.1
pyenv local 3.10.13
pip install -e . && pytest tests/ -v
pyenv local 3.11.7
pip install -e . && pytest tests/ -v
pyenv local 3.12.1
pip install -e . && pytest tests/ -v
```
### 6. Enhancement Not Working
**Problem:** AI enhancement fails or hangs
**Solutions:**
```bash
# Check if API key is set
echo $ANTHROPIC_API_KEY
# Try LOCAL mode instead (uses Claude Code Max, no API key needed)
skill-seekers enhance output/react/ --mode LOCAL
# Monitor enhancement status for background jobs
skill-seekers enhance-status output/react/ --watch
```
### 7. Rate Limit Errors from GitHub
**Problem:** `403 Forbidden` from GitHub API
**Solutions:**
```bash
# Check current rate limit
curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/rate_limit
# Configure multiple GitHub profiles (recommended)
skill-seekers config --github
# Use specific profile
skill-seekers github --repo owner/repo --profile work
# Test all configured tokens
skill-seekers config --test
```
### 8. Confused About Command Options
**Problem:** "Too many flags!" or "Which flags work with which sources?"
**Solution:** Use the progressive disclosure help system in the `create` command:
```bash
# Start with universal options (13 flags)
skill-seekers create --help
# Need web scraping options?
skill-seekers create --help-web
# GitHub-specific flags?
skill-seekers create --help-github
# See ALL options (120+ flags)?
skill-seekers create --help-all
# Quick preset shortcut
skill-seekers create <source> -p quick
skill-seekers create <source> -p standard
skill-seekers create <source> -p comprehensive
```
**Why:** The create command shows only relevant flags by default to reduce cognitive load.
**Legacy commands** (scrape, github, analyze) show all flags in one help screen - use them if you prefer that style.
### 9. CI Passes Locally But Fails in GitHub Actions
**Problem:** Ruff check/format or tests pass locally but fail in CI
**Common causes:**
1. **Dependency version mismatch** - `requirements.txt` vs `pyproject.toml` conflicts
```bash
# Check both files have matching versions for core deps
grep "mcp" requirements.txt pyproject.toml
grep "PyYAML" requirements.txt pyproject.toml
```
2. **Module imported but not declared** - File imports module at top level but it's not in dependencies
```bash
# Search for imports that might not be in dependencies
grep -r "^import yaml" src/
grep -r "^from yaml" src/
# Ensure PyYAML is in pyproject.toml core dependencies
```
3. **Ruff version differences** - Local ruff vs CI ruff may have different rules
```bash
# Use uvx to match CI's ruff version
uvx ruff check src/ tests/
uvx ruff format src/ tests/
```
**Solution:**
```bash
# Run CI validation commands exactly as CI does
pip install -e . # Fresh install
uvx ruff check src/ tests/ # Use uvx, not local ruff
uvx ruff format --check src/ tests/
pytest tests/ -v
```
## 🔌 MCP Integration
### MCP Server (26 Tools)
**Transport modes:**
- stdio: Claude Code, VS Code + Cline
- HTTP: Cursor, Windsurf, IntelliJ IDEA
**Core Tools (9):**
1. `list_configs` - List preset configurations
2. `generate_config` - Generate config from docs URL
3. `validate_config` - Validate config structure
4. `estimate_pages` - Estimate page count
5. `scrape_docs` - Scrape documentation
6. `package_skill` - Package to format (supports `--format` and `--target`)
7. `upload_skill` - Upload to platform (supports `--target`)
8. `enhance_skill` - AI enhancement with platform support
9. `install_skill` - Complete workflow automation
**Extended Tools (10):**
10. `scrape_github` - GitHub repository analysis
11. `scrape_pdf` - PDF extraction
12. `unified_scrape` - Multi-source scraping
13. `merge_sources` - Merge docs + code
14. `detect_conflicts` - Find discrepancies
15. `add_config_source` - Register git repos
16. `fetch_config` - Fetch configs from git
17. `list_config_sources` - List registered sources
18. `remove_config_source` - Remove config source
19. `split_config` - Split large configs
**NEW Vector DB Tools (4):**
20. `export_to_chroma` - Export to ChromaDB
21. `export_to_weaviate` - Export to Weaviate
22. `export_to_faiss` - Export to FAISS
23. `export_to_qdrant` - Export to Qdrant
**NEW Cloud Tools (3):**
24. `cloud_upload` - Upload to S3/GCS/Azure
25. `cloud_download` - Download from cloud storage
26. `cloud_list` - List files in cloud storage
### Starting MCP Server
```bash
# stdio mode (Claude Code, VS Code + Cline)
python -m skill_seekers.mcp.server_fastmcp
# HTTP mode (Cursor, Windsurf, IntelliJ)
python -m skill_seekers.mcp.server_fastmcp --transport http --port 8765
```
## 🤖 RAG Framework & Vector Database Integrations (**NEW - v3.0.0**)
Skill Seekers is now the **universal preprocessor for RAG pipelines**. Export documentation to any RAG framework or vector database with a single command.
### RAG Frameworks
**LangChain Documents:**
```bash
# Export to LangChain Document format
skill-seekers package output/django --format langchain
# Output: output/django-langchain.json
# Format: Array of LangChain Document objects
# - page_content: Full text content
# - metadata: {source, category, type, url}
# Use in LangChain:
from langchain.document_loaders import JSONLoader
loader = JSONLoader("output/django-langchain.json")
documents = loader.load()
```
**LlamaIndex TextNodes:**
```bash
# Export to LlamaIndex TextNode format
skill-seekers package output/django --format llama-index
# Output: output/django-llama-index.json
# Format: Array of LlamaIndex TextNode objects
# - text: Content
# - id_: Unique identifier
# - metadata: {source, category, type}
# - relationships: Document relationships
# Use in LlamaIndex:
from llama_index import StorageContext, load_index_from_storage
from llama_index.schema import TextNode
nodes = [TextNode.from_dict(n) for n in json.load(open("output/django-llama-index.json"))]
```
**Haystack Documents:**
```bash
# Export to Haystack Document format
skill-seekers package output/django --format haystack
# Output: output/django-haystack.json
# Format: Haystack Document objects for pipelines
# Perfect for: Question answering, search, RAG pipelines
```
### Vector Databases
**ChromaDB (Direct Integration):**
```bash
# Export and optionally upload to ChromaDB
skill-seekers package output/django --format chroma
# Output: output/django-chroma/ (ChromaDB collection)
# With direct upload (requires chromadb running):
skill-seekers package output/django --format chroma --upload
# Configuration via environment:
export CHROMA_HOST=localhost
export CHROMA_PORT=8000
```
**FAISS (Facebook AI Similarity Search):**
```bash
# Export to FAISS index format
skill-seekers package output/django --format faiss
# Output:
# - output/django-faiss.index (FAISS index)
# - output/django-faiss-metadata.json (Document metadata)
# Use with FAISS:
import faiss
index = faiss.read_index("output/django-faiss.index")
```
**Weaviate:**
```bash
# Export and upload to Weaviate
skill-seekers package output/django --format weaviate --upload
# Requires environment variables:
export WEAVIATE_URL=http://localhost:8080
export WEAVIATE_API_KEY=your-api-key
# Creates class "DjangoDoc" with schema
```
**Qdrant:**
```bash
# Export and upload to Qdrant
skill-seekers package output/django --format qdrant --upload
# Requires environment variables:
export QDRANT_URL=http://localhost:6333
export QDRANT_API_KEY=your-api-key
# Creates collection "django_docs"
```
**Pinecone (via Markdown):**
```bash
# Pinecone uses the markdown format
skill-seekers package output/django --target markdown
# Then use Pinecone's Python client for upsert
# See: docs/integrations/PINECONE.md
```
### Complete RAG Pipeline Example
```bash
# 1. Scrape documentation
skill-seekers scrape --config configs/django.json
# 2. Export to your RAG stack
skill-seekers package output/django --format langchain # For LangChain
skill-seekers package output/django --format llama-index # For LlamaIndex
skill-seekers package output/django --format chroma --upload # Direct to ChromaDB
# 3. Use in your application
# See examples/:
# - examples/langchain-rag-pipeline/
# - examples/llama-index-query-engine/
# - examples/pinecone-upsert/
```
**Integration Hub:** [docs/integrations/RAG_PIPELINES.md](docs/integrations/RAG_PIPELINES.md)
## 🛠️ AI Coding Assistant Integrations (**NEW - v3.0.0**)
Transform any framework documentation into persistent expert context for 4+ AI coding assistants. Your IDE's AI now "knows" your frameworks without manual prompting.
### Cursor IDE
**Setup:**
```bash
# 1. Generate skill
skill-seekers scrape --config configs/react.json
skill-seekers package output/react/ --target claude
# 2. Install to Cursor
cp output/react-claude/SKILL.md .cursorrules
# 3. Restart Cursor
# AI now has React expertise!
```
**Benefits:**
- ✅ AI suggests React-specific patterns
- ✅ No manual "use React hooks" prompts needed
- ✅ Consistent team patterns
- ✅ Works for ANY framework
**Guide:** [docs/integrations/CURSOR.md](docs/integrations/CURSOR.md)
**Example:** [examples/cursor-react-skill/](examples/cursor-react-skill/)
### Windsurf
**Setup:**
```bash
# 1. Generate skill
skill-seekers scrape --config configs/django.json
skill-seekers package output/django/ --target claude
# 2. Install to Windsurf
mkdir -p .windsurf/rules
cp output/django-claude/SKILL.md .windsurf/rules/django.md
# 3. Restart Windsurf
# AI now knows Django patterns!
```
**Benefits:**
- ✅ Flow-based coding with framework knowledge
- ✅ IDE-native AI assistance
- ✅ Persistent context across sessions
**Guide:** [docs/integrations/WINDSURF.md](docs/integrations/WINDSURF.md)
**Example:** [examples/windsurf-fastapi-context/](examples/windsurf-fastapi-context/)
### Cline (VS Code Extension)
**Setup:**
```bash
# 1. Generate skill
skill-seekers scrape --config configs/fastapi.json
skill-seekers package output/fastapi/ --target claude
# 2. Install to Cline
cp output/fastapi-claude/SKILL.md .clinerules
# 3. Reload VS Code
# Cline now has FastAPI expertise!
```
**Benefits:**
- ✅ Agentic code generation in VS Code
- ✅ Cursor Composer equivalent for VS Code
- ✅ System prompts + MCP integration
**Guide:** [docs/integrations/CLINE.md](docs/integrations/CLINE.md)
**Example:** [examples/cline-django-assistant/](examples/cline-django-assistant/)
### Continue.dev (Universal IDE)
**Setup:**
```bash
# 1. Generate skill
skill-seekers scrape --config configs/react.json
skill-seekers package output/react/ --target claude
# 2. Start context server
cd examples/continue-dev-universal/
python context_server.py --port 8765
# 3. Configure in ~/.continue/config.json
{
"contextProviders": [
{
"name": "http",
"params": {
"url": "http://localhost:8765/context",
"title": "React Documentation"
}
}
]
}
# 4. Works in ALL IDEs!
# VS Code, JetBrains, Vim, Emacs...
```
**Benefits:**
- ✅ IDE-agnostic (works in VS Code, IntelliJ, Vim, Emacs)
- ✅ Custom LLM providers supported
- ✅ HTTP-based context serving
- ✅ Team consistency across mixed IDE environments
**Guide:** [docs/integrations/CONTINUE_DEV.md](docs/integrations/CONTINUE_DEV.md)
**Example:** [examples/continue-dev-universal/](examples/continue-dev-universal/)
### Multi-IDE Team Setup
For teams using different IDEs (VS Code, IntelliJ, Vim):
```bash
# Use Continue.dev as universal context provider
skill-seekers scrape --config configs/react.json
python context_server.py --host 0.0.0.0 --port 8765
# ALL team members configure Continue.dev
# Result: Identical AI suggestions across all IDEs!
```
**Integration Hub:** [docs/integrations/INTEGRATIONS.md](docs/integrations/INTEGRATIONS.md)
## ☁️ Cloud Storage Integration (**NEW - v3.0.0**)
Upload skills directly to cloud storage for team sharing and CI/CD pipelines.
### Supported Providers
**AWS S3:**
```bash
# Upload skill
skill-seekers cloud upload --provider s3 --bucket my-skills output/react.zip
# Download skill
skill-seekers cloud download --provider s3 --bucket my-skills react.zip
# List skills
skill-seekers cloud list --provider s3 --bucket my-skills
# Environment variables:
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret
export AWS_REGION=us-east-1
```
**Google Cloud Storage:**
```bash
# Upload skill
skill-seekers cloud upload --provider gcs --bucket my-skills output/react.zip
# Download skill
skill-seekers cloud download --provider gcs --bucket my-skills react.zip
# List skills
skill-seekers cloud list --provider gcs --bucket my-skills
# Environment variables:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
```
**Azure Blob Storage:**
```bash
# Upload skill
skill-seekers cloud upload --provider azure --container my-skills output/react.zip
# Download skill
skill-seekers cloud download --provider azure --container my-skills react.zip
# List skills
skill-seekers cloud list --provider azure --container my-skills
# Environment variables:
export AZURE_STORAGE_CONNECTION_STRING=your-connection-string
```
### CI/CD Integration
```yaml
# GitHub Actions example
- name: Upload skill to S3
run: |
skill-seekers scrape --config configs/react.json
skill-seekers package output/react/
skill-seekers cloud upload --provider s3 --bucket ci-skills output/react.zip
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
```
**Guide:** [docs/integrations/CLOUD_STORAGE.md](docs/integrations/CLOUD_STORAGE.md)
## 📋 Common Workflows
### Adding a New Platform
1. Create adaptor in `src/skill_seekers/cli/adaptors/{platform}_adaptor.py`
2. Inherit from `BaseAdaptor`
3. Implement `package()`, `upload()`, `enhance()` methods
4. Add to factory in `adaptors/__init__.py`
5. Add optional dependency to `pyproject.toml`
6. Add tests in `tests/test_install_multiplatform.py`
### Adding a New Feature
1. Implement in appropriate CLI module
2. Add entry point to `pyproject.toml` if needed
3. Add tests in `tests/test_{feature}.py`
4. Run full test suite: `pytest tests/ -v`
5. Update CHANGELOG.md
6. Commit only when all tests pass
### Debugging Common Issues
**Import Errors:**
```bash
# Always ensure package is installed first
pip install -e .
# Verify installation
python -c "import skill_seekers; print(skill_seekers.__version__)"
```
**Rate Limit Issues:**
```bash
# Check current GitHub rate limit status
curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/rate_limit
# Configure multiple GitHub profiles
skill-seekers config --github
# Test your tokens
skill-seekers config --test
```
**Enhancement Not Working:**
```bash
# Check if API key is set
echo $ANTHROPIC_API_KEY
# Try LOCAL mode instead (uses Claude Code Max)
skill-seekers enhance output/react/ --mode LOCAL
# Monitor enhancement status
skill-seekers enhance-status output/react/ --watch
```
**Test Failures:**
```bash
# Run specific failing test with verbose output
pytest tests/test_file.py::test_name -vv
# Run with print statements visible
pytest tests/test_file.py -s
# Run with coverage to see what's not tested
pytest tests/test_file.py --cov=src/skill_seekers --cov-report=term-missing
# Run only unit tests (skip slow integration tests)
pytest tests/ -v -m "not slow and not integration"
```
**Config Issues:**
```bash
# Validate config structure
skill-seekers-validate configs/myconfig.json
# Show current configuration
skill-seekers config --show
# Estimate pages before scraping
skill-seekers estimate configs/myconfig.json
```
## 🎯 Where to Make Changes
This section helps you quickly locate the right files when implementing common changes.
### Adding a New CLI Command
**Files to modify:**
1. **Create command file:** `src/skill_seekers/cli/my_command.py`
```python
def main():
"""Entry point for my-command."""
# Implementation
```
2. **Add entry point:** `pyproject.toml`
```toml
[project.scripts]
skill-seekers-my-command = "skill_seekers.cli.my_command:main"
```
3. **Update unified CLI:** `src/skill_seekers/cli/main.py`
- Add subcommand handler to dispatcher
4. **Add tests:** `tests/test_my_command.py`
- Test main functionality
- Test CLI argument parsing
- Test error cases
5. **Update docs:** `CHANGELOG.md` + `README.md` (if user-facing)
### Adding a New Platform Adaptor
**Files to modify:**
1. **Create adaptor:** `src/skill_seekers/cli/adaptors/my_platform_adaptor.py`
```python
from .base import BaseAdaptor
class MyPlatformAdaptor(BaseAdaptor):
def package(self, skill_dir, output_path, **kwargs):
# Platform-specific packaging
pass
def upload(self, package_path, api_key=None, **kwargs):
# Platform-specific upload (optional for some platforms)
pass
def export(self, skill_dir, format, **kwargs):
# For RAG/vector DB adaptors: export to specific format
pass
```
2. **Register in factory:** `src/skill_seekers/cli/adaptors/__init__.py`
```python
def get_adaptor(target=None, format=None):
# For LLM platforms (--target flag)
target_adaptors = {
'claude': ClaudeAdaptor,
'gemini': GeminiAdaptor,
'openai': OpenAIAdaptor,
'markdown': MarkdownAdaptor,
'myplatform': MyPlatformAdaptor, # ADD THIS
}
# For RAG/vector DBs (--format flag)
format_adaptors = {
'langchain': LangChainAdaptor,
'llama-index': LlamaIndexAdaptor,
'chroma': ChromaAdaptor,
# ... etc
}
```
3. **Add optional dependency:** `pyproject.toml`
```toml
[project.optional-dependencies]
myplatform = ["myplatform-sdk>=1.0.0"]
```
4. **Add tests:** `tests/test_adaptors/test_my_platform_adaptor.py`
- Test export format
- Test upload (if applicable)
- Test with real data
5. **Update documentation:**
- README.md - Platform comparison table
- docs/integrations/MY_PLATFORM.md - Integration guide
- examples/my-platform-example/ - Working example
### Adding a New Config Preset
**Files to modify:**
1. **Create config:** `configs/my_framework.json`
```json
{
"name": "my_framework",
"base_url": "https://docs.myframework.com/",
"selectors": {...},
"categories": {...}
}
```
2. **Test locally:**
```bash
# Estimate first
skill-seekers estimate configs/my_framework.json
# Test scrape (small sample)
skill-seekers scrape --config configs/my_framework.json --max-pages 50
```
3. **Add to README:** Update presets table in `README.md`
4. **Submit to website:** (Optional) Submit to SkillSeekersWeb.com
### Modifying Core Scraping Logic
**Key files by feature:**
| Feature | File | Size | Notes |
|---------|------|------|-------|
| Doc scraping | `src/skill_seekers/cli/doc_scraper.py` | ~90KB | Main scraper, BFS traversal |
| GitHub scraping | `src/skill_seekers/cli/github_scraper.py` | ~56KB | Repo analysis + metadata |
| GitHub API | `src/skill_seekers/cli/github_fetcher.py` | ~17KB | Rate limit handling |
| PDF extraction | `src/skill_seekers/cli/pdf_scraper.py` | Medium | PyMuPDF + OCR |
| EPUB extraction | `src/skill_seekers/cli/epub_scraper.py` | Medium | ebooklib + BeautifulSoup |
| Code analysis | `src/skill_seekers/cli/code_analyzer.py` | ~65KB | Multi-language AST parsing |
| Pattern detection | `src/skill_seekers/cli/pattern_recognizer.py` | Medium | C3.1 - 10 GoF patterns |
| Test extraction | `src/skill_seekers/cli/test_example_extractor.py` | Medium | C3.2 - 5 categories |
| Guide generation | `src/skill_seekers/cli/how_to_guide_builder.py` | ~45KB | C3.3 - AI-enhanced guides |
| Config extraction | `src/skill_seekers/cli/config_extractor.py` | ~32KB | C3.4 - 9 formats |
| Router generation | `src/skill_seekers/cli/generate_router.py` | ~43KB | C3.5 - Architecture docs |
| Signal flow | `src/skill_seekers/cli/signal_flow_analyzer.py` | Medium | C3.10 - Godot-specific |
**Always add tests when modifying core logic!**
### Modifying the Unified Create Command
**The create command uses a modular argument system:**
**Files involved:**
1. **Parser:** `src/skill_seekers/cli/parsers/create_parser.py`
- Defines help text and formatter
- Registers help mode flags (`--help-web`, `--help-github`, etc.)
- Uses custom `NoWrapFormatter` for better help display
2. **Arguments:** `src/skill_seekers/cli/arguments/create.py`
- Three tiers of arguments:
- `UNIVERSAL_ARGUMENTS` (13 flags) - Work for all sources
- Source-specific dicts (`WEB_ARGUMENTS`, `GITHUB_ARGUMENTS`, `EPUB_ARGUMENTS`, etc.)
- `ADVANCED_ARGUMENTS` - Rare/advanced options
- `add_create_arguments(parser, mode)` - Multi-mode argument addition
3. **Source Detection:** `src/skill_seekers/cli/source_detector.py` (if implemented)
- Auto-detect source type from input
- Pattern matching (URLs, GitHub repos, file extensions)
4. **Main Logic:** `src/skill_seekers/cli/create_command.py` (if implemented)
- Route to appropriate scraper based on detected type
- Argument validation and compatibility checking
**When adding new arguments:**
- Universal args → `UNIVERSAL_ARGUMENTS` in `arguments/create.py`
- Source-specific → Appropriate dict (`WEB_ARGUMENTS`, etc.)
- Always update help text and add tests
**Example: Adding a new universal flag:**
```python
# In arguments/create.py
UNIVERSAL_ARGUMENTS = {
# ... existing args ...
"my_flag": {
"flags": ("--my-flag", "-m"),
"kwargs": {
"action": "store_true",
"help": "Description of my flag",
},
},
}
```
### Adding MCP Tools
**Files to modify:**
1. **Add tool function:** `src/skill_seekers/mcp/tools/{category}_tools.py`
2. **Register tool:** `src/skill_seekers/mcp/server.py`
```python
@mcp.tool()
def my_new_tool(param: str) -> str:
"""Tool description."""
# Implementation
```
3. **Add tests:** `tests/test_mcp_fastmcp.py`
4. **Update count:** README.md (currently 18 tools)
## 📍 Key Files Quick Reference
| Task | File(s) | What to Modify |
|------|---------|----------------|
| Add new CLI command | `src/skill_seekers/cli/my_cmd.py`<br>`pyproject.toml` | Create `main()` function<br>Add entry point |
| Add platform adaptor | `src/skill_seekers/cli/adaptors/my_platform.py`<br>`adaptors/__init__.py` | Inherit `BaseAdaptor`<br>Register in factory |
| Fix scraping logic | `src/skill_seekers/cli/doc_scraper.py` | `scrape_all()`, `extract_content()` |
| Add MCP tool | `src/skill_seekers/mcp/server_fastmcp.py` | Add `@mcp.tool()` function |
| Fix tests | `tests/test_{feature}.py` | Add/modify test functions |
| Add config preset | `configs/{framework}.json` | Create JSON config |
| Update CI | `.github/workflows/tests.yml` | Modify workflow steps |
## 📚 Key Code Locations
**Documentation Scraper** (`src/skill_seekers/cli/doc_scraper.py`):
- `FALLBACK_MAIN_SELECTORS` - Shared fallback CSS selectors for finding main content (no `body`)
- `_find_main_content()` - Centralized selector fallback: config selector → fallback list
- `is_valid_url()` - URL validation
- `extract_content()` - Content extraction (links extracted from full page before early return)
- `detect_language()` - Code language detection
- `extract_patterns()` - Pattern extraction
- `smart_categorize()` - Smart categorization
- `infer_categories()` - Category inference
- `generate_quick_reference()` - Quick reference generation
- `create_enhanced_skill_md()` - SKILL.md generation
- `scrape_all()` - Main scraping loop (dry-run extracts links from full page)
- `main()` - Entry point
**Codebase Analysis** (`src/skill_seekers/cli/`):
- `codebase_scraper.py` - Main CLI for local codebase analysis
- `code_analyzer.py` - Multi-language AST parsing (9 languages)
- `api_reference_builder.py` - API documentation generation
- `dependency_analyzer.py` - NetworkX-based dependency graphs
- `pattern_recognizer.py` - C3.1 design pattern detection
- `test_example_extractor.py` - C3.2 test example extraction
- `how_to_guide_builder.py` - C3.3 guide generation
- `config_extractor.py` - C3.4 configuration extraction
- `generate_router.py` - C3.5 router skill generation
- `signal_flow_analyzer.py` - C3.10 signal flow analysis (Godot projects)
- `unified_codebase_analyzer.py` - Three-stream GitHub+local analyzer
**AI Enhancement** (`src/skill_seekers/cli/`):
- `enhance_skill_local.py` - LOCAL mode enhancement (4 execution modes)
- `enhance_skill.py` - API mode enhancement
- `enhance_status.py` - Status monitoring for background processes
- `ai_enhancer.py` - Shared AI enhancement logic
- `guide_enhancer.py` - C3.3 guide AI enhancement
- `config_enhancer.py` - C3.4 config AI enhancement
**Platform Adaptors** (`src/skill_seekers/cli/adaptors/`):
- `__init__.py` - Factory function
- `base_adaptor.py` - Abstract base class
- `claude_adaptor.py` - Claude AI implementation
- `gemini_adaptor.py` - Google Gemini implementation
- `openai_adaptor.py` - OpenAI ChatGPT implementation
- `markdown_adaptor.py` - Generic Markdown implementation
**MCP Server** (`src/skill_seekers/mcp/`):
- `server.py` - FastMCP-based server
- `tools/` - 18 MCP tool implementations
**Configuration & Rate Limit Management** (NEW: v2.7.0 - `src/skill_seekers/cli/`):
- `config_manager.py` - Multi-token configuration system (~490 lines)
- `ConfigManager` class - Singleton pattern for global config access
- `add_github_profile()` - Add GitHub profile with token and strategy
- `get_github_token()` - Smart fallback chain (CLI → Env → Config → Prompt)
- `get_next_profile()` - Profile switching for rate limit handling
- `save_progress()` / `load_progress()` - Job resumption support
- `cleanup_old_progress()` - Auto-cleanup of old jobs (7 days default)
- `config_command.py` - Interactive configuration wizard (~400 lines)
- `main_menu()` - 7-option main menu with navigation
- `github_token_menu()` - GitHub profile management
- `add_github_profile()` - Guided token setup with browser integration
- `api_keys_menu()` - API key configuration for Claude/Gemini/OpenAI
- `test_connections()` - Connection testing for tokens and API keys
- `rate_limit_handler.py` - Smart rate limit detection and handling (~450 lines)
- `RateLimitHandler` class - Strategy pattern for rate limit handling
- `check_upfront()` - Upfront rate limit check before starting
- `check_response()` - Real-time detection from API responses
- `handle_rate_limit()` - Execute strategy (prompt/wait/switch/fail)
- `try_switch_profile()` - Automatic profile switching
- `wait_for_reset()` - Countdown timer with live progress
- `show_countdown_timer()` - Live terminal countdown display
- `resume_command.py` - Resume interrupted scraping jobs (~150 lines)
- `list_resumable_jobs()` - Display all jobs with progress details
- `resume_job()` - Resume from saved checkpoint
- `clean_old_jobs()` - Cleanup old progress files
**GitHub Integration** (Modified for v2.7.0 - `src/skill_seekers/cli/`):
- `github_fetcher.py` - Integrated rate limit handler
- Constructor now accepts `interactive` and `profile_name` parameters
- `fetch()` - Added upfront rate limit check
- All API calls check responses for rate limits
- Raises `RateLimitError` when rate limit cannot be handled
- `github_scraper.py` - Added CLI flags
- `--non-interactive` flag for CI/CD mode (fail fast)
- `--profile` flag to select GitHub profile from config
- Config supports `interactive` and `github_profile` keys
**RAG & Vector Database Adaptors** (NEW: v3.0.0 - `src/skill_seekers/cli/adaptors/`):
- `langchain.py` - LangChain Documents export (~250 lines)
- Exports to LangChain Document format
- Preserves metadata (source, category, type, url)
- Smart chunking with overlap
- `llama_index.py` - LlamaIndex TextNodes export (~280 lines)
- Exports to TextNode format with unique IDs
- Relationship mapping between documents
- Metadata preservation
- `haystack.py` - Haystack Documents export (~230 lines)
- Pipeline-ready document format
- Supports embeddings and filters
- `chroma.py` - ChromaDB integration (~350 lines)
- Direct collection creation
- Batch upsert with embeddings
- Query interface
- `weaviate.py` - Weaviate vector search (~320 lines)
- Schema creation with auto-detection
- Batch import with error handling
- `faiss_helpers.py` - FAISS index generation (~280 lines)
- Index building with metadata
- Search utilities
- `qdrant.py` - Qdrant vector database (~300 lines)
- Collection management
- Payload indexing
- `streaming_adaptor.py` - Streaming data ingest (~200 lines)
- Real-time data processing
- Incremental updates
**Cloud Storage & Infrastructure** (NEW: v3.0.0 - `src/skill_seekers/cli/`):
- `cloud_storage_cli.py` - S3/GCS/Azure upload/download (~450 lines)
- Multi-provider abstraction
- Parallel uploads for large files
- Retry logic with exponential backoff
- `embedding_pipeline.py` - Embedding generation for vectors (~320 lines)
- Sentence-transformers integration
- Batch processing
- Multiple embedding models
- `sync_cli.py` - Continuous sync & monitoring (~380 lines)
- File watching for changes
- Automatic re-scraping
- Smart diff detection
- `incremental_updater.py` - Smart incremental updates (~350 lines)
- Change detection algorithms
- Partial skill updates
- Version tracking
- `streaming_ingest.py` - Real-time data streaming (~290 lines)
- Stream processing pipelines
- WebSocket support
- `benchmark_cli.py` - Performance benchmarking (~280 lines)
- Scraping performance tests
- Comparison reports
- CI/CD integration
- `quality_metrics.py` - Quality analysis & reporting (~340 lines)
- Completeness scoring
- Link checking
- Content quality metrics
- `multilang_support.py` - Internationalization support (~260 lines)
- Language detection
- Translation integration
- Multi-locale skills
- `setup_wizard.py` - Interactive setup wizard (~220 lines)
- Configuration management
- Profile creation
- First-time setup
**Video Scraper** (`src/skill_seekers/cli/`):
- `video_scraper.py` - Main video scraping pipeline CLI
- `video_setup.py` - GPU auto-detection, PyTorch installation, visual dependency setup (~835 lines)
- Detects CUDA/ROCm/MPS/CPU and installs matching PyTorch build
- Installs `easyocr` and other visual processing deps at runtime via `--setup`
- Run `skill-seekers video --setup` before first use
## 🎯 Project-Specific Best Practices
1. **Prefer the unified `create` command** - Use `skill-seekers create <source>` over legacy commands for consistency
2. **Always use platform adaptors** - Never hardcode platform-specific logic
3. **Test all platforms** - Changes must work for all 16 platforms (was 4 in v2.x)
4. **Maintain backward compatibility** - Legacy commands (scrape, github, analyze) must still work
5. **Document API changes** - Update CHANGELOG.md for every release
6. **Keep dependencies optional** - Platform-specific deps are optional (RAG, cloud, etc.)
7. **Use src/ layout** - Proper package structure with `pip install -e .`
8. **Run tests before commits** - Per user instructions, never skip tests (1,765+ tests must pass)
9. **RAG-first mindset** - v3.0.0 is the universal preprocessor for AI systems
10. **Export format clarity** - Use `--format` for RAG/vector DBs, `--target` for LLM platforms
11. **Test with real integrations** - Verify exports work with actual LangChain, ChromaDB, etc.
12. **Progressive disclosure** - When adding flags, categorize as universal/source-specific/advanced
## 🐛 Debugging Tips
### Enable Verbose Logging
```bash
# Set environment variable for debug output
export SKILL_SEEKERS_DEBUG=1
skill-seekers scrape --config configs/react.json
```
### Test Single Function/Module
Run Python modules directly for debugging:
```bash
# Run modules with --help to see options
python -m skill_seekers.cli.doc_scraper --help
python -m skill_seekers.cli.github_scraper --repo facebook/react --dry-run
python -m skill_seekers.cli.package_skill --help
# Test MCP server directly
python -m skill_seekers.mcp.server_fastmcp
```
### Use pytest with Debugging
```bash
# Drop into debugger on failure
pytest tests/test_scraper_features.py --pdb
# Show print statements (normally suppressed)
pytest tests/test_scraper_features.py -s
# Verbose test output (shows full diff, more details)
pytest tests/test_scraper_features.py -vv
# Run only failed tests from last run
pytest tests/ --lf
# Run until first failure (stop immediately)
pytest tests/ -x
# Show local variables on failure
pytest tests/ -l
```
### Debug Specific Test
```bash
# Run single test with full output
pytest tests/test_scraper_features.py::test_detect_language -vv -s
# With debugger
pytest tests/test_scraper_features.py::test_detect_language --pdb
```
### Check Package Installation
```bash
# Verify package is installed
pip list | grep skill-seekers
# Check installation mode (should show editable location)
pip show skill-seekers
# Verify imports work
python -c "import skill_seekers; print(skill_seekers.__version__)"
# Check CLI entry points
which skill-seekers
skill-seekers --version
```
### Common Error Messages & Solutions
**"ModuleNotFoundError: No module named 'skill_seekers'"**
→ **Solution:** `pip install -e .`
→ **Why:** src/ layout requires package installation
**"403 Forbidden" from GitHub API**
→ **Solution:** Rate limit hit, set `GITHUB_TOKEN` or use `skill-seekers config --github`
→ **Check limit:** `curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/rate_limit`
**"SKILL.md enhancement failed"**
→ **Solution:** Check if `ANTHROPIC_API_KEY` is set, or use `--mode LOCAL`
→ **Monitor:** `skill-seekers enhance-status output/react/ --watch`
**"No such file or directory: 'configs/myconfig.json'"**
→ **Solution:** Config path resolution order:
1. Exact path as provided
2. `./configs/` (current directory)
3. `~/.config/skill-seekers/configs/` (user config)
4. SkillSeekersWeb.com API (presets)
**"pytest: command not found"**
→ **Solution:** Install dev dependencies
```bash
pip install pytest pytest-asyncio pytest-cov coverage
# Or: pip install -e ".[dev]" (if available)
```
**"ruff: command not found"**
→ **Solution:** Install ruff
```bash
pip install ruff
# Or use uvx: uvx ruff check src/
```
### Debugging Scraping Issues
**No content extracted?**
```python
# Test selectors in Python
from bs4 import BeautifulSoup
import requests
url = "https://docs.example.com/page"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Try different selectors
print(soup.select_one('article'))
print(soup.select_one('main'))
print(soup.select_one('div[role="main"]'))
print(soup.select_one('.documentation-content'))
```
**Categories not working?**
- Check `categories` in config has correct keywords
- Run with `--dry-run` to see categorization without scraping
- Enable debug mode: `export SKILL_SEEKERS_DEBUG=1`
### Profiling Performance
```bash
# Profile scraping performance
python -m cProfile -o profile.stats -m skill_seekers.cli.doc_scraper --config configs/react.json --max-pages 10
# Analyze profile
python -m pstats profile.stats
# In pstats shell:
# > sort cumtime
# > stats 20
```
## 📖 Additional Documentation
**Official Website:**
- [SkillSeekersWeb.com](https://skillseekersweb.com/) - Browse 24+ preset configs, share configs, complete documentation
**For Users:**
- [README.md](README.md) - Complete user documentation
- [BULLETPROOF_QUICKSTART.md](BULLETPROOF_QUICKSTART.md) - Beginner guide
- [TROUBLESHOOTING.md](TROUBLESHOOTING.md) - Common issues
**For Developers:**
- [CHANGELOG.md](CHANGELOG.md) - Release history
- [ROADMAP.md](ROADMAP.md) - 136 tasks across 10 categories
- [docs/UNIFIED_SCRAPING.md](docs/UNIFIED_SCRAPING.md) - Multi-source scraping
- [docs/MCP_SETUP.md](docs/MCP_SETUP.md) - MCP server setup
- [docs/ENHANCEMENT_MODES.md](docs/ENHANCEMENT_MODES.md) - AI enhancement modes
- [docs/PATTERN_DETECTION.md](docs/PATTERN_DETECTION.md) - C3.1 pattern detection
- [docs/THREE_STREAM_STATUS_REPORT.md](docs/THREE_STREAM_STATUS_REPORT.md) - Three-stream architecture
- [docs/MULTI_LLM_SUPPORT.md](docs/MULTI_LLM_SUPPORT.md) - Multi-platform support
## 🎓 Understanding the Codebase
### Why src/ Layout?
Modern Python best practice (PEP 517/518):
- Prevents accidental imports from repo root
- Forces proper package installation
- Better isolation between package and tests
- Required: `pip install -e .` before running tests
### Why Platform Adaptors?
Strategy pattern benefits:
- Single codebase supports 4 platforms
- Platform-specific optimizations (format, APIs, models)
- Easy to add new platforms (implement BaseAdaptor)
- Clean separation of concerns
- Testable in isolation
### Why Git-style CLI?
User experience benefits:
- Familiar to developers (like `git`)
- Single entry point: `skill-seekers`
- Backward compatible: individual tools still work
- Cleaner than multiple separate commands
- Easier to document and teach
### Three-Stream GitHub Architecture
The `unified_codebase_analyzer.py` splits GitHub repositories into three independent streams:
**Stream 1: Code Analysis** (C3.x features)
- Deep AST parsing (9 languages)
- Design pattern detection (C3.1)
- Test example extraction (C3.2)
- How-to guide generation (C3.3)
- Configuration extraction (C3.4)
- Architectural overview (C3.5)
- API reference + dependency graphs
**Stream 2: Documentation**
- README, CONTRIBUTING, LICENSE
- docs/ directory markdown files
- Wiki pages (if available)
- CHANGELOG and version history
**Stream 3: Community Insights**
- GitHub metadata (stars, forks, watchers)
- Issue analysis (top problems and solutions)
- PR trends and contributor stats
- Release history
- Label-based topic detection
**Key Benefits:**
- Unified interface for GitHub URLs and local paths
- Analysis depth control: 'basic' (1-2 min) or 'c3x' (20-60 min)
- Enhanced router generation with GitHub context
- Smart keyword extraction weighted by GitHub labels (2x weight)
- 81 E2E tests passing (0.44 seconds)
## 🔧 Helper Scripts
The `scripts/` directory contains utility scripts:
```bash
# Bootstrap skill generation - self-hosting skill-seekers as a Claude skill
./scripts/bootstrap_skill.sh
# Start MCP server for HTTP transport
./scripts/start_mcp_server.sh
# Script templates are in scripts/skill_header.md
```
**Bootstrap Skill Workflow:**
1. Analyzes skill-seekers codebase itself (dogfooding)
2. Combines handcrafted header with auto-generated analysis
3. Validates SKILL.md structure
4. Outputs ready-to-use skill for Claude Code
## 🔍 Performance Characteristics
| Operation | Time | Notes |
|-----------|------|-------|
| Scraping (sync) | 15-45 min | First time, thread-based |
| Scraping (async) | 5-15 min | 2-3x faster with `--async` |
| Building | 1-3 min | Fast rebuild from cache |
| Re-building | <1 min | With `--skip-scrape` |
| Enhancement (LOCAL) | 30-60 sec | Uses Claude Code Max |
| Enhancement (API) | 20-40 sec | Requires API key |
| Packaging | 5-10 sec | Final .zip creation |
## 🎉 Recent Achievements
**v3.1.4 (Unreleased) - "Selector Fallback & Dry-Run Fix":**
- 🐛 **Issue #300: `create https://reactflow.dev/` only found 1 page** — Now finds 20+ pages
- 🔧 **Centralized selector fallback** — `FALLBACK_MAIN_SELECTORS` constant + `_find_main_content()` helper replace 3 duplicated fallback loops
- 🔗 **Link extraction before early return** — `extract_content()` now discovers links even when no content selector matches
- 🔍 **Dry-run full-page link discovery** — Both sync and async dry-run paths extract links from the full page (was main-content-only or missing entirely)
- 🛣️ **Smart `create --config` routing** — Peeks at JSON to route `base_url` configs to doc_scraper and `sources` configs to unified_scraper
- 🧹 **Removed `body` fallback** — `body` matched everything, hiding real selector failures
- ✅ **Pre-existing test fixes** — `test_auto_fetch_enabled` (react.json exists locally) and `test_mcp_validate_legacy_config` (react.json is now unified format)
**v3.1.3 (Released) - "Unified Argument Interface":**
- 🔧 **Unified Scraper Arguments** - All scrapers (scrape, github, analyze, pdf) now share a common argument contract via `add_all_standard_arguments(parser)` in `arguments/common.py`
- 🐛 **Fix `create` Argument Forwarding** - `create <url> --dry-run`, `create owner/repo --dry-run`, `create ./path --dry-run` all work now (previously crashed)
- 🏗️ **Argument Deduplication** - Removed duplicated arg definitions from github.py, scrape.py, analyze.py, pdf.py; all import shared args
- **New Flags** - GitHub and PDF scrapers gain `--dry-run`, `--verbose`, `--quiet`; analyze gains `--name`, `--description`, `--quiet`
- 🔀 **Route-Specific Forwarding** - `create` command's `_add_common_args()` now only forwards universal flags; route-specific flags moved to their respective methods
**v3.1.0 - "Unified CLI & Developer Experience":**
- 🎯 **Unified `create` Command** - Auto-detects source type (web/GitHub/local/PDF/config)
- 📋 **Progressive Disclosure Help** - Default shows 13 universal flags, detailed help available per source
- ⚡ **-p Shortcut** - Quick preset selection (`-p quick|standard|comprehensive`)
- 🔧 **Enhancement Flag Consolidation** - `--enhance-level` (0-3) replaces 3 separate flags
- 🎨 **Smart Source Detection** - No need to specify whether input is URL, repo, or directory
- 🔄 **Enhancement Workflow Presets** - YAML-based presets; `skill-seekers workflows list/show/copy/add/remove/validate`; bundled presets: `default`, `minimal`, `security-focus`, `architecture-comprehensive`, `api-documentation`
- 🔀 **Multiple Workflows from CLI** - `--enhance-workflow wf-a --enhance-workflow wf-b` chains presets in a single command; `workflows copy/add/remove` all accept multiple names/files at once
- 🐛 **Bug Fix** - `create` command now correctly forwards multiple `--enhance-workflow` flags to sub-scrapers
- ✅ **2,121 Tests Passing** - All CLI refactor + workflow preset work verified
- 📚 **Improved Documentation** - CLAUDE.md, README, QUICK_REFERENCE updated with workflow preset details
**v3.1.0 CI Stability (February 20, 2026):**
- 🔧 **Dependency Alignment** - Fixed MCP version mismatch between requirements.txt (was 1.18.0) and pyproject.toml (>=1.25)
- 📦 **PyYAML Core Dependency** - Added PyYAML>=6.0 to core dependencies (required by workflow_tools.py module-level import)
- ⚡ **Benchmark Stability** - Relaxed timing-sensitive test thresholds for CI environment variability
- ✅ **2,121 Tests Passing** - All CI matrix jobs passing (ubuntu 3.10/3.11/3.12, macos 3.11/3.12)
**v3.0.0 (February 10, 2026) - "Universal Intelligence Platform":**
- 🚀 **16 Platform Adaptors** - RAG frameworks (LangChain, LlamaIndex, Haystack), vector DBs (Chroma, FAISS, Weaviate, Qdrant), AI coding assistants (Cursor, Windsurf, Cline, Continue.dev), LLM platforms (Claude, Gemini, OpenAI)
- 🛠️ **26 MCP Tools** (up from 18) - Complete automation for any AI system
- ✅ **1,852 Tests Passing** (up from 700+) - Production-grade reliability
- ☁️ **Cloud Storage** - S3, GCS, Azure Blob Storage integration
- 🎯 **AI Coding Assistants** - Persistent context for Cursor, Windsurf, Cline, Continue.dev
- 📊 **Quality Metrics** - Automated completeness scoring and content analysis
- 🌐 **Multilingual Support** - Language detection and translation
- 🔄 **Streaming Ingest** - Real-time data processing pipelines
- 📈 **Benchmarking Tools** - Performance comparison and CI/CD integration
- 🔧 **Setup Wizard** - Interactive first-time configuration
- 📦 **12 Example Projects** - Complete working examples for every integration
- 📚 **18 Integration Guides** - Comprehensive documentation for all platforms
**v2.9.0 (February 3, 2026):**
- **C3.10: Signal Flow Analysis** - Complete signal flow analysis for Godot projects
- Comprehensive Godot 4.x support (GDScript, .tscn, .tres, .gdshader files)
- GDScript test extraction (GUT, gdUnit4, WAT frameworks)
- Signal pattern detection (EventBus, Observer, Event Chains)
- Signal-based how-to guides generation
**v2.8.0 (February 1, 2026):**
- C3.9: Project Documentation Extraction
- Granular AI enhancement control with `--enhance-level` (0-3)
**v2.7.1 (January 18, 2026 - Hotfix):**
- 🚨 **Critical Bug Fix:** Config download 404 errors resolved
- Fixed manual URL construction bug - now uses `download_url` from API response
- All 15 source tools tests + 8 fetch_config tests passing
**v2.7.0 (January 18, 2026):**
- 🔐 **Smart Rate Limit Management** - Multi-token GitHub configuration system
- 🧙 **Interactive Configuration Wizard** - Beautiful terminal UI (`skill-seekers config`)
- 🚦 **Intelligent Rate Limit Handler** - Four strategies (prompt/wait/switch/fail)
- 📥 **Resume Capability** - Continue interrupted jobs with progress tracking
- 🔧 **CI/CD Support** - Non-interactive mode for automation
- 🎯 **Bootstrap Skill** - Self-hosting skill-seekers as Claude Code skill
**v2.6.0 (January 14, 2026):**
- **C3.x Codebase Analysis Suite Complete** (C3.1-C3.8)
- Multi-platform support with platform adaptor architecture (4 platforms)
- 18 MCP tools fully functional
- 700+ tests passing
- Unified multi-source scraping maturity
**C3.x Series (Complete - Code Analysis Features):**
- **C3.1:** Design pattern detection (10 GoF patterns, 9 languages, 87% precision)
- **C3.2:** Test example extraction (5 categories, AST-based for Python)
- **C3.3:** How-to guide generation with AI enhancement (5 improvements)
- **C3.4:** Configuration pattern extraction (env vars, config files, CLI args)
- **C3.5:** Architectural overview & router skill generation
- **C3.6:** AI enhancement for patterns and test examples (Claude API integration)
- **C3.7:** Architectural pattern detection (8 patterns, framework-aware)
- **C3.8:** Standalone codebase scraper (300+ line SKILL.md from code alone)
- **C3.9:** Project documentation extraction (markdown categorization, AI enhancement)
- **C3.10:** Signal flow analysis (Godot event-driven architecture, pattern detection)
**v2.5.2:**
- UX Improvement: Analysis features now default ON with --skip-* flags (BREAKING)
- Router quality improvements: 6.5/10 → 8.5/10 (+31%)
- All 107 codebase analysis tests passing
**v2.5.0:**
- Multi-platform support (Claude, Gemini, OpenAI, Markdown)
- Platform adaptor architecture
- 18 MCP tools (up from 9)
- Complete feature parity across platforms
**v2.1.0:**
- Unified multi-source scraping (docs + GitHub + PDF)
- Conflict detection between sources
- 427 tests passing
**v1.0.0:**
- Production release with MCP integration
- Documentation scraping with smart categorization
- 12 preset configurations