Files
skill-seekers-reference/AGENTS.md
yusyus 6cb446d213 docs: Add 5 vector database integration guides (HAYSTACK, WEAVIATE, CHROMA, FAISS, QDRANT)
- Add HAYSTACK.md (700+ lines): Enterprise RAG framework with BM25 + hybrid search
- Add WEAVIATE.md (867 lines): Multi-tenancy, GraphQL, hybrid search, generative search
- Add CHROMA.md (832 lines): Local-first with free embeddings, persistent storage
- Add FAISS.md (785 lines): Billion-scale with GPU acceleration and product quantization
- Add QDRANT.md (746 lines): High-performance Rust engine with rich filtering

All guides follow proven 11-section pattern:
- Problem/Solution/Quick Start/Setup/Advanced/Best Practices
- Real-world examples (100-200 lines working code)
- Troubleshooting sections
- Before/After comparisons

Total: ~3,930 lines of comprehensive integration documentation

Test results:
- 26/26 tests passing for new features (RAG chunker + Haystack adaptor)
- 108 total tests passing (100%)
- 0 failures

This completes all optional integration guides from ACTION_PLAN.md.
Universal preprocessor positioning now covers:
- RAG Frameworks: LangChain, LlamaIndex, Haystack (3/3)
- Vector Databases: Pinecone, Weaviate, Chroma, FAISS, Qdrant (5/5)
- AI Coding Tools: Cursor, Windsurf, Cline, Continue.dev (4/4)
- Chat Platforms: Claude, Gemini, ChatGPT (3/3)

Total: 15 integration guides across 4 categories (+50% coverage)

Ready for v2.10.0 release.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-07 21:34:28 +03:00

19 KiB

AGENTS.md - Skill Seekers

This file provides essential guidance for AI coding agents working with the Skill Seekers codebase.


Project Overview

Skill Seekers is a Python CLI tool that converts documentation websites, GitHub repositories, and PDF files into AI-ready skills for LLM platforms and RAG (Retrieval-Augmented Generation) pipelines. It serves as the universal preprocessing layer for AI systems.

Supported Target Platforms

Platform Format Use Case
Claude AI ZIP + YAML Claude Code skills
Google Gemini tar.gz Gemini skills
OpenAI ChatGPT ZIP + Vector Store Custom GPTs
LangChain Documents QA chains, agents, retrievers
LlamaIndex TextNodes Query engines, chat engines
Haystack Documents Enterprise RAG pipelines
Pinecone Ready for upsert Production vector search
Weaviate Vector objects Vector database
Qdrant Points Vector database
Chroma Documents Local vector database
FAISS Index files Local similarity search
Cursor IDE .cursorrules AI coding assistant rules
Windsurf .windsurfrules AI coding rules
Generic Markdown ZIP Universal export

Current Version: 2.9.0 Python Version: 3.10+ required License: MIT Website: https://skillseekersweb.com/ Repository: https://github.com/yusufkaraaslan/Skill_Seekers

Core Workflow

  1. Scrape Phase - Crawl documentation/GitHub/PDF sources
  2. Build Phase - Organize content into categorized references
  3. Enhancement Phase - AI-powered quality improvements (optional)
  4. Package Phase - Create platform-specific packages
  5. Upload Phase - Auto-upload to target platform (optional)

Project Structure

/mnt/1ece809a-2821-4f10-aecb-fcdf34760c0b/Git/Skill_Seekers/
├── src/skill_seekers/              # Main source code (src/ layout)
│   ├── cli/                        # CLI tools and commands
│   │   ├── adaptors/               # Platform adaptors (Strategy pattern)
│   │   │   ├── base.py             # Abstract base class
│   │   │   ├── claude.py           # Claude AI adaptor
│   │   │   ├── gemini.py           # Google Gemini adaptor
│   │   │   ├── openai.py           # OpenAI ChatGPT adaptor
│   │   │   ├── markdown.py         # Generic Markdown adaptor
│   │   │   ├── chroma.py           # Chroma vector DB adaptor
│   │   │   ├── faiss_helpers.py    # FAISS index adaptor
│   │   │   ├── haystack.py         # Haystack RAG adaptor
│   │   │   ├── langchain.py        # LangChain adaptor
│   │   │   ├── llama_index.py      # LlamaIndex adaptor
│   │   │   ├── qdrant.py           # Qdrant vector DB adaptor
│   │   │   ├── weaviate.py         # Weaviate vector DB adaptor
│   │   │   └── streaming_adaptor.py # Streaming output adaptor
│   │   ├── storage/                # Cloud storage backends
│   │   │   ├── base_storage.py     # Storage interface
│   │   │   ├── s3_storage.py       # AWS S3 support
│   │   │   ├── gcs_storage.py      # Google Cloud Storage
│   │   │   └── azure_storage.py    # Azure Blob Storage
│   │   ├── main.py                 # Unified CLI entry point
│   │   ├── doc_scraper.py          # Documentation scraper
│   │   ├── github_scraper.py       # GitHub repository scraper
│   │   ├── pdf_scraper.py          # PDF extraction
│   │   ├── unified_scraper.py      # Multi-source scraping
│   │   ├── codebase_scraper.py     # Local codebase analysis
│   │   ├── enhance_skill_local.py  # AI enhancement (local mode)
│   │   ├── package_skill.py        # Skill packager
│   │   ├── upload_skill.py         # Upload to platforms
│   │   ├── cloud_storage_cli.py    # Cloud storage CLI
│   │   ├── benchmark_cli.py        # Benchmarking CLI
│   │   ├── sync_cli.py             # Sync monitoring CLI
│   │   └── ...                     # 70+ CLI modules
│   ├── mcp/                        # MCP server integration
│   │   ├── server_fastmcp.py       # FastMCP server (main)
│   │   ├── server_legacy.py        # Legacy server implementation
│   │   ├── server.py               # Server entry point
│   │   └── tools/                  # MCP tool implementations
│   │       ├── config_tools.py     # Configuration tools
│   │       ├── scraping_tools.py   # Scraping tools
│   │       ├── packaging_tools.py  # Packaging tools
│   │       ├── source_tools.py     # Source management tools
│   │       ├── splitting_tools.py  # Config splitting tools
│   │       └── vector_db_tools.py  # Vector database tools
│   ├── sync/                       # Sync monitoring module
│   │   ├── detector.py             # Change detection
│   │   ├── models.py               # Data models
│   │   ├── monitor.py              # Monitoring logic
│   │   └── notifier.py             # Notification system
│   ├── benchmark/                  # Benchmarking framework
│   │   ├── framework.py            # Benchmark framework
│   │   ├── models.py               # Benchmark models
│   │   └── runner.py               # Benchmark runner
│   └── embedding/                  # Embedding server
│       ├── server.py               # FastAPI embedding server
│       ├── generator.py            # Embedding generation
│       ├── cache.py                # Embedding cache
│       └── models.py               # Embedding models
├── tests/                          # Test suite (83 test files)
├── configs/                        # Preset configuration files
├── docs/                           # Documentation (80+ markdown files)
├── .github/workflows/              # CI/CD workflows
├── pyproject.toml                  # Main project configuration
├── requirements.txt                # Pinned dependencies
├── Dockerfile                      # Main Docker image
├── Dockerfile.mcp                  # MCP server Docker image
└── docker-compose.yml              # Full stack deployment

Build and Development Commands

Setup (REQUIRED before any development)

# Install in editable mode (REQUIRED for tests due to src/ layout)
pip install -e .

# Install with all platform dependencies
pip install -e ".[all-llms]"

# Install with all optional dependencies
pip install -e ".[all]"

# Install specific platforms only
pip install -e ".[gemini]"    # Google Gemini support
pip install -e ".[openai]"    # OpenAI ChatGPT support
pip install -e ".[mcp]"       # MCP server dependencies
pip install -e ".[s3]"        # AWS S3 support
pip install -e ".[gcs]"       # Google Cloud Storage
pip install -e ".[azure]"     # Azure Blob Storage
pip install -e ".[embedding]" # Embedding server support

# Install dev dependencies (using dependency-groups)
pip install -e ".[dev]"

CRITICAL: The project uses a src/ layout. Tests WILL FAIL unless you install with pip install -e . first.

Building

# Build package using uv (recommended)
uv build

# Or using standard build
python -m build

# Publish to PyPI
uv publish

Docker

# Build Docker image
docker build -t skill-seekers .

# Run with docker-compose (includes vector databases)
docker-compose up -d

# Run MCP server only
docker-compose up -d mcp-server

Running Tests

CRITICAL: Never skip tests - all tests must pass before commits.

# All tests (must run pip install -e . first!)
pytest tests/ -v

# Specific test file
pytest tests/test_scraper_features.py -v
pytest tests/test_mcp_fastmcp.py -v
pytest tests/test_cloud_storage.py -v

# With coverage
pytest tests/ --cov=src/skill_seekers --cov-report=term --cov-report=html

# Single test
pytest tests/test_scraper_features.py::test_detect_language -v

# E2E tests
pytest tests/test_e2e_three_stream_pipeline.py -v

# Skip slow tests
pytest tests/ -v -m "not slow"

# Run only integration tests
pytest tests/ -v -m integration

Test Architecture:

  • 83 test files covering all features
  • CI Matrix: Ubuntu + macOS, Python 3.10-3.12
  • 1200+ tests passing
  • Test markers: slow, integration, e2e, venv, bootstrap

Code Style Guidelines

Linting and Formatting

# Run ruff linter
ruff check src/ tests/

# Run ruff formatter check
ruff format --check src/ tests/

# Auto-fix issues
ruff check src/ tests/ --fix
ruff format src/ tests/

# Run mypy type checker
mypy src/skill_seekers --show-error-codes --pretty

Style Rules (from pyproject.toml)

  • Line length: 100 characters
  • Target Python: 3.10+
  • Enabled rules: E, W, F, I, B, C4, UP, ARG, SIM
  • Ignored rules: E501, F541, ARG002, B007, I001, SIM114
  • Import sorting: isort style with skill_seekers as first-party

Code Conventions

  1. Use type hints where practical (gradual typing approach)
  2. Docstrings: Use Google-style or standard docstrings
  3. Error handling: Use specific exceptions, provide helpful messages
  4. Async code: Use asyncio, mark tests with @pytest.mark.asyncio
  5. File naming: Use snake_case for all Python files
  6. MyPy configuration: Lenient gradual typing (see mypy.ini)

Architecture Patterns

Platform Adaptor Pattern (Strategy Pattern)

All platform-specific logic is encapsulated in adaptors:

from skill_seekers.cli.adaptors import get_adaptor

# Get platform-specific adaptor
adaptor = get_adaptor('gemini')  # or 'claude', 'openai', 'langchain', etc.

# Package skill
adaptor.package(skill_dir='output/react/', output_path='output/')

# Upload to platform
adaptor.upload(
    package_path='output/react-gemini.tar.gz',
    api_key=os.getenv('GOOGLE_API_KEY')
)

CLI Architecture (Git-style)

Entry point: src/skill_seekers/cli/main.py

The CLI uses subcommands that delegate to existing modules:

# skill-seekers scrape --config react.json
# Transforms to: doc_scraper.main() with modified sys.argv

Available subcommands:

  • config - Configuration wizard
  • scrape - Documentation scraping
  • github - GitHub repository scraping
  • pdf - PDF extraction
  • unified - Multi-source scraping
  • analyze / codebase - Local codebase analysis
  • enhance - AI enhancement
  • package - Package skill for target platform
  • upload - Upload to platform
  • cloud - Cloud storage operations
  • sync - Sync monitoring
  • benchmark - Performance benchmarking
  • embed - Embedding server
  • install / install-agent - Complete workflow

MCP Server Architecture

Two implementations:

  • server_fastmcp.py - Modern, decorator-based (recommended)
  • server_legacy.py - Legacy implementation

Tools are organized by category:

  • Config tools (3 tools)
  • Scraping tools (8 tools)
  • Packaging tools (4 tools)
  • Source tools (4 tools)
  • Splitting tools (2 tools)
  • Vector DB tools (multiple)

Cloud Storage Architecture

Abstract base class pattern for cloud providers:

  • base_storage.py - Defines CloudStorage interface
  • s3_storage.py - AWS S3 implementation
  • gcs_storage.py - Google Cloud Storage implementation
  • azure_storage.py - Azure Blob Storage implementation

Testing Instructions

Test Categories

Marker Description
slow Tests taking >5 seconds
integration Requires external services (APIs)
e2e End-to-end tests (resource-intensive)
venv Requires virtual environment setup
bootstrap Bootstrap skill specific

Running Specific Test Categories

# Skip slow tests
pytest tests/ -v -m "not slow"

# Run only integration tests
pytest tests/ -v -m integration

# Run E2E tests
pytest tests/ -v -m e2e

Test Configuration (pyproject.toml)

[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py"]
addopts = "-v --tb=short --strict-markers"
asyncio_mode = "auto"
asyncio_default_fixture_loop_scope = "function"

Git Workflow

Branch Structure

main (production)
  ↑
  │ (only maintainer merges)
  │
development (integration) ← default branch for PRs
  ↑
  │ (all contributor PRs go here)
  │
feature branches
  • main - Production, always stable, protected
  • development - Active development, default for PRs
  • Feature branches - Your work, created from development

Creating a Feature Branch

# 1. Checkout development
git checkout development
git pull upstream development

# 2. Create feature branch
git checkout -b my-feature

# 3. Make changes, commit, push
git add .
git commit -m "Add my feature"
git push origin my-feature

# 4. Create PR targeting 'development' branch

CI/CD Configuration

GitHub Actions Workflows

.github/workflows/tests.yml:

  • Runs on: push/PR to main and development
  • Lint job: Ruff + MyPy
  • Test matrix: Ubuntu + macOS, Python 3.10-3.12
  • Coverage: Uploads to Codecov

.github/workflows/release.yml:

  • Triggered on version tags (v*)
  • Builds and publishes to PyPI using uv
  • Creates GitHub release with changelog

.github/workflows/docker-publish.yml:

  • Builds and publishes Docker images

.github/workflows/vector-db-export.yml:

  • Tests vector database exports

.github/workflows/scheduled-updates.yml:

  • Scheduled sync monitoring

Pre-commit Checks (Manual)

# Before committing, run:
ruff check src/ tests/
ruff format --check src/ tests/
pytest tests/ -v -x  # Stop on first failure

Security Considerations

API Keys and Secrets

  1. Never commit API keys to the repository
  2. Use environment variables:
    • ANTHROPIC_API_KEY - Claude AI
    • GOOGLE_API_KEY - Google Gemini
    • OPENAI_API_KEY - OpenAI
    • GITHUB_TOKEN - GitHub API
    • AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY - AWS S3
    • GOOGLE_APPLICATION_CREDENTIALS - GCS
    • AZURE_STORAGE_CONNECTION_STRING - Azure
  3. Configuration storage:
    • Stored at ~/.config/skill-seekers/config.json
    • Permissions: 600 (owner read/write only)

Rate Limit Handling

  • GitHub API has rate limits (5000 requests/hour for authenticated)
  • The tool has built-in rate limit handling with retry logic
  • Use --non-interactive flag for CI/CD environments

Custom API Endpoints

Support for Claude-compatible APIs:

export ANTHROPIC_API_KEY=your-custom-api-key
export ANTHROPIC_BASE_URL=https://custom-endpoint.com/v1

Common Development Tasks

Adding a New CLI Command

  1. Create module in src/skill_seekers/cli/my_command.py
  2. Implement main() function with argument parsing
  3. Add entry point in pyproject.toml:
    [project.scripts]
    skill-seekers-my-command = "skill_seekers.cli.my_command:main"
    
  4. Add subcommand handler in src/skill_seekers/cli/main.py
  5. Add tests in tests/test_my_command.py

Adding a New Platform Adaptor

  1. Create src/skill_seekers/cli/adaptors/my_platform.py
  2. Inherit from SkillAdaptor base class
  3. Implement required methods: package(), upload(), enhance()
  4. Register in src/skill_seekers/cli/adaptors/__init__.py
  5. Add optional dependencies in pyproject.toml
  6. Add tests in tests/test_adaptors/

Adding an MCP Tool

  1. Implement tool logic in src/skill_seekers/mcp/tools/category_tools.py
  2. Register in src/skill_seekers/mcp/server_fastmcp.py
  3. Add test in tests/test_mcp_fastmcp.py

Adding Cloud Storage Provider

  1. Create module in src/skill_seekers/cli/storage/my_storage.py
  2. Inherit from CloudStorage base class
  3. Implement required methods: upload(), download(), list(), delete()
  4. Register in src/skill_seekers/cli/storage/__init__.py
  5. Add optional dependencies in pyproject.toml

Documentation

Project Documentation

  • README.md - Main project documentation
  • README.zh-CN.md - Chinese translation
  • CLAUDE.md - Detailed implementation guidance
  • QUICKSTART.md - Quick start guide
  • CONTRIBUTING.md - Contribution guidelines
  • TROUBLESHOOTING.md - Common issues and solutions
  • docs/ - Comprehensive documentation (80+ files)
    • docs/integrations/ - Integration guides for each platform
    • docs/guides/ - User guides
    • docs/reference/ - API reference
    • docs/features/ - Feature documentation
    • docs/blog/ - Blog posts and articles

Configuration Documentation

Preset configs are in configs/ directory:

  • react.json - React documentation
  • vue.json - Vue.js documentation
  • fastapi.json - FastAPI documentation
  • django.json - Django documentation
  • blender.json / blender-unified.json - Blender Engine
  • godot.json - Godot Engine
  • claude-code.json - Claude Code
  • *_unified.json - Multi-source configs

Key Dependencies

Core Dependencies

  • requests>=2.32.5 - HTTP requests
  • beautifulsoup4>=4.14.2 - HTML parsing
  • PyGithub>=2.5.0 - GitHub API
  • GitPython>=3.1.40 - Git operations
  • httpx>=0.28.1 - Async HTTP
  • anthropic>=0.76.0 - Claude AI API
  • PyMuPDF>=1.24.14 - PDF processing
  • Pillow>=11.0.0 - Image processing
  • pytesseract>=0.3.13 - OCR
  • pydantic>=2.12.3 - Data validation
  • pydantic-settings>=2.11.0 - Settings management
  • click>=8.3.0 - CLI framework
  • Pygments>=2.19.2 - Syntax highlighting
  • pathspec>=0.12.1 - Path matching
  • networkx>=3.0 - Graph operations
  • schedule>=1.2.0 - Scheduled tasks
  • python-dotenv>=1.1.1 - Environment variables
  • jsonschema>=4.25.1 - JSON validation

Optional Dependencies

  • mcp>=1.25,<2 - MCP server
  • google-generativeai>=0.8.0 - Gemini support
  • openai>=1.0.0 - OpenAI support
  • boto3>=1.34.0 - AWS S3
  • google-cloud-storage>=2.10.0 - GCS
  • azure-storage-blob>=12.19.0 - Azure
  • fastapi>=0.109.0 - Embedding server
  • uvicorn>=0.27.0 - ASGI server
  • sentence-transformers>=2.3.0 - Embeddings
  • numpy>=1.24.0 - Numerical computing
  • voyageai>=0.2.0 - Voyage AI embeddings

Dev Dependencies (in dependency-groups)

  • pytest>=8.4.2 - Testing framework
  • pytest-asyncio>=0.24.0 - Async test support
  • pytest-cov>=7.0.0 - Coverage
  • coverage>=7.11.0 - Coverage reporting
  • ruff>=0.14.13 - Linting/formatting
  • mypy>=1.19.1 - Type checking

Troubleshooting

Common Issues

ImportError: No module named 'skill_seekers'

  • Solution: Run pip install -e .

Tests failing with "package not installed"

  • Solution: Ensure you ran pip install -e . in the correct virtual environment

MCP server import errors

  • Solution: Install with pip install -e ".[mcp]"

Type checking failures

  • MyPy is configured to be lenient (gradual typing)
  • Focus on critical paths, not full coverage

Docker build failures

  • Ensure you have BuildKit enabled: DOCKER_BUILDKIT=1
  • Check that all submodules are initialized: git submodule update --init

Getting Help

  • Check TROUBLESHOOTING.md for detailed solutions
  • Review docs/FAQ.md for common questions
  • Visit https://skillseekersweb.com/ for documentation
  • Open an issue on GitHub with:
    • Clear title and description
    • Steps to reproduce
    • Expected vs actual behavior
    • Environment details (OS, Python version)
    • Error messages and stack traces

This document is maintained for AI coding agents. For human contributors, see README.md and CONTRIBUTING.md.