firefrost-gaming/skill-seekers-reference

Files

yusyus 6cb446d213 docs: Add 5 vector database integration guides (HAYSTACK, WEAVIATE, CHROMA, FAISS, QDRANT)

- Add HAYSTACK.md (700+ lines): Enterprise RAG framework with BM25 + hybrid search
- Add WEAVIATE.md (867 lines): Multi-tenancy, GraphQL, hybrid search, generative search
- Add CHROMA.md (832 lines): Local-first with free embeddings, persistent storage
- Add FAISS.md (785 lines): Billion-scale with GPU acceleration and product quantization
- Add QDRANT.md (746 lines): High-performance Rust engine with rich filtering

All guides follow proven 11-section pattern:
- Problem/Solution/Quick Start/Setup/Advanced/Best Practices
- Real-world examples (100-200 lines working code)
- Troubleshooting sections
- Before/After comparisons

Total: ~3,930 lines of comprehensive integration documentation

Test results:
- 26/26 tests passing for new features (RAG chunker + Haystack adaptor)
- 108 total tests passing (100%)
- 0 failures

This completes all optional integration guides from ACTION_PLAN.md.
Universal preprocessor positioning now covers:
- RAG Frameworks: LangChain, LlamaIndex, Haystack (3/3)
- Vector Databases: Pinecone, Weaviate, Chroma, FAISS, Qdrant (5/5)
- AI Coding Tools: Cursor, Windsurf, Cline, Continue.dev (4/4)
- Chat Platforms: Claude, Gemini, ChatGPT (3/3)

Total: 15 integration guides across 4 categories (+50% coverage)

Ready for v2.10.0 release.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-07 21:34:28 +03:00

19 KiB

Raw Blame History

AGENTS.md - Skill Seekers

This file provides essential guidance for AI coding agents working with the Skill Seekers codebase.

Project Overview

Skill Seekers is a Python CLI tool that converts documentation websites, GitHub repositories, and PDF files into AI-ready skills for LLM platforms and RAG (Retrieval-Augmented Generation) pipelines. It serves as the universal preprocessing layer for AI systems.

Supported Target Platforms

Platform	Format	Use Case
Claude AI	ZIP + YAML	Claude Code skills
Google Gemini	tar.gz	Gemini skills
OpenAI ChatGPT	ZIP + Vector Store	Custom GPTs
LangChain	Documents	QA chains, agents, retrievers
LlamaIndex	TextNodes	Query engines, chat engines
Haystack	Documents	Enterprise RAG pipelines
Pinecone	Ready for upsert	Production vector search
Weaviate	Vector objects	Vector database
Qdrant	Points	Vector database
Chroma	Documents	Local vector database
FAISS	Index files	Local similarity search
Cursor IDE	.cursorrules	AI coding assistant rules
Windsurf	.windsurfrules	AI coding rules
Generic Markdown	ZIP	Universal export

Current Version: 2.9.0 Python Version: 3.10+ required License: MIT Website: https://skillseekersweb.com/ Repository: https://github.com/yusufkaraaslan/Skill_Seekers

Core Workflow

Scrape Phase - Crawl documentation/GitHub/PDF sources
Build Phase - Organize content into categorized references
Enhancement Phase - AI-powered quality improvements (optional)
Package Phase - Create platform-specific packages
Upload Phase - Auto-upload to target platform (optional)

Project Structure

/mnt/1ece809a-2821-4f10-aecb-fcdf34760c0b/Git/Skill_Seekers/
├── src/skill_seekers/              # Main source code (src/ layout)
│   ├── cli/                        # CLI tools and commands
│   │   ├── adaptors/               # Platform adaptors (Strategy pattern)
│   │   │   ├── base.py             # Abstract base class
│   │   │   ├── claude.py           # Claude AI adaptor
│   │   │   ├── gemini.py           # Google Gemini adaptor
│   │   │   ├── openai.py           # OpenAI ChatGPT adaptor
│   │   │   ├── markdown.py         # Generic Markdown adaptor
│   │   │   ├── chroma.py           # Chroma vector DB adaptor
│   │   │   ├── faiss_helpers.py    # FAISS index adaptor
│   │   │   ├── haystack.py         # Haystack RAG adaptor
│   │   │   ├── langchain.py        # LangChain adaptor
│   │   │   ├── llama_index.py      # LlamaIndex adaptor
│   │   │   ├── qdrant.py           # Qdrant vector DB adaptor
│   │   │   ├── weaviate.py         # Weaviate vector DB adaptor
│   │   │   └── streaming_adaptor.py # Streaming output adaptor
│   │   ├── storage/                # Cloud storage backends
│   │   │   ├── base_storage.py     # Storage interface
│   │   │   ├── s3_storage.py       # AWS S3 support
│   │   │   ├── gcs_storage.py      # Google Cloud Storage
│   │   │   └── azure_storage.py    # Azure Blob Storage
│   │   ├── main.py                 # Unified CLI entry point
│   │   ├── doc_scraper.py          # Documentation scraper
│   │   ├── github_scraper.py       # GitHub repository scraper
│   │   ├── pdf_scraper.py          # PDF extraction
│   │   ├── unified_scraper.py      # Multi-source scraping
│   │   ├── codebase_scraper.py     # Local codebase analysis
│   │   ├── enhance_skill_local.py  # AI enhancement (local mode)
│   │   ├── package_skill.py        # Skill packager
│   │   ├── upload_skill.py         # Upload to platforms
│   │   ├── cloud_storage_cli.py    # Cloud storage CLI
│   │   ├── benchmark_cli.py        # Benchmarking CLI
│   │   ├── sync_cli.py             # Sync monitoring CLI
│   │   └── ...                     # 70+ CLI modules
│   ├── mcp/                        # MCP server integration
│   │   ├── server_fastmcp.py       # FastMCP server (main)
│   │   ├── server_legacy.py        # Legacy server implementation
│   │   ├── server.py               # Server entry point
│   │   └── tools/                  # MCP tool implementations
│   │       ├── config_tools.py     # Configuration tools
│   │       ├── scraping_tools.py   # Scraping tools
│   │       ├── packaging_tools.py  # Packaging tools
│   │       ├── source_tools.py     # Source management tools
│   │       ├── splitting_tools.py  # Config splitting tools
│   │       └── vector_db_tools.py  # Vector database tools
│   ├── sync/                       # Sync monitoring module
│   │   ├── detector.py             # Change detection
│   │   ├── models.py               # Data models
│   │   ├── monitor.py              # Monitoring logic
│   │   └── notifier.py             # Notification system
│   ├── benchmark/                  # Benchmarking framework
│   │   ├── framework.py            # Benchmark framework
│   │   ├── models.py               # Benchmark models
│   │   └── runner.py               # Benchmark runner
│   └── embedding/                  # Embedding server
│       ├── server.py               # FastAPI embedding server
│       ├── generator.py            # Embedding generation
│       ├── cache.py                # Embedding cache
│       └── models.py               # Embedding models
├── tests/                          # Test suite (83 test files)
├── configs/                        # Preset configuration files
├── docs/                           # Documentation (80+ markdown files)
├── .github/workflows/              # CI/CD workflows
├── pyproject.toml                  # Main project configuration
├── requirements.txt                # Pinned dependencies
├── Dockerfile                      # Main Docker image
├── Dockerfile.mcp                  # MCP server Docker image
└── docker-compose.yml              # Full stack deployment

Build and Development Commands

Setup (REQUIRED before any development)

# Install in editable mode (REQUIRED for tests due to src/ layout)
pip install -e .

# Install with all platform dependencies
pip install -e ".[all-llms]"

# Install with all optional dependencies
pip install -e ".[all]"

# Install specific platforms only
pip install -e ".[gemini]"    # Google Gemini support
pip install -e ".[openai]"    # OpenAI ChatGPT support
pip install -e ".[mcp]"       # MCP server dependencies
pip install -e ".[s3]"        # AWS S3 support
pip install -e ".[gcs]"       # Google Cloud Storage
pip install -e ".[azure]"     # Azure Blob Storage
pip install -e ".[embedding]" # Embedding server support

# Install dev dependencies (using dependency-groups)
pip install -e ".[dev]"

CRITICAL: The project uses a src/ layout. Tests WILL FAIL unless you install with pip install -e . first.

Building

# Build package using uv (recommended)
uv build

# Or using standard build
python -m build

# Publish to PyPI
uv publish

Docker

# Build Docker image
docker build -t skill-seekers .

# Run with docker-compose (includes vector databases)
docker-compose up -d

# Run MCP server only
docker-compose up -d mcp-server

Running Tests

CRITICAL: Never skip tests - all tests must pass before commits.

# All tests (must run pip install -e . first!)
pytest tests/ -v

# Specific test file
pytest tests/test_scraper_features.py -v
pytest tests/test_mcp_fastmcp.py -v
pytest tests/test_cloud_storage.py -v

# With coverage
pytest tests/ --cov=src/skill_seekers --cov-report=term --cov-report=html

# Single test
pytest tests/test_scraper_features.py::test_detect_language -v

# E2E tests
pytest tests/test_e2e_three_stream_pipeline.py -v

# Skip slow tests
pytest tests/ -v -m "not slow"

# Run only integration tests
pytest tests/ -v -m integration

Test Architecture:

83 test files covering all features
CI Matrix: Ubuntu + macOS, Python 3.10-3.12
1200+ tests passing
Test markers: slow, integration, e2e, venv, bootstrap

Code Style Guidelines

Linting and Formatting

# Run ruff linter
ruff check src/ tests/

# Run ruff formatter check
ruff format --check src/ tests/

# Auto-fix issues
ruff check src/ tests/ --fix
ruff format src/ tests/

# Run mypy type checker
mypy src/skill_seekers --show-error-codes --pretty

Style Rules (from pyproject.toml)

Line length: 100 characters
Target Python: 3.10+
Enabled rules: E, W, F, I, B, C4, UP, ARG, SIM
Ignored rules: E501, F541, ARG002, B007, I001, SIM114
Import sorting: isort style with skill_seekers as first-party

Code Conventions

Use type hints where practical (gradual typing approach)
Docstrings: Use Google-style or standard docstrings
Error handling: Use specific exceptions, provide helpful messages
Async code: Use asyncio, mark tests with @pytest.mark.asyncio
File naming: Use snake_case for all Python files
MyPy configuration: Lenient gradual typing (see mypy.ini)

Architecture Patterns

Platform Adaptor Pattern (Strategy Pattern)

All platform-specific logic is encapsulated in adaptors:

from skill_seekers.cli.adaptors import get_adaptor

# Get platform-specific adaptor
adaptor = get_adaptor('gemini')  # or 'claude', 'openai', 'langchain', etc.

# Package skill
adaptor.package(skill_dir='output/react/', output_path='output/')

# Upload to platform
adaptor.upload(
    package_path='output/react-gemini.tar.gz',
    api_key=os.getenv('GOOGLE_API_KEY')
)

CLI Architecture (Git-style)

Entry point: src/skill_seekers/cli/main.py

The CLI uses subcommands that delegate to existing modules:

# skill-seekers scrape --config react.json
# Transforms to: doc_scraper.main() with modified sys.argv

Available subcommands:

config - Configuration wizard
scrape - Documentation scraping
github - GitHub repository scraping
pdf - PDF extraction
unified - Multi-source scraping
analyze / codebase - Local codebase analysis
enhance - AI enhancement
package - Package skill for target platform
upload - Upload to platform
cloud - Cloud storage operations
sync - Sync monitoring
benchmark - Performance benchmarking
embed - Embedding server
install / install-agent - Complete workflow

MCP Server Architecture

Two implementations:

server_fastmcp.py - Modern, decorator-based (recommended)
server_legacy.py - Legacy implementation

Tools are organized by category:

Config tools (3 tools)
Scraping tools (8 tools)
Packaging tools (4 tools)
Source tools (4 tools)
Splitting tools (2 tools)
Vector DB tools (multiple)

Cloud Storage Architecture

Abstract base class pattern for cloud providers:

base_storage.py - Defines CloudStorage interface
s3_storage.py - AWS S3 implementation
gcs_storage.py - Google Cloud Storage implementation
azure_storage.py - Azure Blob Storage implementation

Testing Instructions

Test Categories

Marker	Description
`slow`	Tests taking >5 seconds
`integration`	Requires external services (APIs)
`e2e`	End-to-end tests (resource-intensive)
`venv`	Requires virtual environment setup
`bootstrap`	Bootstrap skill specific

Running Specific Test Categories

# Skip slow tests
pytest tests/ -v -m "not slow"

# Run only integration tests
pytest tests/ -v -m integration

# Run E2E tests
pytest tests/ -v -m e2e

Test Configuration (pyproject.toml)

[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py"]
addopts = "-v --tb=short --strict-markers"
asyncio_mode = "auto"
asyncio_default_fixture_loop_scope = "function"

Git Workflow

Branch Structure

main (production)
  ↑
  │ (only maintainer merges)
  │
development (integration) ← default branch for PRs
  ↑
  │ (all contributor PRs go here)
  │
feature branches

main - Production, always stable, protected
development - Active development, default for PRs
Feature branches - Your work, created from development

Creating a Feature Branch

# 1. Checkout development
git checkout development
git pull upstream development

# 2. Create feature branch
git checkout -b my-feature

# 3. Make changes, commit, push
git add .
git commit -m "Add my feature"
git push origin my-feature

# 4. Create PR targeting 'development' branch

CI/CD Configuration

GitHub Actions Workflows

.github/workflows/tests.yml:

Runs on: push/PR to main and development
Lint job: Ruff + MyPy
Test matrix: Ubuntu + macOS, Python 3.10-3.12
Coverage: Uploads to Codecov

.github/workflows/release.yml:

Triggered on version tags (v*)
Builds and publishes to PyPI using uv
Creates GitHub release with changelog

.github/workflows/docker-publish.yml:

Builds and publishes Docker images

.github/workflows/vector-db-export.yml:

Tests vector database exports

.github/workflows/scheduled-updates.yml:

Scheduled sync monitoring

Pre-commit Checks (Manual)

# Before committing, run:
ruff check src/ tests/
ruff format --check src/ tests/
pytest tests/ -v -x  # Stop on first failure

Security Considerations

API Keys and Secrets

Never commit API keys to the repository
Use environment variables:
- ANTHROPIC_API_KEY - Claude AI
- GOOGLE_API_KEY - Google Gemini
- OPENAI_API_KEY - OpenAI
- GITHUB_TOKEN - GitHub API
- AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY - AWS S3
- GOOGLE_APPLICATION_CREDENTIALS - GCS
- AZURE_STORAGE_CONNECTION_STRING - Azure
Configuration storage:
- Stored at ~/.config/skill-seekers/config.json
- Permissions: 600 (owner read/write only)

Rate Limit Handling

GitHub API has rate limits (5000 requests/hour for authenticated)
The tool has built-in rate limit handling with retry logic
Use --non-interactive flag for CI/CD environments

Custom API Endpoints

Support for Claude-compatible APIs:

export ANTHROPIC_API_KEY=your-custom-api-key
export ANTHROPIC_BASE_URL=https://custom-endpoint.com/v1

Common Development Tasks

Adding a New CLI Command

Create module in src/skill_seekers/cli/my_command.py
Implement main() function with argument parsing

Add entry point in pyproject.toml:

[project.scripts]
skill-seekers-my-command = "skill_seekers.cli.my_command:main"

Add subcommand handler in src/skill_seekers/cli/main.py
Add tests in tests/test_my_command.py

Adding a New Platform Adaptor

Create src/skill_seekers/cli/adaptors/my_platform.py
Inherit from SkillAdaptor base class
Implement required methods: package(), upload(), enhance()
Register in src/skill_seekers/cli/adaptors/__init__.py
Add optional dependencies in pyproject.toml
Add tests in tests/test_adaptors/

Adding an MCP Tool

Implement tool logic in src/skill_seekers/mcp/tools/category_tools.py
Register in src/skill_seekers/mcp/server_fastmcp.py
Add test in tests/test_mcp_fastmcp.py

Adding Cloud Storage Provider

Create module in src/skill_seekers/cli/storage/my_storage.py
Inherit from CloudStorage base class
Implement required methods: upload(), download(), list(), delete()
Register in src/skill_seekers/cli/storage/__init__.py
Add optional dependencies in pyproject.toml

Documentation

Project Documentation

README.md - Main project documentation
README.zh-CN.md - Chinese translation
CLAUDE.md - Detailed implementation guidance
QUICKSTART.md - Quick start guide
CONTRIBUTING.md - Contribution guidelines
TROUBLESHOOTING.md - Common issues and solutions
docs/ - Comprehensive documentation (80+ files)
- docs/integrations/ - Integration guides for each platform
- docs/guides/ - User guides
- docs/reference/ - API reference
- docs/features/ - Feature documentation
- docs/blog/ - Blog posts and articles

Configuration Documentation

Preset configs are in configs/ directory:

react.json - React documentation
vue.json - Vue.js documentation
fastapi.json - FastAPI documentation
django.json - Django documentation
blender.json / blender-unified.json - Blender Engine
godot.json - Godot Engine
claude-code.json - Claude Code
*_unified.json - Multi-source configs

Key Dependencies

Core Dependencies

requests>=2.32.5 - HTTP requests
beautifulsoup4>=4.14.2 - HTML parsing
PyGithub>=2.5.0 - GitHub API
GitPython>=3.1.40 - Git operations
httpx>=0.28.1 - Async HTTP
anthropic>=0.76.0 - Claude AI API
PyMuPDF>=1.24.14 - PDF processing
Pillow>=11.0.0 - Image processing
pytesseract>=0.3.13 - OCR
pydantic>=2.12.3 - Data validation
pydantic-settings>=2.11.0 - Settings management
click>=8.3.0 - CLI framework
Pygments>=2.19.2 - Syntax highlighting
pathspec>=0.12.1 - Path matching
networkx>=3.0 - Graph operations
schedule>=1.2.0 - Scheduled tasks
python-dotenv>=1.1.1 - Environment variables
jsonschema>=4.25.1 - JSON validation

Optional Dependencies

mcp>=1.25,<2 - MCP server
google-generativeai>=0.8.0 - Gemini support
openai>=1.0.0 - OpenAI support
boto3>=1.34.0 - AWS S3
google-cloud-storage>=2.10.0 - GCS
azure-storage-blob>=12.19.0 - Azure
fastapi>=0.109.0 - Embedding server
uvicorn>=0.27.0 - ASGI server
sentence-transformers>=2.3.0 - Embeddings
numpy>=1.24.0 - Numerical computing
voyageai>=0.2.0 - Voyage AI embeddings

Dev Dependencies (in dependency-groups)

pytest>=8.4.2 - Testing framework
pytest-asyncio>=0.24.0 - Async test support
pytest-cov>=7.0.0 - Coverage
coverage>=7.11.0 - Coverage reporting
ruff>=0.14.13 - Linting/formatting
mypy>=1.19.1 - Type checking

Troubleshooting

Common Issues

ImportError: No module named 'skill_seekers'

Solution: Run pip install -e .

Tests failing with "package not installed"

Solution: Ensure you ran pip install -e . in the correct virtual environment

MCP server import errors

Solution: Install with pip install -e ".[mcp]"

Type checking failures

MyPy is configured to be lenient (gradual typing)
Focus on critical paths, not full coverage

Docker build failures

Ensure you have BuildKit enabled: DOCKER_BUILDKIT=1
Check that all submodules are initialized: git submodule update --init

Getting Help

Check TROUBLESHOOTING.md for detailed solutions
Review docs/FAQ.md for common questions
Visit https://skillseekersweb.com/ for documentation
Open an issue on GitHub with:
- Clear title and description
- Steps to reproduce
- Expected vs actual behavior
- Environment details (OS, Python version)
- Error messages and stack traces

This document is maintained for AI coding agents. For human contributors, see README.md and CONTRIBUTING.md.

19 KiB Raw Blame History