- Add HAYSTACK.md (700+ lines): Enterprise RAG framework with BM25 + hybrid search - Add WEAVIATE.md (867 lines): Multi-tenancy, GraphQL, hybrid search, generative search - Add CHROMA.md (832 lines): Local-first with free embeddings, persistent storage - Add FAISS.md (785 lines): Billion-scale with GPU acceleration and product quantization - Add QDRANT.md (746 lines): High-performance Rust engine with rich filtering All guides follow proven 11-section pattern: - Problem/Solution/Quick Start/Setup/Advanced/Best Practices - Real-world examples (100-200 lines working code) - Troubleshooting sections - Before/After comparisons Total: ~3,930 lines of comprehensive integration documentation Test results: - 26/26 tests passing for new features (RAG chunker + Haystack adaptor) - 108 total tests passing (100%) - 0 failures This completes all optional integration guides from ACTION_PLAN.md. Universal preprocessor positioning now covers: - RAG Frameworks: LangChain, LlamaIndex, Haystack (3/3) - Vector Databases: Pinecone, Weaviate, Chroma, FAISS, Qdrant (5/5) - AI Coding Tools: Cursor, Windsurf, Cline, Continue.dev (4/4) - Chat Platforms: Claude, Gemini, ChatGPT (3/3) Total: 15 integration guides across 4 categories (+50% coverage) Ready for v2.10.0 release. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
19 KiB
AGENTS.md - Skill Seekers
This file provides essential guidance for AI coding agents working with the Skill Seekers codebase.
Project Overview
Skill Seekers is a Python CLI tool that converts documentation websites, GitHub repositories, and PDF files into AI-ready skills for LLM platforms and RAG (Retrieval-Augmented Generation) pipelines. It serves as the universal preprocessing layer for AI systems.
Supported Target Platforms
| Platform | Format | Use Case |
|---|---|---|
| Claude AI | ZIP + YAML | Claude Code skills |
| Google Gemini | tar.gz | Gemini skills |
| OpenAI ChatGPT | ZIP + Vector Store | Custom GPTs |
| LangChain | Documents | QA chains, agents, retrievers |
| LlamaIndex | TextNodes | Query engines, chat engines |
| Haystack | Documents | Enterprise RAG pipelines |
| Pinecone | Ready for upsert | Production vector search |
| Weaviate | Vector objects | Vector database |
| Qdrant | Points | Vector database |
| Chroma | Documents | Local vector database |
| FAISS | Index files | Local similarity search |
| Cursor IDE | .cursorrules | AI coding assistant rules |
| Windsurf | .windsurfrules | AI coding rules |
| Generic Markdown | ZIP | Universal export |
Current Version: 2.9.0 Python Version: 3.10+ required License: MIT Website: https://skillseekersweb.com/ Repository: https://github.com/yusufkaraaslan/Skill_Seekers
Core Workflow
- Scrape Phase - Crawl documentation/GitHub/PDF sources
- Build Phase - Organize content into categorized references
- Enhancement Phase - AI-powered quality improvements (optional)
- Package Phase - Create platform-specific packages
- Upload Phase - Auto-upload to target platform (optional)
Project Structure
/mnt/1ece809a-2821-4f10-aecb-fcdf34760c0b/Git/Skill_Seekers/
├── src/skill_seekers/ # Main source code (src/ layout)
│ ├── cli/ # CLI tools and commands
│ │ ├── adaptors/ # Platform adaptors (Strategy pattern)
│ │ │ ├── base.py # Abstract base class
│ │ │ ├── claude.py # Claude AI adaptor
│ │ │ ├── gemini.py # Google Gemini adaptor
│ │ │ ├── openai.py # OpenAI ChatGPT adaptor
│ │ │ ├── markdown.py # Generic Markdown adaptor
│ │ │ ├── chroma.py # Chroma vector DB adaptor
│ │ │ ├── faiss_helpers.py # FAISS index adaptor
│ │ │ ├── haystack.py # Haystack RAG adaptor
│ │ │ ├── langchain.py # LangChain adaptor
│ │ │ ├── llama_index.py # LlamaIndex adaptor
│ │ │ ├── qdrant.py # Qdrant vector DB adaptor
│ │ │ ├── weaviate.py # Weaviate vector DB adaptor
│ │ │ └── streaming_adaptor.py # Streaming output adaptor
│ │ ├── storage/ # Cloud storage backends
│ │ │ ├── base_storage.py # Storage interface
│ │ │ ├── s3_storage.py # AWS S3 support
│ │ │ ├── gcs_storage.py # Google Cloud Storage
│ │ │ └── azure_storage.py # Azure Blob Storage
│ │ ├── main.py # Unified CLI entry point
│ │ ├── doc_scraper.py # Documentation scraper
│ │ ├── github_scraper.py # GitHub repository scraper
│ │ ├── pdf_scraper.py # PDF extraction
│ │ ├── unified_scraper.py # Multi-source scraping
│ │ ├── codebase_scraper.py # Local codebase analysis
│ │ ├── enhance_skill_local.py # AI enhancement (local mode)
│ │ ├── package_skill.py # Skill packager
│ │ ├── upload_skill.py # Upload to platforms
│ │ ├── cloud_storage_cli.py # Cloud storage CLI
│ │ ├── benchmark_cli.py # Benchmarking CLI
│ │ ├── sync_cli.py # Sync monitoring CLI
│ │ └── ... # 70+ CLI modules
│ ├── mcp/ # MCP server integration
│ │ ├── server_fastmcp.py # FastMCP server (main)
│ │ ├── server_legacy.py # Legacy server implementation
│ │ ├── server.py # Server entry point
│ │ └── tools/ # MCP tool implementations
│ │ ├── config_tools.py # Configuration tools
│ │ ├── scraping_tools.py # Scraping tools
│ │ ├── packaging_tools.py # Packaging tools
│ │ ├── source_tools.py # Source management tools
│ │ ├── splitting_tools.py # Config splitting tools
│ │ └── vector_db_tools.py # Vector database tools
│ ├── sync/ # Sync monitoring module
│ │ ├── detector.py # Change detection
│ │ ├── models.py # Data models
│ │ ├── monitor.py # Monitoring logic
│ │ └── notifier.py # Notification system
│ ├── benchmark/ # Benchmarking framework
│ │ ├── framework.py # Benchmark framework
│ │ ├── models.py # Benchmark models
│ │ └── runner.py # Benchmark runner
│ └── embedding/ # Embedding server
│ ├── server.py # FastAPI embedding server
│ ├── generator.py # Embedding generation
│ ├── cache.py # Embedding cache
│ └── models.py # Embedding models
├── tests/ # Test suite (83 test files)
├── configs/ # Preset configuration files
├── docs/ # Documentation (80+ markdown files)
├── .github/workflows/ # CI/CD workflows
├── pyproject.toml # Main project configuration
├── requirements.txt # Pinned dependencies
├── Dockerfile # Main Docker image
├── Dockerfile.mcp # MCP server Docker image
└── docker-compose.yml # Full stack deployment
Build and Development Commands
Setup (REQUIRED before any development)
# Install in editable mode (REQUIRED for tests due to src/ layout)
pip install -e .
# Install with all platform dependencies
pip install -e ".[all-llms]"
# Install with all optional dependencies
pip install -e ".[all]"
# Install specific platforms only
pip install -e ".[gemini]" # Google Gemini support
pip install -e ".[openai]" # OpenAI ChatGPT support
pip install -e ".[mcp]" # MCP server dependencies
pip install -e ".[s3]" # AWS S3 support
pip install -e ".[gcs]" # Google Cloud Storage
pip install -e ".[azure]" # Azure Blob Storage
pip install -e ".[embedding]" # Embedding server support
# Install dev dependencies (using dependency-groups)
pip install -e ".[dev]"
CRITICAL: The project uses a src/ layout. Tests WILL FAIL unless you install with pip install -e . first.
Building
# Build package using uv (recommended)
uv build
# Or using standard build
python -m build
# Publish to PyPI
uv publish
Docker
# Build Docker image
docker build -t skill-seekers .
# Run with docker-compose (includes vector databases)
docker-compose up -d
# Run MCP server only
docker-compose up -d mcp-server
Running Tests
CRITICAL: Never skip tests - all tests must pass before commits.
# All tests (must run pip install -e . first!)
pytest tests/ -v
# Specific test file
pytest tests/test_scraper_features.py -v
pytest tests/test_mcp_fastmcp.py -v
pytest tests/test_cloud_storage.py -v
# With coverage
pytest tests/ --cov=src/skill_seekers --cov-report=term --cov-report=html
# Single test
pytest tests/test_scraper_features.py::test_detect_language -v
# E2E tests
pytest tests/test_e2e_three_stream_pipeline.py -v
# Skip slow tests
pytest tests/ -v -m "not slow"
# Run only integration tests
pytest tests/ -v -m integration
Test Architecture:
- 83 test files covering all features
- CI Matrix: Ubuntu + macOS, Python 3.10-3.12
- 1200+ tests passing
- Test markers:
slow,integration,e2e,venv,bootstrap
Code Style Guidelines
Linting and Formatting
# Run ruff linter
ruff check src/ tests/
# Run ruff formatter check
ruff format --check src/ tests/
# Auto-fix issues
ruff check src/ tests/ --fix
ruff format src/ tests/
# Run mypy type checker
mypy src/skill_seekers --show-error-codes --pretty
Style Rules (from pyproject.toml)
- Line length: 100 characters
- Target Python: 3.10+
- Enabled rules: E, W, F, I, B, C4, UP, ARG, SIM
- Ignored rules: E501, F541, ARG002, B007, I001, SIM114
- Import sorting: isort style with
skill_seekersas first-party
Code Conventions
- Use type hints where practical (gradual typing approach)
- Docstrings: Use Google-style or standard docstrings
- Error handling: Use specific exceptions, provide helpful messages
- Async code: Use
asyncio, mark tests with@pytest.mark.asyncio - File naming: Use snake_case for all Python files
- MyPy configuration: Lenient gradual typing (see mypy.ini)
Architecture Patterns
Platform Adaptor Pattern (Strategy Pattern)
All platform-specific logic is encapsulated in adaptors:
from skill_seekers.cli.adaptors import get_adaptor
# Get platform-specific adaptor
adaptor = get_adaptor('gemini') # or 'claude', 'openai', 'langchain', etc.
# Package skill
adaptor.package(skill_dir='output/react/', output_path='output/')
# Upload to platform
adaptor.upload(
package_path='output/react-gemini.tar.gz',
api_key=os.getenv('GOOGLE_API_KEY')
)
CLI Architecture (Git-style)
Entry point: src/skill_seekers/cli/main.py
The CLI uses subcommands that delegate to existing modules:
# skill-seekers scrape --config react.json
# Transforms to: doc_scraper.main() with modified sys.argv
Available subcommands:
config- Configuration wizardscrape- Documentation scrapinggithub- GitHub repository scrapingpdf- PDF extractionunified- Multi-source scrapinganalyze/codebase- Local codebase analysisenhance- AI enhancementpackage- Package skill for target platformupload- Upload to platformcloud- Cloud storage operationssync- Sync monitoringbenchmark- Performance benchmarkingembed- Embedding serverinstall/install-agent- Complete workflow
MCP Server Architecture
Two implementations:
server_fastmcp.py- Modern, decorator-based (recommended)server_legacy.py- Legacy implementation
Tools are organized by category:
- Config tools (3 tools)
- Scraping tools (8 tools)
- Packaging tools (4 tools)
- Source tools (4 tools)
- Splitting tools (2 tools)
- Vector DB tools (multiple)
Cloud Storage Architecture
Abstract base class pattern for cloud providers:
base_storage.py- DefinesCloudStorageinterfaces3_storage.py- AWS S3 implementationgcs_storage.py- Google Cloud Storage implementationazure_storage.py- Azure Blob Storage implementation
Testing Instructions
Test Categories
| Marker | Description |
|---|---|
slow |
Tests taking >5 seconds |
integration |
Requires external services (APIs) |
e2e |
End-to-end tests (resource-intensive) |
venv |
Requires virtual environment setup |
bootstrap |
Bootstrap skill specific |
Running Specific Test Categories
# Skip slow tests
pytest tests/ -v -m "not slow"
# Run only integration tests
pytest tests/ -v -m integration
# Run E2E tests
pytest tests/ -v -m e2e
Test Configuration (pyproject.toml)
[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py"]
addopts = "-v --tb=short --strict-markers"
asyncio_mode = "auto"
asyncio_default_fixture_loop_scope = "function"
Git Workflow
Branch Structure
main (production)
↑
│ (only maintainer merges)
│
development (integration) ← default branch for PRs
↑
│ (all contributor PRs go here)
│
feature branches
main- Production, always stable, protecteddevelopment- Active development, default for PRs- Feature branches - Your work, created from
development
Creating a Feature Branch
# 1. Checkout development
git checkout development
git pull upstream development
# 2. Create feature branch
git checkout -b my-feature
# 3. Make changes, commit, push
git add .
git commit -m "Add my feature"
git push origin my-feature
# 4. Create PR targeting 'development' branch
CI/CD Configuration
GitHub Actions Workflows
.github/workflows/tests.yml:
- Runs on: push/PR to
mainanddevelopment - Lint job: Ruff + MyPy
- Test matrix: Ubuntu + macOS, Python 3.10-3.12
- Coverage: Uploads to Codecov
.github/workflows/release.yml:
- Triggered on version tags (
v*) - Builds and publishes to PyPI using
uv - Creates GitHub release with changelog
.github/workflows/docker-publish.yml:
- Builds and publishes Docker images
.github/workflows/vector-db-export.yml:
- Tests vector database exports
.github/workflows/scheduled-updates.yml:
- Scheduled sync monitoring
Pre-commit Checks (Manual)
# Before committing, run:
ruff check src/ tests/
ruff format --check src/ tests/
pytest tests/ -v -x # Stop on first failure
Security Considerations
API Keys and Secrets
- Never commit API keys to the repository
- Use environment variables:
ANTHROPIC_API_KEY- Claude AIGOOGLE_API_KEY- Google GeminiOPENAI_API_KEY- OpenAIGITHUB_TOKEN- GitHub APIAWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY- AWS S3GOOGLE_APPLICATION_CREDENTIALS- GCSAZURE_STORAGE_CONNECTION_STRING- Azure
- Configuration storage:
- Stored at
~/.config/skill-seekers/config.json - Permissions: 600 (owner read/write only)
- Stored at
Rate Limit Handling
- GitHub API has rate limits (5000 requests/hour for authenticated)
- The tool has built-in rate limit handling with retry logic
- Use
--non-interactiveflag for CI/CD environments
Custom API Endpoints
Support for Claude-compatible APIs:
export ANTHROPIC_API_KEY=your-custom-api-key
export ANTHROPIC_BASE_URL=https://custom-endpoint.com/v1
Common Development Tasks
Adding a New CLI Command
- Create module in
src/skill_seekers/cli/my_command.py - Implement
main()function with argument parsing - Add entry point in
pyproject.toml:[project.scripts] skill-seekers-my-command = "skill_seekers.cli.my_command:main" - Add subcommand handler in
src/skill_seekers/cli/main.py - Add tests in
tests/test_my_command.py
Adding a New Platform Adaptor
- Create
src/skill_seekers/cli/adaptors/my_platform.py - Inherit from
SkillAdaptorbase class - Implement required methods:
package(),upload(),enhance() - Register in
src/skill_seekers/cli/adaptors/__init__.py - Add optional dependencies in
pyproject.toml - Add tests in
tests/test_adaptors/
Adding an MCP Tool
- Implement tool logic in
src/skill_seekers/mcp/tools/category_tools.py - Register in
src/skill_seekers/mcp/server_fastmcp.py - Add test in
tests/test_mcp_fastmcp.py
Adding Cloud Storage Provider
- Create module in
src/skill_seekers/cli/storage/my_storage.py - Inherit from
CloudStoragebase class - Implement required methods:
upload(),download(),list(),delete() - Register in
src/skill_seekers/cli/storage/__init__.py - Add optional dependencies in
pyproject.toml
Documentation
Project Documentation
- README.md - Main project documentation
- README.zh-CN.md - Chinese translation
- CLAUDE.md - Detailed implementation guidance
- QUICKSTART.md - Quick start guide
- CONTRIBUTING.md - Contribution guidelines
- TROUBLESHOOTING.md - Common issues and solutions
- docs/ - Comprehensive documentation (80+ files)
docs/integrations/- Integration guides for each platformdocs/guides/- User guidesdocs/reference/- API referencedocs/features/- Feature documentationdocs/blog/- Blog posts and articles
Configuration Documentation
Preset configs are in configs/ directory:
react.json- React documentationvue.json- Vue.js documentationfastapi.json- FastAPI documentationdjango.json- Django documentationblender.json/blender-unified.json- Blender Enginegodot.json- Godot Engineclaude-code.json- Claude Code*_unified.json- Multi-source configs
Key Dependencies
Core Dependencies
requests>=2.32.5- HTTP requestsbeautifulsoup4>=4.14.2- HTML parsingPyGithub>=2.5.0- GitHub APIGitPython>=3.1.40- Git operationshttpx>=0.28.1- Async HTTPanthropic>=0.76.0- Claude AI APIPyMuPDF>=1.24.14- PDF processingPillow>=11.0.0- Image processingpytesseract>=0.3.13- OCRpydantic>=2.12.3- Data validationpydantic-settings>=2.11.0- Settings managementclick>=8.3.0- CLI frameworkPygments>=2.19.2- Syntax highlightingpathspec>=0.12.1- Path matchingnetworkx>=3.0- Graph operationsschedule>=1.2.0- Scheduled taskspython-dotenv>=1.1.1- Environment variablesjsonschema>=4.25.1- JSON validation
Optional Dependencies
mcp>=1.25,<2- MCP servergoogle-generativeai>=0.8.0- Gemini supportopenai>=1.0.0- OpenAI supportboto3>=1.34.0- AWS S3google-cloud-storage>=2.10.0- GCSazure-storage-blob>=12.19.0- Azurefastapi>=0.109.0- Embedding serveruvicorn>=0.27.0- ASGI serversentence-transformers>=2.3.0- Embeddingsnumpy>=1.24.0- Numerical computingvoyageai>=0.2.0- Voyage AI embeddings
Dev Dependencies (in dependency-groups)
pytest>=8.4.2- Testing frameworkpytest-asyncio>=0.24.0- Async test supportpytest-cov>=7.0.0- Coveragecoverage>=7.11.0- Coverage reportingruff>=0.14.13- Linting/formattingmypy>=1.19.1- Type checking
Troubleshooting
Common Issues
ImportError: No module named 'skill_seekers'
- Solution: Run
pip install -e .
Tests failing with "package not installed"
- Solution: Ensure you ran
pip install -e .in the correct virtual environment
MCP server import errors
- Solution: Install with
pip install -e ".[mcp]"
Type checking failures
- MyPy is configured to be lenient (gradual typing)
- Focus on critical paths, not full coverage
Docker build failures
- Ensure you have BuildKit enabled:
DOCKER_BUILDKIT=1 - Check that all submodules are initialized:
git submodule update --init
Getting Help
- Check TROUBLESHOOTING.md for detailed solutions
- Review docs/FAQ.md for common questions
- Visit https://skillseekersweb.com/ for documentation
- Open an issue on GitHub with:
- Clear title and description
- Steps to reproduce
- Expected vs actual behavior
- Environment details (OS, Python version)
- Error messages and stack traces
This document is maintained for AI coding agents. For human contributors, see README.md and CONTRIBUTING.md.