- Filter out chunks smaller than min_chunk_size (default 100 tokens) - Exception: Keep all chunks if entire document is smaller than target size - All 15 tests passing (100% pass rate) Fixes edge case where very small chunks (e.g., 'Short.' = 6 chars) were being created despite min_chunk_size=100 setting. Test: pytest tests/test_rag_chunker.py -v
16 KiB
Production Deployment Guide
Complete guide for deploying Skill Seekers in production environments.
Table of Contents
- Prerequisites
- Installation
- Configuration
- Deployment Options
- Monitoring & Observability
- Security
- Scaling
- Backup & Disaster Recovery
- Troubleshooting
Prerequisites
System Requirements
Minimum:
- CPU: 2 cores
- RAM: 4 GB
- Disk: 10 GB
- Python: 3.10+
Recommended (for production):
- CPU: 4+ cores
- RAM: 8+ GB
- Disk: 50+ GB SSD
- Python: 3.12+
Dependencies
Required:
# System packages (Ubuntu/Debian)
sudo apt update
sudo apt install -y python3.12 python3.12-venv python3-pip \
git curl wget build-essential libssl-dev
# System packages (RHEL/CentOS)
sudo yum install -y python312 python312-devel git curl wget \
gcc gcc-c++ openssl-devel
Optional (for specific features):
# OCR support (PDF scraping)
sudo apt install -y tesseract-ocr
# Cloud storage
# (Install provider-specific SDKs via pip)
# Embedding generation
# (GPU support requires CUDA)
Installation
1. Production Installation
# Create dedicated user
sudo useradd -m -s /bin/bash skillseekers
sudo su - skillseekers
# Create virtual environment
python3.12 -m venv /opt/skillseekers/venv
source /opt/skillseekers/venv/bin/activate
# Install package
pip install --upgrade pip
pip install skill-seekers[all]
# Verify installation
skill-seekers --version
2. Configuration Directory
# Create config directory
mkdir -p ~/.config/skill-seekers/{configs,output,logs,cache}
# Set permissions
chmod 700 ~/.config/skill-seekers
3. Environment Variables
Create /opt/skillseekers/.env:
# API Keys
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AIza...
OPENAI_API_KEY=sk-...
VOYAGE_API_KEY=...
# GitHub Tokens (use skill-seekers config --github for multiple)
GITHUB_TOKEN=ghp_...
# Cloud Storage (optional)
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
GOOGLE_APPLICATION_CREDENTIALS=/path/to/gcs-key.json
AZURE_STORAGE_CONNECTION_STRING=...
# MCP Server
MCP_TRANSPORT=http
MCP_PORT=8765
# Sync Monitoring (optional)
SYNC_WEBHOOK_URL=https://...
SLACK_WEBHOOK_URL=https://hooks.slack.com/...
# Logging
LOG_LEVEL=INFO
LOG_FILE=/var/log/skillseekers/app.log
Security Note: Never commit .env files to version control!
# Secure the env file
chmod 600 /opt/skillseekers/.env
Configuration
1. GitHub Configuration
Use the interactive configuration wizard:
skill-seekers config --github
This will:
- Add GitHub personal access tokens
- Configure rate limit strategies
- Test token validity
- Support multiple profiles (work, personal, etc.)
2. API Keys Configuration
skill-seekers config --api-keys
Configure:
- Claude API (Anthropic)
- Gemini API (Google)
- OpenAI API
- Voyage AI (embeddings)
3. Connection Testing
skill-seekers config --test
Verifies:
- ✅ GitHub token(s) validity and rate limits
- ✅ Claude API connectivity
- ✅ Gemini API connectivity
- ✅ OpenAI API connectivity
- ✅ Cloud storage access (if configured)
Deployment Options
Option 1: Systemd Service (Recommended)
Create /etc/systemd/system/skillseekers-mcp.service:
[Unit]
Description=Skill Seekers MCP Server
After=network.target
[Service]
Type=simple
User=skillseekers
Group=skillseekers
WorkingDirectory=/opt/skillseekers
EnvironmentFile=/opt/skillseekers/.env
ExecStart=/opt/skillseekers/venv/bin/python -m skill_seekers.mcp.server_fastmcp --transport http --port 8765
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=skillseekers-mcp
# Security
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/skillseekers /var/log/skillseekers
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable skillseekers-mcp
sudo systemctl start skillseekers-mcp
sudo systemctl status skillseekers-mcp
Option 2: Docker Deployment
See Docker Deployment Guide for detailed instructions.
Quick Start:
# Build image
docker build -t skillseekers:latest .
# Run container
docker run -d \
--name skillseekers-mcp \
-p 8765:8765 \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
-e GITHUB_TOKEN=$GITHUB_TOKEN \
-v /opt/skillseekers/data:/app/data \
--restart unless-stopped \
skillseekers:latest
Option 3: Kubernetes Deployment
See Kubernetes Deployment Guide for detailed instructions.
Quick Start:
# Install with Helm
helm install skillseekers ./helm/skillseekers \
--namespace skillseekers \
--create-namespace \
--set secrets.anthropicApiKey=$ANTHROPIC_API_KEY \
--set secrets.githubToken=$GITHUB_TOKEN
Option 4: Docker Compose
See Docker Compose Guide for multi-service deployment.
# Start all services
docker-compose up -d
# Check status
docker-compose ps
# View logs
docker-compose logs -f
Monitoring & Observability
1. Health Checks
MCP Server Health:
# HTTP transport
curl http://localhost:8765/health
# Expected response:
{
"status": "healthy",
"version": "2.9.0",
"uptime": 3600,
"tools": 25
}
2. Logging
Configure structured logging:
# config/logging.yaml
version: 1
formatters:
json:
format: '{"time":"%(asctime)s","level":"%(levelname)s","msg":"%(message)s"}'
handlers:
file:
class: logging.handlers.RotatingFileHandler
filename: /var/log/skillseekers/app.log
maxBytes: 10485760 # 10MB
backupCount: 5
formatter: json
loggers:
skill_seekers:
level: INFO
handlers: [file]
Log aggregation options:
- ELK Stack: Elasticsearch + Logstash + Kibana
- Grafana Loki: Lightweight log aggregation
- CloudWatch Logs: For AWS deployments
- Stackdriver: For GCP deployments
3. Metrics
Prometheus metrics endpoint:
# Add to MCP server
from prometheus_client import start_http_server, Counter, Histogram
# Metrics
scraping_requests = Counter('scraping_requests_total', 'Total scraping requests')
scraping_duration = Histogram('scraping_duration_seconds', 'Scraping duration')
# Start metrics server
start_http_server(9090)
Key metrics to monitor:
- Request rate
- Response time (p50, p95, p99)
- Error rate
- Memory usage
- CPU usage
- Disk I/O
- GitHub API rate limit remaining
- Claude API token usage
4. Alerting
Example Prometheus alert rules:
groups:
- name: skillseekers
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
- alert: HighMemoryUsage
expr: process_resident_memory_bytes > 2e9 # 2GB
for: 10m
annotations:
summary: "Memory usage above 2GB"
- alert: GitHubRateLimitLow
expr: github_rate_limit_remaining < 100
for: 1m
annotations:
summary: "GitHub rate limit low"
Security
1. API Key Management
Best Practices:
✅ DO:
- Store keys in environment variables or secret managers
- Use different keys for dev/staging/prod
- Rotate keys regularly (quarterly minimum)
- Use least-privilege IAM roles for cloud services
- Monitor key usage for anomalies
❌ DON'T:
- Commit keys to version control
- Share keys via email/Slack
- Use production keys in development
- Grant overly broad permissions
Recommended Secret Managers:
- Kubernetes Secrets (for K8s deployments)
- AWS Secrets Manager (for AWS)
- Google Secret Manager (for GCP)
- Azure Key Vault (for Azure)
- HashiCorp Vault (cloud-agnostic)
2. Network Security
Firewall Rules:
# Allow only necessary ports
sudo ufw enable
sudo ufw allow 22/tcp # SSH
sudo ufw allow 8765/tcp # MCP server (if public)
sudo ufw deny incoming
sudo ufw allow outgoing
Reverse Proxy (Nginx):
# /etc/nginx/sites-available/skillseekers
server {
listen 80;
server_name api.skillseekers.example.com;
# Redirect to HTTPS
return 301 https://$server_name$request_uri;
}
server {
listen 443 ssl http2;
server_name api.skillseekers.example.com;
ssl_certificate /etc/letsencrypt/live/api.skillseekers.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/api.skillseekers.example.com/privkey.pem;
# Security headers
add_header Strict-Transport-Security "max-age=31536000" always;
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
# Rate limiting
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
limit_req zone=api burst=20 nodelay;
location / {
proxy_pass http://localhost:8765;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Timeouts
proxy_connect_timeout 60s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
}
}
3. TLS/SSL
Let's Encrypt (free certificates):
# Install certbot
sudo apt install certbot python3-certbot-nginx
# Obtain certificate
sudo certbot --nginx -d api.skillseekers.example.com
# Auto-renewal (cron)
0 12 * * * /usr/bin/certbot renew --quiet
4. Authentication & Authorization
API Key Authentication (optional):
# Add to MCP server
from fastapi import Security, HTTPException
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
security = HTTPBearer()
async def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
token = credentials.credentials
if token != os.getenv("API_SECRET_KEY"):
raise HTTPException(status_code=401, detail="Invalid token")
return token
Scaling
1. Vertical Scaling
Increase resources:
# Kubernetes resource limits
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
2. Horizontal Scaling
Deploy multiple instances:
# Kubernetes HPA (Horizontal Pod Autoscaler)
kubectl autoscale deployment skillseekers-mcp \
--cpu-percent=70 \
--min=2 \
--max=10
Load Balancing:
# Nginx load balancer
upstream skillseekers {
least_conn;
server 10.0.0.1:8765;
server 10.0.0.2:8765;
server 10.0.0.3:8765;
}
server {
listen 80;
location / {
proxy_pass http://skillseekers;
}
}
3. Database/Storage Scaling
Distributed caching:
# Redis for distributed cache
import redis
cache = redis.Redis(host='redis.example.com', port=6379, db=0)
Object storage:
- Use S3/GCS/Azure Blob for skill packages
- Enable CDN for static assets
- Use read replicas for databases
4. Rate Limit Management
Multiple GitHub tokens:
# Configure multiple profiles
skill-seekers config --github
# Automatic token rotation on rate limit
# (handled by rate_limit_handler.py)
Backup & Disaster Recovery
1. Data Backup
What to backup:
- Configuration files (
~/.config/skill-seekers/) - Generated skills (
output/) - Database/cache (if applicable)
- Logs (for forensics)
Backup script:
#!/bin/bash
# /opt/skillseekers/scripts/backup.sh
BACKUP_DIR="/backups/skillseekers"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# Create backup
tar -czf "$BACKUP_DIR/backup_$TIMESTAMP.tar.gz" \
~/.config/skill-seekers \
/opt/skillseekers/output \
/opt/skillseekers/.env
# Retain last 30 days
find "$BACKUP_DIR" -name "backup_*.tar.gz" -mtime +30 -delete
# Upload to S3 (optional)
aws s3 cp "$BACKUP_DIR/backup_$TIMESTAMP.tar.gz" \
s3://backups/skillseekers/
Schedule backups:
# Crontab
0 2 * * * /opt/skillseekers/scripts/backup.sh
2. Disaster Recovery Plan
Recovery steps:
-
Provision new infrastructure
# Deploy from backup terraform apply -
Restore configuration
tar -xzf backup_20250207.tar.gz -C / -
Verify services
skill-seekers config --test systemctl status skillseekers-mcp -
Test functionality
skill-seekers scrape --config configs/test.json --max-pages 10
RTO/RPO targets:
- RTO (Recovery Time Objective): < 2 hours
- RPO (Recovery Point Objective): < 24 hours
Troubleshooting
Common Issues
1. High Memory Usage
Symptoms:
- OOM kills
- Slow performance
- Swapping
Solutions:
# Check memory usage
ps aux --sort=-%mem | head -10
# Reduce batch size
skill-seekers scrape --config config.json --batch-size 10
# Enable memory limits
docker run --memory=4g skillseekers:latest
2. GitHub Rate Limits
Symptoms:
403 Forbiddenerrors- "API rate limit exceeded" messages
Solutions:
# Check rate limit
curl -H "Authorization: token $GITHUB_TOKEN" \
https://api.github.com/rate_limit
# Add more tokens
skill-seekers config --github
# Use rate limit strategy
# (automatic with multi-token config)
3. Slow Scraping
Symptoms:
- Long scraping times
- Timeouts
Solutions:
# Enable async scraping (2-3x faster)
skill-seekers scrape --config config.json --async
# Increase concurrency
# (adjust in config: "concurrency": 10)
# Use caching
skill-seekers scrape --config config.json --use-cache
4. API Errors
Symptoms:
401 Unauthorized429 Too Many Requests
Solutions:
# Verify API keys
skill-seekers config --test
# Check API key validity
# Claude API: https://console.anthropic.com/
# OpenAI: https://platform.openai.com/api-keys
# Google: https://console.cloud.google.com/apis/credentials
# Rotate keys if compromised
5. Service Won't Start
Symptoms:
- systemd service fails
- Container exits immediately
Solutions:
# Check logs
journalctl -u skillseekers-mcp -n 100
# Or for Docker
docker logs skillseekers-mcp
# Common causes:
# - Missing environment variables
# - Port already in use
# - Permission issues
# Verify config
skill-seekers config --show
Debug Mode
Enable detailed logging:
# Set debug level
export LOG_LEVEL=DEBUG
# Run with verbose output
skill-seekers scrape --config config.json --verbose
Getting Help
Community Support:
- GitHub Issues: https://github.com/yusufkaraaslan/Skill_Seekers/issues
- Documentation: https://skillseekersweb.com/
Log Collection:
# Collect diagnostic info
tar -czf skillseekers-debug.tar.gz \
/var/log/skillseekers/ \
~/.config/skill-seekers/configs/ \
/opt/skillseekers/.env
Performance Tuning
1. Scraping Performance
Optimization techniques:
# Enable async scraping
"async_scraping": true,
"concurrency": 20, # Adjust based on resources
# Optimize selectors
"selectors": {
"main_content": "article", # More specific = faster
"code_blocks": "pre code"
}
# Enable caching
"use_cache": true,
"cache_ttl": 86400 # 24 hours
2. Embedding Performance
GPU acceleration (if available):
# Use GPU for sentence-transformers
pip install sentence-transformers[gpu]
# Configure
export CUDA_VISIBLE_DEVICES=0
Batch processing:
# Generate embeddings in batches
generator.generate_batch(texts, batch_size=32)
3. Storage Performance
Use SSD for:
- SQLite databases
- Cache directories
- Log files
Use object storage for:
- Skill packages
- Backup archives
- Large datasets
Next Steps
- Review deployment option that fits your infrastructure
- Configure monitoring and alerting
- Set up backups and disaster recovery
- Test failover procedures
- Document your specific deployment
- Train your team on operations
Need help? See TROUBLESHOOTING.md or open an issue on GitHub.