skill-seekers-reference/docs/PRODUCTION_DEPLOYMENT.md

# Production Deployment Guide

Complete guide for deploying Skill Seekers in production environments.

## Table of Contents

- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Configuration](#configuration)
- [Deployment Options](#deployment-options)
- [Monitoring & Observability](#monitoring--observability)
- [Security](#security)
- [Scaling](#scaling)
- [Backup & Disaster Recovery](#backup--disaster-recovery)
- [Troubleshooting](#troubleshooting)

## Prerequisites

### System Requirements

**Minimum:**
- CPU: 2 cores
- RAM: 4 GB
- Disk: 10 GB
- Python: 3.10+

**Recommended (for production):**
- CPU: 4+ cores
- RAM: 8+ GB
- Disk: 50+ GB SSD
- Python: 3.12+

### Dependencies

**Required:**
```bash
# System packages (Ubuntu/Debian)
sudo apt update
sudo apt install -y python3.12 python3.12-venv python3-pip \
  git curl wget build-essential libssl-dev

# System packages (RHEL/CentOS)
sudo yum install -y python312 python312-devel git curl wget \
  gcc gcc-c++ openssl-devel
```

**Optional (for specific features):**
```bash
# OCR support (PDF scraping)
sudo apt install -y tesseract-ocr

# Cloud storage
# (Install provider-specific SDKs via pip)

# Embedding generation
# (GPU support requires CUDA)
```

## Installation

### 1. Production Installation

```bash
# Create dedicated user
sudo useradd -m -s /bin/bash skillseekers
sudo su - skillseekers

# Create virtual environment
python3.12 -m venv /opt/skillseekers/venv
source /opt/skillseekers/venv/bin/activate

# Install package
pip install --upgrade pip
pip install skill-seekers[all]

# Verify installation
skill-seekers --version
```

### 2. Configuration Directory

```bash
# Create config directory
mkdir -p ~/.config/skill-seekers/{configs,output,logs,cache}

# Set permissions
chmod 700 ~/.config/skill-seekers
```

### 3. Environment Variables

Create `/opt/skillseekers/.env`:

```bash
# API Keys
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AIza...
OPENAI_API_KEY=sk-...
VOYAGE_API_KEY=...

# GitHub Tokens (use skill-seekers config --github for multiple)
GITHUB_TOKEN=ghp_...

# Cloud Storage (optional)
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
GOOGLE_APPLICATION_CREDENTIALS=/path/to/gcs-key.json
AZURE_STORAGE_CONNECTION_STRING=...

# MCP Server
MCP_TRANSPORT=http
MCP_PORT=8765

# Sync Monitoring (optional)
SYNC_WEBHOOK_URL=https://...
SLACK_WEBHOOK_URL=https://hooks.slack.com/...

# Logging
LOG_LEVEL=INFO
LOG_FILE=/var/log/skillseekers/app.log
```

**Security Note:** Never commit `.env` files to version control!

```bash
# Secure the env file
chmod 600 /opt/skillseekers/.env
```

## Configuration

### 1. GitHub Configuration

Use the interactive configuration wizard:

```bash
skill-seekers config --github
```

This will:
- Add GitHub personal access tokens
- Configure rate limit strategies
- Test token validity
- Support multiple profiles (work, personal, etc.)

### 2. API Keys Configuration

```bash
skill-seekers config --api-keys
```

Configure:
- Claude API (Anthropic)
- Gemini API (Google)
- OpenAI API
- Voyage AI (embeddings)

### 3. Connection Testing

```bash
skill-seekers config --test
```

Verifies:
- ✅ GitHub token(s) validity and rate limits
- ✅ Claude API connectivity
- ✅ Gemini API connectivity
- ✅ OpenAI API connectivity
- ✅ Cloud storage access (if configured)

## Deployment Options

### Option 1: Systemd Service (Recommended)

Create `/etc/systemd/system/skillseekers-mcp.service`:

```ini
[Unit]
Description=Skill Seekers MCP Server
After=network.target

[Service]
Type=simple
User=skillseekers
Group=skillseekers
WorkingDirectory=/opt/skillseekers
EnvironmentFile=/opt/skillseekers/.env
ExecStart=/opt/skillseekers/venv/bin/python -m skill_seekers.mcp.server_fastmcp --transport http --port 8765
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=skillseekers-mcp

# Security
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/skillseekers /var/log/skillseekers

[Install]
WantedBy=multi-user.target
```

**Enable and start:**

```bash
sudo systemctl daemon-reload
sudo systemctl enable skillseekers-mcp
sudo systemctl start skillseekers-mcp
sudo systemctl status skillseekers-mcp
```

### Option 2: Docker Deployment

See [Docker Deployment Guide](./DOCKER_DEPLOYMENT.md) for detailed instructions.

**Quick Start:**

```bash
# Build image
docker build -t skillseekers:latest .

# Run container
docker run -d \
  --name skillseekers-mcp \
  -p 8765:8765 \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  -e GITHUB_TOKEN=$GITHUB_TOKEN \
  -v /opt/skillseekers/data:/app/data \
  --restart unless-stopped \
  skillseekers:latest
```

### Option 3: Kubernetes Deployment

See [Kubernetes Deployment Guide](./KUBERNETES_DEPLOYMENT.md) for detailed instructions.

**Quick Start:**

```bash
# Install with Helm
helm install skillseekers ./helm/skillseekers \
  --namespace skillseekers \
  --create-namespace \
  --set secrets.anthropicApiKey=$ANTHROPIC_API_KEY \
  --set secrets.githubToken=$GITHUB_TOKEN
```

### Option 4: Docker Compose

See [Docker Compose Guide](./DOCKER_COMPOSE.md) for multi-service deployment.

```bash
# Start all services
docker-compose up -d

# Check status
docker-compose ps

# View logs
docker-compose logs -f
```

## Monitoring & Observability

### 1. Health Checks

**MCP Server Health:**

```bash
# HTTP transport
curl http://localhost:8765/health

# Expected response:
{
  "status": "healthy",
  "version": "2.9.0",
  "uptime": 3600,
  "tools": 25
}
```

### 2. Logging

**Configure structured logging:**

```python
# config/logging.yaml
version: 1
formatters:
  json:
    format: '{"time":"%(asctime)s","level":"%(levelname)s","msg":"%(message)s"}'
handlers:
  file:
    class: logging.handlers.RotatingFileHandler
    filename: /var/log/skillseekers/app.log
    maxBytes: 10485760  # 10MB
    backupCount: 5
    formatter: json
loggers:
  skill_seekers:
    level: INFO
    handlers: [file]
```

**Log aggregation options:**
- **ELK Stack:** Elasticsearch + Logstash + Kibana
- **Grafana Loki:** Lightweight log aggregation
- **CloudWatch Logs:** For AWS deployments
- **Stackdriver:** For GCP deployments

### 3. Metrics

**Prometheus metrics endpoint:**

```bash
# Add to MCP server
from prometheus_client import start_http_server, Counter, Histogram

# Metrics
scraping_requests = Counter('scraping_requests_total', 'Total scraping requests')
scraping_duration = Histogram('scraping_duration_seconds', 'Scraping duration')

# Start metrics server
start_http_server(9090)
```

**Key metrics to monitor:**
- Request rate
- Response time (p50, p95, p99)
- Error rate
- Memory usage
- CPU usage
- Disk I/O
- GitHub API rate limit remaining
- Claude API token usage

### 4. Alerting

**Example Prometheus alert rules:**

```yaml
groups:
  - name: skillseekers
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"

      - alert: HighMemoryUsage
        expr: process_resident_memory_bytes > 2e9  # 2GB
        for: 10m
        annotations:
          summary: "Memory usage above 2GB"

      - alert: GitHubRateLimitLow
        expr: github_rate_limit_remaining < 100
        for: 1m
        annotations:
          summary: "GitHub rate limit low"
```

## Security

### 1. API Key Management

**Best Practices:**

✅ **DO:**
- Store keys in environment variables or secret managers
- Use different keys for dev/staging/prod
- Rotate keys regularly (quarterly minimum)
- Use least-privilege IAM roles for cloud services
- Monitor key usage for anomalies

❌ **DON'T:**
- Commit keys to version control
- Share keys via email/Slack
- Use production keys in development
- Grant overly broad permissions

**Recommended Secret Managers:**
- **Kubernetes Secrets** (for K8s deployments)
- **AWS Secrets Manager** (for AWS)
- **Google Secret Manager** (for GCP)
- **Azure Key Vault** (for Azure)
- **HashiCorp Vault** (cloud-agnostic)

### 2. Network Security

**Firewall Rules:**

```bash
# Allow only necessary ports
sudo ufw enable
sudo ufw allow 22/tcp    # SSH
sudo ufw allow 8765/tcp  # MCP server (if public)
sudo ufw deny incoming
sudo ufw allow outgoing
```

**Reverse Proxy (Nginx):**

```nginx
# /etc/nginx/sites-available/skillseekers
server {
    listen 80;
    server_name api.skillseekers.example.com;

    # Redirect to HTTPS
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name api.skillseekers.example.com;

    ssl_certificate /etc/letsencrypt/live/api.skillseekers.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/api.skillseekers.example.com/privkey.pem;

    # Security headers
    add_header Strict-Transport-Security "max-age=31536000" always;
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-Content-Type-Options "nosniff" always;

    # Rate limiting
    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
    limit_req zone=api burst=20 nodelay;

    location / {
        proxy_pass http://localhost:8765;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Timeouts
        proxy_connect_timeout 60s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
    }
}
```

### 3. TLS/SSL

**Let's Encrypt (free certificates):**

```bash
# Install certbot
sudo apt install certbot python3-certbot-nginx

# Obtain certificate
sudo certbot --nginx -d api.skillseekers.example.com

# Auto-renewal (cron)
0 12 * * * /usr/bin/certbot renew --quiet
```

### 4. Authentication & Authorization

**API Key Authentication (optional):**

```python
# Add to MCP server
from fastapi import Security, HTTPException
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

security = HTTPBearer()

async def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
    token = credentials.credentials
    if token != os.getenv("API_SECRET_KEY"):
        raise HTTPException(status_code=401, detail="Invalid token")
    return token
```

## Scaling

### 1. Vertical Scaling

**Increase resources:**

```yaml
# Kubernetes resource limits
resources:
  requests:
    cpu: "2"
    memory: "4Gi"
  limits:
    cpu: "4"
    memory: "8Gi"
```

### 2. Horizontal Scaling

**Deploy multiple instances:**

```bash
# Kubernetes HPA (Horizontal Pod Autoscaler)
kubectl autoscale deployment skillseekers-mcp \
  --cpu-percent=70 \
  --min=2 \
  --max=10
```

**Load Balancing:**

```nginx
# Nginx load balancer
upstream skillseekers {
    least_conn;
    server 10.0.0.1:8765;
    server 10.0.0.2:8765;
    server 10.0.0.3:8765;
}

server {
    listen 80;
    location / {
        proxy_pass http://skillseekers;
    }
}
```

### 3. Database/Storage Scaling

**Distributed caching:**

```python
# Redis for distributed cache
import redis

cache = redis.Redis(host='redis.example.com', port=6379, db=0)
```

**Object storage:**
- Use S3/GCS/Azure Blob for skill packages
- Enable CDN for static assets
- Use read replicas for databases

### 4. Rate Limit Management

**Multiple GitHub tokens:**

```bash
# Configure multiple profiles
skill-seekers config --github

# Automatic token rotation on rate limit
# (handled by rate_limit_handler.py)
```

## Backup & Disaster Recovery

### 1. Data Backup

**What to backup:**
- Configuration files (`~/.config/skill-seekers/`)
- Generated skills (`output/`)
- Database/cache (if applicable)
- Logs (for forensics)

**Backup script:**

```bash
#!/bin/bash
# /opt/skillseekers/scripts/backup.sh

BACKUP_DIR="/backups/skillseekers"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Create backup
tar -czf "$BACKUP_DIR/backup_$TIMESTAMP.tar.gz" \
  ~/.config/skill-seekers \
  /opt/skillseekers/output \
  /opt/skillseekers/.env

# Retain last 30 days
find "$BACKUP_DIR" -name "backup_*.tar.gz" -mtime +30 -delete

# Upload to S3 (optional)
aws s3 cp "$BACKUP_DIR/backup_$TIMESTAMP.tar.gz" \
  s3://backups/skillseekers/
```

**Schedule backups:**

```bash
# Crontab
0 2 * * * /opt/skillseekers/scripts/backup.sh
```

### 2. Disaster Recovery Plan

**Recovery steps:**

1. **Provision new infrastructure**
   ```bash
   # Deploy from backup
   terraform apply
   ```

2. **Restore configuration**
   ```bash
   tar -xzf backup_20250207.tar.gz -C /
   ```

3. **Verify services**
   ```bash
   skill-seekers config --test
   systemctl status skillseekers-mcp
   ```

4. **Test functionality**
   ```bash
   skill-seekers scrape --config configs/test.json --max-pages 10
   ```

**RTO/RPO targets:**
- **RTO (Recovery Time Objective):** < 2 hours
- **RPO (Recovery Point Objective):** < 24 hours

## Troubleshooting

### Common Issues

#### 1. High Memory Usage

**Symptoms:**
- OOM kills
- Slow performance
- Swapping

**Solutions:**

```bash
# Check memory usage
ps aux --sort=-%mem | head -10

# Reduce batch size
skill-seekers scrape --config config.json --batch-size 10

# Enable memory limits
docker run --memory=4g skillseekers:latest
```

#### 2. GitHub Rate Limits

**Symptoms:**
- `403 Forbidden` errors
- "API rate limit exceeded" messages

**Solutions:**

```bash
# Check rate limit
curl -H "Authorization: token $GITHUB_TOKEN" \
  https://api.github.com/rate_limit

# Add more tokens
skill-seekers config --github

# Use rate limit strategy
# (automatic with multi-token config)
```

#### 3. Slow Scraping

**Symptoms:**
- Long scraping times
- Timeouts

**Solutions:**

```bash
# Enable async scraping (2-3x faster)
skill-seekers scrape --config config.json --async

# Increase concurrency
# (adjust in config: "concurrency": 10)

# Use caching
skill-seekers scrape --config config.json --use-cache
```

#### 4. API Errors

**Symptoms:**
- `401 Unauthorized`
- `429 Too Many Requests`

**Solutions:**

```bash
# Verify API keys
skill-seekers config --test

# Check API key validity
# Claude API: https://console.anthropic.com/
# OpenAI: https://platform.openai.com/api-keys
# Google: https://console.cloud.google.com/apis/credentials

# Rotate keys if compromised
```

#### 5. Service Won't Start

**Symptoms:**
- systemd service fails
- Container exits immediately

**Solutions:**

```bash
# Check logs
journalctl -u skillseekers-mcp -n 100

# Or for Docker
docker logs skillseekers-mcp

# Common causes:
# - Missing environment variables
# - Port already in use
# - Permission issues

# Verify config
skill-seekers config --show
```

### Debug Mode

Enable detailed logging:

```bash
# Set debug level
export LOG_LEVEL=DEBUG

# Run with verbose output
skill-seekers scrape --config config.json --verbose
```

### Getting Help

**Community Support:**
- GitHub Issues: https://github.com/yusufkaraaslan/Skill_Seekers/issues
- Documentation: https://skillseekersweb.com/

**Log Collection:**

```bash
# Collect diagnostic info
tar -czf skillseekers-debug.tar.gz \
  /var/log/skillseekers/ \
  ~/.config/skill-seekers/configs/ \
  /opt/skillseekers/.env
```

## Performance Tuning

### 1. Scraping Performance

**Optimization techniques:**

```python
# Enable async scraping
"async_scraping": true,
"concurrency": 20,  # Adjust based on resources

# Optimize selectors
"selectors": {
    "main_content": "article",  # More specific = faster
    "code_blocks": "pre code"
}

# Enable caching
"use_cache": true,
"cache_ttl": 86400  # 24 hours
```

### 2. Embedding Performance

**GPU acceleration (if available):**

```python
# Use GPU for sentence-transformers
pip install sentence-transformers[gpu]

# Configure
export CUDA_VISIBLE_DEVICES=0
```

**Batch processing:**

```python
# Generate embeddings in batches
generator.generate_batch(texts, batch_size=32)
```

### 3. Storage Performance

**Use SSD for:**
- SQLite databases
- Cache directories
- Log files

**Use object storage for:**
- Skill packages
- Backup archives
- Large datasets

## Next Steps

1. **Review** deployment option that fits your infrastructure
2. **Configure** monitoring and alerting
3. **Set up** backups and disaster recovery
4. **Test** failover procedures
5. **Document** your specific deployment
6. **Train** your team on operations

---

**Need help?** See [TROUBLESHOOTING.md](./TROUBLESHOOTING.md) or open an issue on GitHub.